A Fuzzy Logic Controller for Tasks Scheduling Using Unreliable Cloud Resources
Panagiotis Oikonomou, Kostas Kolomvatsos, Nikos Tziritas, Georgios Theodoropoulos, Thanasis Loukopoulos, Georgios Stamoulis
AA Fuzzy Logic Controller for Tasks SchedulingUsing Unreliable Cloud Resources
Panagiotis Oikonomou
Computer Science and EngineeringSouthern Univ. of Science and Technology
Shenzhen, [email protected]
Kostas Kolomvatsos
Computer Science and Telecomm.University of Thessaly
Lamia, [email protected]
Nikos Tziritas
Computer Science and Telecomm.University of Thessaly
Lamia, [email protected]
Georgios Theodoropoulos
Computer Science and EngineeringSouthern Univ. of Science and Technology
Shenzhen, [email protected]
Thanasis Loukopoulos
Comp. Science and Biomedical InformaticsUniversity of Thessaly
Lamia, [email protected]
Georgios Stamoulis
Electrical and Computer EngineeringUniversity of Thessaly
Volos, [email protected]
Abstract —The Cloud infrastructure offers to end users a broadset of heterogenous computational resources using the pay-as-you-go model. These virtualized resources can be provisionedusing different pricing models like the unreliable model whereresources are provided at a fraction of the cost but withno guarantee for an uninterrupted processing. However, theenormous gamut of opportunities comes with a great caveat asresource management and scheduling decisions are increasinglycomplicated. Moreover, the presented uncertainty in optimallyselecting resources has also a negatively impact on the quality ofsolutions delivered by scheduling algorithms. In this paper, wepresent a dynamic scheduling algorithm (i.e., the Uncertainty-Driven Scheduling - UDS algorithm) for the management ofscientific workflows in Cloud. Our model minimizes both themakespan and the monetary cost by dynamically selectingreliable or unreliable virtualized resources. For covering the un-certainty in decision making, we adopt a Fuzzy Logic Controller(FLC) to derive the pricing model of the resources that willhost every task. We evaluate the performance of the proposedalgorithm using real workflow applications being tested under theassumption of different probabilities regarding the revocation ofunreliable resources. Numerical results depict the performanceof the proposed approach and a comparative assessment revealsthe position of the paper in the relevant literature.
Index Terms —Scheduling algorithm, Cloud Computing, Vir-tualized resources, Workflow management, Uncertainty manage-ment, Fuzzy Logic
I. I
NTRODUCTION
Workflow scheduling is the process of mapping inter-connected tasks on heterogeneous resources (resources withdifferent computational and storage capabilities). This is afundamental and well-studied problem in computing environ-ments such as Grid and Cluster Computing [2]. Researchcenters (e.g., NASA, earthquake-epigenomic centers) are uti-lizing such computing environments to distribute the workloadof complex and heavy load scientific experiments. The lastdecade, there is a growing interest on scheduling algorithmsapplied for such workflows in the Cloud [1]. The substantialamount of resources, the variety of CPU platforms (vCPUs) as well as the zero cost for management/maintenance hasmade Cloud the most suitable environment for the execution ofcomputation intensive applications. However, challenges likethe pay-as-you go model and data-transfer costs can be aobstacle to Clouds potentials [2]. A main difference betweenCloud and Cluster Computing is that users have to pay forthe duration that resources are utilized. In addition, in Cloudenvironments the performance of resources varies. The aboveis caused by the resource sharing between Virtual Machines(VMs) hosted in the same physical machine.Amazon Web Services (AWS) is a typical example of aCloud provider that offers multiple services. The main cate-gories of services are (i) on-demand and (ii) reserved resourcesinstances. On-demand instances have a fixed price for eachhour of use while reserved instances have a cheaper per-hourprice than the on-demand services, however, users must leasethem for long periods of time (more than 1 year). Amazon wasone of the first providers that announced the disposal of unusedcapacity with a significant discount (around 80% compared toon-demand services). This new type of services is referred as aspot instance. Large Cloud providers followed the Amazon andoffer spare capacity at a discount as well. Google Compute En-gine (GCE) and Alibaba launch Preemptible Virtual Machines(VMs) while Azure offers a low-priority VMS. However,from the users standpoint, using such virtualized resourcescomes with a major caveat. Instances may be revoked by theprovider at any time as their capacity is needed to executeother (preemptive) applications. For instance, Google sends apreemption notice thirty seconds before termination. Usually,preemptive instances are terminated after 24 hours of use.Spot services can be acquired via a biding policy through anauction-like market and greedily, the user with the maximumbid acquires the instance. We have to notice that spot instancesare revoked if users bid price is lower than market price.A typical scenario in the Cloud environment is that a userwants to execute a workflow application in the minimum time a r X i v : . [ c s . D C ] S e p nd cost. Since these requirements are orthogonal in nature,users are confronted with a time versus cost dilemma underthe constraints imposed by the workflow and the provider. Thispaper aims at finding a solution for this challenging dilemma.A straightforward solution to minimize monetary cost is to useexclusively unreliable virtualized resources, i.e., completelyrely on spot instances. A key problem to this solution is thatthe resource availability and the uninterrupted execution of theworkflow are not guaranteed. The adopted scheduler shouldconstantly monitor the queues of virtualized resources andbackup tasks progress even if the possibility of prematuretermination is relatively low. Minimizing the workflows ex-ecution time (makespan) is a subject that has been extensivelystudied by the research community. State-of-the-art algorithmslike HEFT, CPOP [11] and DCP [10] are effective to minimizethe makespan, however, the monetary cost is disregarded.Mapping tasks exclusively to reliable (on-demand) virtualizedresources towards securing the uninterrupted execution of aworkflow results in enormous monetary costs compared to aschedule that considers the pay-as-you-go model.In this paper, we focus on investigating dynamic schedulingapproaches like in [29], [30] whereby it is decided in whichvirtualized resource each task should be assigned. The deci-sions are made based on the current state of the system and theworkflow execution requirements. An algorithm that optimizessimultaneously the workflows execution time (makespan) andthe monetary cost using a mixture of reliable and unreliableresources is proposed. To mitigate the performance variation ofCloud environment as well as the unstable nature of unreliablevirtualized resources, we propose a ‘fast’, however, efficienttechnique that covers the uncertainty present into our scenario.The discussed uncertainty deals with the most appropriateresource that should selected to host every task under thetarget of minimizing the execution time and and monetarycosts. We adopt the principles of Fuzzy Logic (FL) to handlethe uncertainty of the selection process and the definition ofan efficient decision making thresholds. FL is widely adoptedin many applications domains (tasks scheduling among them[31]) as the appropriate theory/technology for dealing withuncertainty in decision making (e.g., [32], [33], [34]). Wedepart from the relevant literature and avoid using ‘crisp’thresholds in decision making. For instance, other efforts inthe field target to meet specific crisp thresholds for deadlinesand budget constraints when deciding tasks assignments. Theintuition behind our approach is two-fold: First, we alwaysseek to minimize the adopted parameters alleviating usersfrom the burden of defining specific thresholds and, secondly,our algorithm takes into consideration multiple parameters(makespan, monetary cost) at the same time leading to amulti-objective decision making. To the best of our knowledgethis is one of the first efforts that deal with the problemof scheduling scientific workflows using multiple unreliablevirtualized resources without taking into consideration anyQuality of Service (QoS) constraints. The following list reportson the contribution of our work: • We propose a model that captures the heterogeneityof the Cloud environment. Both reliable and unreliablevirtualized resources with different processing capabil-ities can be provisioned exhibiting different interruptionprobability depending on the popularity of each resource; • We provide a FL Controller (FLC) to decide on the typeof the resources we have to adopt to host each task ofthe desired workflow, thus, we manage the uncertaintyrelated to the discussed decision-making problem; • We perform an extensive experimental evaluation of theproposed model and simulate the execution of 5 real-world scientific workflows. The configuration of theadopted virtualized resources is based on realistic as-sumptions (AWS).The rest of the paper is organized as follows: Section IIdiscuss the related work. System model and problem formu-lation are illustrated in Section III. The proposed algorithmis presented in Section V. Concluded remarks are discussedin Section VI. Our conclusions are drawn in the final sectionVII. II. R
ELATED W ORK
A growing body of literature has examined the problem ofscheduling workflows in Cluster and Cloud computing. Themajority of resource provision techniques are guided by QoSconstraints defined by users. There are two basic constraints,i.e., the deadline [26], [24], [22], [21], [18], [16] and themonetary budget [25], [23], [15], [14]. While the deadlineconstraint is usually satisfied, unreliable resources are usedto the maximum extent to limit the monetary costs. In thefollowing paragraphs, we categorize the relevant efforts foundin the literature and describe the most representative models.
Heuristic Workflow Scheduling . In [16], the authors in-troduce the concept of the Latest Time On-Demand (LTO),i.e., the latest time in which on-demand instances must beused to guarantee the deadline constraint. If the differencebetween the LTO and the current time is greater than zero(positive slack), tasks are mapped into spot instances (unre-liable resources). This ensures that the deadline constraint issatisfied while cost is minimized. Decisions made upon theslack value inspired other works like [16], [18] and [21]. Ajust in time workflow scheduling algorithm with deadlinesguarantees and cost minimization is presented in [16]. A taskthat arrives before the LTO is scheduled to a spot instanceotherwise it is scheduled to an on-demand resource. Onlya single spot instance is considered (the cheapest one). Themaximization of resources utilization as well as the numberof tasks that fulfil QoS constraints is the subject of [18].Grid resources are adopted as the default candidate solutions.However, if QoS constraints are not satisfied, tasks, along withtheir predecessors, are executed on spot instances. In [22], twotypes of tasks are considered; Preemptive tasks are executedexclusively in spot instances while non-preemptive tasks areexecuted on reliable instances. For each task the schedulerscans the entire list of busy resources to find an idle time-block that gives the earliest start time (insertion policy). Aramework for scheduling scientific workflows in a HybridCloud environment (HC) is presented in [23]. HC consistsof multiple Data Centers (DCs) containing both reliable andunreliable VMs. Processing elements within each DC arehomogenous while costs for data movement and dynamicresource provisioning are ignored. An execution manager isresponsible to monitor tasks executed in revocable VMs andto reschedule them when necessary. An extension of the DCPalgorithm [10] is presented in [13]. A Grid infrastructure isextended with unreliable public Cloud resources when Gridsresources are insufficient/unavailable to process the currentexecution load.
Resource Provisioning Techniques . In [14], the authorsaddress the problem of auto-scaling spot resources. A newproposed strategy called Spots Instances Aware Autoscaling(SIAA) aims at the elimination the makespan and the proba-bility of task failures. According to the available budget, SIAAgenerates a scaling plan compromised by both reliable andunreliable instances. Critical tasks are prioritized first (taskswith small slack time), then, every one of them is scheduledbased on the Earliest Finish Time (EFT) policy. In [17], the au-thors present a framework for scheduling multiple workflowsthat offers probabilistic deadline guarantees and monetary costminimization. At runtime, a combination of on-demand andspot instances is generated for every task. If the executionof a task on spot instances fails or deadline is not met, on-demand instances are adopted on the fly. In [19], the authorspresent an elastic resource provisioner for the allocation of on-demand and spot instances to workflow tasks. High spot pricesdefined by users trigger the switch from unreliable to reliableresources. In [25], we can find a discussion on the problem ofauto-scaling public resources using a Multi-objective GeneticAlgorithm (GA). The makespan, the monetary cost and Out-of-Bid (OOB) errors are considered as the targets of theminimization process. The overall impact of OOB errors ismeasured using a probabilistic model which takes as inputsthe probability of OOB errors occurrence multiplied by thenumber of the available vCPUs.
Fault-Tolerant Models . Checkpointing, a mechanism tomaintain the reliability of unreliable resources, is introducedin [12]. A load balancing model combined with a GA coulddecide on the optimal number of tasks within an instance[15]. Fault-tolerance is enforced using a two-threshold (priceand time) mechanism. However, only identical sized tasks areconsidered and Cloud resources are limited to spot instances.In [21], the provisioning of spot instances is associated totask duplication, i.e., tasks are marked for duplication whenthe scheduler detects idle slots that can execute a task replicas.If no suitable idle slot exists, a new spot instances is initiatedto host replicas. In [20], the authors propose a multi-objectiveGA that minimizes both the makespan and the monetary cost.However, it is assumed that all faults in tasks execution arerevocable. Tasks can continue their execution in the allocatedspot instance after a while (fault-recovery). III. S
YSTEM M ODEL AND P ROBLEM F ORMULATION
A. System Model
We consider the scenario where a user wants to execute a setof dependent tasks (workflow application) in an Infrastructureas a Service (IaaS) Cloud environment.
Definition.
A workflow is a set of dependent tasks that solvea scientific problem.Resources are general provisioned as VMs and a VMs poolconfiguration is defined prior to workflow execution offeredby the provider. The cost of leasing such virtualized resourcesis bounded between the start time of the first task assignedto it and the completion time of the last task assigned toit. The cost is rounded up to the nearest billing cycle. Aworkflow application can be modeled as a Directed AcyclicGraph (DAG) G = ( T, E ) , where T is the set of tasksand E is the set of edges. Each edge e ij represents datadependencies between tasks t i and t j . Task t j receives e ij amount of data from its predecessor, i.e., task t i . For startingthe execution of a task ( t i ) the following two conditionsmust hold true: a) all predecessor ( pred ( t i ) ) tasks must finishexecution and b) all data from pred ( t i ) must be received. t i ischaracterized by a processing demand parameter g i denotingthe number of instructions (i.e., MIPS) that must be executedfor its completion. A task is called an entry task ( t entry ) when( pred ( t i ) = ∅ ). Similar, a task without any successor task( succ ( t i ) = ∅ ) is called an exit task ( t exit ). If more than oneentry tasks exist, then, a pseudo task t pseudo is inserted to G asthe predecessor task of every entry task. No data is transmittedfrom t pseudo to any other task. Multiple exit tasks are handledin an analogous manner.Let V be the set of heterogeneous resources (VMs) formingthe aforementioned VM pool configuration. Any virtualizedresource can be leased as a reliable or an unreliable instance.Additionally, every VM is associated with a processing ca-pability and incurs in a different cost per use. Let v i denotethe i th resource, r i be the processing power of v i and u i bethe leasing cost of v i . Precisely, r i denotes the number ofMIPS instructions that can be executed per time unit (i.e.,one second) by v i . This also incorporates memory speed, disksize and so on and so forth. Each virtualized resource is alsoassociated with a preemption/interruption probability p i . Definition.
The preemption/interruption probability p i isthe probability of ‘loosing’ the selected virtualized resourceleading to the failure of the corresponding task and the needof a re-execution.Since reliable resources (on-demand) are irrevocable, p i isset to zero. Regarding unreliable resources, p i is a positivenumber set to unity when the active duration of v i exceedsone hour (as dictated by GCE). We assume that the tasks thatconsist of a Workflow are non-preemptive and atomic (mustbe executed again if they fail). The execution time required tocomplete the task t i on resource v j is calculated by Eq. (1). w i,j = g i r j (1)e consider that resources are allocated in the same DC,thus, transferring inbound data is free. DC is assumed tohave enough resources to schedule G s tasks. A shared globalstorage system is considered as a data repository (AmazonS3). Tasks save their outputs and receive their inputs fromthe same storage system. Since the global storage system isallocated within DCs premises, we consider that data transferrate between two VMs is constant. Let T be that data transferrate and b ij be a binary variable, i.e., b ij = 0 iff i = j otherwise b ij = 1 . The temporal cost to send data form t i to t j ( t i is assigned in v i while t j in v j ) is expressed by Eq.(2). d i,j = e ij b ij r j (2) EST r j t i and EF T r j t i are the earliest execution start time andthe earliest execution finish time of t i on r j . The earliest starttime of the entry task is zero. For every other task in G , theearliest start time and the earliest finish time are calculatedrecursively as shown in Eq. (3) and Eq. (4): EST r j t i = max t z ∈ pred ( t i ) { AF T t z + d zi } (3) EF T r j t i = w ij + EF T r j t i (4)However, due to Cloud uncertainties like multi-tenant resourcesharing, resource revocation and provision-deprovision delays,EST and EFT can be underestimated. Therefore, we introduce AST t i and AF T t i that denote the actual execution start timeand the actual finish time of t i . Then, the total elapsed timerequired to execute G (makespan) is expressed by Eq.(5): makespan ( G ) = AF T t exit (5)Let N denote the total number of tasks scheduled by r i and t k be the k th task assuming a total ordering of them ≤ k ≤ N .Then, the overall execution cost incurred by r i is calculatedby (6), where γ is the length of the billing cycle. c i = u i (cid:20) AF T t N − AST t γ (cid:21) (6) B. Problem Formulation
Let X be an | V || T | binary matrix used to encode task-resources assignments as follows: X ij = 1 iff t i is assignedfor processing at v j , otherwise X ij = 0 . In our model, timeis represented with the introduction of S equally sized timeslots. Let S τ be the τ th such time slot, with a correspondingassignment matrix X τ . Problem : Find all values in the X total matrices X τ , so thatthe objective function f given by Eq.(7) is minimized: f = ( AF T t exit , | V | (cid:88) i =1 c i ) (7)subject to: | T | (cid:88) j =1 X τkj ≤ , ∀ k, τ (c1) | V | (cid:88) i =1 X τik ≤ , ∀ k, τ (c2) Research Challenge: minimize the makespan and the over-all monetary cost, i.e., minimize the objective function f pro-vided by Eq.(7) w.r.t. the following constraints: (i) a resourcecannot execute concurrently more than one tasks Eq.(c1), and(ii) a task cannot be assigned to more than one resourcesEq.(c2) ”.IV. T HE U NCERTAINTY -D RIVEN S CHEDULING (UDS)A
LGORITHM
At each time step τ , our algorithm tries to accomplish twogoals, i.e., (i) the minimization of the makespan and (ii) theminimization of the monetary cost. To achieve both goals, weintroduce the concept of effectiveness ( ef f τ ( M x , C x ) ) whichis reinforced to each scheduling decision. M x and C x denotethe makespan and monetary cost after applying schedulingplan x . Definition.
Effectiveness ef f τ ( M x , C x ) is defined as theability of an algorithm to deliver the optimal execution of aworkflow in a timely manner after applying scheduling plan x . ef f τ ( M x , C x ) is measured for both goals as the differencebetween: (a) the expected performance of the algorithm fromthe current time t to the finish of the schedule (includingevery future decision), ef f τ ( M ideal , C ideal ) , τ ∈ [ τ, f inish ] and (b) an idealistic performance which actually min-imizes both metrics to the maximum extreme possible ef f τ ( M ideal , C ideal ) , τ ∈ [ start, f inish ] . Clearly, if, at theend of the schedule ( τ = f inish ), the difference between thetwo performances is eliminated (it is close to zero) on bothmetrics, our algorithm’s performance is considered as efficient.In general, the problem of assigning tasks to heteroge-neous resources is NP-hard [27], thus, not a known algorithmis able to generate the optimal solution within polynomialtime. For this reason, we assume that the theoretical optimalperformance is accomplished using two well-known greedyalgorithms namely HEFT [11] and GreedyCost (GS). HEFTis applied upon the makespan metric while GC for the costmetric, respectively. For each task, HEFT selects the resourcethat results in the earliest finish time while the GC relies onthe resource that results in the lower cost. For both algorithms,the task to resource mapping is produced in advanced (i.e.,they perform a static scheduling) and all resources can beused in an uninterrupted mode, thus, they can produce high-quality schedules. This means that both algorithms considera p i equal to zero. For our analysis and experimentation, weconsider that M lower = HEF T ideal and C lower = GC ideal represent the lower bound of the makespan and the monetarycost respectively.Algorithm 1 describes the UDS algorithm. For our schedul-ing scenario, decisions for each task are made at the runtimei.e., when a task is ready for execution. In the resourceallocation phase (lines 6-22) for every ready task ( t i ), weestimate ef f τ ( M curr , C curr ) . To do so, at first, we apply lgorithm 1 The UDS Algorithm
Input:
Workflow’s tasks T , Pool of resources R , θ , a , b Output: M final , C final Calculate M lower , C lower Call function ef f t ( M lower , C lower ) M upper = M lower + a × M lower C upper = C lower + b × C lower Q ← t entry while Q (cid:54) = ∅ do Select t i from Q W = { t i } , t i is in waiting state Call function ef f τ ( M curr , C curr ) ∀ t i ∈ W normM = ( M τcurr − M lower ) / ( M upper − M lower ) normC = ( C τcurr − C lower ) / ( C upper − C lower ) P M I τi = F LC ( normM, normC ) if ( P M I τi ≥ θ ) then Select a reliable pricing model else
Select an unreliable pricing model end if for each r j ∈ R do Compute
EF T r j t i end for Schedule t i to r j that minimize EF T t i Update Q with succ ( t i ) if ∀ t j ∈ pred ( t i ) , AF T t j + d ij ≤ τ end while HEFT and GC in a dynamic way (decisions are based on thecurrent time). We should mention that HEFT & GC consideronly tasks that, at t , are in waiting state (line 8), i.e., tasksthat are not able to run yet because the conditions for runningare not in place (precedence constraints). Both HEFT andGC will result in different solutions w.r.t. the makespan andthe cost which, in turn, results in different distances from M lower and C lower (line 9). Let M τcurr and C τcurr denotethe aforementioned solutions and M upper and C upper be theupper bounds for both M and C as expressed in lines 2and 3, respectively. a and b are scalar values. Next, both M τcurr and C τcurr are normalized in the unity interval basedon the aforementioned lower and upper bounds (lines 10, 11).Clearly, when C τcurr is relatively small compared to C τcurr ,then, unreliable VMs should be utilized to reduce the overallmonetary cost. The adversary case indicates that reliable VMsshould be used. Our goal is to minimize the distance betweenthe solution ( M lower , C lower ) and the one generated by ourapproach for every ready task ( C τcurr , C τcurr ). However, due toperformance fluctuations in the Cloud environment selectingthe appropriate virtualized resource (type and computationalcapabilities) is a challenging task.V. T HE U NCERTAINTY D RIVEN D ECISION M AKING
A. The Proposed FLC
As it is difficult to be aware and define specific thresholdsfor both metrics (makespan and monetary cost) to support efficient resource allocation and aiming at the managementof the ambient uncertainty, we adopt an FLC to supportthe final decision related to the selection of the appropriateresources (line 7 of the proposed algorithm). In FL systems,the objects of discourse are associated with information whichis, or is allowed to be, incomplete, partially true or partiallypossible. FL deals with incomplete information and providesknowledge representation models, i.e., Fuzzy Set Theory,through which an entity can automatically take decisionsduring the fulfillment of a task. FL principles express humanexpert knowledge and enable the automated interpretationof results. The proposed FLC is responsible to handle theuncertainty in decision making and the definition of thresholdsfor the involved parameters. The FLC is a non-linear mappingbetween l inputs u i ∈ U i , i = 1 , . . . , l and m outputs y i ∈ Y i , i = 1 . . . , m . We adopt the Mamdani type of inference[4] that utilizes rules as the following: R j : IF u j is A j AND/OR u j is A j AND/OR . . .
AND/OR u lj is A lj THEN y j is B j AND . . .
AND y mj is B mj , where R j is the j th fuzzy rule, u ij ( i = 1 , . . . , l ) are the inputs of the j thrule, y kj ( k = 1 , . . . , m ) are the outputs and A ij , B kj aremembership functions usually associated by linguistic terms.The proposed FLC has two inputs, i.e., M τcurr & C τcurr . Thesingle output of the FLC is the Pricing Model Indicator (PMI), P M I τi . When M τcurr → (High), it means that there is anincreased demand to decrease the makespan, the opposite istrue when M τcurr → (Low). When C τcurr → (High) thenthe current scheduling decision suffers from high monetarycost, the opposite stands for C τcurr → (Low). Concerningoutput fuzzy variable P M I τi a value close to one (High)indicates that reliable resources should be used to decreasethe overall execution time of the workflow. On the otherhand, a value close to zero (Low) depicts a decrease monetarycost decision, thus, task t i should be executed to unreliableresources. So far, the FLC is capable to decides on the typeof resources that must be selected (reliable or unreliable).For inputs and the output, we consider three linguisticvalues: Low, Medium, High. A Low value represents that thefuzzy variable takes values close to the lowest limit while aHigh value depicts the case where the variable takes valuesclose to the upper limit. A Medium value depicts the casewhere the variable takes values close to the average (e.g.,around 0.5). For simplicity, we consider triangular membershipfunctions as they are widely adopted in the literature. However,the proposed framework is generic enough and, thus, onecan adopt any membership function that better suits to theapplication domain.The proposed FLC receives crisp values for the two inputs,it fuzzifies them and, accordingly, proceeds with the inferenceprocess. The inference process involves a set of fuzzy rulesthat result the best possible value for the output P M I τi . Theserules are defined by experts and incorporate the human viewon the decision process that we should follow. In Table I, wepresent the adopted FL rule base. These rules are designedfor the specific scenario and exhibit a behavior that resembleshuman reasoning, e.g., if the monetary cost is high and thexecution time is low then allocate current task to an unreliableresource. The final step is the de-fuzzification process in orderto derive the final P M I τi value. When the P M I τi value is overa pre-defined threshold ( θ ), task t i is scheduled for executionto a reliable VM, otherwise is executed to an unreliable VM.Our proposed methodology considers multiple heterogeneousvirtualized resources, thus, to conclude on the computationalcapabilities for the VM that eventually will host t i we selectthe one that minimize t i ’s finish time the most (lines 18-20).TABLE I: Fuzzy Logic rule base No M τcurr C τcurr P MI τi { Low, Medium, High } Low2 Medium Low High3 Medium Medium Medium4 Medium High Low5 High { Low, Medium, High } High
VI. E
XPERIMENTAL E VALUATION
A. Simulation setup
Workflow Applications . We report on the experimentalevaluation of the proposed model relying on five (5) workflowapplications as depicted by [5], [4]. The number of tasks,the execution time of each task as well the amount of datatransferred between them is reported in a Directed AcyclicGraph in XML (DAX) format. Workflows include Montage,LIGO, CyberShake, SIPHT and Epigenomics which are ex-tensively adopted in the relevant literature. The discussedworkflows ‘cover’ all the basic execution patterns such aspipelining, process, data aggregation, data distribution and dataredistribution. Each workflow contains 1,000 tasks.
Virtualized Resources . We consider a Cloud model with asingle DC offering VMs of different CPU speeds and prices.For each experiment, we consider five (5) reliable and five (5)unreliable resources with their characteristics being generatedupon the Amazon EC2 platform. Generic VMs are consideredfrom the US East (Ohio) region in a Linux operating system.Table II presents the adopted VMs characteristics. We assumethat the execution time of each task provided in the DAXfiles is on the slowest available VM (a1.medium). The averagebandwidth between the storage system (S3) and VMs is setto 20 Mbps which is the approximate average bandwidthprovided by Amazon services [9]. To measure the performancefluctuations of the adopted VMs, we follow a similar approachas the one presented in [8], [7]. The performance of VMsvaries up to 19% based on a normal distribution with a meanof 9.5% and standard deviation of 5%. The bootup/startup timefor each VM (provisioning time) is set to 96.9 seconds [6].
The Interruption Model . As the demand for unreliableinstances can vary significantly over time, the availability ofsuch instances is questioned. An unreliable instance can beinterrupted at any time and the allocated capacity is returnedto the Cloud provider. Amazon claims that the average inter-ruption probability across all regions and instances is less than5%. However, different types of instances are associated withdifferent interruption probabilities [28]. In our experimental TABLE II: VMs characteristics
Type vCPUs Cost per hour ($)Reliable-Unreliable p i a1.medium 2 0.0255 0.005 30%a1.large 4 0.051 0.0098 28%a1.xlarge 8 0.102 0.0197 25%a1.2xlarge 16 0.204 0.0394 22%a1.4xlarge 32 0.408 0.0788 20% evaluation, we consider that interruptions may occur at any slot( S τ ) during the execution of tasks in any unreliable resource.Table II depicts the interruption probability for each VM.After an interruption, the corresponding VM does not becomeavailable again, unless it is requested from the provider(provisioning and de-provisioning costs are considered). Toachieve a fault-tolerant setup, we consider task retries, i.e.,revoked tasks along with not running tasks are resubmittedto be scheduled at the same time that the revocation eventactually happened. Performance Metrics . We report on the performance ofour model concerning its ability of making correct decisionswhen deciding the pricing model (reliable or unreliable) for theexecution of tasks. We also focus on the workflow’s executiontime and the overall monetary cost. The performance of theproposed mechanism is evaluated by a set of metrics. Weadopt a set of metrics in the following axes: (i) the accuracyof the FLC ( acc ). To measure acc , we define the number ofcorrect decisions ∆ . To do so, we assume two binary functions λ ( M t ) and λ ( C t ) . λ is equal to unity, if task t is executed toa reliable VM when HEF T ideal is applied, otherwise is equalto zero. Similarly, λ is equal to unity, if t is executed to anunreliable VM when GC ideal is applied, otherwise is equalto zero. A decision is considered as correct when one of theEq.(8), Eq.(9) holds true. For instance, Eq.(8) indicates that adecision is correct when HEF T ideal assign t to an reliableVM and, at the same time, the FLC decides to schedule t to reliable VM. The final accuracy of the proposed FLC ismeasured as follows: acc = ∆ / | T | ∗ . λ ( M t ) = 1 && F LC t → reliable (8) λ ( M t ) = 1 && F LC t → unreliable (9)(ii) the final makespan ( M final ) and the monetary cost( C final ) generated by the proposed algorithm. Both metricsare normalized using the following equations normM = M final /M lower and normC = C final /C lower respectively.When normM and normC are close to unity the performanceof the proposed algorithms is considered as efficient.We perform a set of experiments for different θ , a and b . θ varies from 0.1 to 0.9 while for a and b we consider both tightare relaxed upper bounds ranging from 0.5 to 3.0, respectively.In total, we conduct 100 iterations for each experiment andreport our experimental outcomes for the aforementionedmetrics. Experiments where conducted on a Linux server withtwo 6-core Intel Xeon E5-2630 CPUs running at 2.3GHz. . Performance Assessment Initially, we perform a set of simulations for various θ realizations and illustrate its effect on both, normM and normC . Recall that when P M I τi ≥ θ , t i will be scheduledto a reliable resource otherwise the selection of an unreliableresource is the case. In Fig. 1 and Fig. 2, we plot thenormalized makespan ( normM ) and ( normC ) for differentcombinations of a and b while θ varies from 0.1 to 0.9. First, itbecomes clear that any performance difference is rather smallas θ increases. As expected, θ and normC follow the sametrend. This is reasonable since as θ increases the majority oftasks are assigned to unreliable VMs. However, even for thecase where θ = 0 . , the makespan is less that two times thelower bound. In Fig. 2, θ is inversely proportional to normC .This stems from the fact that high θ values suggest the useof costly-effective unreliable VMs. When θ ∈ [0 . , . , weobserve cases where the cost is nearly equal to the lowerbound.Fig. 1: Our evaluation outcomes related to normM Fig. 2: Our evaluation outcomes related to normC
In Figs. 4-6, we keep θ = 0 . and present the performanceof the UDS algorithm for different combinations of a and b (six in total for each case). We observe that when we expectthe performance of the UDS algorithm to be extremely close tolower bounds ( a = b = 0 . ), the proposed algorithm sacrificescost for execution time. This is natural as our FL rule base‘suggests’ to use reliable VMs when the distance from lowerbound is high. On the other hand, the UDS algorithm favorsthe cost as the distance from lower bounds increases. This isdue to the fact that the FLC suggests the use of unreliable VMswhen the distance from lower bounds is limited. However, theeffect on the makespan is relatively minor compared to the benefit on the cost. In the experimental scenario where we get a = 3 . & b = 0 . , we enjoy the best performance relatedto the cost metric ( normC = 1 . ) while the reverse scenario,i.e., a = 0 . , b = 3 . leads to the best performance for themakespan metric ( normM = 1 . ). To efficiently perform onboth metrics, in parallel, the distance from the upper boundmust be moderate (e.g., a = 2 . , b = 2 . ).Fig. 3: Accuracy (%) and Success Rate (%)Fig. 4: a ∈ [0 . − . b ∈ [0 . − . Fig. 5: a ∈ [0 . − . b = 0 . Fig. 3 presents the accuracy ( acc ) of our model as well asthe percentage of tasks that have been executed successfully( succR ). We can see that for high a and b , acc is more thatig. 6: a = 0 . b ∈ [0 . − . a and b ,i.e., in these scenarios, the UDS algorithm achieves an efficientperformance for both metrics, as explained above. However,the number of the successfully executed tasks is decreased asthe proposed algorithm utilizes more unreliable VMs. In anycase, the proposed approach is characterized by stability asdifferent a and b values have a minor impact in the accuracyof our model. VII. C ONCLUSIONS
In this paper, we tackle the problem of scheduling scientificworkflows over distributed heterogeneous resources using dif-ferent Cloud-based pricing models. To decide on the pricingmodel, the proposed algorithm (UDS) incorporates a FLC thatdelivers the realization of an indicator over which the finaldecision is made. The discussed indicator shows the efficiencyof executing a task to reliable or unreliable resources. Viewingthe results in a retrospect, we can argue that UDS tackles bothoptimization targets i.e., the execution time and the monetarycost achieving a high accuracy (up to 98%). In the first placeof our future agenda is to apply the optimal stopping theoryto detect the appropriate time to migrate a task to more-certain resources and semantically cover the heterogeneity ofthe available resources. R
EFERENCES[1] M. A. Rodriguez and R. Buyya, “A taxonomy and survey on schedulingalgorithms for scientific workflows in iaas cloud computing envi-ronments,”
Concurrency and Computation: Practice and Experience ,vol. 29, no. 8, p. e4041, 2017.[2] F. Wu, Q. Wu, and Y. Tan, “Workflow scheduling in cloud: a survey,”
The Journal of Supercomputing , vol. 71, no. 9, pp. 3373–3418, 2015.[3] E. H. Mamdani and S. Assilian, “An experiment in linguistic synthesiswith a fuzzy logic controller,”
International journal of man-machinestudies , vol. 7, no. 1, pp. 1–13, 1975.[4] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H. Su, andK. Vahi, “Characterization of scientific workflows,” in , pp. 1–10, IEEE, 2008.[5] G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta, and K. Vahi,“Characterizing and profiling scientific workflows,”
Future GenerationComputer Systems , vol. 29, no. 3, pp. 682–692, 2013.[6] M. Mao and M. Humphrey, “A performance study on the vm startuptime in the cloud,” in
IEEE 5th International Conference on CloudComputing , pp. 423–430, IEEE, 2012. [7] J. Sahni and D. P. Vidyarthi, “A cost-effective deadline-constraineddynamic scheduling algorithm for scientific workflows in a cloudenvironment,”
IEEE Transactions on Cloud Computing , vol. 6, no. 1,pp. 2–18, 2015.[8] J. Schad, J. Dittrich, and J.-A. Quian´e-Ruiz, “Runtime measurements inthe cloud: observing, analyzing, and reducing variance,”
Proceedings ofthe VLDB Endowment , vol. 3, no. 1-2, pp. 460–471, 2010.[9] M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel, “Amazons3 for science grids: a viable solution?,” in
International workshop onData-aware distributed computing , pp. 55–64, 2008.[10] Y.-K. Kwok and I. Ahmad, “Dynamic critical-path scheduling: Aneffective technique for allocating task graphs to multiprocessors,”
IEEETransactions on Parallel and Distributed Systems , vol. 7, no. 5, pp. 506–521, 1996.[11] H. Topcuoglu, S. Hariri, and M.-y. Wu, “Performance-effective andlow-complexity task scheduling for heterogeneous computing,”
IEEETransactions on Parallel and Distributed Systems , vol. 13, no. 3,pp. 260–274, 2002.[12] S. Yi, D. Kondo, and A. Andrzejak, “Reducing costs of spot instancesvia checkpointing in the amazon elastic compute cloud,” in
IEEE 3rdInternational Conference on Cloud Computing , pp. 236–243, IEEE,2010.[13] S. Ostermann and R. Prodan, “Impact of variable priced cloud resourceson scientific workflow scheduling,” in
European Conference on ParallelProcessing , pp. 350–362, Springer, 2012.[14] D. A. Monge and C. G. Garino, “Adaptive spot-instances aware au-toscaling for scientific workflows on the cloud,” in
Latin American HighPerformance Computing Conference , pp. 13–27, Springer, 2014.[15] D. Jung, T. Suh, H. Yu, and J. Gil, “A workflow scheduling techniqueusing genetic algorithm in spot instance-based cloud.,”
Ksii Transactionson Internet & Information Systems , vol. 8, no. 9, 2014.[16] D. Poola, K. Ramamohanarao, and R. Buyya, “Fault-tolerant workflowscheduling using spot instances on clouds.,” in
ICCS , pp. 523–533, 2014.[17] A. C. Zhou, B. He, and C. Liu, “Monetary cost optimizations forhosting workflow-as-a-service in iaas clouds,”
IEEE Transactions onCloud Computing , vol. 4, no. 1, pp. 34–48, 2015.[18] T. Ghafarian and B. Javadi, “Cloud-aware data intensive workflowscheduling on volunteer computing systems,”
Future Generation Com-puter Systems , vol. 51, pp. 87–97, 2015.[19] R. Chard, K. Chard, K. Bubendorfer, L. Lacinski, R. Madduri, andI. Foster, “Cost-aware cloud provisioning,” in
IEEE 11th InternationalConference on e-Science , pp. 136–144, IEEE, 2015.[20] H. Xu, B. Yang, W. Qi, and E. Ahene, “A multi-objective optimizationapproach to workflow scheduling in clouds considering fault recovery.,”
KSII Transactions on Internet & Information Systems , vol. 10, no. 3,2016.[21] D. Poola, K. Ramamohanarao, and R. Buyya, “Enhancing reliabilityof workflow execution using task replication and spot instances,”
ACMTransactions on Autonomous and Adaptive Systems , vol. 10, no. 4, pp. 1–21, 2016.[22] L. Chen, X. Li, and R. Ruiz, “Cloud workflow scheduling with on-demand and spot block instances,” in
IEEE 21st International Confer-ence on Computer Supported Cooperative Work in Design , pp. 451–456,IEEE, 2017.[23] F. Tordini, M. Aldinucci, P. Viviani, M. Ivan, P. Lio, et al. , “Scientificworkflows on clouds with heterogeneous and preemptible instances,” in
International Conference on Parallel Computing , pp. 1–10, IOS Press,2018.[24] M. Suguna, D. Prakash, D. Y. Thangam, and G. Shobana, “Heuristictask workflow scheduling in cloud using spot and on-demand instances,”
Journal of Computational and Theoretical Nanoscience , vol. 15, no. 8,pp. 2640–2644, 2018.[25] D. A. Monge, E. Pacini, C. Mateos, E. Alba, and C. G. Garino, “Cmi: Anonline multi-objective genetic autoscaler for scientific and engineeringworkflows in cloud infrastructures with unreliable virtual machines,”
Journal of Network and Computer Applications , vol. 149, p. 102464,2020.[26] R. G. Martinez, A. Lopes, and L. Rodrigues, “Planning workflow exe-cutions when using spot instances in the cloud,” in , pp. 310–317, 2019.[27] D. Fern´andez-Baca, “Allocating modules to processors in a distributedsystem,”
IEEE Transactions on Software Engineering , vol. 15, no. 11,pp. 1427–1436, 1989.28] “Frequency of interruption.” https://aws.amazon.com/ec2/spot/instance-advisor/. Accessed: 2020-09-21.[29] P. Oikonomou, M. G. Koziri, N. Tziritas, A. N. Dadaliaris, T. Loukopou-los, G. I. Stamoulis, and S. U. Khan, “Scheduling video transcoding jobsin the cloud,” in
IEEE Green Computing and Communications , pp. 442–449, IEEE, 2018.[30] P. Oikonomou, M. G. Koziri, N. Tziritas, T. Loukopoulos, and X. Cheng-Zhong, “Scheduling heuristics for live video transcoding on cloudedges,”
ZTE Communications , vol. 15, no. 2, pp. 35–41, 2019.[31] K. A. O. P. K. K. L. T., “A demand-driven, proactive tasks managementmodel at the edge,” in
IEEE International Conference on Fuzzy Systems ,IEEE, 2020.[32] K. K. A. C. H. S., “A fuzzy logic system for bargaining in informationmarkets,” in
ACM Transactions on Intelligent Systems and Technology ,p. vol. 3(2), ACM, 2012.[33] A. R. K. K. H. S., “Buyer agent decision process based on automaticfuzzy rules generation methods,” in
IEEE World Congress on Compu-tational Intelligence, FUZ-IEEE , pp. 856–863, ACM, 2010.[34] K. K. A. C. H. S., “’on the use of fuzzy logic in a seller bargaininggame,” in32nd Annual IEEE International Computer Software anApplications Conference