[PDF] Energy-Aware Scheduling of Task Graphs with Imprecise Computations and End-to-End Deadlines

Abstract

Imprecise computations provide an avenue for scheduling algorithms developed for energy-constrained computing devices by trading off output quality with the utilization of system resources. This work proposes a method for scheduling task graphs with potentially imprecise computations, with the goal of maximizing the quality of service subject to a hard deadline and an energy bound. Furthermore, for evaluating the efficacy of the proposed method, a mixed integer linear program formulation of the problem, which provides the optimal reference scheduling solutions, is also presented. The effect of potentially imprecise inputs of tasks on their output quality is taken into account in the proposed method. Both the proposed method and MILP formulation target multiprocessor platforms. Experiments are run on 10 randomly generated task graphs. Based on the obtained results, for some cases, a feasible schedule of a task graph can be achieved with the energy consumption less than 50% of the minimum energy required for scheduling all tasks in that task graph completely precisely.

Full PDF

EEnergy-Aware Scheduling of Task Graphs with ImpreciseComputations and End-to-End Deadlines

AMIRHOSSEIN ESMAILI,

University of Southern California

MAHDI NAZEMI,

University of Southern California

MASSOUD PEDRAM,

University of Southern CaliforniaImprecise computations provide an avenue for scheduling algorithms developed for energy-constrainedcomputing devices by trading off output quality with the utilization of system resources. This work proposesa method for scheduling task graphs with potentially imprecise computations, with the goal of maximizingthe quality of service subject to a hard deadline and an energy bound. Furthermore, for evaluating the efficacyof the proposed method, a mixed integer linear program formulation of the problem, which provides theoptimal reference scheduling solutions, is also presented. The effect of potentially imprecise inputs of taskson their output quality is taken into account in the proposed method. Both the proposed method and MILPformulation target multiprocessor platforms. Experiments are run on 10 randomly generated task graphs.Based on the obtained results, for some cases, a feasible schedule of a task graph can be achieved with theenergy consumption less than 50% of the minimum energy required for scheduling all tasks in that task graphcompletely precisely.Additional Key Words and Phrases: Task Scheduling, Imprecise Computations, Real-time MPSoCs, Input Error

In many real-time applications, it is often preferred for a task to produce an approximate (akaimprecise) result by its deadline rather than producing an exact (aka precise) result late [1]. Imprecisecomputations increase the flexibility of scheduling algorithms developed for real-time systems byallowing them to trade off output quality with utilization of system resources, such as processorcycles.In imprecise computations, a real-time task is allowed to return intermediate and impreciseresults of poorer quality as long as it processes a predefined chunk of work that defines its baselinequality. The number of processor cycles required for the task to provide this baseline quality isreferred to as the mandatory workload of the task. Assigning a larger number of processor cycles toa task beyond its mandatory workload leads to an increase in its quality of results. In other words,output quality of each task is a monotonic non-decreasing function of processor cycles assigned toit [2].The workload of a task beyond its mandatory workload is referred to as the optional workload,which can be executed partially. The quality of service (QoS) is usually evaluated as a linear orconcave function of the number of processor cycles assigned to optional workload of tasks [2].When the full workload of a task, both mandatory and optional, is entirely completed, the producedresults by that task are considered precise.Furthermore, an energy consumption budget may also be one of the main design constraintsfor energy-constrained computing devices such as embedded systems. Some of the prior workhave focused on scheduling task graphs with imprecise computations and energy and deadlineconstraints on single processor platforms [1, 3] and multiprocessor platforms [4, 5]. In the saidprior work, the effect of potentially imprecise inputs of tasks on the their output quality is notconsidered, and thus QoS can be obtained solely based on the number of processor cycles assignedto optional workload. However, in many real-time applications such as video compression or speechrecognition [6] where tasks are interdependent, there is a set of dependent tasks represented by atask graph. Therefore, the input to a task can be dependent on the output of one or more othertasks which may have imprecise results. In the literature, the effect of imprecise input of a task is a r X i v : . [ c s . D C ] M a y sually modeled with an extension in the workload of that task where this extension is responsiblefor compensating the quality degradation due to imprecise inputs.This work proposes a method for scheduling task graphs with potentially imprecise computations,with the goal of maximizing QoS subject to a hard deadline and an energy bound. The proposedmethod takes into account the effect of potential extension in the workload of each task basedon the quality of its inputs. The proposed method considers task dependencies within a giventask graph in order to find tasks (aka nodes) which can be performed imprecisely without havinga negative impact on QoS. In addition, for evaluating the efficacy of the proposed method, amixed integer linear program formulation of the problem, which provides the optimal referencescheduling solutions, is also presented. Both the proposed method and MILP formulation targetmultiprocessor system-on-chip (MPSoC) platforms due to their increasing popularity for manyreal-time applications. To the best of our knowledge, this is the first work that takes into account theeffect of workload extension based on input quality in scheduling task graphs on MPSoC platformswhere the goal is to maximize QoS subject to a hard deadline and an energy bound.The rest of the paper is organized as follows. Section 2 explains models used in the paper,formally characterizes tasks with potentially imprecise computations, and presents the problemstatement. Next, Section 3 explains the proposed method for scheduling task graphs with imprecisecomputation on an MPSoC platform. It also presents a comprehensive MILP formulation of thesame problem, which allows comparing the proposed method with an exact solution. After that,Section 4 details experimental results. Finally, Section 5 concludes the paper. Tasks to be scheduled are modeled as a directed acyclic graph (DAG) represented by G ( V , E , T d ) inwhich V denotes the set of n tasks, E denotes data dependencies among tasks, and T d denotes theperiod of the task graph. T d acts as a hard deadline for scheduling and each repetition of the taskgraph should be scheduled before the arrival of the next one.Each task with the possibility of imprecise computation consists of two parts: a mandatory part and an optional part . In order for a task to produce an acceptable result, its mandatory part must becompleted. The optional part refines the result produced by the mandatory part. If the optional partof a task is not executed entirely, the result of the task is imprecise and the task has an output error.In a task graph, if one or more parent tasks of each task u have an output error, task u will have aninput error. Similar to prior work [2, 7], we assume only the execution of mandatory part of task u will be extended to compensate for the input error and optional part of task u remains the same.This is a valid assumption for many applications such as weather forecasting systems [2], imageand video processing, and Newton’s root finding method [6]. In other words, mandatory part of acertain task can be thought of as the minimum amount of processor cycles required for the task toproduce a result with an acceptable quality, and the mandatory part grows when the quality of atask’s inputs decrease [7]. In order for a task graph to be considered feasibly scheduled, at least thepotentially extended mandatory workload of each task must be completed before the deadline T d .The number of processor cycles required to finish the mandatory part of task u when its inputsare error-free is represented by M u . For a task u with nonzero input error, its mandatory workloadis extended such that it is capable of producing correct results. The number of processor cyclesrequired to process the extension added to M u , which depends on the quality of its inputs, isrepresented by M xu . Therefore, the total mandatory workload, represented by M ′ u , is obtained asfollows: M ′ u = M u + M xu . (1) he total optional workload of task u , which can be executed partially, is represented by O u . Thenumber of processor cycles actually assigned to the optional workload of task u is represented by o u ( o u ≤ O u ). According to [6], the general mandatory extension function of a task can be estimatedby a straight line, which provides an upper bound on the amount of required extension. Therefore,the slope of this line, which is represented by m u and referred to as the task-specific scaling factor[2, 6], quantifies the dependency between E iu and M xu as follows: M xu = m u × E iu , (2)in which E iu indicates the input error of task u . Similar to [2], E iu in a task graph is defined asfollows: E iu = min { , (cid:213) j ∈ par ( u ) E oj } , (3)where par ( u ) is the set of immediate parents of task u and E oj represents output error of parent task j . E oj is defined as the portion of discarded optional workload of task j [6], and thus obtained asfollows: E oj = O j − o j O j = − o j O j , ≤ E oj ≤ . (4)Based on (3) and (4), we have 0 ≤ E iu ≤

1. According to (2), when the input of task u is error-free (i.e. E iu = M xu = M ′ u = M u . On the other hand, when task u has the maximum input-error(i.e. E iu = M xu = m u and thus M ′ u = M u + m u . In this case, the mandatory workload extensionfor task u reaches its maximum.Note the assumption that workload extension can always compensate input error is not true ingeneral. However, based on [6], we can transform the given mandatory and optional portions of atask workload such that in the worst case, where all transformed optional workloads of parent tasksare discarded, the extension amount obtained by (2) would be able to compensate the input error.Therefore, M u and O u used in our proposed method are transformed versions of given mandatoryand optional workloads of tasks.The total number of processor cycles assigned to task u is represented by W u and is obtained asfollows: W u = M ′ u + o u . (5) To model power consumption of a processor when operating at clock frequency f , we use thefollowing equation borrowed from [8]: ρ = α f β + γ f + δ , (6)in which ρ represents the total power consumption, and α , β , γ , and δ are the power modelconstants. α f β represents the dynamic power consumption, and γ f + δ represents the static powerconsumption. α is a constant that depends on the average switched capacitance and the averageactivity factor, and β indicates the technology-dependent dynamic power exponent, which is usually ≈

3. Therefore, energy consumption in one clock cycle ( ϵ cycle ), when executing a task at clockfrequency f , is obtained from the following equation: ϵ cycle = α f β − + γ + δf . (7) .3 Problem Statement We seek to schedule a task graph with the possibility of imprecise computations represented by G ( V , E , T d ) on a platform comprising of K homogeneous processors in order to maximize QoSsubject to a hard deadline and an energy bound. Each processor supports a set of m distinct clockfrequencies: { f , f , ..., f m } . QoS highly correlates with how many processor cycles are assigned tothe execution of optional workloads of exit tasks, which are the tasks in the task graph with nochild tasks. The reason is that the discarded optional workload of tasks other than exit tasks arecompensated with extensions in the mandatory workload of their child tasks. Consequently, QoS isquantitatively defined as follows: QoS = (cid:205) u ∈ exit ( G ) P u | exit ( G )| , ≤ QoS ≤ , (8)where exit ( G ) represents the set of exit tasks of task graph G , and P u represents the precision oftask u . P u is a non-decreasing function of number of processor cycles assigned to the optionalworkload of task u . Similar to [2], P u is defined as follows: P u = P Tu + ( − P Tu )( o u O u ) , (9)in which P Tu indicates the minimum precision acceptable from task u , aka precision threshold oftask u . P Tu assumes values between 0 and 1. P Tu indicates the precision of task u when only its(extended) mandatory part is completed [2]. Based on (9), executing only the extended mandatoryworkload of task u ( o u =

0) results in P u = P Tu . On the other hand, executing the entire optionalworkload of task u ( o u = O u ) in addition to its extended mandatory workload leads to P u =

1. Forother values of o u , P Tu < P u < The proposed framework comprises of two main steps:(1) determining the number of processor cycles assigned to optional workload of non-exit tasks,and(2) scheduling tasks on an MPSoC for maximizing QoS subject to energy and deadline constraints.

The first step of the proposed method tries to minimize the summation of total workload of non-exittasks plus the total (extended) mandatory workloads of exit tasks. The intuition behind choosingsuch objective function is the fact that minimizing the total number of processor cycles associatedwith the aforementioned portions of tasks leads to having more processor cycles available forexecuting optional workloads of exit tasks as there are fixed deadline and energy budget constraints.This can result in increased QoS according to (8). Therefore, we aim to minimize the followingexpression: (cid:34) (cid:213) for u ∈ non-exit tasks W u (cid:35) + (cid:34) (cid:213) for v ∈ exit tasks M ′ v (cid:35) . (10)We first explain our approach for minimizing (10) for two simple task graphs that constitutebase cases. Then, we explain our proposed algorithm for a general task graph. Base Case 1:

Consider the task graph demonstrated in Fig. 1a. It consists of a parent task p ,alongside b child tasks. The workload defined in (10) for this simple task graph can be written as ollows: [ M ′ p + o p ] + [ b (cid:213) i = M i + b (cid:213) i = m i × ( − o p O p )] , (11)in which subscripts p and i are used for referring to workload components of the parent task andchild tasks in Fig. 1a, respectively. (11) can be rewritten as: [ M ′ p + b (cid:213) i = ( M i + m i )] + [ o p × ( − (cid:205) bi = m i O p )] . (12)In (12), the first term in the summation does not depend on how many processor cycles areassigned to o p (note that the actual workload of M ′ p depends on the input error of the parenttask and not o p ). However, the second term is a function of o p and minimizing this term leads tominimization of (12). Two possible scenarios are postulated in this case:(1) if (cid:205) bi = m i ≤ O p , o p should be minimized as much as possible, i.e., o p =

0. This means theoptional workload of parent task must be discarded.(2) if (cid:205) bi = m i > O p , o p should be maximized as much as possible, i.e., o p = O p . This meansthat the parent task should be executed precisely. A large number of child tasks and/or highvalues of their m i values lead to a higher chance of this scenario occurring. Base Case 2:

Consider the task graph demonstrated in Fig. 1b. It consists of a child task c ,alongside b parent tasks. The workload of (10) for this simple task graph can be written as follows: [ b (cid:213) i = ( M ′ i + o i )] + [ M c + m c × min (cid:32) , b (cid:213) i = ( − o i O i ) (cid:33) ] , (13)in which subscripts c and i are used for referring to workload components of the child task andparent tasks in Fig. 1b, respectively. (13) can be rewritten as: [ b (cid:213) i = M ′ i + M c ] + [ b (cid:213) i = o i + m c × min (cid:32) , b (cid:213) i = ( − o i O i ) (cid:33) ] . (14)In (14), the first term in the summation does not depend on how many processor cycles are assignedto o , o , ..., o b . However, the second term is a function of how many processor cycles are assignedto optional workloads of parent tasks and therefore, this term should be minimized for minimizing(14). Two possible scenarios are postulated in this case: p2 b1 c21 b (a) (b) Fig. 1.

Task graphs of (a) base case 1 and (b) base case 2

1) If (cid:205) bi = O i ≤ m c , in order to minimize (14), all optional workloads of b parent tasks should beexecuted completely, i.e., (cid:205) bi = o i = (cid:205) bi = O i . The proof is beyond the scope of this paper.(2) If (cid:205) bi = O i > m c , in order to minimize (14), all optional workloads of b parent tasks should bediscarded, i.e., (cid:205) bi = o i =

0. The proof is beyond the scope of this paper. A large number ofparent tasks and/or high values of their O i values lead to a higher chance of this scenariooccurring. General Task Graphs:

While base cases 1 & 2 help determine the number of processor cyclesassigned to optional workload of tasks in simple task graphs, similar conclusions cannot be drawnfor complicated tasks graphs with interdependence of tasks. For instance, consider an examplewhere two parent tasks share a few child tasks and the goal is to either fully discard or executethe optional workload of tasks within this task graph. Because a few child tasks are potentiallyshared between the two parent tasks, applying base case 1 or base case 2 without considering theinterdependence of tasks may lead to conflicting decisions about execution of optional workloads.As the number of such parent tasks increases, depending on the interdependencies among them andtheir shared child tasks, the number of possible permutations that should be explored in terms offully executing or discarding the optional workloads of those parent tasks can grow exponentially.However, presented base cases can guide us in developing a heuristic that determines the numberof processor cycles assigned to optional workload of non-exit tasks.Note that in the proposed heuristic, it is assumed that the input task graph has only one sourcetask (i.e. a task with in-degree of zero), but potentially many exit tasks. In task graphs where thenumber of source tasks is larger than one, a dummy task with zero workload is introduced andconnected to all source tasks. The steps of proposed heuristic are as follows:

Step 1 (Forward Pass):

This step starts traversing tasks in the task graph G from the source taskand labels each task as precise (fully executing its optional workload) or imprecise (fully discardingits optional workload) based on the task’s optional workload and the total maximum extension ofits child tasks if the task is executed imprecisely. This step of the proposed heuristic is similar tobase case 1. The difference, though, is the fact that if a child task is encountered more than once dueto being a shared child of multiple parent tasks and its mandatory part is extended because one ofits parents is labeled as imprecise, it is not considered when writing (12) for its other parent tasks.After exploring all paths in the task graph, tasks with multiple parents and extended workloadsare marked. For these tasks, their parent tasks are evaluated again while their marked child tasksare removed from (12). This may lead to an update in deciding whether the parent task shouldbe executed precisely or imprecisely. The same process is repeated until no decisions are furtherupdated. Note that each child task with multiple parents is visited only once during this updatepass. Step 2 (Backward Pass):

This step starts traversing tasks in the task graph G in the reverseorder from exit tasks back to the source task. For a task with multiple parents, those which arelabeled as precise are added to a list and sorted in increasing order of the number of child tasks withintact (not extended) mandatory workloads. The resulting list is called sorted_precise_parents , whichincludes b tasks. Next, a subset of tasks in sorted_precise_parents is chosen such that transformingthose tasks to imprecise tasks and extending the mandatory workload of their child tasks leads tothe highest reduction in (10). However, instead of exploring all 2 b possible subsets, we only explore b subsets which are: the subset containing the first task in the sorted list, the subset containing thefirst and second tasks in the sorted list, ..., and for the b th subset, the subset containing all tasksin the sorted list. The rationale behind such decision is that according to base case 1, labeling atask with fewer number of intact child tasks as imprecise is more likely to eventually increase QoS.Such tasks are explored more often in proposed subsets due to the sorting strategy. tep 2 (Backward Pass) is inspired by base case 2 where multiple parents with shared child taskscan be labeled as imprecise. In other words, the first step of proposed heuristic looks at parent tasksindependently while the second step studies their combined effect on overall QoS.The presented heuristic determines which tasks in a given task graph should be executedimprecisely. Therefore, we refer to this heuristic as imp_label . The optional workload of eachnon-exit task u marked as imprecise is o imp _ labelu = o imp _ labelu = O u . Furthermore, if a non-exit task u has a parent which is labeled imprecise, M ′ imp _ labelu = M u + m u , otherwise M ′ imp _ labelu = M u . Therefore, the total workload of eachnon-exit task u is determined by imp_label , is represented by W imp _ labelu , and obtained as follows: W imp _ labelu = M ′ imp _ labelu + o imp _ labelu . (15)Note that imp_label also determines whether the mandatory workload of an exit task v is extended( M ′ imp _ labelv = M v + m v ) or not ( M ′ imp _ labelv = M v ). In this section, we seek to schedule the task graph obtained from imp_label on an MPSoC platformfor maximizing QoS subject to energy and time constraints. For this purpose, we determine a properprocessor assignment for each task alongside the ordering of tasks on each processor in order tominimize the finish time while operating at the maximum clock frequency (we temporarily ignoreenergy budget constraint). This is achieved by deploying a minimal-delay list scheduling algorithm,which is a variant of Heterogeneous Earliest Finish Time (HEFT) [9]. HEFT assigns a rank to eachtask in the task graph based on the length of the critical path from that task to exit tasks. WhileHEFT is designed for heterogeneous platforms, it can be applied to a homogeneous platform aswell. For HEFT, we provide workloads obtained from imp_label for non-exit tasks and for exit tasks,their (extended) mandatory workloads obtained from imp_label plus their total optional workloads.Next, we pick tasks in decreasing order of their ranks and schedule each selected task on its “best”processor, which is the processor that minimizes the finish time of the task under the maximumavailable frequency.Note that HEFT is only used to just obtain a processor assignment for each task alongside theordering of tasks on each processor. The obtained start times for tasks from HEFT just show relativeordering of tasks on each processor. Furthermore, we used the maximum frequency in HEFT andincluded the total optional workloads of all exit tasks since we were temporarily ignoring the energybudget constraint. Therefore, in the next step, the actual number of processor cycles assigned tooptional workload of exit tasks, the actual distribution of workload of each task among m availablefrequencies of the processors, and the actual execution start time of each task should be obtained.For this purpose, we demonstrate that maximizing QoS for a task graph obtained from imp_label subject to energy and time constraints, and processor assignment and task ordering obtained fromHEFT, will be reduced to a linear programming (LP) formulation. In the following formulation, u and v are used to refer to any of the tasks in the task graph.Duration of task u , u = , , ..., n , is formulated as follows: D u = m (cid:213) i = N u , i f i , N u , i ≥ here N u , i indicates the number of processor cycles of task u processed at clock frequency f i ( i = , , ..., m ). If task u is a non-exit task, the following constraint is introduced: m (cid:213) i = N u , i = W imp _ labelu . (17)On the other hand, if task u is an exit task, we have: M ′ imp _ labelu ≤ m (cid:213) i = N u , i ≤ M ′ imp _ labelu + O u . (18)According to (6) and (16), energy consumption during the execution of task u can be formulatedas follows: ϵ task ( u ) = m (cid:213) i = ( N u , i . ( α f β − i + γ + δf i )) . (19)To ensure the total energy consumption of tasks is less than or equal to the given energy bound,represented by ϵ max , we have: n (cid:213) u = ϵ task ( u ) ≤ ϵ max . (20)To ensure time and precedence constraints, by representing start time of each task u with S u , weshould have: S u + D u ≤ T d , u = , , ..., n , S u ≥ , (21) S u + D u + C u , v ≤ S v , ∀ e ( u , v ) ∈ E . (22)In (22), C u , v represent the average communication cost associated with e u , v for sending output oftask u to input of task v .Finally, we need to ensure tasks assigned to the same processor do not overlap: S u + D u ≤ S v , For tasks u and v which areassigned to the same processorand task v is the immediate taskafter task u based on HEFT (23)Maximizing the objective function of (8), with the constraints introduced in (16) to (23), formsan LP over positive real variables of S u , N u , i , and optional workload of exit tasks ( o u for u ∈ exittasks). In order to evaluate the performance of our 2-step proposed method in Sections 3.1 and 3.2 comparedto the optimal solution, we present a comprehensive mixed-integer linear programming (MILP)formulation of the problem statement in Section 2.3. By solving the MILP, we obtain the optimalvalues for the number of processor cycles assigned to the optional workload of each task, processorassignment for each task alongside the ordering of tasks on each processor, task execution starttime, and distribution of the total number of processor cycles associated with the execution of eachtask among m available frequencies. For this purpose, the following variables are defined:Denoting the number of processors with K , for the processor assignment of task u to processor k , k = , , ..., K , we use the decision variable Π k , u , defined as follows: Π k , u = (cid:26) . (24) onsequently, we have the following constraint for Π k , u : K (cid:213) k = Π k , u = , for u = , , ..., n . (25)In order to prevent the overlap of execution of tasks assigned to the same processor with eachother, we use the decision variable Y k , u , v indicating ordering of the tasks. For k = , , ..., K ; u = , , ..., n ; v = , , ..., n , v (cid:44) u ; we define: Y k , u , v =  . (26)In addition, if task v is the first task assigned to processor k , Y k , , v is defined to be 1 (and is 0otherwise). On the other hand, if task u is the last task assigned to processor k , Y k , u , n + is definedto be 1 (and is 0 otherwise). Furthermore, if there is no task assigned to processor k , Y k , , n + isdefined to be 1 (and is 0 otherwise). Accordingly, using (26) and the definitions provided for Y k , , v , Y k , u , n + and Y k , , n + , we have the following constraints for k = , , ..., K : n + (cid:213) v = v (cid:44) u Y k , u , v = Π k , u , for u = , , ..., n (27) n (cid:213) u = u (cid:44) v Y k , u , v = Π k , v , for v = , , ..., n + . (28)According to (27), if task u is assigned to processor k ( Π k , u = u on processor k or task u is the last task assigned toprocessor k . Similarly, according to (28), if task v is assigned to processor k ( Π k , v = v on processor k or task v isthe first task assigned to processor k . In both (27) and (28), Π k , and Π k , n + are defined as 1 for all k = , , ..., K . Using Y k , u , v , we rewrite the constraint in (23) as the following: S u + D u − ( − Y k , u , v ) × T d ≤ S v , for u = , , ..., n , for v = , , ..., n , v (cid:44) u , for k = , , ..., K . (29)Finally, instead of using imp_label algorithm to determine the workload of non-exit and exittasks in (17) and (18), the following constraint is used for all the tasks: M u + m u × E iu ≤ m (cid:213) i = N u , i ≤ M u + m u × E iu + O u , (30)where E iu is obtained by (3). In order to present the minimum formulation existing in (3) as a linearconstraint, we rewrite (3) using an auxiliary decision variable, represented by X u , as the following: E iu = X u . ( ) + ( − X u ) . ( (cid:213) j ∈ par ( u ) E oj ) , (31) n which X u is a decision variable which is 1 when (cid:205) j ∈ par ( u ) E oj > X u can be written as follows: (cid:205) j ∈ par ( u ) E oj − n ≤ X u ≤ (cid:213) j ∈ par ( u ) E oj , X u ∈ { , } , (32)in which n serves as an upper bound for (cid:205) j ∈ par ( u ) E oj . Furthermore, we use the lemma presented in[10] for linearization of multiplication of a Boolean decision variable and a bounded real-valuedvariable for the second term of (31).Consequently, maximizing the objective function of (8) with the constraints introduced in (16),(19) to (22), (25), (27) to (32), and the lemma mentioned in [10] for linearization of the second termof (31), forms an MILP yielding the optimal values for the desired variables mentioned in thebeginning of this section. The time complexity of the proposed labeling heuristic described in Section 3.1 is

O(| E | + | V |) where | E | denotes the number of edges in the task graph while | V | represents the number of vertices.Furthermore, the time complexity of HEFT, which is used for obtaining the processor assignmentof tasks in the labeled graph and ordering of them on each processor for an MPSoC platform, is O( K × | E |) where K denotes the number of processors. For solving the formulated MILP in Section 3.3 and the LP part of the porposed method in Section3.2, we use IBM ILOG CPLEX Optimization Studio[11]. The platform on which simulations areperformed is a computer with a 3.2 GHz Intel Core i7-8700 Processor and 16 GB RAM. For obtainingenergy model parameters, we employ [12] which uses a classical energy model of a 70nm technologyprocessor that supports 5 discrete frequencies. The frequency-independent component of processorpower consumption, which is represented by δ in (6), is obtained as 276 mW . Each processor canoperate independently of other processors at either f = . GHz , f = . GHz , f = . GHz , f = . GHz , f = . GHz . For these frequencies, frequency-dependent component of processorpower consumption, which is represented by α f β + γ f in (6), is 430 . mW , 556 . mW , 710 . mW ,896 . mW , and 1118 . mW , respectively. Using curve fitting, we obtain α = . γ = . β = . u , the amount of workload required tobe assigned to produce precise results when input is error-free is referred to as the initial workloadof the task, and is represented by W initialu . Therefore: W initialu = M u + O u . For studied task graphs,the average value for W initialu of each task u is set to 2 × cycles. For each task u , based on whatportion of W initialu is for its base mandatory workload ( M u ), we consider 3 cases:(1) man _ low : M u ∼ U ( . , . ) × W initialu (low portion of W initialu is for the base mandatory).(2) man _ med : M u ∼ U ( . , . )× W initialu (medium portion of W initialu is for the base mandatory).(3) man _ hiдh : M u ∼ U ( . , . ) × W initialu (high portion of W initialu is for the base mandatory). n each of 3 cases, similar to[2], m u is set as m u ∼ U ( , × M u ) for each task u . For having a faircomparison among these 3 cases, each task graph uses the same random seed for all the aboveuniform distributions, where this random seed is different in each task graph. In all 3 cases, P Tu for all the tasks are uniformly chosen from [0, 1]. Average communication costs associated withedges of task graphs are chosen uniformly from 0.4 ms to 0.6 ms. T d of each task graph is set totwice the length of the longest path from its source task to an exit task (including communicationcosts), when executing the total workload along the path, including all optional workloads, withthe maximum frequency. In this section, for each of the studied task graphs, we evaluate the effect of the ϵ max value onthe obtained QoS, defined in (8), using the proposed method in Sections 3.1 and 3.2. In order toobtain a proper value for ϵ max , first, we derive the minimum energy required for scheduling thetask graph in one T d without the possibility of imprecise computations. We refer to this energyvalue as ϵ ∗ . For obtaining ϵ ∗ , HEFT is again used to obtain the processor assignment for eachtask and the ordering of tasks on each processor. Then, we solve the LP which minimizes theobjective function of (cid:205) nu = ϵ task ( u ) , with the constraints described in (16), (19), (21) to (23), and theconstraint imposing that the workload of each task u , whether non-exit task or exit task, should beexecuted precisely: (cid:205) mi = N u , i = W initialu . By solving this LP, ϵ ∗ of each task graph will be obtained.Optionally, one can use an MILP formulation to obtain ϵ ∗ .For the case of imprecise computations, for each task graph, if its ϵ ∗ is used as the value for ϵ max , QoS is obtained as its maximum value (QoS = 100%, if QoS in (8) is shown with percentage).Therefore, for each task graph, we reduce ϵ max gradually, starting from its ϵ ∗ with the resolution of0 . × ϵ ∗ , and observe QoS obtained using our proposed method for each value of ϵ max , as presentedin Fig. 2. In Fig. 2, for each task graph, existence of a QoS ≥ ϵ ∗ as ϵ max , showsthat our proposed method can generate a feasible schedule for that task graph and ϵ max , whichproduces that value of QoS. A feasible schedule manes at least (extended) mandatory workloads ofall tasks are completed before T d , and the total energy consumption is below the ϵ max .According to Fig. 2, by reducing ϵ max , we observe the sharpest drop in the obtained QoS by ourproposed method in the man _ hiдh case, while the slowest drop in QoS is observed in the man _ low case. This reflects the fact that when lower portion of initial task workloads are mandatory, feasibleresults can be achieved with lower values of ϵ max , compared to the case that higher portion ofinitial task workloads are mandatory. For instance, in the man _ low case, our proposed method cangenerate a feasible schedule for TGFF8 even with using 45% of its ϵ ∗ as the value for ϵ max , while in (a) ² max (fraction of ² ∗ Q o S ( % ) man_low case TGFF0TGFF1TGFF2TGFF3TGFF4TGFF5TGFF6TGFF7TGFF8TGFF9 Q o S ( % ) man_med case TGFF0TGFF1TGFF2TGFF3TGFF4TGFF5TGFF6TGFF7TGFF8TGFF9 Q o S ( % ) man_high case TGFF0TGFF1TGFF2TGFF3TGFF4TGFF5TGFF6TGFF7TGFF8TGFF9 ) (a) ² max (fraction of ² ∗ ) (b) ² max (fraction of ² ∗ ) (c) Fig. 2.

QoS versus ϵ max obtained via the proposed method for different cases of mandatory workload portion: (a) man_low(b) man_med (c) man_high. he man _ hiдh case, it can only generate a feasible schedule for TGFF8 when ϵ max is reduced to atmost 85% of its ϵ ∗ . In this section, we compare the the performance of the proposed method in Sections 3.1 and 3.2with the MILP formulation presented in Section 3.3, in terms of their obtained QoS in differentvalues of ϵ max . We consider our comparison in a case where M u of tasks in a task graph can bechosen uniformly from 20% to 80% of W initialu (a mix of 3 aforementioned cases in Section 4.1;we refer to this case as the man _ mixed case). For each task graph and ϵ max value, we impose atime limit of 60 minutes for MILP to find the optimal scheduling solutions. For evaluating theperformance of our proposed method, we only consider those task graphs for which MILP foundthe optimal solutions for each value of ϵ max within the time limit (This comparison is actually inthe favor of MILP. We elaborate more on this later). Using this setup, MILP could find solutions for7 of 10 studied task graphs within the time limit (TGFF0 to TGFF3, TGFF5, TGFF6 , and TGFF9).For these task graphs, the obtained QoS values for different ϵ max values by the proposed methodand MILP are shown in Fig. 3. According to Fig. 3, QoS values found by our proposed method arecompletely equal to those found by MILP for 4 task graphs (TGFF0, TGFF3, TGFF6, and TGFF9).Furthermore, for other task graphs, the average QoS difference found by the proposed methodversus MILP for different ϵ max values is 1.63% (up to 6.64%). Consequently, the proposed methodyeilds close QoS values compared to the optimal MILP formulation.Comparing the runtime of the proposed method and MILP, we see a clear advantage for theproposed method. On the platform we performed simulations, the average runtime of the proposedmethod for each task graph and ϵ max value was 99.38% lower compared to MILP. This is withoutconsidering the cases that MILP did not find the optimal solutions within the time limit. for manyreal-world applications, as the task graphs can have higher number of nodes and more complexinterdependencies compared to studied task graphs, the runtime of using MILP for those taskgraphs can grow exponentially. Therefore, employing the proposed method, as it provided closeestimations to MILP, can be an efficient alternative. TGFF0TGFF1TGFF2TGFF3TGFF5TGFF6TGFF9 Q o S ( % ) ² max (fraction of ² ∗ ) (a) TGFF0TGFF1TGFF2TGFF3TGFF5TGFF6TGFF9 ² max (fraction of ² ∗ ) Q o S ( % ) (b)Fig. 3. QoS versus ϵ max obtained via (a) the proposed method and (b) MILP for the man _ mixed case. As presented,QoS values obtained using the proposed method are close (and in some cases completely equal) to the optimal referenceQoS values found by MILP. .4 Evaluating the Effect of imp_label algorithm In order to evaluate the effect of imp_label algorithm presented in Section 3.1, which traversesthe graph and labels some tasks as the ones that should be executed imprecisely before feedingthat graph to the scheduling method presented in Section 3.2, we compare the results obtainedfrom our proposed method with a baseline approach in which we feed the task graph with theirinitial workloads ( W initial ) for non-exit tasks to the scheduling method presented in Section 3.2,and assign as much as processor cycles possible to exit tasks in order to maximize QoS. Therefore,In the baseline approach, we solve the same LP as the one formulated in Section 3.2, however, theconstraint in (17) for non-exit task u will be transformed to the following constraint: m (cid:213) i = N u , i = W initialu , (33)and the constraint in (18) for exit task u will be transformed to the following constraint: M u ≤ m (cid:213) i = N u , i ≤ M u + O u . (34)Fig. 4 presents QoS values obtained via the proposed method and the baseline approach for thestudied task graphs for different values of ϵ max . The base mandatory portion of initial workloadof tasks is set based on the man _ mixed case, similar to Section 4.3. According to Fig. 4, using thebaseline approach, QoS for all task graphs immediately drops from 100% as soon as ϵ max is reducedfrom ϵ ∗ . However, in the corresponding man _ mixed case of our proposed method, as shown inFig. 4, QoS can be maintained as 100% even for values lower than ϵ ∗ for our studied task graphs.For instance, as presented in Fig. 4, our proposed method can generate QoS of 100% for TGFF8with 85% of its ϵ ∗ . Furthermore, for each task graph, the minimum ϵ max with which our proposedmethod can generate a feasible schedule for that task graph is lower in comparison to the baselineapproach. For those ϵ max values that both the proposed method and the baseline approach canprovide a feasible schedule for, QoS values obtained via our proposed method are on average 12.82%(up to 43.40%) higher than QoS values obtained via the baseline approach. TGFF0TGFF1TGFF2TGFF3TGFF4TGFF5TGFF6TGFF7TGFF8TGFF9 ² max (fraction of ² ∗ ) Q o S ( % ) (a) TGFF0TGFF1TGFF2TGFF3TGFF4TGFF5TGFF6TGFF7TGFF8TGFF9 Q o S ( % ) ² max (fraction of ² ∗ ) (b)Fig. 4. QoS versus ϵ max obtained via (a) the proposed method and (b) the baseline approach (without the imp_label algorithm) for the man _ mixed case. As presented, QoS values obtained using the proposed method are considerablyhigher compared to the baseline approach. CONCLUSION

In this paper, we presented a method for time and energy constrained scheduling of task graphs onMPSoC platforms, with the possibility of imprecise computation of each task of the task graph. Wetook into the account the effect of the extension in the workload of each task when the input tothat task is not precise. For this purpose, we presented an algorithm which by traversing the taskgraph, determines the optional workload of each non-exit task should be executed or discarded,and then scheduled the labeled graph on a MPSoC platform. For evaluating the efficacy of theproposed method, we also presented a MILP formulation of the problem which provided us theoptimal reference scheduling solutions. Our results shows the effectiveness of our proposed methodin terms of obtaining promising QoS values even with low energy budgets.

REFERENCES [1] H. Yu, B. Veeravalli, and Y. Ha. Dynamic scheduling of imprecise-computation tasks in maximizing qos under energyconstraints for embedded systems. In

Proceedings of the 2008 Asia and South Pacific Design Automation Conference ,pages 452–455. IEEE Computer Society Press, 2008.[2] G. L. Stavrinides and H. D. Karatza. Scheduling multiple task graphs with end-to-end deadlines in distributed real-timesystems utilizing imprecise computations.

Journal of Systems and Software , 83(6):1004–1014, 2010.[3] L. A. Cortés, P. Eles, and Z. Peng. Quasi-static assignment of voltages and optional cycles in imprecise-computationsystems with energy considerations.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 14(10):1117–1129,2006.[4] J. Zhou, J. Yan, T. Wei, M. Chen, and X. S. Hu. Energy-adaptive scheduling of imprecise computation tasks for qosoptimization in real-time mpsoc systems. In

Proceedings of the conference on design, automation & test in Europe , pages1406–1411. European Design and Automation Association, 2017.[5] H. Yu, B. Veeravalli, Y. Ha, and S. Luo. Dynamic scheduling of imprecise-computation tasks on real-time embeddedmultiprocessors. In , pages 770–777.IEEE, 2013.[6] W.-c. Feng and J.-S. Liu. Algorithms for scheduling real-time tasks with input error and end-to-end deadlines.

IEEETransactions on Software Engineering , 23(2):93–106, 1997.[7] D. Hull, A. Shankar, K. Nahrstedt, and J. W. Liu. An end-to-end qos model and management architecture. In inProceedings of IEEE Workshop on Middleware for Distributed Real-time Systems and Services . Citeseer, 1997.[8] M. E. Gerards, J. L. Hurink, and J. Kuper. On the interplay between global dvfs and scheduling tasks with precedenceconstraints.

IEEE Transactions on Computers , 64(6):1742–1754, 2015.[9] H. Topcuoglu, S. Hariri, and M.-y. Wu. Performance-effective and low-complexity task scheduling for heterogeneouscomputing.

IEEE transactions on parallel and distributed systems , 13(3):260–274, 2002.[10] A. Esmaili, M. Nazemi, and M. Pedram. Modeling processor idle times in mpsoc platforms to enable integrated dpm,dvfs, and task scheduling subject to a hard deadline. In

Proceedings of the 24th Asia and South Pacific Design AutomationConference

ACM Transactions on Embedded Computing Systems (TECS) , 13(3s):111, 2014.[13] R. P. Dick, D. L. Rhodes, and W. Wolf. Tgff: task graphs for free. In

Proceedings of the 6th international workshop onHardware/software codesign , pages 97–101. IEEE Computer Society, 1998., pages 97–101. IEEE Computer Society, 1998.