A C-DAG task model for scheduling complex real-time tasks on heterogeneous platforms: preemption matters
Houssam-Eddine Zahaf, Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, Giuseppe Lipari
aa r X i v : . [ c s . O S ] J a n A C-DAG task model for scheduling complexreal-time tasks on heterogeneous platforms:preemption matters
Houssam-Eddine Zahaf, Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, Giuseppe Lipari
Abstract — Recent commercial hardware platforms for em-bedded real-time systems feature heterogeneous processing unitsand computing accelerators on the same System-on-Chip. Whendesigning complex real-time application for such architectures,the designer needs to make a number of difficult choices: onwhich processor should a certain task be implemented? Shoulda component be implemented in parallel or sequentially? Thesechoices may have a great impact on feasibility, as the differencein the processor internal architectures impact on the tasks’execution time and preemption cost.To help the designer explore the wide space of design choicesand tune the scheduling parameters, in this paper we proposea novel real-time application model, called C-DAG, specificallyconceived for heterogeneous platforms. A C-DAG allows tospecify alternative implementations of the same component of anapplication for different processing engines to be selected off-line,as well as conditional branches to model if-then-else statementsto be selected at run-time.We also propose a schedulability analysis for the C-DAG modeland a heuristic allocation algorithm so that all deadlines arerespected. Our analysis takes into account the cost of preemptinga task, which can be non-negligible on certain processors. Wedemonstrate the effectiveness of our approach on a large setof synthetic experiments by comparing with state of the artalgorithms in the literature.
Index Terms —Real-Time, Conditional, DAG, Parallel Program-ming, Heterogeneous ISA
I. I
NTRODUCTION
Modern cyber-physical embedded systems demand are in-creasingly complex and demand powerful computational hard-ware platforms. A recent trend in hardware architecture designis to combine high performance multi-core CPU hosts witha number of application-specific accelerators (e.g. GraphicProcessing Units – GPUs, Deep Learning Accelerators –DLAs, or FPGAs for programmable hardware) in order tosupport complex real-time applications with machine learningand image processing software modules.Such application specific processors are defined by differentlevels of programmability and a different Instruction SetArchitecture (ISA) compared to the more traditional SoCs.NVIDIA Volta GPU architecture for instance , couples a fairlytraditional GPU architecture (hundreds of small SIMD process-ing units called CUDA cores , grouped in computing clusterscalled
Streaming Multiprocessors ) with hardware pipelinesspecifically designed for tensor processing (
Tensor Cores ),hence designed for matrix multiply and accumulate operations NVIDIA GV100 White Paper http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf that are typical of neural network arithmetics. The integratedversion of the NVIDIA Volta architecture is embedded withinthe NVIDIA Xavier SoC, which can now be found in theNVIDIA Jetson AGX board and in the NVIDIA Pegasus board:in such embedded platforms, tensor processing can also beoperated in specifically designed compute engines such asthe DLA (Deep Learning Accelerator ); moreover, anotherapplication specific engine is the PVA (Programmable VisionAccelerator), that is specifically designed for solving signalprocessing algorithms such as stereo disparity and optical flow.In such platforms, the main and novel challenge in analyzingthe timing behavior of a real-time application is representedby the drastic differences at the level of ISAs, preemptioncapabilities, memory hierarchies and inter-connections forthese collections of computing engines.When programming these platforms, the software designeris confronted with several design choices: on which processorengine should a task be implemented? Should a certain sub-system be implemented in parallel or sequentially? Thesechoices could impact on the timing behavior of the applicationand on the resource utilization. The analysis is complicated bythe fact that, on certain processors, the overhead induced bypreempting a lower priority task can be large: for example, theoverhead of preempting a graphical task executing on certainGPU architectures is in the same order of magnitude of theworst-case execution time of the task. As we will see in SectionVI-C, such overhead depends on the computing engine and onthe type of task. a) Contributions.: To help the designer explore the de-sign space, in Section II we present a novel model of real-time task called C-DAG (Conditional-Directed Acyclic Graph).Thanks to the graph structure, the C-DAG model allows tospecify parallelism of real-time sub-tasks. The designer canuse special alternative nodes in the graph to model alternativeimplementations of the same functionality on different com-puting engines to be selected off-line, and conditional nodesin the graph to model if-then-else branches to be selected atrun-time. Alternative nodes are used to leverage the diversityof computing accelerators within our target platform.Then, in Section III we present a schedulability analysis thatwill be used in Section IV by a set of allocation heuristics tomap tasks on computing platforms and to assign schedulingparameters. In particular, we present a novel technique to Hardware specifications for the DLA available at http://nvdla.org/ educe the pessimism due to high preemption costs in theanalysis (Section III-F).After discussing related work in Section V, our methodol-ogy is evaluated in Section VI by comparing it with start ofthe art algorithms trough a set of synthetic experiments.II. S
YSTEM MODEL
A. Architecture model
A heterogeneous architecture is modeled as a set of execu-tion engines
Arch = { e , e , . . . , e m } . An execution engine ischaracterized by 1) its execution capabilities, (i.e. its Instruc-tion Set Architecture), specified by the engine’s tag , and 2)its scheduling policy. An engine’s tag tag ( e i ) indicates theability of a processor to execute a dedicated tasks.As an example, a Xavier based platform such as the NVIDIApegasus , can be modeled using a total of engines for a totalof five different engine tags: CPU s, dGPU s, iGPU s, DLA s and PVA s.Tags express the heterogeneity of modern processor archi-tecture: an engine tagged by dGPU (discrete GPU) or iGPU (integrated GPU) is designed to efficiently run generic GPUkernels, whereas engines with
DLA tags are designed to run deep learning inference tasks.Trivially, a deep learning task can be compiled to run onany engine, including CPUs and GPUs, however its worst-caseexecution time will be lower when running on DLAs. In thispaper, we allow the designer to compile the same task ondifferent alternative engines with different tradeoffs in termsof performance and resource utilization, so to widen the spaceof possible solutions. As we will see in the next section, the C-DAG model supports alternative implementations of the samecode. During the off-line analysis phase, only one of thesealternative versions will be chosen depending on the overallschedulability of the system.Engines are further characterized by a scheduling policy(e.g. Fixed Priority or Earliest Deadline First), which can be preemptive or non-preemptive . In our model we allow differentengines to support different scheduling policies: as we showin Section III, in our methodology the schedulability analysisof each engine can be performed independently of the others.However, to simplify the presentation, in this paper we focusonly on preemptive EDF for all the considered engines. B. The C-DAG task model1) Specification tasks: A specification task is a DirectedAcyclic Graph (DAG), characterized by a tuple τ = { T , D , V , A , Γ , E} , where: T is the period (minimum interar-rival time); D is the relative deadline; V is a set of graph nodesthat represent sub-tasks ; A is a set of alternative nodes ; and Γ is a set of conditional nodes . The set of all the nodes isdenoted by N = V ∪ A ∪ Γ . The set E is the set of edges ofthe graph E : N × N .A sub-task v ∈ V is the basic computation unit. It representsa block of code to be executed by one of the engines of thearchitecture. A sub-task is characterized by: • A tag tag ( v ) represent the ISA of the sub-task code. Asub-task can only be allocate onto an engine with thesame tag; • A worst-case execution time C ( v ) when executing thesub-task on the corresponding engine processor.A conditional node γ ∈ Γ represents alternative paths in thegraph due to non-deterministic on-line conditions (e.g. if-then-else conditions). At run-time, only one of the outgoing edgesof γ is executed, but it is not possible to know in advancewhich one.An alternative node a ∈ A represents alternative imple-mentations of parts of the graph/task, as introduced in theprevious section. During the configuration phase (which isdetailed in Section IV-A) our methodology selects one betweenmany possible alternative implementations of the program byselecting only one of the outgoing edges of a and removing(part of) the paths starting from the other edges. This canbe useful when modeling sub-tasks than can be executedon different engines with different execution costs. In ourmodel, the choice of where the sub-task should be executedis performed off-line by our proposed scheduling analysis andallocation strategy.An edge e ( n i , n j ) ∈ E models a precedence constraint (andrelated communication) between node n i and node n j , where n i and n j can be sub-tasks, alternative nodes or conditionalnodes.The set of immediate predecessors of a node n j , denotedby pred ( n j ) , is the set of all nodes n i such that there exists anedge ( n i , n j ) . The set of predecessors of a node n j is the setof all nodes for which there exist a path toward n j . If a nodehas no predecessor, it is a source node of the graph. In ourmodel we allow a graph to have several source nodes. In thesame way we can define the set of immediate successors ofnode n j , denoted by succ ( n j ) , as the set of all nodes n k suchthat there exists an edge ( n j , n k ) , and the set of successors of n j as the set of nodes for which there is a path from n j . If anode has no successors, it is a sink node of the graph, and weallow a graph to have several sink nodes.Conditional nodes and alternative nodes always have at least2 outgoing edges, so they cannot be sinks. To simplify thereasoning, we also assume that they always have at least onepredecessor node, so they cannot be sources.
2) Concrete tasks:
A concrete task τ = { T, D, V , Γ , E} isan instance of a specification task where all alternatives havebeen removed by making implementation choices during theanalysis.Before explaining how to obtain a concrete task froma specification task, we present an example. Example 1.
Consider the task specification described inFigure 1a. Each sub-task node is labeled by the sub-task id andengine tag. Alternative nodes are denoted by square boxes andconditional nodes are denoted by diamond boxes. The blackboxes denote corresponding junction nodes for alternativesand conditional, they are used to improve the readability ofhe figure but they are not part of the task specification . v CPU v CPU A F v dGPU v DLA v dGPU v CPU F F v DLA v dGPU (a) v CPU v CPU F v CPU F v DLA v dGPU (b) Fig. 1: Task specification and concrete tasks
Sub-tasks v CPU and v CPU are the sources (entry points)of the DAG. v CPU , v CPU are marked by the CPU tag andcan run cuncurrently: during the off-line analysis they maybe allocated on the same or onto different engines. Sub-task v DLA has an outgoing edge to v dGPU , thus sub-task v dCPU cannot start its execution before sub-task v DLA has finished itsexecution. Sub-tasks v CPU and v CPU have each one outgoingedge to the alternative node A . Thus, τ can execute either: by following v dGPU and then v DLA , v dGPU and finishingits instance on v CPU ; or by following the conditional node F and select,according to an undetermined condition evaluated on-line, either to execute v DLA or v dGPU , and finishing itsinstance on v CPU .The two patterns are alternative ways to execute the samefunctionalities at different costs.Figure 1b represents one of the concrete tasks of τ i . Duringthe analysis, alternative execution ( v dGP U , v DLA , v dGP U ) hasbeen dropped. We consider a sporadic task model, therefore parameter T represents the minimum inter-arrival times between twoinstances of the same concrete task. When an instance of atask is activated at time t , all source sub-tasks are simultane-ously activated. All subsequent sub-tasks are activated uponcompletion of their predecessors, and sink sub-tasks must allcomplete no later than time t + D . We assume constraineddeadline tasks , that is D ≤ T .We now present a procedure to generate a concrete task τ from a specification task τ , when all alternatives have beenchosen. The procedure starts by initializing V = ∅ , Γ = ∅ .First, all the source sub-tasks of τ are added to V . Then, forevery immediate successor node n j of a node n i ∈ {V ∪ Γ } :if n j is a sub-task node (a conditional node), it is added to V (to Γ , respectively); if it is an alternative node, we considerthe selected immediate successor n k of n j and we add it to V In fact, it is not always possible to insert junction nodes for an arbitraryspecification. or to Γ , respectively. The procedure is iterated until all nodesof τ have been visited. The set of edges E ⊆ E is updatedaccordingly.We denote by Ω( τ ) the set of all concrete tasks of aspecification task τ . Ω( τ ) is generated by simply enumeratingall possible alternatives.III. S CHEDULING ANALYSIS
In this work, we consider partitioned scheduling. Eachengine has its own scheduler and a separate ready-queue. Sub-tasks are allocated (partitioned) onto the available engines sothat the system is schedulable. Partitioned scheduling allowsto use well-known single processor schedulability tests whichmake the analysis simpler and allow us to reduce the overheaddue to thread migration compared to global scheduling. Theanalysis presented here is modular, so engines may havedifferent scheduling policies. In this paper, we restrict topreemptive-EDF.
A. Alternative patterns
Given a specification task τ , we have to select one of thepossible concrete tasks before proceeding to the allocation andscheduling of the sub-tasks on the computing engine. Since thenumber of combinations can be very large, in this paper wepropose an heuristic algorithm based on a greedy strategy (seeSection IV). In particular, we explore the set of concrete tasksin a certain order. The order relation ≻ sorts concrete tasksaccording to their total execution time. Definition 1.
Let τ ′ , τ ′′ be two concrete tasks of specificationtask τ The partial order relation ≻ is defined as: τ ′ ≻ τ ′′ = ⇒ C ( τ ′ ) ≥ C ( τ ′′ ) (1)In the next section, we will define a second order relation-ship ≫ that sorts concrete tasks based on their engine tags. B. Tagged Tasks
One concrete task may contain sub-tasks with differenttags which will be allocated on different engines. Beforeproceeding to allocation, we need to select only sub-taskspertaining to a given tag. We call this operation task filtering .We start by defining an empty sub-task as a sub-task withnull computation time.
Definition 2 (Tagged task) . Let τ = { T , D , V , Γ , E} be aconcrete task. Task τ ( tag i ) is a tagged task of τ iff • τ ( tag i ) = { T , D , V i , Γ i , E i } is isomorphic to τ , that isthe graph has the same structure, the same number ofnodes of the same type, and the same edges betweencorresponding nodes; • let v ∈ V be a sub-task of τ , and let v ′ ∈ V i be thecorresponding sub-task of τ ( tag i ) in the isomorphism. If tag ( v ) = tag i , then C ( v ′ ) = C ( v ) , else C ( v ′ ) = 0 ; • Γ i = Γ .e denote with S ( τ ) = { τ ( tag ) , . . . τ ( tag K ) } the set of allpossible tagged tasks of τ . Each concrete task generates as many tagged tasks as thereare tags in the target architecture. v CPU v CPU F v CPU F ∅ ∅ ∅ ∅ F ∅ F v DLA ∅ ∅ ∅ F ∅ F ∅ v dGPU Fig. 2: Tagged tasks for the concrete task of FigureFigure 2 shows the three tagged tasks for the concrete taskin Figure 1b. The first one contains only sub-tasks having CPUtag, the second contains only DLA sub-tasks, and the third onerefers to GPU sub-tasks. Every tagged task will be allocatedon one or more engines having the corresponding tag.
Definition 3 ( ≫ order relationship) . Assume the architecturesupports K different tags. Let n ( tag ) denote the number ofcomputing engines labeled with tag . Assume that tags areordered by increasing n ( tag ) , that is n ( tag i ) < n ( tag j ) = ⇒ i < j .Let τ ′ , τ ′′ be two concrete tasks of specification task τ ,and let S ( τ ′ ) = { τ ′ ( tag ) , . . . , τ ′ ( tag K ) } and S ( τ ′′ ) = { τ ′′ ( tag ) , . . . , τ ′′ ( tag K ) } be the respective tagged tasks.The order relation τ ′ ≫ τ d ′′ is defined as follows: τ ′ ≫ τ d ′′ = ⇒∃ ≤ i ≤ K ( C ( τ ′ ( tag j )) = C ( τ ′′ ( tag j )) ∀ j < iC ( τ ′ ( tag i )) < C ( τ ′′ ( tag i )) Relationship ≫ gives priority to concrete tasks that allocateless load on scarce resources: if there are few executionengines with a certain tag, and there is a large number of sub-tasks requiring allocation on that specific engine, the relationorder prefers alternative patterns with lower workload for thoseengines. C. Deadlines and offsets assignment
Meeting timing constraints of a concrete task depends on theallocation of the sub-tasks onto the different execution engines.As these sub-tasks communicate through shared buffers, theyare forced to respect the execution order dictated by theprecedence constraints imposed by the graph structure.To reduce the complexity of dealing with precedence con-straints directly, we impose intermediate offsets and deadlineson each sub-task. In this way, precedence constraints arerespected automatically if every sub-task is activated after itsoffset and it completes no later than its deadline. Many authors have proposed techniques to assign interme-diate deadlines and offsets to task graphs. In this paper we usetechniques similar to those proposed in [1] and [2].Most of the deadline assignment techniques are based onthe computation of the execution time of the critical path. Apath P x = { v , v , · · · , v l } is a sequence of sub-tasks of task τ such that: ∀ v l , v l +1 ∈ P x , ∃ e ( v l , v l +1 ) ∈ E . Let P denote the set of all possible paths of task τ . Thecritical path P crit ( τ ) ∈ P is defined as the path with thelargest cumulative execution time of the sub-tasks.We define the slack Sl ( P, D ) along path P as: Sl ( P, D ) = D − X v ∈ P C ( v ) The assignment algorithm starts by assigning an interme-diate relative deadline to every sub-task along a path bydistributing the path’s slack as follows: D ( v ) = C ( v ) + calculate share ( v, P ) The calculate share function computes the slack for sub-task v along the path. This slack can be shared according totwo alternative heuristics: • Fair distribution: assigns slack as the ratio of theoriginal slack by the number of sub-tasks along the path: calculate share ( v, P ) = Sl ( P, D ) | P | (2) • Proportional distribution: assigns slack according to thecontribution of the sub-task execution time in the path: calculate share ( v, P ) = C ( v ) C ( P ) · Sl ( P, D ) (3)Once the relative deadlines of the sub-tasks along the criticalpath have been assigned, we can select the next path in order ofdecreasing cumulative execution time, and assign the deadlinesto the remaining sub-task by appropriately subtracting thealready assigned deadlines. The complete procedure has beendescribed in [2], and due to space constraints we do not reportit here.Let O ( v ) be the offset of a subtask with respect of thearrival time of the task’s instance. The sum of the offset andof the intermediate relative deadline of a subtask is called local deadline O ( v ) + D ( v ) , and it is the deadline relative tothe arrival of the task’s instance.The offset of a subtask is set equal to 0 if the subtask hasno predecessors; otherwise, it can be computed recursively asthe maximum between the local deadlines of the predecessorsub-tasks.Figure 3 illustrates the relationship between the activationtimes, the intermediate offsets, relative deadlines and localdeadlines of the sub-tasks of the concrete task of Figure 1b. Weassume that v , v , v have been allocated on the same CPUwhereas v and v each on a different engine. The activationtime is the absolute time of the arrival of the sub-task instance. v v CPUiGPUdGPU v v v Local deadline v relative deadlineO ( v ) Absolute deadlineActivation time task relative deadline
Fig. 3: Example of offset and local deadlineThe activation time of a source sub-task corresponds to theactivation time of the task graph. The offset is the intervalbetween the activation of the task graph and the activation ofthe sub-task. The local deadline is the interval between thetask graph activation and the sub-task absolute deadline.
Definition 4.
Sub-task v ∈ V τ is feasible if for each taskinstance arrived at a j , sub-task v executes within the intervalbounded by its arrival time a ( v ) = a j + O ( v ) and its absolute deadline a ( v ) + D ( v ) . Lemma 1.
A concrete task (resp. tagged task) is feasible ifall its sub-tasks are feasible.Proof.
By the definition, the local deadline of the sink sub-tasks is equal to the deadline of the task D . Moreover, theoffset of a sub-task is never before the local deadline of apreceding sub-task. Therefore 1) the precedence constraintsare respected and 2) if sink sub-tasks are feasible then theconcrete task (tagged task, respectively) is feasible. D. Single engine analysis
In this section, we assume that sub-tasks have been alreadybeen assigned offsets and deadlines, and they have beenallocated on the platform’s engines, and we present the schedu-lability analysis to test if all tasks respect their deadlines whenscheduled by the Earliest Deadline First (EDF) algorithm.
Theorem 1.
Let T a set of task graphs allocated onto a single-core engine. Task set T is schedulable by EDF if and only if: X τ ∈T dbf ( τ, t ) ≤ t, ∀ t ≤ t ∗ (4)The dbf is the demand bound function [3] for a task graph τ in interval t . The demand bound function is computed as theworst-case cumulative execution time of all jobs (instances ofsub-tasks) having their arrival time and deadline within anyinterval of time of length t . For a task graph, the dbf can becomputed as follows: dbf ( τ, t ) = max v ∈ τ X v ′ ∈ τ $ t − ˜ O ( v ′ ) − D ( v ′ ) + T ( τ ) T ( τ ) % C ( v ′ ) (5) where : ˜ O ( v ′ ) = ( O ( v ′ ) − O ( v )) mod T ( τ ) In our model, a task graph may contain conditionalnodes , which model alternative paths that are selected non-deterministically at run-time. To compute the dbf for a taggedtask that contains conditional nodes, we must first enumerateall possible conditional graphs by using the same procedure asthe one used for generating concrete tasks from specificationtasks. Hence, the dbf of a tagged task in interval t can becomputed as the largest dbf among all the possible conditionalgraphs. E. Anticipating the activation of sub-tasks
Given an instance of sub-task v with arrival at a ( v ) andlocal deadline at D ( v ) , at run-time it may happen that allinstances of the preceding sub-tasks have already completedtheir execution before a ( v ) . In this case, we activate the sub-task as soon as the preceding sub-tasks have finished with thesame local deadline D ( v ) . Lemma 2.
Consider a feasible set of sub-tasks allocated on aset of engines and scheduled by EDF. If a sub-task is activatedas soon as all predecessor sub-tasks have finished, with thesame local deadline, the set remains schedulable.Proof.
Descends directly from the sustainability property ofEDF [4]. In fact, by anticipating the activation of the sub-task without modifying its local deadline, the sub-task willbe scheduled with a longer relative deadline, and the demandbound function will not increase.From an implementation point of view, this technique avoidsthe need to set-up activation timers for intermediate tasks;moreover, it allows us to reduce the pessimism of the analysisin the presence of high preemption costs, as we will see inthe next section.
F. Preemption-aware analysis
In recent GPUs, preempting an executing task can be acostly operation (see Section VI-C). In particular, the cost ofpreemption may significantly vary depending on the preemptedtask and the engine. For example, preempting a graphical ker-nel induces a larger cost compared to preempting a computingCUDA kernel. Therefore, we need to account for the cost ofpreemption in the analysis.We start by observing that, in the case of EDF scheduling,a job of a sub-task v i can preempt a job of sub-task v j atmost once, and only if its relative deadline deadline is shorter: D ( v i ) < D ( v j ) .A simple (although pessimistic) approach is to alwaysconsider the worst-case preemption cost as part of the worst-case computation time of the preempting task. Let pc ( v j ) denote the cost of preempting sub-task v j . We remind that the remainder of a/b is by definition a positive number r such that a = kb + r . emma 3. Let V = { v , v , · · · , v K } be a set of sub-tasks tobe scheduled by EDF on a single engine.Consider V pc = { v ′ , v ′ , · · · , v ′ K } , where v ′ i has the sameparameters as v i , except for the wcet that is computed as C ( v ′ i ) = C ( v i ) + pc i and pc i = max { pc ( v ) | v ∈ V ∧ D ( v ) >D ( v i ) } .If V pc is schedulable by EDF when considering a nullpreemption cost, then V is schedulable when considering thecost of preemption.Proof. The Lemma directly follows from the simple obser-vation that the cost of preemption can never exceed pc i forsub-task v i .Lemma 3 is safe but pessimistic. We can further improve theanalysis by observing that a sub-task cannot preempt anothersub-task belonging to the same task graph (we remind thereader that we assume constrained deadline tasks). Further-more, it may be impossible for two consecutive sub-task of atask graph to both preempt the same sub-task as demonstratedby Theorem 2. Definition 5 (Maximal sequential subset) . Let V be a set ofsub-tasks allocated on a single engine, and let τ be a taggedtask such that V τ ⊆ V .A maximal sequential subset V M is a maximal subset of V τ such that none of the sub-tasks in V M has a null predecessor.Further, we denote by v M ∈ V M the sub-task with the shorterlocal deadline in V M . We observe that, since all the sub-tasks in V M are allocatedon the same engine and since they do not have any predecessorsub-task allocated on a different engine (no empty predeces-sor), they can be activated as soon as the predecessor sub-taskshave finished.Now, suppose v , v ∈ V M and that v is an immediatepredecessor of v . If v preempts a sub-task v j , and D ( v ) ≤ D ( v j ) , then v j can be executed only after v has finished. Thismeans that the cost of preempting v j can be accounted to onlyone between v and v . We assign this preemption cost to thesub-task v M with the shorter local deadline among all sub-tasks of V M , whereas the others do not pay any preemptioncost. The preemption cost of any other sub-task in V ′ is setequal to . For all sub-tasks that have a null predecessor, wecompute a preemption cost as in Lemma 3.Finally, for any tagged task graph τ , the preemption cost ofone of its sub-tasks v i ∈ V τ can be computed as follows: • If v i = v M , or if v i has a null predecessor, then pc i = max { pc ( v ) | v ∈ V \ V τ ∧ D ( v ) > D ( v s ) } ; (6) • otherwise, pc i = 0 (7) Theorem 2.
Let V = { v , v , · · · , v K } be a set of sub-tasksscheduled by to EDF. Consider V pc = { v ′ , v ′ , · · · , v ′ K } where v ′ i has the same parameters as v i , except for the wcet that iscomputed as C ( v ′ i ) = C ( v i ) + pc i , and pc i is computed asin Equation (6) or (7) . If V pc is schedulable by EDF when considering a null preemption cost, then V is schedulable whenconsidering the cost of preemption.Proof. We report here a proof sketch.Consider any non-source sub-task v i ∈ V M : it is activated assoon as the preceding sub-tasks have finished executing theircorresponding instances. Then, if one of the preceding task of v i preempted a task v j , the preemption cost has already beenaccounted in the worst-case execution time of the precedingtask; as discussed above v j can only resume execution after v i has completed. Thus, no further preemption cost need to beaccounted.If instead none of the preceding sub-task of v i has pre-empted v j , then v j cannot start executing before v i completesbecause its deadline is not smaller than D ( v i ) , hence nopreemption will occur.In any case, no cost of preemption needs to be accountedfor to v i . IV. A LLOCATION
A. Allocation of task specifications
The goal of our methodology is to allocate a set of taskspecifications into a set of engines, by selecting alternative im-plementations, so that all tasks complete before their deadlines.From a operational point of view, is is equivalent to findinga feasible solution to a complex Integer Linear Programmingproblem. In facts, given the large number of combinations (dueto alternative nodes, condition-control nodes, and allocationdecisions), an ILP formulation of this problem fails to producefeasible solutions in an acceptable short time. Therefore, inthis section we propose a set of greedy heuristics to quicklyexplore the space of solutions.Algorithm 1 describes the basic methodology of our ap-proach. The algorithm can be customised with four parame-ters: oder is the sorting order of the concrete task sets (seeSections III-A and III-B); parameter slack concerns the waythe slack is distributed when assigning intermediate deadlinesand offsets (see Section III-C); parameter alloc can be best-fit(BF) or worst-fit (WF); parameter omit concerns the strategyto eliminate sub-tasks when possible (see Section IV-C).At each step, the algorithm tries to allocate one single taskspecification (for loop at line 3). For each task, it first generatesall concrete tasks (line 4), and sorts them according to onerelationship order ( ≻ or ≫ ). Then, for each concrete task, iffirst assigns the intermediate deadlines and offsets according tothe methodology described in Section III-C (line 9), using onebetween the fair or the proportional slack distributions. Then,it separates the concrete tasks into tagged tasks according tothe corresponding tags (line 10).Then, the algorithm tries to allocate every tagged taskonto single engines having the corresponding tag (line 14)(this procedure is described below in Algorithm 2). If afeasible allocation is found, the allocation is generated, andthe algorithm goes to the next specification task (lines 15-16). If no feasible sequential allocation can be found, the nextconcrete task is tested. lgorithm 1 Allocation algorithm input : T : set of task specifications parameters : order ( ≻ or ≫ ), slack (fair or proportional), alloc (BF or WF), omit (parallel or random) output : SUCCESS or FAIL for τ ∈ T do Ω = generate concrete task( τ ) sort (Ω , order ) for ( τ ∈ Ω) do assign deadlines offsets( τ , slack ) S ( τ ) = filter tagged task( τ ) end for allocated = false for ( τ ∈ Ω) do if (feasible sequential( S ( τ ) , alloc )) then allocated = true ; assign sub-tasks to engines break ; end if end for if ( not allocated) then for ( τ ∈ Ω ) do ( τ ′ , τ ′′ ) = parallelize( τ , alloc , omit ) if ( τ ′ = ∅ ) then allocate τ ′ to selected engines add back τ ′′ to T allocated = true break end if end for if ( not allocated) then return FAIL end if end for return
SUCCESSThe algorithm gives priority to single-engine allocationsbecause they reduce preemption cost, as discussed in Sec-tion III-F. In particular, by allocating an entire tagged taskonto a single engine, we reduce the number of null sub-taskto the minum necessary, and so we can assign the cost ofpreemption to fewer sub-tasks.If none of the concrete tasks of a specification task can beallocated (line 17), this means that one of the tagged taskscould not be allocated on a single engine. Therefore, thealgorithms tries to break some of the tagged tasks of a concretetask into parallel tasks to be executed on different engines ofthe same type. This is performed by procedure parallelize ,which will be described in Section IV-C. In particular, onepart of the concrete task will be allocated, while the secondpart will be put back in the list of not-yet-allocated task graphs(line 24).If also this process is unable to find a feasible concrete task,the analysis fails (line 29).
B. Sequential allocation
Algorithm 2 tries to allocate a concrete task on a minimalnumber of engines. It takes as input a set of tagged tasks.For each tagged task, it selects the corresponding engines,and sorts them according to the alloc parameter, that is indecreasing order of utilization in the case of Best-Fit, or inincreasing order of utilization in case of Worst-Fit. Then, ittests the feasibility of allocating the tagged task on each enginein turn. If the allocation is successful, the next tagged task istested, otherwise the algorithm tries the next engine. If thetagged task cannot be allocated on any engine, the algorithmfails. If all tagged tasks have been allocated, the correspondingallocation is returned.
Algorithm 2 feasible sequential input: S ( τ ) : set of tagged tasks, alloc output: feasibility: SUCCESS or FAIL for ( τ ( tag ) ∈ S ( τ )) do engine list=select engine( tag ) sort engines(engine list, alloc ) f = false nfeas = 0 for ( e ∈ engine list ) do f = dbf test( τ ∪ T e ) if ( f ) then save allocation( τ , e ) nfeas ++ break end if end for if ( not f ) then return FAIL; end for if (nfeas = |S ( τ ) | ) then return SUCCESS, saved allocations end if
C. Parallel allocation
When the sequential allocation fails for a given task specifi-cation, the algorithm tries to allocate one or more of its taggedtasks onto multiple engines having the same tag. Algorithm 3takes as input a concrete task and two parameters, alloc for BFor WF heuristics, and omit to select which sub-task to removefirst.For each tagged task of the concrete task (line 5), thealgorithm selects the list of engines corresponding to theselected tag, and sorts them according to BF or WF (line 7).Then, it tries to test the feasibility of the tagged task on eachengine (line 9). If the test fails, it removes one sub-task fromthe tagged task and adds it to list of non allocated sub-tasks τ ′′ (line 11). We propose two heuristics:1) Random heuristic: it selects a random sub-task and addsit to the omitted list.2)
Parallel heuristic: to be feasible, the critical path of eachtagged task must be feasible even on a unlimited numberf engines. Thus, we are interested in sub-tasks that donot belong to the critical path because they are the onescausing the non-feasibility. Thus, they are omitted oneby one until finding a feasible schedule.The feasibility test is repeated until a feasible subset of τ ( tag ) is found. The omitted tasks are tried on the next engine withthe same tag (line 16). At the end of the procedure, twoconcrete tasks are produced, τ ′ is the feasible part that willbe allocated, while τ ′′ will be tried again in the followingiteration of Algorithm 1. Algorithm 3 parallelize input: τ : concrete task, alloc (BF or WF), omit (parallel or random) output: concrete tasks ( τ ′ , τ ′′ ) τ ′ = ∅ , τ ′′ = ∅ for ( τ ( tag ) ∈ S ( τ )) do engine list=select engines(tag) sort(engine list, alloc ) for ( e ∈ engine list ) do f =dbf test( τ ( tag ) ∪ T e ) while ( not f ) do τ ′′ = τ ′′ ∪ remove( τ ( tag ) , omit ) f =dbf test( τ ( tag ) ∪ T E ) end while if ( τ ( tag ) = ∅ ) then τ ′ = τ ′ ∪ save allocation( τ ( tag ) , e ) τ ( tag ) = τ ′′ , τ ′′ = ∅ allocated = true break end if end for if ( not allocated) return ∅ , τ end for return τ ′ , τ ′′ V. R
ELATED WORK
Many authors [1], [5]–[12] have proposed real-time taskmodels based on DAGs. However, to the best of our knowl-edge, none of the existing models supports alternative imple-mentations of the same functionality on different computingengines.Authors of [1] studied the deadline assignment problem indistributed real-time systems. They formalize the problem andidentify the cases where deadline assignment methods have astrong impact on system performances. They propose Fair Lax-ity Distribution (FLD) and Unfair Laxity Distribution (ULD)and study their impact on the schedulability. In [8], authorsanalyze the schedulability of a set of DAGs using global EDF,global rate-monotonic (RM), and federated scheduling. In [13],the authors present a general framework of partitioning real-time tasks onto multiple cores using resource reservations.They propose techniques to set activation time and deadlinesof each task, and they an use ILP formulation to solve theallocation and assignment problems. However, when applying such approaches on large applications consisting of hundredof sub-tasks, the analysis can be highly time consuming.DAG fixed-priority partitioned scheduling has been pre-sented in [10]. The authors propose methods to compute aresponse time with tight bounds. They present partitionedDAGs as a set of self-suspending tasks, and proposed analgorithm to traverse a DAG and characterize the worst-casescheduling scenario.Unlike previous models, Melani et al [6] proposed to modelconditional branches in the code in a way similar to ourconditional nodes, however their model is not able to expressoff-line alternative patterns. They proposed different methodsto compute an upper-bound on the response-time under globalscheduling algorithms. In [14], alternative on-line executionpatterns can be expressed using digraph s. However, the di-graph model cannot express parallelism and only supportssequential tasks.In this paper we assume preemptive EDF scheduling. Typ-ically, preemption on classical CPUs can be assumed to bea negligible percentage of the task execution. However, thisis not always the case with GPUs processors. Depending onthe computing architecture and on the nature of the workload,GPU tasks present different degrees of preemption granularityand related preemption costs. Initial work on preemptivescheduling on GPUs assumed preemption was viable at the kernel granularity [15]. A finer granularity for computingworkloads is represented by CTA (Cooperative Thread Array)level preemption, hence, preemption occurs at the boundariesof group of parallel threads that execute within the sameGPU computing cluster [16], [17]. In such a scenario, thecost of preempting an executing context on a GPU mightpresent significant differences as it will involve saving andrestoring contexts of variable size and/or reaching the nextviable preemption point. Overhead measurements operated inthe cited contributions calls for modeling each GPU sub-taskwith a specific non-negligible preemption cost that can be inthe same order of magnitude of the execution time of the sub-task. VI. R
ESULTS AND DISCUSSIONS
In this section, we evaluate the performance of our schedul-ing analysis and allocation strategies. We compare against themodel cp-DAG proposed by Melani et al. [6]. Please noticethat in [6] the authors proposed an analysis for cp-DAGsin the context of global scheduling, whereas our analysisis based on partitioned scheduling. Therefore, we extendedthe cp-DAG model to support multiple engines by adding arandomly selected tag to each node of the graph. Moreoverwe applied the same allocation heuristics of Section IV andthe same scheduling analysis of Section III to C-DAGs and tocp-DAG.In the following experiments, we considered the NVIDIAJetson AGX Xavier . It features 8 CPU cores, and fourdifferent kinds of accelerators: one discrete and one integrated https://elinux.org/Jetson AGX Xavier PU, one DLA and one PVA. Each accelerator is treated asa single computing resource. In this way, we can exploit tasklevel parallelism as opposed to allowing the parallel executionof more than one sub-task to partitions of the accelerators (e.g:at a given time instant, only one sub-task is allowed to executein all the computing clusters of a GPU).
A. Task Generation
We apply our heuristics on a large number of randomlygenerated synthetic task sets.The task set generation process takes as input an engine/tagutilization for each tag on the platform. First, we start bygenerating the utilization of the n tasks by using the UUniFast-Discard [18] algorithm for each input utilization. Graph sub-tasks can be executed in parallel, thus task utilization can begreater than 1. The sum of every per-tag utilization is a fixednumber upper bounded by the number of engines per tag.The number of nodes of every task is chosen randomlybetween 10 and 30. We define a probability p that expresses thechance to have an edge between two nodes, and we generatethe edges according to this probability. We ensure that thegraph depth is bounded by an integer d proportional to thenumber of sub-tasks in the task. We also ensure that thegraph is weakly connected (i.e. the corresponding undirectedgraph is connected); if necessary, we add edges between non-connected portions of the graph. Given a sub-task node, oneof its successors is an alternative node or a conditional nodewith probability of . .To avoid untractable hyper-periods, the period of every taskis generated randomly according from the list, where theminimum is and the maximum is , . For every sub-task, we randomly select a tag. Further, for each tag, we usealgorithm UUNIFAST discard again to generate single sub-task utilization. Thus, the sub-task utilization can never exceed1. Further, we inflate the utilization of each sub-task by thetask period to generate the worst case execution time of everyvertex.A cp-DAG is generated from a C-DAG by selecting one ofthe possible concrete tasks at random. B. Simulation results and discussions
We varied the baseline utilization from to the number ofengines per engine tag in steps. Therefore, the step sizevary from one engine tag to the other: the step size is . for CPUs, and . for the others. For each utilization, wegenerated a random number of tasks between 20 and 25.The results are presented as follows. Each algorithm isdescribed using letters: (i) the first letter is either B forbest fit or W for worst first allocation techniques; (ii) thesecond is either O for the ≻ order relation, or R for the ≫ order relation; (iii) the third character describes the deadlineassignment heuristic, F for fair and P for proportional. Thealgorithm name may also contain either option P for theparallel allocation heuristic that eliminates parallel nodes first,or R the random heuristic which randomly selects the sub-task Total Utilization index S c h e du l a b ilit y r a t e BRF-PBOF-PBRF-RBOF-RWRF-PWOF-PBOP-PBRP-PWOP-PWRP-Pcp-DAG
Fig. 4: Schedulability rate VS total utilization.to remove. For Figures 4, 5, 6, 7, we run simulations perutilization step.Figure 4 represents the schedulability rate of each com-bination of heuristics cited above as a function of the totalutilization. The fair deadline assignment technique presentsbetter schedulability rates compared to proportional deadlineassignment. In general, BF heuristic combinations outperformWF heuristic: this can be explained by observing that BF triesto pack the maximum number of sub-tasks into the minimumnumber of engines, and this allows for more flexibility toschedule heavy tasks on other engines.In the figures, the cp-DAG model proposed in [6] is shownin yellow. Since the cp-DAG has no alternative implemen-tations, the algorithm has less flexibility in allocating thesub-tasks, therefore by construction the results for C-DAGdominate the corresponding results for cp-DAG. However, it isinteresting to measure the difference between the two models:for example in Figure 4 the difference in the schedulabilityrate between the two models is between 10% and 20% forutilization rates between 6 and 14.When the system load is low, all combinations of heuristicsallow having high schedulability rates. BRF shows betterresults because it is aimed at relaxing the utilization of scarceengines, thus avoiding the unfeasibility of certain task sets dueto a high load on a scarce engines (DLA and PVA/ GPUs).However, when dealing with a highly loaded system, BOFpresents better schedulability rates, as it reduces the executionoverheads on all engines.Figure 5 reports the average number of active cores (CPUs)as a function of the total utilization. WF-based heuristicsalways use the highest number of CPU cores because ourtask generator outputs at least CPU subtasks. Hence, thenumber of tasks is larger than the available number of CPUcores (which is 8, in our test platform). BF heuristics allowsto pack the maximum number of sub-tasks on the minimumnumber of engines, thus the utilization increases quasi-linearly.This occurs until the maximum schedulability limit is reached(i.e. number of cores). BRF heuristic uses more CPU coresbecause it preserves the scarce resources, thus it uses more
Total Utilization index ac ti v e C P U s BRF-PBOF-PWRF-PWOF-PBOP-PBRP-PWOP-PWRP-P
Fig. 5: . . . Total Utilization index ac ti v e C P U s u tili za ti on BRF-PBOF-PWRF-PWOF-PBOP-PBRP-PWOP-PWRP-P
Fig. 6: Active CPU utilization VS total utilizationFigure 6 shows the average active utilization for CPUs.Average utilization of BF-based heuristics is higher comparedto WF. In fact, the latter distributes the work on differentengines thus the per-core utilization is low in contrast toBF. Again, BRF has higher utilization than BOF becauseit schedules more workload on CPU cores than the otherheuristics. As the workload is equally distributed on differentCPUs, the WF heuristics may be used to reduce the CPUsoperating frequency to save dynamic energy. Regarding BFheuristics, we see that BRF is not on the top of the averageload because it uses more cores than the others.Figure 7 shows the average utilization of the scarce re-sources. As you may notice, order relation ≫ based heuristicsallows to reduce the load on the scarce resources comparedto ≻ . In fact, the higher is the load, the less loaded are thescarce resources. C. Preemption cost simulation
In all previous experiments, we applied the analysis de-scribed in Section III-F to account for preemption costs. In par-ticular, we applied the technique of Theorem 2, by assuming . . . . Total Utilization index A v e r a g e s ca r ce - e ng i n e u tili za ti on BRF-PBOF-PWRF-PWOF-PBOP-P
Fig. 7: DLA, GPUs, PVA utilizations vs total utilization.that the cost of preempting a sub-task is 30% of the sub-taskexecution time on a GPU, 10% on DLA and PVA, and 0.02%on the CPUs. DLA and PVA are non-preemptable engines,however longer jobs might be split into smaller chunks andthis translates in a splitting overhead as we submit many kernelcalls as opposed of a single batch of commands.
Total Utilization index S c h e du l a b ilit y r a t e s MAX-PREEMPREDUCED-PREM
Fig. 8: Preemption cost Theorem vs maxTo highlight the importance of a proper analysis of the costof preemption, in Figure 8 we report the schedulability ratesobtained by BRF-P in two different cases: when consideringthe analysis of Lemma 3 (where the maximum preemption costis charged to all preempting sub-tasks) and that of Theorem2, where the cost is only charged to one of the sub-tasks inthe maximal sequential subset.With the increase of of utilization, schedulability drasticallyfalls for the first method, while the improved analysis ofTheorem 2 keeps high schedulability rates.VII. C
ONCLUSIONS AND FUTURE WORK
In this paper, we presented the C-DAG real-time taskmodel, which allows to specify both off-line and on-linealternatives, to fully exploit the heterogeneity of complexembedded platforms. We also presented a scheduling analysisand a set of heuristics to allocate C-DAGs on heterogeneousomputing platforms. The analysis takes into account the costof preemption that may be non-negligible in certain specializedengines.Results of our extensive synthetic simulations show that asignificant reduction in pessimism occurs with our proposedapproach. This lead to an increase in resource utilizationcompared to similar approaches in the literature. As for futurework, we are considering extending our framework to accountfor memory interference between the different compute en-gines, as it is known to cause significant variations in executiontimes [19], [20]. R
EFERENCES[1] D. Marinca, P. Minet, and L. George, “Analysis of deadlineassignment methods in distributed real-time systems,”
Comput.Commun. , vol. 27, no. 15, pp. 1412–1423, Sep. 2004. [Online].Available: http://dx.doi.org/10.1016/j.comcom.2004.05.006[2] Y. Wu, Z. Gao, and G. Dai, “Deadline and activation time assignmentfor partitioned real-time application on multiprocessor reservations,”
Journal of Systems Architecture
Real-Time Systems , vol. 2, no. 4, 1990.[4] A. Burns and S. Baruah, “Sustainability in real-time scheduling,”
Journalof Computing Science and Engineering , vol. 2, no. 1, pp. 74–97, 2008.[5] M. Qamhieh, F. Fauberteau, L. George, and S. Midonnet, “Global edfscheduling of directed acyclic graphs on multiprocessor systems,” in
Proceedings of the 21st International conference on Real-Time Networksand Systems . ACM, 2013, pp. 287–296.[6] A. Melani, M. Bertogna, V. Bonifaci, A. Marchetti-Spaccamela, andG. C. Buttazzo, “Schedulability analysis of conditional parallel taskgraphs in multicore systems,”
IEEE Trans. Computers , vol. 66, no. 2,pp. 339–353, 2017.[7] A. Saifullah, K. Agrawal, C. Lu, and C. Gill, “Multi-core Real-TimeScheduling for Generalized Parallel Task Models,” pp. 217–226, Nov.2011.[8] J. Li, J. J. Chen, K. Agrawal, C. Lu, C. Gill, and A. Saifullah, “Analysisof federated and global scheduling for parallel real-time tasks,” in
Real-Time Systems (ECRTS), 2014 26th Euromicro Conference on . IEEE,2014, pp. 85–96.[9] A. Saifullah, D. Ferry, C. Lu, and C. Gill, “Real-time scheduling ofparallel tasks under a general dag model,” 2012.[10] J. Fonseca, G. Nelissen, V. N´elis, and L. M. Pinho, “Response timeanalysis of sporadic dag tasks under partitioned scheduling,” in
Indus-trial Embedded Systems (SIES), 2016 11th IEEE Symposium on . IEEE,2016, pp. 1–10.[11] H.-E. Zahaf, A. E. H. Benyamina, R. Olejnik, and G. Lipari, “Energy-efficient scheduling for moldable real-time tasks on heterogeneouscomputing platforms,”
Journal of Systems Architecture , vol. 74, pp. 46– 60, 2017.[12] ——, “Modeling parallel real-time tasks with di-graphs,” in
Proceed-ings of the 24th International Conference on Real-Time Networks andSystems , ser. RTNS ’16. ACM, 2016, pp. 339–348.[13] Y. Wu, Z. Gao, and G. Dai, “Deadline and activation time assignmentfor partitioned real-time application on multiprocessor reservations,”
Journal of Systems Architecture , vol. 60, no. 3, pp. 247–257, 2014.[14] M. Stigge, P. Ekberg, N. Guan, and W. Yi, “The digraph real-timetask model,” in
Real-Time and Embedded Technology and ApplicationsSymposium , April 2011.[15] J. Zhong and B. He, “Kernelet: High-throughput gpu kernel executionswith dynamic slicing and scheduling,”
IEEE Transactions on Paralleland Distributed Systems , vol. 25, no. 6, pp. 1522–1532, 2014.[16] T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith, “Gpuscheduling on the nvidia tx2: Hidden details revealed,” in . IEEE, 2017, pp. 104–115. [17] N. Capodieci, R. Cavicchioli, P. Valente, and M. Bertogna, “Sigamma:Server based integrated gpu arbitration mechanism for memory ac-cesses,” in
Proceedings of the 25th International Conference on Real-Time Networks and Systems . ACM, 2017, pp. 48–57.[18] P. Emberson, R. Stafford, and R. I. Davis, “Techniques for the synthesisof multiprocessor tasksets,” in
WATERS , 2010.[19] W. Ali and H. Yun, “Protecting real-time gpu applications on integratedcpu-gpu soc platforms,” arXiv preprint arXiv:1712.08738 , 2017.[20] R. Cavicchioli, N. Capodieci, and M. Bertogna, “Memory interferencecharacterization between cpu cores and integrated gpus in mixed-criticality platforms,” in