[PDF] Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

Abstract

We propose a virtual-gang based parallel real-time task scheduling approach for multicore platforms. Our approach is based on the notion of a virtual-gang, which is a group of parallel real-time tasks that are statically linked and scheduled together by a gang scheduler. We present a light-weight intra-gang synchronization framework, called RTG-Sync, and virtual gang formation algorithms that provide strong temporal isolation and high real-time schedulability in scheduling real-time tasks on multicore. We evaluate our approach both analytically, with generated tasksets against state-of-the-art approaches, and empirically with a case-study involving real-world workloads on a real embedded multicore platform. The results show that our approach provides simple but powerful compositional analysis framework, achieves better analytic schedulability, especially when the effect of interference is considered, and is a practical solution for COTS multicore platforms.

Full PDF

VVirtual Gang based Scheduling of Real-TimeTasks on Multicore Platforms

Waqar Ali

University of KansasLawrence, [email protected]

Rodolfo Pellizzoni

University of WaterlooOntario, [email protected]

Heechul Yun

University of KansasLawrence, [email protected]

Abstract

We propose a virtual-gang based parallel real-time task scheduling approach for multicore platforms.Our approach is based on the notion of a virtual-gang, which is a group of parallel real-time tasksthat are statically linked and scheduled together by a gang scheduler. We present a light-weightintra-gang synchronization framework, called RTG-Sync, and virtual gang formation algorithmsthat provide strong temporal isolation and high real-time schedulability in scheduling real-timetasks on multicore. We evaluate our approach both analytically, with generated tasksets againststate-of-the-art approaches, and empirically with a case-study involving real-world workloads on areal embedded multicore platform. The results show that our approach provides simple but powerfulcompositional analysis framework, achieves better analytic schedulability, especially when the eﬀectof interference is considered, and is a practical solution for COTS multicore platforms.

Software and its engineering → Real-time schedulability

Keywords and phrases safety-critical, hard real-time, multicore platforms, gang scheduling

Digital Object Identiﬁer

High-performance multi-core based embedded computing platforms are increasingly beingused in safety-critical real-time applications, such as avionics, robotics and autonomousvehicles. This is fueled by the need to eﬃciently process computationally demanding real-timeworkloads (e.g., AI and vision). For such workloads, multicore platforms provide ampleopportunity for speedup by allowing these workloads to utilize multiple cores. However, theuse of multicore platforms in safety-critical real-time applications brings signiﬁcant challengesdue to the diﬃculties encountered in guaranteeing predictable timing on these platforms.In a multicore platform, tasks running concurrently can experience high timing variationsdue to shared hardware resource contention. The eﬀect of contention highly depends onthe underlying hardware architecture, which is generally optimized for average performance,thus often shows extremely poor worst-case behaviors [8]. Furthermore, which tasks areco-scheduled at a time instance depends on the OS scheduler’s decision and may vary overtime. Therefore, to estimate a task’s worst-case execution time (WCET) through empiricalmeasurements, which is a common industry practice, one might have to explore all feasibleco-schedules of the entire taskset under a chosen OS scheduling policy. Any slight changein the schedule or change in any of the tasks can make ripple eﬀects in the observed task © Waqar Ali and Rodolfo Pellizzoni and Heechul Yun;licensed under Creative Commons License CC-BYLeibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . O S ] F e b X:2 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms execution times. In other words, timing of a task is coupled with the rest of the tasks, the OSscheduling policy, and the underlying hardware. For this reason, in the use-cases where hardreal-time guarantees are a must, such as avionics, it is recommended to disable all but onecore of a multicore processor [10], which obviously defeats the purpose of using multicore.Gang scheduling was originally proposed in high-performance computing to maximizeperformance of parallel tasks [17], and many real-time varieties have been studied in thereal-time community [9, 15, 16, 19, 20, 19, 26, 45]. In gang scheduling, threads of a paralleltask are scheduled only when there are enough cores to schedule them all at the same time.Therefore, gang scheduling reduces scheduling induced timing variations and synchronizationoverhead [45]. However, most prior gang scheduling studies do not consider the co-runnerdependent timing variations due to contention in the shared hardware resources, and insteadsimply assume that WCETs already account for such eﬀects, which may introduce severepessimism in their analysis [48].In our previous work [4], we presented

RT-Gang framework, which implements a restrictiveform of gang scheduling policy for parallel real-time tasks, to address this problem of sharedresource contention, by scheduling only one real-time gang task at a time, even if there areenough cores left to accommodate other real-time tasks . This design philosophy of RT-Gang,although solves the problem of shared-resource contention among diﬀerent real-time tasks,constrains the overall real-time schedulability of the system. Given that parallelization ofa task often does not scale well, and more cores are being integrated in modern multicoreprocessors, it is unlikely to be a general solution for many systems. To mitigate this problem,we had introduced the notion of virtual gang , which is deﬁned as a group of real-time tasksthat are always scheduled together as if they are members of a single parallel task. However,our previous work did not cover how such a virtual gang can be created, selected, andscheduled in ways to improve real-time schedulability.In this paper, we rigorously examine the idea of a virtual gang and demonstrate thatwithout careful consideration, the notion of virtual gang, as proposed in [4], can actuallysubstantially decrease the real-time schedulability of the system. In fact, we show that, in theworst-case, scheduling of a virtual gang task is equivalent of serializing all member tasks ofthe gang, which eﬀectively nulliﬁes any schedulability beneﬁts of co-scheduling. We proposea light-weight intra-gang synchronization framework, which we call RTG-Sync , to addressthe problem by ensuring all member tasks of a virtual gang are synchronously released . Italso provides easy to use APIs to create and destroy virtual gangs and their memberships.Moreover, RTG-Sync improves isolation between real-time tasks and best-eﬀort tasks, andamong the member tasks of each virtual gang, by integrating a page-coloring based last-levelcache (LLC) partitioning mechanism.RTG-Sync provides the following guarantees to the members of each virtual gang task:(1) all member tasks are statically determined and do not change over time; (2) no otherreal-time tasks can be co-scheduled; (3) best-eﬀort tasks can be co-scheduled on any idlecores, but their usage of LLC is restricted to a static partition via page-coloring so that theydo not pollute the working-set of real-time task in the LLC. Moreover, maximum memorybandwidth usages of best-eﬀort tasks are strictly regulated to a certain threshold value set bythe virtual gang. These properties greatly simplify the process of determining task WCETs,because once a virtual gang is created, the other tasks that do not belong to the virtualgang cannot interfere with the member tasks, regardless of the OS scheduling policy, and In this paper, by a real-time task, we mean a periodically activated task, which is composed of one ormore parallel threads. . Ali and R. Pellizzoni and H. Yun XX:3 the eﬀect of shared hardware resource contention is strictly bounded. In short, RTG-Syncenables compositional timing analysis on multicore platforms .We present virtual gang formation algorithms to help create virtual gangs and theirmember tasks from a given real-time taskset with the goal of maximizing system-level real-timeschedulability. Since we are interested in demonstrating our technique on commercial-oﬀ-the-shelf platforms, we do not assume that a detailed hardware model is available; instead,we rely on measurement-based techniques for WCET estimation. Lastly, we describe howa classical single-core schedulability analysis can be applied to analyze the scheduling ofparallel real-time tasksets on a multicore platform.For evaluation, we ﬁrst present schedulability analysis results with randomly generatedparallel real-time tasksets. We then present case-study results conducted on a real multicoreplatform, demonstrating that the proposed framework achieves higher utilization and timepredictability.In summary, we make the following contributions:We establish, with the help of concrete examples, the requirements for supporting thevirtual gang abstraction in an operating system.We present RTG-Sync, a light-weight synchronization framework for ensuring synchronousrelease of the member tasks for each virtual gang.We present virtual gang formation algorithms, which create virtual gangs and theirmember tasks from a given taskset to improve system-level real-time schedulability. .We implement our system in a real embedded multicore platform and evaluate ourapproach both analytically with generated tasksets and empirically with a case-studyinvolving real-world workloads. The rest of the paper is organized as follows. We provide necessary background inSection 2. In Section 3, we establish the requirements for supporting virtual gangs in agang-scheduling framework with the help of motivating examples. In Sections 4 and 5, weexplain the design of RTG-Sync and the gang formation algorithms respectively. In Section 6,we describe the schedulability analysis results using synthetic tasksets. In Section 7, wepresent our case-study evaluation results. We discuss related work in Section 8 and concludein Section 9.

In this section, we provide necessary background.

In this paper, we consider the rigid real-time gang task model [20] with implicit deadlines,and focus on scheduling n periodic real-time tasks, denoted by T = { τ , τ , ..., τ n } , on amulticore platform with m identical cores.In the rigid gang model, each real-time task τ i is characterized by three parameters < h i , C i , T i > where h i represents the number of cores (threads) required by the gang task torun, C i is the task’s worst-case execution time (WCET), and T i represents its period, whichis also equal to its deadline. The task model is said to be rigid because the number of cores Timing analysis of a real-time system is considered compositional if analysis of a component can becarried out independently of other components. RTG-Sync and the analysis tools will be available as open-source.

X:4 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms 𝜏 " 𝜏 Preempt 𝜏 " 𝜏 Best-Effort 𝑝𝑟𝑖𝑜 (𝜏 " ) > 𝑝𝑟𝑖𝑜 (𝜏 ) 𝜏 " 𝜏 C o r e s (a) Gang-FTP (b) RT-Gang Figure 1

Illustration of the “one-gang-at-a-time” scheduling policy. a task needs is ﬁxed and does not change over time. The rigid gang task model is well suitedfor multi-threaded parallel applications, often implemented by using parallel programmingframeworks such as OpenMP [13]. We will review other task models in Section 8.

Gang FTP [20] is a ﬁxed-priority gang scheduling algorithm, which schedules rigid andperiodic real-time gang tasks as follows: At each scheduling event, the algorithm schedulesthe highest priority task τ i on the ﬁrst h i available cores (if exist) among the active (ready)tasks. The process repeats for the remainder of the active tasks on the remaining availablecores. Gang EDF scheduler [29] works similarly, except that the task’s priority is not ﬁxedbut may dynamically change based on deadlines of the tasks at the moment (a task withmore imminent deadline is given higher priority). RT-Gang is a recently proposed open-source real-time gang scheduler, which implements arestrictive form of Gang FTP scheduling policy in Linux kernel [4]. According to this policy,at most one real-time task—which may be composed of one or more parallel threads—can bescheduled at any given time. When a real-time task is released, all of its threads are scheduledsimultaneously on the available cores—if it is the highest priority real-time task—or none atall if there is a higher priority real-time task currently in execution.Figure 1 shows the comparison of the “one-gang-at-a-time” scheduling policy of RT-Gangagainst Gang-FTP with a simple example, in which two real-time tasks τ and τ are scheduledon a multicore platform. Under Gang-FTP, τ and τ can be co-scheduled because theircombined core requirement is equal to the total number of system cores. Under RT-Gang,such co-scheduling is not possible. All threads of τ are simultaneously preempted when thehigher priority task τ arrives because real-time tasks must be executed one-at-a-time.The rationale behind this “simple” gang scheduling policy—one-gang-at-a-time—is toeliminate the problem of shared hardware resource contention among co-executing real-timetasks by design. This also greatly simpliﬁes schedulability analysis because it transforms the(complex) problem of multicore scheduling of real-time tasks into the well-understood unicorescheduling problem. Since each real-time task is guaranteed temporal isolation, its worst-caseexecution time (WCET) can be tightly bounded as opposed to pessimistic estimation ofWCET when co-scheduling of real-time tasks is allowed. Note that this restrictive gang . Ali and R. Pellizzoni and H. Yun XX:5 𝒑𝒓𝒊𝒐 (𝒗 ) > 𝒑𝒓𝒊𝒐 (𝝉 ) 𝜏 - 𝑣 / C o r e s 𝜏 / 𝜏 𝜏 𝜏 𝜏 - 𝜏 - 𝜏 - 𝜏 - Virtual-Gang: 𝑣 / 𝒑𝒓𝒊𝒐 (𝒗 ) < 𝒑𝒓𝒊𝒐 (𝝉 ) 𝜏 - 𝑣 / 𝜏 / 𝜏 𝜏 𝜏 𝜏 - 𝜏 - 𝜏 - 𝜏 - Figure 2 virtual gang concept. Adapted from [4] scheduling is still strictly better than Federal Aviation Administration (FAA) recommendedindustry practice of disabling all but one cores [10].The obvious problem of low CPU utilization—because not all cores may be utilized bythe scheduled one real-time gang task—is partially mitigated by allowing co-scheduling ofbest-eﬀort tasks on any idle cores, with a condition that their memory bandwidth usages arethrottled to a certain threshold value, which is set by the real-time gang task, so that theimpact of co-scheduling best-eﬀort tasks to the critical real-time gang task is strictly bounded.For throttling best-eﬀort tasks, RT-Gang implements a hardware performance counter basedkernel-level throttling mechanism [49]. Note, however, that although co-scheduling best-eﬀorttasks can improve CPU utilization, it has no eﬀects in improving low schedulability ofreal-time tasks under the strict one-gang-at-a-time policy.

To improve schedulability of real-time tasks under RT-Gang, we introduced the notion ofvirtual gang, which is deﬁned as a group of real-time tasks that are explicitly linked andscheduled together as if they are the threads of a single real-time gang task under thescheduler. Figure 2 illustrates the concept of a virtual gang, in which three separate real-timetasks, τ , τ and τ , form a virtual gang task v . The virtual gang task is then treated assingle real-time gang task by the scheduler.Unfortunately, however, the conditions under which virtual gangs can be created and howthey can be eﬀectively scheduled in ways to improve real-time schedulability of the systemwere not shown in our previous work [4]. In this section, we show that the virtual gang concept, as described in [4], does not necessarilyimprove real-time schedulability of the system. We present two main challenges that need tobe addressed for eﬀective virtual gang based real-time scheduling.

The ﬁrst major requirement for virtual gang is that all member tasks must have equal periodand that they must be released synchronously.

X:6 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms C o r e s (𝜏 ,𝜏 % , 𝜏 & , 𝜏 ’ ) 𝜏 𝜏 % 𝜏 & 𝜏 ’ (𝜏 ,𝜏 % , 𝜏 & , 𝜏 ’ ) (a) No virtual gangs C o r e s (𝜏 ,𝜏 % , 𝜏 & , 𝜏 ’ ) 𝜏 𝜏 % 𝜏 & 𝜏 ’ (𝜏 ,𝜏 % , 𝜏 & , 𝜏 ’ ) Lock Duration Best-Effort Tasks (b)

Synchronized virtual gang (𝜏 𝜏 % )20 1 3 4 51234 C o r e s 𝜏 & 𝜏 ’ 𝜏 𝜏 & 𝜏 %𝜏 ’ 𝜏 & (c) Unsynchronized virtual gang

Figure 3

Example schedules under diﬀerent schemes

As for the period requirement, if the linked member tasks of a virtual gang task do notshare the same period, then the consolidated gang task will not be eﬀectively modeled as asingle periodic task from the analysis point of view. Therefore, a virtual gang task can onlybe created when all of its member tasks have a common period.

Task WCET ( C ) Period ( T ) τ τ τ τ Table 1

Taskset parameters of the illustrative example

As for the synchronous release requirement, consider the taskset in the Table 1, andsuppose it is scheduled on a quad-core platform ( m=4 ). We consider scheduling of these tasksunder the one-gang-at-a-time policy. When virtual gangs are not used, each task in thetaskset executes as a gang by itself. This results in the scheduling timeline shown in theFigure 3a. In this scheme, the completion time of the taskset is 10 time units, as the tasksexecute sequentially one-at-a-time. Note that even though each of these tasks do not fullyutilize the cores—use only one leaving three idle cores—the idle cores cannot be used (theyare said to be “locked” by the gang scheduler) to schedule other real-time tasks and thusresult in reduced real-time schedulability.Now we consider the execution of this taskset as a virtual gang. Assuming that these arethe only tasks that share the same period in our system and members of this taskset do notinterfere with each other, an intuitive grouping of these tasks would be to run them at thesame time across all four cores in the system. This results in the execution timeline shownin Figure 3b. In this scheme, the virtual gang completes in just 4 time units, after whichother real-time tasks can be scheduled—i.e., improved real-time schedulability.However, the execution of the taskset in the virtual gang scheme assumes that the jobsof the members are perfectly aligned. If this is not the case, then the virtual gang task’sexecution time will be increased, as shown in Figure 3c, and in the worst-case, it can be as badas the original schedule without using virtual gang in terms of real-time task schedulability. . Ali and R. Pellizzoni and H. Yun XX:7 C o r e s (𝜏 , 𝜏 % , 𝜏 & , 𝜏 ’ , 𝜏 ( ) 𝜏 ( 𝜏 % 𝜏 & 𝜏 ’ Lock Duration Best-Effort Tasks 𝜏 (𝜏 , 𝜏 % , 𝜏 & , 𝜏 ’ , 𝜏 ( ) 𝜏 ( 𝜏 % 𝜏 & 𝜏 ’ 𝜏 : (𝝉 , 𝝉 , 𝝉 ,𝝉 ) 𝒗 : (𝝉 ) 𝒗 : (𝝉 , 𝝉 , 𝝉 ,𝝉 ) 𝒗 : (𝝉 ) (a) Best-Case (a) Worst-Case Figure 4

Example schedules under diﬀerent gang formations.

Another major challenge, when creating virtual gangs, is to decide which tasks to grouptogether for concurrent execution.Assume that in addition to tasks from Table 1, there is one more task τ = ( h : 1 , C :3 , P : 10) which needs to be scheduled on the quad-core platform. In this case, it is requiredthat the taskset be split into at least two virtual gangs since all ﬁve tasks in the tasksetcannot execute simultaneously in our target system. Hence the problem is to ﬁnd an optimalgrouping of tasks into virtual gangs such that the execution time of the taskset is minimized.For the simple taskset considered here, it can be seen, with a little trial and error, that avirtual gang comprising τ , τ , τ , τ and another one comprising just τ will achieve thisgoal, resulting in the execution timeline shown in the inset (a) of Figure 4.However if the tasks in a virtual gang are not carefully selected, the execution time of thetaskset can increase signiﬁcantly. In the example taskset, a virtual gang comprising τ , τ , τ , τ and the other one comprising τ leads to an execution time of 7 time units as comparedto 5 time units in the previous case; as can be seen in inset (b) of Figure 4.Given a taskset, the problem of selecting the tasks which should be run together asvirtual gangs so that the execution time of the entire taskset is minimized, is non-trivial.The problem is further complicated by the fact that the tasks in a virtual gang can interferewith each other when run concurrently due to shared hardware resource contention, whichmay require some degree of pessimism in estimating the virtual gang’s WCET.Without taking the synchronization and gang formation problems into account, a strategyto improve system utilization via virtual gangs under RT-Gang may not lead to the desiredresults and may actually deteriorate the system’s performance and real-time schedulability. RTG-Sync is a software framework to enable virtual-gang based parallel real-time taskscheduling on multicore platforms.As explained in the Section 3, synchronization between the members of each virtual gangis a key requirement for eﬀective virtual gang based scheduling. For a typical multi-threadedprocess (task), synchronization between the threads of the process can be achieved by usinga barrier mechanism available in the parallel programming library it uses (e.g., OpenMPbarrier). However, such a barrier mechanism is tied to the particular parallel programming

X:8 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

Core-1 Core-2

RTG-Sync Middleware (Daemon, Client, User-Library)

Core-4Core-3

PMC PMC PMC PMC

Main Memory OS (RT-Gang, Page-Coloring, Bandwidth Throttling Framwork) τ τ Best Effort

LLC

Library Call Library CallSystem Call System Call

Virtual Gang

Figure 5

High level architecture of RTG-Sync framework. In this ﬁgure, τ and τ are real-timetasks and the resources allocated to them via RTG-Sync framework are color-coded. framework, which is used by the particular parallel task, and is not designed to be used bydisparate tasks for system-level scheduling.RTG-Sync provides a cross-process synchronization mechanism for virtual gangs, byutilizing existing OS-level inter-process communication (IPC) mechanisms. In addition, itprovides an API to create and destroy virtual gangs and their membership. Furthermore, itintegrates shared cache partitioning and memory bandwidth throttling mechanisms to boundthe impact of interference in hardware resources.Figure 5 shows the high level architecture of RTG-Sync. The user-level component ofRTG-Sync provides a specially designed system-wide barrier to each virtual gang so thatall its member tasks can be synchronously released and scheduled by the kernel-level gangscheduler simultaneously. At the kernel-level, the modiﬁcations we have implemented ensurethat the virtual gang execution is protected from interference by best-eﬀort tasks throughpartitioning of LLC via page-coloring and memory level bandwidth throttling framework. Inthe ﬁgure, τ (on Core-1 and 2) and τ (on Core-3) are periodic real-time tasks of a virtualgang under RTG-Sync. The LLC has eight distinct partitions (colors), of which four aregiven to τ , two are given to τ , and the rest are reserved for best-eﬀort tasks by RTG-Sync.In addition to the coloring based LLC partitioning, RTG-Sync additionally throttles themaximum memory bandwidth of Core-4, which can schedule best-eﬀort tasks, to the virtualgang determined bandwidth thresholds so that the interference impact of best-eﬀort tasks tothe virtual-gang is bounded. RTG-Sync middleware consists of a server daemon and a client program. The primary serviceprovided by the server is creating virtual gangs and initializing their associated resources.The server receives the number of processes, which need to be run as a single virtual gang,and creates a new memory mapped ﬁle in a predeﬁned location, which is used for creating asystem-wide barrier. When creating a virtual gang, we also specify the maximum memorybandwidth thresholds and LLC partitions (colors) for the best-eﬀort cores (i.e., the cores . Ali and R. Pellizzoni and H. Yun XX:9 that are not used to schedule the gang member tasks and thus can schedule any best-eﬀorttasks). These parameters are enforced by the kernel-level mechanisms (Section 4.2) in orderto bound the impact of co-scheduled best-eﬀort tasks, if exist, to the virtual gang. A uniqueID value is generated for each virtual gang, which is then used by the member tasks. Eachmember task shall make RTG-Sync user-library calls, to map the barrier ﬁle into its ownaddress space and synchronize with other virtual gang members through the barrier.The API call to register a process as a virtual gang member takes the virtual gang IDvalue issued by the RTG-Sync server along with the shared-resource requirement informationfor the calling process. Currently, each gang member task can speciﬁes a portion of theLLC space, in terms of colors, the task is allowed to use. Internally, RTG-Sync makes asystem-call to record the passed in parameters into the calling process’s task-structure in thekernel. Furthermore, it maps the system-wide barrier registered against the passed in virtualgang id into the calling process’ address-space.Once a process is registered as part of a virtual gang, the call to synchronize gangmembers is simple. It takes the barrier pointer returned by the aforementioned API call anduses it to synchronize on the barrier. This call must be made by all member tasks at thestart of their periodic execution. As soon as the waiter count for the barrier is reached, themember tasks are unblocked simultaneously; leading to desired alignment of their periodicexecution. Because RTG-Sync requires all member tasks of a virtual gang to share the sameperiod, no additional synchronization is necessary after they are simultaneously released atthe beginning.

RTG-Sync uses the RT-Gang framework to provide gang scheduling inside the Linux kernel.We have made several changes to the RT-Gang framework. First, RT-Gang originally usesthe SCHED_FIFO priority value of a process to determine gang membership i.e., diﬀerentprocesses which have the same SCHED_FIFO priority are considered part of the same gangand are allowed simultaneous execution. There are two drawbacks of using this approach.First, it forces the RMS priority assignment in that tasks which have the same period mustbe executed as a single gang. Second, it makes the gang scheduler restricted to ﬁxed priorityassignment policy only. In RTG-Sync, we have modiﬁed the kernel to instead use the virtualgang ID value recorded in the task’s control block, to check gang membership, thus avoidingthe aforementioned problems.Second, we have integrated PALLOC [47], which is a page-coloring framework, into RTG-Sync and modiﬁed it to allocate pages to a task based on the LLC color-map informationstored in the task’s control-block. We use PALLOC to perform two-level partitioning of theLLC. The ﬁrst level statically partition’s the LLC into two regions which are assigned toreal-time tasks and best-eﬀort tasks via Linux’s Cgroups. The second level of partitioning isused to divide LLC between member tasks of a virtual gang as per the resource requirementinformation stored in the task’s control-block.Lastly, we have extended RT-Gang’s memory bandwidth throttling framework to supportseparate read and write throttling capabilities [8]. This can be used to improve performanceof the best-eﬀort tasks without impacting the virtual gang’s performance.

In this section, we describe the gang formation algorithm of RTG-Sync.

X:10 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

For a given candidate-set of N real-time rigid gang tasks with the same period T and a givenmulticore platform with m homogeneous CPU cores, we want to form a set of virtual gangtasks such that the total completion time of the virtual gangs is minimized. The membertasks of each virtual gang are synchronously released through the RTG-Sync framework. Wefurther assume that for each task τ k in the candidate-set, we can determine its WCET C ∗ k in isolation , that is, without any co-running real-time or best-eﬀort tasks. Consistent withemployed industry practices, we do not restrict our methodology to any speciﬁc approach forcomputing the task WCET, i.e., either measurement or static analysis are acceptable. Notethat while the task is executed without any co-runners, by deﬁnition C ∗ k must include theeﬀects of synchronization and resource interference among concurrent threads of τ k .We ﬁrst present a brute-force algorithm to solve the virtual gang formation problem andthen describe a heuristic based algorithm. Before delving into the details of the algorithms,we ﬁrst deﬁne key terms, which are used in the remainder of this section. System Conﬁguration:

We deﬁne a system conﬁguration ( G ), given a candidate-set, asa unique combination of virtual gangs which are suﬃcient to execute every task from thecandidate-set. We use the following notation to denote a conﬁguration: G i = { v i, , v i, , ..., v i,j } where each v i,j = ( ..., τ k , ... ) denotes a virtual gang comprising tasks from the candidate-set. Completion Time:

The completion time of a conﬁguration is deﬁned as the time it takesfor all the virtual gangs, which are part of the conﬁguration, to complete their execution in agiven period. Under our framework, the completion time of a conﬁguration is equal to thesum of WCETs of the virtual gangs which are part of the conﬁguration.

The brute-force algorithm for ﬁnding the best conﬁguration from the candidate-set is statedin Algorithm 1 and it revolves around the following key steps. Given the candidate-set, wegenerate all the possible conﬁgurations containing all possible pairings of tasks into virtualgangs ( line-4 ). For this purpose, we write a recursive algorithm which, starting from thesimplest conﬁguration of tasks in the candidate-set into virtual gangs where every task runsas a gang by itself, successively generates more complex conﬁgurations by programmaticallypairing tasks into viable virtual gangs.We ﬁrst compute an estimated completion time of each conﬁguration ( line-5 ). For thisstep, we optimistically assume that the tasks inside a virtual gang do not interfere witheach other or with best-eﬀort tasks. Under this assumption, the execution time in isolation C ∗ i,j of each virtual gang v i,j is equal to the WCET of its longest running constituent task: C ∗ i,j = max τ k ∈ v i,j { C ∗ k } . Once the completion time of each conﬁguration has been computed,all the conﬁgurations are ranked based on their completion time from shortest to longest( lines-6:7 ). If two conﬁgurations have the same completion time, then the one whichcomprises smaller number of virtual gangs is given the higher rank. If a tie still existsbetween multiple conﬁgurations, then the one which is computed ﬁrst by the algorithm isgiven the higher rank. The rank value is then used to sort the conﬁgurations from best toworst—best being one with the highest rank. Iterative Step : Due to interfering eﬀects over shared resources, the actual WCET C i,j of each virtual task v i,j might be (signiﬁcantly) larger than its WCET in isolation C ∗ i,j . For A virtual gang is viable if it requires up-to m cores to execute. . Ali and R. Pellizzoni and H. Yun XX:11 Algorithm 1

Brute-Force Algorithm Input : Candidate Set ( T ), Number of Cores ( m ) Output : Best System Conﬁguration function gang_formation( T , m ) conﬁgs = generate_system_conﬁgs ( T , m ) completionTimes = calc_conﬁg_times (conﬁgs) rankedConﬁgs = rank_conﬁgs (conﬁgs, completionTimes) bestConﬁg = pick_best_conﬁg (rankedConﬁgs) while True do bestCompletionTime = bestConﬁg.completionTime add_interference (bestConﬁg) if (bestConﬁg.completionTime ≤ (1 + tolerance) * bestCompletionTime) then break else completionTimes = update_completion_times (bestConﬁg) rankedConﬁgs = rank_conﬁgs (conﬁgs, completionTimes) newBestConﬁg = pick_best_conﬁg (rankedConﬁgs) if (newBestConﬁg == bestConﬁg) then break else bestConﬁg = newBestConﬁg return bestConﬁg this reason, we need to update the completion time of the best conﬁguration G i ( line-10 )by determining the WCET value C i,j of each virtual task in G i . As mentioned before, we donot impose restrictions on how the WCET estimation is carried out. If a detailed model ofthe hardware platform and resource utilization of the tasks is available, then an analyticalapproach may be used to determine interference delay, as in [39]. However, for commercial-oﬀ-the-shelf platforms, which are the target of this work, such approach is practically unfeasible.Therefore, we instead propose to empirically determine the WCET of each virtual taskby executing it on the target platform under the synchronization framework of RTG-Sync,possibly in parallel with isolated best-eﬀort tasks . Independently of the WCET estimationmethod, based on the interference-updated completion time of the conﬁguration, there aretwo possibilities. If the completion time with interference is within a speciﬁed tolerancethreshold (e.g., 20%) of its computed value in isolation, the algorithm can ﬁnish and thebest system conﬁguration has been computed ( line-11 ).If, on the other hand, the completion time with interference is more than the tolerancethreshold, the completion time of all conﬁgurations, which contain one or more of the virtualgangs from the best conﬁguration and whose execution time changed in the interferenceevaluation, is recalculated and then the conﬁgurations are re-ranked ( lines-14:16 ). If the Based on the criticality level of the task, an additional safety margin might be added to the measurement-based WCET [25, 22].

X:12 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

Algorithm 2

Greedy Packing Heuristic Input : Candidate Set ( T ), Number of Cores ( m ) Output : Taskset comprising virtual gangs function gang_formation( T , m ) sortedTaskset = sort_tasks_by_compute_time ( T ) virtualGangSet = [ ] while not_empty (sortedTaskset) do anchorTask = sortedTaskset.pop () for nextTask in sortedTaskset do if anchorTask.h + nextTask.h ≤ m then sortedTaskset.remove (nextTask) anchorTask = create_virtual_gang (anchorTask, nextTask) virtualGangSet.append (anchorTask) return virtualGangSet best conﬁguration stays the same, then the algorithm can ﬁnish ( line-17 ). Otherwise, theiterative step is repeated with the new best conﬁguration until the algorithm converges. Complexity Analysis:

The worst-case complexity of the brute-force algorithm is linearwith respect to the total number of system conﬁgurations due to the while loop on line-8 .Given a candidate-set with N tasks and a system with m cores, the worst-case with respectto the number of unique system conﬁgurations arises when each task is single-threaded. Inthis case, the total number of system conﬁgurations is upper bounded by the following seriessum: | G | = N X k = d Nm e S ( N, k ) (1)where S ( N, k ) is

Stirling number of the second kind [21] and can be calculated using thefollowing equation [3]: S ( N, k ) = 1 k ! k X i =0 ( − i (cid:18) ki (cid:19) ( k − i ) n (2)In the context of virtual gang formation, each S ( N, k ) represents the unique number ofways to partition N tasks into k virtual gangs. The brute-force algorithm is adequate when the candidate-set is small and the tasks in thecandidate-set are heavily parallel. For lightly parallel tasks and large candidate-sets, thecomplexity of the brute-force algorithm rapidly becomes intractable. For this reason, wepresent a simple to use heuristic for gang formation as shown in Listing 2. The ﬁrst step inthe heuristic is to sort the tasks in the candidate-set in decreasing order of their WCETsin isolation C ∗ k ( line-4 ). Then we remove the task with the highest WCET, which we callanchor task, and pack as many tasks with it for co-execution as permissible by the number . Ali and R. Pellizzoni and H. Yun XX:13 of cores m of the platform ( lines-7:11 ); giving preference to tasks with larger WCETs ifmultiple tasks can be paired with the anchor task. The tasks which are paired oﬀ are removedfrom the candidate-set. We continue this process until the candidate-set is empty ( line-6 ).Once the virtual gangs are formed, we perform the interference evaluation, as describedin the previous section, to empirically determine the WCETs C k of the virtual gangs underRTG-Sync synchronization framework. For each virtual gang, if the WCET C k is within anacceptable tolerance threshold (e.g., 20%) of the WCET in isolation C ∗ k , the virtual gang isaccepted. If this is not the case then contrary to the brute-force algorithm, the virtual gangis rejected and its member tasks are considered separate gangs; there is no iterative step.The runtime of the heuristic is O ( N ) because of the loops on line-6 and line-8 . As stated in Section 2, we consider the rigid real-time gang task model [20] with implicitdeadlines. We consider a set of n periodic real-time tasks, denoted by τ = { τ , τ , ..., τ n } ,and a multicore platform with m identical cores. Each task τ i is characterized by threeparameters τ i = { h i , C i , T i } where h i is the number of cores the task needs to run, C i is theWCET and T i is the period. For RTG-Sync, we assume that each τ i represents a virtualgang, which may be created by a gang formation algorithm described in the previous section;note that in this case, C i is the updated WCET including inter-task interference eﬀects.Because the underlying gang scheduler schedules these virtual gang tasks one-at-a-time,the exact schedulability test for our system is a straight-forward application of the standardunicore response time analysis under the rate-monotonic priority assignment scheme [5], asdepicted in the following: R n +1 i = C i + X ∀ τ j ∈ hp ( τ i ) l R ni T j m C j (3). The taskset is schedulable if the response time of every task is less than its period. In this section, we present the schedulability results comparing RTG-Sync with other paralellreal-time task scheduling approaches with synthetically generated parallel real-time tasksets.For analysis, we ignore best-eﬀort tasks as we have means (bandwidth throttling and cachepartitioning) to bound their impact to the real-time tasks (Section 4). We, however, doconsider possible interference among the real-time tasks in the analysis as RTG-Sync doesnot protect bandwidth contention among the real-time tasks within a virtual gang.For the real-time taskset generation, we ﬁrst uniformly select a period T i in the range[10 , T i , N tasks τ i,j , where N is randomly picked from the interval [2 , C ∗ i,j in the range [ T / , T /

5] and a parallelism level h i,j . Note that C ∗ i,j represents the WCET of the task while running alone, as discussed inSection 5; the actual WCET C i,j , including interference between co-scheduled tasks, dependson the employed scheduling policy, and we thus detail its computation later. The utilization u i,j of each τ i,j is then calculated using the relation: u i,j = C ∗ i,j × h i,j T i . If u i,j is less than theremaining utilization for the taskset, C ∗ i,j is adjusted so that τ i,j ﬁlls the remaining utilization.Otherwise, taskset generation continues until the desired level of utilization is reached. Taskset Types:

Similarly to the work in [45], we consider three types of tasksets in oursimulation, based on the allowed level of parallelization h i,j for the tasks in the taskset. For a X:14 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms lightly-parallel taskset, h i,j is uniformly selected in the range [1 , d . × m e ] ( m is the numberof cores as deﬁned earlier). For a heavily-parallel taskset, the value of h i,j is picked fromthe range [ d . × m e , m ]. Finally, for mixed taskset, the parallelization level h i,j is selectedrandomly from the interval [1 , m ]. Priority Assignment:

We consider the rate-monotonic based priority assignment scheme: prio ( τ i ) > prio ( τ j ) if T i < T j . For tasks with the same period, we assign priorities based ontask’s WCET: prio ( τ i,j ) > prio ( τ i,k ) if C i,j < C i,k . Scheduling Policies:

For each taskset type, we calculate the schedulability results underfour diﬀerent scheduling policies. Under the

RT-Gang policy, the unicore response timeanalysis using Equation 3 is applied to calculate schedulability of the taskset under theone-gang-at-a-time scheduling. For

RTG-Sync , we ﬁrst form virtual gangs from the giventaskset and then use Equation 3 to calculate schedulability of the new taskset comprisingthe virtual gangs. Under

RTG-Sync (BFC) , virtual gangs are formed using the brute-forcemethod whereas in

RTG-Sync (GPC) , the greedy packing heuristic is used to form virtualgangs. For the

Gang-FTP policy, we use the analysis in [45] to calculate schedulability ofthe taskset under gang ﬁxed-priority scheduling . Concerning other schedulability analysesfor the rigid gang model, we do not employ [15, 16] because they assume Gang EDF ratherthan FTP; nor we compare against [19] as it requires creating static execution patternsover a hyper-period, which both signiﬁcantly complicates the runtime and requires strictlyperiodic tasks. Finally, the Threaded scheme models the scheduling of parallel tasks undervanilla Linux real-time scheduler, where the m i,j threads of each task τ i,j are independentlyscheduled. In this case, we assess schedulability based on the state-of-the-art analysis forﬁxed-priority thread scheduling of DAG tasks in [18]; here τ i,j is simply modeled as a DAGof m i,j nodes with the same execution time . Interference Model:

We incorporate a simple interference model into our analysis tocompute the WCET C i,j of each task mimicking shared resource interference between co-scheduled tasks on real platforms. For RT-Gang , we simply set C i,j = C ∗ i,j , as under thispolicy, there is no co-scheduling among gangs by design. For all other policies, we considerboth an ideal case where C i,j = C ∗ i,j , and a more realistic case where C i,j is derived from C ∗ i,j based on the interference model.According to this model, we randomly generate a resource-demand factor r i,j for eachtask τ i,j in the range [0 , R i,j for each task, which is the sum of the demand of thetask and the combined demand of the set of maximally resource intensive tasks which canget co-scheduled with the task under analysis as per the scheduling policy. We use R i,j toscale the WCET of the task as follows: C i,j = C ∗ i,j × max( R i,j , RTG-Sync (BFC) , while calculating the completion time of each systemconﬁguration ( line-5 of Algorithm 1), the set of co-scheduled tasks is determined basedon the virtual gangs and the completion time of each conﬁguration is adjusted. In thiscase, the algorithm takes interference into account in selecting the optimal virtual gangs. In

RTG-Sync (GPC) , on the other hand, interference is taken into account, according to thestrategy mentioned earlier, after the best virtual gangs have been formed as per the heuristic. Note that the analysis in [45] applies to a bundled gang model, but since the bundled model generalizesthe rigid gang model, it still applies to our case. While we could also model the tasks according to the fork-join parallel model, as noted in [45], thestate-of-the-art analysis for fork-join tasks would not perform better than [18]. . Ali and R. Pellizzoni and H. Yun XX:15

For

Gang-FTP , we enumerate all possible sets of co-running tasks based on the remainingnumber of cores M − m i,j , and pick the set with the maximal combined demand. For Threaded , we assume that each independently scheduled thread of τ i,j has a resource demandof r i,j /m i,j , and pick the M − S c h e d u l a b ili t y Lightly Parallel

Utilization

Mixed

Heavily Parallel

Gang-FTPGang-FTP (Ideal) ThreadedThreaded (Ideal) RTG-Sync(GPC)RTG-Sync(GPC-Ideal) RTG-Sync(BFC)RTG-Sync(BFC-Ideal) RT-Gang

Figure 6

Schedulability results of the analyzed policies for diﬀerent taskset types on 8 cores.Dashed lines are used when interference is not considered, while solid lines are used when it isincorporated in the analysis. For RT-Gang, because only one task can be scheduled at a time,interference cannot occur by design (thus no dashed lines).

Figure 6 shows the schedulability plots for 8 cores ( m = 8) in three diﬀerent types oftasksets—lightly parallel, mixed, and heavily parallel—depending on the degree of parallelismof the individual real-time tasks in the generated tasksets.For lightly parallel tasksets, the Threaded scheme gives the best schedulability results,followed closely by the

Gang-FTP policy, if interference model is not used (dashed lines).However, when interference is considered (solid lines), the schedulability under these policiesdeteriorates rapidly. As expected,

RT-Gang suﬀers the most for lightly parallel tasks asthey under-utilize the cores. In comparison, both

RTG-Sync (GPC) and and

RTG-Sync(BFC) are signiﬁcantly better than RT-Gang because the creation of virtual gangs improvesutilization. Note that the eﬀect of interference is less signiﬁcant in RTG-Sync, compared tothat of

Gang-FTP and

Threaded . This is due to the fact that under RTG-Sync, only thetasks of the same virtual gang can possibly interfere with each other, while a lot more tasksmust be considered in Gang-FTP and Threaded.For mixed and heavily parallel tasksets, both

RTG-Sync policies perform very similarlyand outperform the rest regardless whether interference is considered or not.

RT-Gang improves considerably as a single parallel task can utilize more cores in the platform, thoughit still lags behind RTG-Sync. On the other hand,

Gang-FTP and

Threaded are signiﬁcantlyworse than RTG-Sync in both mixed and heavily parallel tasksets, and worse than RT-Gangin the heavily parallel tasksets. This can be attributed to the analysis pessimism needed tohandle carry-in jobs in their schedulability analyses [45, 18], which becomes more pronouncedas the parallelism of the tasks increases. Because both RTG-Sync and RT-Gang use unicoreﬁxed-priority schedulability techniques, they do not suﬀer from such analysis pessimism.

X:16 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

Finally, in all cases, interference impact becomes less prominent as the parallelization ofthe taskset increases. This is because with highly parallel tasks, the opportunity of gettingco-scheduled with other resource intensive tasks decreases, leading to improved schedulability.

For RTG-Sync policies, the parameter N , which denotes the number of tasks generated foreach period, is crucial. A high value of N increases the possibilities of virtual gang formationand improves the schedulability under RTG-Sync policies. S c h e d u l a b ili t y N = 2

Utilization

N = 5

N = 10

RTG-Sync(GPC) RTG-Sync(GPC-Ideal) RTG-Sync(BFC) RTG-Sync(BFC-Ideal) RT-Gang

Figure 7

Eﬀect of number of tasks per period ( N ) on RTG-Sync policies for mixed taskset Figure 7 shows the eﬀect of N on the performance of RTG-Sync policies . In this ﬁgure,the schedulability results for RTG-Sync policies are plotted for mixed taskset types whilestatically changing the value of N between the experiments while keeping the remainingsetup the same as detailed in Sec 6.1. It can be seen that when N = 2, the schedulabilitycurves under RTG-Sync policies are much closer to that for RT-Gang. The reason for this isthat with N = 2, for each unique period T , only two tasks exist which can possibly formgang. However, the parallelism level of these tasks may be such that gang formation isnot possible. With N = 5, the improvement in schedulability under RTG-Sync policies ismore pronounced as compared to RT-Gang. Finally for N = 10, RTG-Sync policies providesigniﬁcant improvement in schedulability over RT-Gang.Another observation from this ﬁgure is that under the interference model, increasingthe value of N makes the diﬀerence in schedulability between the RTG-Sync brute-forcealgorithm and the heuristic more apparent. This indicates that as the possibilities of virtualgang formation increase, the heuristic becomes less likely to ﬁnd the optimal virtual gangcombination. This is to be expected since the heuristic does not take the interference amonggang members into account during the gang formation process. Although it may be possibleto design a more sophisticated heuristic for virtual gang formation, this would requireincorporating an analytical interference model into the algorithm. As discussed in Section 5,this is highly challenging for commercial-oﬀ-the-shelf platform. Therefore, in this paper wepreferred to demonstrate that we can still achieve signiﬁcant improvements in schedulabilitywith an algorithm that is interference-model agnostic; while we defer more complex, analyticalmethods for platforms where a detailed hardware model can be constructed to future work. For the sake of simplicity, we do not include the schedulability curves for

Gang-FTP and

Threaded because these policies are not fundamentally aﬀected by changing the value of N . . Ali and R. Pellizzoni and H. Yun XX:17 In this section, we describe the evaluation results of RTG-Sync on a real multicore platform.

We use NVIDIA’s Jetson TX-2 [1] board for our evaluation experiments with RTG-Sync.The Jetson TX-2 board has a heterogeneous multicore cluster comprising six CPU cores (4Cortex-A57 + 2 Denver ). On the software side, we use the Linux kernel version 4.4 andpatch it with the modiﬁed version of RT-Gang [2] to enable real-time gang scheduling atthe kernel level along with best-eﬀort task throttling and page-coloring frameworks, adding ∼ lines to the architecture neutral part of the Linux kernel. In all our experiments,we put our evaluation platform in maximum performance mode which involves staticallymaximizing the CPU and memory bus clock frequencies and disabling the dynamic frequencyscaling governor. We also turn oﬀ the GUI and networking components and lower therun-level of the system (5 →

3) to keep the background system services to a minimum.

In this case-study, we demonstrate the eﬀectiveness of using virtual gangs to improve systemutilization, compared to the “one-gang-at-a-time” scheduling and Linux’s default scheduler.

Task WCET (ms) Period (ms) τ RTBWT τ RTDNN − τ RTDNN − τ BEcutcp ∞ N/A 2 N/A τ BElbm ∞ N/A 2 N/A

Table 2

Taskset parameters for case-study

The taskset for the case-study is shown in Table 2. It consists of three real-time tasks andtwo best-eﬀort ones. For real-time tasks, we use the DNN workload from DeepPicar [7] astwo of the real-time tasks τ RTDNN − and τ RTDNN − . Both DNN tasks use two threads each andhave the same period of 50 ms. We use the synthetic bandwidth-rt benchmark as the thirdreal-time task τ RTBW T , which uses 4 threads and has a period of 100 ms. τ RTBW T is designedto be oblivious to shared resource interference but it creates signiﬁcant shared hardwareresource contention to DNN tasks under co-scheduling. As per the RMS priority assignment,we assign higher real-time priority to DNN tasks than the bandwidth-rt task.For best-eﬀort tasks, we use two benchmarks from the Parboil benchmark suite [41].Among the best-eﬀort tasks, τ BElbm is signiﬁcantly more memory intensive than τ BEcutcp . Bothbest-eﬀort tasks use two threads each and are pinned to disjoint CPU cores.We evaluate the performance of this taskset on Jetson TX-2 under three scenarios. The

Linux scenario represents the scheduling of the taskset under the vanilla Linux kernel. In We do not use the Denver cores because of their lack of support for necessary hardware performancecounters to implement the throttling mechanism.

X:18 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms

DNN Job Duration (msec) CD F = 8.5 = 9.0 = 11.3RT-Gang RTG-Sync Linux Figure 8

Distribution of job duration for τ RTDNN − RT-Gang scheme, the real-time tasks are gang scheduled with the one-gang-at-a-time policy.Finally, under

RTG-Sync , we create a virtual gang, which is comprised of the two real-timeDNN tasks. We assign 3/4th of the LLC to the virtual gang (two DNN tasks) and the rest1/4th of the cache to the best-eﬀort tasks. We do not, however, apply partitioning betweenthe DNN tasks as sharing the cache space is beneﬁcial in this case.Figure 8 shows the cumulative distribution function of the job execution times of τ RTDNN − under the three compared schemes. Note that this task has the highest real-time priorityin our case-study. In this ﬁgure, the performance of τ RTDNN − remains highly deterministicunder both RT-Gang and RTG-Sync. In both cases, the observed WCET of τ RTDNN − stayswithin of its solo WCET—i.e., measured WCET in isolation—from Table 2. However,under the baseline Linux kernel (denoted as Linux ), the job execution times of τ RTDNN − varysigniﬁcantly, with the observed WCET approaching of the solo WCET.The diﬀerence among the observed performance of τ RTDNN − under the three scenarioscan be better explained by analyzing the execution trace of the taskset in one hyper-periodof 100 ms, which is shown in Figure 9. Inset 9a displays the execution timeline under vanillaLinux. It can be seen that the DNN tasks suﬀer from two main sources of interference in thisscenario. Whenever the execution of the DNN tasks overlaps with the execution of τ RTBW T , theexecution time of the task increases. The execution time also increases when the DNN tasksget co-scheduled with best-eﬀort tasks. Note that the system is not regulated in any way inthis scenario. Therefore, the eﬀect of shared resource interference is diﬃcult to predict, asevidenced in the CDF plot of Figure 8, which shows highly variable timing behavior. UnderRT-Gang, on the other hand, the execution of DNN tasks is almost completely deterministic.Due to the restrictive one-gang-at-a-time scheduling policy, co-scheduling of DNN tasks with τ RTBW T is not possible. Moreover, the shared resource interference from the best-eﬀort tasksis strictly regulated due to LLC partitioning and the kernel level throttling framework.However, under RT-Gang, each DNN task executes as a separate gang by itself, whichmeans that two system cores are left unusable for real-time tasks while the DNN tasks areexecuting because of the one-gang-at-a-time policy. This reduces the share of total system . Ali and R. Pellizzoni and H. Yun XX:19 𝜏 𝐷𝑁𝑁−1𝑅𝑇 𝜏 𝐷𝑁𝑁−2𝑅𝑇 𝜏 𝐵𝑊𝑇𝑅𝑇 𝜏 𝑙𝑏𝑚𝐵𝐸 𝑘𝑡ℎ𝑟𝑜𝑡𝑡𝑙𝑒 𝜏 𝑐𝑢𝑡𝑐𝑝𝐵𝐸 −msec −msec −msec

𝟏𝟎𝟎 − 𝐦𝐬𝐞𝐜

CPU-0CPU-1CPU-2CPU-3 (a)

Linux Default −msec

CPU-0CPU-1CPU-2CPU-3 −msec −msec −msec (b)

RT-Gang −msec

CPU-0

CPU-1

CPU-2CPU-3 8.8 −msec (c)

RTG-Sync

Figure 9

Annotated KernelShark trace snapshots of case-study scenarios for one hyper period utilization of the multicore platform, which can be used by other real-time tasks. Althoughthe idle cores are utilized by best-eﬀort tasks, the strict regulation imposed by DNN tasksmeans that the best-eﬀort tasks are mostly throttled when they are co-scheduled with DNNtasks. Under RTG-Sync, both of these problems are solved by pairing τ RTDNN − and τ RTDNN − into a single virtual gang. In this case, the system is fully utilizable by real-time tasks.The execution of virtual DNN gang is completely deterministic due to the synchronizationframework of RTG-Sync. Moreover, since there is no co-scheduling of best-eﬀort tasks withreal-time tasks, the throttling framework does not get activated and any slack duration leftby real-time tasks can be utilized completely by best-eﬀort tasks without imposing throttling. The runtime overhead due to RTG-Sync can be broken down into two parts. First is theoverhead due to synchronization. This overhead is incurred only once during the setup phaseof the real-time tasks which are members of the same virtual gang and it does not contributeto the WCETs of the periodic jobs. Second, the kernel level overhead is incurred due to thesimultaneous scheduling of real-time tasks by the gang-scheduler. Since we use the RT-Gangframework for this purpose, the kernel level overhead is the same as reported in [4], whichshowed negligible overhead on a quad-core platform.

Parallel real-time tasks are generally modeled using one of following three models: Fork-joinmodel [29, 38, 35, 11], DAG model [6, 37, 18] and gang task model [9, 19, 20, 26, 16, 15]. Inthe fork-join model, a task alternates between parallel (fork) and sequential (join) phases over

X:20 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms time. In DAG model, which is a generalization of the fork-join model, a task is representedas a directed acyclic graph with a set of associated precedence constraints, which allows moreﬂexible scheduling as long as the constraints are satisﬁed. Lastly, the gang model is furtherdivided into three categories. Under the rigid gang task model [26, 16, 19], the number ofcores required by the gang are determined prior to scheduling and stay constant throughoutits execution. In the moldable gang model [9], the number of cores required by the gang aredetermined by the scheduler on a per-job basis but once determined, they are assumed tostay constant throughout the execution of the job. Finally, in the malleable gang model [12],the number of cores required by the job can change during the job’s execution. Recently abundled gang model is proposed in [45], which is a generalization of rigid gang model thatallows more ﬂexible parallel task modeling at the cost of increased analysis complexity.In the real-time systems community, ﬁxed-priority and dynamic priority real-time versionsof gang scheduling policies, namely Gang FTP and Gang EDF, respectively, are studied andanalyzed [20, 26, 16]. However, these prior real-time gang scheduling policies do not considerinterference caused by shared hardware resources in multicore processors. On the other hand,the Isolation Scheduling model [23] and a recently proposed integrated modular avionic (IMA)scheduler design in [34] consider shared resource interference and limit co-scheduling to thetasks of the same criticality (in [23]) or those in the same IMA partition (in [34]). However,they do not speciﬁcally target parallel real-time tasks and do not allow co-scheduling ofbest-eﬀort tasks. Also, to the best of our knowledge, all aforementioned real-time schedulingpolicies were not implemented in actual operating systems. Recently, a restrictive formof gang scheduling policy, which limits scheduling of just one gang task at a time, wasproposed and implemented in Linux as a open-source project [4, 2]. The gang scheduler,called RT-Gang, provides strong temporal isolation by avoiding and bounding shared resourceinterference. However, it can signiﬁcantly under-utilize computing resources in schedulingcritical real-time tasks. Our work leverages the open-source RT-Gang scheduler and developsmechanisms and methodologies that improve real-time schedulability of the system at amarginal cost in terms of execution time predictability.Many researchers have attempted to make COTS multicore platforms more predictablewith OS-level techniques. A majority of prior works focused on partitioning of shared resourcesamong the tasks and cores to improve predictability. Page coloring has long been studied topartition shared cache [30, 31, 50, 40, 14, 44, 27, 46], DRAM banks [47, 32, 42], and TLB [36].Mancuso et al. [33] and Kim et al. [28], used both coloring and cache way partitioning [24]for ﬁne-grained cache partitioning. While these shared resource partitioning techniques canreduce space conﬂicts of some shared resources, hence beneﬁcial for predictability, but theyare often not enough to guarantee strong time predictability on COTS multicore platformsbecause of many undisclosed yet important shared hardware [43, 8]. Furthermore, partitioningtechniques can lower performance and eﬃciency and are diﬃcult to apply for parallel tasks.

We presented a virtual gang based parallel real-time task scheduling approach for multicoreplatforms. Our approach is based on the notion of virtual gang, a group of parallel real-timetasks that are statically linked and scheduled together as a single scheduling entity. Wepresented an intra-gang synchronization framework and virtual gang formation algorithmsthat enable strong temporal isolation and high real-time schedulability in scheduling parallelreal-time tasks on COTS multicore platforms. We evaluated our approach both analyticallyand empirically on a real embedded multicore platform using real-world workloads. Our . Ali and R. Pellizzoni and H. Yun XX:21 evaluation results showed the eﬀectiveness and practicality of our approach. In future, weplan to extend our approach to support heterogeneous cores and accelerators such as GPUs.

References Jetson tx2 module. https://developer.nvidia.com/embedded/jetson-tx2 . RT-Gang code repository. https://github.com/CSL-KU/RT-Gang . Stirling Number of the Second Kind. http://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html . Waqar Ali and Heechul Yun. Rt-gang: Real-time gang scheduling framework for safety-criticalsystems. In

Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2019. N. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings. Applying new schedulingtheory to static priority preemptive scheduling.

Software Engineering Journal , 8(5):284–292,1993. Sanjoy Baruah, Vincenzo Bonifaci, Alberto Marchetti-Spaccamela, Leen Stougie, and AndreasWiese. A generalized parallel task model for recurrent real-time processes. In

Real-TimeSystems Symposium (RTSS) , pages 63–72. IEEE, 2012. Michael Garrett Bechtel, Elise McEllhiney, and Heechul Yun. DeepPicar: A Low-cost DeepNeural Network-based Autonomous Car. In

Embedded and Real-Time Computing Systemsand Applications (RTCSA) , 2018. Michael Garrett Bechtel and Heechul Yun. Denial-of-service attacks on shared cache inmulticore: Analysis and prevention. In

Real-Time and Embedded Technology and ApplicationsSymposium (RTAS) , 2019. Vandy Berten, Pierre Courbin, and JoÃ«l Goossens. Gang ﬁxed priority scheduling of periodicmoldable real-time tasks. In

Junior Researcher Workshop Session of the 19th InternationalConference on Real-Time and Network Systems (RTNS) , pages 9–12, 2011. Certiﬁcation Authorities Software Team. CAST-32A: Multi-core Processors. Technical report,Federal Aviation Administration (FAA), November 2016. Hoon Sung Chwa, Jinkyu Lee, Kieu-My Phan, Arvind Easwaran, and Insik Shin. Global edfschedulability analysis for synchronous parallel tasks on multicore platforms. In

EuromicroConference on Real-Time Systems (ECRTS) , pages 25–34. IEEE, 2013. Sébastien Collette, Liliana Cucu, and Joël Goossens. Integrating job parallelism in real-timescheduling theory.

Information Processing Letters , 106(5):180–187, 2008. Leonardo Dagum and Ramesh Menon. Openmp: An industry-standard api for shared-memoryprogramming.

Computing in Science & Engineering , (1):46–55, 1998. Xiaoning Ding, Kaibo Wang, and Xiaodong Zhang. Srm-buﬀer: An os buﬀer managementtechnique to prevent last level cache from thrashing in multicores. In

Proceedings of the SixthConference on Computer Systems , EuroSys, pages 243–256, 2011. URL: http://doi.acm.org/10.1145/1966445.1966468 , doi:10.1145/1966445.1966468 . Z. Dong and C. Liu. Analysis techniques for supporting hard real-time sporadic gang tasksystems. In

Real-Time Systems Symposium (RTSS) , pages 128–138. IEEE, 2017. doi:10.1109/RTSS.2017.00019 . Zheng Dong and Cong Liu. Analysis Techniques for Supporting Hard Real-Time SporadicGang Task Systems. In

Real-Time Systems Symposium (RTSS) , pages 128–138, 2017. doi:10.1109/RTSS.2017.00019 . Dror G Feitelson and Larry Rudolph. Gang scheduling performance beneﬁts for ﬁne-grainsynchronization.

Journal of Parallel and distributed Computing , 16(4):306–318, 1992. José Fonseca, Geoﬀrey Nelissen, and Vincent Nélis. Improved response time analysis of sporadicdag tasks for global fp scheduling. In

International Conference on Real-Time Networks andSystems (RTNS) , page 28–37, 2017. Joël Goossens and Pascal Richard. Optimal scheduling of periodic gang tasks.

LeibnizTransactions on Embedded Systems (LITES) , 3(1):04:1–04:18, 2016.

X:22 Virtual Gang based Scheduling of Real-Time Tasks on Multicore Platforms JoÃ«l Goossens and Vandy Berten. Gang FTP scheduling of periodic and parallel rigidreal-time tasks. In

International Conference on Real-Time Networks and Systems (RTNS) ,pages 189–196, 2010. URL: http://arxiv.org/abs/1006.2617 . Ronald L. Graham, Donald E. Knuth, and Oren Patashnik.

Concrete Mathematics: AFoundation for Computer Science . Addison-Wesley Longman Publishing Co., Inc., Boston,MA, USA, 1989. Vance Hilderman and Tony Baghi.

Avionics certiﬁcation: a complete guide to DO-178(software), DO-254 (hardware) . Avionics Communications, 2007. Pengcheng Huang, Georgia Giannopoulou, Rehan Ahmed, Davide B Bartolini, and LotharThiele. An isolation scheduling model for multicores. In

Real-Time Systems Symposium(RTSS) , pages 141–152. IEEE, 2015. Intel. Improving real-time performance by utilizing cache allocation technology. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology . Seo-Hyun Jeon, Jin-Hee Cho, Yangjae Jung, Sachoun Park, and Tae-Man Han. Automotivehardware development according to iso 26262. In , pages 588–592. IEEE, 2011. Shinpei Kato and Yutaka Ishikawa. Gang EDF scheduling of parallel task systems. In

Real-Time Systems Symposium (RTSS) , pages 459–468. IEEE, 2009. URL: http://ieeexplore.ieee.org/document/5368128/ , doi:10.1109/RTSS.2009.42 . H. Kim, A. Kandhalu, and R. Rajkumar. A coordinated approach for practical os-level cachemanagement in multi-core real-time systems. In

Euromicro Conference on Real-Time Systems(ECRTS) , pages 80–89, 2013. doi:10.1109/ECRTS.2013.19 . Namhoon Kim, Bryan C Ward, Micaiah Chisholm, James H Anderson, and F Donelson Smith.Attacking the one-out-of-m multicore problem by combining hardware management withmixed-criticality provisioning.

Real-Time Systems , 53(5):709–759, 2017. Karthik Lakshmanan, Shinpei Kato, and Ragunathan Rajkumar. Scheduling parallel real-timetasks on multi-core processors. In

Real-Time Systems Symposium (RTSS) , pages 259–268.IEEE, 2010. J. Liedtke, H. Hartig, and M. Hohmuth. Os-controlled cache predictability for real-timesystems. In

IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) ,pages 213–224, 1997. doi:10.1109/RTTAS.1997.601360 . Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan.Gaining insights into multicore cache partitioning: Bridging the gap between simulation andreal systems. In

IEEE International Symposium on High Performance Computer Architecture(HPCA) , pages 367–378, 2008. doi:10.1109/HPCA.2008.4658653 . Lei Liu, Z. Cui, Mingjie Xing, Y. Bao, M. Chen, and Chengyong Wu. A software memorypartition approach for eliminating bank-level interference in multicore systems. In

InternationalConference on Parallel Architectures and Compilation Techniques (PACT) , pages 367–375,2012. R. Mancuso, R. Dudko, E. Betti, M. Cesati, M. Caccamo, and R. Pellizzoni. Real-timecache management framework for multi-core architectures. In

IEEE Real-Time and EmbeddedTechnology and Applications Symposium (RTAS) , pages 45–54, 2013. doi:10.1109/RTAS.2013.6531078 . Alessandra Melani, Renato Mancuso, Marco Caccamo, Giorgio Buttazzo, Johannes Freitag,and Sascha Uhrig. A scheduling framework for handling integrated modular avionic systemson multicore platforms. In

Embedded and Real-Time Computing Systems and Applications(RTCSA) , pages 1–10. IEEE, 2017. Geoﬀrey Nelissen, Vandy Berten, Joël Goossens, and Dragomir Milojevic. Techniques optimiz-ing the number of processors to schedule multi-threaded tasks. In

Euromicro Conference onReal-Time Systems (ECRTS) , pages 321–330. IEEE, 2012. . Ali and R. Pellizzoni and H. Yun XX:23 S. A. Panchamukhi and F. Mueller. Providing task isolation via tlb coloring. In

IEEEReal-Time and Embedded Technology and Applications Symposium (RTAS) , pages 3–13, 2015. doi:10.1109/RTAS.2015.7108391 . Abusayeed Saifullah, David Ferry, Jing Li, Kunal Agrawal, Chenyang Lu, and Christopher DGill. Parallel real-time scheduling of DAGs.

Parallel and Distributed Systems, IEEE Transac-tions on , 25(12):3242–3252, 2014. Abusayeed Saifullah, Jing Li, Kunal Agrawal, Chenyang Lu, and Christopher Gill. Multi-corereal-time scheduling for generalized parallel task models.

Real-Time Systems , 49(4):404–435,2013. URL: https://link.springer.com/content/pdf/10.1007{%}2Fs11241-012-9166-9.pdf , doi:10.1007/s11241-012-9166-9 . Martin Schoeberl, Sahar Abbaspour, et al. T-crest: Time-predictable multi-core architecturefor embedded systems.

Journal of Systems Architecture , 61(9):449–471, 2015. L. Soares, D. Tam, and M. Stumm. Reducing the harmful eﬀects of last-level cache polluterswith an os-level, software-only pollute buﬀer. In

IEEE/ACM International Symposium onMicroarchitecture (MICRO) , pages 258–269, 2008. doi:10.1109/MICRO.2008.4771796 . John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, NasserAnssari, Geng Daniel Liu, and Wen mei W. Hwu. Parboil: A revised benchmark suite forscientiﬁc and commercial throughput computing. Technical report, University of Illinois atUrbana-Champaign, 2012. N. Suzuki, H. Kim, D. d. Niz, B. Andersson, L. Wrage, M. Klein, and R. Rajkumar. Coordinatedbank and cache coloring for temporal protection of memory accesses. In

IEEE InternationalConference on Computational Science and Engineering (CSE) , pages 685–692, 2013. doi:10.1109/CSE.2013.106 . Prathap Kumar Valsan, Heechul Yun, and Farzad Farshchi. Taming non-blocking caches toimprove isolation in multicore real-time systems. In

Real-Time and Embedded Technology andApplications Symposium (RTAS) , 2016. B. C. Ward, J. L. Herman, C. J. Kenna, and J. H. Anderson. Making shared caches morepredictable on multicore platforms. In

Euromicro Conference on Real-Time Systems (ECRTS) ,pages 157–167, 2013. doi:10.1109/ECRTS.2013.26 . S. Wasly and R. Pellizzoni. Bundled scheduling of parallel real-time tasks. In

Real-Time andEmbedded Technology and Applications Symposium (RTAS) , pages 130–142. IEEE, 2019. Y. Ye, R. West, Z. Cheng, and Y. Li. Coloris: A dynamic cache partitioning system using pagecoloring. In

International Conference on Parallel Architecture and Compilation Techniques(PACT) , pages 381–392, 2014. doi:10.1145/2628071.2628104 . H. Yun, R. Mancuso, Z. Wu, and R. Pellizzoni. PALLOC: DRAM bank-aware memory allocatorfor performance isolation on multicore platforms. In

IEEE Real-Time and Embedded Technologyand Applications Symposium (RTAS) , pages 155–166, 2014. doi:10.1109/RTAS.2014.6925999 . Heechul Yun, Rodolfo Pellizzon, and Prathap Kumar Valsan. Parallelism-aware memoryinterference delay analysis for cots multicore systems. In , pages 184–195. IEEE Computer Society, 2015. URL: https://doi.org/10.1109/ECRTS.2015.24 , doi:10.1109/ECRTS.2015.24 . Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Marco Caccamo, and Lui Sha. Memguard: Memorybandwidth reservation system for eﬃcient performance isolation in multi-core platforms. In

Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2013. Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. Towards practical page coloring-basedmulticore cache management. In

Proceedings of the 4th ACM European Conference onComputer Systems , EuroSys ’09, pages 89–102, 2009. URL: http://doi.acm.org/10.1145/1519065.1519076 , doi:10.1145/1519065.1519076doi:10.1145/1519065.1519076