[PDF] Dependency Graph Approach for Multiprocessor Real-Time Synchronization

Abstract

Over the years, many multiprocessor locking protocols have been designed and analyzed. However, the performance of these protocols highly depends on how the tasks are partitioned and prioritized and how the resources are shared locally and globally. This paper answers a few fundamental questions when real-time tasks share resources in multiprocessor systems. We explore the fundamental difficulty of the multiprocessor synchronization problem and show that a very simplified version of this problem is NP -hard in the strong sense regardless of the number of processors and the underlying scheduling paradigm. Therefore, the allowance of preemption or migration does not reduce the computational complexity. For the positive side, we develop a dependency-graph approach, that is specifically useful for frame-based real-time tasks, in which all tasks have the same period and release their jobs always at the same time. We present a series of algorithms with speedup factors between 2 and 3 under semi-partitioned scheduling. We further explore methodologies and tradeoffs of preemptive against non-preemptive scheduling algorithms and partitioned against semi-partitioned scheduling algorithms. The approach is extended to periodic tasks under certain conditions.

Full PDF

DDependency Graph Approach for MultiprocessorReal-Time Synchronization

Jian-Jia Chen, Georg von der Br¨uggen, Junjie Shi and Niklas UeterTU Dortmund University, Germany

Abstract —Over the years, many multiprocessor locking proto-cols have been designed and analyzed. However, the performanceof these protocols highly depends on how the tasks are partitionedand prioritized and how the resources are shared locally and glob-ally. This paper answers a few fundamental questions when real-time tasks share resources in multiprocessor systems. We explorethe fundamental difﬁculty of the multiprocessor synchronizationproblem and show that a very simpliﬁed version of this problemis N P -hard in the strong sense regardless of the number ofprocessors and the underlying scheduling paradigm. Therefore,the allowance of preemption or migration does not reduce thecomputational complexity. For the positive side, we develop adependency-graph approach, that is speciﬁcally useful for frame-based real-time tasks, in which all tasks have the same periodand release their jobs always at the same time. We present aseries of algorithms with speedup factors between and undersemi-partitioned scheduling. We further explore methodologiesand tradeoffs of preemptive against non-preemptive schedulingalgorithms and partitioned against semi-partitioned schedulingalgorithms. The approach is extended to periodic tasks undercertain conditions. In a multi-tasking system, mutual exclusion for the accessesto shared resources, e.g., data structures, ﬁles, etc., has to beguaranteed to ensure the correctness of these operations. Suchaccesses to shared resources are typically done within the so-called critical sections , which can be protected by using binarysemaphores or mutex locks . Therefore, at any point in timeno two task instances are in their critical sections that accessthe same shared recourse. Moreover, advanced embeddedcomputing systems heavily interact with the physical world,and timeliness of computation is an essential requirementof correctness. To ensure safe operations of such embeddedsystems, the satisfaction of the real-time requirements, i.e.,worst-case timeliness, needs to be veriﬁed.If aborting or restarting a critical section is not allowed,due to mutual exclusion, a higher-priority job may have to bestopped until a lower-priority job unlocks the requested sharedresource that was already locked earlier, a so-called priorityinversion. The study of mutual exclusion in uniprocessor real-time systems can be traced back to the priority inheritanceprotocol (PIP) and priority ceiling protocol (PCP) by Shaet al. [40] in 1990 and the stack resource policy (SRP) byBaker [5] in 1991. The Immediate PCP, a variant of the PCP,has been implemented in Ada (called Ceiling locking) andPOSIX (called Priority Protect Protocol).To schedule real-time tasks on multiprocessor platforms,there have been three widely adopted paradigms: parti-tioned, global, and semi-partitioned scheduling. The par-titioned scheduling approach partitions the tasks statically among the available processors, i.e., a task is always executedon the assigned processor. The global scheduling approachallows a task to migrate from one processor to another atany time. The semi-partitioned scheduling approach decideswhether a task is divided into subtasks statically and how eachtask/subtask is then assigned to a processor. A comprehensivesurvey of multiprocessor scheduling in real-time systems canbe found in [16].The design of synchronization protocols for real-time taskson multiprocessor platforms started with the distributed prior-ity ceiling protocol (DPCP) [39], followed by the multipro-cessor priority ceiling protocol (MPCP) [38]. The MPCP isbased on partitioned ﬁxed-priority scheduling and adopts thePCP for local resources. When requesting global resources thatare shared by several tasks on different processors, the MPCPexecutes the corresponding critical sections with priority boost-ing. By contrast, under the DPCP, the sporadic/periodic real-time tasks are scheduled based on partitioned ﬁxed-priorityscheduling, except when accessing resources that are boundto a different processor. That is, the DPCP is semi-partitionedscheduling that allows migration at the boundary of criticaland non-critical sections.Over the years, many locking protocols have been designedand analyzed, including the multiprocessor stack resource pol-icy (MSRP) [20], the ﬂexible multiprocessor locking protocol(FMLP) [7], the multiprocessor PIP [17], the O ( m ) lockingprotocol (OMLP) [11], the Multiprocessor Bandwidth Inher-itance (M-BWI) [19], gEDF-vpr [2], LP-EE-vpr [3], and theMultiprocessor resource sharing Protocol (MrsP) [12]. Also,several protocols for hybrid scheduling approaches such asclustered scheduling [10], reservation-based scheduling [19],and open real-time systems [33] have been proposed in re-cent years. To support nested critical sections, Ward andAnderson [46], [47] introduced the Real-time Nested LockingProtocol (RNLP) [46], which adds supports for ﬁne-grainednested locking on top of non-nested protocols.However, the performance of these protocols highly de-pends on 1) how the tasks are partitioned and prioritized,2) how the resources are shared locally and globally, and 3)whether a job/task being blocked should spin or suspend itself.Regarding task partitioning, Lakshmanan et al. [28] pre-sented a synchronization-aware partitioned heuristic for theMPCP, which organizes the tasks that share common resourcesinto groups and attempts to assign each group of tasks tothe same processor. Following the same principle, Nematiet al. [34] presented a blocking-aware partitioning method Neither of these two protocols had a concrete name in the original papers.In the literature, most authors referred to the protocols in [39] as DPCP and[38] as MPCP, respectively. a r X i v : . [ c s . O S ] S e p hat uses an advanced cost heuristic algorithm to split a taskgroup when the entire group fails to be assigned on oneprocessor. In subsequent work, Hsiu et al. [23] proposed adedicated-core framework to separate the execution of criticalsections and normal sections, and employed a priority-basedmechanism for resource sharing, such that each request canbe blocked by at most one lower-priority request. Wiederand Brandenburg [49] proposed a greedy slacker partitioningheuristic in the presence of spin locks. The resource-orientedpartitioned (ROP) scheduling was proposed by Huang et.al [24] in 2016 and later reﬁned by von der Br¨uggen et al. [44]with release enforcement for a special case.For priority assignment, most of the results in the litera-ture use rate-monotonic (RM) or earliest-deadline-ﬁrst (EDF)scheduling. To the best of our knowledge, the priority as-signment for systems with shared resources has only beenseriously explored in a small numbers of papers, e.g., relativedeadline assignment under release enforcement in [44], priorityassignment for spinning [1], reasonable priority assignmentsunder global scheduling [17], and the optimal priority assign-ment used in the greedy slack algorithm in [49]. However, notheoretical evidence has been provided to quantify the non-optimality of the above heuristics.Although many multiprocessor locking protocols have beenproposed in the literature, there are a few unsolved fundamen-tal questions when real-time tasks share resources (via lockingmechanisms) in multiprocessor systems: • What is the fundamental difﬁculty? • What is the performance gap of partitioned, semi-partitioned, and global scheduling? • Is it always beneﬁcial to prioritize critical sections?

To answer the above questions, we focus on the simplestand the most basic setting: all tasks have the same period andrelease their jobs always at the same time, so-called frame-based real-time task systems , and are scheduled on M identical(homogeneous) processors. Speciﬁcally, we assume that eachcritical section is non-nested and is guarded by only one binarysemaphore or one mutex lock. Contribution:

Our contributions are as follows: • We show that ﬁnding a schedule of the tasks to meetthe given common deadline is

N P -hard in the strongsense regardless of the number of processors M in thesystem . Therefore, there is no polynomial-time approxi-mation algorithm that can bound the allocated number ofprocessors to meet the given deadline. Moreover, the N P -hardness holds under any scheduling paradigm. Therefore,the allowance of preemption or migration does not reducethe computational complexity. • We propose a dependency graph approach for multipro-cessor synchronization, which consists of two steps: 1)the construction of a directed acyclic graph (DAG), and 2)the scheduling of this DAG. We prove that for minimizingthe makespan the lower bound of the approximation ratioof such an approach is at least − M + M under anyscheduling paradigm and − M under partitioned or semi-partitioned scheduling. • We demonstrate how existing results in the literature ofuniprocessor non-preemptive scheduling can be adoptedto construct the DAG in the ﬁrst step of the dependency graph approach when each task has only one critical sec-tion. This results in several polynomial-time schedulingalgorithms with different constant approximation boundsfor minimizing the makespan. Speciﬁcally, the best ap-proximation developed is a polynomial-time approxima-tion scheme with an approximation ratio of (cid:15) − (cid:15)M for any (cid:15) > under semi-partitioned scheduling strate-gies. We further discuss methodologies and tradeoffs ofpreemptive against non-preemptive scheduling algorithmsand partitioned against semi-partitioned scheduling algo-rithms. • We also implemented the dependency graph approach asa prototype in LITMUS RT [8], [13]. The experimentalresults show that the overhead is almost the same as thestate-of-the-art multiprocessor locking protocols. More-over, we also provide extensive numerical evaluations,which demonstrate the performance of the proposed ap-proach under different scheduling constraints. Comparingto the state-of-the-art resource-oriented partitioned (ROP)scheduling, our approach shows signiﬁcant improvement. In this paper, we will implicitly consider frame-based real-time task systems to be scheduled on M identical (homoge-neous) processors. The given tasks release their jobs at thesame time and have the same period and relative deadline. Ourstudied problem is the task synchronization problem where alltasks have exactly one (not nested) critical section, denoted as TS-OCS . Speciﬁcally, each task τ i releases a job (at time fornotational brevity) with the following properties: • C i, is the execution time of the ﬁrst non-critical sectionof the job. • A i, is the execution time of the (ﬁrst) critical section ofthe job, in which a binary semaphore or a mutex σ ( τ i, ) is used to control the access of the critical section. • C i, is the execution time of the second non-criticalsection of the job.A subjob is a critical section or a non-critical section. There-fore, there are three subjobs of a job of task τ i . We assumethe task set T is given and that the deadline is either implicit,i.e., identical to the period, or constrained, i.e., smaller thanthe period. The cardinality of a set X is | X | . We also make thefollowing assumptions: • For each task τ i in T , C i, ≥ , C i, ≥ , and A i, ≥ . • The execution of the critical sections guarded by onebinary semaphore s must be sequentially executed under atotal order. That is, if two tasks share the same semaphore,their critical sections must be executed one after anotherwithout any interleaving. • The execution of a job cannot be parallelized, i.e.,a job must be sequentially executed in the order of C i, , A i, , C i, . • There are in total z binary semaphores.The paper will implicitly focus on the above task model. InSection 8, we will explain how the algorithms in this paper canbe extended to periodic task systems under certain conditions. .2 Scheduling Strategies Here, we deﬁne scheduling strategies and the properties ofa schedule for a frame-based real-time task system. Note thatthe terminologies used here are limited to the scenario whereeach task in T releases only one job at time . Therefore, wewill use the term jobs and tasks interchangeable.A schedule is an assignment of the given jobs (tasks) to oneof the M identical processors, such that each job is executed(not necessarily consecutively) until completion. A schedulefor T can be deﬁned as a function ρ : R × M → T ∪{⊥} , where ρ ( t, m ) = τ j denotes that the job of task τ j is executed at time t on processor m , and ρ ( t, m ) = ⊥ denotes that processor m is idle at time t . We assume that a job has to be sequentiallyexecuted, i.e., intra-task parallelism is not possible. Therefore,it is not feasible to run a job in parallel on two processors,i.e., ρ ( t, m ) (cid:54) = ρ ( t, m (cid:48) ) for any m (cid:54) = m (cid:48) if ρ ( t, m ) (cid:54) = ⊥ .Some other constraints may also be introduced. A scheduleis non-preemptive if a job cannot be preempted by any otherjob, i.e., there is only one interval with ρ ( t, m ) = τ j onprocessor m for each task τ j in T . A schedule is preemptive if a job can be preempted, i.e., more than one interval with ρ ( t, m ) = τ j for any task τ j in T on processor m is allowed.For a partitioned schedule, a job has to be executed onone processor. That is, there is exactly one processor m with ρ ( t, m ) = τ j for every task τ j in T . Such a schedule canbe preemptive or non-preemptive. For a global schedule , a jobcan be arbitrarily executed on any of the M processors at anytime point. That is, it is possible that ρ ( t, m ) = τ j and ρ ( t (cid:48) , m (cid:48) ) = τ j for m (cid:54) = m (cid:48) and t (cid:54) = t (cid:48) . By deﬁnition,a global schedule is preemptive (for frame-based real-timetask systems) in our model. For a semi-partitioned schedule,a subjob (either a critical section or a non-critical section)has to be executed on one processor. Such a semi-partitionedschedule can be preemptive or non-preemptive.Based on the above deﬁnitions, a partitioned scheduleis also a semi-partitioned schedule, and a semi-partitionedschedule is also a global schedule. In the rich literature of scheduling theory, one speciﬁcobjective is to minimize the completion time of the jobs,called makespan . For frame-based real-time task systems, ifthe makespan of the jobs released at time is no more than therelative deadline, then the task set can be feasibly scheduledto meet the deadline. We state the makespan problem for

TS-OCS that is studied here as follows:

Deﬁnition 1:

The

TS-OCS

Makespan Problem:

We aregiven M identical (homogeneous) processors. There are N tasks arriving at time . Each task is given by { C i, , A i, , C i, } and has at most one critical section, guarded by one binarysemaphore. The objective is to ﬁnd a schedule that minimizesthe makespan.Alternatively, we can also investigate the bin packing versionof the problem, i.e., minimizing the number of allocatedprocessors to meet a given common deadline D . Note that the deadline is never larger than the period in our setting.

Deﬁnition 2:

The

TS-OCS

Bin Packing Problem:

We aregiven identical (homogeneous) processors. There are N tasksarriving at time with a common deadline D . Each task isgiven by { C i, , A i, , C i, } and has at most one critical section,guarded by one binary semaphore. The objective is to ﬁnd aschedule to meet the deadline with the minimum number ofallocated processors.Essentially, the decision versions of the makespan and thebin packing problems are identical: Deﬁnition 3:

The

TS-OCS

Schedulability Problem:

Weare given M identical (homogeneous) processors. There are N tasks arriving at time with a common deadline D . Eachtask is given by { C i, , A i, , C i, } and has at most one criticalsection, guarded by one binary semaphore. The objective is toﬁnd a schedule to meet the deadline by using the M processors.In the domain of scheduling theory, a scheduling problemis described by a triplet Field | Field | Field . • Field : describes the machine environment. • Field : speciﬁes the processing characteristics and con-straints. • Field : presents the objective to be optimized.For example, the scheduling problem | r j | L max deals with auniprocessor system, in which the input is a set of jobs withdifferent release times and different absolute deadlines, and theobjective is derive a non-preemptive schedule which minimizesthe maximum lateness. The scheduling problem P || C max dealswith a homogeneous multiprocessor system, in which the inputis a set of jobs with the same release times, and the objective isderive a partitioned schedule which minimizes the makespan.The scheduling problem P | prec | C max is an extension of P || C max by further considering the precedence constraints ofthe jobs. The scheduling problem P | prec, prmp | C max furtherallows preemption. Note that in classical scheduling theory,preemption in parallel machines implies the possibility ofjob migration from one machine to another. Therefore, thescheduling problem P | prec, prmp | C max allows job preemp-tion and migration, i.e., preemptive global scheduling. Since many scheduling problems are

N P -hard in thestrong sense, polynomial-time approximation algorithms areoften used. In the realm of real-time systems, there are twowidely adopted metrics:The

Approximation Ratio compares the resulting objectivesof (i) scheduling algorithm A and (ii) an optimal algorithmwhen scheduling any given task set. Formally, an algorithm A for the makespan problem (i.e., Deﬁnition 1) has an ap-proximation ratio α ≥ , if given any task set T , the resultingmakespan is at most αC ∗ max on M processors, where C ∗ max is the minimum (optimal) makespan to schedule T on M processors. An algorithm A for the bin packing problem (i.e.,Deﬁnition 2) has an approximation ratio α ≥ , if given anytask set T , it can ﬁnd a schedule of T on αM ∗ processorsto meet the common deadline, where M ∗ is the minimum In real-time systems, this is not necessarily the case. For instance, underpreemptive partitioned scheduling a job can be preempted and resumed lateron the same processor without migration. optimal) number of processors required to feasibly schedule T . The Speedup Factor [26], [36] of a scheduling algorithm A indicates the factor α ≥ by which the overall speed ofa system would need to be increased so that the schedulingalgorithm A always derives a feasible schedule to meet thedeadline, provided that there exists one at the original speed.This is used for the problem in Deﬁnition 3.We note that an algorithm that has an approximation ratio α for the makespan problem in Deﬁnition 1 also has a speedupfactor α for the schedulability problem in Deﬁnition 3. To handle the studied makespan problem in Deﬁnition 1,we propose a

Dependency Graph Approach , which involvestwo steps: • In the ﬁrst step, a directed graph G = ( V, E ) is con-structed. A subjob (i.e., a critical section or a non-criticalsection) is a vertex in V . The subjob C i, is a predecessorof the subjob A i, . The subjob A i, is a predecessor of thesubjob C i, . If two jobs of τ i and τ j share the same binarysemaphore, i.e., σ ( τ i, ) = σ ( τ j, ) , then either the subjob A i, is the predecessor of that of A j, or the subjob A j, is the predecessor of that of A i, . All the critical sectionsguarded by a binary semaphore form a chain in G , i.e.,the critical sections of the binary semaphore follow a totalorder. Therefore, we have the following properties in set E : ◦ The two directed edges ( C i, , A i, ) and ( A i, , C i, ) arein E . ◦ Suppose that T k is the set of the tasks which requirethe same binary semaphore s k . Then, the | T k | tasks in T k follow a certain total order π such that ( A i, , A j, ) is a directed edge in E when π ( τ i ) = π ( τ j ) − .Fig. 1 provides an example for a task dependencygraph with one binary semaphore. Since there are z binary semaphores in the task set, the task dependencygraph G has in total z connected subgraphs, denotedas G , G , . . . , G z . In each connected subgraph G (cid:96) , thecorresponding critical sections of the tasks that requestcritical sections guarded by the same semaphore form achain and have to be executed sequentially. For example,in Fig. 1, the dependency graph forces the scheduler toexecute the critical section A , prior to any of the otherthree critical sections. • In the second step, a corresponding schedule of G on M processors is generated. The schedule can be basedon system’s restrictions or user’s preferences, i.e., eitherpreemptive or non-preemptive schedules, either global,semi-partitioned, or partitioned schedules.In the dependency graph approach, the second step hasbeen widely studied in scheduling theory. That is, a solution ofthe problem P | prec | C max results in a semi-partitioned sched-ule, since the dependency graph is constructed by considering acritical section or a non-critical section as a subjob. Moreover,a solution of the problem P | prec, prmp | C max results in aglobal schedule. For deriving a partitioned schedule, we canforce the subjobs generated by a job to be tied to one C , A , C , C , A , C , C , A , C , C , A , C , Fig. 1. An example of a task dependency graph for a task set with one binarysemaphore. processor. That is, P | prec, tied | C max targets a partitioned non-preemptive schedule and P | prec, prmp, tied | C max targets apartitioned preemptive schedule.Therefore, the key issue is the construction of the depen-dency graph. An alternative view of the dependency graphapproach is to build the dependency graph assuming a sufﬁ-cient number of processors (i.e., using as many processors aspossible) in the ﬁrst step, and then the second step considersthe constraint of the number of processors. Towards the ﬁrststep, we need the following deﬁnition: Deﬁnition 4: A critical path of a task dependency graph G is one of the longest paths of G . The critical path length of G is denoted by len ( G ) For the rest of this paper, we denote a dependency taskgraph of the input task set T that has the minimum criticalpath length as G ∗ . Note that G ∗ is independent of M . Lemma 1: len ( G ∗ ) is the lower bound of the TS-OCS makespan problem for task set T on M processors. Proof:

This comes from the setting of the problem, i.e.,each task τ i has only one critical section guarded by one binarysemaphore, and the deﬁnition of the graph G ∗ , i.e., using asmany processors as possible. Deﬁnition 5: A feasible schedule S ( G ) of a task de-pendency graph G respect to the precedence constraints de-ﬁned in G and the speciﬁed scheduling requirement, e.g.,being global/semi-partitioned/partitioned and preemptive/non-preemptive. L ( S ( G )) is the makespan of S ( G ) .With the above deﬁnitions, we can recap the objectives ofthe two steps in the dependency graph approach. In the ﬁrststep, we would like to construct a dependency graph G tominimize len ( G ) , and in the second step, we would like toconstruct a schedule S ( G ) to minimize L ( S ( G )) .We conclude this section by stating the following theorem: Theorem 1:

The optimal makespan of the

TS-OCS makespan problem for T on M processors is at least max (cid:40) (cid:88) τ i ∈ T C i, + A i, + C i, M , len ( G ∗ ) (cid:41) (1)where G ∗ is a dependency task graph of T that has theminimum critical path length. Proof:

The lower bound len ( G ∗ ) comes from Lemma 1and the lower bound (cid:80) τ i ∈ T C i, + A i, + C i, M is due to the pigeonole principle. This section presents the computational complexity andlower bounds of approximation ratios of the dependency graphapproach.

The following theorem shows that constructing G ∗ isunfortunately N P -hard in the strong sense.

Theorem 2:

Constructing a dependency task graph G ∗ thathas the minimum critical path length is N P -hard in the strongsense.

Proof:

This theorem is proved by a reduction from thedecision version of the scheduling problem | r j | L max , i.e.,uniprocessor non-preemptive scheduling, in which the objec-tive is to minimize the maximum lateness assuming that eachjob J j in the given job set J has its known processing time p j ≥ , arrival time r j ≥ , and absolute deadline d j . Thisproblem is N P -hard in the strong sense by a reduction fromthe 3-Partition problem [30]. Suppose that the decision versionof the scheduling problem | r j | L max is to validate whetherthere exists a schedule in which the ﬁnishing time of each job J j is no less than d j .Let H be any positive integer greater than max j ∈ J d j . Foreach job J j in J , we construct a task τ j with one criticalsection, where C j, is set to r j , C j, is set to H − d j , and A j, is set to p j . By the setting, C j, ≥ , C j, ≥ , and A j, ≥ for every constructed task τ j . The critical sections of all theconstructed tasks are guarded by only one binary semaphore.Let the task set constructed above be T . The above input taskset T by deﬁnition is a feasible input task set for the one-critical-section task synchronization problem.We now prove that there is a non-preemptive uniprocessorschedule for J in which all the jobs can meet their deadlines ifand only if there is a dependency task graph G ∗ with a criticalpath length less than or equal to H for the constructed taskset T . If part , i.e., len ( G ∗ ) ≤ H holds: Without loss of gen-erality, we index the tasks in T so that the critical sectionof A i, is the immediate predecessor of the critical section A i +1 , in G ∗ , e.g., as in Fig. 1. Suppose that G ∗ ( τ i ) is thesubgraph of G ∗ that consists of only the vertices representing { C k, , A k, , C k, | k = 1 , , . . . , i − } ∪ { C i, , A i, } and thecorresponding edges. Let f i be the longest path in G ∗ ( τ i ) that ends at the vertex representing A i, .By deﬁnition, f is C , + A , . Moreover, f i is max { f i − , C i, } + A i, for every task τ i in T . Since len ( G ∗ ) ≤ H and C i, = H − d i , we know that f i + C i, ≤ H ⇒ f i ≤ d i for every task τ i in T .We can now construct the uniprocessor non-preemptiveschedule for J by following the same execution order. Here,we index the jobs in J corresponding to T . The ﬁnishing timeof job J is r + p = C , + A , = f . The ﬁnishing time ofjob J i is max { f i − , r i } + p i = max { f i − , C i, } + A i, = f i . This proves the if part. Only-If part , i.e., there is a uniprocessor non-preemptiveschedule in which all the deadlines of the jobs in J are met: Theproof for the if part can be reverted and the same argumentscan be applied. Due to space limitation, we omit the details. Theorem 3:

The makespan problem with task synchroniza-tion for T on M processors is N P -hard in the strong senseeven if M is sufﬁciently large under any scheduling paradigm. Proof:

This comes directly from Theorem 2. Considerthat there are M ≥ | T | + 1 processors. The if-and-only-ifproof in Theorem 2 can be extended by introducing a concreteschedule that executes the two non-critical sections of task τ i one processor i and the critical section of task τ i on processor | T | + 1 . Theorem 3 expresses the fundamental difﬁculty of themultiprocessor synchronization problem and shows that avery simpliﬁed version of this problem is N P -hard in thestrong sense regardless of the number of processors and theunderlying scheduling paradigm. Therefore, the allowance ofpreemption or migration does not reduce the computationalcomplexity. The fundamental problem is the sequencing ofthe critical sections, which is independent from the underlyingscheduling paradigm. Therefore, no matter what ﬂexibilitythe scheduling algorithm has (unless aborting and restartinga critical section is allowed), the computational complexityremains N P -hard in the strong sense. Although the focus of this paper is the makespan problemin Deﬁnition 1 and the schedulability problem in Deﬁnition 3,we also state the following theorems to explain the difﬁcultyof the bin packing problem in Deﬁnition 2.

Theorem 4:

Minimizing the number of processors for agiven common deadline of T with task synchronization for T (i.e., Deﬁnition 2) is N P -hard in the strong sense under anyscheduling paradigm.

Proof:

As the decision problem is Deﬁnition 3, we reachthe conclusion based on Theorem 3.

Theorem 5:

There is no polynomial-time (approximation)algorithm to minimize the number of processors for a givencommon deadline of T with task synchronization for T underany scheduling paradigm unless P = N P . Proof:

This is based on Theorems 2 and 3. If such apolynomial-time algorithm exists, then the problem | r j | L max can be solved in polynomial time, which implies P = N P . The dependency graph approach requires two steps. Thefollowing theorem shows that even if both steps are optimized,the resulting schedule for the makespan problem with tasksynchronization is not optimal and has an asymptotic lowerbound of the approximation ratio. The same statement also holds for using M = | T | processors, but theproof is more involved. heorem 6: The optimal schedule on M identical proces-sors for the dependency graph G ∗ that has the minimum criticalpath length is not optimal for the TS-OCS makespan problemand can have an approximation bound of at least • − M + M under any scheduling paradigm, and • − M under partitioned or semi-partitioned scheduling. Proof:

We prove this theorem by providing a concreteinput instance as follows: • Suppose that M is a given integer with M ≥ and wehave N = M − M + 1 tasks. • We assume a small positive number δ which is close to and a number Q which is much greater than δ , i.e., QMN (cid:29) δ > . • All N tasks have a critical section guarded by the samebinary semaphore. • Task τ has C , = δ, A , = Q − QM , and C , = QM + N δ • Task τ i has C i, = δ, A i, = δ , and C i, = QM for i = 2 , , . . . , N .We need to show that the optimal dependency graph of thisinput instance in fact leads to the speciﬁed bound. The proofis in Appendix. G The key to success is to ﬁnd G ∗ . Unfortunately, as shown inTheorem 2, ﬁnding G ∗ is N P -hard in the strong sense. How-ever, ﬁnding good approximations is possible. The problemto construct G is called the dependency-graph constructionproblem . Here, instead of presenting new algorithms to ﬁndgood approximations of G ∗ , we explain how to use the existingalgorithms of the scheduling problem | r j | L max to derive goodapproximations of G ∗ .It should be ﬁrst noted that the problem | r j | L max cannotbe approximated with a bounded approximation ratio becausethe optimal schedule may have no lateness at all and anyapproximation leads to an unbounded approximation ratio.However, a variant of this problem can be easily approximated.This is known as the delivery-time model of the problem | r j | L max . In this model, each job J j has its release time r j , processing time p j , and delivery time q j ≥ . After a jobﬁnishes its execution on a machine, its result (ﬁnal product)needs q j amount of time to be delivered to the customer.The objective is to minimize the makespan K . Therefore, the effective deadline d j of job J j on the given single machineis d j = K − q j . Since K is a constant, this is effectivelyequivalent to the case when d j is set to − q j .The delivery-time model of the problem | r j | L max canthen be effectively approximated. Moreover, our problem toconstruct a good dependency graph for T is indeed equivalentto the delivery-time model of the problem | r j | L max . Toshow such equivalence, Algorithm 1 presents the detailedtransformation. For each semaphore s k , suppose that T k is theset of tasks that use s k (Line 1 in Algorithm 1). For eachtask set T k , we transform the problem to construct G k toan equivalent delivery-time model of the problem | r j | L max (Line 3 to Line 8). Then, we construct the graph G k basedon the derived schedule of an approximation algorithm for thedelivery-time model of the problem | r j | L max . Theorem 7: An α -approximation algorithm for thedelivery-time model of the problem | r j | L max applied inAlgorithm 1 guarantees to derive a dependency graph G with len ( G ) ≤ α × len ( G ∗ ) . Proof:

This theorem can be proved by a counterpart ofthe proof of Theorem 2. We will show that Algorithm 1 isin fact an L-reduction (i.e., a reduction that preserves theapproximation ratio) from the input task set to the delivery-time model of the problem | r j | L max . In this L-reduction,there is no loss of the approximation ratio.First, by deﬁnition, two tasks are independent if they donot share any semaphore. Moreover, since the TS-OCS problemassumes that a task accesses at most one binary semaphore,a task τ i can only appear at most in one T k for a certain k .Therefore, len ( G ∗ ) = max k =1 , ,...,z len ( G ∗ k ) .To show that the reduction preserves the approximationratio, we only need to prove the one-to-one mapping. Onepossibility is to prove that a schedule for the input instanceof the problem | r j | L max delivers the last result at time X ifand only if the corresponding graph G k constructed by usingLines 9 and 10 in Algorithm 1 has a critical path length X .This is unfortunately not possible because a ( technically badbut possible ) schedule for the input instance of the problem | r j | L max can be arbitrarily alerted by inserting useless delays.Fortunately, for a given permutation to order the | T k | tasksin T k , we can always construct a schedule for the inputinstance of the problem | r j | L max by respecting the givenorder and their release times. Such a schedule for the inputinstance of the problem | r j | L max delivers the last result attime X if and only if the corresponding graph G k constructedby using Lines 9 and 10 in Algorithm 1 has a critical pathlength X . Moreover, the schedule for one such permutation isoptimal for the input instance of the problem | r j | L max .Therefore, the approximation ratio is perserved while con-structing G k . According to the above discussions, len ( G k ) ≤ α × len ( G ∗ k ) . Moreover, len ( G ) ≤ max k =1 , ,...,z len ( G k ) ≤ α × max k =1 , ,...,z len ( G ∗ k ) = α × len ( G ∗ ) According to Theorem 7 and Algorithm 1, we can sim-ply apply the existing algorithms of the scheduling problem | r j | L max in the delivery-time model to derive G ∗ by usingwell-studied branch-and-bound methods, see for example [14],[32], [35], or good approximations of G ∗ , see for example [22],[37]. Here, we will summarize several polynomial-time ap-proximation algorithms. The details can be found in [22].For the delivery-time model of the scheduling problem | r j | L max , the extended Jackson’s rule ( JKS ) is as follows:“Whenever the machine is free and one or more jobs isavailable for processing, schedule an available job with largestdelivery time,” as explained in [22].

Lemma 2:

The extended Jackson’s rule (

JKS ) isa polynomial-time -approximation algorithm for thedependency-graph construction problem. Proof:

This is based on Theorem 7 and the approximationratio of

JKS for the problem | r j | L max , where the proof can lgorithm 1 Graph Construction Algorithm

Input: set T of N tasks with z shared binary semaphores;1: T k ← { τ i | σ ( τ i, ) = s k } for k = 1 , , . . . , z ;2: for k ← to z do J ← ∅ ;4: for each τ i ∈ T k do

5: create a job J i with r i ← C i, , p i ← A i, , and q i ← C i, ,where q i is the delivery time;6: J ← J ∪ { J i } ;7: end for

8: apply an approximation algorithm to derive a non-preemptive schedule ρ k for the delivery-time model of the problem | r j | L max on onemachine;9: construct the initial dependency graph G k for T k , in which thefollowing directed edges ( C i, , A i, ) and ( A i, , C i, ) for every task τ i ∈ T k are created;10: create a directed edge from A i, to A j, in G k if job J j is executedright after (but not necessarily consecutively) job J i in the schedule ρ k ;11: end for

12: return G = G ∪ G ∪ . . . ∪ G z ; be found in [27].Potts [37] observed some nice properties when the ex-tended Jackson’s rule is applied. Suppose that the last deliveryis due to a job J c . Let J a be the earliest scheduled job so thatthe machine in the problem | r j | L max is not idle between theprocessing of J a and J c . The sequence of the jobs that areexecuted sequentially from J a , . . . , to J c is called a criticalsequence . By the deﬁnition of J a , all jobs in the criticalsequence must be released no earlier than the release time r a of job J a . If the delivery time of any job in the criticalsequence is not shorter than the delivery time q c of J c , thenit can be proved that the extended Jackson’s rule is optimalfor the problem | r j | L max . However, if the delivery time q b of a job J b in the critical sequence is shorter than the deliverytime q c of J c , the extended Jackson’s rule may start a non-preemptive job J b too early. Such a job J b that appears lastin the critical sequence is called the interference job of thecritical sequence.Potts [37] suggested to attempt at improving the scheduleby forcing some interference job to be executed after thecritical job J c , i.e., by delaying the release time of J b from r b to r (cid:48) b = r c . This procedure is repeated for at most n iterationsand the best schedule among the iterations is returned as thesolution. Lemma 3:

Potts’ iterative process (

Potts ) is a polynomial-time . -approximation algorithm for the dependency-graphconstruction problem. Proof:

This is based on Theorem 7 and the approximationratio of

Potts for the problem | r j | L max , where the proof canbe found in [22].Hall and Shmoys [22] further improved the approximationratio to / by handling a special case when there are two jobs J i and J h with p i > P/ and p h > P/ where P is (cid:80) J j p j and running Potts’ algorithm for n iterations. Lemma 4:

Algorithm HS is a polynomial-time / -approximation algorithm for the dependency-graph construc-tion problem. Hall and Shmoys [22] further use the concept of forward and inverseproblems of the input instance of | r j | L max . As they are not highly related,we omit those details. Proof:

This is based on Theorem 7 and the approximationratio of HS for the problem | r j | L max , where the proof canbe found in [22].The algorithm that has the best approximation ratio for thedelivery-time model of the problem | r j | L max is a polynomial-time approximation scheme (PTAS) developed by Hall andShmoys [22]. Lemma 5:

The dependency-graph construction problemadmits a polynomial-time approximation scheme (PTAS), i.e.,the approximation bound is (cid:15) under the assumption that (cid:15) is a constant for any (cid:15) > . This section presents our heuristic algorithms to schedulethe dependency graph G derived from Algorithm 1. We ﬁrstconsider the special case when there is a sufﬁcient number ofprocessors, i.e., M ≥ N . Lemma 6:

For a task set T , to be scheduled on M identicalprocessors, the makespan of the schedule which executes task τ i on only one processor i as early as possible by respecting tothe precedence constraints deﬁned in a given task dependencygraph G is len ( G ) if M ≥ N . By deﬁnition, the schedule isa partitioned schedule for the given jobs and non-preemptivewith respect to the subjobs. Proof:

Since M ≥ N , all the tasks can start their ﬁrstnon-critical sections at time . Therefore, the critical sectionof task τ i arrives exactly at time C i, . Then, the ﬁnishing timeof the critical section of task τ i is exactly the longest path in G that ﬁnishes at the vertex representing A i, . Therefore, themakespan of such a schedule is exactly len ( G ) .For the remaining part of this section, we will focus onthe other case when M < N . We will heavily utilize theconcept of list schedules developed by Graham [21] and theextensions to schedule the dependency graph G derived fromSection 5. A list schedule works as follows: Whenever aprocessor idles and there are subjobs eligible to be executed(i.e., all of their predecessors in G have ﬁnished), one of theeligible subjobs is executed on the processor. When the numberof eligible subjobs is larger than the number of idle processors,many heuristic strategies exist to decide which subjobs shouldbe executed with higher priorities. Graham [21] showed thatthe list schedules can be generated in polynomial time andhave a − M approximation ratio for the scheduling problem P | prec | C max .For the rest of this section, we will explain how to use orextend list schedules to generate partitioned or semi-partitionedand preemptive or non-preemptive schedules based on G . In a list schedule, since the subjobs of a task are sched-uled individually, a task in the generated list schedule maymigrate among different processors, thus representing a semi-partitioned schedule. However, a subjob by default is non-preemptive in list schedules.The following lemma is widely used in the literaturefor the list schedules developed by Graham [21]. All theexisting results of federated scheduling, e.g., [6], [15], [31],or scheduling sporadic dependent tasks (that are not due tosynchronizations) all implicitly or explicitly use the propertyin this lemma.

Lemma 7:

The makespan of a list schedule of a given taskdependency graph G for task set T on M processors is at most (cid:80) τi ∈ T ( C i, + A i, + C i, ) − len ( G ) M + len ( G ) . Proof:

The original proof can be traced back to Theorem1 by Graham [21] in 1969. We omit the proof here as thisis a standard procedure in the proof of list schedules for thescheduling problem P | prec | C max . Lemma 8: If len ( G ) ≤ α × len ( G ∗ ) for a certain α ≥ ,the makespan of a list schedule of the task dependency graph G for task set T on M processors has an approximation boundof α − αM if M < N . Proof:

Since

M < N , the makespan of a list schedule of G , denoted as L ( List ( G )) , is L ( List ( G )) Lemma 7 ≤ ( (cid:80) τ i ∈ T C i, + C i, + A i, ) − len ( G ) M + len ( G )= (cid:80) τ i ∈ T C i, + C i, + A i, M + len ( G )(1 − M ) assumption ≤ (cid:80) τ i ∈ T C i, + C i, + A i, M + α × len ( G ∗ )(1 − M ) Theorem 1 ≤ (1 + α − αM ) OP T (2)We now conclude the approximation ratio.

Theorem 8:

When applying

JKS ( α = 2 , from Lemma 2), Potts ( α = 1 . , from Lemma 3), HS ( α = 4 / , fromLemma 4), and PTAS ( α = (cid:15) for any (cid:15) > , from Lemma 5) togenerate the task dependency graph G , the TS-OCS

Makespanproblem admits polynomial-time algorithms to generate asemi-partitioned schedule that has an approximation ratio of (cid:26) α if M ≥ N α − αM if M < N (3)

Proof:

The case when

M < N comes from Lemma 8.The case when M ≥ N comes from Lemma 6 and the factthat a partitioned schedule is also a semi-partitioned scheduleby deﬁnition.The default list schedulers are non-preemptive in the subjoblevel. However, it may be more efﬁcient if the second non-critical section of a task can be preempted by a critical section.Otherwise, the processors may be busy executing second non-critical sections and a critical section has to wait. As a result,not only this critical section itself but also its successorsin G may be unnecessary postponed and therefore increasethe makespan. This problem can be handled by preemptingsecond non-critical sections. Allowing such preemption in thescheduler design can be achieved easily as follows: • In the algorithm, the scheduling decision is made at atime t when there is a subjob eligible or ﬁnished. • Whenever a subjob representing a critical section iseligible, it can be assigned to a processor that executes a second non-critical section of a job by preempting thatsubjob.The makespan of the resulting schedule remains at most (cid:80) τi ∈ T ( C i, + A i, + C i, ) − len ( G ) M + len ( G ) as in Lemma 7. There-fore, the approximation ratios in Theorem 8 still hold even ifpreemption of the second non-critical sections is possible. In a partitioned schedule of the frame-based task set T ,all subjobs of a task must be executed on the same processor.Therefore, the list scheduling algorithm variant must ensurethat once the ﬁrst subjob C i, of task τ i is executed on aprocessor, all subsequent subjobs of task τ i are tied to thesame processor in any generated list schedule. Speciﬁcally,the problem is termed as P | prec, tied | C max in Section 2.3.A special case of P | prec, tied | C max has been recentlystudied to analyze OpenMP systems by Sun et al. [42] in2017. They assumed that the synchronization subjob of a taskalways takes place at the end of the task. Our dependencygraph G unfortunately does not satisfy the assumption becausethe synchronization subjob is in fact in the middle of a task.However, ﬁxing this issue is not difﬁcult. We illustrate the keystrategy by using Fig. 2. The subgraph ¯ G of G that consistsof only the vertices of the ﬁrst non-critical sections and thecritical sections in fact satisﬁes the assumption made by Sun etal. [42]. Therefore, we can generate a multiprocessor schedulefor the dependency graph ¯ G on M processors by using theBFS ∗ algorithm (an extension of the breadth-ﬁrst-schedulingalgorithm) by Sun et al. [42]. It can be imagined that thesubjobs that represent the second non-critical sections C i, are background workload and can be executed only at the end ofthe schedule or when the available idle time is sufﬁcient tocomplete C i, .Alternatively, in order to improve the parallelism, anotherheuristic algorithm can be applied where all the ﬁrst non-critical sections are scheduled before any of the critical sec-tions using list scheduling. Once the ﬁrst non-critical section C i, of task τ i is assigned on a processor, the remainingexecution of task τ i is forced to be executed on that processor. ¯ G C , A , C , C , A , C , C , A , C , Fig. 2. A schematic of a tied task dependency graph for a task set with onebinary semaphore.

If the second non-critical sections can be preempted, itcan be imagined that the subjobs that represent the secondnon-critical sections C i, are background workload and canbe executed whenever its processor idles and preempted bythe ﬁrst non-critical sections or the critical sections on therocessor. For completeness, we illustrate the algorithm inAlgorithm 2 in the Appendix. So far, we assume that C i, , A i, , and C i, are exact for atask τ i . However, the execution of a subjob of task τ i canbe ﬁnished earlier than the worst case. It should be notedthat list schedules are in this case not sustainable, i.e., thereduction of the execution time of a subjob can lead to aworse makespan due to the well-known multiprocessor timinganomaly observed by Graham [21]. There are three ways tohandle such timing anomaly: 1) ignore the early completionand stick to the ofﬂine schedule, 2) reclaim the unused time(slack) carefully without creating timing anomaly, e.g., [50],or 3) use a safe upper bound, e.g., Lemma 7 to account forall possible list schedules. Each of them has advantages anddisadvantages. It is up to the designers to choose whetherthey want to be less effective (Option 1), pay more runtimeoverhead (Option 2), or be more pessimistic by taking alwaysa safe upper bound (Option 3).Due to multiprocessor timing anomaly, a dependency graphwith a longer critical path may have a better makespan in theresulting list schedule. Our approach can be easily improved byreturning and scheduling the intermediate dependency graphsin Algorithms Potts and HS. Our approach can be extended to periodic tasks withdifferent periods under an assumption that a binary semaphoreis only shared among the tasks that have the same period. Foreach of the z semaphores, a DAG is constructed using Algo-rithm 1. Afterwards, the z resulting DAGs can be scheduledusing any approach for multiprocessor DAG scheduling, e.g.,global scheduling [29], Federated Scheduling [31] as well asenhanced versions like Semi-Federated Scheduling [25] andReservation-Based Federated Scheduling [43]. This section presents the evaluations of the proposedapproach. We will ﬁrst explain how our approach can beimplemented by using existing routines in LITMUS RT andprovide the measured overhead in LITMUS RT . Then, we willdemonstrate the performance of the proposed approach byapplying numerical evaluations for different conﬁgurations. The hardware platform used in our experiments is a cache-coherent SMP, consisting of two 64-bit Intel Xeon ProcessorE5-2650Lv4 running at 1.7 GHz, with 35 MB cache and 64 GBof main memory. We have implemented our dependency graphapproach in LITMUS RT , in order to investigate the overheads.Both partitioned and semi-partitioned scheduling algorithmspresented in Section 6 have been implemented in LITMUS RT under the plug-in Partitioned Fixed Priority (P-FP), detailed inthe Appendix. The patches of our implementation have beenreleased in [41].In Table I, we report the following overheads of differentprotocols, including the existing protocols DPCP, and MPCP Max.(Avg.) in µs DPCP MPCP PDGA SDGACXS 30.93 (1.51) 31.1 (0.67) 31.21 (0.71) 30.95 (1.54)RELEASE 32.63 (3.96) 19.48 (3.91) 19.77 (4.03) 21.64 (4.3)SCHED2 28.7 (0.18) 29.78 (0.15) 29.91 (0.16) 29.74 (0.2)SCHED 31.43 (1.2) 31.38 (0.78) 31.4 (0.83) 31.26 (1.11)SEND-RESCHED 47.01 (14.42) 31.83 (3.45) 45.23 (4.33) 41.53 (7.24)

TABLE I. O

VERHEADS OF DIFFERENT PROTOCOLS IN

LITMUS RT . in LITMUS RT and our implementation of the partitioneddependency graph approach (PDGA) and the semi-partitioneddependency graph approach (SDGA): • CXS: context-switch overhead. • RELEASE: time spent to enqueue a newly released jobin a ready queue. • SCHED2: time spent to perform post context switch andmanagement activities. • SCHED: time spent to make a scheduling decision(scheduler to ﬁnd the next job). • SEND-RESCHED: inter-processor interrupt latency, in-cluding migrations.Table I shows that the overheads of our approach and of otherprotocols implemented in LITMUS RT are comparable. We conducted evaluations with M = 4, 8 and 16 processors.Depending on M , we generate task sets, each with M tasks. For each task set T , we generated synthetic tasks with (cid:80) τ i ∈ T C i, + C i, + A i, = M by applying the RandomFixed-Sum method [18] and enforced that C i, + C i, + A i, ≤ . for each task τ i . The number of shared resources (binarysemaphores) was set to z ∈ { , , } . The length of thecritical section A i, is a fraction of the total execution time C i, + C i, + A i, of task τ i , depended on β ∈ { − } . Theremaining part C i was split into C i, and C i, by drawing C i, randomly uniform from [0 , C i ] and setting C i, to C i − C i, .For a generated task set T , we calculated a lower bound LB on the optimal makespan based on Eq. (1). Since deriving len ( G ∗ ) is computationally expensive, we used min τ i ∈ T C i, +min τ i ∈ T C i, + max k =1 ,...,z CriticalSum k as a safe approx-imation for len ( G ∗ ) , where CriticalSum k is the sum of thelengths of the critical sections that share semaphore s k . If therelative deadline of the task set is less than LB , the task setis not schedulable by any algorithm. We compare the per-formance of different algorithms according to the acceptanceratio by setting the relative deadline D = T in the range of [ LB, . LB ] . We name the developed algorithms using thefollowing rules: 1) JKS/POTTS in the ﬁrst part: using theextended Jackson’s rule or Potts to construct the dependencygraph; SP/P in the second part: semi-partitioned or parti-tioned scheduling algorithm is applied ; 3) P/NP in the thirdpart: preemptive or non-preemptive for the second non-criticalsections.We evaluated all 8 combinations under different settings asshown in Fig. 3. Due to space limitation, only a subset of the We did not implement Lemma 5 due to the complexity issue. AlgorithmHS in general has similar performance to POTTS. In Section 6.2, we presented two strategies for task partitioning: one isbased on [42] (detailed in Appendix) and another is a simple heuristic byperforming the list scheduling algorithm based on the ﬁrst non-critical sections.In all the experiments regarding partitioned scheduling, we observed that thelatter (i.e., the simple heuristic) performed better. All the presented results forpartitioned scheduling are therefore based on the simple heuristic. . . . . . . D/LB . . . . . . A cce p t a n ce R a t i o ( % ) . . . . . (a) M=8 z=8 β =10%-40% JKS-SP-NPJKS-P-NP JKS-SP-PJKS-P-P POTTS-SP-NPPOTTS-P-NP POTTS-SP-PPOTTS-P-P . . . . . (b) M=16 z=16 β =10%-40% . . . . . (c) M=8 z=4 β =10%-40% . . . . . (d) M=8 z=16 β =10%-40% . . . . . (e) M=8 z=8 β =5%-10% . . . . . (f) M=8 z=8 β =40%-50% Fig. 3. Comparison of different approaches with different deadlines. results is presented. In general, the semi-partitioned schedulingalgorithms clearly outperform the partitioned strategies, inde-pendently from the algorithm used to construct the dependencygraph. In addition, the preemptive scheduling policy withrespect to the second computation segment is superior tothe non-preemptive strategy and POTTS (usually) performsslightly better than JKS. We analyze the effect of the threeparameters individually by changing:1) M = z ∈ { , } (Fig. 3(a) and Fig. 3(b)): increasing z and M also slightly increases the difference betweenthe semi-partitioned and the partitioned approaches.2) z for a ﬁxed M , i.e., z ∈ { , , } and M = 8 (Fig. 3(c), Fig. 3(a), and Fig. 3 (d)): when the numberof resources is decreased compared to the number ofprocessors, the performance gap between preemptive andnon-preemptive scheduling increases.3) Workload of Shared Resources, i.e., β ∈ { [5% − , [10% − , [40% − } (Fig. 3(e), Fig. 3 (a), and Fig. 3 (f)): if the workload ofthe critical sections is increased, the difference betweenpreemptive and non-preemptive scheduling approaches ismore signiﬁcant.We also compare our approach with the Resource OrientedPartitioned (ROP) scheduling with release enforcement by vonder Br¨uggen et al. [44] which is designed to schedule periodictasks with one critical section on a multiprocessor platform.The concept of the ROP is to have a resource centric viewinstead of a processor centric view. The algorithm 1) binds thecritical sections of the same resource to the same processor,thus enabling well known uniprocessor protocols like PCP tohandle the synchronization, and 2) schedule the non-criticalsections on the remaining processors using a state-of-the- . . . . . . Utilization (%) / M . . . . . . A cce p t a n ce R a t i o ( % )

40 50 60 70 80 90 100020406080100 (a) M=8 z=8 β =5%-10% JKS-SP-P POTTS-SP-P EDF-EIM-PCP FP-EIM-PCP

40 50 60 70 80 90 100020406080100 (b) M=8 z=8 β =40%-50% Fig. 4. Schedulability of different approaches for frame-based task sets. art scheduler for segmented self-suspension tasks, namelySEIFDA [45]. Among the methods in [44], we evaluated FP-EIM-PCP (under ﬁxed-priority scheduling) and EDF-EIM-PCP(under dynamic-priority scheduling). It has been shown in [44]that EDF-EIM-PCP dominates all existing methods. We per-formed another set of evaluations by adopting aforementionedsettings and testing the utilization level in a step of , wherethe utilization of a task set T is (cid:80) τ i ∈ T C i, + C i, + A i, T i . Fig. 4presents the evaluation results. Due to space limitation, onlya subset of the results is presented, but the others have verysimilar curve tendencies. For readability, we only select twocombinations in our proposed approach that outperform theothers. The results in Fig. 4 show that for frame-based tasks,our approach outperforms ROP signiﬁcantly. We note thatFig. 4 is only for frame-based tasks, and the results for periodictask systems discussed in Section 8 are further presented inAppendix.

10 Conclusion

This paper tries to answer a few fundamental questionswhen real-time tasks share resources in multiprocessor sys-tems. Here is a short summary of our ﬁndings: • The fundamental difﬁculty is mainly due to the sequenc-ing of the mutual exclusive accesses to the share resources(binary semaphores). Adding more processors, removingperiodicity and job recurrence, introducing task migration,or allowing preemption does not make the problem easierfrom the computational complexity perspective. • The performance gap of partitioned and semi-partitionedscheduling in our study is mainly due to the capabilityto schedule the subjobs constrained by the dependencygraph. Although partitioned scheduling may seem muchworse than semi-partitioned scheduling in our evalua-tions, this is mainly due to the lack of understanding ofthe problem P | prec, tied | C max in the literature. Furtherexplorations are needed to understand these schedulingparadigms for a given dependency graph. • The dependency graph approach is not work-conservingfor the critical sections, since a critical section may beready but not executed due to the artiﬁcially introducedprecedence constraints. Existing multiprocessor synchro-nization protocols mainly assume work-conserving forgranting the accesses of the critical sections via priorityboosting. Our study reveals a potential to consider cau-tious and non-work-conserving synchronization protocolsin the future. cknowledgement : This paper is supported by DFG, as partof the Collaborative Research Center SFB876, project A3 andB2 (http://sfb876.tu-dortmund.de/). The authors thank ZeweiChen and Maolin Yang for their tool SET-MRTS (Schedula-bility Experimental Tools for Multiprocessors Real Time Sys-tems, https://github.com/RTLAB-UESTC/SET-MRTS-public)to evaluate the LP-GFP-FMLP, LP-PFP-DPCP, LP-PFP-MPCP,GS-MSRP, and LP-GFP-PIP in Fig. 5.

Appendix

Proof of Theorem 6.

Due to the design of the task set, thereare only N different dependency graphs, depending on theposition of τ in the execution order. Suppose that the criticalsection of task τ is the j -th critical section in the dependencygraph. It can be proved that the critical path of this dependencygraph is jδ + Q + N δ . We sketch the proof: • The non-critical section C , must be part of the criticalpath since C , = QM + N δ , which is greater than any ( N − A i, + C i, for any i = 2 , , . . . , N − . • The longest path that ends at the vertex representing A , has 1) one non-critical section, 2) j − critical sectionsfrom τ i for i = 2 , , . . . , N , and 3) 1 critical section fromtask τ . Therefore, this length is δ + ( j − δ + Q − QM = jδ + Q − QM . • Combining the two scenarios, we reach the conclusion.Therefore, the dependency graph G ∗ that has the minimumcritical path length is the one where τ ’s critical sectionis the ﬁrst one among the N critical sections. The optimalschedule of the dependency graph G ∗ on M processors hasthe following properties: • Task τ ﬁnishes its critical section at time δ + Q − QM . • Before time δ + Q − QM , none of the second non-criticalsections is executed. Therefore, the makespan of anyfeasible schedule S ( G ∗ ) of G ∗ on M processors is L ( S ( G ∗ )) ≥ δ + Q − QM + N (cid:88) i =1 C i, M = δ + Q − QM + ( M − M + 1) QM + N δM = (cid:18) NM (cid:19) δ + (cid:18) − M + 1 M (cid:19) Q • Moreover, when the scheduling policy is either semi-partitioned or partitioned scheduling, by the pigeon holeprinciple, at least one processor must execute (cid:6) NM (cid:7) of the N second non-critical sections no earlier than δ + Q − QM .Therefore, the makespan of a feasible semi-partitioned orpartitioned schedule S p of G ∗ on M processors is L ( S p ( G ∗ )) ≥ δ + Q − QM + (cid:24) NM (cid:25) QM = δ + Q − QM + (cid:24) M − M (cid:25) QM = δ + Q − QM + M QM = δ + (cid:18) − M (cid:19) Q We can have another feasible partitioned schedule S ∗ : • The ﬁrst non-critical section τ is executed on processor M , and the ﬁrst non-critical sections of the other N − tasks are executed on the ﬁrst M − processors based onlist scheduling. All the ﬁrst non-critical sections ﬁnish nolater than M δ . Each of the ﬁrst M − processors executes exactly M tasks since there are N − M ( M − taskswith identical properties on these M − processors. • The critical sections of tasks τ N , τ N − , . . . , τ are exe-cuted sequentially by following the above reversed-indexorder on the same processor of the corresponding ﬁrstnon-critical sections, starting from time M δ . • At time

M δ + N δ , all the second non-critical sections of τ , . . . , τ N are eligible to be executed. We execute themin parallel on the ﬁrst M − processors by respectingthe partitioned scheduling strategy. That is, each of theﬁrst M − processors executes exactly M tasks with C i, = Q/M . The makespan of these N − tasks is ( N + M ) δ + ( N − QM M − = ( N + M ) δ + Q . • At time

M δ + N δ , the critical section of τ startsits execution on processor M . Furthermore, at time ( N + M ) δ + Q − QM , the second non-critical section of τ is executed on processor M and it is ﬁnished at time ( N + M ) δ + Q + N δ = (2 N + M ) δ + Q . • As a result, the makespan of the above partitioned sched-ule S ∗ is exactly (2 N + M ) δ + Q .Therefore, the approximation bound of the optimal taskdependency graph approach is at least L ( S ( G ∗ )) L ( S ∗ ) under anyscheduling paradigm and is at least L ( S p ( G ∗ )) L ( S ∗ ) under partitionedor semi-partitioned scheduling paradigm. We reach the conclu-sion by taking δ → . Pseudo-code of the Partitioned Preemptive Schedulingin Section 6.2

For notational brevity, we deﬁne two vertices v i, and v i, to represent the ﬁrst and second non-criticalsections of task τ i and v i, to represent the critical sectionof task τ i . Let T m be the set of tasks in T assigned toprocessor m for m = 1 , , . . . , M . The pseudo-code is listedin Algorithm 2. It consists of three blocks: initialization fromLine 1 to Line 4, scheduling of the ﬁrst non-critical sectionsand the critical sections of the tasks according to ¯ G from Line5 to Line 23, and scheduling of the second non-critical sectionsof the tasks from Line 24 to Line 28.The ﬁrst block is self-explained in Algorithm 2. We willfocus on the second and third blocks of Algorithm 2. Ourscheduling algorithm executes the ﬁrst non-critical sectionsand the critical sections non-preemptively. Whenever a subjobﬁnishes at time t , we examine the following scenarios on eachprocessor m for m = 1 , , . . . , M : • If there is a pending critical section on processor m thatis eligible at time t according to the dependency graph G , we would like to execute the critical section as soonas possible. Therefore, this critical section is executed assoon as it is eligible and the processor idles (i.e., Lines12-13). • Else if there is a task in T m in which its ﬁrst non-criticalsection has not ﬁnished yet at time t , we would like toexecute it (Lines 14-15). • Otherwise, there is no eligible subjob to be executed attime t . If there is still an unassigned task, we select one lgorithm 2 Tied List-Scheduling (Partitioned Preemptive)

Input: G, T , M with | T | > M ;1: current ← ;2: assign one task τ i in T to task set T m to be executed on processor m ;3: T ← T \ ∪ Mm =1 T m ;4: execute v i, of the unique task τ i in T m on processor m from time ,i.e., ρ ( t, m ) ← τ i for t ∈ [0 , C i, ) , for each m = 1 , , . . . , M ;5: while ∃ τ i such that v i, has not ﬁnished yet at time current do

6: let t be the minimum time instant greater than current such that theschedule ﬁnishes a subjob at time t ;7: current ← t ;8: for m = 1 , , . . . , M do if processor m is busy executing a subjob at time t then

10: continue;11: else if processor m idles (or just ﬁnishes a subjob) at time t then if ∃ τ i ∈ T m , in which v i, has not ﬁnished yet and v i, iseligible according to G at time t then

13: execute τ i ’s critical section from time t to t + A i, non-preemptively on processor m , i.e., ρ ( θ, m ) ← τ i for θ ∈ [ t, t + A i, ) ;14: else if ∃ τ i ∈ T m , in which v i, has not ﬁnished yet at t then

15: execute v i, from time t on t + C i, processor m , i.e., ρ ( θ, m ) ← τ i for θ ∈ [ t, t + C i, ) ;16: else if T is not empty then

17: select a task τ i and remove τ i from T , i.e., T ← T \ { τ i } ;18: assign task τ i to processor m , i.e., T m ← T m ∪ { τ i } ;19: execute v i, from time t to t + C i, on processor m , i.e., ρ ( θ, m ) ← τ i for θ ∈ [ t, t + C i, ) ;20: end if end if end for end while for m = 1 , , . . . , M do for each task τ i in T m do

26: schedule the second non-critical section v i, of task τ i as back-ground workload with the lowest priority preemptively as early aspossible but no earlier than the ﬁnishing time of its critical section;27: end for end for and assign it to processor m by starting its ﬁrst non-critical section at time t (Lines 16-19).In all the above steps, task τ i can be arbitrarily selected if thereare multiple tasks satisfying the speciﬁed conditions. We notethat the schedule is in fact ofﬂine . Therefore, after we ﬁnishthe schedule of the ﬁrst non-critical sections and the criticalsections, in the third block in Algorithm 2, we can pad the idletime of the schedule on a processor m with the second non-critical sections assigned on processor m , starting from time .The only attention is not to start earlier than the ﬁnishing timeof its critical section. Of course, to minimize the makespan,we should always pad the idle time as early as possible. Implementation in LITMUS RT To force the tasks to followthe pre-deﬁned order to execute the critical sections, weadded several elements into the rt params structure whichis used to deﬁne the property for each task, i.e., priority,period, execution time, etc. Two parameters are added: 1) rt order to deﬁne the order of the task to execute the criticalsection, and 2) rt total to deﬁne the number of the tasks thatshared the same resource. To implement the binary semaphoresunder the dependency graph approach, we created two newstructures, pdga semaphore for the partitioned dependencygraph approach (PDGA), and sdga semaphore for the semi-partitioned dependency graph approach (SDGA). In thesestructures, one parameter is deﬁned to control the order ofthe execution named current serving ticket . When a taskrequests the resource, it will compare its rt order with the . . . . . . Utilization (%) / M . . . . . . A cce p t a n ce R a t i o ( % )

30 40 50 60 70 80 90 100020406080100 (a) M=8 z=8 β =5%-10% POTTS-SFEDF-EIM-PCP JKS-SFFP-EIM-PCP LP-GFP-FMLPLP-GFP-DPCP GS-MSRPLP-GFP-PIP

30 40 50 60 70 80 90 100020406080100 (b) M=8 z=8 β =10%-40% Fig. 5. Comparison of different approaches for periodic task sets. semaphore’s current serving ticket , if they are equal, thetask will be granted to access the resource and start its criticalsection; if not, the task will be added to the wait-queue, whichis sorted by the tasks’ parameter rt order . Once a task hasﬁnished its critical section, it will increase the semaphore’s cur-rent serving ticket by , and check the head of the wait-queuedo the comparison again. Once the current serving ticket reaches to the rt total , which means one dependency graphhas ﬁnished its execution of the critical sections, then theparameter current serving ticket will be reset to to startthe next iteration. The only difference between PDGA andSDGA is that we added the migration function for SDGA tosupport the semi-partitioned algorithm. Evaluations for Periodic Task Sets

We also performed eval-uations for periodic task systems, when a binary semaphoreis only shared by the tasks with the same period described inSection 8. We used similar conﬁgurations as in Section 9.2to generate the task sets. For the tasks that share the samesemaphore, they have the same period in the range of [1 , .The following algorithms were evaluated: • LP-GFP-FMLP [7]: a linear-programming-based (LP)analysis for global FP scheduling using the FMLP [7]. • LP-PFP-DPCP [9]: LP-based analysis for partitioned FPand DPCP [39]. Tasks are assigned using Worst-Fit-Decreasing (WFD) as proposed in [9]. • LP-PFP-MPCP [9]: LP-based analysis for partitioned FPusing MPCP [38]. Tasks are partitioned according toWFD as proposed in [9]. • GS-MSRP [48]: the Greedy Slacker (GS) partitioningheuristic with the spin-based locking protocol MSRP [20]under Audsley’s Optimal Priority Assignment [4]. • LP-GFP-PIP: LP-based global FP scheduling using thePriority Inheritance Protocol (PIP) [17]. • FP-EIM-PCP [44]: The ROP scheduling under ﬁxed-priority scheduling and release enforcement. • EDF-EIM-PCP [44]: The ROP scheduling underdynamic-priority scheduling and release enforcement. • POTTS-SF: Our approach by applying algorithm Potts forgenerating G and semi-federated scheduling in [25]. • JKS-SF: Our approach by applying algorithm JKS forgenerating G and semi-federated scheduling in [25].For one evaluation point, synthetic task sets were gen-erated and tested. Only a subset of the results is presentedin Fig. 5 due to space limitation, and LP-PFP-MPCP is notpresented for better readability since it performs the worst forthe evaluations in Fig. 5. The ﬁgure clearly shows that POTTS-SF and JKS-SF signiﬁcantly outperform the other approaches. eferences [1] S. Afshar, M. Behnam, R. J. Bril, and T. Nolte. An optimal spin-lock priority assignment algorithm for real-time multi-core systems. In RTCSA , pages 1–11, 2017.[2] B. Andersson and A. Easwaran. Provably good multiprocessor schedul-ing with resource sharing.

Real-Time Systems , 46(2):153–159, 2010.[3] B. Andersson and G. Raravi. Real-time scheduling with resource shar-ing on heterogeneous multiprocessors.

Real-Time Systems , 50(2):270–314, 2014.[4] N. C. Audsley. Optimal priority assignment and feasibility of staticpriority tasks with arbitrary start times. Technical Report YCS-164,Department of Computer Science, University of York, 1991.[5] T. P. Baker. Stack-based scheduling of realtime processes.

Real-TimeSystems , 3(1):67–99, 1991.[6] S. Baruah. The federated scheduling of systems of conditional sporadicDAG tasks. In

Proceedings of the 15th International Conference onEmbedded Software (EMSOFT) , 2015.[7] A. Block, H. Leontyev, B. Brandenburg, and J. Anderson. A ﬂexiblereal-time locking protocol for multiprocessors. In

RTCSA , pages 47–56,2007.[8] B. Brandenburg.

Scheduling and Locking in Multiprocessor Real-TimeOperating Systems . PhD thesis, The University of North Carolina atChapel Hill, 2011.[9] B. Brandenburg. Improved analysis and evaluation of real-timesemaphore protocols for P-FP scheduling. In

RTAS , 2013.[10] B. B. Brandenburg. The FMLP+: an asymptotically optimal real-time locking protocol for suspension-aware analysis. In

EuromicroConference on Real-Time Systems (ECRTS) , pages 61–71, 2014.[11] B. B. Brandenburg and J. H. Anderson. Optimality results for multi-processor real-time locking. In

Real-Time Systems Symposium (RTSS) ,pages 49–60, 2010.[12] A. Burns and A. J. Wellings. A schedulability compatible multipro-cessor resource sharing protocol - MrsP. In

Euromicro Conference onReal-Time Systems (ECRTS) , pages 282–291, 2013.[13] J. M. Calandrino, H. Leontyev, A. Block, U. C. Devi, and J. H.Anderson. LITMUS RT : A testbed for empirically comparing real-timemultiprocessor schedulers. In Real-Time Systems Symposium (RTSS) ,pages 111–126. IEEE, 2006.[14] J. Carlier. The one-machine sequencing problem.

European Journal ofOperational Research , 11(1):42 – 47, 1982.[15] J.-J. Chen. Federated scheduling admits no constant speedup factors forconstrained-deadline dag task systems.

Real-Time Systems , 52(6):833–838, November 2016.[16] R. I. Davis and A. Burns. A survey of hard real-time scheduling formultiprocessor systems.

ACM Comput. Surv. , 43(4):35, 2011.[17] A. Easwaran and B. Andersson. Resource sharing in global ﬁxed-priority preemptive multiprocessor scheduling. In

Real-Time SystemsSymposium (RTSS) , pages 377–386, 2009.[18] P. Emberson, R. Stafford, and R. I. Davis. Techniques for the synthesisof multiprocessor tasksets. In

International Workshop on Analysis Toolsand Methodologies for Embedded and Real-time Systems (WATERS2010) , pages 6–11, 2010.[19] D. Faggioli, G. Lipari, and T. Cucinotta. The multiprocessor bandwidthinheritance protocol. In

Euromicro Conference on Real-Time Systems(ECRTS) , pages 90–99, 2010.[20] P. Gai, G. Lipari, and M. D. Natale. Minimizing memory utilization ofreal-time task sets in single and multi-processor systems-on-a-chip. In

Real-Time Systems Symposium (RTSS) , pages 73–83, 2001.[21] R. L. Graham. Bounds on multiprocessing timing anomalies.

SIAMJournal of Applied Mathematics , 17(2):416–429, 1969.[22] L. A. Hall and D. B. Shmoys. Jackson’s rule for single-machinescheduling: Making a good heuristic better.

Math. Oper. Res. , 17(1):22–35, 1992.[23] P.-C. Hsiu, D.-N. Lee, and T.-W. Kuo. Task synchronization andallocation for many-core real-time systems. In

International Conferenceon Embedded Software, (EMSOFT) , pages 79–88, 2011.[24] W.-H. Huang, M. Yang, and J.-J. Chen. Resource-oriented partitionedscheduling in multiprocessor systems: How to partition and how toshare? In

Real-Time Systems Symposium (RTSS) , pages 111–122, 2016.[25] X. Jiang, N. Guan, X. Long, and W. Yi. Semi-federated scheduling ofparallel real-time tasks on multiprocessors. In

Proceedings of the 38ndIEEE Real-Time Systems Symposium, RTSS , 2017.[26] B. Kalyanasundaram and K. Pruhs. Speed is as powerful as clairvoy-ance.

Journal of ACM , 47(4):617–643, July 2000.[27] H. Kise, T. Ibaraki, and H. Mine. Performance analysis of six approx-imation algorithms for the one-machine maximum lateness schedulingproblem with ready times.

Journal of the Operations Research Societyof Japan , 22(3):205–224, 1979. [28] K. Lakshmanan, D. de Niz, and R. Rajkumar. Coordinated taskscheduling, allocation and synchronization on multiprocessors. In

Real-Time Systems Symposium (RTSS) , pages 469–478, 2009.[29] K. Lakshmanan, S. Kato, and R. R. Rajkumar. Scheduling parallel real-time tasks on multi-core processors. In

Proceedings of the 2010 31stIEEE Real-Time Systems Symposium , RTSS ’10, pages 259–268, 2010.[30] J. K. Lenstra, A. H. G. Rinnooy Kan, and P. Brucker. Complexity ofmachine scheduling problems.

Annals of Discrete Mathematics , 1:343–362, 1977.[31] J. Li, J.-J. Chen, K. Agrawal, C. Lu, C. D. Gill, and A. Saifullah.Analysis of federated and global scheduling for parallel real-time tasks.In , pages85–96, 2014.[32] G. McMahon and M. Florian. On scheduling with ready times and duedates to minimize maximum lateness.

Operations Research , 23(3):475–482, 1975.[33] F. Nemati, M. Behnam, and T. Nolte. Independently-developed real-timesystems on multi-cores with shared resources. In

Euromicro Conferenceon Real-Time Systems (ECRTS) , pages 251–261, 2011.[34] F. Nemati, T. Nolte, and M. Behnam. Partitioning real-time systemson multiprocessors with shared resources. In

Principles of DistributedSystems - International Conference, OPODIS , pages 253–269, 2010.[35] E. Nowicki and S. Zdrzałka. A note on minimizing maximum latenessin a one-machine sequencing problem with release dates.

EuropeanJournal of Operational Research , 23(2):266 – 267, 1986.[36] C. Phillips, C. Stein, E. Torng, and J. Wein. Optimal time-criticalscheduling via resource augmentation. In

ACM Symposium on Theoryof Computing , pages 140–149, 1997.[37] C. N. Potts. Analysis of a heuristic for one machine sequencing withrelease dates and delivery times.

Operations Research , 28(6):1436–1441, 1980.[38] R. Rajkumar. Real-time synchronization protocols for shared memorymultiprocessors. In

Proceedings.,10th International Conference onDistributed Computing Systems , pages 116 – 123, 1990.[39] R. Rajkumar, L. Sha, and J. P. Lehoczky. Real-time synchronizationprotocols for multiprocessors. In

Proceedings of the 9th IEEE Real-Time Systems Symposium (RTSS ’88) , pages 259–269, 1988.[40] L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority inheritance protocols:An approach to real-time synchronization.

IEEE Trans. Computers ,39(9):1175–1185, 1990.[41] J. Shi. DGA-LITMUS-RT. https://github.com/Strange369/Dependency-Graph-Approaches-for-LITMUS-RT, 2018.[42] J. Sun, N. Guan, Y. Wang, Q. He, and W. Yi. Real-time scheduling andanalysis of OpenMP task systems with tied tasks. In

IEEE Real-TimeSystems Symposium, RTSS , pages 92–103, 2017.[43] N. Ueter, G. von der Br¨uggen, J.-J. Chen, J. Li, and K. Agrawal.Reservation-based federated scheduling for parallel real-time tasks.

CoRR , abs/1712.05040, 2017.[44] G. von der Br¨uggen, J.-J. Chen, W.-H. Huang, and M. Yang. Releaseenforcement in resource-oriented partitioned scheduling for multipro-cessor systems. In

Proceedings of the 25th International Conferenceon Real-Time Networks and Systems, RTNS , pages 287–296, 2017.[45] G. von der Br¨uggen, W.-H. Huang, J.-J. Chen, and C. Liu. Uniprocessorscheduling strategies for self-suspending task systems. In

InternationalConference on Real-Time Networks and Systems , RTNS ’16, pages 119–128, 2016.[46] B. C. Ward and J. H. Anderson. Supporting nested locking inmultiprocessor real-time systems. In

Euromicro Conference on Real-Time Systems ECRTS , pages 223–232, 2012.[47] B. C. Ward and J. H. Anderson. Fine-grained multiprocessor real-timelocking with improved blocking. In

International Conference on Real-Time Networks and Systems, RTNS , pages 67–76, 2013.[48] A. Wieder and B. Brandenburg. On spin locks in AUTOSAR: blockinganalysis of FIFO, unordered, and priority-ordered spin locks. In

RTSS ,2013.[49] A. Wieder and B. B. Brandenburg. Efﬁcient partitioning of sporadicreal-time tasks with shared resources and spin locks. In

InternationalSymposium on Industrial Embedded Systems, (SIES) , pages 49–58,2013.[50] D. Zhu, R. G. Melhem, and B. R. Childers. Scheduling with dynamicvoltage/speed adjustment using slack reclamation in multi-processorreal-time systems. In