DPCP-p: A Distributed Locking Protocol for Parallel Real-Time Tasks
DDPCP-p: A Distributed Locking Protocol for ParallelReal-Time Tasks
Maolin Yang † Zewei Chen † Xu Jiang † Nan Guan ‡ Hang Lei †† University of Electronic Science and Technology of China (UESTC), Chengdu, China ‡ Hong Kong Polytechnic University (PolyU), Hong Kong, China
Abstract —Real-time scheduling and locking protocols are fundamentalfacilities to construct time-critical systems. For parallel real-time tasks,predictable locking protocols are required when concurrent sub-jobsmutually exclusive access to shared resources. This paper for the firsttime studies the distributed synchronization framework of parallel real-time tasks, where both tasks and global resources are partitioned todesignated processors, and requests to each global resource are conductedon the processor on which the resource is partitioned. We extend theDistributed Priority Ceiling Protocol (DPCP) for parallel tasks underfederated scheduling, with which we proved that a request can be blockedby at most one lower-priority request. We develop task and resourcepartitioning heuristics and propose analysis techniques to safely boundthe task response times. Numerical evaluation (with heavy tasks on 8-,16-, and 32-core processors) indicates that the proposed methods improvethe schedulability significantly compared to the state-of-the-art lockingprotocols under federated scheduling.
Index Terms —real-time scheduling, locking protocols, parallel tasks
I. I
NTRODUCTION
To exploit the parallelism for time-critical applications on multi-cores, the design of scheduling and analysis techniques for parallelreal-time tasks has attracted increasing interests in recent years.Among the scheduling algorithms for parallel real-time tasks, the fed-erated scheduling [13] is a promising approach with high flexibilityand simplicity in analysis.Coordinated locking protocols are used to ensure mutually exclu-sive access to shared resources while preventing uncontrolled priorityinversions [6], [11]. In multiprocessor systems, requests to sharedresources can be executed locally by the tasks [15] or remotely byresource agents [16], e.g., by means of the Remote Procedure Call(RPC) mechanism. Local execution of requests is in general moreefficient since migrations are not needed, while blockings can bebetter explored and managed with remote execution of requests, e.g.,by constraining resource contentions on designated processors [9],[10]. While existing locking protocols for parallel tasks [6], [11] areall based on local execution of requests, no work has been done withremote execution of requests so far as we know.The Distributed Priority Ceiling Protocol (DPCP) [16] is a classicmultiprocessor real-time locking protocol for sequential tasks thatexecutes requests of global resources remotely, where both tasks andshared resources are partitioned among the processors and all requeststo a global resource must be conducted by the resource agents on theprocessor on which the resource is partitioned. Empirical studies [2]indicate that the DPCP has better schedulability performance com-pared to similar protocols with local execution of requests. Further,the recent Resource-Oriented Partitioned (ROP) scheduling [10], [17],[18] with the DPCP guarantees bounded speedup factors.In addition, since each heavy task exclusively uses a subset ofprocessors under federated scheduling, there could be significantresource waste under the federated scheduling, i.e., almost half of the
Work supported by the NSFC (Grant No. 61802052) and the ChinaPostdoctoral Science Fundation Funded Project (Grant No. 2017M612947). processing capacity is wasted in the extreme case. Executing global-resource-requests on remote processors can alleviate the potentialresource wastes by shifting a part of the resource-related workloadof a task to processors with lower workload.This paper for the first time studies the distributed synchronizationframework for parallel real-time tasks. we answer the fundamentalquestion of whether the key insight of remote execution of sharedresources for sequential tasks can be applied to parallel real-timetasks and how to do so. We propose
DPCP-p , an extension of DPCP,to support parallel real-time tasks under federated scheduling, anddevelop the corresponding schedulability analysis and partitioningheuristic. DPCP-p retains the fundamental property of the underlyingpriority ceiling mechanism of the DPCP, namely a request can beblocked by at most one lower-priority request. Numerical evaluationwith heavy tasks on more than 8-core processors indicates that DPCP-p improves the schedulability performance significantly compared toexisting locking protocols under federated scheduling.II. S
YSTEM M ODEL AND T ERMINOLOGIES
We consider a set of n parallel tasks τ = { τ , ..., τ n } to bescheduled on m ≥ identical processors ℘ = { ℘ , ..., ℘ m } with n r shared resources Φ = { (cid:96) , ..., (cid:96) n r } . Parallel Tasks.
Each task τ i is characterized by a Worst-Case Execu-tion Time (WCET) C i , a relative deadline D i , and a minimum inter-arrival time T i , where D i ≤ T i ( constrained-deadline is considered).The utilization of τ i is defined by U i = C i /T i .The structure of τ i is represented by a Directed Acyclic Graph(DAG) G i = (cid:104) V i , E i (cid:105) , where V i is the set of vertices and E i is the setof edges. Each vertex v i , x ∈ V i has a WCET C i , x , and the WCET ofall vertices of τ i is C i = (cid:80) v i , x ∈ V i C i , x . Each edge ( v i , x , v i , y ) ∈ E i represents the precedence relation between v i , x and v i , y . A vertex v i , x is said to be pending during the time while all its predecessorsare finished and v i , x is not finished. A complete path is a sequenceof vertices ( v i , a , ..., v i , z ) , where v i , a is a head vertex, v i , z is a tailvertex, and v i , x is the predecessor of v i , y for each pair of consecutivevertices v i , x and v i , x +1 . We use λ i to denote an arbitrary completepath. The length of λ i , denoted by L ( λ i ) , is defined as the sum of theWCETs of the vertices on λ i . We also use L ∗ i to denote the lengthof the longest path of G i . For example in Fig. 1(a), the longest pathof G i is ( v i ,1 , v i ,5 , v i ,7 , v i ,8 ) , and L ∗ i = 10 At runtime, each task generates a sequence of jobs, and each jobinherits the DAG structure of the task. Let J i , j denote the j th jobof τ i . Let a i , j and f i , j denote the arrival and finish time of J i , j respectively, then J i , j must finish no later than a i , j + D i , and thesubsequent job J i , j +1 cannot arrive before a i , j + T i . The Worst-CaseResponse Time (WCRT) of task τ i is defined as R i = max ∀ j { f i , j − a i , j } . For brevity, let J i be an arbitrary job of τ i . Shared Resources.
Each task τ i uses a set of shared resources Φ i ⊆ Φ , and each resource (cid:96) q is shared by a set of tasks τ ( (cid:96) q ) . To ensuremutual exclusion, (cid:96) q is protected by a binary semaphore (also called1 a r X i v : . [ c s . O S ] J u l lock for short). A job is allowed to execute a critical section for (cid:96) q only if it holds the lock of (cid:96) q , otherwise, it is suspended. A vertex v i , x requests (cid:96) q at most N i , x , q times, and each time uses (cid:96) q for a timeof at most L i , q . For simplicity, we assume that a path λ i requests (cid:96) q at most N λi , q = (cid:80) v i , x ∈ λ i N i , x , q times, and a job J i requests (cid:96) q atmost N i , q = (cid:80) v i , x ∈ V i N i , x , q times.Given that L i , q is included in C i , for brevity, we use C (cid:48) i , x and C (cid:48) i to denote the WCETs of the non-critical sections of v i , x and τ i ,respectively. For simplicity, it is assumed that C (cid:48) i = (cid:80) v i , x ∈ τ i C (cid:48) i , x = C i − (cid:80) (cid:96) q ∈ Φ i N i , q · L i , q . Further, critical sections are assumed to benon-nested, and nested critical sections remain in future work. Scheduling.
The tasks are scheduled based on the federated schedul-ing paradigm [13]. Each task τ i with C i /D i > (i.e., heavy tasks )is assigned m i dedicated processors, and we use ℘ ( τ i ) to denote theset of processors assigned to τ i . The rest light tasks are assigned tothe remaining processors. Each task τ i has a unique base priority π i ,and π i < π h implies that τ i has a base priority lower than τ h . Alljobs of τ i and all vertices of τ i have the same base priority π i .At runtime, each heavy task is scheduled exclusively on theassigned processors according to a work-conserving scheduler (i.e.,no processor assigned to a task is idle when there is a vertex of thetask is ready to be scheduled), while each light task is treated as asequential task and is scheduled with the tasks (if one exists) assignedon the same processor. We focus on heavy tasks in the following anddiscuss how to handle both heavy and light tasks in Sec. VI.III. T HE D ISTRIBUTED LOCKING PROTOCOL
DPCP- P The design of DPCP-p is based on the DPCP [16] and extents itto support parallel real-time tasks under federated scheduling.
A. The Synchronization Framework
Under federated scheduling, a resource can be shared locally orglobally. A resource (cid:96) q is a local resource if it is shared only by thevertices of a single task, and it is a global resource if it is shared bymore than one task. For example in Fig. 1, (cid:96) is a global resourceand (cid:96) is a local resource. We use Φ L and Φ G to denote the localresources and the global resources respectively.Each global resource (cid:96) q ∈ Φ G is assigned to a processor, and allrequests to (cid:96) q must execute on that processor, e.g., by means of anRPC-like agent [16]. Once a vertex requests a global resource, it issuspended until the agent finishes. Requests to local resources areexecuted by the tasks directly, i.e., no migration is required.For brevity, we use Φ( ℘ k ) to denote the set of global resourceson processor ℘ k . The global resources that are assigned to the sameprocessor as (cid:96) q are denoted by Φ ℘ ( (cid:96) q ) , and the global resources thatare assigned to the same processors as τ i are denoted by Φ ℘ ( τ i ) . B. Queue Structure
While pending, a vertex is either executing, ready and not sched-uled, or suspended. The following queues are used to maintain thestates of the vertices for each task. • RQ Ni : the ready queue of τ i for the vertices that are ready toexecute non-critical sections. The vertices in RQ Ni are scheduledin a First In First Out (FIFO) order. • RQ Li : the ready queue of τ i for the vertices that are holdinglocal resources. The vertices in RQ Li are scheduled in a FIFOorder. If both RQ Ni and RQ Li are not empty, the vertices in RQ Li are scheduled first . • SQ i : the suspended queue of τ i . Each vertex in SQ i is waitingfor a request to be finished. In addition, each processor maintains two hybrid queues to handlethe global-resource-requests. • RQ Gk : the ready queue of the global-resource-requests on pro-cessor ℘ k . The requests in RQ Gk are scheduled by the prioritiesof the tasks. • SQ Gk : the suspended queue of the global-resource-requests onprocessor ℘ k . C. Locking Rules
Under priority scheduling, the problem of priority inversion [3] isinevitable when jobs compete for shared resources. Various progressmechanisms [3], [15], [16] are used to minimize the duration of pri-ority inversions. We consider the inherent priority ceiling mechanismas used in the DPCP [16] in the following.Consider a global resource (cid:96) q ∈ Φ G on processor ℘ k , the priorityceiling of (cid:96) q is defined as Π q = π H + max τ j ∈ τ ( (cid:96) q ) π j , where π H is a priority level higher than the base priority of any task in τ .At runtime, the processor ceiling of ℘ k at some time t , denoted by Π ℘k ( t ) , is the maximum of the priority ceilings of the global resourcesthat are allocated to ℘ k and locked at time t . Let (cid:60) i , q be a requestfrom a job J i to a global resource (cid:96) q ∈ Φ G . The effective priority of (cid:60) i , q at some time t , denoted by π Ei ( t ) , is elevated to π Ei ( t ) = π H + π i . The priority ceiling mechanism ensures that: a global-resource-request (cid:60) i , q is granted the lock at time t only if π Ei ( t ) > Π ℘k ( t ) .Next, we introduce the locking rules of DPCP-p. Consider a vertex v i , x issues a request (cid:60) i , q for a resource (cid:96) q at some time t . Rule 1. If (cid:96) q is a local resource locked by another vertex at time t , then v i , x is suspended and enqueued to SQ i . Rule 2. If (cid:96) q is a local resource not locked at time t , then v i , x locks (cid:96) q and queues upon RQ Li , i.e., v i , x is ready to be scheduledto execute the critical section. Rule 3. If (cid:96) q is a global resource on some processor ℘ k , then v i , x is suspended and enqueued to SQ i . Meanwhile, (cid:60) i , q tries to lock (cid:96) q according to the priority ceiling mechanism. (cid:60) i , q queues upon RQ Gk and is ready to be scheduled (according to its priority) if the lock isgranted, otherwise (cid:60) i , q is enqueued to SQ Gk . Rule 4.
Once (cid:60) i , q finishes, it releases the lock of (cid:96) q , and dequeuesfrom RQ Gi if (cid:96) q is a global resource. Then, v i , x is enqueued to RQ Ni .Fig. 1 shows an example schedule of DPCP-p with two DAG taskson a four-core processor, and each task is assigned two processors.At time t = 2 , (i) v i ,2 is suspended and enqueued to SQ i until theglobal-resource-request (cid:60) i ,1 finishes on processor ℘ at time t = 7 ,(ii) (cid:60) i ,1 is suspended and enqueued to SQ G until (cid:60) j ,1 releases (cid:96) at time t = 4 , and (iii) v i ,3 locks a local resource (cid:96) , enqueued to RQ Li , and is scheduled until time t = 4 , while v i ,4 is suspended andqueued upon SQ i until v i ,3 releases (cid:96) at time t = 4 . Lemma 1.
Under DPCP-p, a request can be blocked by lower-priority requests at most once.Proof.
We prove by contradiction. Since each local resource is usedonly by a task, we consider global-resource-requests. Suppose that arequest (cid:60) i , q ( (cid:96) q ∈ Φ G ) on a processor ℘ k is blocked by at least twolower-priority requests (cid:60) a , u and (cid:60) b , v ( π a < π i , π b < π i ). Let t i , s and t i , f be the time when (cid:60) i , q arrives and finishes respectively. Let t a , r and t b , r be the time when (cid:60) a , u and (cid:60) b , v are granted the locksrespectively. Without loss of generality, we assume that t a , r < t b , r .While (cid:60) i , q is pending at some time t ∈ [ t i , r , t i , f ] , the processorceiling Π ℘k ( t ) ≥ π H + π i according to the priority ceiling mechanism.Since (cid:60) i , q can be blocked by (cid:60) a , u , the priority ceiling of (cid:96) u is largerthan π H + π i , i.e., Π u ≥ π H + π i . Thus, Π ℘k ( t ) ≥ π H + π i during t ∈ [min( t i , s , t a , r ), t i , f ] . Further, by hypothesis, (cid:60) i , q is blocked by2 i ,1 v i ,2 v i ,3 v i ,4 v i ,5 v i ,6 v i ,7 v i ,8 C i ,1 =2 C i ,3 =2 C i ,5 =4 C i ,4 =2 C i ,2 =3 C i ,7 =2 C i ,6 =2 C i ,8 =2 v j ,1 v j ,2 v j ,3 v j ,4 v j ,5 v j ,6 C j ,1 =1 C j ,3 =3 C j ,5 =4 C j ,4 =4 C j ,2 =3 C j ,6 =1 G i G j (a) The structures of two DAG tasks with resources (cid:96) (red) and (cid:96) (blue). v i ,1 v i ,3 v i ,4 v i ,5 v i ,7 v j ,1 j ,1 v i ,6 v i ,8 v j ,2 v j ,4 v j ,5 v j ,6 J i J j i ,11 2 3 4 5 6 7 8 9 10 11 120 1 2 3 4 5 6 7 8 9 10 11 120 non-critical section execution on execution on (b) Example schedule of DPCP-p with (cid:96) being assigned to ℘ . Fig. 1: Example schedule of two DAG tasks. (cid:60) b , v , then (cid:60) b , v must be granted the lock at some time t ∈ ( t a , r , t i , f ) .Accoring to the priority ceiling mechanism, the effective priority of (cid:60) b , v must be larger than the processor ceiling at time t , i.e., π Ei ( t ) = π H + π b > Π ℘k ( t ) ≥ π H + π i . Thus, π b > π i . Contradiction.IV. W ORST -C ASE R ESPONSE T IME A NALYSIS
We derive an upper bound of the WCRT of an arbitrary path of τ i using the fixed-point Response-Time Analysis (RTA) in this section.Let r i be the WCRT of an arbitrary path λ i , then R i can be upperbounded by the maximum of the WCRTs of the paths, that is R i = max { r i } . (1)To upper bound r i , we classify the delays of a path into fourcategories as follows. A. Blocking and Interference
First, we consider a global-resource-request (cid:60) j , q of a job J j ( i (cid:54) = j , (cid:96) q ∈ Φ G ) that causes λ i to incur • inter-task blocking , if an agent on behalf of (cid:60) j , q is executing onsome processor ℘ k while λ i is suspended on a global resource (cid:96) u ∈ Φ G on ℘ k .Second, a vertex v i , y of τ i that is not on λ i (i.e., v i , y / ∈ λ i ) causes λ i to incur • intra-task blocking , if v i , y is holding a local resource (cid:96) q ∈ Φ L and scheduled while λ i is suspended on (cid:96) q , or if an agent isexecuting on behalf of v i , y on some processor ℘ k while λ i issuspended on a global resource on ℘ k ; and • intra-task interference , if v i , y is executing a non-critical sectionor a local-resource-request while λ i is ready and not executing.Third, a global-resource-request from another job or from a vertexthat is not on λ i causes λ i to incur • agent interference , if an agent on behalf of the request isexecuting while λ i is (i) ready and not executing, or (ii)suspended on a local resource and the resource holder is notscheduled (i.e., preempted by the agent of the request).Notabaly, the defined delays are mutually exclusive, i.e., at anypoint in time, a vertex or an agent can cause a path to incur at mostone type of delay. This is essential to minimize over-counting inthe blocking time analysis. For example in Fig. 1(b), at any timeduring t = [2, 4] , (cid:60) j ,1 only causes path ( v i ,1 , v i ,2 , v i ,6 , v i ,8 ) to incur inter-task blocking, v i ,3 only causes path ( v i ,1 , v i ,4 , v i ,7 , v i ,8 ) to incurintra-task blocking, v j ,2 only causes path ( v j ,1 , v j ,4 , v j ,6 ) to incurintra-task interference, and (cid:60) j ,1 only causes path ( v j ,1 , v j ,5 , v j ,6 ) to incur agent interference. It is also noted that a path can incurmultiple types of delays at a time. For instance, at any time during t = [1, 4] , path ( v j ,1 , v j ,5 , v j ,6 ) incurs intra-task interference andagent interference due to v j ,2 and (cid:60) j ,1 respectively.Based on these definitions, we derive an upper bound on r i inTheorem 1. In preparation, we use B i to denote the workload of theother tasks that causes λ i to incur inter-task blocking. Analogously,let b i and I intra i denote the workloads of the vertices of τ i not on λ i that cause λ i to incur intra-task blocking and intra-task interference,respectively. Let I Ai denote the workload of the agents that causes λ i to incur agent interference. These open variables will be boundedin Sect. IV-B and IV-C. Theorem 1. r i ≤ L ( λ i ) + B i + b i + ( I intra i + I Ai ) /m i .Proof. While λ i is pending, it is either (I) executing, (II) suspendedand executing on global resources, (III) ready and not executing, (IV)suspended and not executing on any global resource. By definition,the duration of (I) and (II) can be bounded by L ( λ i ) . For case (III) . The workload executed on ℘ ( τ i ) can be from (i)the vertices of τ i not on λ i (i.e., intra-task interference), and (ii) theagents on behalf of the vertices not on λ i (i.e., agent interference).By definition, the workload of (i) can be upper-bounded by I inter i , andthe workload of (ii), denoted by ˆ I Ai , is a part of I Ai . For case (IV) . If λ i is suspended on a local resource (cid:96) q , then λ i is either (iii) waiting a vertex of τ i not on λ i to release (cid:96) q (i.e.,intra-task blocking), or (iv) waiting the agents that preempted theresource holder to finish (i.e., agent interference). If λ i is suspendedon a global resource on a processor ℘ k , then it can be delayed by (v)an agent on behalf of another task on ℘ k (i.e., inter-task blocking),or (vi) an agent on behalf of a vertex of τ i not on λ i on ℘ k (i.e.,intra-task blocking). By definition, the duration of (iii) and (vi) is b i , and the duration of (v) is B i . Further, for case (iv), we let theworkload of the agents be ˇ I Ai . Total durations of (I) - (IV) . In (III) and (IV)-(iv), there is at leasta vertex ready and not executing, thus none of the m i processorsis idle according to work-conserving scheduling. Let the duration of(III) and (IV)-(iv) be Y i , then I intra i + ˆ I Ai + ˇ I Ai = Y i · m i . By definition, ˆ I Ai + ˇ I Ai ≤ I Ai . Hence, Y i ≤ ( I intra i + I Ai ) /m i . Summing up (I) -(IV), we have r i ≤ L ( λ i ) + B i + b i + ( I intra i + I Ai ) /m i . B. Upper Bounds on Blockings
We begin with the analysis of inter-task blocking. To derive anupper bound on B i , we first derive an upper bound on the responsetime of a global-resource-request.In preparation, let η j ( L ) denote the maximum number of jobs ofa task τ j during a time interval of length L . It has been well studiedthat η j ( L ) ≤ (cid:100) ( L + R j ) /T j (cid:101) . Further, let γ i , q ( L ) be the cumulativelength of the requests from higher-priority tasks of τ i to the globalresources that are assigned on the same processor as (cid:96) q ∈ Φ G duringa time interval of length L . Since there are η h ( L ) jobs of each higher-priority task τ h ( π h > π i ) during a time interval length of L , andeach job J h uses resource (cid:96) q for a time of at most N h , q · L h , q ,summing up the workload of all the higher-priority requests we have γ i , q ( L ) ≤ (cid:88) π h >π i ∧ (cid:96) u ∈ Φ ℘ ( (cid:96) q ) η h ( L ) · N h , u · L h , u . (2)Let W i , q be the response time of a request from λ i to a globalresource (cid:96) q ∈ Φ G . We bound W i , q according to the following lemma.3 emma 2. W i , q can be upper bounded by the least positive solution,if one exists, of the following equation. W i , q = L i , q + (cid:88) (cid:96) u ∈ Φ ℘ ( (cid:96) q ) ( N i , u − N λi , u ) · L i , u + β i , q + γ i , q ( W i , q ). (3) Where, β i , q = max { L j , u | π j < π i ∧ (cid:96) u ∈ Φ ℘ ( (cid:96) q ) ∧ Π u ≥ π H + π i } . Proof.
Under DPCP-p, a global-resource-request (cid:60) i , q has an effec-tive priority higher than π H . Thus, while (cid:60) i , q is pending, only theglobal-resource-requests can execute. Since global-resource-requestsare scheduled by their priorities, (cid:60) i , q may wait for (i) at most onelower-priority request to a global resource with priority ceiling noless than π H + π i on the processor, (ii) intra-task requests from thevertices not on λ i to the global resources on the processor, and (iii)higher-priority requests to the global resources on the processor.Be definition, (i) can be bounded by β i , q , and (ii) can be boundedby (cid:80) (cid:96) u ∈ Φ ℘ ( (cid:96) q ) ( N i , u − N λi , u ) · L i , u . By the definition of γ i , q ( L ) , (iii)can be bounded by γ i , q ( W i , q ) . In addition, (cid:60) i , q executes at most L i , q .We claim the lemma by summing up the respective bounds.With Lemma 2 in place, we are ready to upper bound B i . Lemma 3. B i ≤ (cid:80) ℘ k ∈ ℘ min( ε ki , ζ ki ) , where, ε ki = (cid:88) (cid:96) q ∈ Φ G ∩ Φ( ℘ k ) ( β i , q + γ i , q ( W i , q )) · N λi , q , (4) and ζ ki = (cid:88) τ j (cid:54) = τ i (cid:88) (cid:96) q ∈ Φ G ∩ Φ( ℘ k ) η j ( r i ) · N j , q · L j , q . (5) Proof.
Each time λ i requests a global resource (cid:96) q ∈ Φ G on ℘ k , itcan be blocked by (i) at most one lower-priority request and (ii) allhigher-priority requests. Analogous to the proof in Lemma 2, (i) canbe bounded by β i , q , and (ii) can be bounded by γ i , q ( W i , q ) . Since λ i requests each global resource (cid:96) q at most N λi , q times, the workload ofthe other tasks that cause λ i to incur inter-task blocking on ℘ k canbe bounded by ε ki in Eq. (4).Further, each other task τ j ( j (cid:54) = i ) has at most η j ( r i ) jobs before λ i finishes, and each job uses a resource (cid:96) q for a time of at most N j , q · L j , q . Thus, the workload of the other tasks for the globalresources on ℘ k can be bounded by ζ ki in Eq. (5). We claim the lemmaby summing up the minimum of ε ki and ζ ki for all processors.Next, we derive an upper bound for intra-task blocking. For brevity,let σ i , k = min(1, (cid:80) (cid:96) u ∈ Φ( ℘ k ) N λi , u ) . Intuitively, σ i , k = 1 if there isa vertex on path λ i requests a global resource (cid:96) q on processor ℘ k ,and σ i , k = 0 otherwise. Lemma 4. b i ≤ (cid:80) (cid:96) q ∈ Φ L ∩ Φ( τ i ) b Li , q + (cid:80) ℘ k ∈ ℘ b Gi , where, b Li , q = min(1, N λi , q ) · ( N i , q − N λi , q ) · L i , q , (6) and, b Gi = σ i , k · (cid:88) (cid:96) q ∈ Φ( ℘ k ) ( N i , q − N λi , q ) · L i , q . (7) Proof.
By definition, λ i incurs intra-task blocking on a local resource (cid:96) q ∈ Φ L only if it requests (cid:96) q . Clearly, min(1, N λi , q ) = 1 if λ i requests (cid:96) q , and min(1, N λi , q ) = 0 otherwise. Given that the verticesof τ i not on λ i can execute on a resource (cid:96) q for a total of at most ( N i , q − N λi , q ) · L i , q , λ i , q incurs intra-task blocking for at most b Li , q ,as in Eq. (6). Moreover, λ i incurs intra-task blocking on a global resource onsome processor ℘ k only if it requests some global resource on ℘ k .By definition, σ i , k = 1 if λ i requests some global resource on ℘ k ,and σ i , k = 0 otherwise. Thus, the workload that cause λ i to incurintra-task blocking on ℘ k can be bounded by summing up ( N i , q − N λi , q ) · L i , q for all the global resources on ℘ k , i.e., b ℘i , as in Eq. (7).Thus, b i can be bounded by summing up (i) b Li , q for all localresource in Φ( τ i ) , and (ii) b ℘i for all processors. C. Upper Bounds on Interference
Next, we derive upper bounds for the intra-task interference andthe agent interference. First, the intra-task interference of λ i can beupper bounded by the workload of the non-critical sections and thelocal-resource-requests of the vertices of τ i that are not on λ i . Lemma 5. I intra i ≤ (cid:80) v i , x / ∈ λ i C (cid:48) i , x + (cid:80) (cid:96) q ∈ Φ L ( N i , q − N λi , q ) · L i , q .Proof. By definition, I intra i consists of the workload of (i) the non-critical sections and (ii) the local-resource-requests from the verticesof τ i that are not on λ i . From the task model, (i) and (ii) are boundedby (cid:80) v i , x / ∈ λ i C (cid:48) i , x and (cid:80) (cid:96) q ∈ Φ L ( N i , q − N λi , q ) · L i , q , respectively. Thus, I intra i can be bounded by the total of (i) and (ii).For each global resource on Φ ℘ ( τ i ) , the agent interference of λ i consists of the agent workload of the vertices that are not on λ i . Lemma 6. I Ai ≤ (cid:80) (cid:96) q ∈ Φ G ∩ Φ ℘ ( τ i ) ( I Ai , q + ˘ I Ai , q ) , where, I Ai , q = (cid:88) τ j (cid:54) = τ i η j ( r i ) · N j , q · L j , q , (8) and, ˘ I Ai , q = ( N i , q − N λi , q ) · L i , q . (9) Proof.
While λ i is pending, the other tasks can request a resource (cid:96) q for at most I Ai , q , and the vertices of τ i not on λ i can execute on (cid:96) q for at most ˘ I Ai , q . Thus, the agent interference of λ i can be boundedby summing up I Ai , q + ˘ I Ai , q for all the global resources on Φ ℘ ( τ i ) .Now that we bounded all the variables in Theorem 1, thus theWCRT of task τ i can be bounded according to Eq. (1) by calculatingthe WCRTs of all paths of τ i .V. T ASK AND R ESOURCE P ARTITIONING
According to the schedulability analysis in Sect. IV, the WCRT ofa task can be determined only if the tasks and the global resourcesare partitioned. In this section, we present a partitioning algorithm toiteratively assign tasks and resources.For ease of description, we consider the processors assigned toeach task as a cluster. Accordingly, we use ℘ Cx to denote the x thcluster ( x ≤ m ). The capacity of ℘ Cx , denoted by ˜ U cluster x , is equalto the number of the processors in ℘ Cx . The utilization of ℘ Cx ,denoted by U cluster x , is the total of the utilizations of the task andthe resources assigned to ℘ Cx , where the utilization of a resource (cid:96) q is defined as u Φ q = (cid:80) τ j ∈ τ N j , q · L j , q T j . The total utilization of theglobal resources assigned to a processor ℘ k is denoted by u ℘k , i.e., u ℘k = (cid:80) (cid:96) q ∈ Φ( ℘ k ) u Φ q . The utilization slack of a cluster ℘ Cx is definedby ˜ U cluster x − U cluster x . A cluster is infeasible if U cluster x > ˜ U cluster x .Each task τ i is initially assigned (cid:100) C i −L ∗ i D i −L ∗ i (cid:101) processors, and theglobal resources are partitioned according to the Worst-Fit Decreasing(WFD) heuristic, as shown in Algorithm 1. Intuitively, the globalresource with the highest utilization is assigned to the processorwith the lowest resource utilization in the cluster with maximum4 lgorithm 1 Task and Resource Partitioning
Input: the tasks τ , the processors ℘ , and the resources Φ Output: the schedulability of the system for ∀ τ i ∈ τ do if there are (cid:100) ( C i − L ∗ i ) / ( D i − L ∗ i ) (cid:101) processors unassigned then assign (cid:100) ( C i − L ∗ i ) / ( D i − L ∗ i ) (cid:101) processors to τ i else return unschedulable while true do if WFD Resource( Φ G , ℘ ) is infeasible then return unschedulable for ∀ τ i ∈ τ in decreasing order of priority do if WCRT ( τ i ) > D i then if there is a processor unassigned then assign one more processor to τ i rollback of the global resource assignment break // i.e., go to line 9 else return unschedulable return schedulable Algorithm 2
WFD Resources
Input: the global resources Φ G , and the processors ℘ Output: the feasibility of the global resource allocation sort Φ G in non-increasing order of utilization for ∀ τ i ∈ τ do U cluster i = m i for ∀ (cid:96) q ∈ Φ G do select the cluster ℘ Cx with the maximum value of ˜ U cluster x − U cluster x if U cluster x + u Φ q > ˜ U cluster x then return infeasible allocation else assign (cid:96) q to processor ℘ k , s.t., u ℘k = min { u ℘a | ℘ a ∈ ℘ Cx } U cluster x = U cluster x + u Φ q return feasible allocation utilization slack, as shown in Algorithm 2. The schedulability analysisis performed from the task with highest base priority. If there is atask unschedulable, then we assign an additional processor, if oneexists, to that task. Since the capacity of the cluster is updated whenan additional processor is assigned, we re-assign global resourcesand perform schedulability tests accordingly. The partitioning processruns at most m − n rounds for systems containing only heavy tasks.VI. D ISCUSSIONS
Although we focus on heavy tasks in this paper, the DPCP-papproach can be extended to support light tasks. First, light tasksare treated as sequential tasks under federated scheduling, thus theoriginal DPCP can be used to handle resource sharing between them.Further, since each heavy task is exclusively assigned a cluster ofprocessors, the delays between heavy and light tasks are only dueto global resources. According to the definitions in Sect. IV-A, suchdelays can be captured by inter-task blocking and agent interference.According to Lemma 3 and Lemma 6, bounding both inter-taskblocking and agent interference does not distinguish between heavyand light tasks. Thus, the delays between heavy and light tasks canbe analyzed using the analysis framework as presented in Sect. IV.Notably, handling light tasks with shared resources optimally underfederated scheduling remains as an open problem.Further, we assume that the maximum number of requests ofeach vertex N i , x , q is known. This is possible in some real-lifeapplications when the maximum number of requests of each vertex can be pre-determined. Thus we can derive a more accurate blockingbound by using the exact number of requests on a path λ i , i.e., N λi , q = (cid:80) v i , x ∈ λ i N i , x , q , rather than enumerating the value of N λi , q from [0, N i , q ] [6]. The tradeoff is more calculations to enumerate allpaths of the task in analysis. Notably, the presented analysis applies tothe prior model [6], [11] by using the key-path-oriented analysis [11].The blocking-time analysis can be further improved by modernanalysis techniques, e.g., the Linear-Programming-based (LP-based)analysis in [2]. However, we have no evidence on how the LP-based analysis [2] can be applied for this scenario yet. Thus, wefirst establish the fundamental analysis framework in this paper andremain fine-grained analysis as future work.VII. E MPIRICAL C OMPARISONS
In this section, we conduct schedulability experiments to evaluatethe DPCP-p approach using synthesized heavy tasks.
A. Experimental Setup
Multiprocessor platforms with m ∈ {
8, 16, 32 } unispeed proces-sors and n r , ranging over [2, 4] , [4, 8] or [8, 16] , shared resourceswere considered. For each m , we generated the total utilizations ofthe tasksets from 1 to m in steps of 0.05 m . The task utilizations of ataskset were generated according to the RandFixedSum algorithm [7]ranging over (1, 2 U avg ] , where U avg ∈ { } is the averageutilization of the tasks. The base priority of the tasks was assignedby the Rate Monotonic (RM) heuristic. The number of tasks n wasdetermined by the chosen U avg and the total utilization of the taskset.For each task τ i , the DAG structure was generated by the Gr´egoryErd¨os-R´enyi algorithm [5], where the number of vertices | V i | wasrandamly chosen in [10, 100] , and the probability of an edge betweenany two vertices was set to 0.1. Task period T i was randomly chosenfrom log-uniform distributions ranging over [10 ms , 1000 ms ] , and C i was computed by U i · T i . τ i uses each resource in a probability p r = { } . If τ i used (cid:96) q , the maximum number of requests N i , q was randomly chose from [1, 25] or [1, 50] , and the maximum criticalsection length L i , q was chosen in [15 µs , 50 µs ] or [50 µs , 100 µs ] . TheWCET of each vertex C i , x and the maximum number of requestsin each vertex N i , x , q were randamly determined such that C i = (cid:80) v i , x ∈ V i C i , x and N i , q = (cid:80) v i , x ∈ V i N i , x , q . To ensure plausibility,we enforced that L ∗ i < D i / and C i , x ≥ (cid:80) (cid:96) q ∈ Φ N i , x , q · L i , q . Thecombination of the parameters consists of 216 experimental scenarios. B. Baselines
We compare DPCP-p with existing locking protocols, denoted bySPIN-SON [6] and LPP [11], under federated scheduling (there is nostudy on locking protocols for the other state-of-the-art schedulingapproaches in the literature, for which we will discuss in Sect. VIII).For DPCP-p, we use DPCP-p-EP to denote the analysis as presentedin Sect. IV by enumerating all paths, and use DPCP-p-EN to denotethe analysis by enumerating N λi , q from 0 to N i , q for ∀ (cid:96) q ∈ Φ as in [6],[11]. We also use FED-FP to denote a hypothesis baseline withoutconsidering shared resources under federated scheduling [13]. C. Results
Fig. 2 shows acceptance ratios of the tested approaches withincreasing normalized utilization, where Fig. 2(b) and (d) includemore resource contentions compared to Fig. 2(a) and (c). It is shownthat DPCP-p-EP consistently schedules more tasksets than SPIN-SONand LPP. In particular, the advantage of the DPCP-p approach is moresignificant for heavy resource-contentions as shown in Fig. 2(b) and(d), while SPIN-SON appears to be competitive for light resource-contentions as shown in Fig. 2(a) and (c).5 a) U avg = 1.5 . (b) U avg = 1.5 .(c) U avg = 2 . (d) U avg = 2 . Fig. 2: Experiment results for N i , q ∈ [1, 50] , L i , q ∈ [50 µs , 100 µs ] ,where m = 16 , n r ∈ [4, 8] , p r = 0.5 for (a) and (c), and m = 32 , n r ∈ [8, 16] , p r = 1 (b) and (d).For brevity, we use the notations of dominance and outperfor-mance to summarize the main trends of the results in Table 2 and3. It is shown that the DPCP-p approach improves upon SPIN-SONand LPP significantly. In particular, DPCP-p-EP outperforms in allscenarios, and it dominates in more than 99% scenarios. Similarly,DPCP-p-EN dominates and outperforms more often than less. Table 2.
Statistic for Dominance.DPCP-p-EP DPCP-p-EN SPIN-SON LPPDPCP-p-EP N/A 216(100%) 215(99.5%) 216(100%)DPCP-p-EN 0(0.0%) N/A 104(48.1%) 87(40.3%)SPIN-SON 0(0.0%) 10(4.6%) N/A 39(18.1%)LPP 0(0.0%) 32(14.8%) 38(17.6%) N/A
Table 3.
Statistic for Outperformance.DPCP-p-EP DPCP-p-EN SPIN-SON LPPDPCP-p-EP N/A 216(100%) 216(100%) 216(100%)DPCP-p-EN 0(0.0%) N/A 138(63.9%) 158(73.1%)SPIN-SON 0(0.0%) 78(36.1%) N/A 114(52.8%)LPP 0(0.0%) 58(26.9%) 102(47.2%) N/A
VIII. R
ELATED W ORK
Real-time scheduling algorithms and analysis techniques for inde-pendent parallel tasks have been widely studied in the literature [1],[4], [8], [12]–[14], where shared resources are not modeled explicitly.The study of multiprocessor real-time locking protocols stemsfrom the DPCP [16] and the Multiprocessor Priority Ceiling Protocol(MPCP) [15]. Empirical studies [2] showed that the DPCP has betterschedulability performance than the MPCP. Based on the DPCP, Hsiuet al. [9] presented a dedicated-core scheduling. More recently, Huanget al. [10] proposed the ROP scheduling. However, the work in [2],[9], [10], [15], [16] all assumes sequential task models. Although thelocking protocols that are originally used for sequential tasks, e.g.,[15], [16], might be used to handle concurrent requests of paralleltasks, no work on the corresponding analysis has been established inthe literature. In this paper, we extend the DPCP to support parallelreal-time tasks and present the schedulability analysis.Recently, there is significant progress on the scheduling of parallelreal-time tasks, e.g., partitioned [4], semi-partitioned [1], global [8],[14], federated [13], and decomposition-based scheduling [12]. How-ever, no study on locking protocols for the state-of-the-art scheduling For an experimental scenario, algorithm A is said to outperform algorithm B if algorithm A scheduled more task sets than algorithm B , or dominatealgorithm B if its acceptance ratio is higher than that of algorithm B at sometested points and never lower than that of algorithm B at any tested point. approaches other than the federated scheduling have been reportedin the literature, so far as we know. For federated scheduling,Dinh et al. [6] studied the schedulability analysis for spinlocks.Jiang et al. [11] developed a semaphore protocol called LPP underfederated scheduling. Both [6] and [11] assume local execution ofresource requests. The presented DPCP-p is based on a distributedsynchronization framework, where requests to global resources areconducted on designated processors. In this way, the contention oneach resource can be isolated to the designated processor such thatblocking among tasks can be better managed.IX. C ONCLUSION
This paper for the first time studies the distributed synchronizationframework for parallel real-time tasks with shared resources. Weextend the DPCP to DAG tasks for federated scheduling and developanalysis techniques and partitioning heuristic to bound the taskresponse time. More precise blocking analysis based on the concreteDAG structure would be an interesting future work.R
EFERENCES [1] V. Bonifaci, G. D’Angelo, and A. Marchetti-Spaccamela. Algorithms forhierarchical and semi-partitioned parallel scheduling. In
IPDPS , pages738–747. IEEE Computer Society, 2017.[2] B. B. Brandenburg. Improved analysis and evaluation of real-timesemaphore protocols for P-FP scheduling. In
RTAS , pages 141–152,2013.[3] B. B. Brandenburg and J. H. Anderson. Optimality results for multipro-cessor real-time locking. In
RTSS , pages 49–60, 2010.[4] D. Casini, A. Biondi, G. Nelissen, and G. C. Buttazzo. Partitioned fixed-priority scheduling of parallel tasks without preemptions. In
RTSS , pages421–433. IEEE Computer Society, 2018.[5] D. Cordeiro, G. Mouni´e, S. Perarnau, D. Trystram, J. Vincent, andF. Wagner. Random graph generation for scheduling simulations. In
SIMUTools , page 60, 2010.[6] S. Dinh, J. Li, K. Agrawal, C. D. Gill, and C. Lu. Blocking analysisfor spin locks in real-time parallel tasks.
IEEE Trans. Parallel Distrib.Syst. , 29(4):789–802, 2018.[7] P. Emberson, R. Stafford, and R. I. Davis. Techniques for the synthesisof multiprocessor tasksets. In
WATERS , pages 6–11, 2010.[8] J. C. Fonseca, G. Nelissen, and V. N´elis. Improved response timeanalysis of sporadic DAG tasks for global FP scheduling. In E. Biniand C. Pagetti, editors,
RTNS , pages 28–37. ACM, 2017.[9] P. Hsiu, D. Lee, and T. Kuo. Task synchronization and allocation formany-core real-time systems. In
EMSOFT , pages 79–88, 2011.[10] W. Huang, M. Yang, and J. Chen. Resource-oriented partitionedscheduling in multiprocessor systems: How to partition and how toshare? In
RTSS , pages 111–122, 2016.[11] X. Jiang, N. Guan, W. Liu, and M. Yang. Scheduling and analysis ofparallel real-time tasks with semaphores. In
DAC , page 93, 2019.[12] X. Jiang, X. Long, N. Guan, and H. Wan. On the decomposition-basedglobal EDF scheduling of parallel real-time tasks. In
RTSS , pages 237–246. IEEE Computer Society, 2016.[13] J. Li, J. Chen, K. Agrawal, C. Lu, C. D. Gill, and A. Saifullah. Analysisof federated and global scheduling for parallel real-time tasks. In
ECRTS ,pages 85–96, 2014.[14] A. Melani, M. Bertogna, V. Bonifaci, A. Marchetti-Spaccamela, andG. C. Buttazzo. Schedulability analysis of conditional parallel taskgraphs in multicore systems.
IEEE Trans. Comput. , 66(2):339–353,2017.[15] R. Rajkumar. Real-time synchronization protocols for shared memorymultiprocessors. In
ICDCS , pages 116–123, 1990.[16] R. Rajkumar, L. Sha, and J. P. Lehoczky. Real-time synchronizationprotocols for multiprocessors. In
RTSS , pages 259–269, 1988.[17] G. von der Br¨uggen, J. Chen, W. Huang, and M. Yang. Release en-forcement in resource-oriented partitioned scheduling for multiprocessorsystems. In
RTNS , pages 287–296, 2017.[18] M. Yang, W. Huang, and J. Chen. Resource-oriented partitioning formultiprocessor systems with shared resources.
IEEE Trans. Compt. ,68(6):882–898, 2019.,68(6):882–898, 2019.