A WCET-aware cache coloring technique for reducing interference in real-time systems
Fabien Bouquillon, Clément Ballabriga, Giuseppe Lipari, Smail Niar
aa r X i v : . [ c s . O S ] M a y Compas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019
A WCET-aware cache coloring technique for reducinginterference in real-time systems
Fabien Bouquillon , Clément Ballabriga , Giuseppe Lipari , Smail Niar Univ. Lille, CNRS, CentraleLille,UMR 9189 - CRIStAL, Lille, France Univ. Polytechnique Hauts-de-France, LAMIH/CNRS, Valenciennes, France
Abstract
The time predictability of a system is the condition to give safe and precise bounds on theworst-case execution time of real-time functionalities which are running on it. Commercialoff-the-shelf(COTS) processors are increasingly used in embedded systems and contain sharedcache memory. This component has a hard predictable behavior because its state depends onthe execution history of the systems. To increase the predictability of COTS component we usecache coloring, a technique widely used to partition cache memory. Our main contributionis a WCET aware heuristic which partition task according to the needs of each task. Our ex-periments are made with CPLEX an ILP solver with random tasks set generated running onpreemptive system scheduled with earliest deadline first(EDF).
1. Introduction
Hard real-time systems are found in many different domains, like avionics, automotive, healthcare services. In such systems, a real-time task has to be executed within predefined tim-ing constraints, whose violation can lead to system failure. Thus, it is important to computethe response time of every task to ensure a-priori that it always executes within its time win-dow under all conditions. Schedulability analysis algorithms provide upper bounds on theresponse time of tasks, which depend upon several parameters such as tasks’ execution timesand scheduling policy. In turn, execution times depend on the hardware architecture and task’scode.Commercial off-the-shelf (COTS) processors are increasingly used in embedded systems fortheir low cost and high performance. Most of COTS processors use cache memories to bridgethe gap between processor speed and main memory access speed. In particular, cache memo-ries improve performance by reducing the typical execution time of a task.However, in real-time systems we need predictability, that is we need to precisely estimate the
Worst-Case
Execution Time (WCET) of a task. Cache memory state depends on execution his-tory of the system and, its prediction is a challenge due to the increase of tasks running on thesystem which compete for the same cache memory area. More sophisticated WCET analysestake into account the state of the cache during the execution of the task and provide a tighterWCET. However, these analyses typically assume that every task executes alone in the system,without interference from other tasks. If tasks are executed concurrently and preemptively onthe system, one task may preempt another task and evict cache blocks, making the estimatedWCET too optimistic. This type of interference is called inter-task interference, as opposed to intra-task interference due to a task evicting its own cache blocks. ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019
In the literature, many researchers have been interested in this problem by accounting for thecost of preemption through the so-called
Cache-Related Preemption Delay [1, 8, 9]. Another prob-lem arises in multi-core systems with shared cache: a task executing on one processor may evictuseful cache blocks for a seconds task executing on a different processor. It is, therefore, neces-sary to reduce, or eliminate altogether, the inter-task interference caused by the cache conflictson tasks’ execution times.The goal of this research is to use virtual memory and cache-coloring techniques to reduceinter-task interference: we allocate the virtual pages of a task to physical pages so to minimizeconflicts between tasks on set-associative caches. Since cache memory is limited, by doing, sowe might increase intra-task conflicts: two pages from the same task may be allocated to twophysical pages that correspond to the same position in the cache, thus increasing the task’sWCET. We, therefore, propose a methodology to explore the space of possible cache-coloringconfigurations so to reduce conflicts while maintaining the respect of the timing constraints.We can represent this problem as a variant of the multiple-choice knapsack problem where thecolors are the knapsack and the pages the object, but in this variant, the values of the objectdepend also on the presence of the other objects in the same knapsack. Since the problem’scomplexity is very large, we propose a combination of Integer-Linear-Programming techniquesand heuristics to partition the cache taking into consideration the WCET of each task.
2. Related Works
Predictability of cache memory in real-time systems has been widely explored, especially forthe CRPD [1, 8, 9].Luniss et al. [8] used simulated annealing to find a code layout in the memory that minimizesthe CRPD. However, tasks are not isolated on cache memory, so inter-task and inter-core inter-ference are still present. They used the linker to configure the code layout.Mancuso et al. [9] propose a complete framework which defines, isolates and locks most impor-tant memory areas in memory cache. These techniques are based on cache coloring partitioningand cache locking, its purpose is to reduce conflicts and enhance predictability but the cacheis not optimally used because only the most important memories area are in the cache andto access other areas require costly RAM access. In our work, we use their techniques in ourheuristics for the pages coloring, but instead of giving all the partitions to the most importantdata, we reserve a partition to the other data.Kim et al. [6] propose a practical OS-level cache management scheme using page coloring. Theywork on partitioned fixed priority preemptive scheduling system where they partition cachememory between cores with page coloring. In their works tasks may share the same cache area,thus intra-core interference is still present.Ward et al. [10] consider colors as shared resources protected by critical sections, thus priorityinversion may occur during execution. To reduce this problem they propose to slice tasks’periods, but their method may force the preempted task to reload its data (the set of data pagesthat a task may access in one job).
3. System model
In this section, we first present the task model, and then the model of the hardware architecture.We consider a system of N real-time sporadic tasks T = { τ , · · · , τ N } . A task τ i is an infinitesuccession of jobs J i,k ( a i,k , c i,k , d i,k ) , each one characterized by an arrival time a i,k , a computa-tion time c i,k and an absolute deadline d i,k . A job J i,k must be executed in the interval of time ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019 TASK1VTABLE
TASK2VTABLE MEMORY CACHE (a) Cache colloring
BLOCK1 BLOCK1BLOCK0 BLOCK1BLOCK1 BLOCK1BLOCK0 BLOCK1 (b) WCET Control Flow Graph
Figure 1: Cache coloring and its impact on WCET [ a i,k , d i,k ] , if it misses its deadline then a critical failure occurs.A sporadic task τ i can be represented by the tuple ( C i , D i , T i , P i ) , where C i represents theworst case execution time (WCET) of the task τ i ( C i = max ∀ k,k ≥ { c i,k } ), T i represents the minimumtime between two consecutive arrivals ( T i ≥ min k { a i,k + − a i,k } ), D i is the relative deadline( ∀ k, d i,k = a i,k + D i ), and P i is the number of distinct virtual pages used by the task.We consider a set of sporadic tasks with implicit deadline ( D i = T i ) or constrained deadlines( D i ≤ T i ), scheduled with the preemptive Earliest Deadline First (EDF) scheduler on a singleprocessor. This work can be easily extended to partitioned scheduling on a multi-core systemwith shared caches. We also assume that tasks are independent, that is they do not share anymemory page. We will discuss later how to remove such assumption.We consider a set-associative cache and denote as S instructions the number of distinct pages thatfits into the cache. In this paper we only focus on the instruction cache: extension to data cacheis the subject of future work. We denote as N way the number of cache way.The color of the j-th virtual page p i,j of task τ i , denoted as κ i,j , is an index between [ S instructions N way −( N − )] that denotes the position in the cache where the page will be loaded. Therefore, wesearch a method for allocating virtual pages to physical pages so that any two different tasksshare the minimum possible number of colors (ideally zero).Main memory size is a multiple of cache memory size which is a multiple of a page size. There-fore, when considering cache coloring at page level, the same page is always mapped to thesame cache page(partition of the cache memory of a page size).Figure 1a represents an example cache coloring technique in a set-associative cache: all pagesin main memory with the same color share the same cache page (red color). Thus, the color ofeach page in main memory can be computed as κ i,j = index ( P i,j ) mod S instructions N way . κ i,j dependson the index of the page in main memory, we can use the virtual table of the task to colorinstructions pages. The configurations of task pages has an impact on the typical execution ofthe task, thus also on the WCET. To compute the WCET, we can use measurements or static analysis. Measurements give anoptimistic estimation of the WCET because not all inputs and internal states can be tested. Thestatic analysis gives a safer over-estimation of the WCET value. Our static analysis methodbuilds a control flow graph (CFG) of the task and runs various analyses on it (including cachebehavior prediction) to compute an estimation of the WCET. The task’s pages allocation hasan impact on its WCET: in Figure 1b we show an example of two CFGs of the same task withdifferent page configurations. The node’s color represents the color of the page which contains ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019 the block, and the edges the possible paths that the execution may take. On the top CFG, Block1 and Block 0 are not in the same page but use the same area in the cache memory, if the wcetpath uses Block 0, then Block 1 will be evicted in cache memory. In the bottom CFG, Block 1and Block 0 do not use the same cache memory area because the colors of their pages are notthe same, thus there will be no eviction and lower execution time.
4. WCET-aware Coloring Heuristics
Our goal is to allocate the virtual memory pages of a set of real-time tasks to the physicalmemory pages so to minimize the inter-task interference on the cache. In this paper, we try tocompletely remove the interference by partitioning the cache.We divide the problem into two steps: 1) At the macro-level , we assign a certain number ofcolors to each task so that the total number of colors is less than or equal to the number ofavailable colors in the cache; 2) At the micro-level , for each task separately, with a given numberof available colors, we compute the best WCET for that task.We start by proposing a method for solving the micro-level . Consider that for each possible color combination, it is necessary to perform a WCET analysis.Since that can be very time-consuming, we rule out the complete exploration of all possiblecombination, and we use a heuristic instead. An overestimation of the number of solutions isgiven by ( P i ) P i which is exponential.We consider 2 heuristic algorithms. The first algorithm assigns the same number of pages(approximately) to each color. In particular, if task τ i is assigned j colors and it has P i pages,then the same color is assigned to ⌊ P i /j ⌋ pages. We use a simple modulo: the first ⌊ P i /j ⌋ pagesare assigned to the first color, etc.The second algorithm classifies pages according to their importance in the program. Therefore,we assign each page a score that depends on how many times the page is accessed by theprogram in the Control-Flow-Graph. The score of a page is computed as the sum of the scoresof the instructions in the page, and the score of instruction ψ is computed as, score ( ψ ) = l ( ψ ) .Where l ( ψ ) is the nesting level of loops where the instruction is found: if ψ is not containedin any loop, then l ( ψ ) = ; if ψ is contained in a loop of first level, then l ( ψ ) = ; etc. Thepages’ scores are computed using the OTAWA analysis tool [2]. Then the pages are ordered bydecreasing score: if the task τ i is assigned j colors, the first j − pages in decreasing score orderare assigned a different color, while all other pages are assigned the last remaining color.Once each page has been assigned a color according to one of the two heuristics above, welaunch the OTAWA WCET analysis tool to obtain the corresponding WCET for the task.We do this for different values of j in the interval j = [
1, S maxi } ] , where S maxi = min (cid:10) P i , S instructions − ( N − ) (cid:11) (1)and for each value we compute the corresponding WCET C i ( j ) . These values are used by theILP solver described in the next section. The distribution of cache memory space can be represented as a Multiple Choice KnapsackProblem(MCKP). In this problem we have a knapsack of limited size and a set of objects ofdifferent categories. The problem consists in selecting one and only one object of each categoryto put in the knapsack. ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019
In our case, the size of the knapsack represents the schedulability constraints; the objectivefunction is the number of colors used (that we want to minimize); the object types are thedifferent tasks, and an object is a configuration of colors for a given task, with the correspondingWCET.We encode the problem above as an Integer Linear Programming (ILP) problem, and we useCPLEX as a solver. We use the following variables and constraints: • We define variable χ i,j ∈ {
0, 1 } to denote the fact that task τ i has been assigned j colors.Each tasks must be assigned at least one configuration selected: P S maxi j = χ i,j = • The worst case execution time of a task can be expressed as: C i = P S maxi j = ( C i ( j ) · χ i,j ) . where C i ( j ) is computed in the micro-level problem. • We want to minimize the total number of colors used: min P Ni = P S max j = ( j · χ i,j ) If thevalue of the objective function for the optimal solution is greater than S instructions N way , thenthe problem has no feasible solution, and we must resort to other methods for computingthe interference (for example by using the CRPD analysis [1]). • To impose the schedulability of the system, we use the DBF analysis for EDF, first pro-posed by Baruah [3]. We first represent the utilization constraint P Ni = i T i ≤ Thenwe add all inequalities to check that all deadlines are respected. Let dset ( τ i ) = { ∀ i =
1, . . . , N, ∀ k > 0 | kT i + D i ≤ DIT } . The first definitive IDLE time (DIT) [7], is an instant atwhich all tasks must complete, and it does not depend on the WCETs of the tasks. Then,we add the following inequalities: ∀ t ∈ dset ( τ ) : P Ni = (cid:16)j t − D i T i k + (cid:17) · C i ≤ t
5. Result
The analysis takes into account a system with a 32 KB set associative memory cache of 2 wayswith 512 rows. We consider a page size of 1 KB (this value is defined as a constant in OTAWAand involved timely tool modification if we want to change its value), thus, there are 16 colorsavailable. We test each utilization in the range [ ; ] (we assume a step of ), with variation of periods and deadlines of the 8 tasks in Table 4d, taken from well-known standardbenchmarks in the literature [4, 5].First, our method performs a static analysis of each task which gives us a list of WCET accord-ing to their number of available colors, the worst of them is selected to compute the periods anddeadline with uunifast algorithm ( T i = WCET i ( worst ) /U i ). To represents constrained dead-lines we assign for each task, a deadline in the range of [ WCET i ( worst ) + ( T i − WCET i ( worst )) · ) · T i , T i ] . In the following figures, the line labeled as infinite cache represents the percentageof schedulable tasks set that we can schedule if we have a cache of unbounded size .The random line represents the percentage of task schedulable with a random distribution ofthe cache space between tasks . Our method (described in the previous section) is representedwith the line labeled ILP . The x-axis represents the utilization of the worst distribution withrandom coloring.For all heuristics, Figures 2a, 2b, 2c, 4a, 4b and 4c show that our method (ILP) increases theamount of schedulable set (more than 20% compared to random distribution for high uti-lization), but the performances of our heuristics are mitigated compared to the random col-oring(see Figure 5a) . On this figure, we do not observe any significant difference between theperformance of fair coloring and federated coloring. However, if we look at Figure 5b we can ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019 utilization p e r c e n t a g e s c h e d u l a b l e ILP + fairrandom distributioninfinite cacheworst distribution (a) Implicit deadlines and fair coloring utilization p e r c e n t a g e s c h e d u l a b l e ILP + federatedrandom distributioninfinite cacheworst distribution (b) Implicit deadlines and federated coloring utilization p e r c e n t a g e s c h e d u l a b l e ILP + randomrandom distributioninfinite cacheworst distribution (c) Implicit deadlines and random coloring
Figure 2: percentage of schedulable task set with implicit deadlines utilization p e r c e n t a g e s c h e d u l a b l e ILP + fairILP + federatedILP + random (a) Comparison between heuristics with ILP with implicitdeadlines utilization c a c h e p a r t i t i o n s u s e d ILP + fairILP + federatedILP + random (b) Average pages number used with implicit deadlines
Figure 3: performances of heuristics with implicit deadlines ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019 utilization p e r c e n t a g e s c h e d u l a b l e ILP + fairrandom distributioninfinite cacheworst distribution (a) Constrained deadlines and fair coloring utilization p e r c e n t a g e s c h e d u l a b l e ILP + federatedrandom distributioninfinite cacheworst distribution (b) Constrained deadlines and federated coloring utilization p e r c e n t a g e s c h e d u l a b l e ILP + randomrandom distributioninfinite cacheworst distribution (c) Constrained deadlines and random coloring
Task pagescompress (Mälardalen) 4fir (Mälardalen) 2ndes (Mälardalen) 4jfdctint (Mälardalen) 3edn (Mälardalen) 4crc (Mälardalen) 2g723_enc (TACLeBench) 8petrinet (TACLeBench) 8 (d) Tasks used in analysis
Figure 4: percentage of schedulable task set with constrained deadlines utilization p e r c e n t a g e s c h e d u l a b l e ILP + fairILP + federatedILP + random (a) Comparison between heuristics with ILP with con-strained deadlines utilization c a c h e p a r t i t i o n s u s e d ILP + fairILP + federeatedILP + random (b) Average number of cache partitions used with con-strained deadlines
Figure 5: performances of heuristics with constrained deadlines ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019 see that fair coloring uses fewer pages than federated and random. So the best heuristics is faircoloring compared to random and federated. This can be explained by the fact that for a lownumber of colors j , federated coloring isolates only the j − most important pages. If the scoreof the j -th pages is also important, it will experience a significant number of evictions in all theother pages with the lower scores.
6. Conclusion
We proposed an approach based on ILP to partition the cache memory according to the needsof each task for a preemptive system scheduled with EDF. We also propose a heuristic basedon our empirical results to find pages layout for each task according to the number of its givencolors. Our experimental results confirm an increase of high utilization tasks set schedulableof % compared to a random partition of cache memory, however, the performances of ourheuristics to coloring task pages are mitigated. We will reduce the granularity to have a methodto partition at the granularity of the size of a cache line and explore other heuristics in a futurework. Bibliographie
1. Altmeyer (S.), Davis (R. I.) et Maiza (C.). – Improved cache related pre-emption delayaware response time analysis for fixed priority pre-emptive systems.
Real-Time Systems ,vol. 48, n5, 2012, pp. 499–526.2. Ballabriga (C.), Cassé (H.), Rochange (C.) et Sainrat (P.). – Otawa: an open toolbox foradaptive wcet analysis. – In
IFIP International Workshop on Software Technolgies for Embeddedand Ubiquitous Systems , pp. 35–46. Springer, 2010.3. Baruah (S. K.), Mok (A. K.) et Rosier (L. E.). – Preemptively scheduling hard-real-timesporadic tasks on one processor. – In
Real-Time Systems Symposium, 1990. Proceedings., 11th ,pp. 182–190. IEEE, 1990.4. Falk (H.), Altmeyer (S.), Hellinckx (P.), Lisper (B.), Puffitsch (W.), Rochange (C.), Schoe-berl (M.), Sørensen (R. B.), Wägemann (P.) et Wegener (S.). – Taclebench: A benchmarkcollection to support worst-case execution time research. – In . Schloss Dagstuhl-Leibniz-Zentrum fürInformatik, 2016.5. Gustafsson (J.), Betts (A.), Ermedahl (A.) et Lisper (B.). – The Mälardalen WCET bench-marks – past, present and future. – pp. 137–147, Brussels, Belgium, juillet 2010. OCG.6. Kim (H.), Kandhalu (A.) et Rajkumar (R.). – A coordinated approach for practical os-levelcache management in multi-core real-time systems. – In , pp. 80–89. IEEE, 2013.7. Lipari (G.), George (L.), Bini (E.) et Bertogna (M.). – On the average complexity of theprocessor demand analysis for earliest deadline scheduling. – In
Proceedings of a conferenceorganized in celebration of Professor Alan Burns’s sixtieth birthday , p. 75, 2013.8. Lunniss (W.), Altmeyer (S.) et Davis (R. I.). – Optimising task layout to increase schedula-bility via reduced cache related pre-emption delays. – In
Proceedings of the 20th InternationalConference on Real-Time and Network Systems , pp. 161–170. ACM, 2012.9. Mancuso (R.), Dudko (R.), Betti (E.), Cesati (M.), Caccamo (M.) et Pellizzoni (R.). – Real-time cache management framework for multi-core architectures. – In
Real-Time and Embed-ded Technology and Applications Symposium (RTAS), 2013 IEEE 19th , pp. 45–54. IEEE, 2013.10. Ward (B. C.), Herman (J. L.), Kenna (C. J.) et Anderson (J. H.). – Making shared caches ompas’2019 : Parallélisme / Architecture / SystèmeLIUPPA - IUT de Bayonne, France, du 24 au 28 juin 2019 more predictable on multicore platforms. – In2013 25th Euromicro Conference on Real-TimeSystems