Deterministic Real-time Thread Scheduling
DDeterministic Real-time Thread Scheduling
Heechul Yun, Cheolgi Kim and Lui ShaDepartment of Computer Science,University of Illinois at Urbana-Champaign, Champaign, IL 61801 { heechul,cheolgi,lrs } @illinois.edu Abstract —Race condition is a timing sensitive problem. A sig-nificant source of timing variation comes from non-deterministichardware interactions such as cache misses. While data racedetectors and model checkers can check races, the enormousstate space of complex software makes it difficult to identify all ofthe races and those residual implementation errors still remaina big challenge. In this paper, we propose deterministic real-time scheduling methods to address scheduling nondeterminismin uniprocessor systems. The main idea is to use timing insensitivedeterministic events, e.g, an instruction counter, in conjunctionwith a real-time clock to schedule threads. By introducing theconcept of Worst Case Executable Instructions (WCEI), weguarantee both determinism and real-time performance.
I. I
NTRODUCTION
Software running on safety-critical embedded systems, suchas avionics, medical devices, and automotive systems requireshigh reliability because the consequences of failure can bedisastrous. Even in non-safety-critical consumer electronicssuch as smartphones, consumers demand higher reliability thanever as they play increasingly important roles in everyday life.In multithreaded programs, thread interaction bugs, such asrace condition, are sensitive to interleaving patterns, so arehard to reproduce when thread scheduling is nondeterministic.Such bugs can lead to the great challenge known as NoFault Found(NFF). As it was observed that, “Overall, bettersoftware has had a far greater impact on reducing NFF thanbetter hardware” by an avionics company [1] , software bugshave been observed to be increasingly major causes of criticalproblems.As noted by E. Lee [2], correct reasoning of mul-tithreaded programs is extremely difficult because its outputdepends not only on the input but also on the thread schedules,which are essentially nondeterministic even in uniprocessorsystems.As a solution for nondeterminism, a logical counter, such asan instruction counter, can be used to schedule the threads. Ifthread switching occurs on specific instruction counter values,the thread schedule is repeatable as long as the programhas the same input. Recent work in the parallel systemcommunity adapted this method to reduce nondeterminism inthread-scheduling of multi-core systems. These systems usea hardware instruction counter [3], or a compiler-generatedvirtual counter[4], to control thread interleaving so that theyproduce deterministic schedules.However, in real-time systems, we cannot solely rely oninstruction counters, because of the real-time constraints. Inthis paper, we propose a novel thread-scheduling technique for uniprocessor systems that removes time-dependant non-determinism without sacrificing the real-time guarantee. Thekey idea is that our scheduler uses both timer interrupts andinstruction count interrupts to schedule threads. The traditionaltimer interrupt is used to keep up with real-time. On theother hand, we also use an instruction counter to generatean interrupt when a given number of instructions have beenexecuted, to preserve determinism in the scheduling decisions.The challenge is to find a ‘good’ deterministic counter anda mapping function between the counter value and real-timeprogress. If the mapping function is too pessimistic, the taskmust be idle for a substantial amount of time, thus reducingCPU utilization; if it is too aggressive, a high priority task canmiss the deadline because it has to wait a low priority task toexecute all the assigned number of instructions. This problemof finding a good mapping is challenging primarily becauseof the cache effect. We evaluated our methods to find a goodmapping function using a cycle-accurate processor simulator,SimpleScalar. We also implemented a prototype RTOS withthe proposed deterministic scheduler.The rest of the paper is organized as follows. SectionII shows a motivating example. Section III describes thedeterministic scheduler methodology. Section IV describes theprototype implementation. Section V concludes the paper.II. M
OTIVATING E XAMPLE
Race condition is a common mistake that is difficult toidentify and fix. Consider, for example, Figure 1, whichis found in an earlier version of the paparazzi [5] UAV(Unmanned automatic vehicle) source code. Clearly, thereis a race on reading and updating the status variablebetween the two threads. The left box is a correct run, whilethe right box is an incorrect one. Most of the time, thisprogram performs fairly well. However, if the two threadsinterleave, as in the right box, the final status value iserroneously set to
LOST . While this bug can be removedeasily by using a proper lock, finding the bug is not easy,because the bug is rarely manifested in practice. While thereare many static and dynamic race detection tools [6], [7],they are often limited—they can miss or generate too manyfalse positive bugs. This kind of bug is a potential cause ofNFF because its behavior is nondeterministic. By using thedeterministic scheduler presented in this paper, however, itbecomes deterministic and therefore easier to identify and fix. a r X i v : . [ c s . O S ] A p r onsole InputTask Flight ControlTask Write console input status = AVAILABLE if status != AVAILABLE status = LOSTelse process input (a) Normal task interleaving sequence
Console InputTask Flight ControlTask
Write console input status = AVAILABLE if status != AVAILABLE status = LOSTelse process input (b) Buggy task interleaving sequence
Fig. 1:
A race condition example simplified from paparazzi UAV source repository [5] .III. D
ETERMINISTIC R EAL - TIME S CHEDULING
A. System Model and Definition
We consider a real-time system that is modeled as a setof periodic real-time tasks
Γ = { τ , τ , ..., τ n } where n isthe number of tasks. The tasks can be dependent on eachother’s operations and have a global shared memory for inter-task communications. The scheduling policy is rate monotonic:a higher priority task preempts a lower priority one when itarrives. We assume there is a deterministic hardware counter(e.g., instruction counter) that is capable of measuring theprogram progress and generating an interrupt when it reachesa user-supplied number. We also assume that the deterministiccounter and a system timer are the only sources of interruptionin the system—threads are not allowed to use blocking systemcalls to wait for devices such as hard disks.In a conventional fixed priority scheduler, when a task ispreempted in the middle of execution by a higher prioritytask, the logical quantity of execution (the number of executedinstructions) until the preemption point may vary because ofthe complex modern processor architecture, like cache andpipelining. Such variation can cause races and NFF problemsas presented in Section II.In our proposed deterministic scheduling, the scheduleruses a deterministic counter to ensure determinism in taskexecutions. When a task τ is scheduled, the scheduler firstforesees the time t when τ must be scheduled out becauseof a higher priority task arrival or the completion of the task.The scheduler then estimates the number of instructions I thatcan be executed for the duration t . After the task executes I instructions, τ becomes idle, even when it finishes earlierthan the estimated time t . Notice that the task must not takelonger than t to execute I instructions to avoid deadline miss.Hence, the next task can always be scheduled immediately onthe arrival. Moreover, we assume that tasks execute for at least T unit once they are scheduled.The key of deterministic scheduling is the estimation of theinstruction quantity I to be executed for the given duration t . We define the estimation function Worst-Case ExecutableInstructions (WCEI) , because the execution of I instructionsmust be guaranteed for the duration t . Notice that it must be worst-case estimation but also should not be too pessimistic forthe sake of overall performance. In Section III-C, we presenthow to estimate a good (tight) WCEI. B. An Example of Deterministic Scheduling t2t1 t3 τ1 High prio. τ2 Low prio.(1) calculate t1, WCEI(t1) (2) WCEI(t1)expires(3) τ1 preempt (4) τ1 finish, calculate t2(5) calculate t3, WCEI(t3) (6) τ2 finish
Fig. 2:
Task preemption in deterministic scheduling.
Figure 2 shows how the deterministic scheduling works.The system has two periodic tasks τ and τ . (1) The schedulerschedules τ and computes t , the duration to the next schedul-ing event, and the corresponding WCEI. Then, it installs a in-terrupt which will be raised after executing WCEI instructions.(2) The interrupt is raised because the CPU executed all theinstructions up to WCEI. Since the task finished earlier than t ,it becomes idle. (3) τ arrives and is scheduled immediately.(4) When τ finish, the scheduler computes t using a inverseof the WCEI function. (5) Similar to the step (1), the schedulercomputes t and the corresponding WCEI. (6) τ finishesbefore exhausting all WCEI instructions.As this example clearly shows, WCEI computation playsan important role in the overall performance of deterministicscheduling; if the function is too pessimistic, the task mustidle for a substantial amount of time hence reducing CPUutilization; if it is too aggressive, a higher priority task can bedelayed by waiting for completion of the instruction budgetof the lower priority task.2 ABLE I:
Simulator parameters.
Module Size LatencyL1-I&D 8KB 2 cyclesL2-uni 2MB 16 cyclesDRAM - 200,4 cycles i n s t r u c ti on s ( ! ) cycles ( ! )SPEC2000GCCWCEI slope Fig. 3:
Execution profile and an WCEI example. WCEI is calculatedassuming the T unit is 1M cycles (1ms in 1GHz CPU) C. WCEI for Homogeneous Tasks
This section describes how to obtain a good WCEI functionfor homogeneous tasks in the form of
WCEI = at − b (1)where t is the duration, a is the execution speed, and b is thecache cold-miss penalty.The WCEI function correlates the executed instructions andthe elapsed real-time. For an ideal processor, where everyinstruction takes exactly one cycle, the function would besimply the processor speed in Hz. For any practical con-temporary processor, however, the relationship is much morecomplicated for many reasons including cache effect and out-of-order execution. These are known to be difficult, if notimpossible, to model.We profiled the actual program execution using a cycle-accurate processor simulator, SimpleScalar, that modeled analpha processor with cache and memory [8] to obtain a inEq. (1). Table I shows the major system parameters we usedin the simulator. Another possible way to profile such data isto use hardware performance counters found in most modernprocessors.We collected the processor cycle whenever an instructionwas retired. Figure 3 shows the retired instructions in terms ofthe elapsed cycles for a benchmark program, SPEC2000GCC,and an obtained WCEI function for this task. The solid line TABLE II:
Utilization impact of unit time T unit . T unit WCEI rate best rate worst loss(M. cycles) (inst/cycle) (inst/cycle) (%)1 0.56 0.89 37.732 0.59 0.75 21.523 0.60 0.71 15.29
TABLE III:
Utilization impact comparison of multi-phase WCEIsvs a single WCEI
Region WCEI rate best rate worst loss(inst/cycle) (inst/cycle) (%)phase 1 1.22 1.25 2.10phase 2 0.61 0.61 0.28phase 3 0.48 0.48 0.61single 0.48 1.25 61.95 shows the observed behavior, and the dotted line presentscoefficient a , which is obtained from the smallest number ofexecuted instructions over any T unit , the smallest schedulingtime unit during the program execution. The choice of T unit is important: if it is too short, the WCEI function becomesvery conservative because the temporal locality in the cacheaccesses varies significantly. Table II shows the effect. As T unit increases the worst case utilization loss is reduced. Thisis because the cache locality is averaged over time. In all ofour simulation, T unit was 1M cycles.Note that if there are different program execution paths, wehave to explore them all, and the WCEI must be the lowerbound of them all. We argue that real-time control tasks oftenhave very limited execution paths, therefore are systematicallyanalyzable using automatic path exploration tools such asKLEE[9]. D. WCEI for Multi-Phase Tasks
Tasks are often divided into multiple phases of operation—for example, a computation phase and an update phase. Insuch cases, finding a single WCEI function, as described inthe previous section, may result in a very pessimistic functionbecause the instruction execution speed varies significantlyover the different phases.A solution to this problem is to identify a different WCEIfunction for each phase. In Figure 4, the profiled programSPEC2000GZIP shows three phases. Therefore, we computedthree different WCEI functions: Phase 1, 2, and 3, whichare tightly matched with the profiled execution behavior.In comparison, WCEI single, computed using the methoddescribed in the previous section, is not tight, i.e., the functionwastes CPU. Table III compares the effectiveness of the multi-phase approach in terms of worst case CPU utilization losses.In the multi-phase approach, the utilization loss is very small—less than 3% in all phases. In comparison, a single WCEIfunction results in up to 61.95%, the worst case utilizationloss.Note that using the multi-phase function needs a means tonotify the scheduler of the phase change so the scheduler canre-install the instruction counter interrupt using a new WCEIfunction. This can be done manually, by inserting functioncalls in the application program, or automatically from theexecution profile.
E. Mixed Real-time and Non Real-time Tasks
While we described several optimization techniques in theprevious sections, it is inevitable to lose some CPU utiliza-3 i n s t r u c ti on s ( ! ) cycles ( ! )SPEC2000GZIPPhase 1 WCEI slopePahse 2 WCEI slopePahse 3 WCEI slopeWCET single P h a s e P h a s e P h a s e Fig. 4:
Execution profile for a multi-phase task. tion because the scheduler should idle after consuming theinstruction budget defined by WCEI, which is supposed to beconservative. The idle cycles, however, can be utilized by othernon-critical tasks, for which deadline misses are not an issue,as long as they are independent of the tasks under deterministicscheduling. It is possible by selectively enabling the instructioncounter based on the type of the task.IV. I
MPLEMENTATION
We implemented a preliminary prototype deterministicscheduler as a user level thread scheduler on top of Linux2.6.36 running on an Intel Core2Duo processor based machine.Basic threading and synchronization APIs are the same as thestandard pthread. We used two interrupts, timer and instructioncounter, to implement the deterministic scheduler. For the in-struction counter, we used the retired store instruction counterfound in Intel Core2 Duo processor. We used perf eventsinfrastructure [10] of Linux 2.6.36. Note that our currentimplementation only supports a single WCEI function.V. C
ONCLUSION AND F UTURE W ORK
In shared memory multi-thread systems, time-based pre-emptive scheduling is one of the main sources of nondetermin-ism. We proposed a counter-based deterministic schedulingmethod for periodic real-time tasks to eliminate such non-determinism without violating the real-time property. A keyto our approach was to design a good mapping function,called WCEI, that maps the counter and real-time. Using acycle-accurate processor simulator, we explored designs forWCEI functions and discussed related issues. We also made aprototype system showing that it is readily implementable intoday’s computer systems.Future work will include evaluating a broad range of real-time applications using both a simulator and real hardware.Another interesting avenue would be to explore a determin-istic hardware counter with different weights for differenttypes of instructions (e.g., floating point, integer, and memoryoperations) or a scratchpad based MMU[11]. Applying suchhardware can potentially reduce pessimism in WCEI functions. R
EFERENCES[1]
Untangling No Fault Found
Computer , vol. 39, no. 5, pp. 33–42,2006.[3] M. Olszewski, J. Ansel, and S. Amarasinghe, “Kendo: efficient deter-ministic multithreading in software,” in
ASPLOS . ACM, 2009.[4] T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman, “Core-Det: A compiler and runtime system for deterministic multithreadedexecution,” in
ASPLOS . ACM, 2010.[5] P. Brisset, A. Drouin, M. Gorraz, P. Huard, and J. Tyler, “The paparazzisolution,”
MAV2006, Sandestin, Florida , 2006.[6] P. Pratikakis, J. Foster, and M. Hicks, “LOCKSMITH: context-sensitivecorrelation analysis for race detection,”
ACM SIGPLAN Notices , vol. 41,no. 6, pp. 320–331, 2006.[7] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson,“Eraser: A dynamic data race detector for multithreaded programs,”
TOCS , 1997.[8]
SimpleScalar
OSDI . USENIX, 2008, pp. 209–224.[10] perf event . [Online]. Available: http://lwn.net/Articles/357481/[11] J. Whitham and N. Audsley, “Studying the Applicability of the Scratch-pad Memory Management Unit,” in
RTAS . IEEE, 2010.. IEEE, 2010.