[PDF] Restart-Based Fault-Tolerance: System Design and Schedulability Analysis

Abstract

Embedded systems in safety-critical environments are continuously required to deliver more performance and functionality, while expected to provide verified safety guarantees. Nonetheless, platform-wide software verification (required for safety) is often expensive. Therefore, design methods that enable utilization of components such as real-time operating systems (RTOS), without requiring their correctness to guarantee safety, is necessary. In this paper, we propose a design approach to deploy safe-by-design embedded systems. To attain this goal, we rely on a small core of verified software to handle faults in applications and RTOS and recover from them while ensuring that timing constraints of safety-critical tasks are always satisfied. Faults are detected by monitoring the application timing and fault-recovery is achieved via full platform restart and software reload, enabled by the short restart time of embedded systems. Schedulability analysis is used to ensure that the timing constraints of critical plant control tasks are always satisfied in spite of faults and consequent restarts. We derive schedulability results for four restart-tolerant task models. We use a simulator to evaluate and compare the performance of the considered scheduling models.

Full PDF

RRestart-Based Fault-Tolerance:System Design and Schedulability Analysis

Fardin Abdi, Renato Mancuso, Rohan Tabish, Marco Caccamo

Department of Computer Science, University of Illinois at Urbana-Champaign, USA { abditag2, rmancus2, rtabish, mcaccamo } @illinois.edu Abstract —Embedded systems in safety-critical environments arecontinuously required to deliver more performance and functional-ity, while expected to provide veriﬁed safety guarantees. Nonethe-less, platform-wide software veriﬁcation (required for safety) isoften expensive. Therefore, design methods that enable utilizationof components such as real-time operating systems (RTOS), with-out requiring their correctness to guarantee safety, is necessary.In this paper, we propose a design approach to deploy safe-by-design embedded systems. To attain this goal, we rely on a smallcore of veriﬁed software to handle faults in applications and RTOSand recover from them while ensuring that timing constraints ofsafety-critical tasks are always satisﬁed. Faults are detected bymonitoring the application timing and fault-recovery is achievedvia full platform restart and software reload, enabled by the shortrestart time of embedded systems. Schedulability analysis is usedto ensure that the timing constraints of critical plant control tasksare always satisﬁed in spite of faults and consequent restarts. Wederive schedulability results for four restart-tolerant task models.We use a simulator to evaluate and compare the performance ofthe considered scheduling models.

I. I

NTRODUCTION

Embedded controllers with smart capabilities are being in-creasingly used to implement safety-critical cyber-physical sys-tems (SC-CPS). In fact, modern medical devices, avionic andautomotive systems, to name a few, are required to deliverincreasingly high performance without trading off in robustnessand assurance. Unfortunately, satisfying the increasing demandfor smart capabilities and high performance means deploying in-creasingly complex systems. Even seemingly simple embeddedcontrol systems often contain a multitasking real-time kernel,support networking, utilize open source libraries [1], and anumber of specialized hardware components (GPUs, DSPs,DMAs, etc. ). As systems increase in complexity, however, thecost of formally verifying their correctness can easily explode.Testing alone is insufﬁcient to guarantee the correctness ofsafety-critical systems, and unveriﬁed software may violatesystem safety in multiple ways, for instance: (i) the controlapplication may contain unsafe logic that guides the systemtowards hazardous states; (ii) the logic may be correct butincorrectly implemented thereby creating unsafe commands atruntime (application-level faults); (iii) even with logically safe,correctly implemented control applications, faults in underlyingsoftware layers (e.g. RTOS and device drivers) can preventthe correct execution of the controller and jeopardize systemsafety (system-level faults). Due to the limited feasibility andhigh cost of platform-wide formal veriﬁcation, we take a differ-ent approach. Speciﬁcally, we propose a software/hardware co-design methodology to deploy SC-CPS that (i) provide strongsafety guarantees; and (ii) can utilize unveriﬁed software com-ponents to implement complex safety-critical functionalities.Our approach relies on a key observation: by performingcareful boot-sequence optimization, many embedded platformsand RTOS utilized in automotive industry, avionics, andmanufacturing can be entirely restarted within a very shortperiod of time. Restarting a computing system and reloading afresh image of all the software ( i.e.,

RTOS, and applications)from a read-only source appears to be an effective approach to recover from unexpected faults. Thus, we propose the following:as soon as a fault that disrupts the execution of criticalcomponents is detected, the entire system is restarted. After arestart, all the safety-critical applications that were impactedby the restart are re-executed. If restart and re-execution ofcritical tasks can be performed fast enough , i.e. such thattiming constraints are always met in spite of task re-executions,the physical system will remain oblivious to and will not beimpacted by the occurrence of faults.The effectiveness of the proposed restart-based recoveryrelies on timely detection of faults to trigger a restart.Since detecting logical faults in complex control applicationscan be challenging, we utilize Simplex Architecture [2]–[4]to construct control software. Under Simplex, each controlapplication is divided into three tasks; safety controller, complexcontroller, and decision module. And, safety of the system reliessolely on timely execution of the safety controller tasks. From ascheduling perspective, safety is guaranteed if safety controllertasks have enough CPU cycles to re-execute and ﬁnish beforetheir deadlines in spite of restarts. In this paper, we analyzethe conditions for a periodic task set to be schedulable in thepresence of restarts and re-executions. We assume that when arestart occurs, the task instance executing on the CPU and anyof the tasks that were preempted before their completion willneed to re-execute after the restart. In particular, we make thefollowing contributions: • We propose a Simplex Architecture that can be recoveredvia restarts and implemented on a single processing unit ; • We derive the response time analysis under ﬁxed-prioritywith fully preemptive and fully non-preemptive disciplinesin presence of restart-based recovery and discuss pros andcons of each one; • We propose response time analysis of ﬁxed-priorityscheduling in presence of restarts for tasks with preemptionthresholds [5] and non-preemptive ending intervals [6] toimprove feasibility of task sets;II. B

ACKGROUND ON S IMPLEX D ESIGN

Our proposed approach is designed for the control tasksthat are constructed following Simplex veriﬁed design guide-lines [2]–[4]. In the following, we review Simplex design con-cepts which are essential for the understanding the methodologyof this paper. The goal of original Simplex approach is to designcontrollers, such that the faults in controller software do notcause the physical plant to violate its safety conditions.

Deﬁnition

States of the physical plant that do not violate anyof the safety conditions are referred to as admissible states . Thephysical subsystem is assumed safe as long it is in an admissiblestate. Likewise those that violate the constraints are referred toas inadmissible states .Under Simplex Architecture, each controlled physical pro-cess/component requires a safety controller, a complex con-troller, and a decision module. In the following, we deﬁneproperties of each component.1 a r X i v : . [ c s . S Y ] M a y eﬁnition Safety Controller is a controller for which a subsetof the admissible states called recoverable states exists withthe following property; If the safety controller starts controllingthe plant from one of those states, all future states will remainadmissible. The set of recoverable states is denoted by R . Safetycontroller is formally veriﬁed i.e., it does not contain logical orimplementation errors. Deﬁnition

Complex Controller is the main controller task ofthe system that drives the plant towards mission set points.However, it is unveriﬁed i.e., it may contain unsafe logic orimplementation bugs. As a result, it may generate commandsthat force the plant into inadmissible states.

Deﬁnition

Decision Module includes a switching logic that candetermine if the physical plant will remain safe (stay within theadmissible states) if the control output of complex controller isapplied to it.There are multiple approaches to design a veriﬁed safetycontroller and decision module. The ﬁrst proposed way is basedon solving linear matrix inequalities [7], which has been usedto design Simplex systems as complicated as automated landingmaneuvers for an F-16 [8]. According to this approach, safetycontroller is designed by approximating the system with lineardynamics in the form: ˙ x = Ax + Bu , for state vector x andinput vector u . In this approach, safety constraints are expressedas linear constraints in the form of linear matrix inequalities.These constraints, along with the linear dynamics for thesystem, are the inputs to a convex optimization problem thatproduces both linear proportional controller gains K , as well asa positive-deﬁnite matrix P . The resulting linear-state feedbackcontroller, u = Kx , yields closed-loop dynamics in the form of ˙ x = ( A + BK ) x . Given a state x , when the input Kx is used, the P matrix deﬁnes a Lyapunov potential function ( x T P x ) witha negative-deﬁnite derivative. As a result, the stability of thephysical plant is guaranteed using Lyapunov’s direct or indirectmethods. Furthermore, matrix P deﬁnes an ellipsoid in the statespace where all safety constraints are satisﬁed when x T P x < .If sensors’ and actuators’ saturation points were provided asconstraints, the states inside the ellipsoid can be reached usingcontrol commands within the sensor/actuator limits.In this way, when the gains K deﬁne the safety controller, theellipsoid of states x T P x < is the set of recoverable states R .This ellipsoid is used to determine the proper switching logicof the decision module. As long as the system remains insidethe ellipsoid, any unveriﬁed, complex controller can be used. Ifthe state approaches the boundary of the ellipsoid, control canbe switched to the safety controller which will drive the systemtowards the equilibrium point where x T P x = 0 .An alternative approach for constructing a veriﬁed safetycontroller and decision module is proposed in [9]. Here, safetycontroller is constructed similar to the above approach [7].However, a novel switching logic is proposed for decisionmodule to decide about the safety of complex controllercommands. Intuitively, this check is examining what happensif the complex controller is used for a single control intervalof time, and then the safety controller is used thereafter. Ifthe reachable states contain an inadmissible state (either beforethe switch or after), then the complex controller cannot beused for one more control interval. Assuming the system startsin a recoverable state, this guarantees it will remain in therecoverable set for all time.A system that adheres to this architecture is guaranteedto remain safe only if safety controller and decision moduleexecute correctly. In this way, the safety premise is valid only ifsafety controller and decision module execute in every control cycle. Original Simplex design, only protects the plant fromfaults in the complex controller. For instance, if a fault in theRTOS crashes the safety controller or decision module, safetyof the physical plant will get violated.III. S

YSTEM M ODEL AND A SSUMPTIONS

In this section we formalize the considered system andtask model, and discuss the assumptions under which ourmethodology is applicable.

A. Periodic Tasks

We consider a task set T composed of n periodic tasks τ . . . τ n executed on a uniprocessor under ﬁxed priorityscheduling. Each task τ i is assigned a priority level π i . Wewill implicitly index tasks in decreasing priority order, i.e., , τ i has higher priority than τ k if i < k . Each periodic task τ i is expressed as a tuple ( C i , T i , D i , φ i ) , where C i is theworst-case execution time (WCET), T i is the period, D i is therelative deadline of each task instance, and φ i is the phase (therelease time of the ﬁrst instance). The following relation holds: C i ≤ D i ≤ T i . Whenever D i = T i and φ i = 0 , we simplyexpress tasks parameters as ( C i , T i ) . Each instance of a periodictask is called job and τ i,k denotes the k -th job of task τ i .Finally, hp ( π i ) and lp ( π i ) refer to the set of tasks with higheror lower priority than π i i.e., hp ( π i ) = { τ j | π i < π j } and lp ( π i ) = { τ j | π i > π j } . We indicate with T r the minimuminter-arrival time of faults and consequent restarts; while C r refers to the time required to restart the system. B. Critical and Non-Critical Workload

It is common practice to execute multiple controllers fordifferent processes of physical plant on a single processingunit. In this work, we use the Simplex Architecture [2]–[4] toimplement each controller. As a result, three periodic tasks areassociated with every controller: (i) a safety controller (SC)task, (ii) a complex controller (CC) task, and (iii) a decisionmodule (DM) task. In typical designs, the three tasks thatcompose the same controller have the same period, deadline,and release time.

Remark 1.

SC’s control command is sent to the actuator bufferimmediately before the termination of that job instance. Hence,the timely execution of SC tasks is necessary and sufﬁcient forthe safety of the physical plant.

As a result, out of the three tasks, SC must execute ﬁrst andwrite its output to the actuator command buffer. Conversely,DM needs to execute last, after the output of CC is available,to decide if it is safe to replace SC’s command which isalready in the actuator buffer. Hence, the priorities of thecontroller tasks need to be in the following order : π ( DM ) <π ( CC ) < π ( SC ) . Note that, the precedence constraint that SC,CC and DM tasks must execute in this order can be enforcedthrough the proposed priority ordering if self-suspension andblocking on resources are excluded and if the scheduler iswork-conserving. We consider ﬁxed priority scheduling, whichis work-conserving and we assume SC, CC and DM tasks do notself-suspend. Moreover, tasks controlling different componentsare independent; SC, CC and DM tasks for the same componentshare sensors and actuator channels. Sensors are read-onlyresources, do not require locking/synchronization and thereforecannot cause blocking. A given SC task, may only share actuatorchannels with the corresponding DM task. However, SC jobsexecute before DM jobs and do not self-suspend, hence DMcannot acquire a resource before SC has ﬁnished its execution. We assume enough priority levels to assign distinct priorities. critical workload. All the CC and DM tasks are referred as non-critical workload. Safety is guaranteed if and only if all the criticaltasks complete before their deadlines. Whereas, execution ofnon-critical tasks is not crucial for safety; these tasks are saidto be mission-critical but not safety-critical. We assume that theﬁrst n c tasks of T are critical. Notice that with this indexingstrategy, any critical task has a higher priority than any non-critical task. C. Fault Model

In this paper, we consider two types of fault for the system;application-level faults and system-level faults. We make thefollowing assumptions about the faults that our system safelyhandles:A1 The original image of the system software is storedon a read-only memory unit ( e.g., , E PROM). Thiscontent is unmodiﬁable at runtime.A2 Application faults may only occur in the unveriﬁedworkload ( i.e., all the application-level processes onthe system except SC and DM tasks).A3 SC and DM tasks are independently veriﬁed and fault-free. They might, however, fail silently (no output isgenerated) due to faults in software layers or otherapplications on which they depend.A4 We only consider system- and application-level faultsthat cause SC and DM tasks to fail silently but do notchange their logic or alter their output.A5 Faults do not alter sensor readings.A6 Once SC or CC tasks have send their outputs to theactuators, the output is unaffected by system restart.As such, a task does not need to be re-executed if ithas completed correctly before a restart.A7 Re-executing a task even if it has completed correctlydoes not negatively impact system safety.A8 Monitoring and initializer tasks (Section IV) areindependently veriﬁed and fault-free. We assume thatsystem faults can only cause silent failures in thesetasks (no output or correct output).A9 T r is larger than the least common multiple (hyper-period ) of critical tasks, i.e. T r > LCM { T k | k ≤ n c } D. Scheduler State Preservation and Absolute Time

In order to know what tasks were preempted, executing, orcompleted after a restart occurs, it is fundamental to carrya minimum amount of data across restarts. As such, ourarchitecture requires the existence of a small block of non-volatile memory (NVM). We also require the presence of amonotonic clock unit (CLK) as an external device. CLK isused to derive the absolute time after a system restart. Sincewe assume periodic tasks, the information provided by CLKis enough to determine the last release time of each task.Whenever a critical task is completed, the completion timestampobtained from CLK is written to NVM, overwriting the previousvalue for the same task. We assume that a timestamp update inNVM write can be performed in a transactional manner.

E. Recovery Model

The recovery action we assume in this paper is to restart theentire system, reload all the software (RTOS and applications)from a read-only storage unit, and re-execute all the jobs thatwere released but not completed at the time of restart. Thepriority of a re-executing instance is the same as the priority Length of the hyper-period can be signiﬁcantly reduced if the control taskshave harmonic periods.

Fig. 1:

Example of fully preemptive system with 3 tasks τ =(1 , τ = (2 , τ = (4 , , and restart at t = 10 − (cid:15) ( C r = 0 ).The taskset is schedulable without restarts, however, restart and taskre-execution causes a deadline miss at t = 22 . of the original job. Within C r time units, the system (RTOSand applications) reloads from a read-only image, and re-execution is initiated as needed. Figure 1 depicts how restartand task re-execution affect the scheduling of 3 real-time tasks( τ , τ , and τ ). When the restart happens at t = 10 − (cid:15) , τ wasstill running. Moreover, τ and τ were preempted at time t = 9 and t = 8 , respectively. Hence all the three task will need to bere-executed after the restart.System restart is triggered only after a fault is detected. Thefollowing deﬁnition of fault is used throughout this paper: Critical Fault: any system misbehavior that leads to a non-timely execution of any of the critical tasks.It follows that (i) the absence of critical faults guaranteesthat every critical task completes on time; that (ii) the timelycompletion of all the critical tasks ensures system safety byAssumptions A3-A7; and that (iii) being able to detect allcritical faults and re-execute critical tasks by their deadline isenough to ensure timely completion of critical tasks in spite ofrestarts. We discuss critical fault detection in Section IV; andwe analyze system schedulability in spite of critical faults inSection V and VI. Since handling critical faults is necessaryand sufﬁcient (Remark 1) for safety, in the rest of this paper,the term fault is used to refer to critical faults.

F. RBR-Feasibility

A task set T is said to be feasible under restart basedrecovery (RBR-Feasible) if the following two conditions aresatisﬁed; (i) there exists a schedule such that all jobs of allthe critical tasks, or their potential re-executions, can completesuccessfully before their respective deadlines, even in thepresence of a system-wide restart, occurring at any arbitrarytime during execution. (ii) All jobs, including instances of non-critical tasks, can complete before their deadlines when norestart is performed.IV. F

AULT D ETECTION AND T ASK R E - EXECUTION

As described in the previous section, a successful fault-detection approach must be able to detect any fault before thedeadline of a critical task is missed, and to trigger the recoveryprocedure. Another key requirement is being able to correctlyre-execute critical jobs that were affected by a restart.

Fault detection with watchdog (WD) timer: to explain thedetection mechanism, we rely on the concept of ideal worst-case response time , i.e. the worst-case response time of a taskwhen there are no restarts (and no re-executions) in the system.We use ˆ R i to denote the ideal worst-case response time of τ i . ˆ R i can be derived using traditional response-time analysis, or3ith the analysis proposed in Section V and VI by imposingall the overhead terms O xy = 0 .If no faults occur in the system, every instance of τ i isexpected to ﬁnish its execution within at most ˆ R i time unitsafter its arrival time. This can be checked at runtime witha monitoring task. Recall that each critical job records itscompletion timestamp t compi to NVM. The monitoring taskchecks the latest timestamp for τ i at time instants kT i + ˆ R i .If t compi < kT i it means that τ i has not completed by its idealworst-case response time. Hence, a restart needs to be triggered.A single WD can be used to always ensure a system reset if anyof the critical tasks does not complete by its ideal worst-caseresponse time. The following steps are performed:1) Determine the next checkpoint instant t next and checkedcritical task τ i as follows: t next = min i ≤ c n (cid:18) (cid:98) ( t − φ i ) /T i (cid:99) T i + φ i + ˆ R i (cid:19) . (1)In other words, t next captures the earliest instant of timethat corresponds to the elapsing of the ideal worst-caseresponse time of some critical task τ i ;2) Set the WD to restart the system after t − t next + (cid:15) timeunits;3) Terminate and set wake-up time at t next ;4) At wake-up, check if τ i completed correctly: if t compi obtained from NVM satisﬁes t compi ≥ (cid:98) ( t − φ i ) /T i (cid:99) T i + φ i ,then acknowledge the WD so that it does not trigger a reset.Otherwise, do nothing, causing a WD-induced reset after (cid:15) time units.5) Continue from Step 1 above.Notice that this simple solution utilizes only one WD timer,and handles all the silent failures. The advantage of usinghardware WD timers is that if any faults in the OS or otherapplications, prevent the time monitor task from execution, theWD which is already set, will expire and restart the system.To determine which tasks to execute after a restart, wepropose the following. Immediately after the reboot completes,a initializer task calculates the latest release time of each task τ i using (cid:98) ( t − φ i ) /T i (cid:99) T i + φ i where t is the current time retrievedfrom CLK. Next, it retrieves the last recorded completion timeof the task, t compi , from NVM. If t compi < (cid:98) ( t − φ i ) /T i (cid:99) T i + φ i ,then the task needs to be executed, and is added to the list ofready tasks. It is possible that a task completed its executionprior to the restart, but was not able to record the completiontime due to the restart. In this case, the task will be executedagain which does not impact the safety due to Assumption A7.V. RBR-F EASIBILITY A NALYSIS

As mentioned in Section IV, re-execution of jobs impactedby a restart must not cause any other job to miss a deadline.Also, re-executed jobs need to meet their deadlines as well. Thegoal of this section is to present a set of sufﬁcient conditionsto reason about the feasibility of a given task set T in presenceof restarts (RBR-feasibility). In particular, in Sections V-Aand V-B, we present a methodology that provides a sufﬁcientcondition for exact RBR-Feasibility analysis of preemptive andnon-preemptive task sets. Deﬁnition : Length of level- i preemption chain at time t isdeﬁned as sum of the executed portions of all the tasks that arein the preempted or running state, and have a priority greaterthan or equal to π i at t . Longest level- i preemption chain isthe preemption chain that has the longest length over all thepossible level- i preemption chains.For instance, consider a fully preemptive task set with fourtasks; C = 1, T = 5 , C = 3, T = 10 , C = 2, T = 12 , C = 4, T = 15 , and π < π < π < π . For this task set,the longest level- and level- preemption chains are 6 and 10,respectively. A. Fully Preemptive Task Set

Under fully preemptive scheme, as soon as a higher prioritytask is ready, it preempts any lower priority tasks running onthe processor. To calculate the worst-case response time of task τ i , we have to consider the case where the restart incurs thelongest delay on ﬁnishing time of the job. For a fully preemptivetask set, this occurs when every task τ k for k ∈ { , . . . , i } is preempted immediately prior to its completion by τ k − andsystem restarts right before the completion of τ . In other words,when tasks τ to τ i form the longest level-i preemption chain.An example of this case is depicted in Figure 1. In this case,the restart and consequent re-execution causes a deadline missat t = 22 . The example uses only integer numbers for taskparameters, hence tasks can be preempted only up to 1 unit oftime before their completion. In the rest of the paper, we discussour result assuming that tasks’ WCETs are real numbers.Theorem 1 provides RBR-feasibility conditions for a fullypreemptive task set T , under ﬁxed priority scheduling. Theorem 1.

A set of preemptive periodic tasks T is RBR-Feasible under ﬁxed priority algorithm if the response time R i of each task τ i satisﬁes the condition: ∀ τ i ∈ T , R i ≤ D i . R i is obtained for the smallest value of k for which we have R ( k +1) i = R ( k ) i . R ( k +1) i = C i + (cid:88) τ j ∈ hp ( π i ) (cid:24) R ( k ) i T j (cid:25) C j + O pi (2) where the restart overhead O pi on response time is O pi = (cid:26) C r + (cid:80) τ j ∈ hp ( π i ) ∪{ τ i } C j i ≤ n c i > n c (3) Proof.

First, note that Equation 2 without the overhead term O pi , corresponds to the classic response time of a task underfully preemptive ﬁxed priority scheduling [10]. The additionaloverhead term represents the worst-case interference on the taskinstance under analysis introduced by restart time and the re-execution of the preempted tasks. We need to show that theoverhead term can be computed using Equation 3. Consider thescenario in which every task τ k is preempted by τ k − afterexecuting for δ i time units where k ∈ { , ..., i } . And, a restartoccurs after τ executed for δ time units. Due to the restart,all the tasks have to re-execute and the earliest time τ i canﬁnish its execution is C r + δ i + ... + δ + C i + ... + C . Hence,it is obvious that the later each preemption or the restart in τ occurs, the more delay it creates for τ i . Once a task hascompleted, it no longer needs to be re-executed. Therefore,the maximum delay of each task is felt immediately prior tothe task’s completion instant. Thus, the overhead is maximizedwhen each τ k is preempted by τ k − for k ∈ { , .., i } and restartoccurs immediately before the end of τ .As seen in this section, the worst-case overhead of restart-based recovery in fully preemptive setting occurs when systemrestarts at the end of longest preemption chain. Therefore,to reduce the overhead of restarting, length of the longestpreemption chain must be reduced. In order to reduce thiseffect we investigate the non-preemptive setting in the followingsection.4 . Fully Non-Preemptive Task set Under this model, jobs are not preempted until their executionterminates. At every termination point, the scheduler selects thetask with the highest priority amongst all the ready tasks toexecute. The main advantage of non-preemptive task set is thatat most one task instance can be affected by restart at any instantof time.Authors in [11] showed that in non-preemptive scheduling,the largest response time of a task does not necessarily occurin the ﬁrst job after the critical instant. In some cases, thehigh-priority jobs activated during the non-preemptive executionof τ i ’s ﬁrst instance are pushed ahead to successive jobs,which then may experience a higher interference. Due to thisphenomenon, the response time analysis for a task cannot belimited to its ﬁrst job, activated at the critical instant, as donein preemptive scheduling, but it must be performed for multiplejobs, until the processor ﬁnishes executing tasks with priorityhigher than or equal to π i . Hence, the response time of a taskneeds to be computed within the longest Level- i Active Period ,deﬁned as follows [12], [13].

Deﬁnition : The

Level- i Active Period L i is an interval [ a, b ) such that the amount of processing that still needs to beperformed at time t due to jobs with priority higher than or equalto π i , released strictly before t , is positive for all t ∈ ( a, b ) andnull in a and b . It can be computed using the following iterativerelation: L ( q ) i = B i + C i + (cid:88) j ∈ hp ( π i ) (cid:100) L ( q − i /T j (cid:101) C j + O npi (4)Here, O npi is the maximum overhead of restart on the responsetime of a task. In the following we describe how to calculate thisvalue. L i is the smallest value for which L ( q ) i = L ( q − i . Thisindicates that the response time of task τ i must be computedfor all jobs τ i,k with k ∈ [1 , K i ] where K i = (cid:100) L i /T i (cid:101) .Theorem 2 describes the sufﬁcient conditions under which afault and the subsequent restart do not compromise the timelyexecution of the critical workload under fully non-preemptivescheduling. Notice that, as mentioned earlier, it is assumed thatthe schedule is resumed with the highest priority active job afterrestart. Theorem 2.

A set of non-preemptive periodic tasks is RBR-feasible under ﬁxed-priority if the response time R i of each task τ i , calculated through following relation, satisﬁes the condition: ∀ τ i ∈ T ; R i ≤ D i . R i = max k ∈ [1 ,K i ] { F i,k − ( k − T i } (5) where F i,k is the ﬁnishing time of job τ i,k given by F i,k = S i,k + C i (6) Here, S i,k is the start time of job τ i,k , obtained for the smallestvalue that satisﬁes S ( q +1) i,k = S ( q ) i,k in the following relation S ( k +1) i,k = B i + (cid:88) τ j ∈ hp ( π i ) (cid:32)(cid:22) S ( k ) i,k T j (cid:23) + 1 (cid:33) C j + O npi (7) In Equation 7, term B i is the blocking from low priority tasksand is calculated as B i = max τ j ∈ lp ( π i ) { C j } . The term O npi represents the overhead on task execution introduced by restartsand is calculated as follows: O npi = (cid:26) C r + max {{ C j | j ∈ hp ( π i ) } ∪ C i } i ≤ n c i > n c (8) Proof.

Equation 7 and 6, without the restart overhead term O npi ,are proposed in [12], [13] to calculate the worst-case start timeand response time of a task under non-preemptive setting.We need to show that the overhead term can be computedusing Equation 8. Under non-preemptive discipline, restart onlyimpacts a single task executing on the CPU at the instant ofrestart. There are two possible scenarios that may result inthe worst-case restart delay on ﬁnish time of task τ i . First,when τ i is waiting for the higher priority tasks to ﬁnish theirexecution, a restart can occur during the execution of one of thehigher priority tasks τ j and delay the start time τ i by C r + C j .Alternatively, a restart can occur inﬁnitesimal time prior to thecompletion of τ i and cause an overhead of C r + C i . Hence,the worst-case delay due to a restart is caused by the task withthe longest execution time among the task itself and the taskswith higher priority (Equation 8). The restart overhead is notincluded in the response-time of non-critical tasks ( O npi = 0 for i > n c ). Fig. 2:

Example of fully non-preemptive system with 3 tasks τ =(1 , τ = (2 , τ = (4 , , and restart at t = 5 − (cid:15) ( C r = 0 ).Restart and task re-execution causes a deadline miss at t = 9 . Unfortunately, under non-preemptive scheduling, blockingtime due to low priority tasks, may cause higher priority taskswith short deadlines to be non-schedulable. As a result, whenpreemptions are disabled, there exist task sets with arbitrary lowutilization that despite having the lowest restart overhead, arenot RBR-Feasible. Figure 2 uses the same task parameters asin Figure 1. The plot shows that the considered task system isnot schedulable under fully non-preemptive scheduling when arestart is triggered at t = 5 − (cid:15) .VI. L IMITED P REEMPTIONS

In the previous section, we analyzed the RBR-Feasibilityof task sets under fully preemptive and fully non-preemptivescheduling. Under full preemption, restarts can cause a sig-niﬁcant overhead because the longest preemption chain cancontains all the tasks. On the other hand, under non-preemptivescheduling, the restart overhead is minimum. However, due toadditional blocking on higher priority tasks, some task sets, evenwith low utilization, are not schedulable.In this section we discuss two alternative models withlimited preemption. Limited preemption models are suitablefor restart-based recovery since they enable the necessarypreemptions for the schedulability of the task set, but avoidmany unnecessary preemptions that occur in fully preemptivescheduling. Consequently, they induce lower restarting overheadand exhibit higher schedulability.

A. Preemptive tasks with Non-Preemptive Ending

As seen in the previous sections, reducing the numberand length of preempted tasks in the longest preemptionchain, can reduce the overhead of restarting and increase the5BR-Feasibility of task sets. On the other hand, preventingpreemptions entirely is not desirable since it can impactfeasibility of the high priority tasks with short deadlines. Asa result, we consider a hybrid preemption model in which, ajob once executed for longer than C i − Q i time units, switchesto non-preemptive mode and continues to execute until itstermination point. Such a model allows a job that has mostlycompleted to terminate, instead of being preempted by a higherpriority task. Q i is called the size of non-preemptive endinginterval of τ i and Q i ≤ C i . The model we utilize in this section,is a special case of the model proposed in [6] which aims todecrease the preemption overhead due to context switch in real-time operating systems. In Figure 3, we consider a task set withthe same parameters as in Figure 1, where in addition task τ has a non-preemptive region of length Q = 1 . The preemptionchain that caused the system in Figure 1 to be non-schedulablecannot occur and the instance of the task becomes schedulableunder restarts. With the same setup, Figure 4 considers the casewhen a reset occurs at t = 9 − (cid:15) .

1) RBR-Feasibility Analysis:

Theorem 3 provides the RBR-feasibility conditions of a task-set with non-preemptive endingintervals. In this theorem, S i,k represents the worst case starttime of the non-preemptive region of the re-executed instance ofjob τ i,k . Similarly, F i,k is used to represent the worst-case ﬁnishtime. The arrival time of instance k of task τ i,k is ( k − T i . Theorem 3.

A set of periodic tasks T with non-preemptiveending regions of length Q i , is RBR-Feasible under a ﬁxedpriority algorithm if the worst-case response time R i of eachtask τ i , calculated from Equation 9, satisﬁes the condition: ∀ τ i ∈ T , R i ≤ D i . R i = max k ∈ [1 ,K i ] { F i,k − ( k − T i } (9) where F i,k = S i,k + Q i (10) and S i,k is obtained for the smallest value of q for which wehave S ( q +1) i,k = S ( q ) i,k in the following S ( q +1) i,k = B i + ( k − C i + C i − Q i + (cid:88) τ j ∈ hp ( τ i ) (cid:32)(cid:22) S ( q ) i,k T j (cid:23) + 1 (cid:33) C j + O npei (11) Here, the term B i is the blocking from low priority tasks andis calculated by B i = max τ k ∈ lp ( π i ) { Q k } . (12) O npei is the maximum overhead of the restart on the responsetime and is calculated as follows: O npei = (cid:26) C r + WCWE ( i ) i ≤ n c i > n c (13) where WCWE ( i ) is the worst-case amount of the execution thatmay be wasted due to the restarts. It is given by the followingwhere WCWE (1) = C and WCWE ( i ) = C i + max (cid:18) , WCWE ( i − − Q i (cid:19) (14) K i in Equation 9 can be computed from Equation 4 by using O npei instead of O npi .Proof. Authors in [5] show that the worst-case response timeof task τ i is the maximum difference between the worst case Fig. 3:

Example of system with 3 tasks τ = (1 , τ = (2 , τ =(4 , , where τ has a non-preemptive region of size Q = 1 . Restartoccurs at t = 7 − (cid:15) ( C r = 0 ). The task set is schedulable with restarts. ﬁnish time and the arrival time of the jobs that arrive within thelevel- i active period (Equation 9).Hence, we must compute the worst-case ﬁnish time of job τ i,k in the presence of restarts. When a restart occurs during theexecution of τ i,k or while it is in preempted state, τ i,k needsto re-execute. Therefore, the ﬁnish time of the τ i,k is whenthe re-executed instance completes. As a result, to obtain theworst-case ﬁnish time of τ i,k , we calculate the response time ofeach instance when a restart with longest overhead has impactedthat instance. We break down the worst-case ﬁnish time of τ i,k into two intervals: the worst-case start time of the non-preemptive region of the re-executed job and the length of thenon-preemptive region, Q i (Equation 10). S i,k in Equation 10,is the worst-case start time of non-preemptive region of job τ i,k which can be iteratively obtained from Equation 11. Equation 11is an extension of the start time computation from [13]. Inthe presence of non-preemptive regions, an additional blockingfactor B i must be considered for each task τ i , equal to thelongest non-preemptive region of the lower priority tasks.Therefore, the maximum blocking time that τ i may experienceis B i = max τ j ∈ lp ( π i ) { Q j } . B i is added to the worst-case starttime of the task in Equation 11.For a task τ i with the non-preemptive region of size Q i ,there are two cases that may lead to the worst-case wasted time.First case is when the system restarts immediately prior to thecompletion of τ i , in which case the wasted time is C i . Secondcase occurs when τ i is preempted immediately before the non-preemptive region begins ( i.e., at C i − Q i ) by the higher prioritytask τ i − . In this case, the wasted execution is C i − Q i plus themaximum amount of the execution of the higher priority tasksthat may be wasted due to the restarts ( i.e., WCWE ( i − ).The worst-case wasted execution is the maximum of these twovalues i.e., WCWE ( i ) = max ( C i , C i − Q i + WCWE ( i − C i + max (0 , WCWE ( i − − Q i ) . Similarly, WCWE ( i − can be computed recursively.

2) Optimal Size of Non-Preemptive Regions:

RBR-Feasibility of a taskset depends on the choice of Q i s for thetasks. In this section, we present an approach to determine thesize of non-preemptive regions Q i for the tasks to maximizethe RBR-Feasibility of the task set.First, we introduce the the notion of blocking tolerance ofa task β i . β i is the maximum time units that task τ i may beblocked by the lower priority tasks, while it can still meet itsdeadline. Algorithm 1, uses binary search and the response timeanalysis of task (from Theorem 3) to ﬁnd β i for a task τ i .In Algorithm 1, R i,B i = middle is computed as described inTheorem 3 (Equation 9), where instead of using the B i fromEquation 12, the blocking time is set to the value of middle .6 ig. 4: Example of system with 3 tasks τ = (1 , τ = (2 , τ =(4 , , where τ has a non-preemptive region of size Q = 1 . Restartoccurs at t = 9 − (cid:15) ( C r = 0 ). The task set is schedulable with restarts. Algorithm 1:

Binary Search for Finding β i FindBlockingTolerance( τ i , T , Q , ..., Q i ) start = 0; end = T i /* Initialize the interval */ if R i (start) > T i then return τ i Not Schedulable; while end - start > (cid:15) do middle = (start + end)/2 if R i,B i = middle > T i then end = middle ; else start = middle end return β i = start; Note that, if Algorithm 1 cannot ﬁnd a β i for task τ i , thistask is not schedulable at all. This indicates that there is notany selection of Q i s that would make T RBR-Feasible.Given that task τ has the highest priority, it may not bepreempted by any other task; hence we set Q = C . The nexttheorem shows how to drive optimal Q i for the rest of the tasksin T . The results are optimal, meaning that if there is at leastone set of Q i s under which T is RBR-Feasible, it will ﬁndthem. Theorem 4.

The optimal set of non-preemptive interval Q i s oftasks τ i for ≤ i ≤ n is given by: Q i = min (cid:8) min { β j | j ∈ hp ( π i ) } , C i (cid:9) (15) assuming that β j ≥ for j ∈ hp ( π i ) .Proof. Increasing the length of Q i for a task reduces theresponse time in two ways. First, from Equation 11, increasing Q i reduces the start time of the job S i,k which reduces theﬁnish time and consequently the response time of τ i . Second,from Equation 14, increasing Q i reduces the restart overhead O npei on the task and lower priority tasks which in turn reducesthe response time. Thus Q i may increase as much as possibleup to the worst-case execution time C i ; Q i ≤ C i . However,the choice of Q i must not make any of the higher priority tasksunschedulable. As a result, Q i must be smaller than the smallestblocking tolerance of all the tasks with higher priority than π i ; Q i ≤ min { β j | j ∈ hp ( π i ) } . Combining these two conditionsresults in the relation of Equation 15. B. Preemption Thresholds

In the previous section, we discussed non-preemptive endingsas a way to reduce the length of the longest preemption chainand decrease the overhead of restarts. In this section, we discussan alternative approach to reduce the number of tasks in thelongest preemption chain and thus reduce the overhead ofrestart-based recovery.To achieve this goal, we use the notion of preemptionthresholds which has been proposed in [5]. According to this

Fig. 5:

Example of system with 3 tasks τ = (1 , τ = (2 , τ =(4 , , where τ and τ have a preemption threshold of λ = 1 and λ = 2 , respectively. Restart occurs at t = 7 − (cid:15) ( C r = 0 ). In thiscase, the task set remains schedulable. Fig. 6:

Example of system with 3 tasks τ = (1 , τ = (2 , τ =(4 , , where τ and τ have a preemption threshold of λ = 1 and λ = 2 , respectively. Restart occurs at t = 9 − (cid:15) ( C r = 0 ). The taskset is not schedulable. model, each task τ i is assigned a nominal priority π i and apreemption threshold λ i ≥ π i . In this case, τ i can be preemptedby τ h only if π h > λ i . At activation time, priority of τ i is setto the nominal value π i . The nominal priority is maintained aslong as the task is kept in the ready queue. During this interval,the execution of τ i can be delayed by all tasks with priority π h > π i , and by at most one lower priority task with threshold λ l ≥ π i . When all such tasks complete, τ i is dispatched forexecution, and its priority is raised to λ i . During execution, τ i can be preempted by tasks with priority π h > λ i . When τ i ispreempted, its priority is kept at λ i .Restarts may increase the response time of τ i,k in one of twoways; A restart may occur after the arrival of the job but beforeit has started, delaying its start time S i,k . Alternatively, thesystem can be restarted after the job has started. We use O pt,si to denote the worst-case overhead of a restart that occurs beforethe start time of a job in task sets with preemption thresholds.And, O pt,fi is used to represent the worst-case overhead of arestart that occurs after the start time of a job in task sets withpreemption thresholds.In Figure 5, we consider a task set with the same parametersas in Figure 1 where in addition τ and τ have a preemptionthreshold equal to λ = 1 and λ = 2 , respectively. Thisassignment is effective to prevent a long preemption chain, andthe jobs do not miss their deadline when the restart occurs at t = 7 − (cid:15) . Notice that, the task set is still not RBR-Feasiblesince if the restart occurs at t = 9 − (cid:15) , some job will miss thedeadline, as shown in Figure 6. Theorem 5.

For a task set with preemption thresholds underﬁxed priority, the worst-case overhead of a restart that occurs fter the start of the job τ i,k is O pt,fi = C r + WCWE ( i ) where WCWE ( i ) = C i + max {WCWC ( j ) | τ j ∈ hp ( λ i ) } (16) Here,

WCWC (1) = C .Proof. After a job τ i,k starts, its priority is raised to λ i . Inthis case, the restart will create the worst-case overhead if itoccurs at the end of longest preemption chain that includes τ i and any subset of the tasks with π h > λ i . Equation 16 uses arecursive relation to calculate the length of longest preemptionchain consisting of τ i and all the tasks with π h > λ i . Theorem 6.

For a task set with preemption thresholds underﬁxed priority, a restart occurring before the start time of a job τ i,k , can cause the worst-case overhead of O pt,si = C r + max {WCWE ( j ) | τ j ∈ hp ( π i ) } (17) where WCWE ( j ) can be computed from Equation 16.Proof. Start time of a task can be delayed by a restartimpacting any of the tasks with priority higher than π i .Equation 17 recursively ﬁnds the longest possible preemptionchain consisting of any subset of tasks with π h > π i .Due to the assumption of one fault per hyper-period, eachjob may be impacted by at most one of O pt,fi or O pt,si , but notboth at the same time. Hence, we compute the ﬁnish time ofthe task once assuming that the restart occurs before the starttime i.e., O pt,fi = 0 , and another time assuming it occurs afterthe start time i.e., O pt,si = 0 . Finish time in these two cases isreferred respectively by F si,k (restart before the start time) and F fi,k (restart after the start time).We expand the response time analysis of tasks with preemp-tion thresholds from [5], considering the overhead of restarting.In the following, S i,k and F i,k represent the worst case starttime and ﬁnish time of job τ i,k . And, the arrival time of τ i,k is ( k − T i . The worst-case response time of task τ i is given by: R i = max k ∈ [1 ,K i ] (cid:26) max { F si,k , F fi,k } − ( k − T i (cid:27) (18)Here, K i can be obtained from Equation 4 by using max ( O pt,fi , O pt,si ) instead of O npi . A task τ i can be blockedonly by lower priority tasks that cannot be preempted by it,that is: B i = max j { C j | π j < π i ≤ λ j } (19)To compute ﬁnish time, S i,k is computed iteratively using thefollowing equation [5]: S ( q ) i,k = B i + ( k − C i + (cid:88) j ∈ hp ( π i ) (cid:18) (cid:22) S ( q − i,k T j (cid:23)(cid:19) C j + O pt,si (20)Once the job starts executing, only the tasks with higher prioritythan λ i can preempt it. Hence, the F i,k can be derived from thefollowing: F ( q ) i,k = S i,k + C i + (cid:88) j ∈ hp ( λ i ) (cid:32)(cid:24) F ( q − i,k T j (cid:25) − (cid:18) (cid:22) S i,k T j (cid:23)(cid:19)(cid:33) C j + O pt,fi (21)Task set T is considered RBR-Feasible if ∀ τ i ∈ T , R i ≤ T i .RBR-Feasibility of a task set depends on the choice of λ i s forthe tasks. In this paper, we use a genetic algorithm to ﬁnd a set of preemption thresholds to achieve RBR-Feasibility of the task-set. Although this algorithm can be further improved to ﬁnd theoptimal threshold assignments, the proposed genetic algorithmachieves acceptable performance, as we show in Section VII.VII. E VALUATION

In this section, we compare and evaluate the four fault-tolerant scheduling strategies discussed in this paper. In order toevaluate the practical feasibility of our approach, we have alsoperformed a preliminary proof-of-concept implementation oncommercial hardware (i.MX7D platform) for an actual 3 degree-of-freedom helicopter. We tested logical faults, applicationfaults and system-level faults and demonstrated that the physicalsystem remained within the admissible region. Due to spaceconstraints, we omit the description and evaluation of ourimplementation and refer to [14] for additional details. (a) Fully preemptive (b) Fully non-preemptive(c) Non-preemptive ending intervals. (d) Preemption thresholds.

Fig. 7:

Minimum Period: 10, Maximum Period: 1000

A. Evaluating Performance of Scheduling Schemes

In this section, we evaluate the performance of four fault-tolerant scheduling schemes that are discussed in this paper.For each data point in the experiments, 500 task sets withthe speciﬁed utilization and number of tasks are generated.Then, RBR-feasibility of the task sets are evaluated under fourdiscussed schemes; fully preemptive, fully non-preemptive, non-preemptive ending intervals, and preemption thresholds. In orderto evaluate performance of the scheduling schemes, all the tasksin the analysis are assumed to be part of the critical workload.Priorities of the tasks are assigned according to the periods, soa task with shorter period has a higher priority.The experiments are performed with two sets of parametersfor the periods of the task sets. In the ﬁrst set of experi-ments (Figure 7), task sets are generated with periods in therange of 10 to 1000 time units. In the second set (Figure 8),tasks have a period in the range of 900 to 1000 time units. Asa result, tasks in the ﬁrst experiment have more diverse set ofperiods than the second one.As shown in Figure 7(a) and 8(a), all the task sets withutilization less than are RBR-feasible under preemptive8 a) Fully preemptive (b) Fully non-preemptive(c) Non-preemptive ending intervals. (d) Preemption thresholds.

Fig. 8:

Minimum Period: 900, Maximum Period: 1000 scheduling. This observation is consistent with the resultsof [15] which considers preemptive task sets under ratemonotonic scheduling with a recovery strategy similar toours (re-executing all the unﬁnished tasks), and shows that allthe task sets with utilization under are schedulable.Moreover, a comparison between Figure 7(a) and 8(a) revealsthat fully preemptive setting performs better when tasks in thetask set have diverse rates. To understand this effect, we mustnotice that the longest preemption chain for a task in preemptivesetting, consists of the execution time of all the tasks with ahigher priority. Therefore, under this scheduling strategy, taskswith low priority are the bottleneck for RBR-feasibility analysis.When the diversity of the periods is increased, lower prioritytasks, on average, have much longer periods. As a result, theyhave a larger slack to tolerate the overhead of restarts comparedto the lower priority tasks in task sets with less diverse periods.Hence, more task sets are RBR-feasible when a larger range ofperiods is considered.On the contrary, when tasks have more diverse periods, non-preemptive setting performs worse (Figure 7(b) and 8(b)). Thisis because, with diverse periods, tasks with shorter periods (andhigher priorities) experience longer blocking times due to lowpriority tasks with long execution times.As the ﬁgures show, scheduling with preemption thresholdsand non-preemptive intervals in both experiments yield betterperformance than preemptive and non-preemptive schemes. Thiseffect is expected because the ﬂexibility of these schemesallows them to decrease the overhead of restarts by increasingthe non-preemptive regions, or by increasing the preemptionthresholds while maintaining the feasibility of the task sets.Tasks under these disciplines exhibit less blocking and lowerrestart overhead.Preemption thresholds and non-preemptive endings in generaldemonstrate comparable performance. However, in task setswith very small number of tasks (2-10 task), scheduling usingnon-preemptive ending intervals performs slightly better thanpreemption thresholds. This is due to the fact that, with smallnumber of tasks, the granularity of the latter approach is limited because few choices can be made on the tasks’ preemptionthresholds. Whereas, the length of non-preemptive intervals canbe selected with a ﬁner granularity and is not impacted by thenumber of tasks. VIII. R

ELATED W ORK

Most of the previous work on Simplex Architecture [2]–[4], [16], [17] has focused on design of the switching logicof DM or the SC, assuming that the underlying RTOS, librariesand middle-ware will correctly execute the SC and DM. Oftenhowever, these underlying software layers are unveriﬁed andmay contain bugs. Unfortunately, Simplex-based systems arenot guaranteed to behave correctly in presence of system-level faults. System-Level Simplex and its variants [18]–[20]run SC and DM as bare-metal applications on an isolated,dedicated hardware unit. By doing so, the critical componentsare protected from the faults in the OS or middle-ware of thecomplex subsystem. However, exercising this design on mostmulti-core platforms is challenging. The majority of commercialmulti-core platforms are not designed to achieve strong inter-core fault isolation due to the high-degree of hardware resourcesharing. For instance, a fault occurring in a core with thehighest privilege level may compromise power and clockconﬁguration of the entire platform. To achieve full isolation andindependence, one has to utilize two separate boards/systems.Our design enables the system to safely tolerate and recoverfrom application-level and system-level faults that cause silentfailures in SC and DM without utilizing additional hardware .The notion of restarting as a means of recovery from faultsand improving system availability was previously studied inthe literature. Most of the previous work, however, targettraditional non -safety-critical computing systems such as serversand switches. Authors in [21] introduce recursively restartablesystems as a design paradigm for highly available systems.Earlier literature [22], [23] illustrates the concept of micro-reboot which consists of having ﬁne-grain rebootable compo-nents and trying to restart them from the smallest component tothe biggest one in the presence of faults. The works in [24]–[26]focus on failure and fault modeling and try to ﬁnd an optimalrejuvenation strategy for various non safety-critical systems.In the context of safety-critical CPS, authors in [27] proposethe procedures to design a base controller that enables theentire computing system to be safely restarted at run-time. BaseController keeps the system inside a subset of safety regionby updating the actuator input at least once after every systemrestart. In [19], which is variation of System-Level Simplex,authors propose that the complex subsystem can be restartedupon the occurrence of faults. In this design, safe restarting ispossible because the back up controller runs on a dedicatedprocessing unit and is not impacted by the restarts in thecomplex subsystem.One way to achieve fault-tolerance in real-time systems isto use time redundancy. Using time redundancy, whenever afault leads to an error, and the error is detected, the faultytask is either re-executed or a different logic (recovery block)is executed to recover from the error. It is necessary thatsuch recovery strategy does not cause any deadline misses inthe task set. Fault tolerant scheduling has been extensivelystudied in the literature. Hereby we brieﬂy survey those worksthat are more closely related. A feasibility check algorithmunder multiple faults, assuming EDF scheduling for aperiodicpreemptive tasks is proposed in [28]. An exact schedulabilitytests using checkpointing for task sets under fully preemptivemodel and transient fault that affects one task is proposedin [29]. This analysis is further extended in [30] for the caseof multiple faults as well as for the case where the priority of9 critical task’s recovery block is increased. In [31], authorspropose the exact feasibility test for ﬁxed-priority schedulingof a periodic task set to tolerate multiple transient faults onuniprocessor. In [32] an approach is presented to schedule underﬁxed priority-driven preemptive scheduling at least one of thetwo versions of the task; simple version with reliable timingor complex version with potentially faulty. Authors in [15]consider a similar fault model to ours, where the recovery actionis to re-execute all the partially executed tasks at the instant ofthe fault detection i.e., executing task and all the preemptedtasks. This work only considers preemptive task sets under ratemonotonic and shows that single faults with a minimum inter-arrival time of largest period in the task set can be recoveredif the processor utilization is less than or equal to . In[33], the authors investigate the feasibility of task sets underfault bursts with preemptive scheduling. Similar to our work,the recovery action is to re-execute the faulty job along withall the partially completed (preempted) jobs at the time of faultdetection. Most of these works are only applicable to transientfaults ( e.g., faults that occur due to radiation or short-lived HWmalfunctions) that impact the task and do not consider faultsaffecting the underlying system. Additionally, most of theseworks assume that an online fault detection or acceptance testmechanism exists. While this assumption is valid for detectingtransient faults or timing faults, detecting complex system-levelfaults or logical faults is non-trivial. Additionally, to the bestof our knowledge, our paper is the ﬁrst one to provide thesufﬁcient feasibility condition in the presence of faults under the preemption threshold model and task sets with non-preemptiveending intervals . IX. C

ONCLUSION

Restarting is considered a reliable way to recover traditionalcomputing systems from complex software faults. However,restarting safety-critical CPS is challenging. In this work wepropose a restart-based fault-tolerance approach and analyzefeasibility conditions under various schedulability schemes. Weanalyze the performance of these strategies for various task sets.This approach enables us to provide formal safety guarantees inthe presence of software faults in the application-layer as well assystem-layer faults utilizing only one commercial off-the-shelfprocessor. R

EFERENCES[1] S. M. Sulaman, A. Orucevic-Alagic, M. Borg, K. Wnuk, M. H¨ost, andJ. L. de la Vara, “Development of safety-critical software systems usingopen source software–a systematic map,” in . IEEE,2014, pp. 17–24.[2] L. Sha, “Dependable system upgrade,” in

Real-Time Systems Symposium,1998. Proceedings., The 19th IEEE . IEEE, 1998, pp. 440–448.[3] L. Sha, “Using simplicity to control complexity.” IEEE Software, 2001.[4] L. Sha, R. Rajkumar, and M. Gagliardi, “Evolving dependable real-timesystems,” in

Aerospace Applications Conference, 1996. Proceedings., 1996IEEE , vol. 1. IEEE, 1996, pp. 335–346.[5] Y. Wang and M. Saksena, “Scheduling ﬁxed-priority tasks with preemptionthreshold,” in

Real-Time Computing Systems and Applications, 1999.RTCSA’99. Sixth International Conference on . IEEE, 1999.[6] S. Baruah, “The limited-preemption uniprocessor scheduling of sporadictask systems,” in , July 2005, pp. 137–144.[7] D. Seto and L. Sha, “A case study on analytical analysis of the invertedpendulum real-time control system,” DTIC Document, Tech. Rep., 1999.[8] D. Seto, E. Ferreira, and T. F. Marz, “Case study: Development of abaseline controller for automatic landing of an f-16 aircraft using linearmatrix inequalities (lmis),” DTIC Document, Tech. Rep., 2000.[9] S. Bak, T. T. Johnson, M. Caccamo, and L. Sha, “Real-time reachabilityfor veriﬁed simplex design,” in

Real-Time Systems Symposium (RTSS),2014 IEEE . IEEE, 2014, pp. 138–148.[10] C. L. Liu and J. W. Layland, “Scheduling algorithms for multiprogram-ming in a hard-real-time environment,”

Journal of the ACM (JACM) ,vol. 20, no. 1, pp. 46–61, 1973. [11] R. I. Davis, A. Burns, R. J. Bril, and J. J. Lukkien, “Controller areanetwork (can) schedulability analysis: Refuted, revisited and revised,”

Real-Time Systems , vol. 35, no. 3, pp. 239–272, 2007.[12] R. J. Bril, J. J. Lukkien, and W. F. J. Verhaegh, “Worst-case responsetime analysis of real-time tasks under ﬁxed-priority scheduling withdeferred preemption revisited,” in , July 2007, pp. 269–279.[13] G. C. Buttazzo, M. Bertogna, and G. Yao, “Limited preemptive schedulingfor real-time systems. a survey,”

IEEE Transactions on IndustrialInformatics , vol. 9, no. 1, pp. 3–15, Feb 2013.[14] F. Abdi, R. Mancuso, R. Tabish, and M. Caccamo, “Achievingsystem-level fault-tolerance with controlled resets,” University of Illinoisat Urbana-Champaign, Tech. Rep., April 2017. [Online]. Available:http://rtsl-edge.cs.illinois.edu/reset-based/reset sched.pdf[15] M. Pandya and M. Malek, “Minimum achievable utilization for fault-tolerant processing of periodic tasks,”

IEEE Transactions on Computers ,vol. 47, no. 10, pp. 1102–1112, Oct 1998.[16] D. Seto and L. Sha, “An engineering method for safety region develop-ment,” 1999.[17] T. L. Crenshaw, E. Gunter, C. L. Robinson, L. Sha, and P. Kumar, “Thesimplex reference model: Limiting fault-propagation due to unreliablecomponents in cyber-physical system architectures,” in

Real-Time SystemsSymposium, 2007. RTSS 2007. 28th IEEE International . IEEE, 2007.[18] S. Bak, D. K. Chivukula, O. Adekunle, M. Sun, M. Caccamo, and L. Sha,“The system-level simplex architecture for improved real-time embeddedsystem safety,” in

Real-Time and Embedded Technology and ApplicationsSymposium, 2009. RTAS 2009. 15th IEEE . IEEE, 2009, pp. 99–107.[19] F. Abdi, R. Mancuso, S. Bak, O. Dantsker, and M. Caccamo, “Reset-based recovery for real-time cyber-physical systems with temporal safetyconstraints,” in

IEEE 21st Conference on Emerging Technologies FactoryAutomation (ETFA 2016) , 2016.[20] S. Mohan, S. Bak, E. Betti, H. Yun, L. Sha, and M. Caccamo, “S3a:Secure system simplex architecture for enhanced security and robustnessof cyber-physical systems,” in

Proceedings of the 2nd ACM internationalconference on High conﬁdence networked systems . ACM, 2013.[21] G. Candea and A. Fox, “Recursive restartability: Turning the rebootsledgehammer into a scalpel,” in

Hot Topics in Operating Systems, 2001.Proceedings of the Eighth Workshop on . IEEE, 2001, pp. 125–130.[22] G. Candea and A. Fox, “Crash-only software,” in

HotOS IX: The 9thWorkshop on Hot Topics in Operating Systems , 2003, pp. 67–72.[23] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox,“Microreboot- a technique for cheap recovery,” in

Proceedings ofthe 6th Conference on Symposium on Opearting Systems Design &Implementation - Volume 6 , ser. OSDI’04, 2004, pp. 3–3.[24] K. Vaidyanathan and K. S. Trivedi, “A comprehensive model for softwarerejuvenation,”

Dependable and Secure Computing, IEEE Transactions on ,vol. 2, no. 2, pp. 124–137, 2005.[25] S. Garg, A. Puliaﬁto, M. Telek, and K. S. Trivedi, “Analysis ofsoftware rejuvenation using markov regenerative stochastic petri net,” in

Software Reliability Engineering, 1995. Proceedings., Sixth InternationalSymposium on . IEEE, 1995, pp. 180–187.[26] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Softwarerejuvenation: Analysis, module and applications,” in

Fault-TolerantComputing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth InternationalSymposium on . IEEE, 1995, pp. 381–390.[27] F. Abdi, R. Tabish, M. Rungger, M. Zamani, and M. Caccamo,“Application and system-level software fault tolerance through full systemrestarts,” in

In Proceedings of the 8th ACM/IEEE International Conferenceon Cyber-Physical Systems . IEEE, 2017.[28] F. Liberato, R. Melhem, and D. Mosse, “Tolerance to multiple transientfaults for aperiodic tasks in hard real-time systems,”

IEEE Transactionson Computers , vol. 49, no. 9, pp. 906–914, Sep 2000.[29] S. Punnekkat, A. Burns, and R. Davis, “Analysis of checkpointing forreal-time systems,”

Real-Time Systems , vol. 20, no. 1, pp. 83–102, 2001.[30] G. Lima and A. Burns,

Scheduling Fixed-Priority Hard Real-Time Tasksin the Presence of Faults . Berlin, Heidelberg: Springer Berlin Heidelberg,2005, pp. 154–173.[31] R. M. Pathan and J. Jonsson, “Exact fault-tolerant feasibility analysis ofﬁxed-priority real-time tasks,” in ,Aug 2010, pp. 265–274.[32] C.-C. Han, K. G. Shin, and J. Wu, “A fault-tolerant schedulingalgorithm for real-time periodic tasks with possible software faults,”

IEEETransactions on Computers , vol. 52, no. 3, pp. 362–372, March 2003.[33] M. A. Haque, H. Aydin, and D. Zhu, “Real-time scheduling under faultbursts with multiple recovery strategy,” in , April 2014., April 2014.