Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-Stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft
Christian M. Fuchs, Todor Stefanov, Nadia Murillo, Aske Plaat
226 th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
Bringing Fault-Tolerant GigaHertz-Computing to Space
A Multi-Stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft
Christian M. Fuchs ∗ , Todor P. Stefanov ∗ , Nadia M. Murillo † and Aske Plaat ∗∗ Leiden Institute for Advanced Computer Science † Leiden ObservatoryLeiden University, 2333 CA, The NetherlandsEmail: [email protected]
Abstract —Modern embedded technology is a driving factor in satelliteminiaturization, contributing to a massive boom in satellite launches anda rapidly evolving new space industry. Miniaturized satellites, however,suffer from low reliability, as traditional hardware-based fault-tolerance (FT)concepts are ineffective for on-board computers (OBCs) utilizing modernsystems-on-a-chip (SoC). Therefore, larger satellites continue to rely onproven processors with large feature sizes. Software-based concepts havelargely been ignored by the space industry as they were researched onlyin theory, and have not yet reached the level of maturity necessary forimplementation. We present the first integral, real-world solution to enablefault-tolerant general-purpose computing with modern multiprocessor-SoCs(MPSoCs) for spaceflight, thereby enabling their use in future high-priorityspace missions. The presented multi-stage approach consists of three FTstages, combining coarse-grained thread-level distributed self-validation,FPGA reconfiguration, and mixed criticality to assure long-term FT andexcellent scalability for both resource constrained and critical high-priorityspace missions. Early benchmark results indicate a drastic performanceincrease over state-of-the-art radiation-hard OBC designs and considerablylower software- and hardware development costs. This approach was de-veloped for a 4-year European Space Agency (ESA) project, and we areimplementing a tiled MPSoC prototype jointly with two industrial partners.
I. I
NTRODUCTION
Modern embedded technology is a driving factor in satellite minia-turization, contributing to a massive boom in satellite launches and arapidly evolving new space industry. Micro- and nanosatellites (100-1kg)have become increasingly popular platforms for a variety of commercialand scientific applications, due to an excellent balance of performanceand cost. However, this class of spacecraft suffers from low reliability,discouraging its use in long, complex, or high-priority missions. The on-board computer (OBC) related electronics constitute a much larger shareof a miniaturized satellite than they do in larger satellites. Thus, percomponent, they must deliver drastically better performance and consumeless energy. Therefore, and due to cost considerations, miniaturizedsatellite OBCs are generally based upon processors with considerablyfiner feature size, such as those developed for mobile embedded devices.Traditional hardware-based fault-tolerance (FT) concepts for general-purpose computing, however, are ineffective for modern, highly scaledsystems-on-chip (SoCs), becoming a prime source of malfunctionsaboard miniaturized satellites [1]. Larger satellites, too, are limited by themeasures traditionally used to assure FT for space applications, as theseprevent larger satellites from harnessing the benefits of modern proces-sors designs, and multiprocessor-SoCs (MPSoCs). Also, these hardware-based FT-measures can not handle varying performance requirementsduring multi-phased missions and mega-constellations [2]. Software-based FT measures rapidly evolved due to efforts of the scientificcommunity, and are effective for modern embedded hardware. However,these advances have largely been ignored by the space industry as theywere researched only in theory, but rarely designed to be implemented.While many of these concepts include innovative ideas, major imple-mentation obstacles and fundamental issues remain unaddressed. Often,prior research makes impractical assumptions towards the platform orapplication environment, ignores fault detection, recovery from fail-over,or other real-world constraints. Many concepts also attempt to upholdsafety and availability, e.g., for atmospheric aerospace use, but notcomputational correctness. To the best of our knowledge, no integraland practical solution to utilizing modern MPSoC-based systems withinhigh-priority space missions has been developed to date. There is a wide gap between academic research towards novel FTconcepts and their practical application in spacecraft OBCs. Satellitecomputers for control purposes are still largely based upon architecturesdeveloped decades ago, while theoretical research has not achieved thelevel of maturity necessary to bridge this gap. Thus, neither traditionalhardware- nor software-based FT solutions could offer all the functional-ity necessary to improve the reliability of state-of-the-art embedded SoCsin miniaturized satellite OBCs. Other concepts promise excellent FTguarantees in theory, but require complex architectures that often do notaddress the specific challenges of computers flying in space. Innovationsare especially needed in general-purpose computing, as OBCs mustexecute a broad variety of applications efficiently. The presented researchaddresses these challenges and our main contributions are: • the first non-intrusive, integral, flexible, software-side approach en-abling the use of modern MPSoCs for spaceflight meeting real-worldconstraints; • an approach not based upon custom or proprietary FT-processor cores,does not require radiation-hard ASICs, or non-standard functionality; • which can be implemented with standard toolchains, commercial offthe shelf (COTS) components, library IP, and little manpower; • an introduction to an FPGA-based MPSoC architecture developed asan ideal platform for our approach.This approach was developed for a 4-year European Space Agency(ESA) project with two industrial partners. Due to the interdisciplinarynature of this project, other aspects of this approach and its hardwareimplementation will be presented in separate publications.In the next two sections, we will outline the challenges faced inthe space environment, and related work. Section IV contains a briefoverview of the multi-stage approach, its limitations, terminology, aswell as the application model and requirements. Each stage is describedin the subsequent sections, with the supervision concept explained inSection V-D. Section VIII then introduces briefly an MPSoC architecturespecifically designed as a platform for this FT concept. Performanceand checkpoint reliability are discussed in Section IX, followed byconclusions. II. T HE S PACE E NVIRONMENT
Solar cells are the main power source aboard modern spacecraft.A spacecraft’s orbit, location and orientation (attitude) relative to theSun, and the solar array’s temperature all influence the efficiency of itssolar array. Miniaturized satellite’s have comparably small solar arrayswith strongly fluctuating output, and their OBCs are limited to a fewWatts of power-budget. Mid-mission physical access to spacecraft isimpossible, and historically servicing missions were conducted only onrare occasions for satellites of outstanding importance in low-Earth orbit(LEO). Signal-travel times, limited communication windows, and scarcebandwidth make live-interaction with a spacecraft impractical. Thus,faults detected during a satellite mission must be resolved unattended,remotely, and fully autonomously. The drastically different fault-model,mass, size, physical stress, and thermal design constraints in space [3]prevent the re-use of FT-, debugging-, and testing approaches developedfor ground application.High-energy particles are the predominant cause for faults withinOBCs [4]. They travel along the Earth’s magnetic field-lines in theVan Allen belts, are ejected by the Sun during Solar Particle Events,or arrive as Cosmic Rays from beyond our solar system. In LEO, the a r X i v : . [ c s . D C ] A ug th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
ELATED W ORK
Traditionally, FT is enabled through circuit-, RTL-, core-, and OBC-level voting, which is costly to develop, difficult to validate, maintain,and slow to evolve [6]–[11]. Software takes no active part in fault-mitigation, as faults are suppressed at the circuit level, preventing theeffective assessment of a processor’s health. Circuit- and RTL-votingare effective for microcontrollers and very small SoCs, while core-levelvoting requires logic unavailable in COTS systems. Modern embeddedCOTS MPSoCs consume very little energy. But to achieve FT usinghardware-side measures, arrays of synchronized high-frequency votersor core-lockstepping in hardware are necessary. As voting and core-level lockstepping at GigaHertz clock rates is non-trivial, it has beenimplemented only at considerably lower frequencies with non-COTShardware [9], [11]–[13]. In general, hardware-voting based MPSoCdesigns are static and non-adaptive, as the entire design’s fault-coverageproperties are highly chip specific [14]. All these components are single-vendor solutions, therefore implying walled-garden environments. FTMPSoCs for space use contain retrofitted TMRed single-core processors,e.g. [7], or are unique, experimental solutions for specific satellitemissions [15], [16]. In contrast to these solutions, modern MPSoCs also
Fig. 1. A component-wise view of a satellite OBC. Volatile memory (blue) andnon-volatile memory (green) can well be protected using erasure coding. Thepresented multi-stage approach covers faults affecting processor logic (yellow). allow considerably more software design freedom due to the availablecompute resources, thereby reducing the required development timeand complexity. For scientific instrumentation and low-priority Cube-Sat missions, COTS-based MPSoCs and FPGA-SoC-hybrids have beenutilized, but these are not suitable for critical satellite control applicationswithin miniaturized satellites [17]. Ground-based FT applications donot consider the specific threat-scenario and application environment,physical constraints, and thermal design constraints [3], [5]. Instead, wepropose to use software-side functionality to assure FT for conventional,non-fault tolerant processor cores.First concepts involving coarse-grained lockstepping are promising[18]–[20], but do not address the specific challenges to FT in space[21]. FT using thread-level very-long-instruction word architectures [22],[23] has also been explored, though the approach still requires pipeline-level voters in hardware. Most implement checkpoint & rollback orrestart, which makes them unsuitable for spacecraft command & controlapplications [24], others ignore fault-detection [25], [26], or requireexternal, infallible fault detection entities with deep knowledge aboutapplication-intrinsics [27] but no concept of how this could be obtained.Often, faults are assumed to be isolated, side-effect free and local toan application [28] and/or transient [19], [20], [25], which voids theireffectiveness for space applications. Many prior concepts entail highperformance- [29], resource-overhead [30], [31], or impose severe designconstraints on applications and the OS [18], [19]. To be effective in thespace environment, an FT approach must be based upon forward-error-correction and the implementation complexity must be low, and must besuitable for general-purpose computing and impose little or no constraintson the application software. Changes to the OS infrastructure must beplatform portable, code-wise localized, and individually verifiable.[19], [20], [28] implement voting through OS invasive measures, cannot handle multi-threaded applications and consider the OS and storedprogram code to be fault free. [21] requires no modifications to theapplication software whatsoever, but can only assure availability in anetworked application architecture. An acceptance of these constraintsdoes not allow for adequate FT in a space mission scenario, and thus wepropose that application and OS instance must be able to fail arbitrarilywithout impacting the residual system. In this case, fault propagationbetween application instances also becomes a non-issue. Considerableresearch has been directed towards FT real-time scheduling and mixedcritical software-FT systems, though only at a theoretical level [32]–[34].As a consequence, no implementable, software-driven FT concept formodern embedded- and mobile-market MPSoCs in space exists, creatinga gap between the described prior research on software- and hardware-FT6 th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
RIDGING THE G AP : O UR A PPROACH
This approach consists of three fault-mitigation stages:
Stage 1 is implemented entirely in software and provides fault-detectionthrough coarse-grained lockstepping to enable self-testing, andcan be implemented in COTS MPSoCs.
Stage 2 improves medium-term reliability, and enables long-term fault-coverage through FPGA reconfiguration and the use of alterna-tive configuration variants. It utilizes Stage 1’s fault detectioncapabilities.
Stage 3 extends the lifetime of a degraded OBC by utilizing mixedcriticality to assure fault-coverage for high-criticality threads.It enables the OBC to automatically sacrifice performance orfault-coverage of lower-criticality threads in favor of higher-critical applications, thereby maintaining a stable core system.The presented concept is flexible and the individual stages are modular,as Stage 2 or 3 can be omitted depending on the OBC and mission. Ourapproach is designed for generic COTS MPSoCs, as these are readilyavailable in a variety of performance classes at low cost. The tiledarchitecture described in Section VIII is optional but can be consideredas an ideal platform. In MPSoCs without a tiled architecture, tile can besubstituted for processor core , and the differences in fault coverage arediscussed in Section VIII.
Terminology : Fault detection in our approach is based upon sets oftiles running two or more lockstepped copies of application threads. Werefer to such a group of lockstepped threads as a thread group . Timing-compatible thread groups can be combined and executed on the same setof tiles, and are then referred to as a tile group . A tile group periodicallyexecutes a checkpoint routine , which computes checksums for all activethreads and compares them with the other tiles in the group ( siblings ),thereby enabling a majority decision. The time between checkpoints (the checkpoint frequency ) is defined by the threads in a tile group and can bemodified at runtime. All lockstepping-relevant information is stored in validation memory , a tile-dedicated memory segment which is read-onlyaccessible by tiles.
Application Requirements : The OS only has to support interrupts,wake-up timers, and a multi-threading capable scheduler. To the bestof our knowledge, such functionality is available in most widely-usedRT- and general-purpose OS implementations. Virtual memory supportis required to enable performance-efficient multi-threading. Furthermorevirtual memory drastically simplifies thread-management, context switch-ing, and thread isolation, benefiting overall fault-tolerance.The only requirement for applications is interruptibility at application-defined points in time, during which checkpoints can be executed. Asthere is no efficient, uniform approach to assess the health of threads, werely upon applications assessing their own health-state. A thread providesfour callback routines, which are executed during tile initialization andby the checkpoint handler: • an initialization routine , to be executed on all tiles at bootup; • a checksum callback , used to generate a checksum for comparisonwith siblings, • a synchronization callback , exposing all thread-state relevant data tosynchronize a sibling with a tile group; This data can either be placeddirectly in the tile’s validation memory, or as a reference to structuresin main memory. • and an update callback , which is executed on a tile that needs tosynchronize its state to a tile group.Some of these callbacks may be omitted, e.g., for applications not requir-ing bootstrapping or with an already exposed state. The checksum com-putation and state (re)-synchronization are intentionally placed withinthe domain of the application developer. This enables decisions about anapplication state to be taken by the entity with the best knowledge ofthe individual thread and the means to determine which data is relevantto the system and application state, and must be preserved. Threads can be executed in an arbitrary order within a lockstep cycleas long as their state is equivalent during the next checkpoint. However,interrupting an active application at a random point in time is usuallyundesirable. We avoid thread-synchronization issues [18] by enabling theapplication developer to define comparison points where the applicationwill yield control to the checkpoint handler. If an application requiresreal-time scheduling, the tightness of the RT guarantees depends uponthe time required to execute these callbacks. Communication betweenthread-groups and tile-groups is of course possible and will remainreliable, as long as the receiving application is aware that it will receivemultiple message replicas. To prevent faults from propagating throughIPC channels, a thread can compare the received messages. Limitations : This approach guarantees system state consistency andcontrol flow correctness after each checkpoint, and for all past check-point periods. It also assures computational correctness before the lastcheckpoint, but can not actively prevent faults from occurring duringthe ongoing checkpoint cycle. Thus, if one tile experiences a fault,incorrect results may be propagated outside the system, even thoughthe damage caused to the OBC will be corrected during the next check-point, and system state consistency will be asserted. This limitation isinherent to coarse-grained lock-stepping concepts, but could be elevatedat the thread-level somewhat using finer-grained event hooking, e.g.,system-call hooking [19]. However, this workaround requires in-deepmodifications to the OS kernel and development toolchain, is thus non-portable and difficult to maintain, while still not solving the underlyingconceptional limitation.Related research, however, does show that a solution at the system-design level is much better suited to prevent fault-propagation of transientfaults between checkpoints using simple I/O voting [21]. Traditionalhardware-FT approaches used in space computing are strong for as-suring non-propagation of faults across interfaces using hardware-sidevoting, but can not protect the control-flow and system-state consistencyefficiently. While the system state and system-level fault-tolerance areassured by Stage 1, and long-term system resilience are safeguardedin Stage 2 and 3, we can utilize simple I/O voting to prevent fault-propagation for tile groups. Performing I/O voting on interface is alreadycommon practice in space-borne computing, as considerable effort isput into providing interface redundancy aboard larger satellites. Smallsatellites, especially CubeSats, usually can not spare the additionalenergy, space and mass required for interface replication. For suchspacecraft, I/O voting can be implemented on-chip using library IP.V. S
TAGE
1: S
HORT -T ERM F AULT M ITIGATION
Stage 1 offers software-controlled, thread-level, distributed majorityvoting and fine-grained fault logging within any COTS MPSoC with threeor more processor cores. The objective of Stage 1 is to detect and correctfaults at each checkpoint to assure computational correctness, control-flow consistency, and a consistent system state after each checkpoint. Todo so, Stage 1 requires a processor guaranteeing sequential consistency.Instead of exerting direct control over the MPSoC, a supervisorcan assure FT indirectly, as fault-coverage and control are distributedand enforced by the tiles themselves. In consequence, the supervisordoes not require any knowledge about the executed application threads,an individual tile’s state, or other OBC intrinsics. The thread groupassignment within an MPSoC can be reconfigured freely at runtime toimplement different voting configurations. Thus, the described approachcan exploit parallelization to improve reliability, throughput, or minimizepower consumption, thereby allowing the system to adapt to multi-phasedmissions with varying performance requirements.
A. Thread-Based Self-Testing
The program flow of this stage is depicted in Figure 2 and describedbelow. It can be implemented within an existing scheduler and aninterrupt service routine (ISR). A practical example for tile fault handlingand recovery, and an overview over how the supervisor interacts with thesystem are provided at the end of this section.6 th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
Fig. 2. The execution cycle of a tile during Stage 1. All code necessary forimplementation is highlighted in blue, callbacks in yellow.
Bootup & Initialization : After bootup, a tile first executes basic self-test functionality to assure integrity of tile-local IP-cores and memory.Each thread’s initialization routine is executed on all tiles to allowmore rapid state-update in case a new thread-group is added to atile. When being assigned to a tile, a thread will register its desiredcheckpoint frequency and its checksum, synchronization and updatecallback routines. After the threads have been initialized, each tile willset a periodic timer to initiate checkpoints. As depicted in Figure 2, atile will execute its first checkpoint immediately after the MPSoC hasbeen fully rebooted, to assure that application and OS initialization weresuccessful. If only this individual tile was rebooted, it can thus return tothe spare tile pool to replace a faulty core in the future.
Checkpoint Start : A checkpoint is triggered by a timer interrupt orexternally by the supervisor. A thread can delay a checkpoint until it hasreached a viable state for checksum comparison by disabling interrupts,thereby deferring interrupt processing. The checkpoint ISR saves theexisting system state, loads the actual checkpoint handler, performs acontext switch to kernel mode, and invokes the checkpoint handler.
Checksum Computation : The checkpoint handler invokes each activethread’s checksum callback scheduled for checking. As not all threadsin a tile group require the same checking frequencies, not all activethreads will be validated during each checkpoint. This checksum callbackreturns a representation of the application thread’s internal state aschecksum or hash generated from thread-private variables and otherinternal application state. The checksum format is compile-time defined, and must be chosen based on FT needs. The algorithm used to generatethis checksum is up to the application developer. Each checksum isstored in the tile’s local validation memory and thereby exposed to theother tiles. If no checkpoint routine can be provided, a checksum iscomputed by the checkpoint handler for an application-defined memoryrange. This memory range can be utilized by the application to depositstate-relevant data passively, e.g., through linker scripts or pre-processormacros. A non-continuously running application can also deposit itsresults in validation memory or return a checksum upon exit.Prior concepts required deep modifications to the OS to allow aproprietary central health-management entity to retrieve this informationdirectly [18], [25], or utilized no application-internal information [20],[21], [31]. Instead, this approach enables us to utilize application-intrinsics to assess the health-state of the system, without requiring anyknowledge on the applications. The time required to generate checksumscan be minimized by adapting the application code, e.g., by retainingcomputational by-products which would usually be discarded.
Checksum Comparison : Once all checksum callbacks have beenexecuted, a tile will monitor its group members’ validation memorysegments until another tile is ready for comparison. It will do sountil it has compared its checksums with all siblings, or the systemdesigner’s tile-group deadline expired. Tiles will usually begin comparingits checksums with siblings immediately or wait only briefly, as delaysare mainly induced due to varying memory latencies or malfunctions. Ifit detects a checksum mismatch or a sibling violated the deadline, thetile will stop comparing checksums and report disagreement with thattile to the supervisor.
Thread Disagreement & State Propagation : If a tile detected achecksum mismatch, it executes the synchronization callback routinesfor all threads in the affected tile group. This callback can be omittedif all state-relevant data is already in validation memory, e.g. for non-continuous running applications. The checkpoint routine will adjust thecheckpoint’s timer if a new thread group was added to the tile group,and return control to the scheduler.
State Update and Thread Execution : The scheduler will check threeconditions during regular operation: if any thread-group is active, thetile was newly added to a tile group, or requires an update. Idle tilessleep until the next checkpoint and can be woken up by the supervisor toreduce energy consumption and fault-potential. In case a tile must updatea thread-group’s state from a sibling, the relevant update callback will beexecuted for each thread. Tiles that have detected disagreement with oneof their siblings will delay execution for a tile-group-wide grace period,to allow a sibling to retrieve a state-copy from validation memory. Oncea tile has updated its state using a sibling’s data, application processingcontinues. The other tile group members will also wake up after the graceperiod and continue executing threads. This concludes the lockstep cycle.
B. A Practical Example
Figure 3 depicts a quad-core MPSoC with a single tile group andthree members. A fault has occurred during the second lockstep cycleon tile C , which is subsequently replaced with the idle tile C . C mustretrieve a copy of the state of its threads T a and T b from another valid Fig. 3. Tile initialization and a complete Stage 1 lockstep cycle. th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13) C , can subsequently be tested for permanentdefects by the OS and the supervisor. C. Checkpoint-Frequency, Timing & Real-Time Capabilities
The level of fault-coverage is mainly dependent on the checkpointfrequency. During a checkpoint, the computationally most costly op-erations are the application checksum callbacks, the synchronizationcallbacks and a new tile’s update callback. Each of these operationsinvolves a context switch and may imply a varying level of data beingread or written. Thus, the performance overhead and fault-tolerancecapabilities are mainly based upon actual applications checked, as thisactual checkpoint handler code is rather trivial. In general, a highercheckpoint frequency implies more time will be spent in checkpoints,more fine-grained fault-detection are possible, thus better fault-coverage.In our implementation, interrupts are deferred during a checkpoint,thus applications are not serviced and will not process I/O, therebyaffecting the level of real-time capabilities the MPSoC can offer.However, though this can be worked around using a more elaborateinterrupt handling concept, e.g., using interrupt prioritization or filtering.Real-time capabilities are thus directly dependent on the MPSoC, andapplication implementation characteristics, with the OS infrastructureplaying a minor role. For complex applications with a large state, alower checkpoint frequency however also implies a larger differencein state. Hence, more data must be copied between tiles to achievethread-synchronization requiring additional time. Thus, a larger statealso requires more time for execution, potentially more complex datastructures, thereby implying longer synchronization- and update-callback.Overall, the performance of OBCs executing less complex applicationswith little state will improve with lower checking frequencies. For suchOBCs, more checkpoints imply more computational overhead. With morecomplex applications, there is considerable optimization potential to finda sweet-spot between checkpoint frequency and application-state size.However, performance is strongly dependent assuring that high-qualitycallback-routines are provided by the application developer.
D. Supervision
The supervisor is connected to the MPSoC through a multiplexed bus-interface, where each line signals agreement with another tile. Finer-grained disagreement reporting does not significantly improve fault-coverage and constrains scalability of the MPSoC. As depicted in Figure4, the supervisor only reacts to disagreement between tiles, otherwiseremaining passive. It maintains a fault-counter for each tile, and actsas a system-reset inducing watchdog timer for the MPSoC. To resolvetransient faults within a tile, it increments the fault counter and inducesa state update through a low-level debug interface. After repeated faults,the supervisor will replace the tile by adjusting the thread-mapping ofa spare tile, activating it, and rebooting the faulty tile. In case a systemdeveloper indicated threshold is exceeded, the disagreeing tile is assumedpermanently defunct and not re-used as a spare. Stage 1 alone can notreclaim defective tiles beyond programmatically avoiding the use ofdefective peripherals, memory pages or processor functionality. Thus,Stage 2 will attempt to repair tiles to prevent resource exhaustion.In contrast to existing FT solutions, faults can be reported by each tileindividually, because fault detection is decentralized. As this functionalityis implemented at the kernel level, we can utilize the OS’s powerfullogging and diagnostics facilities, instead of relying upon the supervisorto provide a minimal useful level of logging. Diagnostics can thus beenriched with application-level information. Thereby, defect assessmentaccuracy can be improved compared to prior FT-approaches, enablingmore sophisticated debugging without requiring live-interaction.Our approach enables lockstepping frequencies far below the Kilo-Hertz range, thus the supervisor will not be a bottleneck. Therefore,high-performance MPSoCs can be well supervised using pre-existingdiscrete COTS supervisors. COTS MPSoCs will utilize an externalsupervisor, while ASIC, FPGA and FPGA-SoC-hybrid based MPSoCscan implement this functionality in reconfigurable logic. An off-chip
Fig. 4. A tile’s and supervisor’s program-flow and their interactions. Stage 1, 2and 3 logic are indicated in white, blue and yellow respectively. supervisor can be used for active tile health-management and FPGAreconfiguration, enabling the use of FPGA reconfiguration. See [35] forfurther details on MPSoC to Supervisor communication.VI. S
TAGE
2: T
ILE R EPAIR & R
ECOVERY
The previous stage can compensate faults as long as healthy tilesare available to replace defective tiles. In all existing hardware-sideFT implementations, resource exhaustion is mitigated through over-provisioning (adding more spares). Over-provisioning of tiles naturallyis inefficient and curtails system scalability, but is certain due to thestatic, unchangeable nature of existing ASIC based solutions. This willinevitably result in resource exhaustion, and has not been solved in priorwork. Stage 2 is designed to perform active tile health management andtest, repair, validate and recover faulty tiles, thereby tackling this fun-damental limitation. In FPGA-based systems transient faults can corruptthe stored configuration of programmed logic, thus induce permanenteffects within the running configuration [36], [37]. However, even if alogic cell is damaged permanently the residual highly-redundant FPGAfabric will remain intact and can be re-purposed [38]. It could be repairedwith differently routed, functionally equivalent configurations.The main issue preventing prior research from utilizing FPGA re-configuration to increase FT of general purpose computing architecturesis a lack of non-invasive, flexible circuit level fault detection. Asefficient fault-detection is an unresolved issue and periodic configurationscrubbing is slow, Stage 2 relies upon fault-detection by Stage 1. If atile was replaced by a spare, the supervisor’s Stage 2 logic recoverstiles using partial reconfiguration, mapping a tile to one of multiplepartitions. Once reconfiguration is complete, the supervisor validatesthe relevant partitions to detect permanent damage to the FPGA fabric.Assuming a tiled MPSoC architecture (see Section VIII) is used, tilesare self-contained by design. Thus, reconfiguration of just one tile willnot impact the other tiles and allow the OBC to recover a tile in thebackground. If reprogramming was unsuccessful or fabric-level faultspersist, the supervisor will repeat the previous step with differentlyrouted configuration variants. Partially defective logic cells can be re-purposed, while other cells can be avoided entirely, if no other usage ispossible. Other elements of the FPGA fabric can be treated equivalently.As a final measure, faults within shared logic can be resolved using fullreconfiguration, briefly halting the MPSoC.Stage 2 can also test different on-chip memories, the processor cores,and peripheral controllers through external interconnect access ports (e.g.an AXI-bridge). If the OBC is implemented on an ASIC or with a COTSMPSoC, a widely available low-level debug and testing interface such asJTAG can be utilized for the same purpose. For further details on howthis functionality can be implemented, see [39].If a defunct tile can not be repaired through automated reconfiguration,additional diagnostic information can be used for further analysis. The6 th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
Fig. 5. If no healthy spare tiles are available, the Stage 3 can split defuncttile groups and uphold FT guarantees for high-criticality threads. The necessaryadjustment to the checkpoint frequency on tile 2 is omitted for simplicity. operator can utilize this information to conduct fault analysis on theground, to craft a suitable replacement configuration to avoid these areas.Of course, this implies extreme development effort but for many higher-priority space missions, the loss of a spacecraft may be more costlythan the engineering costs for saving the mission. If both partial- andfull-reconfiguration are unsuccessful and all spare resources have beenexhausted, the fault is escalated to Stage 3.VII. S
TAGE
3: A
PPLIED M IXED C RITICALITY
Stage 3 utilizes thread-level mixed criticality to extend an OBC’slifetime once the previous stages have depleted all spare resources.Its primary objective is to autonomously maintain system stability ofan aged or degraded OBC at short notice to avert loss-of-mission andloss-of-subsystem, even if an OBC approaches the end of its lifetime.The operator can then define a more resource conserving satelliteoperations schedule, sacrifice link capacity, or on-board storage space.Thus, dependability for high-criticality threads can be maintained byreducing compute performance, throughput, or increasing latency oflower-criticality applications.The criticality of applications executed on an OBC can be differen-tiated by the importance of the controlled subsystem or relevance forcommandeering the spacecraft. Performance degradation or even a lossof lower-criticality tasks aboard a satellite is in general preferable toa loss of system stability for key applications. As thread groups can beadded and removed from tile groups, and multiple tile groups can coexistin the same MPSoC, individual threads can also be migrated betweentile groups [26]. Furthermore, the checkpoint frequency of a tile groupcan be reduced to increase a tile’s computational capacity, or it can ceaseservicing low-priority interfaces.The supervision logic is extended to reallocate thread-groups acrossthe system based upon the thread’s priority. Hence, if Stage 2 failedto reconfigure the OBC, the supervisor can generate new tile-groupassignments for threads with high priority and will attempt to retainexisting assignments. Eventually, all healthy tiles will be saturated withthreads, and no further assignments will be possible. Then, it caneither allocate more mappings, providing lower-priority threads with lessprocessing time to maintain availability, reduce the checking frequency,or leave them inactive. The OBC developer can decide at design time,which applications would benefit most from continuous operation withreduced performance or reliability, and which can be forgone.In Figure 5, initially two tile groups are executed on one MPSoC with6 tiles. The green tile group consisting of a computationally expensivelow-criticality application T d and a shorter but more important thread T c . Tile 2 is member of another group, and has sufficient spare capacityto accommodate T c , but not T d . As no more spare tiles are available, thelower-criticality task T d remains degraded, and can only detect but notcorrect subsequent faults. T c is migrated to a separate, new tile groupand executed on tiles 2 – 4, thereby maintaining strong FT.VIII. P LATFORM A RCHITECTURE
Our multi-stage FT-approach is in principle platform independent andcan be implemented within any multi-threading capable OS supporting interrupts and timers. For most COTS-MPSoC based nanosatellites ina LEO orbit, stage 1-3 alone offer sufficient fault-coverage. Aboardsuch spacecraft, MPSoC interfaces are either unprotected or protectedprogrammatically and outside the MPSoC (e.g. using EDAC chips or byresolving SEFIs through power cycling). Aboard larger, more criticalspacecraft such faults can not be accepted, and OBC interfaces areusually implemented redundantly at great effort. This redundancy isinherent to our approach with tiled architectures, and we developed anMPSoC platform capable of surviving the loss of peripheral devices andpermanent, non-resolvable defects in interfaces.
A. Architecture Overview
This MPSoC can completely be implemented in full using library IPavailable with standard industry FPGA or ASIC design tools withoutcustom FT components. We have implemented our MPSoC prototypewith Xilinx Vivado standard IP, AXI Interconnects, for low-tier ARMCortex-A processor cores to be provided by one of our industrial partners.For common space applications, size-optimized cores such as the Cortex-A32, -A35 and A5 offer an excellent balance between performance, uni-versal platform support and logic utilization. The architecture minimizesshared logic, compartmentalizes tiles, and offers a clearly defined accesschannel between tiles for sharing checkpoint-results and application-state. We are aware that most miniaturized satellites do not require sucha high degree of fault-coverage, and often can not afford the addedhardware complexity and development effort.The design depicted in Figure 6 follows a tiled architecture andis implemented within an FPGA to counter resource exhaustion whenmitigating faults in Stage 2. It utilizes simple redundancy to compensatefor SEFIs, but does not contain radiation-hard or FT processor cores orcustom logic. Each tile is equipped with a processor core, an interruptcontroller (IRQ in the figure), a dedicated on-chip memory slice usedas validation memory, and several peripheral interfaces through the localinterconnect. Tiles are connected through an I/O memory managementunit (IOMMU) and a global interconnect to main- and non-volatilememory. They can not access the local interconnect of other tiles toprevent interference and minimize shared logic. This tiled architecturebenefits from partial reconfiguration, as tiles can be placed strategicallyon an FPGA’s fabric along partition borders. Our approach and thisarchitecture support multi-FPGA and -ASIC MPSoCs without adaptation,thereby improving scalability and resilience against FPGA-level SEFIs.The ECC-protected dual-port validation memory in each tile holds thecurrent tile-status, thread assignments, as well as the checksums and stateinformation. One interface is connected to the tile’s local interconnect,while the second port is read-only accessible via the global interconnect.
Fig. 6. A simplified representation of the presented MPSoC with memorycontrollers highlighted in yellow, scrubbers in green, and interconnect in blue.A dedicated interface on each tile allows supervisor access. th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
B. FPGA Implementation & Utilization
We also have developed a community reproducible MPSoC designbased on the previously described architecture utilizing exclusivelylibrary-IP. Instead of ARM cores, this 4-tile demonstration design in-cludes Xilinx MicroBlaze processor cores, as these are more available tothe general public. It targets standard FPGA development boards and isequipped with a single shared DDR4 main memory controller, and 2MBon-chip BRAM program memory. This reduced design was implementedsuccessfully using the Xilinx Vivado Design Suite and Stage 1 wasimplemented using FreeRTOS and using the Xilinx SDK toolchain. Eachtile is outfitted with data and instruction caches, an interrupt controller,a UART interface, validation memory and an additional local memoryfor storing tile-private information, and a GPIO controller to signalagreement between tiles. All tile-local memories are equipped with ECC,as this increases logic size of the relevant memory controllers, andincludes two additional interrupts for each connected memory. We couldachieved full timing closure at 250MHz core frequency on VCU118 andKCU116 development kits, though the clock frequency was selected toachieve a simple design, not an efficient or fast one. If additional timewas invested into timing optimization and clocking, the clock speed canbe drastically increased. Additional information regarding the tile andSoC layout are available in [35].Fabric utilization based upon the Xilinx Virtex VCU118 DevelopmentKit is depicted in Figure 7. Due to the use of on-chip program memoryand the DDR4 memory controller, BRAM utilization is inflated comparedto the MPSoC described previously. Resource utilization is indicated inTable I, with more details given in [35]. Stage 2 and 3 do not requireadditional FPGA logic.This design’s very low logic usage shows that the architecture itselfcan be scaled to 8 and more tiles comfortably, and most current-generation FPGAs offer an abundance of unused resources for StageResource Utilization Available
Utilization %
LUT 68,705 1,182,240
LUTRAM 9,235 591,840
FF 92,536 2,364,480
BRAM 810 2,160
DSP 27 6,840
IO 163 832
BUFG 17 1,800
MMCM 6 30
Tab. I. Resource utilization of the 4-tile demonstration MPSoC on a XilinxVCU118 development board. The on-chip program memory and DDR4 memorycontroller disproportionately inflate BRAM utilization. Fig. 7. Logic placement of the demo-MPSoC on a VCU118 development boardrunning 4 Tiles: green, red, yellow, pink; Global Interconnect: white; Xilinx DDR4controller: blue; Program Memory: teal.
2. With current-generation FPGA platforms, Stage 2 will thus not onlybe able to recover defective tiles using spare resources, but could evenplace multiple tiles as cold or hot spares. The Microblaze cores utilizedhere for demonstration purposes can directly be replaced with drasticallymore complex processor cores, assuming the necessary peripheral IP isadded as well (e.g. an ARM GIC instead of the MicroBlaze InterruptController). IX. D
ISCUSSION & O
UTLOOK
The reliability of each individual tile’s voting decision can be weak,and an individual tile can report false (dis)agreement with its siblings.Our approach takes into account that any software or hardware compo-nent associated within a tile can fail arbitrarily. Such failure is mitigatedthrough a distributed decision, which is taken based on each tile’sperspective of its siblings. Thus, this approach does not require thechecksum logic to compute correctly, and we assume that faults mayoccur at any time during the lifetime of a tile. As tile groups usuallyconsist of three or more tiles, the likelihood of false-disagreements ornon-reported disagreement is insignificant. To mask such a fault, multiplefaults would have to coincide in a majority of tiles within the sametile group during a single checking period and induce the same fault.The probability for such an event is extremely low, except at very highradiation levels. Even in such situations, such faults would be detectedafter the subsequent checkpoint with near certainty.Prior research proves the conceptual effectiveness of thread-based FT[9], [20] and software-based FT combined with simple I/O voting [21].Also, the detailed FT capabilities of a platform utilizing our approach6 th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan DOI: TBD, c (cid:13)
ONCLUSIONS
In this contribution, we present the first practical and integral multi-stage approach to fault-tolerant (FT) general purpose computing forspaceflight use. The approach explicitly does not utilize radiation-hardened or hardware-FT processor cores and utilizes no central MPSoC-internal voting logic. It can thus be implemented within COTS MPSoCsor alternatively entirely with non-FT, standard library IP-cores availablein FPGA or ASIC design software. In contrast to prior research, thepresented approach considers the full and realistic fault-model for spacecomputing, and operates within real-world constraints. The approachdoes not require failure-free components within an MPSoC or in the OS,and does not leave conceptual gaps, e.g., regarding fault detection andrecovery. It is not based upon traditional radiation-hardened processorcores and does not achieve fault-tolerance through hardware-measures.We showed that our approach is programmatically simple and requireslittle custom code, which can also be implemented in most pre-existingmulti-threading capable OS. Faults can be detected and mitigated usingapplication provided routines, enabling decisions about an application’sintegrity to be taken by the application developers themselves. As aconsequence, the system designer no longer must struggle to assess thehealth of each individual application’s state, and instead can focus ondetermining an optimal solution to problems at hand. It allows flexiblefault-detection, mitigation and recovery within COTS MPSoCs, layingthe foundations for FT computing aboard miniaturized satellites, andhelping to bridge the gap between theoretical embedded research andpractical implementation in the space industry. While remaining flexible,and inducing only a minimal performance overhead, the presented multi-stage approach offers time-bounded real-time guarantees.The approach can be well complemented with several other reliability-improving measures which were integrated into the outlined referenceMPSoC architecture. Preliminary benchmark results of an unoptimizedimplementation show a low performance overhead, suggesting a beyondfactor-of-5 performance increase over state-of-the-art radiation-hardenedprocessors for space use. Our approach allows the host platform toscale vertically (more powerful processor cores and more interfaces pertile) as well as horizontally (more tiles), with virtually any modernprocessor core. Thereby, we aim to increase acceptance for software-sideFT approaches in the space industry, building trust in hybrid hardware- software architectures. Thus, our approach is the first integral, real-worldsolution to enable the fault-tolerant application with modern MPSoCdesigns for critical satellite control applications, thereby enabling theuse of such SoCs in future high-priority space missions.R
EFERENCES [1] M. Langer and J. Bouwmeester, “Reliability of cubesats-statistical data, developers’ beliefsand the way forward,” in
AIAA SmallSat , 2016.[2] B. Bastida Virgili and H. Krag, “Mega-constellations issues,” in
COSPAR , 2016.[3] M. Marinella and H. Barnaby, “Total ionizing dose and displacement damage effects inembedded memory technologies,” Sandia National Laboratories, Tech. Rep., 2013.[4] S. Bourdarie and M. Xapsos, “The Near-Earth Space Radiation Environment,”
IEEETransactions on Nuclear Science , 2008.[5] J. Schwank et al. , “Radiation Hardness Assurance Testing of Microelectronic Devices andIntegrated Circuits,”
IEEE Transactions on Nuclear Science , 2013.[6] K. Reick et al. , “FT design of the IBM Power6 microprocessor,”
IEEE micro , 2008.[7] M. Hijorth et al. , “GR740: Rad-hard quad-core LEON4FT system-on-chip,” in
EurospaceDASIA , 2015.[8] A. S. Jackson, “Implementation of the configurable fault tolerant system experiment onNPSAT-1,” Ph.D. dissertation, Naval Postgraduate School Monterey, 2016.[9] X. Iturbe et al. , “A triple core lock-step ARM Cortex-R5 processor for safety-critical andultra-reliable applications,” in
IEEE DSN , 2016.[10] D. Ludtke et al. , “OBC-NG: towards a reconfigurable on-board computing architecture forspacecraft,” in
IEEE Aerospace , 2014.[11] S. Gupta et al. , “SHAKTI-F: A fault tolerant microprocessor architecture,” in
IEEE ATS ,2015.[12] R. DeCoursey et al. , “Non-radiation hardened microprocessors in space-based remotesensing systems,” in
Int. Society for Optics and Photonics: Remote Sensing , 2006.[13] M. Pigno et al. , “A testbench for validation of DST fault-tolerant architectures on PowerPCG4 COTS microprocessors,” in
Eurospace DASIA , 2011.[14] M. Pignol, “DMT and DT2,” in
IEEE IOLTS , 2006.[15] C. A. Hulme et al. , “Configurable fault-tolerant processor (CFTP) for spacecraft onboardprocessing,” in
IEEE Aerospace Conference , 2004.[16] J. R. Samson, “Implementation of a dependable multiprocessor cubesat,” in
IEEEAerospace , 2011.[17] X. Iturbe et al. , “On the use of system-on-chip technology in next-generation instrumentsavionics for space exploration,” in
Springer VLSI-SoC , 2015.[18] U. Kretzschmar et al. , “Synchronization of faulty processors in coarse-grained TMRprotected partially reconfigurable FPGAs,”
Elsevier RESS , 2016.[19] B. D¨obel, “Operating system support for redundant multithreading,” Ph.D. dissertation,Dresden University, 2014.[20] A. Shye et al. , “Using process-level redundancy to exploit multiple cores for transient faulttolerance,” in
IEEE DSN , 2007.[21] Y. Dong et al. , “COLO: Coarse-grained lock-stepping virtual machines for non-stopservice,” in
ACM Symposium on Cloud Computing , 2013.[22] A. L. Sartor at al., “Exploiting idle hardware to provide low overhead fault tolerance forVLIW processors,”
ACM JETC , 2017.[23] F. Anjam and S. Wong, “Configurable fault-tolerance for a configurable VLIW processor,”in
Springer ARC , 2013.[24] J. Hursey et al. , “The design and implementation of checkpoint/restart process faulttolerance for open MPI,” in
IEEE IPDPS , 2007.[25] P. Munk et al. , “Toward a fault-tolerance framework for COTS many-core systems,” in
IEEEEDCC , 2015.[26] L. Zeng, P. Huang, and L. Thiele, “Towards the design of fault-tolerant mixed-criticalitysystems on multicores,” in
ACM CASES , 2016.[27] S. P. Azad et al. , “Holistic approach for fault-tolerant network-on-chip based many-coresystems,”
ACM HiPEAC, DREAMCloud , 2016.[28] A. H¨oller et al. , “Software-based fault recovery via adaptive diversity for COTS multi-coreprocessors,” 2015, arXiv:1511.03528.[29] A. D. Santangelo, “An open source space hypervisor for small satellites,” in
AIAA SPACE ,2013.[30] E. Missimer, R. West, and Y. Li, “Distributed real-time fault tolerance on a virtualizedmulti-core system,”
Euromicro ECRTS, OSPERT , 2014.[31] Z. Al-bayati et al. , “Fault-tolerant scheduling of multicore mixed-criticality systems underpermanent failures,” in
IEEE DFT , 2016.[32] S. Malik and F. Huet, “Adaptive fault tolerance in real time cloud computing,” in
IEEEWorld Congress on Services , 2011.[33] K. Smiri et al. , “Fault-tolerant in embedded systems (MPSoC): Performance estimation anddynamic migration tasks,” in
IEEE IDT , 2016.[34] Z. Al-bayati et al. , “A four-mode model for efficient fault-tolerant mixed-criticality sys-tems,” in
IEEE DATE , 2016.[35] C. M. Fuchs et al. , “Preliminary Performance Estimations and Benchmark Results for aSoftware-based Fault-Tolerance Approach aboard Miniaturized Satellite Computers,” 2017,arXiv:1706.02086.[36] S. Azimi, B. Du, and L. Sterpone, “On the prediction of radiation-induced SETs in flash-based FPGAs,”
Elsevier Microelectronics Reliability , 2016.[37] H. Zhang et al. , “Aging resilience and fault tolerance in runtime reconfigurable architec-tures,”
IEEE Transactions on Computers , 2016.[38] F. Siegle et al. , “Mitigation of radiation effects in SRAM-based FPGAs for space applica-tions,”
ACM Computing Surveys , 2015.[39] C. M. Fuchs et al. , “Enhancing nanosatellite dependability through autonomous chip-leveldebug capabilities,” in
Springer ARCS , 2016.[40] ——, “FTRFS: A fault-tolerant radiation-robust filesystem for space use,” in
SpringerARCS , 2015.[41] S. Zertal, “A reliability enhancing mechanism for a large flash embedded satellite storagesystem,” in
IEEE ICONS , 2008.[42] M. Ressler et al. , “The Mid-Infrared instrument for the James Webb Space Telescope,”