[PDF] A low-overhead soft-hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

Abstract

The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.

Full PDF

JJournal of Supercomputing manuscript No. (will be inserted by the editor)

A Low-overhead Soft-Hard Fault Tolerant Architecture, Design andManagement Scheme for Reliable High-performance Many-core3D-NoC Systems

Khanh N. Dang · Michael Meyer · Yuichi Okuyama · Abderazek Ben Abdallah

The ﬁnal publication is available at Springer via https: // doi.org / / s11227-016-1951-0 Abstract

The Network-on-Chip (NoC) paradigm has beenproposed as a favorable solution to handle the strict commu-nication requirements between the increasingly large num-ber of cores on a single chip. However, NoC systems areexposed to the aggressive scaling down of transistors, lowoperating voltages, and high integration and power densi-ties, making them vulnerable to permanent (hard) faults andtransient (soft) errors. A hard fault in a NoC can lead to ex-ternal blocking, causing congestion across the whole net-work. A soft error is more challenging because of its silentdata corruption, which leads to a large area of erroneous datadue to error propagation, packet re-transmission, and dead-lock. In this paper, we present the architecture and design ofa comprehensive soft error and hard fault tolerant 3D-NoCsystem, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO) . With the aid of e ﬃ cient mechanisms andalgorithms, 3D-FETO is capable of detecting and recoveringfrom soft errors which occur in the routing pipeline stagesand leverages reconﬁgurable components to handle perma-nent faults in links, input bu ﬀ ers, and crossbars. In-depthevaluation results show that the 3D-FETO system is able towork around di ﬀ erent kinds of hard faults and soft errors,ensuring graceful performance degradation, while minimiz-ing additional hardware complexity and remaining power-e ﬃ cient. Khanh N. Dang · Micheal Meyer · Yuichi Okuyama · Abderazek BenAbdallahAdaptive Systems LaboratoryGraduate School of Computer Science and EngineeringThe University of AizuAizu-Wakamatsu City, Fukushima 965-8580, JapanE-mail: d8162103, [email protected] This project is partially supported by Competitive Research Fund-ing (CRF), The University of Aizu, Reference P-11 (2016), and JSPSKAKENHI Grant Number JP30453020

Keywords

3D NoCs · Fault-tolerance · Soft-Hard Faults · Reliability · Architecture · Design

Global interconnects are becoming the principal performancebottleneck for high performance Systems-on-Chip (SoCs) [2].The 3-dimensional Networks-on-Chip (3D-NoCs) have beenproposed as a promising architecture that combines the highparallelism of Network-on-Chip paradigm with the high per-formance and lower interconnect power of 3-dimensionalintegration circuits (3D-ICs) [6]. In the past few years, thebeneﬁts of 3D Integrated Circuits (3D-ICs) and mesh-basedNetwork-on-Chips (NoCs) have been fused into a promisingarchitecture opening a new horizon for IC design. The par-allelism of NoCs can be enhanced in the third dimensionthanks to the short wire length and low power consump-tion of the interconnects of 3D-ICs. As a result, the 3D-NoC paradigm is considered to be one of the most advancedand auspicious architectures for the future of IC design, asit is capable of providing extremely high bandwidth and lowpower interconnects.While the NoC paradigm has been increasing in popu-larity with several commercial chips [3], it is threatened bythe decreasing reliability of aggressively scaled transistors.Transistors are approaching the fundamental limits of scal-ing. Gate widths are nearing the molecular scale, resulting inbreakdown and wear out in end products [19,23]. Moreover,the anticipated fabrication geometry in 2018 scales down to8 nm with a projected 0.6V supply voltage [22]. In the 8 nm process, a higher rate of soft errors a ﬀ ect control logic andbu ﬀ ers of NoC routers, leading to chip failure. In addition,the low supply voltage enforces a very narrow noise mar-gin, which makes the architecture vulnerable and sensitiveto faults. As reported in [16], the soft error rate increases a r X i v : . [ c s . A R ] M a r Khanh N. Dang et al.

Errors in Network-on-ChipSoft ErrorsTransient faults:–

Cross-talk – Radiation particles – Cosmic rays – Thermal neutrons – Noise

Hard FaultsRun-time issues:–

Time Dependent Di-electric Breakdown – Electro Migration – Thermal Stress – Negative-Bias Tem-perature Instability

Manufacturing defects:–

Open – Stuck at 0. – Stuck at 1. – Bridge

Fig. 1

Taxonomy of errors and faults in NoCs. about 30% for each 100 mV decrease in the supply volt-age. With rising power density and non-ideal threshold andsupply voltage scaling, soft errors have become increasinglycommon during a chip’s lifetime [17]. Figure 1 shows a de-tailed taxonomy of di ﬀ erent types of error and fault sourcesin NoCs. We categorized the faults into two classes: HardFaults and

Soft Errors .Hard faults, including both permanent faults and inter-mittent faults, can occur during the manufacturing stage orunder speciﬁc operating circumstances. Intermittent faultsperiodically occur during operation and can disappear af-ter a certain time. Because these faults do not permanentlydamage a given component, it can pass through several test-ing stages, but can still cause operation failures. Althoughintermittent faults can disappear after a speciﬁc period oftime, their inconsistency can be treated as permanent faultsto avoid complex situations. For both permanent and inter-mittent faults, the most natural solution is using redundantcomponents [15,12].Soft errors arise from energetic particles, such as alphaparticles and neutrons from cosmic rays, generating electron-hole pairs as they pass through a device. A su ﬃ cient amountof accumulated charge may invert the state of a logic devicesuch as a: latch, gate, or SRAM cell; thereby introducing alogic fault into the NoC’s operation. Soft errors do not per-manently defect the gate and only occur over a short periodof time. Because of their special characteristics, they are un-predictable and unavoidable. Unlike permanent and inter-mittent faults, transient faults cannot be ﬁxed by replacingthe a ﬀ ected components. Instead, they can be recovered byrepeating the erroneous operation. A transient failure insidethe data path can also be ﬁxed by using code-based tech-niques (e.g., Error Correction Code (ECC) [8]). Statistically,transient faults are the most common kind of fault account-ing for 80% of failures, as reported in [24]. Therefore, with-out an e ﬃ cient protection mechanism, these errors can com-promise the system’s functionality and reliability.Hard fault handling schemes are based on two main ap-proaches: (a) fault-tolerant routing algorithms, which en- able packets to avoid faulty nodes in the network [6,15];(b) architecture-based methods, which use hardware (com-ponents) redundancy and / or reconﬁguration to recover fromfaults [15,12,1]. Soft error recovery is also solved by twomain schemes: (a) data corruption handling using Error Cor-rection Code (ECC) based methods [26,8,38] ; (b) controllogic handling using temporal redundancy based methods [18,39,14].Although many researchers have proposed solutions forvarious individual aspects of on-chip reliability, a compre-hensive approach encompassing both soft errors and hardfaults pertaining to NoC reliability has yet to evolve. In addi-tion, the error detection and diagnosis in NoC architectureshas been studied thoroughly in the scope of o ﬄ ine testing;however, with soft errors and intermittent faults becoming adominant failure mode in modern NoCs and general VLSIsystems, a widespread deployment of online test approacheshas become crucial. In this paper, we present a comprehen-sive soft error and hard fault tolerant 3D-NoC architecture,named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of e ﬃ cient mechanisms and algorithms,3D-FETO is capable of detecting and recovering from softerrors occurring in the routing pipeline stages and leveragesreconﬁgurable components to handle permanent fault occur-rences in links, input-bu ﬀ ers, and crossbars. The main con-tributions of this work are summarized as follows: – A new adaptive 3D router architecture based on a robusthardware reconﬁguration mechanism of the most sus-ceptible components to hardware faults, and on a low-cost method that is capable of detecting and recoveringfrom soft errors in the router pipeline stages. – An e ﬃ cient scheme for online control fault detection anddiagnosis in 3D-NoC systems.The organization of this paper is as follows: in Section 2,we present related works. Section 3 presents the adaptiverouter architecture (SHER-3DR). In Section 4, we presentcomprehensive techniques which include fault detection, di-agnosis and recovery. Section 5 provides the implementa-tion and evaluation results. Finally, we present the conclu-sion and our ideas for future work in the last section. A lot of works have addressed the fault-tolerance and reli-ability issues in NoC architectures. In [6,1,7], we coveredsome well-known solutions presented to tackle hard faults;therefore, in this section we mainly focus on solutions re-lated to soft error recovery. As depicted in Table 1, they areclassiﬁed into methods focusing on the Data Path (DP) andmethods focusing on the Control Logic (CL) of the router.For soft errors in the data path, most works use code-based techniques that not only detect the integrity of the

Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 3

Table 1

Taxonomy of di ﬀ erent error recovery protocols and architectures in NoCs. Fault Type Position / Method Fault Tolerant Method

Soft Errors Data Path Automatic Re-transmission Request [26]Error Detecting / Correcting Code [8,38]Control Logic Logic / Latch Hardening [18,32]Pipeline Redundancy [14]Monitoring and Correcting model [39,31,29]Hard Faults Routing Technique Spare wire [25,35]Split transmission [20]Fault-Tolerant routing algorithm [6,15]Architecture-based Technique Hardware Redundancy [12]Reconﬁguration architectures [15] received data, but also provide a correction function up toan acceptable number of faults. For instance,

Bertozzi etal . [8] analyzed several low cost coding techniques for on-chip communication. Among these techniques,

SECDED (Single-Error Correcting and Double-Error Detecting) was found tobe the solution with the most balanced trade-o ﬀ between re-liability and implementation cost. Although the authors pro-vide several evaluations of energy and hardware complexity,on-chip communication analysis (such as throughput andlatency) is missing. As an adaptive solution, Yu et al. [38]presented a dynamic

ECC based on quality of wire connec-tion by using a conﬁgurable

ECC with two Hamming codesto adapt with several probabilities of faults. Although thisadaptive

ECC obtains energy e ﬃ ciency, its area overhead isproblematic.Soft errors can be detected and recovered using temporalredundancy. For example, Ernst et al. [18] presented a

RazorD Flip-ﬂop with an additional shadow latch sampled by a de-layed clock for checking the occurrence of transient faults.Furthermore, a soft error detection solution based on redun-dant latches was also presented by

Ravindan et al. [34]. Al-though these techniques obtain more e ﬃ cient detection re-sults, they nearly double the area overhead and power con-sumption to maintain the redundant latches.For soft errors in the control logic, there are several tech-niques with cross-layer resolution. In the End-to-End level, Shamshiri et al. [36] proposed error-correction and on-linediagnosis using a speciﬁc code named . Based on theposition of the erroneous bit in the received data, the sys-tem can indicate the position of the faulty node in the net-work; however, when a packet is misrouted due to wrongrouting information / arbitration or an adaptive routing algo-rithm, the path of a packet is not ﬁxed in a way that candetermine the faulty node. To ensure arbitration computa-tion across layers, NoCAlert [31] implements constraints toobtain computational accuracy. By constraining the relation-ship between the input and output of a block, the system candetect both soft and hard faults. Although this work presentse ﬃ cient detection, it lacks e ﬃ ciency in recovering from softerrors. First, the system needs to distinguish between soft and hard faults to decide the recovery method. Second, softerrors cannot be recovered by spatial redundancy and theirrecovery in the End-to-End level is ine ﬃ cient. The FoReVerframework [29] also presented a network level method todetect and recover from routing errors: lost, duplicated, andmisrouted packets. Since

FoReVer is based on End-to-Enddetection and recovery, dealing with soft errors requires re-transmission of the whole packet instead of an online recov-ery.In the physical / data-link layers, one of the most com-mon methods is using Triple Modular Redundancy (TMR).By triplicating the original module, the system gets threeresults at the same time [32]. The three results are sent toa Majority Voting module to decide the accurate result. Al-though this technique su ﬀ ers from high area overhead andpower consumption (about 300%), it is easy to implementand e ﬀ ective for both soft errors and hard faults. In [39],the authors deploy a monitoring system on important con-trol modules. They can diagnose the output to ﬁnd the fail-ure. This technique is light-weight in both area and powerand has an insigniﬁcant impact on the system performance.However, it su ﬀ ers from lack of ﬂexibility since the moni-tor module has to be speciﬁcally designed depending on thetarget component. If any changes in the routing algorithm orpipeline stages are needed, investigation and re-designing ofthe monitor module is mandatory. Figure 2 shows the block diagram of the proposed adaptive3D router architecture (SHER-3DR). The router relies onsimple recovery techniques based on system reconﬁgurationwith redundant structural resources to contain hard faults inthe input-bu ﬀ ers, crossbar, and links, in addition to soft er-rors in the routing pipeline stages.The SHER-3DR router is the backbone component ofthe 3D-FETO system. Each router has a maximum of 7-input and 7-output ports, where 6 input / output ports are ded-icated to the connection to the neighboring routers and oneinput / output port is used to connect the switch to the local Khanh N. Dang et al. local input_portdown input_port

Switch Allocator

Crossbar A R Q bu ff e r input buffer request input port manager NPC

44 ECC arq_out a r q _ i n down-inup-inwest-insouth-ineast-innorth-inlocal-in fault_manager prev_node next_node S E R - m a n a g e r Monitor stop_out RAB

BYPASS LINK - 1BYPASS LINK - 2

Arbiter Stall/GoController s t o p _ i n north input_porteast input_portsouth input_portwest input_portup input_port … … d a t a _ o u t Fig. 2

Adaptive 3D router (SHER-3DR) architecture. computation tile. As shown in Fig. 2, the SHER-3DR con-tains seven

Input-port modules for each direction in additionto the

Switch-Allocator , and the

Crossbar module whichhandles the transfer of ﬂits to the next node. An

Input-port module is composed of two main elements: an

Input-bu ﬀ er and the LAFT routing (Next-Port-Computing) module. In-coming ﬂits from di ﬀ erent neighboring routers, or from theconnected computation tile, are ﬁrst stored in the Input-bu ﬀ er .This step is considered to be the ﬁrst pipeline stage of theﬂit’s life-cycle, Bu ﬀ er-Writing (BW). After receiving andstoring the ﬂits, their routing information is read and pro-cessed by a LAFT-Routing module (

Next-Port-Computing )and an arbitrating module (

Switch-Allocator ). This step isthe second stage - Next-Port-Computing / Switch-Allocator(NPC / SA). After the NPC / SA pipeline stage, the next-port value is merged into the ﬂit and the grant signal allows theﬂit to traverse from its input port to an output port (Crossbar-Traversal (CT) stage).An augmented Look-Ahead-Fault-Tolerant routing algo-rithm (LAFT) [4,5] is used to perform the routing decision.If a given ﬂit is routed to the local port, there is no rout-ing calculation. If the ﬂit is to be routed to another node,the fault link information of all neighboring nodes is readby each input-port and LAFT routing is executed. The ﬁrstphase of the algorithm is calculating the next node’s addressand its fault output information. In the next phase, the LAFT routing algorithm determines the minimal paths which arevalid for routing after eliminating the faulty paths. The ﬁnalrouting path is selected by evaluating two factors of all thepossible routing paths: (1) the diversity of the routing pathto the destination node and (2) the congestion value of theconnection. If there is no minimal routing path, a similar ap-proach is applied for the non-minimal routing paths. Finally,an output port of the selected routing is calculated. This in-formation is merged in the ﬂit as next-output-port bits forrouting in following nodes [1].3.1 Hard Fault Recovery Mechanism OverviewThe block diagram of the hard fault recovery mechanismis shown in Fig. 3. The Random Access Bu ﬀ er mechanism(RAB) [1] solves the deadlock problem that can occur withthe look-ahead fault-tolerant routing algorithm (LAFT), andis able to recover from transient, intermittent, and perma-nent faults in the input-bu ﬀ er. When a fault is detected inone of the slots, the main controller (located in input portmanager in Fig. 2) considers the ﬂagged slots when assign-ing the write and read addresses. It remains to check theﬂagged slots for recovery from the faults.The Bypass Link on Demand mechanism (BLoD) [1]provides additional escape channels whenever the number of Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 5

Bypass-1

Ctrl Fault-control-module (FCM)

Bypass-2 F a u l t y _ C r o ss E n a b l e_ b y p a ss d i s a b l e_ c r ss L_inN_inE_inS_inW_inU_inD_in L_outN_outE_outS_outW_outU_outD_out

Bypass-3(a) (b)

Fig. 3

Hard-fault tolerant mechanism [1]: (a) Random Access Bu ﬀ er (RAB); (b) Bypass-Link-on-Demand (BLoD) faults in the baseline 7x7 crossbar increases. When a fault isdetected in one or several crossbar links, the fault manager (depicted in Fig. 2) disables the faulty crossbar links and en-ables the appropriate number of bypass channels. The num-ber of Bypass-links is very important and it should be min-imized as much as possible to reduce the area and poweroverhead. In a case where the number of faulty links is largerthan the number of backup links, the system needs to markthe links as faulty and use the LAFT algorithm to avoid rout-ing through this defective connection.3.2 Soft Error Recovery MechanismAs represented in Fig. 4, the principal soft-error handlingmethod in the proposed 3D-FETO system relies on a solu-tion called Pipeline Computation Redundancy (PCR) in onemore clock cycle [14].For ease of understanding, we explain the PCR in Algo-rithm 1. The Next Port Computing (NPC) and Switch Al-locator (SA) run in parallel (line 2,3) after the Bu ﬀ er Writ-ing stage. This is achieved by the LAFT routing algorithm,where the dependency between the two stages is eliminated.After the ﬁrst computation, both of the two stages have anadditional computation clock cycle (line 4, 5). By compar-ing two consecutive results, soft errors will be detected. Ifa soft error is detected, the whole pipeline is halted for cor-rection. A third computation is required for majority voting,which decides the ﬁnal result. To recover from soft errorsin the data, Single Error Correction Double Error Detection (SECDED) [21] with ARQ (Automatic Retransmission Re-quest) [26] is adopted.In the ﬁrst stage, ﬂits are stored in the input bu ﬀ er at theBu ﬀ er Writing (BW) stage, and the ECC is used to checkand correct the input data in the ECC module. In secondstage, the NPC and the SA are executed in parallel in the LAFT routing unit and the

Switch-Allocator module. In thirdstage, the Redundant NPC (RNPC) and the Redundant SA(RSA) are computed in parallel. Then, if the output of RNPCis equal to that of NPC, and SA is equal to RSA, the Cross-bar Traversal (CT) stage is performed in the third cycle, andthe ﬂit goes to the next router via the output channel. If theRNPC is not equal to the NPC, the system rolls-back and re-computes the NPC. Moreover, if SA is not equal to RSA, thesystem also rolls-back and re-computes the SA stage. Af-ter rolling-back and re-computing, a majority voting moduleis used to decide the correct output of these modules. Therolling-back, re-computing and voting are executed. Then,the outputs of NPC / SA are sent to the Crossbar Traversalstage to ﬁnish the ﬂit transmission.Figure 5 presents a working demonstration of the SHER-3DR router. [ f lit ( n )] represents the ﬂit in the n th position ofthe packet. [ time ( m )] illustrates the m th time of computation.In the ﬁrst clock cycle, BW handles [ f lit (1)] while NPC / SAand CT are idle or are handling another packet. In the secondcycle, NPC / SA computes [ f lit (1) , time (1)], which means thecomputation of the ﬁrst ﬂit for the ﬁrst time. In the third cy-cle, NPC / SA computes [ f lit (1) , time (2)], which means thatit computes the ﬁrst ﬂit for the second time, also knownas the redundant computation. [ c (1)] compares the results Khanh N. Dang et al.

BW NPC/SA CT

LocalInput-portNorthInput-port

East

Input-portWestInput-portSouthInput-portUpInput-port

Down

Input-port S w i t c h A ll o c a t o r C r o ss b a r T a il S e n t data_out_Lstop_in_Ldata_out_Nstop_in_Ndata_out_Estop_in_Edata_out_Wstop_in_Wdata_out_S stop_in_S data_out_Ustop_in_Udata_out_Dstop_in_DInput Buffer NPC

Input port manager d a t a _ i n Arbiter

Stall/Go

Controller t o _ c r o ss b a r grant Soft-Error Monitor c r o ss b a r _ c t r l M U X M U X E CC a r q _ o u t requestRAB P C R m a n a g e r (d) (e)(a) data_in_L stop_out_Ldata_in_Nstop_out_Ndata_in_E stop_out_E data_in_Wstop_out_Wdata_in_Sstop_out_Sdata_in_Ustop_out_Udata_in_Dstop_out_D

30 31 32 3320 21 22 3210 11 21 3100 01 02 0330 31 32 3320 21 22 3210 11 21

00 01 02 0330 31 32 3320 21 22 3210 11 21 3100 01 02 0330 31 32 3320 21 22 3210 11 21 3100 01 02 03

RNI

UPDOWN EASTWEST NORTHSOUTH PE (b)(c) Fig. 4

High-level view of the soft-hard error recovery approach: (a) 3D-Mesh based NoC conﬁguration; (b) Tile organization; (c) SHER-3DRrouter organization; (d) Input-Port; (e) Switch allocation unit. Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 7

Algorithm 1:

Algorithm of Pipeline Computation Re-dundancy (PCR). // input flit’s data

Input: in ﬂit // output flit’s data

Output: out ﬂit // Write flit’s data into buffers Bu ﬀ erWriting (in ﬂit) // Compute first time of NPC and SA next port[1] = NextPortComputing (in ﬂit) grants[1] = SwitchAllocation (in ﬂit) // Compute redundant of NPC and SA next port[2] = NextPortComputing (in ﬂit) grants[2] = SwitchAllocation (in ﬂit) // Compare orginal and redundant to detect soft-error// Soft-error on NPC if (next port[1] (cid:44) next port[2]) then // roll-back and recalculate NPC next port[3] = NextPortComputing (in ﬂit) ﬁnal next port = MajorityVoting (next port[1,2,3]); else // No soft-error on NPC ﬁnal next port = next port[1] end // Soft-error on SA if (grants[1] (cid:44) grants[2]) then // roll-back and recalculate SA grants[3] = SwitchAllocation (in ﬂit) ﬁnal grants = MajorityVoting (grants[1,2,3]) else // No soft-error on SA ﬁnal grants = grants[1] end // After detection and recovery, the algorithm finisheswith CT out ﬂit = CrossbarTraversal (in ﬂit, ﬁnal next port, ﬁnal grants);

Cycle BW NPC/SA CT st 𝑓𝑙𝑖𝑡(1) 𝑖𝑑𝑙𝑒 𝑖𝑑𝑙𝑒 nd 𝑓𝑙𝑖𝑡(2) 𝑓𝑙𝑖𝑡 1 ,𝑡𝑖𝑚𝑒(1) 𝑖𝑑𝑙𝑒 rd 𝑓𝑙𝑖𝑡(3) 𝑓𝑙𝑖𝑡 1 ,𝑡𝑖𝑚𝑒(2) → 𝑐(1) 𝑓𝑙𝑖𝑡 1 ,𝑡𝑖𝑚𝑒(1) th : 𝑐 1 = 𝑇 𝑓𝑙𝑖𝑡(4) 𝑓𝑙𝑖𝑡(2) 𝑖𝑑𝑙𝑒 th : 𝑐 1 = 𝐹 𝑓𝑙𝑖𝑡(4) 𝑓𝑙𝑖𝑡 1 ,𝑡𝑖𝑚𝑒(3) → 𝑓(1) 𝑓𝑙𝑖𝑡 1 ,𝑡𝑖𝑚𝑒(2) 𝑓𝑙𝑖𝑡(𝑛) : flit 𝑛 𝑡ℎ in packet. 𝑡𝑖𝑚𝑒 𝑚 : computation at 𝑚 𝑡ℎ time. 𝑐(𝑎) : flit 𝑎 𝑡ℎ comparison. 𝑇 = 𝑇𝑟𝑢𝑒; 𝐹 = 𝐹𝑎𝑙𝑠𝑒 𝑓(𝑎) : flit 𝑎 𝑡ℎ finalization based on majority voting. conditionalbranches Input direction First CycleSecond CycleRecovery Cycle

Conditional direction

Fig. 5

SHER-3DR working demonstration. of [ f lit (1) , time (1)] and [ f lit (1) , time (2)] to detect the oc-currence of a soft error. If there is no error, CT processes[ f lit (1) , time (1)] to ﬁnish the pipeline stages of the ﬁrst ﬂit.If there is an error in NPC / SA, the system requires the recov-ery in the fourth cycle. In this cycle, NPC / SA recalculatesthe ﬁrst ﬂit for the third time for recovery ([ f lit (1) , time (3)])and ﬁnalizes an accurate result by using majority voting ([ f (1)]).After getting the ﬁnal result of the ﬁrst ﬂit, CT completesthe pipeline stage of the ﬁrst ﬂit based on the correct re- sult of the two previous computations: [ f lit (1) , time (1)] or[ f lit (1) , time (2)]. As shown in Fig. 5, the router requires oneclock cycle for detecting a soft-error and one optional cyclefor recovering each time an error occurs. Algorithm 2 shows the proposed

Detection, Diagnosis andRecovery Mechanism (DDRM). It uses the feedback fromthe ECC and the Automatic Retransmission Request (ARQ)protocol to monitor the errors. As shown in Fig. 2, the inputdata is ﬁrst veriﬁed by an ECC decoder. If the value is cor-rect or the ECC decoder can handle the correction, the ﬂitis written to the input bu ﬀ er. Otherwise, a retransmission isrequested. Since the transient fault only occurs over a shortperiod of time, assumed to be a single clock cycle, it doesnot occur for two consecutive cycles. Therefore, ARQ canrecover this kind of fault. However, if a permanent fault oc-curs, ARQ is unable to correct it and the faulty connectionwill keep requesting retransmission inﬁnitely. Therefore, ifthe ARQ cannot correct the fault, the system considers it tobe a permanent fault (line 1-10 in Algorithm 2).Since a ﬂit’s correctness is veriﬁed by the ECC modulebefore being written to the bu ﬀ er, a permanent fault can onlyoccur in the path between the input-bu ﬀ er in the upstreamnode and the one in the downstream node. Figure 6 showsthe high-level view of the DDRM and Router-to-Router in-terfacing. The transmission path of a ﬂit consists of 3 maincomponents: input bu ﬀ er slots, a crossbar link and a router-to-router channel. When a fault is detected, DDRM diag-noses these two components to ﬁnd the fault position andrecover it with an appropriate mechanism.For the diagnosis and recovery phase, the router’s

Fault-manager module initiates the diagnosis with input bu ﬀ erchecking. In this step, the error statuses of the followingﬂits of the monitored input bu ﬀ er are checked. If errors aredetected in the following ﬂits’ transmission, it means thefault should belong to the crossbar link or the inter-routerchannel. The diagnosis is forwarded to check the crossbarand inter-router channel. If errors are constantly detectedat the same position of the monitored bu ﬀ er, the fault be-longs to this detected position. In this fashion, the Fault-manager sends a signal to the

Random Access Bu ﬀ er (RAB)mechanism to indicate the faultiness of the slot in the inputbu ﬀ er (line 11-14). If the fault-manager indicates that thefault may belong to the crossbar or inter-router channel, the Fault-manager ﬁrst conﬁgures the

Bypass-Link-on-Demand (previously presented in Section 3.1) to establish an alter-native connection path. Then, another ﬂit is sent from theinput bu ﬀ er through a bypass-link and the router-to-routerchannel to the downstream node. If, at the downstream node,the ﬂit is found to be not faulty by the ECC module, the Khanh N. Dang et al.

STOPARQ

CrossbarBypass-Links

RouteComputingSwitchAllocator

Upstream NodeInput Port

STOPARQ

CrossbarBypass-Links

RouteComputingSwitchAllocator

Downstream NodeInput Port

STOPARQ R o u t e r s & I P s R o u t e r s & I P s Routers & IPs

ECC

Soft-Error-Resilience Technique

Bypass-Link-On-DemandFault-Tolerant RoutingFault-manager Fault-managerRandom-Access-Buffer Fig. 6

Router-to-Router interfacing and DDRM scheme.

Fault-manager concludes that the fault is in the Crossbar,which is already handled by the BLoD mechanism. There-fore, the conﬁguration of the BLoD is kept as a recovery.If the ﬂit is still faulty, the fault belongs to the inter-routerchannel. In this situation, the BLoD is released for furtherfault-tolerance and the information of the faulty channel issent to the routing module (in LAFT algorithm). At the rout-ing module, the

Look-Ahead Fault-Tolerant routing algo-rithm uses the fault information to handle the channel’s fail-ure. The ﬂit in the input bu ﬀ er is re-routed via an alternativeoutput port. ﬃ cpatterns as benchmarks. For synthetic benchmarks, we se-lected Transpose [11], Uniform [37], Matrix-multiplication[10,40], and Hotspot 10% [13]. For realistic benchmarks,we chose H.264 video encoding system [33], Video ObjectPlane Decoder (VOPD), Picture In Picture (PIP) and Multi-ple Window Display (MWD) [9]. The simulation conﬁgura-tions are depicted in Table 2.The above synthetic benchmarks help us understand theperformance of the network under stress; however, we alsoneed several realistic benchmarks to understand the networkunder real application tra ﬃ c. Therefore, we build a simula- tor in Verilog-HDL which allows us to set up the tra ﬃ c pat-terns from real applications. Based on the tra ﬃ c patterns, the Network Interfaces send and receive packets over the net-works. We select a video encoding system using a H.264encoder, a MP3 encoder, and a OFDM [33]. Moreover, weselect three applications [9]: VOPD, PIP and MWD.We evaluate the performance of our fault-tolerant modelwhich includes hard fault tolerance from 3D-FTO [1], Soft-Error Tolerance OASIS system, and the proposed system(3D-FETO). We measure the average packet latency, withthe selected synthetic and realistic benchmarks. To under-stand the impact of fault-tolerance techniques on performance,we compare the obtained results with the baseline 3D-NoCsystem presented in [4]. We randomly inject faults at threefault-rates: 10%, 20% and 33%. The faults are injected intohard fault tolerant and soft error tolerant modules. For thesoft error tolerant system, only soft errors are injected. Forthe hard fault tolerant (3D-FTO) system, only hard faultsare injected. For the ﬁnal system (3D-FETO), both soft er-rors and hard faults are injected. Hard faults are injected atthe beginning of simulation and their rate is measured asthe percentage of routers with faults. Soft errors are injectedduring the system’s operation and their rate is considered tobe the number of soft errors per clock cycle. The injectedfault rates are considered individually for each error type.5.2 Complexity EvaluationIn this evaluation, we considered the hardware complexityof the proposed SHER-3DR router. For this evaluation, weuse the NANGATE 45 nm technology library [27]. Area costand power consumption analyses are performed with theSynopsys c (cid:13) Design Compiler. The power consumption in-formation is analyzed based on the switching activity of the

Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 9

Algorithm 2:

Fault Detection, Diagnosis and Recov-ery. // Automatic Retransmission Request

Input: transmittin g f lit // Transmitted Buffer Position Input: bu f f er position // Control signal to all Fault-Tolerance modules

Output:

RAB control , BLoD control , LAFT control // Transmit the flit, get the ECC’s feedback Transmit ( transmittin g f lit ); ECC result = ECC-Decoder ( transmittin g f lit ); // DETECTION PHASE: if ECC result == ARQ then // Automatic Retransmission Request increase ( ARQ counter ); ARQ ( transmittin g f lit ); else // The transmitted flit is non faulty Finish ; end // Check the number of consecutive ARQs if (ARQ counter == ) then // There is a permanent fault// Jump to DIAGNOSIS-RECOVERY PHASE end // DIAGNOSIS-RECOVERY PHASE:// Start with Input Buffer Checking Bu f f er Failure ← Bu f f er Checkin g ( bu f f er position ); if (Bu f f er Failure == Yes) then // Random Access Buffer is received theposition to handle. RAB Control = bu f f er position ; Finish ; else // The buffer slot is non faulty.// Move to Crossbar Checking: using aBypass-Link. BLoD control = enable; // Get the ECC’s feedback and detect withARQ counter. if (ARQ counter == ) then // BLoD cannot fix the fault, the link isfailed. BLoD control = release; // The LAFT routing algorithm handles thefaulty link. LAFT control = faulty; Finish ; else // BLoD already fixed the failure, therecovery step is finished. Finish ; end end router under the uniform benchmark. We start ﬁrst by ob-serving the additional hardware added to the baseline systemwhen we employ the hard fault tolerance model (3D-FTOrouter). Then, we evaluate the impact when we considerthe soft error tolerant model (Soft Error Tolerant router).Finally, we evaluate the completed SHER-3DR system in- cluding both soft and hard fault tolerant mechanisms. Theconﬁgurations of the network are shown in Table 2 and thelayout of a single SHER-3DR router is depicted in Fig. 8.Table 3 illustrates the hardware complexity results ofSHER-3DR router in terms of area, power (static, dynamic,and total), and speed. In the hard fault tolerance router (3D-FTO), the area and power consumption overheads have in-creased by 1.43% and 25.65%, respectively. The maximumspeed has also slightly decreased. On the other hand, oursoft error handling mechanism adds seven ARQ bu ﬀ ers andsome combinational logic which increase the area and powerconsumption more signiﬁcantly. However, SHER-3DR in-troduces 7.50% and 3.74% extra area and power consump-tion, respectively, when compared to the soft error tolerantmodel. In comparison to the baseline model, SHER-3DRincreases the area and power consumption by 56.39% and112.10%, respectively, while the maximum speed decreasesby 33.70%.The area cost and power consumption of the proposedrouter is given by Equation 1 where π i represents the areacost or power consumption of module i . The SHER-3DRrouter consists of four main modules: input-ports, switch-allocator, crossbar, and fault manager. π router = π input − ports + π s w itch − allocator + π crossbar + π f ault − mana g er (1)The details of an input port, a switch-allocator and a crossbarare given in Equation 2. π input − ports = π ori g inal − input − ports + π RAB − controller + π PCR − controller + π ECC π s w itch − allocator = π ori g inal − s w itch − allocator + π PCR − monitor π crossbar = π ori g inal − crossbar + π b y pass − links + π ARQ − bu f f ers (2)We can observe the overheads in power consumption andarea cost that are caused by the fault-tolerance mechanisms(RAB-controller, PCR-controller, ECC, BLoD, ARQ bu ﬀ ers).Figure 7 provides the evaluation results of power consump-tion and area cost of SHER-3DR. In terms of area cost, theinput ports occupy the majority with over 67% which is fol-lowed by the crossbar (20%) and the switch allocator (9%).The fault manager, which supports DDRM, uses only about4% of the overall area cost. In terms of power consumption,the input ports consume over 80% of the total value. Thefault manager module also causes an insigniﬁcant increasein power consumption (3%).When compared to the baseline OASIS router, the pro-posed SHER-3DR consumes more power consumption andcosts more area. As shown in Fig. 7, SHER-3DER increasesthe area and power of all three main modules (crossbar, in-put ports, and switch-allocator). The overhead can be ana-lyzed by Equation 2 where additional modules are attachedto support the fault-tolerance mechanisms. Table 2

Simulation conﬁgurations.

Parameter / System Value

Network Size ( x × y × z ) Matrix 6 × × × × × × × × × × × × × × × × +

10% for hotspot nodesOthers 10 ﬂitsFlits Size 44 bitsHeader Size 14 bitsPayload Bit Baseline, 3D-FTO 30 bitsSoft Error Tolerance, 3D-FETO 18 bitsParity Bit Baseline, 3D-FTO 0 bitsSoft Error Tolerance, 3D-FETO 12 bits (2 × SECDED(22,16))Bu ﬀ er Depth 4Switching Wormhole-likeFlow-control Stop-GoRouting LAFT Table 3

Hardware complexity evaluation and comparison results.

Area Power SpeedModel ( µ m ) (mW) (Mhz)Static Dynamic TotalBaseline LAFT router 18,873 5.1229 0.9429 6.0658 925.283D-FTO router 19,143 6.4280 1.1939 7.6219 909.09Soft Error Tolerance router 27,457 9.7314 2.6710 12.4024 625.00SHER-3DR 29,516 10.0819 2.7839 12.8658 613.50 N o r m a li z ed A r ea C o s t (a) Area CostCrossbar Input-Port Switch Allocation Fault-manager 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3D-FETO Baseline N o r m a li z ed P o w e r C on s u m p t i on (a) Power Consumption Fig. 7

Area cost and power consumption analysis.

Although our proposed models are penalized in termsof area, power consumption, and maximum frequency dueto additional logic and registers that are necessary for fault

TSVarea μ m μ m Fig. 8

Layout of a single SHER-3DR router for the 3D-FETO sys-tem. The SHER-3DR router was designed in Verilog-HDL and syn-thesized using 45nm technology library [27]. For the Through SiliconVia (TSV) integration, we used FreePDK3D45 kit compiler [28]. TheSHER-3DR router is designed on a 450 µ m × µ m and the TSV arrayis 208 TSVs. handling mechanisms, they provide an improved resiliencyagainst a signiﬁcant amount of soft and hard faults. Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 11 ﬀ er and cross-bar only), 3D-FTO has similar performance to the baselinesystem (LAFT-OASIS). In addition, we found that even at a33% fault-rate, 3D-FTO increases the latency by only 1.71%,11.38%, 8.79% and 13.73% for Transpose, Uniform, 6 × ﬀ ers more of an impact at high error-rates (20% and 33%) since the ﬂit encounter bottlenecksdue to errors inside the input bu ﬀ ers. However, the proposed3D-FETO model still works even at high fault-rates whilethe baseline model collapses at a 5% error-rate. We used thesame benchmark programs to evaluate the soft error tolerantmodel. Since both the proposed Pipeline Computation Re-dundancy mechanism and ECC require additional clock cy-cles, we can observe a signiﬁcant e ﬀ ect on average packetlatency. For the 0%, 10%, 20% and 33% fault-rates, theSoft Error Tolerant model increases the average delay inthe Transpose benchmark by 18.57%, 28.74%, 34.54% and49.62%, respectively. Finally, we evaluate the proposed 3D-FETO system with both soft error and hard fault handlingschemes. As shown in Figs. 9 and 10, 3D-FETO has demon-strated a signiﬁcant impact on the average latency, whichhas mostly doubled for both realistic and synthetic bench-marks. At a 33% fault-rate using Matrix, Uniform, Trans-pose benchmarks, 3D-FETO’s average latency increases by78.44%, 50.73% and 67.18% in terms of average packet la-tency. The degradation is caused by both soft errors and hardfault tolerance mechanisms: (1 the) ECC + ARQ and PCRboth require additional re-transmission clock cycles; (2) theRAB and LAFT routing algorithm may disable a part of thenetwork which causes congestion. However, it still main-tains the ability to work under an extremely high fault-rate(33% for hard faults and 33% for soft errors).5.4 Throughput EvaluationFigure 11 depicts the throughput evaluation with the adoptedsynthetic benchmarks. At a 0% error rate, 3D-FTO (hard-fault tolerance) presents the best throughput which matchesthe capacity of the baseline LAFT-OASIS. The Soft ErrorTolerant OASIS and the proposed 3D-FETO have less through-put due to their soft error tolerance mechanisms. When the errors are injected into the system, we can observe a degra-dation in throughput. Thanks to the e ﬃ cient hard fault tol-erance scheme and the fault-tolerant routing algorithm, 3D-FTO at a 33% error-rate provides a slightly decreased through-put: 40.18%, 43.96%, 43.55% and 32.59% for Transpose,Matrix, Uniform, and Hotspot 10%, respectively. For theSoft Error Tolerant OASIS, the system requires re-transmissionvia the ARQ mechanism and the re-execution for the softerror mechanism. Therefore, the throughput is degraded dueto extra clock cycles. The proposed 3D-FETO, which is afusion of both hard fault tolerance and soft error tolerantmechanisms, inherits both degradations; however, these sys-tems provide the ability to handle up to a 33% error rate (thelimitation of the soft error mechanism). Table 4

Successful arrival-rate comparison results for a 5 × × ﬃ c.Algorithm / Fault-rate 1% 5% 10% 15% 20%XYZ 91% 62% 41% 28% 23%Hybrid-XYZ 99% 83% 62% 44% 36%8-RW 100% 95% 85% 69% 59%Odd-Even 96% 85% 67% 53% 43%Hybrid-Odd-Even 100% 94% 83% 70% 61%4N-FIRST 98% 89% 72% 68% 46%4NP-FIRST 97% 98% 95% 83% 76%LAFT-OASIS 100% 100% 99% 98% 95%3D-FETO 100% 100% 99% 99% 97%

This subsection presents the reliability evaluation of the pro-posed 3D-FETO system over several hard fault and soft er-rors injection rates. For comparison, seven systems adoptingdi ﬀ erent routing algorithms are selected [30]: XYZ, Hybrid-XYZ, 8-Random-Walk (8-RW), Odd-Even, Hybrid-Odd-Even,4N-First, and 4NPFirst. Among these algorithms, we canﬁnd deterministic 3D routing algorithms, fault-tolerant 2Dalgorithms that were extended to the third dimension, andalso turn-model based schemes that were proposed for fault-tolerant 3D-NoC systems. We adopted the same simulationenvironment and assumptions made in [30] from where thearrival-rate results were also obtained. For fair comparison,we assume that the faults can occur at any link with LAFT;thus, we eliminate the two assumptions that are necessaryfor the algorithm to e ﬃ ciently work: (1) the links connect-ing the PE to the local input and output ports are always non-faulty. (2) There exists at least one non-faulty path betweena (source, destination) pair. Moreover, we also evaluate thearrival-rate of our ﬁnal system with the enhancements by A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (a) Transpose

Baseline LAFT-OASISHard Fault Tolerant OASIS Soft Error Tolerant OASIS3D-FETO 0 10 20 30 40 50 0% 10% 20% 33% A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (b) Uniform A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (c) Matrix A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (d) Hotspot

Fig. 9

Average packet latency evaluation of the synthetic benchmarks.

Random-Access-Bu ﬀ er and Bypass-Link-on-Demand. Insteadof only distributing faults on the inter-router channel, theyare randomly assigned to input bu ﬀ ers, crossbar, or the inter-router channel.Table 4 and Table 5 depict the arrival-rate results for a 5 × × ﬃ c. This is in con-trast with the remaining algorithms where their reliabilitydegrades considerably with this application. In fact, the com-bination of look-ahead routing and the path prioritizationusing the diversity value in LAFT signiﬁcantly increases theprobability for packets to ﬁnd non-faulty paths to reach theirdestinations.The arrival rates of the proposed 3D-FETO reach over97% in the worst case (20% fault-rate) while LAFT-OASIS’sarrival rates are 95% and 96%. With other rates, 3D-FETOpresents its capacity for high reliability with an arrival-rateof over 98%. When we analyzed the possible causes for thefailing 5%, we observed the occurrence of cases where allthe connecting links of a given router are faulty: for exam-ple, the East, North, and UP links of the bottom-left routerof the network are broken. Thus, the router cannot receive orinject any ﬂit from / to the network. Another failure case man-ifests when the link connecting the router to the attached PEis faulty. As expected, these two cases justify the two as-sumptions that we previously made to ensure the e ﬃ ciencyof LAFT’s fault-tolerance capabilities. Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 13 A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (a) H.264

Baseline LAFT-OASISHard Fault Tolerant OASIS Soft Error Tolerant OASIS3D-FETO 0 5 10 15 20 25 0% 10% 20% 33% A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (b) PIP A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (c) MWD A v e r age La t en cy ( cyc l e s / pa ck e t ) Fault Rate (%) (d) VOPD

Fig. 10

Average packet latency evaluation of the realistic benchmarks.

Table 5

Successful arrival-rate comparison results for a 5 × × ﬃ c.Algorithm / Fault-rate 1% 5% 10% 15% 20%XYZ 85% 46% 31% 14% 11%Hybrid-XYZ 99% 68% 42% 25% 20%8-RW 93% 82% 62% 44% 36%Odd-Even 97% 84% 53% 42% 32%Hybrid-Odd-Even 99% 92% 77% 62% 53%4N-FIRST 96% 86% 68% 50% 37%4NP-FIRST 100% 97% 89% 75% 63%LAFT-OASIS 100% 100% 100% 99% 96%3D-FETO 100% 100% 100% 99% 98%

Besides the arrival rate evaluation, we assessed our fault-tolerant system in terms of Mean Time To Failure (MTTF)improvement. We deﬁne a system at healthy if it operatescorrectly (100% arrival rate, accurate fault detection and re-covery function). Otherwise, the system is marked as failed .To obtain more precise results, we use the net-list (gate-level) models from the complexity evaluation. Moreover,faults are not only injected to the fault-tolerance modules but they are also injected to other modules (controller, man-agement module). Before the MTTF assessment, we ﬁrst as-sume the original system has a natural fault rate: λ ra w . TheMTTF value can be given as the following. MT T F ra w = λ ra w (3)To measure the MTTF value of the fault-tolerant system, weuse a Monte-Carlo based simulation as shown in Figure 12.At the beginning of the simulation, we deﬁne the number ofexperiments (N) and the fault models and distribution mech-anisms. Faults will be generated in two types: soft errors(randomly occur within a clock period) and hard faults (oc-cur from the beginning to the end of experiment). There arealso two fault models: stuck-at “0” and stuck-at “1”. Faultsare injected to the dedicated gates which selected by a ran-dom generator. We use two distributions: (1) ﬂat: randomlyinject to any gate inside a router; (2) weight: more than 80%of faults are injected to the fault-tolerant modules (bu ﬀ er,crossbar, next-port-computing, switch-allocator). For eachexperiment i , we inject faults and examine the correctness ofthe system (data’s accuracy, fault-tolerance conﬁgurations).Faults will be injected until the system is determined as fail- T h r oughpu t ( f li t s / node / cyc l e ) Fault Rate (%) (a) Transpose

Baseline LAFT-OASISHard Fault Tolerant OASIS Soft-Error Tolerant OASIS3D-FETO 0 0.1 0.2 0.3 0.4 0.5 0% 10% 20% 33% T h r oughpu t ( f li t s / node / cyc l e ) Fault Rate (%) (b) Uniform T h r oughpu t ( f li t s / node / cyc l e ) Fault Rate (%) (c) Matrix T h r oughpu t ( f li t s / node / cyc l e ) Fault Rate (%) (d) Hotspot

Fig. 11

Throughput evaluation of the synthetic benchmarks. ure. At the end of an experiment, the number of faults isrecorded for the ﬁnal process. To calculate the MTTF valueof a system, the average number of faults is used in the fol-lowing equation.

MT T F s y stem = (cid:80) f i × MT T F ra w N (4)In order to understand the e ﬃ ciency of the fault-tolerance,the ratio of two MTTF values is used as in Equation 5. Impro v ement MTTF = MT T F f ault − tolerant MT T F ori g inal (5)Because the raw fault rate depends on the technologyparameters and the operating conditions, they will require ahighly complex evaluation. To alleviate the complexity, weassume the fault-tolerant and original system have a simi-lar raw fault rate. Therefore, the MTTF improvement canbe obtained by Equation 6 where, AFTF is average fault tofailure.

Impro v ement MTTF = AFT F f ault − tolerant AFT F ori g inal (6) Table 6 shows the average number of faults to failureafter 1000 simulations. The test scheme is built to function-ally verify the data communication and the fault-tolerancemechanisms. In the ﬂat distribution, the proposed SHER-3DR enhances the MTTF of hard faults and soft errors by1.93 and 1.49 times, respectively. With the weight distribu-tion, the proposal shows more improvement since the faultsfocus on the fault-tolerant modules. SHER-3DR’s hard faulttolerance is 2.96 times better the baseline OASIS router. Interms of soft error MTTF, SHER-3DR is 5.32 times betterthan the original router. In conclusion, we observe a signif-icant improvement in terms of MTTF from our proposedmechanism. Along with the high arrival rates, we demon-strated the reliability enhancement of our system. In this paper, we proposed a comprehensive fault tolerant3D-Network-on-Chip (3D-NoC) system architecture for highly-reliable many-core Systems-on-Chips (SoCs), named 3D-FETO. The proposed system is based on two approaches.

Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 15

Define the total number of experiments (N) Identify random parameters of the system Assume appropriate distributions for the parametersInitialize counter 𝑖 = 1

Generate a uniformly distributed number for each “experiment” iGenerate the random variable numbers to the system’s distribution Evaluate by using the set of random numberDetermine the system is a success or a failure Is I = N? 𝑖 = 𝑖 + 1

Calculate the system MTTF:

𝑀𝑇𝑇𝐹 = ∑𝑓 𝑖 × 𝑀𝑇𝑇𝐹 𝑟𝑎𝑤 N No Yes failure success 𝑓 𝑖 = 0𝑓 𝑖 = 𝑓 𝑖 + 1𝑓 = 0 Fig. 12

MTTF simulation methodology.

Table 6

Average number of faults to failure.Fault-Type Distribution Baseline router SHER-3DR router MTTF ImprovementHard Fault Flat 2.37 4.58 1.93Weighted 2.055 6.085 2.96Soft Error Flat 17.928 26.770 1.49Weight 4.037 21.492 5.32

First, a comprehensive mechanism to handle both soft errorand hard faults in a 3D-NoC router is proposed. The hardfault support is achieved by leveraging reconﬁgurable com-ponents to handle permanent faults in links, input bu ﬀ ers,and crossbars, while soft error tolerance is obtained via ef-ﬁcient and light-weight software redundancy that enablesfault recovery in the router pipeline stages. In the secondapproach, the system can support a detection, diagnosis andrecovery technique which makes it independent of any com-plex and costly testing mechanisms commonly found in con-ventional systems.Through extensive evaluation, we showed that the pro-posed 3D-FETO was able to recover e ﬃ ciently from a sig-niﬁcant number of soft and hard errors at di ﬀ erent fault-rates, reaching up to 33%. This means that 3D-FETO canprovide up to a 98% packet arrival rate even when almost one-third of its components have failed. Despite the per-formance degradation and hardware complexity penalty, westill consider that this overhead is acceptable. This is be-cause we made sure that the system is still functional at highfault rates where previously proposed systems fail to deliverpackets. As reliability constitutes one of the main challengesin future SoC design, we demonstrated that the proposed3D-FETO can be used as a reliable and independent systemcapable of ensuring fault resiliency in worst case scenariosand that it can be adopted for mission critical applicationswhere correct data delivery is primordial.As a future work, we are planning to investigate the faultswithin Through-Silicon-Vias of 3D-ICs / ﬃ cient fault-tolerance method for 3D-NoC systems. More-over, the degradation factors of the reliability, such as ther- mal stress, operating voltages, design characteristics shouldbe also studied. Acknowledgements

This work is partially supported by CompetitiveResearch Funding (CRF), The University of Aizu, Reference P-11 (2016),and JSPS KAKENHI Grant Number JP30453020. This work is alsosupported by VLSI Design and Education Center (VDEC), the Univer-sity of Tokyo, Japan, in Collaboration with Synopsys, Inc. and CadenceDesign Systems, Inc. The ﬁrst and the last authors in the author-list arethe main contributors of this work.

References

1. Ahmed, A.B., Abdallah, A.B.: Adaptive fault-tolerant architec-ture and routing algorithm for reliable many-core 3D-NoC sys-tems. Journal of Parallel and Distributed Computing , 30–43(2016)2. Ben Abdallah, A.: Multicore Systems-on-Chip: Practical Hard-ware / Software Design, 2nd Edition. Atlantis (2013)3. Ben Abdallah, A., Masahiro, S.: Basic Network-on-Chip Intercon-nection for Future Gigascale MCSoCs Applications: Communica-tion and Computation Orthogonalization. In: Proc. of the Sym-posium on Science, Society, and Technology (JASSST2006), pp.1–7 (2006)4. Ben Ahmed, A., Ben Abdallah, A.: LA-XYZ: low latency, highthroughput look-ahead routing algorithm for 3D network-on-chip(3D-NoC) architecture. In: IEEE 6th International Symposium onEmbedded Multicore Socs (MCSoC), pp. 167–174. IEEE (2012)5. Ben Ahmed, A., Ben Abdallah, A.: Low-overhead Routing Algo-rithm for 3D Network-on-Chip. In: Third International Confer-ence on Networking and Computing (ICNC), pp. 23–32 (2012)6. Ben Ahmed, A., Ben Abdallah, A.: Architecture and design ofhigh-throughput, low-latency, and fault-tolerant routing algorithmfor 3D-network-on-chip (3D-NoC). The Journal of Supercomput-ing (3), 1507–1532 (2013)7. Ben Ahmed, A., Ben Abdallah, A.: Graceful deadlock-free fault-tolerant routing algorithm for 3D Network-on-Chip architectures.Journal of Parallel and Distributed Computing (4), 2229–2240(2014)8. Bertozzi, D., Benini, L., De Micheli, G.: Error control schemesfor on-chip communication links: the energy-reliability tradeo ﬀ .IEEE Transactions on Computer-Aided Design of Integrated Cir-cuits and Systems (6), 818–831 (2005)9. Bertozzi, D., Jalabert, A., Murali, S., Tamhankar, R., Stergiou, S.,Benini, L., De Micheli, G.: NoC synthesis ﬂow for customized do-main speciﬁc multiprocessor systems-on-chip. IEEE Transactionson Parallel and Distributed Systems (2), 113–129 (2005)10. Chen, P., Dai, K., Wu, D., Rao, J., Zou, X.: The parallel algorithmimplementation of matrix multiplication based on ESCA. In: IEEEAsia Paciﬁc Conference on Circuits and Systems (APCCAS), pp.1091–1094. IEEE (2010)11. Chien, A.A., Kim, J.H.: Planar-adaptive routing: low-cost adaptivenetworks for multiprocessors. Journal of the ACM (JACM) (1),91–123 (1995)12. Constantinides, K., Plaza, S., Blome, J., Zhang, B., Bertacco, V.,Mahlke, S., Austin, T., Orshansky, M.: Bulletproof: A defect-tolerant CMP switch architecture. In: The Twelfth InternationalSymposium on High-Performance Computer Architecture, pp. 5–16. IEEE (2006)13. Dally, W.J., Towles, B.P.: Principles and practices of interconnec-tion networks. Elsevier (2004)14. Dang, K.N., Meyer, M., Okuyama, Y., Tran, X.T., Ben Abdallah,A.: A soft-error resilient 3d network-on-chip router. In: IEEE 7thInternational Conference on Awareness Science and Technology(iCAST), pp. 84–90 (2015) 15. DeOrio, A., Fick, D., Bertacco, V., Sylvester, D., Blaauw, D., Hu,J., Chen, G.: A reliable routing architecture and algorithm forNoCs. IEEE Transactions on Computer-Aided Design of Inte-grated Circuits and Systems (5), 726–739 (2012)16. Dixit, A., Wood, A.: The impact of new technology on soft errorrates. In: 2011 International Reliability Physics Symposium, pp.5B.4.1–5B.4.7 (2011)17. Eghbal, A., Yaghini, P.M., Bagherzadeh, N., Khayambashi, M.:Analytical Fault Tolerance Assessment and Metrics for TSV-based3D Network-on-Chip. IEEE Transactions on Computers (12),3591–3604 (2015)18. Ernst, D., Kim, N.S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler,C., Blaauw, D., Austin, T., Flautner, K., et al.: Razor: A low-powerpipeline based on circuit-level timing speculation. In: Proceedings36th Annual IEEE / ACM International Symposium on Microarchi-tecture (MICRO-36), pp. 7–18. IEEE (2003)19. Fick, D., DeOrio, A., Chen, G., Bertacco, V., Sylvester, D.,Blaauw, D.: A highly resilient routing algorithm for fault-tolerantNoCs. In: 2009 Design, Automation Test in Europe ConferenceExhibition, pp. 21–26 (2009)20. Hern´andez, C., Silla, F., Santonja, V., Duato, J.: Dealing with vari-ability in NoC links. In: 2nd Workshop on Diagnostic Services inNetwork-on-Chips, pp. 4–10 (2008)21. Hsiao, M.Y.: A class of optimal minimum odd-weight-columnsec-ded codes. IBM Journal of Research and Development (4),395–401 (1970)22. ITRS: 2012 Edition Update Process Integration, Devices, andStructures. Tech. rep., The International Technology Roadmap forSemiconductor (2012). (accessed 16.06.16)23. Karl, E., Blaauw, D., Sylvester, D., Mudge, T.: Reliability Model-ing and Management in Dynamic Microprocessor-based Systems.In: Proceedings of the 43rd Annual Design Automation Confer-ence, DAC ’06, pp. 1057–1060. ACM, New York (2006)24. Lehtonen, T., Liljeberg, P., Plosila, J.: Online reconﬁgurable self-timed links for fault tolerant NoC. VLSI design , 1–13 (2007)25. Lehtonen, T., Wolpert, D., Liljeberg, P., Plosila, J., Ampadu, P.:Self-adaptive system for addressing permanent errors in on-chipinterconnects. IEEE Transactions on Very Large Scale Integration(VLSI) Systems (4), 527–540 (2010)26. Lin, S., Costello, D., Miller, M.: Automatic-repeat-request error-control schemes. IEEE Communications Magazine (12), 5–17(1984)27. NanGate Inc.: Nangate Open Cell Library 45 nm. . (accessed 16.06.16)28. NCSU Electronic Design Automation: FreePDK3D45 3D-IC process design kit. . (accessed 16.06.16)29. Parikh, R., Bertacco, V.: Formally Enhanced Runtime Veriﬁcationto Ensure NoC Functional Correctness. In: Proceedings of the44th Annual IEEE / ACM International Symposium on Microarchi-tecture, MICRO-44, pp. 410–419. ACM, New York (2011)30. Pasricha, S., Zou, Y.: A low overhead fault tolerant routing schemefor 3D Networks-on-Chip. In: 12th International Symposium onQuality Electronic Design (ISQED), pp. 1–8. IEEE (2011)31. Prodromou, A., Panteli, A., Nicopoulos, C., Sazeides, Y.: No-CAlert: An On-Line and Real-Time Fault Detection Mechanismfor Network-on-Chip Architectures. In: Proceedings of the 201245th Annual IEEE / ACM International Symposium on Microarchi-tecture (MICRO), pp. 60–71 (2012)32. Radetzki, M., Feng, C., Zhao, X., Jantsch, A.: Methods for faulttolerance in networks-on-chip. ACM Computing Surveys (CSUR) (1), 8 (2013)33. Rahmani, A.M., Vaddina, K.R., Latif, K., Liljeberg, P., Plosila,J., Tenhunen, H.: High-performance and fault-tolerant 3D noc-bushybrid architecture using arb-net-based adaptive monitoring plat-form. IEEE Transactions on Computers (3), 734–747 (2014) Low-overhead Soft-Hard Fault-Tolerant Architecture and Management Scheme for Reliable... 1734. Ravindan, D.K.: Structural fault-tolerance on the noc circuit level.Tech. rep., Institut fur Technische Informatik, Universitat Stuttgart(2009)35. Shamshiri, S., Cheng, K.T.: Yield and cost analysis of a reliablenoc. In: 27th IEEE VLSI Test Symposium, pp. 173–178. IEEE(2009)36. Shamshiri, S., Ghofrani, A.A., Cheng, K.T.: End-to-end error cor-rection and online diagnosis for on-chip networks. In: IEEE Inter-national Test Conference (ITC), pp. 1–10. IEEE (2011)37. Sivaram, R.: Queuing delays for uniform and nonuniform tra ﬃ cpatterns in a MIN. ACM SIGSIM Simulation Digest (1), 17–27(1992)38. Yu, Q., Ampadu, P.: Transient and permanent error co-management method for reliable networks-on-chip. In: FourthACM / IEEE International Symposium on Networks-on-Chip(NOCS), pp. 145–154. IEEE (2010)39. Yu, Q., Zhang, M., Ampadu, P.: Addressing network-on-chiprouter transient errors with inherent information redundancy.ACM Transactions on Embedded Computing Systems (TECS)12