[PDF] Soft Errors Detection and Automatic Recovery based on Replication combined with different Levels of Checkpointing

Abstract

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.

Full PDF

aa r X i v : . [ c s . D C ] J u l Soft Errors Detection and Automatic Recoverybased on Replication combined with diﬀerentLevels of Checkpointing

Diego Montezanti , Enzo Rucci , Armando De Giusti , MarceloNaiouf , Dolores Rexachs , and Emilio Luque III-LIDI - Instituto de Investigaci´on en Inform´atica LIDI ,Facultad de Inform´atica, Universidad Nacional de La Plata, LaPlata, Buenos Aires, Argentina, { dmontezanti,erucci,degiusti,mnaiouf } @lidi.info.unlp.edu.ar CAOS - Computer Architecture and Operating Systems,Universidad Aut´onoma de Barcelona, Barcelona, Spain, { dolores.rexachs,emilio.luque } @uab.esJune 8, 2020 This is the accepted version of the manuscript that was sent toreview to

Future Generation Computer Systems (ISSN 0167-739X).This manuscript was finally accepted for publication on July 3rd,2020 and its final published version is available online athttps://doi.org/10.1016/j.future.2020.07.003 . R (cid:13) .1 bstract Handling faults is a growing concern in HPC. In future exascale sys-tems, it is projected that silent undetected errors will occur several timesa day, increasing the occurrence of corrupted results. In this article, wepropose SEDAR, which is a methodology that improves system reliabilityagainst transient faults when running parallel message-passing applica-tions. Our approach, based on process replication for detection, combinedwith diﬀerent levels of checkpointing for automatic recovery, has the goalof helping users of scientiﬁc applications to obtain executions with correctresults. SEDAR is structured in three levels: (1) only detection and safe-stop with notiﬁcation; (2) recovery based on multiple system-level check-points; and (3) recovery based on a single valid user-level checkpoint. Aseach of these variants supplies a particular coverage but involves limita-tions and implementation costs, SEDAR can be adapted to the needs ofthe system. In this work, a description of the methodology is presentedand the temporal behavior of employing each SEDAR strategy is mathe-matically described, both in the absence and presence of faults. A modelthat considers all the fault scenarios on a test application is introducedto show the validity of the detection and recovery mechanisms. An over-head evaluation of each variant is performed with applications involvingdiﬀerent communication patterns; this is also used to extract guidelinesabout when it is beneﬁcial to employ each SEDAR protection level. Asa result, we show its eﬃcacy and viability to tolerate transient faults intarget HPC environments.

Keywords: soft error detection, automatic recovery, system-level checkpoint,user-level checkpoint

In the area of High-Performance Computing (HPC), parallel systems continueincreasing the number of components to improve their performance and, as aconsequence, ensuring their reliability has become a critical issue. Nowadays,fault rates involve just a few hours on modern platforms [1] but it is forecastedthat large parallel applications will have to manage fault rates of barely someminutes in exascale supercomputers [2]. In that sense, these applications requiresome help to progress eﬃciently.Currently, there are diﬀerent types of transient errors aﬀecting parallel pro-grams; Silent Data Corruption (SDC) is the most dangerous of these as severalrecent reports have stated [3]. When SDC occurs, the application seems to runcorrectly but, at the end, the results are incorrect. Science is one of the areasstrongly aﬀected by SDC, since historically it has relied on large-scale simula-tions. Therefore, the treatment of silent errors is one of the greatest challengesin current and future resilience.Without a fault tolerance mechanism, a whole application could misbehavedue to a failure aﬀecting just one task. Even worse, the program could outputinvalid results that, in the best-case scenario, will be noticed when the executionis concluded; as a consequence, silent errors require a detection mechanism [4].In addition, a single SDC can cause deep eﬀects, propagating them across allprocesses that communicate in message-passing applications [5]. One way to2olerate transient faults is to rely on hardware redundancy (registers and pro-cessor arithmetical logic units). Nevertheless, this approach is highly expensiveand diﬃcult to implement [6]. Given the high cost of re-running the applicationfrom the start if a fault is detected, speciﬁc software strategies are required toreach a suitable cost-beneﬁt trade-oﬀ.As opposed to the silent errors, the fail-stop failures cause a process to crash,making their detection almost immediate. A common, well-studied techniqueto reduce their impact consists of Checkpoint-based Rollback recovery (C/R)[7]. When coordinated checkpointing is used, the entire state of an applicationis saved in a periodic manner. So, if a failure takes place, all the processes canrestart from their saved checkpoints. Instead, if uncoordinated checkpointing isused, only the state of the process being checkpointed is dumped. Unfortunately,C/R can be time-consuming and the overhead increases as the number of coresgrow. Nonetheless, despite being eﬀective when dealing with fail-stop errors,C/R shows weakness when facing silent errors. Since a stored checkpoint couldcontain undetected corruption, C/R cannot guarantee a correct recovery. Thissituation becomes aggravated in the case of strongly coupled computation, sincean error in one node could propagate to the others in microseconds [8].Performing redundant software execution is a common way to provide re-silience. Following the state machine replication approach, a process is dupli-cated and both copies proceed with the same execution sequence. As a result,for deterministic applications, they produce the same output for the same input[9]; these two outputs can be compared to provide error detection, despite notbeing enough for recovery. In the HPC context, using multicore architecturesrepresents a viable solution for detecting SDCs as a result of their intrinsicnatural redundancy [1].Considering these circumstances of non-reliable results and their expensiveveriﬁcation, this paper presents SEDAR, which is a methodology designed toprovide transient fault tolerance for scientiﬁc message-passing parallel applica-tions that execute in multicore clusters. SEDAR seeks to help programmers andusers of parallel scientiﬁc applications to accomplish reliability in their execu-tions. It works as a static library that is compiled with the application. Eventhough this changes the model of execution, it is still almost transparent to thealgorithm, as opposed to speciﬁc detectors that force modifying it and do notcover all faults [4, 10]. Following this methodology, each process of the paral-lel application gets replicated and both copies execute on diﬀerent cores of thesame socket, taking advantage of the multicore’s intrinsic hardware redundancy.SEDAR can detect and recover from all transient faults that cause SDC andTOE (Time Out Errors). Three diﬀerent ways are provided by SEDAR so itcan achieve full silent error coverage: (1) only detection with notiﬁcation; (2)recovery based on multiple system-level checkpoints; and (3) recovery utilizing asingle safe application-level checkpoint. Each of these alternatives has particularfeatures and provides a diﬀerent cost-performance trade-oﬀ.A preliminary, more conceptual version of this work is available in [11]. Thisarticle fully describes and validates the methodology, extending the insightsalready oﬀered in the previous version, with the following new contributions: • The introduction of an analytical model to verify the eﬃcacy both ofthe detection strategy and of the recovery mechanism based on multi-ple, system-level coordinated checkpoints. The model is based on the3redictability of the data aﬀected in a well-known test application (con-sidering its computation and communication stages), so the consequencesof each fault occurrence can be anticipated. • The design of a complete workfault (i.e. a set of representative faultcases that emulate real faults experienced by the system, which we useas workload for testing purposes) that permits a grouping of all possiblefaulty situations in scenarios . This set includes the eﬀects of the faults,their latency of detection and recovery point. • The implementation of the multiple-checkpoint-based recovery algorithm,using the DMTCP library. To verify its operation, SEDAR has beenincorporated into the test application. The empirical validation has beenachieved through controlled fault injection. • The evaluation of the overhead of each alternative SEDAR strategy. Forthis purpose, we have attached SEDAR to three parallel benchmarks withdiﬀerent communication patterns and workload demands, thus measuringor estimating the execution parameters for each of them. • The introduction of a function that describes the average execution timeof using each SEDAR strategy. • A qualitative evaluation of the incidence of the communication pattern’sinﬂuence on temporal behavior. • A discussion about the convenience of saving multiple checkpoints for re-covery, compared to just employing the detection mechanism. In addition,a brief analysis of how to determine the best moment to start protectionhas been conducted.The remainder of the paper is organized as follows: Section 2 reviews somebasic concepts and related work. Section 3 describes the proposed methodologyfrom a functional point of view, separating it in the three aforementioned strate-gies. Section 4 presents the evaluation of the recovery strategy, in addition to adiscussion about the convenience of utilization and adaptation, considering thetemporal behavior of the possible variants. Finally, in Section 5, the conclusionsand future lines of work are detailed.

Transient faults can be classiﬁed according to their consequences on the programexecution [12]. A Latent Error (LE) is a non-harmful fault since it does not aﬀectthe ﬁnal results. This kind of error alters data that are not used anymore. Onthe contrary, when a Detected Unrecoverable Error (DUE) occurs, a programends suddenly; the system software can be aware of DUEs but cannot recoverfrom them. In the case of a Time Out Error (TOE), the application does notﬁnish within a stipulated time range. Finally, as mentioned before, when SDCtakes part, the application seems to execute correctly, although invalid resultsare produced. Particularly, in message-passing parallel programs, SDC can besub-classiﬁed into two diﬀerent types of errors according to [13]: (1) Transmitted4ata Corruption (TDC) alters data to be sent by a process, which will propagateto others if it is not detected; (2) Final Status Corruption (FSC) has an eﬀect onnon-communicated data, spreading the error in a local manner and invalidatingthe ﬁnal results.Nowadays, there is no silver bullet to manage frequent SDC. Some availablealgorithmic solutions only apply to speciﬁc kernels [14], decreasing the cost oferror detection; this kind of solution can be used in HPC environments [10].Among them, some well-known methods like ABFT [10] can detect up to amaximum number of errors in linear-algebra problems. However, each kernelrequires an ad-hoc implementation, which represents a lot of work for large HPCsoftware. Moreover, algorithm-based solutions are more intrusive as they modifythe algorithms [4][14][15]. On the contrary, compiler or runtime software-baseddetection proposals are more general since they can be employed to any code,but at the cost of a signiﬁcant increase in complexity. In addition, containmentstrategies seek to reduce the fault consequences, either stopping its propagationto the other nodes or to the data stored in checkpoints [7]. In [16], the authorsincrease availability and oﬀer a trade-oﬀ between the number and the quality ofcomponents through redundancy in HPC systems. In the same way, [17] showedthat replication strategy is more eﬃcient than C/R under circumstances of higherror rate and large values of C/R overhead.Traditionally, SDC detection has been achieved through execution replica-tion combined with partial or total result comparison during the execution.Software-redundancy solutions remove the need for expensive hardware per-forming replication at the level of threads [18], processes [6] or machine status.On the other hand, other proposals require fewer resources but at the cost ofreducing their accuracy. One of them is approximate replication, which im-plements upper and lower limits for computation results [7]. MR-MPI [16]employs a transparent redundancy approach for HPC. It proposes a partial pro-cess replication and can be used together with C/R in non-replicated processes[19]. rMPI [17] takes on failures by redundant execution of MPI applications.Using this protocol, the program fails only if two corresponding replicas fail,because each node is duplicated. The probability of simultaneous failure of anode and its replica decreases when the number of nodes increases, so redun-dancy scales. However, this beneﬁt requires duplicating the number of resourcesand quadrupling the number of messages. Faults in shared-memory systems areexplored in [9], where a scheme based on multi-threaded processes is proposed,including non-determinism management due to memory accesses. A protocolfor hybrid task-parallel MPI programs is described in [1], which carries out re-covery based on uncoordinated checkpoints and message logging. Only the taskthat presented the error is restarted and all the MPI calls are handled inside it.RedMPI [5] is an MPI library that exploits rMPI’s per-process replication to de-tect and correct SDC, comparing the messages sent by replicated issuers at thereceiver side. It avoids sending all messages and comparing their entire contentsthrough a hashing-based optimization. In addition, it does not require sourcecode modiﬁcation and guarantees determinism among replicated processes. Asit oﬀers protection even when high failure rates occur, RedMPI is a potentialalternative to be used on large-scale systems. It is shown that even a singletransient error can produce deep eﬀects on the program, causing a corruptionpattern that cascades toward all other processes through MPI messages. In asimilar way to SEDAR, RedMPI also enables replica mapping on the same phys-5cal node as the native processes, or in neighbors with lower network latency.Like our proposal, it monitors communications as a strategy which attemptsto provide the correct output. Detection is delayed upon transmission, but, asopposed to SEDAR, validation is carried out on the receiver side. This producesan additional overhead, as well as latency and network congestion that are notpresent in our solution. Fault tolerant protocols for other parallel programmingmodels, such as PGAS [20] have been also explored. The combination of check-pointing the output of tasks and replicating for application-speciﬁc detection isexplored in [2] for a linear workﬂow context, in the presence of both fail-stopand silent faults. Finally, in a recent study, the authors of [21] explore the com-bination of replication with checkpointing for fail-stop errors, and compute theoptimal checkpoint interval for this approach.

The following subsections are dedicated to the description of the basics of thediﬀerent proposed options to accomplish transient fault tolerance. In addition,an evaluation of the temporal behavior of implementing each speciﬁc feature isalso included. In that sense, a simple model has been developed which considersthe factors that have inﬂuence over the total execution time, both in the absenceof faults as when a single silent error occurs during the execution. It is importantto remark that SEDAR can functionally manage multiple fault occurrence [22].However, the proposed performance model contemplates a single fault for thesake of clarity.To evaluate the diﬀerent strategies, a baseline is used, which consists of amanual method for ensuring reliable results. This method involves launchingtwo simultaneous instances of the application and comparing the ﬁnal results ofboth executions in a semi-automatic manner. In this way, the same computingresources that are consumed in our proposal are assigned to each individualinstance (i.e. half of the total cores of the system), which is the fairest wayto compare. In the absence of faults, the ﬁnal results will match. However, ifa transient fault occurs, a third re-execution (maintaining the same mapping)and a new comparison are required to pick the outputs of the runs that form amajority (using a voting mechanism) as the correct ones.The time elapsed by this manual method in the absence of faults (faultabsence, T F A ) is given by Equation 1. It consists of the time spent by the twoindependent instances to run simultaneously ( T prog ) plus the time of comparingthe results of the two executions. On the other hand, Equation 2 is the timewhen a fault occurs (fault presence, T F P ), which is the time of a new re-executionand a new comparison for voting, besides a restart time ( T rest ) to relaunch thethird run (which takes the same time as the previous one because it uses thesame mapping) after the two original ones. In Table 1, the parameters involvedin all the equations and their meanings are summarized. T F A = T prog + T comp (1) T F P = 2( T prog + T comp ) + T rest (2)6able 1: Name and meaning of the parameters involved in temporal character-ization of each alternative strategyParameter Meaning T prog The execution time of two instances of theoriginal application in parallel T comp Time of semi-automatic comparison of results;may include calculating a hash T rest Time of manually restarting the application.An automatic restart may take shorter. Insimpliﬁed model, it is considered the same f d The factor of overhead due to detection mech-anism. It is application dependent and can beexperimentally determined. 0 < f d < X Instant of fault detection, expressed as a frac-tion of the application progress. Random0 < X < n Number of checkpoints made during the wholeexecution, given a checkpoint interval t cs Time involved in storing a system checkpoint t i Checkpoint interval. It can be adjusted tominimize overhead k A number of additional checkpoints that theapplication needs to rollback to ﬁnd a non-corrupted one. It depends on the applicationand the detection latency t ca The time involved in storing an applicationcheckpoint. It should be shorter than t cs T compA The time for validating an application check-point; it may include calculating a hash

In order to accomplish detection when running deterministic parallel applica-tions, which is the ﬁrst SEDAR feature, the messages between processes arevalidated before being sent. Thus, the error that aﬀects any process can beisolated, preventing it from propagating to the others.The detection strategy consists of duplicating each application process in athread, which requires a synchronization mechanism between both replicas. Ev-ery time a communication is to be performed, the leading thread stops runningand then waits for its replica to reach the same point. Once there, the detectionmechanism compares the entire contents of the messages computed by bothredundant threads and, if there is a coincidence, only one of them sends themessage. Such a mechanism does not require additional network bandwidth.When the receiver process reaches the receive operation, it gets the messageand in turn waits for its replica to synchronize. Next, it makes a copy of the re-ceived contents before resuming execution. Additionally, because a failure mayhave locally propagated until the end of the execution, a comparison of the ﬁnalresults of the application is performed, thus allowing to detect such failures.7his detection strategy is capable of detecting failures that cause SDC (bothTDC and FSC variants, at the cost of sacriﬁcing latency in the latter case) andTOE. As regards TOE, they can be detected under the assumption that theexecution time of two redundant threads is similar in a homogeneous, dedicatedsystem [22]. Hence, if an appreciable delay is noticed between the two replicas,it is considered that a silent error has caused the separation of their ﬂows. Asa single time-out interval is not optimal, it should be conﬁgured taking intoaccount the application’s needs: if it is too long, the detection latency enlarges,but being too short may cause false positive detection. Anyway, a TOE isdeﬁnitely detected if a process enters an inﬁnite loop.A scheme of the proposed detection strategy is shown in Figure 1. Eachreplica runs in a core that shares a cache level with the core in which the originalprocess executes. Thus, the comparisons are solved with no need to access mainmemory, taking advantage of the memory hierarchy. It should be clear that theproposed detection mechanism is based on launching a single instance of theapplication, with each process internally replicated in a thread. This is diﬀerentfrom the baseline, in which two independent instances of the application arelaunched in parallel. Nevertheless, both cases make the same use of half of theavailable cores from the application performance point of view.As this methodology is based on process-replication, it is a priori capableof performing only detection (triple redundancy needs to be used to achievecorrection). The ways to accomplish recovery without the need of triplicatingare described in the two next subsections.Implementing any resilience strategy involves unavoidable costs, both in exe-cution time and in resource utilization. In particular, a duplication-based mech-anism aims to achieve reliability, at the cost of assigning half of the system coresto protect the executions. In this context, it is important to note that SEDARprovides fault tolerance without introducing any additional cost regarding re-source utilization; it takes advantage of the available cores (the intrinsic redun-dancy of the system), without any change or need for speciﬁc hardware. As aconsequence of the problems related to strong scalability, parallel applicationsmay not always make an eﬃcient use of all the available cores, both in terms oftime reduction and energy consumption. Therefore, using all the available coresmay not necessarily be better (as expected) than using half of them (some exam-ples of this behavior can be found in [23, 24]). In addition, as performance candepend on the mapping, providing resilience is a useful way to take advantageof the available resources.Another remarkable aspect is that SEDAR does not modify the algorithm (asalgorithm-based detectors do) and it is almost transparent to the application,which makes it generally usable for message-passing applications.The steps involved in the detection, such as the replication of processes,synchronization between redundant threads, comparison before sending, copyingmessages upon reception and ﬁnal veriﬁcation of the results, are the cause ofthe overhead introduced in the overall execution time.A trade-oﬀ is reached in relation to the validation interval. The overheadinvolved in detection is minimized if results are compared only at the end, butbecause the detection latency increases, a lot of computation could be use-less. On the other hand, if the frequency of partial results validation is high,the introduced overhead enlarges but the fault can be quickly detected, andhence more computation is proﬁtable. There is evidence that, depending on the8 a) Normal operation in absence of faults(b) Operation when detecting a fault

Figure 1: Outline of the operation of the detection mechanism9omputation-to-communication ratio of the particular application, as well as onthe size of the workload, the overhead can diﬀer in a signiﬁcant manner [13]. Asa consequence, the combination of the validation of the messages with the ﬁnalcomparison of results aims to detect all faults and obtain a reliable system byintroducing a reasonable overhead. SEDAR’s detection mechanism could alsobe adapted to partial replication, similar to [16]. However, extra work shouldbe done to identify the application’s critical parts that need to be replicated,which is a non-general procedure. Consequently, SEDAR could be disabled inthe non-critical parts. Moreover, the partial replication would not result in abeneﬁt regarding resource utilization, since the cores were previously assignedto replicas (whether they run or not). Nevertheless, it could be potentiallyproﬁtable from the standpoint of energy consumption.In the absence of a recovery strategy, the occurrence of an SDC or a TOEcauses the detection-only strategy to notify the user and lead the system toa safe stop, preventing it from delivering defective results. As validating themessages has the eﬀect of limiting the detection latency, the implementation ofsuch a strategy permits relaunching the execution as soon as the error is de-tected, thus avoiding the needless and expensive wait for the termination withcorrupted results. The execution time of the detection strategy in the absenceof faults is given by Equation 3. The time is the same as the one of the base-line (Equation 1), but, in this case, T prog is negatively aﬀected (increased) by afactor f d , which represents the overhead of the detection mechanism. If a faultoccurs, the execution time is accounted in Equation 4. The ﬁrst term comprisesthe time executed until the detection instant ( X ) plus the whole re-executionafter the stop caused by the error. Once it is detected, a restart is required,and, in the re-execution, the ﬁnal comparison is needed. T F A = T prog (1 + f d ) + T comp (3) T F P = T prog (1 + f d )( X + 1) + T rest + T comp (4)It is important to point out that the parameter X represents the instant ofthe fault detection and not the moment of the fault occurrence (which cannotbe exactly known). The value of X is related to the latency of detection anddepends on how the data (and then the messages) are aﬀected by the fault, soit varies according to the communication pattern of the target application.A remarkable fact is that the proposed detection strategy is equally eﬀectivewhen multiple errors occur during the execution. The ﬁrst diﬀerence, caused byan error, which is observable in the contents of a message or in the ﬁnal resultswill lead to the system stopping safely. The vulnerability of this mechanism isreduced to extremely unlikely cases, which are detailed in [22, 25]. Consequently,despite the fact that we have limited the analysis of the temporal behavior tothe cases of fault absence or single error occurrence, the detection strategy canhandle multiple non-related errors. As regards the implementation, the requiredprocedure for adding the detection functionality to the parallel application (thatcould be automated) is detailed in [13].10 .2 Recovery based on multiple system-level checkpoints The next step in the search for transient fault tolerance consists of adding arecovery mechanism. In SEDAR, it is proposed to store a chain of distributedcoordinated checkpoints, built with a system-level checkpointing library.It is not possible to ensure that any particular checkpoint holds a safe statefor recovery because a silent error can spoil the internal state of one of thereplicas that is going to be checkpointed. Therefore, recovering from the laststored checkpoint is not always feasible and using an older one may be required.Thus, multiple checkpoints have to be saved to guarantee recovery [8]. As thetransient faults are ﬂeeting, it is important to note that the restart can beattempted from the same node where the corruption took place. There are twopossible cases:1.

The transient fault occurs and is detected inside the boundaries of a check-point interval . In this situation, the last checkpoint can be used to resumethe execution. As a particular case, if the detection occurs previously tothe ﬁrst checkpoint, the application must be relaunched from the begin-ning. This situation is outlined in Figure 2 (a).2.

The detection latency transposes the limits of the checkpoint interval . Thiscircumstance arises when the fault occurs before storing a checkpoint butthe detection takes place after that. In this situation, the last checkpointis invalid, so the corresponding restart causes the same error to manifest.Consequently, the previous checkpoint must be used to attempt to recover.Generalizing, the fault can traverse any number of checkpoints (dependingon the detection latency), requiring several tries to make rollback recoverypossible. In turn, this situation is outlined in Figure 2 (b).Controlled fault injection experiments are needed to verify the operationof the recovery strategy when the two aforementioned cases occur. The datacorruption becomes evident as an observable diﬀerence between the memorystate of the replicas. Hence, in order to simulate a bit-ﬂip in a processor register,the value of a variable is changed in only one of the replicated threads, in asingle iteration of the computation. Such an injection is made from inside thecode of the application. The details of the injection method are described insection 4.2. As regards the checkpoints, the best moments to take them arewhen the communications have just been validated. This strategy reduces thewindows of vulnerability [25] (we have previously studied this issue in [22]),as the probability of an error corrupting the state of a checkpoint is smaller inthat situation. However, the mechanism even works if the checkpoints are takensomewhere, as in a more realistic scenario, with periodic checkpointing madeexternally to the application. However, in general, a considerable overheadwould be involved if a checkpoint is taken after each communication.As mentioned before, the two possible behaviors of the recovery strategy areoutlined in Figure 2, while Algorithm 1 describes the pseudo-code of the pro-posed method. For the sake of clarity, the injection mechanism is not includedin the algorithm, nor is the recording of system checkpoints; in general, theycan be taken anytime.Automating this method would increase its usability. This can be accom-plished if an outsider process is allowed to read the extern counter and, based11 a) Detection latency conﬁned within the checkpoint interval(b) Detection latency transposing the checkpoint interval

Figure 2: Possible cases of recovery using multiple system-level checkpoints,depending on the detection latency 12 lgorithm 1

Recovery algorithm with multiple system-level checkpoints /* extern counter is an external counter that controls thenumber of rollbacks (not included in checkpoint) */ int extern counter = 0; /* fault detected is a boolean variable that reports if afailure was detected in the last execution */ boolean fault detected = FALSE; /* run parallel application app under SEDAR monitoring */ SEDAR run(app); /* if fault detected is TRUE then a fault was detected in thelast execution */ while fault detected == TRUE do /* extern counter is increased by 1 */ extern counter++; /* get the number of checkpoints done */ ckpt count = get ckpt count(); /* calculate the number of the checkpoint for restart */ ckpt no = ckpt count - extern counter; /* reset detection flag for restart */ fault detected = FALSE; /* restart app from checkpoint ckpt no */ SEDAR restart(app, ckpt no); end while on its value, ﬁnd the correct restart script to try recovery; besides that, thewrong-restart checkpoint has to be erased (and stored again in re-execution).In turn, the code of fault injection is executed only once, and the target appli-cation modiﬁes the value of extern counter every time a fault is detected.It is worth noting that this mechanism is also able to detect multiple faults(if they are independent of each other) [22]. If a diﬀerent fault is detected duringa re-execution, the algorithm will recover but at a sub-optimal computationalcost. In the current state of the proposal, the algorithm is optimized to deal witha single error. So, when any error is detected in a re-execution, the algorithmassumes that it is the same as was previously detected (the detection latencyhas exceeded the checkpoint interval, as mentioned in Section 3.2). Therefore,the detection of a diﬀerent error during re-execution will generate an unnec-essary rollback attempt (i.e. the algorithm assumes that the last checkpointis corrupted, although this is not the case). In the worst-case situation, therecovery mechanism goes back to the beginning. While predicting the recoverytime becomes diﬃcult in the case of multiple faults, a reliable conclusion is stillensured. The described ineﬃciency can be ﬁxed by adding a more sophisticatedmechanism, which is brieﬂy described in Section 4.2.In the absence of faults, the execution time of this mechanism is given byEquation 5. Compared with the only detection case (Equation 3), the extra termaccounts for the time involved in saving n system-level checkpoints. When afault occurs, the execution time is the one shown in Equation 6. The parameter k is the number of extra checkpoints that need to be reversed if the restart fromthe last one does not succeed. Therefore, the third term represents the timespent in checkpointing, taking into account that various checkpoints ( k ) might13e recorded again if there is any corruption that prevents successful recovery.The fourth term is an estimate of the re-execution time, considering that, onaverage, the fault may be detected midway through the checkpoint interval.This lapse will need to be re-executed in the best case, (i.e. when recovery ispossible from the last stored checkpoint ( k = 0)). If k >

0, the same portionplus a number of checkpoint intervals (which depend on the value of k ) willrequire being re-executed. Finally, the last term represents the number of neededrestarts. T F A = T prog (1 + f d ) + T comp + nt cs (5) T F P = T prog (1 + f d ) + T comp + ( n + k ) t cs ++( k X m =0 ( k − m + 1 / t i + ( k + 1) T rest (6)When system-level checkpoints are the only available option, this strategybecomes appropriate, despite having two signiﬁcant limitations. The ﬁrst onerefers to the amount of required storage. None of the checkpoints can be erased,given the uncertainty about the validity of the data recorded in them: if a con-sistent checkpoint cannot be found, a signiﬁcant part of the application willneed to be re-executed; in an extreme case, the whole execution will have tobe relaunched from the beginning [8]. In any case, the negative impact ofmultiple checkpoints over the storage can be reduced by solutions based onmulti-level checkpointing [7]. The second important drawback is related to scal-ability: in upcoming exascale systems, and despite some considerable eﬀorts[26], coordinated-system-level checkpoints would not be the most suitable solu-tion, because they keep a large amount of information related to the system.In an unreﬁned version, our method is an expensive approach, because it needsto keep an undetermined number of active checkpoints and may require sev-eral restart attempts. Instead, user-level checkpoints are becoming more usual,especially due to their lower costs and portability options [1]. This third alternative included in SEDAR is designed to overcome the limita-tions caused by the utilization of system-level checkpoints. In this context, anddespite requiring detailed knowledge of the application’s internal organization(computing and communication), user-level checkpoints are a more appropriateoption, given the fact that they only save the application-related information[4]. Besides that, they are smaller, more portable and scale better than thesystem-level versions. As a consequence, the utilization of a single user-levelcheckpoint for recovery is proposed, in conjunction with a strategy to ensurethe validity of the last recorded checkpoint. Therefore, the prior checkpointscan be removed, thus decreasing storage usage and reducing the relaunchinglatency.The proposed solution is based on recording per-thread user-level check-points, taking advantage of the synchronization mechanism between replicas.14uch checkpoints just save the set of variables that are signiﬁcant to the appli-cation at that speciﬁc moment. As both thread checkpoints are stored, a hashon each one is calculated. To collate the two hashes, the mechanism used forvalidating message contents in the detection phase is employed again. Hence,the checkpoint is considered valid only if the comparison proves to be true. Inthis situation, the previous checkpoint can be safely discarded to save storage,as the current one constitutes a consistent state for recovery. On the otherhand, if a diﬀerence is detected on the veriﬁcation phase, it is necessarily due toa fault that has occurred within the last checkpoint interval. As this checkpointis considered corrupted , it is not possible to use it for recovery and it should beerased. Then, the execution has to be resumed from the previous checkpoint.As a consequence, there is a single valid checkpoint at any given time (exceptfor the validation interval), which is independent from the comparison result.Algorithm 2 describes the pseudo-code of the proposed mechanism.

Algorithm 2

Recovery algorithm with application-level checkpoints function usr ckpt ( n ) // usr ckpt function definition for (tid=0; tid <

2; tid++) do // for both replicas /* record its custom checkpoint */ store all signiﬁcant variables(tid); hash array[tid]=compute hash(tid); end for synch threads(); // wait for each other /* only one of the replicas compares hashes */ if tid==0 then if hash array[0]==hash array[1] then // they match /* delete own checkpoint */ remove all signiﬁcant variables(tid); /* this is a valid checkpoint so the previous can bediscarded */ return TRUE; else return

FALSE // this is a corrupted checkpoint end if end if end function (...) // application code /* n represents the current checkpoint */ if usr ckpt(n)== TRUE then /* delete previous checkpoint since the current is valid */ remove usr ckpt(n-1); else /* remove current corrupted checkpoint */ remove usr ckpt(n); /* restart from previous checkpoint */ restart from usr checkpoint(n-1); end if

15n the absence of faults, the execution time of this mechanism is given byEquation 7. The time is equal to the detection-only strategy, but incorporatesan additional last term which represents the time employed for n user-levelcheckpoints to be recorded ( t ca ) after being validated ( t compA ). Instead, Equa-tion 8 accounts the time when a fault occurs. The fourth added term showsthat, on average, just half of the checkpoint interval has to be re-executed, asbarely a single rollback is required. For the sake of clarity: as each checkpoint isvalidated, the latency of detection is conﬁned within the checkpoint interval. Inthe worst-case scenario, the re-execution time will be t i , if the error is detectedjust before taking a new checkpoint; while, in the best-case scenario, in whichthe error is detected as soon as a checkpoint has been taken, it will be near to 0.As the probability of an error is equally distributed along with the checkpointinterval, we state that, on average, the re-execution time is (1/2) t i . Ultimately,the last term represents the only restart time that is required, as the algorithmperforms a single rollback at most. It is important to notice that T comp repre-sents the time required for the validation of application results, while T compA comprises the time required to validate an application-level checkpoint. T F A = T prog (1 + f d ) + T comp + n ( t ca + T compA ) (7) T F P = T prog (1 + f d ) + T comp + n ( t ca + T compA )++(1 / t i + T rest (8) As previously mentioned, Equations 3 to 8 describe the time required by eachstrategy in two cases: (1) in the absence of faults ( T F A ); and (2) in the presenceof a single silent fault ( T F P ). As a fault has an associated occurrence probability,we introduce a general formulation that predicts the Average Execution Timeconsidering it and, as a consequence, the

Mean Time Between Errors (MTBE)parameter. This function allows us to estimate the average overhead introducedby each SEDAR strategy. Let α be the probability of a fault occurrence. Then,the Average Execution Time function is given by: AET = T F P ( α ) + T F A (1 − α ) (9)The M T BE of a system with N processors decreases linearly with N , thatis M T BE = M T BE ind / N , where M T BE ind is the

M T BE of an individualprocessor [4]. If λ = 1/ M T BE ind is the silent error rate of an individual pro-cessor, then the silent error rate of the whole system is λN . Assuming thaterrors occur according to an exponential distribution, the probability of a silenterror aﬀecting a computation that lasts T prog and executes on a system with N processors is: P ( N, T prog ) = 1 − e − λNT prog = 1 − e − NT prog /MT BE ind = 1 − e − T prog /MT BE (10)The latter expression is the probability of a silent fault occurrence, that is, α .By considering this in Equation 9, we can obtain the Average Execution Timeas a function of the M T BE of the system and the baseline program executiontime T prog . 16 ET = T F P (1 − e − T prog /MT BE ) + T F A ( e − T prog /MT BE ) (11)Equation 11 is useful for estimating the average overhead of using eachSEDAR strategy, considering both the cases of fault absence and fault presence.If Error Detection with Notiﬁcation strategy is used, T F A and T F P of Equa-tion 11 are obtained from Equations 3 and 4; when using Recovery based onMultiple System-Level Checkpoints, they are obtained from Equations 5 and 6;and when Recovery based on a Single Safe Application-Level Checkpointing isthe chosen strategy, the times are obtained from Equations 7 and 8.

To describe and validate the functional behavior of the detection and automaticrecovery strategies in the presence of faults, we have built a model based oncombining a well-known test application with a complete, controlled workfault.The analytical model contemplates all the possible faults that can occur, basedon the deep knowledge of the behavior of the application. Each fault has apredictable eﬀect, a moment in which it is certainly detected and a determinablepoint for recovery. Obviously, there are inﬁnite physical possibilities of faultoccurrences, but all of them are represented in the listed scenarios; a scenariorepresents a class of errors, so it includes a set of cases that behave in the sameway. For each experiment, a single fault is injected.The test application is synthetic, built-up over an MPI Master/Worker ma-trix multiplication ( C = A × B ). The modiﬁcations consist of replicating pro-cesses in threads for detection, in addition to ﬁnal validation of the result-ing matrix. Every time the application performs a communication, a system-level checkpoint is carried out as messages are sent only if the involved dataare safe. The matrix-multiplication has been selected because it is a regu-lar, computationally-intensive, representative parallel application, with a well-known communication pattern. The deep knowledge about its behavior allowsthe clear identiﬁcation of the moments of communication between processes andof the data involved in each communication. As a consequence, the precise ef-fect of each injected fault can be predicted, as well as the state of each recordedcheckpoint ( clean or dirty ), and, therefore, which checkpoint makes the recoverypossible.The pseudo-code for the test application is shown in Algorithm 3.The possible scenarios for injection experiments are organized according tothe following criteria: • P inj : the execution instant where the injection is carried out, taking asreference the structure of the application (e.g. between the SCATTERand CK1). • P rocess : if the injection is made in the code executed by the Master orany Worker. • Data : if the injection is made either on an element of the matrix A , B or C , or on an index variable. Also, if the injected value is used by theMaster or a Worker for its computation (e.g. A(M), C(W), i(M), etc).17 lgorithm 3 Pseudo-code for the test application SEDAR Init() SEDAR Ckpt() // Checkpoint SEDAR Scatter(A) // Master scatters matrix A (SCATTER) SEDAR Ckpt() // Checkpoint SEDAR Bcast(B) // Master broadcasts matrix B (BCAST) SEDAR Ckpt() // Checkpoint matmul(A,B,C) // Each process computes its block (MATMUL) SEDAR Gather(C) // Master gathers matrix C (GATHER) SEDAR Ckpt() // Checkpoint if rank == MASTER then /* Master validates final result (VALIDATE) */ SEDAR Validate(C) end if • Ef f ect : TDC, FSC, LE or TOE. • P det : the execution moment where the fault is detected. It could be atcommunication or at the ﬁnal validation. • P rec : the nearest checkpoint from which it is possible to recover. • N roll : the number of attempts required for correct recovery. The possiblevalues are: 0, if the injected fault causes a LE; 1, if recovery is possiblefrom the last recorded checkpoint; 2, if it is necessary to rewind to thelast but one checkpoint; and so on.Based on the combinations of these factors, we have designed a set of 64injection experiments that cover all the situations that can occur in the targettest application. The 64 scenarios have been designed according to the followingcriteria: the faults are injected both in the code executed by the Master andby the Workers, in elements of each of the three matrices. The injections inthe Master code are made both in data that it transmits and in others thatare kept for local use. The injections in the Workers’ code are made in datathat will be transmitted, as the Workers do not locally retain results. In boththe Master and the Workers, there are faults injected after each checkpoint, i.e.the checkpoint is clean , so recovery is possible. Nevertheless, other faults areinjected after a communication operation but before the subsequent checkpoint(i.e. between them, into already validated data), making that checkpoint dirty and forcing more than one rollback to recover. On the other hand, injections inindex variables are made, both in the Master and the Workers codes, during theactual matrix-multiplication operation, in order to make the processing timeof both redundant threads asymmetric. As previously mentioned, every faultaﬀecting a certain subset of data, and occurring (at any moment) during thelapse of the execution comprised in a particular scenario, is detected at thesame time, so it can be recovered from the same checkpoint. In other words,any case of error has a similar eﬀect to one of the 64 provided scenarios; these64 scenarios are derived from the study of the application behavior.To illustrate the method followed, only a few representative scenarios are de-tailed in Table 2. These scenarios were selected mainly to make evident the four18ossible eﬀects of a fault (TDC, FSC, LE or TOE), but also to display injectionsmade in the codes of both the Master and the Workers, showing diﬀerent mo-ments of detection and reﬂecting various possible situations of recovery (i.e. noneed to rollback, rollback to the last checkpoint or multiple rollback attempts).Table 2: Selected representative injection scenarios: eﬀects and predicted pointsof detection and recovery Scenario P inj Process Data Eﬀect P det P rec N roll From Table 2 we can observe that: • In Scenario 2, the injection is carried out modifying matrix A from theMaster process between CK0 and SCATTER stages. The injected elementof A is going to be transmitted to a Worker, so this injection produces aTDC error, which will be detected at SCATTER moment. The recoveryis possible from the last checkpoint carried out (CK0, a clean checkpoint). • An example of LE error is described in Scenario 29. The injection happensbetween BCAST and CK2 aﬀecting matrix C from a Worker. As thismatrix has not been computed yet and will be overwritten afterwards, theerror does not modify the ﬁnal result. • Scenario 50 shows the details for an injection experiment that causes aFSC error. The injection is made in an element of matrix C that hasalready been calculated and received by the Master (GATHER), but beforemaking the checkpoint CK3. The error will be detected at the VALIDATEstage, and, because CK3 is dirty (the fault occurred before recording it),an additional rollback is required to recover. • A possible TOE error is detailed in Scenario 59. This injection takes placeat MATMUL stage and aﬀects an index variable used by a Worker, causingone of the replicas to restart its computation after it has already done partof its task. This causes a delay in the aﬀected thread, which is detectedas a TOE; only the other replica reaches the GATHER operation withina conﬁgurable lapse. The recovery is built from the CK2 point (a singlerollback is enough).A conclusion of our analysis is that any random fault that can occur alongthe execution resembles some of the modeled scenarios. A method of analysislike the one described, based on the knowledge of the target application andthe moments of checkpointing, in combination with controlled and systematicfault injection, allows us to predict the behavior of both the detection and theautomatic recovery mechanisms. Thus, the eﬃciency of the strategy is shownwhen running in a real environment with random faults.It is to be noted that the aim of the performed functional analysis is theevaluation of the eﬃcacy of SEDAR’s detection and recovery mechanisms. Asthe correct operation of both mechanisms is checked for all the errors that can19ccur in the particular selected test application, and other types of errors do notexist, the carried-out validation veriﬁes the functional suitability of SEDAR.

All the experiments that are described in this section have been carried out uti-lizing standard tools. The implementation of the fault tolerance strategy con-sists of a library of modiﬁed MPI functions and data types with extended func-tionality for fault detection. This includes a buﬀers comparison before sending,message copies upon reception, and synchronization between replicated threads.The coordinated system-level checkpoints are built with the DMTCP li-brary [27], which generates distributed-per-process checkpoint ﬁles, and a sin-gle restart script for each checkpoint. All processes of the application callSEDAR Ckpt(), but a single process (e.g. the Master) is in charge of check-pointing the whole application. Both redundant threads of this process syn-chronize with each other (in the same way that they do when a message is tobe sent), and only one of them calls DMTCP Ckpt() from inside of SEDAR.The fault injection is made from inside the test application. An ad-hoc func-tion is included in the library, which contains the 64 scenarios, and a conditionalcompilation is used to make a single injection in each experiment. Dependingon the number of the particular injection scenario, the function is invoked in adiﬀerent place during the execution. This function works in conjunction witha ﬁle that is used to control if an injection has been already made (named in-jected.txt ). The content of this ﬁle is evaluated in each function call. In theﬁrst instance, the ﬁle contains the value and when the injection is made,its content is incremented (it is changed to ). In the recovery process, thecode is re-executed calling the injection routine again. As the ﬁle content isre-evaluated, the function returns without making a new injection since it haschanged. This ﬂag needs to be external to the application so that its content isindependent of the checkpointing storage (i.e. when a checkpoint is made, thevalue in the ﬁle is not aﬀected).The recovery mechanism, based on keeping a chain of checkpoints, uses an-other ﬁle to control how many times the same fault has been detected (named failures.txt ). This ﬁle is initialized with and its content is incremented eachtime that a fault is detected. Assuming that a single error occurs during execu-tion, then the ﬁle contains for the ﬁrst detection, and a rollback to last restartscript is tried. If an error is detected during re-execution, it is assumed that itis the same error as the previous one and that the checkpoint is dirty due to thedetection latency. In this case, the ﬁle’s content becomes and a rollback tothe prior checkpoint is attempted. Therefore, the content of the ﬁle is used tochoose the number of the restart script that has to be executed. Once again, theﬁle needs to be external to the application, so that its content is independentof the checkpoint storage.To recover from multiple diﬀerent faults, the mechanism should be slightlymodiﬁed to achieve a better performance. Some additional data, related to thecurrent fault, might also need to be stored in the ﬁle, in order to be able todistinguish between a repetition of the previous fault and a new fault. In thelatter case, the ﬁle size would not increase and the mechanism would behave asin the ”ﬁrst” detection scenario.The 64 injection experiments, each one being representative of a particular20cenario, were performed over the test application, using two nodes of a Bladecluster with eight nodes. Each node contains two quad-core Intel Xeon e54052.0GHz processors (6MB L2 cache, shared between pairs of cores), 10GB RAMmemory (shared between both processors) and 250GB local disk storage. Theoperating system is GNU/Linux Debian 6.0.7 (64 bits, kernel version 2.6.32),the message-passing library is MPICH (version 3.3.1), and the checkpointinglibrary is DMTCP (version 2.4.4).Figure 3 shows the output ﬁle of one of the injection experiments, namelyInjection Scenario 50, in order to demonstrate the followed methodology andthe obtained behavior.Figure 3: Output of the execution of the test application when the fault ofScenario 50 is injectedIn order to provide more detail, in this stage of SEDAR’s functional valida-tion, our implementation of fault-tolerant MPI functions is based on point-to-point communications, which makes it more general and allows us to show upthe FSC scenarios, at the expense of sacriﬁcing performance. However, we havealso developed versions of optimized collective communications, which are usedin the stage of the temporal behavior evaluation. It is important to note that incollective communications, the sender process also participates in these, eithersending or receiving. If our test application just uses collective operations, thecorrupted data gets transmitted and hence it is validated. In this way, onlyTDC scenarios remain and FSC scenarios should not be present any longer. Itis clear that this idea could be extended to many other applications.21 .3 Evaluation of the Temporal Behavior To show how our descriptive model can be used to evaluate the temporal behav-ior of each alternative, a set of simple examples is presented, which includes realmeasured values for the parameters in Table 1, taken from carried-out tests.To enrich the analysis, we have used three parallel benchmarks for the test-ing stage: matrix multiplication; Jacobi’s method for Laplace’s equation [28];and DNA sequence alignment with Smith-Waterman algorithm [29]. These ap-plications have been selected because they are well-known, computationally in-tensive, and representative of scientiﬁc computing. In addition, they allow us tostudy the eﬀects of having diﬀerent communication patterns: Master-Worker,Single-Program-Multiple-Data (SPMD) and Pipeline, respectively [13]. Thecomparison between execution times of the three target applications, betweenraw MPI versions and our MPI-based implementation of SEDAR, has the aimof measuring the execution parameters with each application and pointing outthe diﬀerences according to the distinct communication patterns. The SEDARimplementation includes mechanisms for duplicating processes, synchronizingreplicas, comparing and copying the messages’ contents, and validating the ﬁ-nal results. Each experiment has been repeated ﬁve times, and average valueshave been taken.The parameter f d , the detection overhead, was obtained by comparing theexecution time of the SEDAR detection mechanism, in the absence of faults,with the time of executing the manual detection strategy (baseline), for eachtarget application. Explicitly: f d = T SEDAR det F A − ( T prog + T comp ) T prog + T comp (12)where T SEDAR det F A is the time of Equation 3 and ( T prog + T comp ) is thetime of Equation 1.As regards the parameter T comp , it has been measured for a manually-launched program that compares the contents of two binary ﬁles. It is similarto the one required by the function SEDAR Validate(). An average value ofthe parameter t cs was obtained by measuring with tools provided by DMTCPlibrary. As regards T rest , it has been indirectly measured (by fault injection ex-periments which demand recovery), but the obtained values were consistent tothe assumption that the time needed to perform a checkpoint can be consideredequal to the time needed to load a checkpoint from storage [30].To achieve long-lasting executions for the three applications (around 10 hoursfor the baseline case), we have adjusted the execution parameters. The matrixproduct has been repeated 100 times using N = 8192. For the Jacobi algorithm,the size of the workload was N = 8192 and the number of iterations I = 300.000.Finally, for DNA sequence alignment, the length of the sequences was set to N= 2ˆ22 = 4.194.304. In this context, the parameter t i has been ﬁxed to 1 hourfor all the experiments. Regarding n , this is the number of checkpoints thathave been recorded with such an interval value, and it is obtained by dividingthe time of the only detection strategy (Equation 3) by the checkpoint interval t i . It is worth noting that (instead of being arbitrarily assigned) the checkpointinterval can be determined, for example, with Daly’s formula [31], which takes22nto account the M T BE . The checkpoint interval is intended to be a trade-oﬀbetween maintaining a low overhead, due to checkpointing, and a reasonabletime of rework if a fault occurs.The parameter X , which depends on the detection latency, has been manu-ally assigned for three cases. On its behalf, the parameter t ca was estimated as-suming that an application-level checkpoint is more lightweight for storing thanits system-level counterpart. Last, T compA (the validation time of an application-level checkpoint) has been estimated to be equal to the time of validating theresults (i.e. parameter T comp ), as a way to simplify the model.Table 3 contains the list of all values used for the temporal evaluation, ob-tained as described, whereas Table 4 shows the resulting times in each case.It may be recalled that in the manual strategy (execution of two simultaneousMPI independent instances), a total of eight MPI processes have been launched,with a maximum of four processes mapped in each node, which means that justfour cores have been used in that node. The same mapping was assigned inSEDAR implementation, but in this case, all the cores of each node have beenused, as the redundant threads run on free cores.An analysis of the data contained in Table 3 reveals some remarkable facts.In all the cases, the detection overhead f d is very low. However, the biggestvalue is for Jacobi’s method, which is the application with the most frequentcommunications. This is the expected behavior since f d is tightly associatedwith messages. On the other hand, the matrix product, which is computationbounded, presents negligible detection overhead.On the other hand, t cs is directly related with the size of the workload W foreach application, and therefore with the amount of memory spent, as can be seenin Table 3. The matrix product is the most memory-consuming application. TheMaster process handles the three entire matrices, whereas each Worker handlesthe entire B matrix, plus its corresponding chunks of A and C. As regardsJacobi’s method, one of the processes handles the two entire matrices, whereasthe other processes handle their corresponding chunks of them. Finally, inSmith-Waterman algorithm, all the processes require local buﬀers; one of themhandles two entire sequences, whereas the others handle one entire sequenceplus their corresponding chunk of the other. It may be recalled that for thethree benchmarks, all the processes are replicated.Finally, T comp is associated with the size of the results that have to bevalidated. In the matrix product, the entire matrix C is validated for eachinstance, so the obtained value is signiﬁcant compared to the other applications.On the other hand, in the DNA sequence alignment, only the similarity scorehas to be validated, which involves negligible time. As a middle ground, inJacobi’s method, a single matrix needs validation.The information shown in Table 4 allows us to survey interesting aspects ofSEDAR’s behavior. As a remarkable fact, when a fault occurs, the detectionmechanism (rows 4, 5 and 6) performs better than the baseline (row 2) for allthe applications, regardless of the time of detection, due to the low temporaloverhead implied. The sooner the error is detected, the better the mechanismbehaves, as expected, because less work must be remade after stopping. Itis rather obvious that adding the multiple-checkpoint-based recovery strategy(row 7) involves a larger overhead if compared to the detection-only strategy, inabsence of faults; the time spent in checkpointing is worth more in the case ofthese long-lasting executions, but may have a not-negligible impact in shorter23able 3: Values of the parameters utilized in the temporal evaluation of eachalternative strategyParameter MATMUL JACOBI SW T prog [hs] 10.21 8.92 11.15 T comp [s] 42 1 < f d [%] < X ; X ; X [%] 30; 50; 80 t i [hs] 1 n

10 8 11 W [MB] 6016 1920 152 t cs [s] 14.10 9.62 2.55 T rest [s] 14.10 9.62 2.55 t ca [s] 10.58 9.11 1.92 t compA [s] 42 1 < programs.The analysis of the values in rows 8, 9 and 10 reveals that, when an errortakes place, even rolling back several times is advantageous in respect to thebaseline; as long as the number of rollbacks is greater than 4, the time spent inreworking is longer than the baseline strategy.The values shown in rows 11 and 12 are similar to the ones of rows 7 and 8.As expected, the time of recovery from the last valid application-level checkpoint(Equation 8) is almost equal to the time of recovery from the last system-levelcheckpoint, when this is possible (Equation 6 with k = 0).In conclusion, it can be seen that the temporal behavior of each SEDARstrategy is dependent on the communication pattern, the computation-to-communication ratio and the detection latency. The examples described, withthe three benchmarks, demonstrate how the model can be applied for temporalevaluation if the involved parameters are available (or they can be measured).Another remarkable fact that can be observed from Table 4 is that the diﬀer-ent alternatives of SEDAR oﬀer considerable gains both in time and reliability,24hen facing the occurrence of a silent error. This item becomes particularlyimportant in executions that can last many hours. Moreover, the longer theexecution time ( T prog ), the more useful the fault-tolerance strategy is, becausethe failures are more likely to happen. As previously stated, the protectionmechanism should be used used for long programs: if the execution is too short,checkpoints become worthless. Despite the fact that these examples cannot betaken as general conclusions, they are illustrative of the potential of SEDAR inhelping users of scientiﬁc applications to reach reliable executions, as they arerepresentative of the scientiﬁc parallel applications. As mentioned before, in a system that saves a chain of checkpoints for rollback,the recovery is then possible after one or more attempts. However, due to thecheckpointing and rolling-back overhead, there are possible scenarios in whichthe time spent in those attempts could be longer than simply stopping upondetection and relaunching from the beginning. Therefore, it is useful to evaluatethe convenience of saving multiple checkpoints. Moreover, the beneﬁts of usingthe checkpoint-based protection should be considered.This study is suggestive about how to use the developed model. If statisticsabout the frequency and typical behavior of the faults are available for a par-ticular system that runs an application (i.e. when the faults are more likely toappear), the strategy of protection can be properly tuned between only detectionand checkpointing for recovery.For diﬀerent values or parameter X , some quantitative knowledge can beextracted from the evaluation of the execution times in Equation 4 and in Equa-tion 6. It can be shown that the fourth term in Equation 6 is equivalent to:( k X m =0 ( k − m + 1 / t i = ( k + 1) t i (13)so that Equation 6 can be rewritten as: T F P = T prog (1 + f d ) + T comp + ( n + k ) t cs ++ ( k + 1) t i + ( k + 1) T rest (14)To illustrate this idea, we have selected Jacobi’s method as the test case (al-though the procedure can easily be applied to the other benchmarks). For this,we have evaluated the Equation 4 with three diﬀerent values of X : if the faultis detected near the beginning ( X = 30%), in the middle ( X = 50%) or towardthe end ( X = 80%) of the execution. The times obtained are compared with theones derived from Equation 14, taking into account the following considerations.The reference time for this analysis is the one from Equation 3: the durationof an execution using the detection mechanism, without faults. In such a totalexecution time (8.97 hs, see Table 4), X = 30% means that the fault is detectedat t = 2.69 hs. At this time, with t i = 1 hour, only the two ﬁrst checkpoints(CK0 and CK1) have been stored. Therefore, recovery must be possible, eitherfrom CK1 (i.e. k = 0) or from CK0 (i.e. k = 1), so both values are admissible25able 5: Execution time with the fault detected at X , with only detection andwith diﬀerent number of rollback attempts ( k +1) X [ %] Only detection [hs] k + 1 rollback attempts [hs]k=0 k=1 k=2 k=3 k=430 11.66 9.5 11.01 N A (Not Admissible)50 13.46 13.52 17.02

N A

80 16.16 21.53in Equation 14. The same reasoning has been followed for the other values of X . Table 5 summarizes this data. Based on the value of parameter X , the secondcolumn shows the times obtained with detection, safe-stop and relaunching fromthe start (i.e. from Equation 4), the third column shows the times obtained byrewinding to the last (by the moment) stored checkpoint (i.e. from Equation 14with k = 0), the fourth column shows the times obtained by rewinding to thelast but one checkpoint (i.e. from Equation 14 with k = 1), and so on. Thevalue N A means that the current value of k is not admissible in Equation 14for the current value of X , because the corresponding checkpoint has not beenstored yet by that moment of the progress of the execution.For the analyzed values of X , the obtained results suggest that rolling back tothe last stored checkpoint ( k = 0), if possible, is always advisable, faced with thecase of stopping, notifying the error and relaunching from the beginning. Evento restart from the last but one checkpoint ( k = 1) is still convenient (obviously,if that checkpoint is not corrupted). However, if the fault is detected aroundthe middle of execution, and two (or more) rollbacks have to made, it wouldhave been preferable to stop and relaunch. In other words, if recovery from thelast two checkpoints is not possible, trying from the previous ones is still moreexpensive than simply halting and getting started again. This is caused by thelarge overheads involved in not only re-executing the same computation severaltimes, but also re-storing checkpoints and making various restart attempts. Thistrend also holds as the application progresses: if the fault is detected close tothe end, even trying more rollbacks to recorded checkpoints could represent animprovement compared to stop and relaunch.Of course, there is no way of knowing which checkpoint would enable recov-ery. However, following this line of reasoning, if we force the time of Equation 4to be minor or equal to the one of Equation 14 with k = 0, we obtain X ≤ . k = 1, we obtain X ≥ . X ≥ . t i can beconvenient, as the advantage of rolling back a shorter span exceeds the check-pointing overhead.Once again, although these are simple examples, they are illustrative of howuseful conclusions can be drawn from the model of temporal behavior, whichallow us to adapt the protection strategy based on the knowledge of the systemparameters. Exascale computing presents several challenges to future generation computersystems and guaranteeing reliability is one of them. The protection of the MPIapplications at message level is a feasible and eﬀective method for detecting,secluding and avoiding the propagation of data corruption, taking into accountthe deep eﬀects that a single transient fault can cause on all processes thatcommunicate. In this article, SEDAR has been presented as a methodologyfor detecting and recovering from all silent errors, in an agnostic manner tothe algorithms. SEDAR consists of three complementary alternatives for onlydetection, recovery based on multiple system-level checkpoints and recoverybased on a single user-level checkpoint. The most remarkable conclusions are: • The functional behavior in the presence of faults can be analytically de-scribed. We have built a model that considers all the fault scenarios on awell-known test application and SEDAR’s response facing each scenario,thus showing the validity of the detection and the recovery mechanisms. • The predictions of the model can be empirically veriﬁed. Through con-trolled fault injection experiments, the reliability provided by SEDARstrategies has been demonstrated. • The temporal behavior of each SEDAR strategy can be characterized.When obtaining the execution parameters for applications with diﬀerentcommunication patterns and computation-to-communication ratios, it hasbeen shown that the diﬀerent variants of SEDAR oﬀer beneﬁts both inexecution time and reliability. This becomes particularly proﬁtable inlong-lasting programs. • SEDAR can be adapted to a determined cost-performance trade-oﬀ. Aseach SEDAR strategy supplies a particular coverage but also has limita-tions and implementation costs, choosing between them allows us to adjustto the needs of a particular system. • The temporal characterization can be used to extract useful protectionguidelines. To illustrate this, it has been shown when it is beneﬁcial toemploy each SEDAR strategy. • Both the viability and eﬃcacy to tolerate transient faults in expected HPCexascale systems have been shown.27s ongoing and future lines of work, we can enumerate: • Emulating non-deterministic calls, which are required to extend the scopeof applications that can be protected with SEDAR. • Performing experimental validation with customized, non-coordinateduser-level checkpoints, calculating the optimal checkpoint interval to min-imize execution overhead, and measuring the relationship between thelatency of detection and the communication pattern. • Reﬁning the multiple checkpoint-based recovery mechanism to optimallysupport various faults, and analytically modeling the temporal responsein the presence of multiple non-related faults. • Implementing an automatic adaptation of the recovery strategy, i.e. dy-namically starting protection depending on the progress of the execution(based on the reasoning stated in section 4.4).As a ﬁnal aim, integration with scalable architectures that use C/R forpermanent fault tolerance [32] should be attempted. As SEDAR also providesscalable options for detection and recovery, fault-tolerance for both types oferrors could be achieved for projected exascale systems. It is important toclarify that a production version of SEDAR is being developed, whereas thecurrent implementation remains as a prototype.

Acknowledgments

This research has been supported by the Agencia Estatal de Investigacin (AEI),Spain and the Fondo Europeo de Desarrollo Regional (FEDER) UE, under con-tract TIN2017-84875-P and partially funded by a research collaboration agree-ment with the Fundacion Escuelas Universitarias Gimbernat (EUG). In addi-tion, this research has been supported by the Universidad Nacional de La Plata,Argentina, through the Programas de Incentivos. We also would like to thankthe reviewers and the editors for their constructive feedback and signiﬁcantcontributions to this work.

References [1] T. Martsinkevich, O. Subasi, O. Unsal, F. Cappello, and J. Labarta, “Fault-tolerant protocol for hybrid task-parallel message-passing applications,” in

Cluster Computing (CLUSTER), 2015 IEEE International Conference on .IEEE, 2015, pp. 563–570.[2] A. Benoit, A. Cavelan, F. M. Ciorba, V. Le F`evre, and Y. Robert, “Combin-ing checkpointing and replication for reliable execution of linear workﬂowswith fail-stop and silent errors,”

International Journal of Networking andComputing , vol. 9, no. 1, pp. 2–27, 2019.[3] J. Elliott, M. Hoemmen, and F. Mueller, “Evaluating the impact of sdc onthe gmres iterative solver,” in

Parallel and Distributed Processing Sympo-sium, 2014 IEEE 28th International . IEEE, 2014, pp. 1193–1202.284] A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert, and H. Sun,“Coping with silent and fail-stop errors at scale by combining replicationand checkpointing,”

Journal of Parallel and Distributed Computing , vol.122, pp. 209–225, 2018.[5] D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, andR. Brightwell, “Detection and correction of silent data corruption for large-scale high-performance computing,” in

Proceedings of the InternationalConference on High Performance Computing, Networking, Storage andAnalysis . IEEE Computer Society Press, 2012, p. 78.[6] A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors, “PLR:A software approach to transient fault tolerance for multicore architec-tures,”

IEEE Transactions on Dependable and Secure Computing , vol. 6,no. 2, pp. 135–148, 2009.[7] F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, and M. Snir, “To-ward exascale resilience: 2014 update,”

Supercomputing frontiers and in-novations , vol. 1, no. 1, pp. 5–28, 2014.[8] G. Lu, Z. Zheng, and A. A. Chien, “When is multi-version checkpointingneeded?” in

Proceedings of the 3rd Workshop on Fault-tolerance for HPCat extreme scale . ACM, 2013, pp. 49–56.[9] H. Mushtaq, Z. Al-Ars, and K. Bertels, “Eﬃcient software-based fault tol-erance approach on multicore platforms,” in

Proceedings of the Conferenceon Design, Automation and Test in Europe . EDA Consortium, 2013, pp.921–926.[10] G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, “Algorithm-based faulttolerance applied to high performance computing,”

Journal of Parallel andDistributed Computing , vol. 69, no. 4, pp. 410–416, 2009.[11] D. Montezanti, A. De Giusti, M. Naiouf, J. Villamayor, D. Rexachs, andE. Luque, “A methodology for soft errors detection and automatic recov-ery,” in . IEEE, 2017, pp. 434–441.[12] S. S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem:An architectural perspective,” in

High-Performance Computer Architec-ture, 2005. HPCA-11. 11th International Symposium on . IEEE, 2005, pp.243–247.[13] D. Montezanti, E. Rucci, D. Rexachs, E. Luque, M. Naiouf, andA. De Giusti, “A tool for detecting transient faults in execution of parallelscientiﬁc applications on multicore clusters,”

Journal of Computer Science& Technology , vol. 14, pp. 32–38, 2014.[14] Z. Chen, “Algorithm-based recovery for iterative methods without check-pointing,” in

Proceedings of the 20th international symposium on High per-formance distributed computing . ACM, 2011, pp. 73–84.2915] M. Shantharam, S. Srinivasmurthy, and P. Raghavan, “Fault tolerant pre-conditioned conjugate gradient for sparse linear system solution,” in

Pro-ceedings of the 26th ACM international conference on Supercomputing .ACM, 2012, pp. 69–78.[16] C. Engelmann and S. B¨ohm, “Redundant execution of hpc applicationswith MR-MPI,” in

Proceedings of the 10th IASTED International Confer-ence on Parallel and Distributed Computing and Networks (PDCN) , 2011,pp. 15–17.[17] K. Ferreira, R. Riesen, R. Oldﬁeld, J. Stearley, J. Laros, K. Pedretti, andT. Brightwell, “rMPI: increasing fault resiliency in a message-passing en-vironment,”

Sandia National Laboratories, Albuquerque, NM, Tech. Rep.SAND2011-2488 , 2011.[18] G. Yalcin, O. S. Unsal, and A. Cristal, “Fault tolerance for multi-threadedapplications by leveraging hardware transactional memory,” in

Proceedingsof the ACM International Conference on Computing Frontiers . ACM,2013, p. 4.[19] X. Ni, E. Meneses, N. Jain, and L. V. Kal´e, “ACR: Automatic check-point/restart for soft and hard error protection,” in

Proceedings of theInternational Conference on High Performance Computing, Networking,Storage and Analysis . ACM, 2013, p. 7.[20] N. Ali, S. Krishnamoorthy, N. Govind, and B. Palmer, “A redundant com-munication approach to scalable fault tolerance in pgas programming mod-els,” in

Parallel, Distributed and Network-Based Processing (PDP), 201119th Euromicro International Conference on . IEEE, 2011, pp. 24–31.[21] A. Benoit, T. H´erault, V. L. F`evre, and Y. Robert, “Replication is moreeﬃcient than you think,” in

Proceedings of the International Conference forHigh Performance Computing, Networking, Storage and Analysis . ACM,2019, p. 89.[22] D. M. Montezanti, D. Rexachs del Rosario, E. Rucci, E. Luque Fad´on,M. Naiouf, and A. E. De Giusti, “Characterizing a detection strategy fortransient faults in hpc,” in

Computer Science & Technology Series. XXIArgentine Congress of Computer Science. Selected papers . Editorial de laUniversidad Nacional de La Plata (EDULP), 2016, pp. 77–90.[23] J. Panadero, A. Wong, D. Rexachs, and E. Luque, “P3s: A methodology toanalyze and predict application scalability,”

IEEE Transactions on Paralleland Distributed Systems , vol. 29, no. 3, pp. 642–658, 2017.[24] V. Puzyrev and J. M. Cela, “A review of block krylov subspace methodsfor multisource electromagnetic modelling,”

Geophysical Journal Interna-tional , vol. 202, no. 2, pp. 1241–1252, 2015.[25] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August,“SWIFT: Software implemented fault tolerance,” in

Proceedings of the in-ternational symposium on Code generation and optimization . IEEE Com-puter Society, 2005, pp. 243–254. 3026] J. Cao, K. Arya, R. Garg, S. Matott, D. K. Panda, H. Subramoni, J. Vienne,and G. Cooperman, “System-level scalable checkpoint-restart for petascalecomputing,” in . IEEE, 2016, pp. 932–941.[27] J. Ansel, K. Arya, and G. Cooperman, “DMTCP: Transparent checkpoint-ing for cluster computations and the desktop,” in

Parallel & DistributedProcessing, 2009. IPDPS 2009. IEEE International Symposium on . IEEE,2009, pp. 1–12.[28] G. Andrews, “Scientiﬁc computing,” in

Foundations of Multithreaded, Par-allel and Distributed Computing . Addison-Wesley, 2000, ch. 11, pp. 527–585.[29] E. Rucci, A. De Giusti, and F. Chichizola, “Parallel smith-waterman algorithm for dna sequences comparison on diﬀerentcluster architectures,” in

Proceedings of the International Confer-ence on Parallel and Distributed Processing Techniques and Applications(PDPTA’11) , vol. 1. WorldComp, 2011, pp. 666–672. [Online]. Available:http://worldcomp-proceedings.com/proc/p2011/PDP5014.pdf[30] L. Fialho, D. Rexachs, and E. Luque, “What is missing in current check-point interval models?” in . IEEE, 2011, pp. 322–332.[31] J. T. Daly, “A higher order estimate of the optimum checkpoint intervalfor restart dumps,”

Future generation computer systems , vol. 22, no. 3, pp.303–312, 2006.[32] M. Castro-Le´on, H. Meyer, D. Rexachs, and E. Luque, “Fault toleranceat system level based on RADIC architecture,”