[PDF] Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

Abstract

Full PDF

RReinit ++ : Evaluating the Performance ofGlobal-Restart Recovery Methods For MPIFault Tolerance Giorgis Georgakoudis , Luanzheng Guo (cid:63) , and Ignacio Laguna Center for Advanced Scientiﬁc Computing ,Lawrence Livermore National Laboratory, USA, { georgakoudis1, lagunaperalt1 } @llnl.gov EECS , UC Merced, USA, [email protected]

Abstract.

Scaling supercomputers comes with an increase in failurerates due to the increasing number of hardware components. In standardpractice, applications are made resilient through checkpointing data andrestarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearingdown and re-instating execution, and possibly limiting checkpointing re-trieval from slow permanent storage.In this paper we present Reinit ++ , a new design and implementation ofthe Reinit approach for global-restart recovery, which avoids applicationre-deployment. We extensively evaluate Reinit ++ contrasted with theleading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application toderive new insight on performance. Experimentation with three diﬀerentHPC proxy applications made resilient to withstand process and nodefailures shows that Reinit ++ recovers much faster than restarting, up to6 × , or ULFM, up to 3 × , and that it scales excellently as the number ofMPI processes grows. HPC system performance scales by increasing the number of computing nodesand by increasing the processing and memory elements of each node. Further-more, electronics continue to shrink, thus are more susceptible to interference,such as radiation upsets or voltage ﬂuctuations. Those trends increase the prob-ability of a failure happening, either due to component failure or due to transientsoft errors aﬀecting electronics. Large HPC applications run for hours or daysand use most, if not all, the nodes of a supercomputer, thus are vulnerable tofailures, often leading to process or node crashes. Reportedly, the mean time be-tween a node failure on petascale systems has been measured to be 6.7 hours [26], (cid:63)

Work performed during internship at Lawrence Livermore National Laboratory a r X i v : . [ c s . D C ] F e b G. Georgakoudis et al. while worst-case projections [12] foresee that exascale systems may experience afailure even more frequently.HPC applications often implement fault tolerance using checkpoints to restartexecution, a method referred to as

Checkpoint-Restart (CR). Applications peri-odically store checkpoints, e.g., every few iterations of an iterative computation,and when a failure occurs, execution aborts and restarts again to resume from thelatest checkpoint. Most scalable HPC applications follow the Bulk SynchronousParallel (BSP) paradigm, hence CR with global, backward, non-shrinking re-covery [21], also known as global-restart naturally ﬁts their execution. CR isstraightforward to implement but requires re-deploying the whole applicationon a failure, re-spawning all processes on every node and re-initializing any ap-plication data structures. This method has signiﬁcant overhead since a failureof few processes, even a single process failure, requires complete re-deployment,although most of the processes survived the failure.By contrast, User-level Fault Mitigation (ULFM) [4] extends MPI with inter-faces for handling failures at the application level without restarting execution.The programmer is required to use the ULFM extensions to detect a failure andrepair communicators and either spawn new processes, for non-shrinking recov-ery, or continue execution with any survivor processes, for shrinking recovery.Although ULFM grants the programmer great ﬂexibility to handle failures, it re-quires considerable eﬀort to refactor the application for correctly and eﬃcientlyimplementing recovery.Alternatively, Reinit [23,11] has been proposed as an easier-to-program ap-proach, but equally capable of supporting global-restart recovery. Reinit extendsMPI with a function call that sets a rollback point in the application. It trans-parently implements MPI recovery, by spawning new processes and mending theworld communicator at the MPI runtime level. Thus, Reinit transparently en-sures a consistent, initial MPI state akin to the state after MPI initialization.However, the existing implementation of Reinit [11] is hard to deploy, since it re-quires modiﬁcations to the job scheduler, and diﬃcult to compare with ULFM,which only requires extensions to the MPI library. Notably, both Reinit andULFM approaches assume the application has checkpointing in place to resumeexecution at the application level.Although there has been a large bibliography [4,11,23,5,25,28,17,18,16,9,22]discussing the programming model and prototypes of those approaches, no studyhas presented an in-depth performance evaluation of them –most previous workseither focus on individual aspects of each approach or perform limited scaleexperiments. In this paper, we present an extensive evaluation using HPC proxyapplications to contrast these two leading global-restart recovery approaches.Speciﬁcally, our contributions are: – A new design and implementation of the Reinit approach, named Reinit ++ ,using the latest Open MPI runtime. Our design and implementation sup-ports recovery from either process or node failures, is high performance, anddeploys easily by extending the Open MPI library. Notably, we present a einit ++ : Global-Restart Recovery For MPI Fault Tolerance 3 precise deﬁnition of the failures it handles and the scope of this design andimplementation. – An extensive evaluation of the performance of the possible recovery ap-proaches (CR, Reinit ++ , ULFM) using three HPC proxy applications (CoMD,LULESH, HPCCG), and including ﬁle and in-memory checkpointing schemes. – New insight from the results of our evaluation which show that recovery un-der Reinit ++ is up to 6 × faster than CR and up to 3 × faster than ULFM.Compared to CR, Reinit ++ avoids the re-deployment overhead, while com-pared to UFLM, Reinit ++ avoids interference during fault-free applicationexecution and has less recovery overhead. This section presents an overview of the state-of-the-art approaches for MPIfault tolerance. Speciﬁcally, it provides an overview of the recovery models forapplications and brieﬂy discusses ULFM and Reinit, which represent the state-of-the-art in MPI fault tolerance.

There are several models for fault tolerance depending on the requirements ofthe application. Speciﬁcally, if all MPI processes must recover after a failure,recovery is global ; otherwise if some, but not all, of the MPI processes needto recover then recovery is deemed as local . Furthermore, applications can ei-ther recover by rolling back computation at an earlier point in time, deﬁnedas backward recovery, or, if they can continue computation without backtrack-ing, recovery is deemed as forward . Moreover, if recovery restores the numberof MPI processes to resume execution, it is deﬁned as non-shrinking , whereasif execution continues with whatever number of processes surviving the failure,then recovery is characterized as shrinking . Global-restart implements global,backward, non-shrinking recovery which ﬁts most HPC applications that followa bulk-synchronous paradigm where MPI processes have interlocked dependen-cies, thus it is the focus of this work.

One of the state-of-the-art approaches for fault tolerance in MPI isUser-level Fault Mitigation (ULFM) [4]. ULFM extends MPI to enable failuredetection at the application level and provide a set of primitives for handlingrecovery. Speciﬁcally, ULFM taps to the existing error handling interface of MPIto implement user-level fault notiﬁcation. Regarding its extensions to the MPIinterface, we elaborate on communicators since their extensions are a supersetof other communication objects (windows, I/O). Following, ULFM extends MPIwith a revoke operation (

MPI_Comm_revoke(comm) ) to invalidate a communica-tor such that any subsequent operation on it raises an error. Also, it deﬁnes a

G. Georgakoudis et al. shrink operation (

MPI_Comm_shrink(comm, newcomm) ) that creates a new com-municator from an existing one after excluding any failed processes. Additionally,ULFM deﬁnes a collective agreement operation (

MPI_Comm_agree(comm,flag) )which achieves consensus on the group of failed processes in a communicatorand on the value of the integer variable flag .Based on those extensions, MPI programmers are expected to implementtheir own recovery strategy tailored to their applications. ULFM operations aregeneral enough to implement any type of recovery discussed earlier. However,this generality comes at the cost of complexity. Programmers need to under-stand the intricate semantics of those operations to correctly and eﬃcientlyimplement recovery and restructure, possibly signiﬁcantly, the application forexplicitly handling failures. Although ULFM provides examples that prescribethe implementation of global-restart, the programmer must embed this in thecode and refactor the application to function with the expectation that commu-nicators may change during execution due to shrinking and merging, which isnot ideal.

Reinit

Reinit [24,11] has been proposed as an alternative approach for imple-menting global-restart recovery, through a simpler interface compared to ULFM.The most recent implementation [11] of Reinit is limited in several aspects: (1) itrequires modifying the job scheduler (SLURM), besides the MPI runtime, thusit is impractical to deploy and skews performance measurements due to crossingthe interface between the job scheduler and the MPI runtime; (2) its implemen-tation is not publicly available; (3) it bases on the MVAPICH2 MPI runtime,which makes comparisons with ULFM hard, since ULFM is implemented on theOpen MPI runtime. Thus, we opt for a new design and implementation , namedReinit ++ , which we present in detail in the next section. ++ This section describes the programming interface of Reinit ++ , the assumptionsfor application deployment, process and node failure detection, and the recoveryalgorithm for global-restart. We also deﬁne the semantics of MPI recovery forthe implementation of Reinit ++ as well as discuss its speciﬁcs. ++ Figure 1 presents the programminginterface of Reinit ++ in the C language, while ﬁgure 2 shows sample usage of it.There is a single function call, MPI_Reinit , for the programmer to call to deﬁnethe point in code to rollback and resume execution after a failure. This function Available open-source at https://github.com/ggeorgakoudis/ompi/tree/reinit einit ++ : Global-Restart Recovery For MPI Fault Tolerance 5 typedef enum {MPI_REINIT_NEW , MPI_REINIT_REINITED , MPI_REINIT_RESTARTED} MPI_Reinit_state_ttypedef int(* MPI_Restart_point )(int argc , char ** argv , MPI_Reinit_state_t state );int MPI_Reinit(int argc , char ** argv , const MPI_Restart_point point ); Fig. 1: The programming interface of Reinit ++ int foo ( int argc , char ** argv , MPI_Reinit_state_t state ){ /* Load checkpoint if it exists */ while (! done ) { /* Do computation *//* Store checkpoint */ }}int main ( int argc , char ** argv ){ MPI_Init (& argc , & argv ); /* Application - specific initialization */// Entry point of the resilient function MPI_Reinit (& argc , & argv , foo );MPI_Finalize ();}

Fig. 2: Sample usage of the interface of Reinit ++ must be called after MPI_Init so ensure the MPI runtime has been initialized.Its arguments imitate the parameters of

MPI_Init , adding a parameter for apointer to a user-deﬁned function. Reinit ++ expects the programmer to encap-sulate in this function the main computational loop of the application, whichis restartable through checkpointing. Internally, MPI_Reinit passes the param-eters argc and argv to this user-deﬁned function, plus the parameter state ,which indicates the MPI state of the process as values from the enumerationtype

MPI_Reinit_state_t . Speciﬁcally, the value

MPI_REINIT_NEW designates anew process executing for the ﬁrst time, the value

MPI_REINIT_REINITED desig-nates a survivor process that has entered the user-deﬁned function after rollingback due to a failure, and the value

MPI_REINIT_RESTARTED designates that theprocess has failed and has been re-spawned to resume execution. Note that thisstate variable describes only the MPI state of Reinit ++ , thus has no semanticson the application state, such as whether to load a checkpoint or not. G. Georgakoudis et al.

RootD P P k D n P l P m · · · · · · · · · Fig. 3: Application deployment model

Application Deployment Model

Reinit ++ assumes a logical, hierarchicaltopology of application deployment. Figure 3 shows a graphical representationof this deployment model. At the top level, there is a single root process thatspawns and monitors daemon processes, one on each of the computing nodesreserved for the application. Daemons spawn and monitor MPI processes localto their nodes. The root communicates with daemons and keeps track of theirliveness, while daemons track the liveness of their children MPI processes. Basedon this execution and deployment model, Reinit ++ performs fault detection,which we discuss next. Fault Detection

Reinit ++ targets fail-stop failures of either MPI processes ordaemons. A daemon failure is deemed equivalent to a node failure. The causesfor those failures may be transient faults or hard faults of hardware components.In the design of Reinit ++ , the root manages the execution of the wholeapplications, so any recovery decisions are taken by it, hence it is the focal pointfor fault detection. Speciﬁcally, if an MPI process fails, its managing daemon isnotiﬁed of the failure and forwards this notiﬁcation to the root, without takingan action itself. If a daemon process fails, which means either the node failed orthe daemon process itself, the root directly detects the failure and also assumesthat the children MPI processes of that daemon are lost too. After detecting afault the root process proceeds with recovery, which we introduce in the followingsection. MPI Recovery

Reinit ++ recovery for both MPI process and daemon failuresis similar, except that on a daemon failure the root chooses a new host node tore-instate failed MPI processes, since a daemon failure proxies a node failure. Forrecovery, the root process broadcasts a reinit message to all daemons. Daemonsreceiving that message roll back survivor processes and re-spawn failed ones. Af-ter rolling back survivor MPI processes and spawning new ones, the semanticsof MPI recovery are that only the world communicator is valid and any previousMPI state (other communicators, windows, etc.) has been discarded. This is sim- einit ++ : Global-Restart Recovery For MPI Fault Tolerance 7 ilar to the MPI state available immediately after an application calls MPI_Init .Next, the application restores its state, discussed in the following section.

Application Recovery

Reinit ++ assumes that applications are responsible forsaving and restoring their state to resume execution. Hence, both survivor andre-spawned MPI processes should load a valid checkpoint after MPI recovery torestore application state and resume computation. We implement Reinit ++ in the latest Open MPI runtime, version 4.0.0. Theimplementation supports recovery from both process and daemon (node) fail-ures. This implementation does not presuppose any particular job scheduler,so it is compatible with any job scheduler the Open MPI runtime works with.Introducing brieﬂy the Open MPI software architecture, it comprises of threeframeworks of distinct functionality: (i) the OpenMPI MPI layer (OMPI), whichimplements the interface of the MPI speciﬁcation used by the application de-velopers; (ii) the OpenMPI Runtime Environment (ORTE), which implementsruntime functions for application deployment, execution monitoring, and faultdetection, and (iii) the Open Portability Access Layers (OPAL), which imple-ments abstractions of OS interfaces, such as signal handling, process creation,etc.Reinit ++ extends OMPI to provide the function MPI_Reinit . It extendsORTE to propagate fault notiﬁcations from daemons to the root and to im-plement the mechanism of MPI recovery on detecting a fault. Also, Reinit ++ extends OPAL to implement low-level process signaling for notifying survivorprocess to roll back. The following sections provide more details. Application Deployment

Reinit ++ requires the application to deploy usingthe default launcher of Open MPI, mpirun . Note that using the launcher mpirun is compatible with any job scheduler and even uses optimized deployment in-terfaces, if the scheduler provides any. Physical application deployment in OpenMPI closely follows the logical model of the design of Reinit ++ . Speciﬁcally,Open MPI sets the root of the deployment at the process launching the mpirun ,typically on a login node of HPC installations, which is deemed as the HeadNode Process (HNP) in Open MPI terminology. Following, the root launches anORTE daemon on each node allocated for the application. Daemons spawn theset of MPI processes in each node and monitor their execution. The root processcommunicates with each daemon over a channel of a reliable network transportand monitors the liveness of daemons through the existence of this channel.Launching an application, the user speciﬁes the number of MPI processes andoptionally the number of nodes (or number of processes per node). To withstandprocess failures, this speciﬁcation of deployment is suﬃcient, since Reinit ++ re-spawns failed processes on their original node of deployment. However, for G. Georgakoudis et al.

Algorithm 1:

Root: HandleFailure

Data: D : the set of daemons, Children ( x ) : returns the set of children MPI processes of daemon x , P arent ( x ) : returns the parent daemon of MPI process x Input:

The failed process f (MPI process or daemon) // failed process is a daemon if f ∈ D then D ← D \ { f } d (cid:48) ← d | arg min d ∈D Children ( d ) // broadcast REINIT to all daemons Broadcast D message (cid:10) REINIT , { (cid:104) d (cid:48) , c (cid:105) | ∀ c ∈ Children ( f ) } (cid:11) // failed process is an MPI process else Broadcast D message (cid:10) REINIT , { (cid:104) P arent ( f ) , f (cid:105) } (cid:11) end node failures, the user must over-provision the allocated process slots for re-spawning the set of MPI processes lost due to a failed node. To do so, themost straightforward way is to allocate more nodes than required for fault-freeoperation, up to the maximum number of node failures to withstand. Fault Detection

In Open MPI, a daemon is the parent of the MPI processeson its node. If an MPI process crashes, its parent daemon is notiﬁed, by trap-ping the signal

SIGCHLD , in POSIX semantics. Implementing the fault detectionrequirements of Reinit ++ , a daemon relays the fault notiﬁcation to the root pro-cess for taking action. Regarding node failures, the root directly detects themproxied through daemon failures. Speciﬁcally, the root has an open communica-tion channel with each daemon over some reliable transport, e.g., TCP. If theconnection over that communication channel breaks, the root process is notiﬁedof the failure and regards the daemon at fault, thus assuming all its children MPIprocess lost and its host node is unavailable. For both types of failures (processand node), the root process initiates MPI recovery. MPI Recovery

Algorithm 1 shows in pseudocode the operation of the rootprocess when handling a failure. On detecting a failure, the root process dis-tinguishes whether it is a faulty daemon or MPI process. For a node failure,the root selects the least loaded node in the resource allocation, that is the nodewith the fewest occupied process slots, and sets this node’s daemon as the parentdaemon for failed processes. For a process failure, the root selects the originalparent daemon of the failed process to re-spawn that process. Next, the rootprocess initiates recovery by broadcasting to all daemons a message with the

REINIT command and the list of processes to spawn, along with their selectedparent daemons. Following, when a daemon receives that message it signals itssurvivor, children MPI processes to roll back, and re-spawns any processes in the einit ++ : Global-Restart Recovery For MPI Fault Tolerance 9 Algorithm 2:

Daemon ˆ d : HandleReinit Data:

Children ( x ) : returns the set of children MPI processes of daemon x , P arent ( x ) : returns the parent daemon of MPI process x Input:

List {(cid:104) d i , c i (cid:105) , · · · } // Signal survivor MPI processes for c ∈ Children ( ˆ d ) do c.state ← MPI REINIT REINITED

Signal

SIGREINIT to c end // Spawn new process if ˆ d is parent foreach {(cid:104) d i , c i (cid:105) , · · · } doif ˆ d == d i then Children ( ˆ d ) ← Children ( ˆ d ) ∪ c i c i .state ← MPI REINIT RESTARTED

Spawn c i endend list that have this daemon as their parent. Algorithm 2 presents this procedurein pseudocode.Regarding the asynchronous, signaling interface of Reinit ++ , Algorithm 3illustrates the internals of the Reinit ++ in pseudocode. When an MPI processexecutes MPI_Reinit , it installs a signal handler for the signal

SIGREINIT , whichaliases

SIGUSR1 in our implementation. Also,

MPI_Reinit sets a non-local gotopoint using the POSIX function setjmp() . The signal handler of

SIGREINIT simply calls longjmp() to return execution of survivor processes to this gotopoint. Rolled back survivor processes discard any previous MPI state and blockon a ORTE-level barrier. This barrier replicates the implicit barrier present in

MPI_Init to synchronize with re-spawned processes joining the computation.After the barrier, survivor processes re-initialize the world communicator andcall the function foo to resume computation. Re-spawned processes initializethe world communicator as part of the MPI initialization procedure of

MPI_Init and go through

MPI_Reinit to install the signal handler, set the goto point, andlastly call the user-deﬁned function to resume computation.

Application Recovery

Application recovery includes the actions needed at theapplication-level to resume computation. Any additional MPI state besides therepaired world communicator, such as sub-communicators, must be re-created bythe application’s MPI processes. Also, it is expected that each process loads thelatest consistent checkpoint to continue computing. Checkpointing lays withinthe responsibility of the application developer. In the next section, we discussthe scope and implications of our implementation.

Algorithm 3:

Reinit ++ internals Function

OnSignalReinit() :goto

Rollback endFunction

MPI Reinit( argc, argv, foo) : Install signal handler

OnSignalReinit on SIGREINIT

Rollback: if this.state == MPI REINIT REINITED then

Discard MPI stateWait on barrierRe-initialize world communicator endreturn foo ( argc, argv, this.state ) end Discussion

In this implementation, the scope of fault tolerance is to supportrecovery from failures happening after

MPI_Reinit has been called by all MPIprocesses. This is because

MPI_Reinit must install signal handlers and set theroll-back point on all MPI processes. This is suﬃcient for a large coverage offailures since execution time is dominated by the main computational loop. In thecase a failure happens before the call to

MPI_Reinit , the application falls backto the default action of aborting execution. Nevertheless, the design of Reinit ++ is not limited by this implementation choice. A possible approach instead ofaborting, which we leave as future work, is to treat any MPI processes that havenot called MPI_Reinit as if failed and re-execute them.Furthermore, signaling

SIGREINIT for rolling back survivor MPI processesasynchronously interrupts execution. In our implementation, we render the MPIruntime library signal and roll-back safe by using masking to defer signal han-dling until a safe point, i.e., avoid interruption when locks are held or datastructures are updating. Since application code is out of our control, Reinit ++ requires the application developer to program the application as signal and roll-back safe. A possible enhancement is to provide an interface for installing cleanuphandlers, proposed in earlier designs of Reinit [22], so that application and li-brary developers can install routines to reset application-level state on recovery.Another approach is to make recovery synchronous, by extending the Reinit ++ interface to include a function that tests whether a fault has been detected andtrigger roll back. The developer may call this function at safe points during ex-ecution for recovery. We leave both those enhancements as future work, notingthat the existing interface is suﬃcient for performing our evaluation. This section provides detailed information on the experimentation setup, therecovery approaches used for comparisons, the proxy applications and their con-ﬁgurations, and the measurement methodology. einit ++ : Global-Restart Recovery For MPI Fault Tolerance 11 Table 1: Proxy applications and their conﬁguration

Application Input No. ranks

CoMD -i4 -j2 -k2 16, 32, 64, 128, 256, 512, 1024-x 80 -y 40 -z 40 -N 20HPCCG 64 64 64 16, 32, 64, 128, 256, 512, 1024LULESH -i 20 -s 48 8, 64, 512

Recovery approaches

Experimentation includes the following recovery ap-proaches: – CR , which implements the typical approach of immediately restarting anapplication after execution aborts due to a failure. – ULFM , by using its latest revision based on the Open MPI runtime v4.0.1(4.0.1ulfm2.1rc1). – Reinit ++ , which is our own implementation of Reinit, based on OpenMPIruntime v4.0.0. Emulating failures

Failures are emulated through fault injection. We opt forrandom fault injection to emulate the occurrence of random faults, e.g., soft er-rors or failures of hardware components, that lead to a crash failure. Speciﬁcally,for process failures, we instrument applications so that at a random iteration ofthe main computational loop, a random MPI process suicides by raising the sig-nal

SIGKILL . The random selection of iteration and MPI process is the same forevery recovery approach. For node failures, the method is similar, but instead ofitself, the MPI process sends the signal

SIGKILL to its parent daemon, thus killsthe daemon and by extension all its children processes. In experimentation, weinject a single

MPI process failure or a single node failure.

Applications

We experiment with three benchmark applications that repre-sent diﬀerent HPC domains:

CoMD for molecular dynamics,

HPCCG for itera-tive solvers, and

LULESH for multi-physics computation. The motivation is toinvestigate global-restart recovery on a wide range of applications and evaluateany performance diﬀerences. Table 1 shows information on the proxy applica-tions and scaling of their deployed number of ranks. Note

LULESH requiresa cube number of ranks, thus the trimmed down experimentation space. Thedeployment conﬁguration has 16 ranks per node, so the smallest deploymentcomprises of one node while the largest one spans 64 nodes (1024 ranks). Appli-cation execute in weak scaling mode – for

CoMD we show its input only 16 ranksand change it accordingly. We extend applications to implement global-restartwith Reinit ++ or ULFM, to store a checkpoint after every iteration of their maincomputational loop and load the latest checkpoint upon recovery. Table 2: Checkpointing per recovery and failure

Failure Recovery

CR ULFM Reinit process ﬁle memory memory node ﬁle ﬁle ﬁle

Checkpointing

For evaluation purposes, we implement our own, simple check-pointing library that supports saving and loading application data using in-memory and ﬁle checkpoints. Table 2 summarizes checkpointing per recoveryapproach and failure type. In detail, we implement two types of checkpointing: ﬁle and memory . For ﬁle checkpointing, each MPI process stores a checkpoint toglobally accessible permanent storage, which is the networked, parallel ﬁlesys-tem Lustre available in our cluster. For memory checkpointing, an MPI processstores a checkpoint both locally in its own memory and remotely to the memoryof a buddy [36,35] MPI process, which in our implementation is the (cyclically)next MPI process by rank. This memory checkpointing implementation is ap-plicable only to single process failures since multiple process failures or a nodefailure can wipe out both local and buddy checkpoints for the failed MPI pro-cesses. CR necessarily uses ﬁle checkpointing since re-deploying the applicationrequires permanent storage to retrieve checkpoints.

Statistical evaluation

For each proxy application and conﬁguration we per-form 10 independent measurements. Each measurement counts the total execu-tion time of the application breaking it down to time needed for writing check-points, time spent during MPI recovery, time reading a checkpoint after a failure,and the pure application time executing the computation. Any conﬁdence inter-vals shown correspond to a 95% conﬁdence level and are calculated based on thet-distribution to avoid assumptions on the sampled population’s distribution.

For the evaluation we compare CR, Reinit ++ and ULFM for both process andnode failures. Results provide insight on the performance of each of those re-covery approaches implementing global-restart and reveal the reasons for theirperformance diﬀerences. Figure 4 shows average total execution time for process failures using ﬁle check-pointing for CR and memory checkpointing for Reinit ++ and ULFM. The plotbreaks down time to components of writing checkpoints, MPI recovery, and pure einit ++ : Global-Restart Recovery For MPI Fault Tolerance 13

16 32 64 128 256 512 1024

Ranks T i m e ( s ) CR ULFM Reinit ++ RecoveryApplicationWriteCP (a) CoMD

16 32 64 128 256 512 1024

Ranks T i m e ( s ) (b) HPCCG Ranks T i m e ( s ) (c) LULESH Fig. 4: Total execution time breakdown recovering from a process failureapplication time. Reading checkpoints occurs one-oﬀ after a failure and has neg-ligible impact, in the order of tens of milliseconds, thus it is omitted.The ﬁrst observation is that Reinit ++ scales excellently compared to bothCR and ULFM, across all programs. CR has the worse performance, increasinglyso with more ranks. The reason is the limited scaling of writing checkpointsto the networked ﬁlesystem. By contrast, ULFM and Reinit ++ use memorycheckpointing, spending minimal time writing checkpoints. Interestingly, ULFMscales worse than Reinit ++ ; we believe that the reason is that it inﬂates pureapplication execution time, which we illustrate in the next section. Further, inthe following sections, we remove checkpointing overhead from the analysis tohighlight the performance diﬀerences of the diﬀerent recovering approaches. Figure 5 shows the pure application time, without including reading/writingcheckpoints or MPI recovery. We observe that application time is on par for CRand Reinit ++ , and that all applications scale weakly well on up to 1024 ranks.CR and Reinit ++ do not interfere with execution, thus they have no impacton application time, which is on par to the fault-free execution time of the

16 32 64 128 256 512 1024

Ranks T i m e ( s ) CR ULFM Reinit ++ (a) CoMD

16 32 64 128 256 512 1024

Ranks T i m e ( s ) (b) HPCCG Ranks T i m e ( s ) (c) LULESH Fig. 5: Scaling of pure application time

16 32 64 128 256 512 1024

Ranks T i m e ( s ) CR ULFM Reinit ++ (a) CoMD

16 32 64 128 256 512 1024

Ranks T i m e ( s ) (b) HPCCG Ranks T i m e ( s ) (c) LULESH Fig. 6: Scaling of MPI recovery time recovering from a process failureproxy applications. However, in ULFM, application time grows signiﬁcantly asthe number of ranks increases. ULFM extends MPI with an always-on, periodicheartbeat mechanism [8] to detect failures and also modiﬁes communicationprimitives for fault tolerant operation. Following from our measurements, thoseextensions noticeably increase the original application execution time. However,it is inconclusive whether this is a result of the tested prototype implementationor a systemic trade-oﬀ. Next, we compare the MPI recovery times among all theapproaches.

Though checkpointing saves application’s computation time, reducing MPI re-covery time saves overhead from restarting. This overhead is increasingly im-portant the larger the deployment and the higher the fault rate. In particular,ﬁgure 6 shows the scaling of time required for MPI recovery across all programsand recovery approaches, again removing any overhead for checkpointing to focuson the MPI recovery time. As expected, MPI recovery time depends only on thenumber of ranks, thus times are similar among diﬀerent programs for the samerecovery approach. Commenting on scaling, CR and Reinit ++ scale excellently,requiring almost constant time for MPI recovery regardless the number of ranks.However, CR is about 6 × slower, requiring around 3 seconds to tear down execu-tion and re-deploy the application, whereas Reinit ++ requires about 0.5 second einit ++ : Global-Restart Recovery For MPI Fault Tolerance 15

16 32 64 128 256 512 1024

Ranks T i m e ( s ) CR Reinit ++ (a) CoMD

16 32 64 128 256 512 1024

Ranks T i m e ( s ) (b) HPCCG Ranks T i m e ( s ) (c) LULESH Fig. 7: Scaling of MPI recovery time recovering from a node failureto propagate the fault, re-initialize survivor processes and re-spawn the failedprocess. ULFM has on par recovery time with Reinit ++ up to 64 ranks, but thenits time increases being up to 3 × slower than Reinit ++ for 1024 ranks. ULFMrequires multiple collective operations among all MPI processes to implementglobal-restart (shrink the faulty communicator, spawn a new process, merge itto a new communicator). By contrast, Reinit ++ implements recovery at the MPIruntime layer requiring fewer operations and conﬁning collective communicationonly between root and daemon processes. This comparison for a node failure includes only CR and Reinit ++ , since the pro-totype implementation of ULFM faced robustness issues (hanging or crashing)and did not produce measurements. Also, since both CR and Reinit ++ use ﬁlecheckpointing and do not interfere with pure application time, we present onlyresults for MPI recovery times, shown in ﬁgure 7. Both CR and Reinit ++ scalevery well with almost constant times, as they do for a process failure. However,in absolute values, Reinit ++ has a higher recovery time of about 1.5 secondsfor a node failure compared to 0.5 seconds for a process failure. This is becauserecovering from a node failure requires extra work to select the least loaded nodeand spawn all the MPI processes of the failed node. Nevertheless, recovery withReinit ++ is still about 2 × faster than with CR. Checkpoint-Restart [15,29,2,31,34,10,1,20] is the most common approach to re-cover an MPI application after a failure. CR requires substantial developmenteﬀort to identify which data to checkpoint and may have signiﬁcant overhead.Thus, many eﬀorts attempt to make checkpointing easier to adopt and render itfast and storage eﬃcient. We brieﬂy discuss them here.Hargrove and Duell [15] implement the system-level CR library BerkeleyLab Checkpoint/Restart (BLCR) library to automatically checkpoint applica-tions by extending the Linux kernel. Bosilca et al. [6] integrate an uncoordinated, distributed checkpoint/roll-back system in the MPICH runtime to automaticallysupport fault tolerance for node failures. Furthermore. Sankaran et al. [29] in-tegrate Berkeley Lab BLCR kernel-level C/R to the LAM implementation ofMPI. Adam et al. [2], SCR [27], and FTI [3] propose asynchronous, multi-levelcheckpointing techniques that signiﬁcantly improve checkpointing performance.Shahzad et al. [30] provide an extensive interface that simpliﬁes the implementa-tion of application-level checkpointing and recovery. Advances in checkpointingare beneﬁcial not only for CR but for other MPI fault tolerance approaches,such as ULFM and Reinit. Though making checkpointing faster resolves thisbottleneck, the overhead of re-deploying the full application remains.ULFM [4,5] is the state-of-the-art MPI fault tolerance approach, pursuedby the MPI Fault Tolerance Working Group. ULFM extends MPI with inter-faces to shrink or revoke communicators, and fault-tolerant collective consensus.The application developer is responsible for implementing recovery using thoseoperations, choosing the type of recovery best suited for its application. A collec-tion of works on ULFM [25,28,17,18,16,9,22] has investigated the applicability ofULFM and benchmarked individual operations of it. Bosilca et al. [7,8] and Kattiet al. [19] propose eﬃcient fault detection algorithms to integrate with ULFM.Teranishi et al. [33] use spare processes to replace failed processes for local recov-ery so as to accelerate recovery of ULFM. Even though ULFM gives ﬂexibilityto developers to implement any type of recover, it requires signiﬁcant developereﬀort to refactor the application. Also, implementing ULFM has been identiﬁedby previous work [33,14] to suﬀer from scalability issues, as our experimenta-tion shows too. Fenix [13] provides a simpliﬁed abstraction layer atop ULFMto implement global-restart recovery. However, we choose to directly use ULFMsince it already provides a straightforward, prescribed solution for implementingglobal-restart.Reinit [24,11] is an alternative solution that supports only global-restart re-covery and provide an easy to use interface to developers. Previous designs andimplementations of Reinit have limited applicability because they require mod-ifying the job scheduler and its interface with the MPI runtime. We presentReinit ++ , a new design and implementation of Reinit using the latest OpenMPI runtime and thoroughly evaluate it.Lastly, Sultana et al. [32] propose MPI stages to reduce the overhead ofglobal-restart recovery by checkpointing MPI state, so that rolling back doesnot have to re-create it. While this approach is interesting, it is still in proof-of-concept status. How to maintain consistent checkpoints of MPI state across allMPI processes, and doing so fast and eﬃciently, is still an open-problem. We have presented Reinit ++ , a new design and implementation of the global-restart approach of Reinit. Reinit ++ recovers from both process and node crashfailures, by spawning new processes and mending the world communicator, re-quiring from the programmer only to provide a rollback point in execution and einit ++ : Global-Restart Recovery For MPI Fault Tolerance 17 have checkpointing in place. Our extensive evaluation comparing with the state-of-the-art approaches Checkpoint-Restart (CR) and ULFM shows that Reinit ++ scales excellently as the number of ranks grows, achieving almost constant recov-ery time, being up to 6 × faster than CR and up to 3 × faster than ULFM. Forfuture work, we plan to expand Reinit for supporting more recovery strategiesbesides global-restart, including shrinking recovery and forward recovery strate-gies, to maintain its implementation, and expand the experimentation with moreapplications and larger deployments. Acknowledgments

The authors would like to thank the anonymous referees for their valuable com-ments and helpful suggestions. This work was performed under the auspices ofthe U.S. Department of Energy by Lawrence Livermore National Laboratoryunder contract DEAC52-07NA27344 (LLNL-CONF-800061).

References

1. Adam, J., Besnard, J.B., Malony, A.D., Shende, S., P´erache, M., Carribault, P.,Jaeger, J.: Transparent high-speed network checkpoint/restart in mpi. In: Proceed-ings of the 25th European MPI Users’ Group Meeting. p. 12 (2018)2. Adam, J., Kermarquer, M., Besnard, J.B., Bautista-Gomez, L., P´erache, M., Car-ribault, P., Jaeger, J., Malony, A.D., Shende, S.: Checkpoint/restart approachesfor a thread-based mpi runtime. Parallel Computing , 204–219 (2019)3. Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N.,Matsuoka, S.: Fti: High performance fault tolerance interface for hybrid sys-tems. In: SC ’11: Proceedings of 2011 International Conference for High Per-formance Computing, Networking, Storage and Analysis. pp. 1–12 (Nov 2011).https://doi.org/10.1145/2063384.20634274. Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure re-covery of mpi communication capability: Design and rationale. The InternationalJournal of High Performance Computing Applications (3), 244–254 (2013)5. Bland, W., Lu, H., Seo, S., Balaji, P.: Lessons learned implementing user-levelfailure mitigation in mpich. In: 2015 15th IEEE/ACM International Symposiumon Cluster, Cloud and Grid Computing (2015)6. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Her-ault, T., Lemarinier, P., Lodygensky, O., Magniette, F., et al.: Mpich-v: Toward ascalable fault tolerant mpi for volatile nodes. In: SC’02: Proceedings of the 2002ACM/IEEE Conference on Supercomputing. pp. 29–29. IEEE (2002)7. Bosilca, G., Bouteiller, A., Guermouche, A., Herault, T., Robert, Y., Sens, P., Don-garra, J.: Failure detection and propagation in hpc systems. In: SC’16: Proceedingsof the International Conference for High Performance Computing, Networking,Storage and Analysis. pp. 312–322 (2016)8. Bosilca, G., Bouteiller, A., Guermouche, A., Herault, T., Robert, Y.,Sens, P., Dongarra, J.: A failure detector for hpc platforms. The Interna-tional Journal of High Performance Computing Applications (1), 139–158(2018). https://doi.org/10.1177/1094342017711505, https://doi.org/10.1177/1094342017711505 (0), e4863.https://doi.org/10.1002/cpe.4863, https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4863 , e4863 cpe.486312. Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J.C., Barkai,D., Berthou, J.Y., Boku, T., Braunschweig, B., Cappello, F., Chapman, B., Chi,X., Choudhary, A., Dosanjh, S., Dunning, T., Fiore, S., Geist, A., Gropp, B.,Harrison, R., Hereld, M., Heroux, M., Hoisie, A., Hotta, K., Jin, Z., Ishikawa, Y.,Johnson, F., Kale, S., Kenway, R., Keyes, D., Kramer, B., Labarta, J., Lichnewsky,A., Lippert, T., Lucas, B., Maccabe, B., Matsuoka, S., Messina, P., Michielse, P.,Mohr, B., Mueller, M.S., Nagel, W.E., Nakashima, H., Papka, M.E., Reed, D., Sato,M., Seidel, E., Shalf, J., Skinner, D., Snir, M., Sterling, T., Stevens, R., Streitz, F.,Sugar, B., Sumimoto, S., Tang, W., Taylor, J., Thakur, R., Trefethen, A., Valero,M., Van Der Steen, A., Vetter, J., Williams, P., Wisniewski, R., Yelick, K.: Theinternational exascale software project roadmap. Int. J. High Perform. Comput.Appl. (1), 3–60 (Feb 2011). https://doi.org/10.1177/1094342010391989, http://dx.doi.org/10.1177/1094342010391989

13. Gamell, M., Katz, D.S., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring au-tomatic, online failure recovery for scientiﬁc applications at extreme scales. In: Pro-ceedings of the International Conference for High Performance Computing, Net-working, Storage and Analysis. pp. 895–906. SC ’14, IEEE Press, Piscataway, NJ,USA (2014). https://doi.org/10.1109/SC.2014.78, https://doi.org/10.1109/SC.2014.78

14. Gamell, M., Teranishi, K., Heroux, M.A., Mayo, J., Kolla, H., Chen, J., Parashar,M.: Local recovery and failure masking for stencil-based applications at extremescales. In: SC’15: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis. pp. 1–12 (2015)15. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clus-ters. In: Journal of Physics: Conference Series. vol. 46, p. 494 (2006)16. Herault, T., Bouteiller, A., Bosilca, G., Gamell, M., Teranishi, K., Parashar, M.,Dongarra, J.: Practical scalable consensus for pseudo-synchronous distributed sys-tems. In: SC’15: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis. pp. 1–12 (2015)17. Hori, A., Yoshinaga, K., Herault, T., Bouteiller, A., Bosilca, G., Ishikawa, Y.:Sliding substitution of failed nodes. In: Proceedings of the 22nd European MPIUsers’ Group Meeting. p. 14. ACM (2015)18. Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Scalable and fault tolerantfailure detection and consensus. In: Proceedings of the 22nd European MPI Users’Group Meeting. p. 13 (2015)19. Katti, A., Di Fatta, G., Naughton, T., Engelmann, C.: Epidemic failure detectionand consensus for extreme parallelism. The International Journal of High Perfor-mance Computing Applications (5), 729–743 (2018)einit ++ : Global-Restart Recovery For MPI Fault Tolerance 1920. Kohl, N., H¨otzer, J., Schornbaum, F., Bauer, M., Godenschwager, C., K¨ostler, H.,Nestler, B., R¨ude, U.: A scalable and extensible checkpointing scheme for massivelyparallel simulations. The International Journal of High Performance ComputingApplications (4), 571–589 (2019)21. Laguna, I., Richards, D.F., Gamblin, T., Schulz, M., de Supinski, B.R.: Eval-uating user-level fault tolerance for mpi applications. In: Proceedings of the21st European MPI Users’ Group Meeting. pp. 57:57–57:62. EuroMPI/ASIA ’14,ACM, New York, NY, USA (2014). https://doi.org/10.1145/2642769.2642775, http://doi.acm.org/10.1145/2642769.2642775

22. Laguna, I., Richards, D.F., Gamblin, T., Schulz, M., de Supinski, B.R.: Eval-uating user-level fault tolerance for mpi applications. In: Proceedings of the21st European MPI Users’ Group Meeting. pp. 57:57–57:62. EuroMPI/ASIA ’14,ACM, New York, NY, USA (2014). https://doi.org/10.1145/2642769.2642775, http://doi.acm.org/10.1145/2642769.2642775

23. Laguna, I., Richards, D.F., Gamblin, T., Schulz, M., de Supinski, B.R., Mohror,K., Pritchard, H.: Evaluating and extending user-level fault tolerance in mpi appli-cations. The International Journal of High Performance Computing Applications (3), 305–319 (2016)24. Laguna, I., Richards, D.F., Gamblin, T., Schulz, M., de Supinski, B.R., Mohror,K., Pritchard, H.: Evaluating and extending user-level fault tolerance in mpi ap-plications. The International Journal of High Performance Computing Applica-tions (3), 305–319 (2016). https://doi.org/10.1177/1094342015623623, https://doi.org/10.1177/1094342015623623

25. Losada, N., Cores, I., Mart´ın, M.J., Gonz´alez, P.: Resilient mpi applications usingan application-level checkpointing framework and ulfm. The Journal of Supercom-puting (1) (2017)26. Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer,W.: Lessons learned from the analysis of system failures at petascale: Thecase of blue waters. In: 2014 44th Annual IEEE/IFIP International Con-ference on Dependable Systems and Networks. pp. 610–621 (June 2014).https://doi.org/10.1109/DSN.2014.6227. Mohror, K., Moody, A., Bronevetsky, G., de Supinski, B.R.: Detailed model-ing and evaluation of a scalable multilevel checkpointing system. IEEE Trans-actions on Parallel and Distributed Systems (9), 2255–2263 (Sep 2014).https://doi.org/10.1109/TPDS.2013.10028. Pauli, S., Kohler, M., Arbenz, P.: A fault tolerant implementation of multi-levelmonte carlo methods. Parallel computing: Accelerating computational science andengineering (CSE) , 471–480 (2014)29. Sankaran, S., Squyres, J.M., Barrett, B., Sahay, V., Lumsdaine, A., Duell, J.,Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: System-initiated checkpointing. JHPCA (4), 479–493 (2005)30. Shahzad, F., Thies, J., Kreutzer, M., Zeiser, T., Hager, G., Wellein, G.: Craft: A li-brary for easier application-level checkpoint/restart and automatic fault tolerance.IEEE Transactions on Parallel and Distributed Systems (3), 501–514 (2018)31. Subasi, O., Martsinkevich, T., Zyulkyarov, F., Unsal, O., Labarta, J., Cappello, F.:Uniﬁed fault-tolerance framework for hybrid task-parallel message-passing appli-cations. The International Journal of High Performance Computing Applications (5), 641–657 (2018)32. Sultana, N., R¨ufenacht, M., Skjellum, A., Laguna, I., Mohror, K.: Failure recoveryfor bulk synchronous applications with mpi stages. Parallel Computing , 1 – 140 G. Georgakoudis et al.(2019). https://doi.org/https://doi.org/10.1016/j.parco.2019.02.007,

33. Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience modelusing mpi-ulfm. In: Proceedings of the 21st european mpi users’ group meeting.p. 51 (2014)34. Wang, Z., Gao, L., Gu, Y., Bao, Y., Yu, G.: A fault-tolerant framework for asyn-chronous iterative computations in cloud environments. IEEE Transactions on Par-allel and Distributed Systems (8), 1678–1692 (2018)35. Zheng, G., Xiang Ni, Kal´e, L.V.: A scalable double in-memory checkpoint andrestart scheme towards exascale. In: IEEE/IFIP International Conference on De-pendable Systems and Networks Workshops (DSN 2012). pp. 1–6 (June 2012).https://doi.org/10.1109/DSNW.2012.626467736. Zheng, G., Huang, C., Kal´e, L.V.: Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++. SIGOPS Oper. Syst. Rev. (2), 90–99 (Apr 2006). https://doi.org/10.1145/1131322.1131340,(2), 90–99 (Apr 2006). https://doi.org/10.1145/1131322.1131340,