[PDF] MATCH: An MPI Fault Tolerance Benchmark Suite

Abstract

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at this https URL FT- Bench.

Full PDF

MMATCH: An MPI Fault Tolerance Benchmark Suite

Luanzheng Guo † , Giorgis Georgakoudis ‡ , Konstantinos Parasyris ‡ , Ignacio Laguna ‡ and Dong Li †† EECS , University of California, Merced, USA; { lguo4, dli35 } @ucmerced.edu ‡ CASC , Lawrence Livermore National Laboratory, USA; { georgakoudis1, parasyris1, lagunaperalt1 } @llnl.gov Abstract —MPI has been ubiquitously deployed in ﬂagship HPCsystems aiming to accelerate distributed scientiﬁc applicationsrunning on tens of hundreds of processes and compute nodes.Maintaining the correctness and integrity of MPI applicationexecution is critical, especially for safety-critical scientiﬁc appli-cations. Therefore, a collection of effective MPI fault tolerancetechniques have been proposed to enable MPI application execu-tion to efﬁciently resume from system failures. However, thereis no structured way to study and compare different MPI faulttolerance designs, so to guide the selection and development ofefﬁcient MPI fault tolerance techniques for distinct scenarios. Tosolve this problem, we design, develop, and evaluate a benchmarksuite called MATCH to characterize, research, and comprehen-sively compare different combinations and conﬁgurations of MPIfault tolerance designs. Our investigation derives useful ﬁndings:(1) Reinit recovery in general performs better than ULFMrecovery; (2) Reinit recovery is independent of the scaling sizeand the input problem size, whereas ULFM recovery is not;(3) Using Reinit recovery with FTI checkpointing is a highlyefﬁcient fault tolerance design. MATCH code is available athttps://github.com/kakulo/MPI-FT-Bench.

I. I

NTRODUCTION

As supercomputers continue to increase in computationalpower and size, next-generation HPC systems are expected toincur a much higher failure rate than current systems. Forexample, the Sequoia supercomputer located in LawrenceLivermore National Laboratory (LLNL) reported a mean timebetween node failures of 19.2 hours in 2013 [1]. After that, in2014, the Blue Waters supercomputer reported a mean timebetween node failures of 6.7 hours [2]. Most recently, theTaurus system located in TU Dresden reported a mean timebetween node failures of 3.65 hours [3].This trend raises concerns in the HPC community for MPIapplications running on tens of thousands of processes andnodes, that are prone to fail due to the increased probabilityof a failure. Process and node failures frequently occur inproduction HPC systems due to power outages and other issues.MPI process and node failures are usually fail-stop failures.In this type of failure, application execution cannot continuewithout repairing the communication and has to stop.These crucial facts lead to increasing importance of andchallenges for developing efﬁcient and effective fault tolerancedesigns for scaling HPC systems [4], [5]. There are numerousfault tolerance techniques proposed to protect MPI applicationexecution from system failures. MPI fault tolerance techniquescan be assigned into two types of different focus. Checkpoint-ing [6], [7], [8], [9], [10], commonly used in HPC applications,is one type of fault tolerance technique that focuses on restoringapplication state. Checkpointing takes place in two separate phases: storing the system state and recovering from it in caseof a failure. Checkpointing helps MPI applications quicklyrestore application state from the latest checkpoints throughsaving application execution state periodically. The other typeof MPI fault tolerance technique focuses on restoring the MPIstate. Restarting is a baseline solution for restoring the MPIstate, which immediately restarts an application after executioncollapses due to a failure. Later, because of the inefﬁciencyof restarting an application, HPC practitioners propose MPIrecovery mechanisms to restore the MPI state online. User-levelFault Mitigation (ULFM) [11] and Reinit [12], [13], [14] are thetwo pioneering MPI recovery frameworks in this effort. ULFMsupports a wide range of recovery strategies, including localforward recovery and global-restart recovery, whereas Reinitonly supports global-restart recovery. ULFM is a powerful MPIrecovery framework but complicated to use. In contrast, Reinitrequires less programming effort.However, there is no existing framework that enablesa comprehensive comparison between different MPI faulttolerance techniques. To solve the problem, we design anddevelop a benchmark suite called MATCH, aiming to studythe performance efﬁciency of different MPI fault toleranceconﬁgurations. MATCH contains six proxy applications fromthe Exascale Computing Project (ECP) Proxy Apps Suite andLLNL Advanced Simulation and Computing (ASC) proxyapplication suite; MATCH uses Fault Tolerance Interface(FTI) [15] for data recovery and uses ULFM and Reinit for MPIrecovery. We pick a representative set of HPC applications, butour methodology is extensible to other HPC applications too. Inevaluation, we break down the execution time and compare theperformance overhead when using FTI with Restart, when usingFTI with ULFM, and when using FTI with Reinit, respectively.All the above experiments are running in four different scalingsizes (64 processes, 128 processes, 256 processes, and 512processes on 32 nodes), in three different input sizes (small,medium, and large), and with or without injecting processfailures.In particular, our contributions are:1) We present MATCH, an MPI fault tolerance benchmarksuite. This is the ﬁrst benchmark suite designed for MPIfault tolerance. We illustrate the process and manifestthe details of implementing three different fault tolerancedesigns to HPC proxy applications.2) To facilitate checkpointing, we propose three principlesto automatically detect data objects for checkpointing.Those data objects are the only necessary data objectsto guarantee the application execution correctness after a r X i v : . [ c s . D C ] F e b estoring the application state.3) We comparatively and extensively investigate the perfor-mance efﬁciency of different fault tolerance designs. Ourevaluation reveals that, for MPI global-restart recovery,using FTI with Reinit is the most efﬁcient design withinthe three evaluated fault tolerance designs, and Reinitrecovery is four times faster than ULFM recovery onaverage, and 16 times faster than Restart on average.II. B ACKGROUND

A. MATCH

There is no existing benchmark suite aiming at benchmarkingof MPI fault tolerance. We design, implement and test thebenchmark suite MATCH to understand and comparativelystudy the performance efﬁciency of different MPI fault tol-erance designs. MATCH is composed of six HPC proxyapplications, taken from widely used HPC benchmark suites,aiming to represent the HPC application domain. Our faulttolerance design has two interfaces: the checkpointing interfaceto preserve and protect application data, and the failure recoveryinterface to protect and repair the MPI communicator. We usethe Fault Tolerance Interface (FTI) for checkpointing, andRestart, ULFM, and Reinit for MPI process recovery.

B. Workloads

Our workloads comprise of proxy applications presentin well-known benchmark suites: ECP proxy applicationssuite [16] and LLNL ASC proxy applications suite [17]. Proxyapplications are small and simpliﬁed applications that allowHPC practitioners, operators, and domain scientists to exploreand test key features of real applications in a quick turnaroundfashion. Our workloads represent the most important HPCapplication domains in scientiﬁc computing, such as iterativesolvers, multi-grid, molecular dynamics, etc. We describe thesix proxy applications used in MATCH below.

AMG : An algebraic multi-grid solver dealing with linearsystems in unstructured grids problems. AMG is built on topof the BoomerAMG solver of the HYPRE library which isa large-scale linear solver library developed at LLNL. AMGprovides a number of tests for a variety of problems. Thedefault one is an anisotropy problem in the Laplace domain.

CoMD : A proxy application in Molecular Dynamics (MD)commonly used as a research platform for particle motion sim-ulation. Different from previous MD proxy applications suchas miniMD, the design of CoMD is signiﬁcantly modularizedwhich allows performing analyses on individual modules.

LULESH : A proxy application that solves the hydrodynamicsequation in a Sedov blast problem. LULESH solves thehydrodynamics equation separately by using a mesh to simulatethe Sedov blast problem which is divided into a composition ofvolumetric elements. This mesh is an unstructured hex mesh,where nodes are points connected by mesh lines. miniFE : A proxy application that solves unstructured im-plicit ﬁnite element problem. miniFE aims at the approximationof an unstructured implicit ﬁnite element. miniVite : A proxy application that solves the graphcommunity detection problem using the distributed Louvainmethod. The Louvain method is a greedy algorithm for thecommunity detection problem.

HPCCG : A preconditioned conjugate gradient solver thatsolves the linear system of partial differential equations in a3D chimney domain. HPCCG approximates practical physicalapplications that simulate unstructured grid problems.

C. Checkpointing Interface - FTI

Fault Tolerance Interface (FTI) [18] is an application-level,multi-level checkpointing interface for efﬁcient checkpoint-ing in large-scale high-performance computing systems. FTIprovides programmers a API, which is easy to use andthe user can choose a checkpointing strategy that ﬁts theapplication needs. FTI enables multiple levels of reliabilitywith different performance efﬁciency by utilizing local storage,data replication, and erasure codes. It requires users to designatewhich data objects to checkpoint, while hiding any dataprocessing details from them. Users need only to pass to FTIthe memory address and data size of the date object to beprotected to enable checkpointing of this data object. Becausefailures can corrupt either a single or multiple nodes duringthe execution of an application, FTI provides multiple levels ofresiliency to recover from failures of different severity. Namely,the levels are the following: • L1: This level stores checkpoints locally to each computenode. In a node failure, the application states cannotsuccessfully be restored. • L2: This level is built on top of L1 checkpointing. Inthis level, each application stores its checkpoint locallyas well as to a neighboring node. • L3: In this level, the checkpoints are encoded by theRead-Solomon (RS) erasure code. This implementationcan survive the breakdown of half of the nodes within acheckpoint encoding group. The lost data can be restoredfrom the RS-encoded ﬁles. • L4: This level ﬂushes checkpoints to the parallel ﬁlesystem. This level enables differential checkpointing.FTI has proposed a multi-level checkpointing model, andhave conducted an extensive study of correctness and reliabilityof this proposed checkpointing model. In our work, for theﬁrst time, we use FTI in the context of MPI recovery.

D. Failure Recovery Interface - ULFM and Reinit

MPI failure recovery has multiple modes, including global , local , backward , forward , shrinking , and non-shrinking . Global : The application execution must roll back all pro-cesses (including survivor and failed processes) to a globalstate to ﬁx a failure.

Local : The application can continue the execution byrepairing only the failed components, such as the failedprocesses, to continue the execution.

Backward : The application execution must go back to someprevious correct state to survive a failure. orward : The failure can be ﬁxed with the current applica-tion state, and the execution can continue.

Non-shrinking : The application needs to bring back allfailed processes to resume execution.

Shrinking : The application execution is able to continuewith the remaining survivor processes.We target global, backward, non-shrinking recovery in thiswork, because this recovery ﬁts best for the widely used BulkSynchronous Parallel (BSP) paradigm of HPC applications.

1) ULFM:

User-level Fault Mitigation [11] is a leading MPIfailure recovery framework providing shrinking recovery andnon-shrinking recovery. ULFM develops new MPI operationsto add fault tolerance functionalities. These functionalitiesinclude fault detection, communicator repairing, and failurerecovery. Particularly, ULFM leverages the MPI error handlerto provide notiﬁcation of process failures. Once a failure isdetected, the notiﬁed applications invokes ULFM to issuethe operation MPI Comm revoke(), which revokes processesin the communicator. This operation interrupts any pendingcommunication for this communicator for all member processes.ULFM then removes the failed processes using an operationMPI Comm shrink(), which creates a new communicator con-sisting only of survivor processes. Shrinking recovery is doneusing the steps described above. For non-shrinking recovery,ULFM further uses the MPI Comm spawn() operation tospawn new processes and create a new communicator. ULFMthen uses the MPI Intercomm merge() operator to merge thecommunicator of survivor processes and the communicator ofspawned processes to create a new, combined communicator.We provide a sample implementation of ULFM non-shrinkingrecovery in Figure 3.

2) Reinit:

Reinit [13], [14], [19] is an alternative recoveryframework designed particularly for global backward non-shrinking recovery. Reinit implements the recovery processinto the MPI runtime, thus it is transparent to users. Therefore,the programming effort of using Reinit is much less than usingULFM. Programmers only need to set a global restart point.The remaining recovery is done by Reinit. Reinit is much moreefﬁcient than ULFM because of MPI recovery transparentlyhandled in MPI runtime [14], whereas ULFM recovery ishandled not only in MPI runtime but also in the application.III. D

ESIGN

We present design details in this section. In particular, wedescribe the algorithm that we use to ﬁnd data objects forcheckpointing through data dependency analysis.

A. Find Data Objects for Checkpointing

Unlike many fault tolerance frameworks that request pro-grammers to decide data objects for checkpointing, we developa practical analytic tool to guide programmers to identify dataobjects that must be checkpointed to recover the application ex-ecution to the same state as before the failure. We identify dataobjects for checkpointing through data dependency analysisacross iterations following three principles . 1) The data objects for checkpointing across iterations mustbe deﬁned before the iterative computation. Data objectsdeﬁned locally within the main computation loop areexcluded from checkpointing.2) The data objects for checkpointing must be used (read orwritten) across iterations of the main computation loop.3) The value of data objects for checkpointing must varyacross iterations of the main computation loop.Following the three principles, we design and develop a datadependency analysis tool. The input to the tool is a dynamicexecution instruction trace generated using LLVM-Tracer [20].The trace contains detailed information of dynamic operations,such as the register name and memory address, the operator,and the line number in the source code where the operationperforms. We describe the algorithm of the data dependencyanalysis tool in Algorithm 1. The input to the algorithm is theset of locations used within the main computation loop and theset of locations allocated before the main computation loop.Those locations are either registers or memory locations. Wecreate the two sets of locations by traversing the instructiontrace once. After that, we ﬁrst check values of locations andmake sure the invocation values of the same location withinthe main computation loop are different. We then removerepetitions from both sets of locations. Lastly, for each locationin the set of the main computation loop, we search for a matchin the location set before the main computation loop. If amatch is found, the matched location is used to localize dataobjects for checkpointing. The output of the tool is a set oflocations for checkpointing. Note that the tool only outputsthe locations for checkpointing, runs separately, and does notsupport automatic generation of checkpointing code at thisstage, which we leave as future work.IV. I

MPLEMENTATION

A. FTI Implementation

The Fault Tolerance Interface (FTI) is a checkpointing librarywidely used by HPC developers for checkpointing. We illustratea sample usage of FTI in Figure 1.

Please read the FTIpaper [15] for the implementation details of FTI functioncalls such as FTI Protect() and FTI Recover().

We reckona challenge while implementing checkpointing using FTI forMATCH workloads.The challenge is the programming complexity of enablingFTI checkpointing to data objects when the number of dataobjects for checkpointing is large. FTI requests users to add FTIcheckpointing to every data object manually. This signiﬁcantlyincreases the programming effort when the number of dataobjects for checkpointing is large and when the data objectis a complicated data structure. This is a common issue inapplication-level checkpoint libraries such as FTI, VeloC, andSCR. These libraries cannot automatically enable checkpointingto target data objects.

B. FTI with Reinit Implementation

Reinit is the state-of-the-art MPI global non-shrinking recov-ery framework. Reinit hides all of the recovery implementation lgorithm 1:

Find Data Objects for Checkpointing

Input:

Locs in loop : the set of locations used in themain computation loop;

Locs bef ore loop : theset of locations deﬁned or allocated before themain computation loop

Output:

CP K Locs : the set of locations forcheckpointing // Check values of locations in Locs _ in _ loop for l ∈ Locs in loop doif

The invocation values of l are not the same then Keep l in Locs _ in _ loop ; else Remove l from Locs _ in _ loop ; // Remove repetition in Locs _ in _ loop and Locs _ bef ore _ loop for l ∈ Locs in loop do Remove repetition; for l ∈ Locs bef ore loop do Remove repetition; // Check if locations in

Locs _ in _ loop canfind a match in Locs _ bef ore _ loop for l i ∈ Locs in loop dofor l j ∈ Locs bef ore loop doif l i matches l j then CP K Locs ← l i ;in the MPI runtime, which makes it ease-to-use. We providea sample implementation of Reinit with FTI checkpointingin Figure 2. We can see that Reinit recovery only adds lessthan ﬁve lines of code. Line 4 and 5 are for Reinit recovery,while Line 14 is used for other functions. FTI is completelyindependent of Reinit. To implement FTI with Reinit, the onlything to notice is to move the FTI Init() and FTI Finalize()functions into the resilient main() function as well. Pleaseread work on Reinit [13], [14], [19] for the design andimplementation details of Reinit.C. FTI with ULFM Implementation

ULFM is a pioneer MPI recovery framework. ULFMprovides ﬁve new MPI interfaces to support MPI fault tolerance.ULFM gives the ﬂexibility to programmers to use providedinterfaces to implement their own, customized MPI recoverystrategy. Also, ULFM allows programmers to use both shrink-ing and non-shrinking recovery. However, it takes a signiﬁcantlearning and programming effort before a programmer cansuccessfully implement recovery with ULFM. As most HPCapplications follow the Bulk Synchronous Parallel (BSP)paradigm, we focus on ULFM global non-shrinking recovery.In order to implement ULFM non-shrinking recovery, we addmore than 200 lines of code for each benchmark, which requiresmore effort compared to the implementation (less than ﬁvelines of code) using Reinit for recovery. We provide a sample i n t main ( i n t argc , char * argv [ ] ) { MPI Init (&argc , &argv ) ; / / I n i t i a l i z e FTI F T I I n i t ( argv [ 1 ] , MPI COMM WORLD) ; / / Right before the main computation loop / / Add FTI p r o t e c t i o n to data o b j e c t s F T I P r o t e c t ( ) ; / / the main computation loop while ( . . . ) { / / At the beginning of the loop / / I f the execution i s a r e s t a r t i f ( FTI Status ( ) != 0) { FTI Recover ( ) ; } / / do FTI checkpointing i f ( Iter Num % c p s t r i d e == 0) { FTI Checkpoint ( ) ; } } F T I F i n a l i z e ( ) ; MPI Finalize ( ) ; } Fig. 1: A sample implementation of FTI. i n t main ( i n t argc , char * argv [ ] ) { MPI Init (&argc , &argv ) ; OMPI Reinit ( argc , argv , r e s i l i e n t m a i n ) ; MPI Finalize ( ) ; return } / / Move the o r i g i n a l main ( ) i n t o r e s i l i e n t m a i n ( ) i n t r e s i l i e n t m a i n ( i n t argc , char ** argv ,O M P I r e i n i t s t a t e t s t a t e ) { F T I I n i t ( argv [ 1 ] , MPI COMM WORLD) ; . . . / / the main computation loop . . . F T I F i n a l i z e ( ) ; return } Fig. 2: A sample implementation of Reinit.implementation of ULFM global non-shrinking recovery withFTI in Figure 3.When combining ULFM global non-shrinking recovery withFTI, it is important to notice that the MPI COMM WORLDmust be implemented as a global variable with an externaldeclaration. See Lines 2-6 in Figure 3 for the implementationdetails. This is because ULFM updates the world communicatorhandler and FTI must use the repaired world communicatorfor MPI communication to correctly function.

D. Fault Injection

We emulate MPI process failures through fault injection.In particular, we raise a SIGTERM signal at a randomly /* world w i l l swap between worldc [ 0 ] and worldc [ 1 ]a f t e r each respawn */ MPI Comm worldc [ 2 ] = { MPI COMM NULL, MPI COMM NULL } ; i n t worldi = 0; / / the MPI communicator must be implemented as a globalv a r i a b l e to enable immediately update a f t e r ULFMrecovery for FTI to use i n t main ( i n t argc , char * argv [ ] ) { MPI Init (&argc , &argv ) ; / / s e t long jump i n t do recover = setjmp ( stack jmp buf ) ; i n t s u r v i v o r = I s S u r v i v o r ( ) ; /* s e t an errhandler on world , so that a f a i l u r e i snot f a t a l anymore */ MPI Comm set errhandler ( world ) ; F T I I n i t ( argv [ 1 ] , world ) ; . . . / / the main computation loop . . . F T I F i n a l i z e ( ) ; MPI Finalize ( ) ; } /* error handler : repair comm world */ s t a t i c void e r r h a n d l e r (MPI Comm* pcomm , i n t * errcode ,. . . ) { i n t e c l a s s ; MPI Error class (* errcode , &e c l a s s ) ; i f ( MPIX ERR PROC FAILED != e c l a s s && MPIX ERR REVOKED != e c l a s s ) { MPI Abort (MPI COMM WORLD, * e r r c o d e ) ; } /* swap the worlds */ worldi = ( worldi +1) %2; MPIX Comm revoke ( world ) ; MPIX Comm shrink ( ) ; MPI Comm spawn ( ) ; MPI Intercomm merge ( ) ; MPIX Comm agree ( ) ; longjmp ( stack jmp buf , 1 ) ; } Fig. 3: A sample implementation of ULFM non-shrinkingrecovery.selected MPI process in a randomly selected iteration of themain computation loop. We illustrate the fault injection codein Figure 4. Note that we choose to evaluate different faulttolerance techniques by triggering a process failure, which doesnot mean that the MPI recovery frameworks do not supportrecovery from a node failure. On the one hande Reinit canrecover from a node failure [14], on the other hand the currentULFM implementation cannot. In our case, it is sufﬁcient toevaluate on MPI process failures to compare the performancedifference when using FTI checkpointing in ULFM and Reinit. / / simulation of proc f a i l u r e s i f ( p r o c f i == 1 && numIters == S e l e c t e d I t e r ) { i f ( myrank == Selected Rank ) { p r i n t f ( ”KILL rank %d \ n” , myrank ) ; k i l l ( g e t p i d ( ) , SIGTERM) ; } } Fig. 4: A sample implementation of fault injection.V. E

VALUATION

We measure and compare the MPI failure recovery time,the checkpointing time, and the application execution time ofthree fault tolerance designs. We use a similar methodologywith other works [14], [21], [22], [23] to evaluate MPI faulttolerance, we validate our benchmark suite MATCH on fourscaling sizes and three input problem sizes, both with andwithout fault injection. We answer the following questions: • Can fault tolerance interfaces (such as ULFM) delay theapplication execution or not? • Can the checkpointing interface and the MPI recoveryinterface interfere with each other? • Can ULFM perform better or Reinit perform better fordifferent scaling sizes and different input problem sizes?

A. Artifact Description

We run experiments on a large-scale HPC cluster having 752nodes. Each node is equipped of two Intel Haswell CPUs, 28CPU cores, 128 GB shared memory, and 8 TB local storage.

B. Experimentation Setup

This section provides details of the experimentation setup.We evaluate three fault tolerance designs. Those are

FTIcheckpointing with Restart (RESTART-FTI) which means thatwe restart the execution, in case of a failure, for MPI recovery,

FTI checkpointing with Reinit recovery (REINIT-FTI) , and

FTIcheckpointing with ULFM recovery (ULFM-FTI) .For FTI checkpointing, we use its L1 mode. FTI L1 modeallows to store checkpoints to the local SSD or to do in-memory checkpointing. We use the fastest approach thatsaves checkpoints to the local memory using RAMFS through“/dev/shm”. Although there are also L2, L3, and L4 modes forcheckpointing, we do not evaluate all of them. The efﬁciencycomparison between the four FTI checkpointing modes hasbeen thoroughly studied in the FTI paper [18]. We savecheckpoints every ten iterations. For ULFM, we use the latestversion “ULFM v4.0.1ulfm2.1rc1” based on Open MPI 4.0.1.For Reinit [14], we use its latest version based on Open MPI4.0.0.We implement all the three fault tolerance designs in theMATCH workloads. Each design is run on three input problemsizes with the default scaling size (64 processes) with andwithout fault injection. Each design is also executed on fourscaling sizes (64 processes on 32 nodes, 128 processes on32 nodes, 256 processes on 32 nodes, and 512 processes on R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) AMG

Write Checkpoints Application (a) AMG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) CoMD (b) CoMD R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) HPCCG (c) HPCCG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 512 T i m e ( s ) LULESH (d) LULESH R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) miniFE (e) miniFE R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) miniVite (f) miniVite Fig. 5: Execution time breakdown recovering in different scaling sizes with no process failuresTABLE I: Experimentation conﬁguration for proxy applications ( default scaling size: 64 processes; default input problem:small ) Application Small Input Medium Input Large Input Number of processes

AMG -problem 2 -n 20 20 20 -problem 2 -n 40 40 40 -problem 2 -n 60 60 60 64, 128, 256, 512CoMD -nx 128 -ny 128 -nz 128 -nx 256 -ny 256 -nz 256 -nx 512 -ny 512 -nz 512 64, 128, 256, 512HPCCG 64 64 64 128 128 128 192 192 192 64, 128, 256, 512LULESH -s 30 -p -s 40 -p -s 50 -p 64, 512miniFE -nx 20 -ny 20 -nz 20 -nx 40 -ny 40 -nz 40 -nx 60 -ny 60 -nz 60 64, 128, 256, 512miniVite -p 3 -l -n 128000 -p 3 -l -n 256000 -p 3 -l -n 512000 64, 128, 256, 51232 nodes) with the default input problem size (small) withand without fault injection. We show the experimentationconﬁguration in Table I. Note that LULESH needs to runon a cube number of processes, thus runs only with 64 and512 processes.For fault injection, we choose a random iteration and a ran-dom process to inject a fault. This enables us to fairly comparethe efﬁciency of different fault tolerance conﬁgurations.Notably, we run the experiment of each conﬁguration for ﬁvetimes, and calculate the average execution time to minimizesystem noise. We use ‘-O3’ for mpicc or mpicxx compilation.

C. Performance Comparison on Different Scaling Sizes

In this experiment, we run each evaluation on four scalingsizes with the default input problem size (small). We seekto compare the scaling efﬁciency of the three fault tolerancedesigns with and without process failures.

Without A Failure:

Figure 5 shows the average executiontime with no failure. We break down the execution time to theapplication execution time and the time to write checkpoints .Overall, we can see that among the three fault tolerance de-signs, ULFM-FTI performs worst. RESTART-FTI and REINIT-FTI perform similar and better than ULFM-FTI. We ﬁrst observe that FTI checkpointing scales well. The timespent on writing checkpoints modestly increases with moreprocesses. This implies that there are a number of collectiveoperations used in FTI L1 checkpointing. The time for writingcheckpoints is accounted for 13% of the total execution time.Second, we observe that Reinit has no impact on applicationexecution when there is no failure. We use the FTI applicationexecution time as the baseline for comparison because FTIis an application-level checkpointing library, whereas ULFMand Reinit modify the MPI runtime. We can see that the application execution time of REINIT-FTI is very close tothe application execution time of RESTART-FTI. However,ULFM-FTI introduces overhead to the application execution.This overhead grows as the number of processes goes up.This is understandable. ULFM is implemented across MPIruntime and application levels. It can introduce memory accessand communication latency to the application execution andfurther affect the application execution efﬁciency. As reportedin a ULFM paper [24], ULFM implements a constantlyheartbeat mechanism for failures detection, and also amendsMPI communication interfaces for failure recovery operations.These changes must have an impact on application execution R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) AMG

Recovery Write Checkpoints Application (a) AMG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) HPCCG (c) HPCCG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) miniVite (f) miniVite Fig. 6: Execution time breakdown recovering from a process failure in different scaling sizes R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) AMG (a) AMG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) HPCCG (c) HPCCG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I

64 128 256 512 T i m e ( s ) miniVite (f) miniVite Fig. 7: Recovery time for different scaling sizesefﬁciency. Different from ULFM, Reinit incurs overhead onlywhen a failure happens because it does not perform anybackground operation during application execution.Furthermore, we observe that the times for writing check-points in RESTART-FTI and REINIT-FTI cases are close. Thisindicates that Reinit has no interference on FTI checkpointing,yet ULFM has a small impact on FTI checkpointing incases such as HPCCG and miniVite. This is reasonable.Reinit is implemented at the MPI runtime level, which has aminimal impact on application-level operations, where the FTI operations run. In contrast, ULFM does a signiﬁcant amountof collective operations for a periodic heartbeat in the MPIruntime, which leads to background overhead.

With A Failure:

Figure 6 shows the breakdown of executiontime recovering from a process failure on different scalingsizes. Note that reading checkpoints only happens once in theexecution, after recovery, and it is in the order of milliseconds,which is difﬁcult to observe, so we exclude it from the ﬁgure.Figure 7 shows the MPI recovery time for different scalingsizes. R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

AMG

Write Checkpoints Application (a) AMG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

CoMD (b) CoMD R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

HPCCG (c) HPCCG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

LULESH (d) LULESH R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem miniFE (e) miniFE R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem miniVite (f) miniVite

Fig. 8: Execution time breakdown in different input problem sizes with no process failures R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

AMG

Recovery Write Checkpoints Application (a) AMG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

Fig. 9: Execution time breakdown recovering from a process failure in different input problem sizesOverall, we observe that REINIT-FTI achieves the bestperformance compared to RESTART-FTI and ULFM-FTI.There are two reasons. First, Reinit recovery does not affectapplication execution, including checkpoint writing. Second,Reinit recovery uses the least time for MPI recovery thanrestarting and ULFM recovery. Those are similar observationsderived from Figure 5. Furthermore, we provide new observa-tions by comparing the MPI recovery efﬁciency for the threefault tolerance designs.

ULFM recovery vs. Reinit Recovery.

We ﬁnd that ULFMrecovery time can be up to 13 times larger than Reinit recovery time, and four times larger on average. We can also see atrend that the ULFM recovery time increases as the numberof processes grows, thus not scaling well.

Different fromULFM, we ﬁnd that Reinit recovery is independent of thenumber of processes.

Since, ULFM enforces a variety offault tolerance collective operations on all MPI processes toenable MPI global non-shrinking recovery. Even worse, ULFMimplements these fault tolerance operations at the applicationlevel, which needs to synchronize with other fault toleranceoperations implemented at the MPI runtime. By contrast, Reinitis implemented at the MPI runtime level, and Reinit requires R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

AMG (a) AMG R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I R ES T A R T - FT I R E I N I T - FT I U L F M - FT I Small Medium Large T i m e ( s ) Input Problem

Fig. 10: Recovery time for different input problem sizesmuch fewer collective operations.

Restart vs. Reinit recovery.

We ﬁnd that the restart recoverycan be up to 22 times slower than Reinit recovery, and 16times slower on average. This is expected. Redeployment ofthe MPI setup and allocation of resources for restarting theexecution is very expensive. Reinit recovery repairs MPI stateonline.

Restart vs. ULFM recovery.

Restart recovery is 2 to3 times slower than ULFM recovery. Similarly to Reinit,ULFM recovery is online, which is much more efﬁcient thanredeployment.

D. Performance Comparison on Different Input Sizes

In this experiment, we compare the performance efﬁciencyof the three fault tolerance designs on three input problemsizes with the default scaling size (64 processes), with andwithout fault injection.

Without A Failure:

Figure 8 presents the results of differentinput problem sizes with no process failures. The executiontime is divided into the time of writing checkpointing andpure application execution time. We make several observations.Again, we use the pure application execution time of RESTART-FTI as the baseline for comparison.First, we can see an increment on the pure applicationexecution time and FTI checkpointing time when runningon larger input problem sizes because the amount of datato process increases. We can also observe the performanceoverhead in application execution time in ULFM-FTI. Thisoverhead increases as the input problem size grows. Thisindicates that ULFM is intensively involved in the applicationexecution, where ULFM fault tolerance operations run alarge number of collective MPI operations. These inefﬁcientoperations signiﬁcantly affect the application execution, causing signiﬁcant communication latency, especially when there is alarge amount of data to process. Different from ULFM, Reinitdoes not delay application execution. We can observe that theapplication execution time of REINIT-FTI is very close to theexecution time of RESTART-FTI. This is expected as Reinitis implemented in the MPI runtime. Also, Reinit uses muchfewer collective operations than ULFM does.

With A Failure:

Figure 9 shows the execution timebreakdown in a process failure in different input problem sizes.Note that we omit the time of reading checkpoints becauseit is in the order of milliseconds. Also, Figure 10 shows therecovery time for different input problem sizes.The same observations from Figure 8 and the scalingexperiments still hold. The new observation is that the recoverytime of either ULFM or Reinit negligibly changes for differentinput problem sizes. This is an interesting ﬁnding but followsfrom their operation. When a failure occurs, ULFM startscollecting messages among daemons and processes in thebackground, while the program execution terminates. ULFMrecovery dominates the execution. Reinit is fully implementedin the MPI runtime, which is even more difﬁcult to be affected.We conclude that ULFM and Reinit recovery are independent of the input problem size.

Conclusion. (1) ULFM delays application execution,whereas Reinit has a negligible impact on the applicationexecution; (2) ULFM affects the performance of FTI check-pointing, whereas Reinit has a negligible effect on it; (3) Reinitperforms better than ULFM both when there is no failure andwhen there is a failure for MPI global, backward, non-shrinkingrecovery; (4) REINIT-FTI is the most efﬁcient design withinthe three fault tolerance designs for MPI global, backward,non-shrinking recovery.. Use of MATCH

MATCH can help HPC programmers aiming at MPI faulttolerance in three ways. (1) We provide hands-on instructions ofimplementing ULFM with FTI, Reinit with FTI, and FTI withRestart on representative HPC proxy applications. MATCHis open-source. Programmers can learn with less effort onhow to implement the three fault tolerance designs to anHPC application through the MATCH code. (2) We providea data dependency analysis tool to identify data objects forcheckpointing. Those data objects are the minimal data objectsneeded to guarantee the application execution correctnessafter restoring the application state. This is especially usefulfor applications with many data objects that need to becheckpointed. (3) MATCH can also be a foundation for futureMPI fault tolerance designs. Programmers can develop newMPI fault tolerance designs on top of the three fault tolerancedesigns. For example, the ULFM global non-shrinking recoverycan be replaced with the ULFM local forward recovery; the FTIcheckpointing can be replaced with the SCR checkpointing [25].This is also part of our future work. Lastly, we encourageprogrammers to add new HPC applications and new MPI faulttolerance designs to MATCH.VI. R

ELATED W ORK

Data Recovery.

Checkpointing [26], [27], [22], [28], [29],[30], [31], [32] is the commonly used approach to restart anMPI application when a failure occurs. Hargrove et al. [26]develop a system-level checkpointing library–the BerkeleyLab Checkpoint/Restart (BLCR) library–to run checkpointingat system-level using the Linux kernel. Furthermore, Adamet al. [22], SCR [25], and FTI [15] propose multi-levelcheckpointing aiming to signiﬁcantly advance checkpointingefﬁciency. CRAFT [23] provides a fault tolerance frameworkthat integrates checkpointing to ULFM shrinking and non-shrinking recovery. In this work, we choose FTI for check-pointing for data recovery because of the high efﬁciency andits extenstive documentation. We plan to integrate and evaluatemore checkpointing mechanisms in addition to FTI in futurework. Furthermore, different to existing work, we also provide adata dependency analytic tool to aid programmers in identifyingdata objects for checkpointing.

MPI Recovery.

ULFM [11], [21] is a leading MPI recoveryframework in progress with the MPI Fault Tolerance WorkingGroup. ULFM provides new MPI interfaces to remove failedprocesses and add new processes to communicators, andto perform resilient agreement between processes. ULFMrequests programmers to implement shrinking or non-shrinkingrecovery using these interfaces. ULFM provides ﬂexibility toprogrammers, but there is signiﬁcant learning effort beforeprogrammers can correctly use ULFM interfaces to implementULFM recovery. A large body of work [6], [7], [8], [10], [33]has explored and extended the applicability of ULFM. Teranishiet al. [34] replace failed processes with spare processes toaccelerate ULFM recovery. Bosilca et al. [24] and Katti etal. [35] propose a series of efﬁcient fault detection mechanismsfor ULFM. Reinit [14], [36] is a more efﬁcient solution for MPIglobal recovery. Reinit hides the MPI process recovery fromprogrammers by implementing it in the MPI runtime. Reinitprovides a simple interface to programmers to deﬁne a globalrestart point, in the form of a resilient target function. Theearly versions [13], [19], [36], [37] of Reinit have limited usagebecause they require hard-to-deploy changes to job schedulers.Most recently, Georgakoudis et al. [14] propose a new designand implementation of Reinit into the Open MPI runtime.Later, researchers realize the efﬁciency of combining check-pointing and MPI recovery for higher efﬁciency of MPI faulttolerance. For example, FENIX [38] and CRAFT [23] bothdesign and develop a checkpointing interface that supportsdata recovery for ULFM shrinking and non-shrinking recovery.However, developers must explicitly manage and redistributethe restored data among survivor processes in case of a non-shrinking recovery. This can easily cause load imbalanceproblems. Also, they only evaluate their frameworks ontwo applications, and do not compare their fault toleranceframeworks to other fault tolerance designs. In conclusion,there is no existing work that either benchmarks the designand implementation of MPI fault tolerance, or compares theperformance efﬁciency of different fault tolerance designs.Different from FENIX and CRAFT, we evaluate and com-prehensively compare fault tolerance designs that combineFTI checkpointing and MPI recovery frameworks (ULFM andReinit) on a collection of HPC proxy applications.

MPI Fault Tolerance Benchmarking.

There have beenmany benchmark suites [39], [40], [41] developed for MPIperformance modeling. SKaMPI [42] is an early benchmarksuite that evaluates different implementations of MPI. Bureddyet al. [43] develop a benchmark suite to evaluate point-to-point,multi-pair, and collective MPI communication on GPU clusters.However, there is no MPI benchmark suite that focuses on faulttolerance and evaluates fault tolerance designs in MPI. Thispaper proposes a benchmark suite MATCH for benchmarkingMPI fault tolerance. VII. C

ONCLUSIONS

MPI fault tolerance is becoming an increasingly criticalproblem as supercomputers continue to grow in size. We havedesigned and implemented a benchmark suite, called MATCH,to evaluate MPI fault tolerance approaches. Our benchmarksuite has six representative HPC proxy applications selectedfrom existing, ﬂagship HPC benchmark suites. We compre-hensively evaluate and compare the performance efﬁciencyof three different fault-tolerance designs, implemented on theselected applications. The evaluation results reveal that FTIcheckpointing with Reinit recovery is the most efﬁcient faulttolerance design of those three. Our analysis and ﬁnding providesigniﬁcant insight to HPC developers on MPI fault tolerance.VIII. A

CKNOWLEDGMENTS

This work was performed under the auspices of the U.S.Department of Energy by Lawrence Livermore Nationalaboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-812453). This research was supported by the ExascaleComputing Project (17-SC-20-SC), a collaborative effort ofthe U.S. Department of Energy Ofﬁce of Science and theNational Nuclear Security Administration. This research ispartially supported by U.S. National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194). We wish to thankthe Trusted CI, the NSF Cybersecurity Center of Excellence,NSF Grant Number ACI-1920430, for assisting our projectwith cybersecurity challenges.R

EFERENCES[1] J. Dongarra, “Emerging heterogeneous technologies for high performancecomputing,” in

International Heterogeneity in Computing Workshop ,2013.[2] C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop,and W. Kramer, “Lessons learned from the analysis of system failuresat petascale: The case of blue waters,” in

IEEE/IFIP InternationalConference on Dependable Systems and Networks . IEEE, 2014.[3] S. Ghiasvand, F. M. Ciorba, R. Tsch¨uter, and W. E. Nagel, “Lessonslearned from spatial and temporal correlation of node failures in highperformance computers,” in

Euromicro International Conference onParallel, Distributed, and Network-Based Processing . IEEE, 2016.[4] L. Guo, D. Li, I. Laguna, and M. Schulz, “Fliptracker: Understandingnatural error resilience in hpc applications,” in SC , 2018.[5] L. Guo and D. Li, “MOARD: Modeling Application Resilience to Tran-sient Faults on Data Objects,” in International Parallel and DistributedProcessing Symposium , 2019.[6] A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann, “Scalable andfault tolerant failure detection and consensus,” in

Proceedings of the22nd European MPI Users’ Group Meeting , 2015, p. 13.[7] T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, K. Teranishi,M. Parashar, and J. Dongarra, “Practical scalable consensus for pseudo-synchronous distributed systems,” in

SC’15: Proceedings of the Interna-tional Conference for High Performance Computing, Networking, Storageand Analysis , 2015, pp. 1–12.[8] A. Bouteiller, G. Bosilca, and J. J. Dongarra, “Plan b: Interruption ofongoing mpi operations to support failure recovery,” in

Proceedings ofthe 22nd European MPI Users’ Group Meeting , 2015, p. 11.[9] M. M. Ali, P. E. Strazdins, B. Harding, and M. Hegland, “Complex sci-entiﬁc applications made fault-tolerant with the sparse grid combinationtechnique,”

The International Journal of High Performance ComputingApplications , vol. 30, no. 3, pp. 335–359, 2016.[10] I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, and B. R.de Supinski, “Evaluating user-level fault tolerance for mpi applications,”in

Proceedings of the 21st European MPI Users’ Group Meeting , ser.EuroMPI/ASIA ’14. New York, NY, USA: ACM, 2014, pp. 57:57–57:62.[Online]. Available: http://doi.acm.org/10.1145/2642769.2642775[11] W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, “Post-failure recovery of mpi communication capability: Design and rationale,”

The International Journal of High Performance Computing Applications ,vol. 27, no. 3, pp. 244–254, 2013.[12] I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, B. R. de Supinski,K. Mohror, and H. Pritchard, “Evaluating and extending user-levelfault tolerance in mpi applications,”

The International Journal of HighPerformance Computing Applications , vol. 30, no. 3, pp. 305–319, 2016.[13] S. Chakraborty, I. Laguna, M. Emani, K. Mohror, D. K. Panda, M. Schulz,and H. Subramoni, “Ereinit: Scalable and efﬁcient fault-tolerance forbulk-synchronous mpi applications,”

Concurrency and Computation:Practice and Experience , vol. 0, no. 0, p. e4863, e4863 cpe.4863. [Online].Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4863[14] G. Georgakoudis, L. Guo, and I. Laguna, “Reinit++: Evaluating theperformance of global-restart recovery methods for mpi fault tolerance,”in

ISC , 202O.[15] L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama,and S. Matsuoka, “Fti: High performance fault tolerance interface forhybrid systems,” in

SC ’11: Proceedings of 2011 International Conferencefor High Performance Computing, Networking, Storage and Analysis ,Nov 2011, pp. 1–12. [16] D. Richards, O. Aaziz, J. Cook, S. Moore, D. Pruitt, and C. Vaughan,“Quantitative performance assessment of proxy apps and parentsreportfor ecp proxy app project milestone adcd-504-9,” Lawrence LivermoreNational Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2020.[17] J. R. Neely and B. R. de Supinski, “Application modernization at llnl andthe sierra center of excellence,”

Computing in Science & Engineering ,2017.[18] L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama,and S. Matsuoka, “Fti: high performance fault tolerance interface forhybrid systems,” in

International conference for high performancecomputing, networking, storage and analysis (SC) , 2011.[19] I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, B. R. de Supinski,K. Mohror, and H. Pritchard, “Evaluating and extending user-levelfault tolerance in mpi applications,”

The International Journal of HighPerformance Computing Applications , vol. 30, no. 3, pp. 305–319, 2016.[Online]. Available: https://doi.org/10.1177/1094342015623623[20] Y. S. Shao and D. Brooks, “ISA-Independent Workload Characterizationand its Implications for Specialized Architectures,” in

IEEE InternationalSymposium on Performance Analysis of Systems and Software (ISPASS) ,2013.[21] W. Bland, H. Lu, S. Seo, and P. Balaji, “Lessons learned implement-ing user-level failure mitigation in mpich,” in , 2015.[22] J. Adam, M. Kermarquer, J.-B. Besnard, L. Bautista-Gomez, M. P´erache,P. Carribault, J. Jaeger, A. D. Malony, and S. Shende, “Checkpoint/restartapproaches for a thread-based mpi runtime,”

Parallel Computing , vol. 85,pp. 204–219, 2019.[23] F. Shahzad, J. Thies, M. Kreutzer, T. Zeiser, G. Hager, and G. Wellein,“Craft: A library for easier application-level checkpoint/restart andautomatic fault tolerance,”

IEEE Transactions on Parallel and DistributedSystems , vol. 30, no. 3, pp. 501–514, 2018.[24] G. Bosilca, A. Bouteiller, A. Guermouche, T. Herault, Y. Robert,P. Sens, and J. Dongarra, “A failure detector for hpc platforms,”

TheInternational Journal of High Performance Computing Applications ,vol. 32, no. 1, pp. 139–158, 2018. [Online]. Available: https://doi.org/10.1177/1094342017711505[25] K. Mohror, A. Moody, G. Bronevetsky, and B. R. de Supinski, “Detailedmodeling and evaluation of a scalable multilevel checkpointing system,”

IEEE Transactions on Parallel and Distributed Systems , vol. 25, no. 9,pp. 2255–2263, Sep. 2014.[26] P. H. Hargrove and J. C. Duell, “Berkeley lab checkpoint/restart (blcr)for linux clusters,” in

Journal of Physics: Conference Series , vol. 46,no. 1, 2006, p. 494.[27] S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell,P. Hargrove, and E. Roman, “The lam/mpi checkpoint/restart framework:System-initiated checkpointing,”

JHPCA , vol. 19, no. 4, pp. 479–493,2005.[28] O. Subasi, T. Martsinkevich, F. Zyulkyarov, O. Unsal, J. Labarta,and F. Cappello, “Uniﬁed fault-tolerance framework for hybrid task-parallel message-passing applications,”

The International Journal ofHigh Performance Computing Applications , vol. 32, no. 5, pp. 641–657,2018.[29] Z. Wang, L. Gao, Y. Gu, Y. Bao, and G. Yu, “A fault-tolerant frameworkfor asynchronous iterative computations in cloud environments,”

IEEETransactions on Parallel and Distributed Systems , vol. 29, no. 8, pp.1678–1692, 2018.[30] J. Cao, K. Arya, R. Garg, S. Matott, D. K. Panda, H. Subramoni,J. Vienne, and G. Cooperman, “System-level scalable checkpoint-restartfor petascale computing,” in , 2016.[31] J. Adam, J.-B. Besnard, A. D. Malony, S. Shende, M. P´erache, P. Carrib-ault, and J. Jaeger, “Transparent high-speed network checkpoint/restart inmpi,” in

Proceedings of the 25th European MPI Users’ Group Meeting ,2018, p. 12.[32] N. Kohl, J. H¨otzer, F. Schornbaum, M. Bauer, C. Godenschwager,H. K¨ostler, B. Nestler, and U. R¨ude, “A scalable and extensible check-pointing scheme for massively parallel simulations,”

The InternationalJournal of High Performance Computing Applications , vol. 33, no. 4,pp. 571–589, 2019.[33] N. Losada, I. Cores, M. J. Mart´ın, and P. Gonz´alez, “Resilient mpiapplications using an application-level checkpointing framework andulfm,”

The Journal of Supercomputing , vol. 73, no. 1, 2017.34] K. Teranishi and M. A. Heroux, “Toward local failure local recoveryresilience model using mpi-ulfm,” in

Proceedings of the 21st europeanmpi users’ group meeting , 2014, p. 51.[35] A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann, “Epidemic failuredetection and consensus for extreme parallelism,”

The InternationalJournal of High Performance Computing Applications , vol. 32, no. 5,pp. 729–743, 2018.[36] I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, and B. R.de Supinski, “Evaluating user-level fault tolerance for mpi applications,”in

Proceedings of the 21st European MPI Users’ Group Meeting , ser.EuroMPI/ASIA ’14. New York, NY, USA: ACM, 2014, pp. 57:57–57:62.[Online]. Available: http://doi.acm.org/10.1145/2642769.2642775[37] N. Sultana, M. R¨ufenacht, A. Skjellum, I. Laguna, and K. Mohror,“Failure recovery for bulk synchronous applications with mpi stages,”

Parallel Computing

Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis ,ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 895–906.[Online]. Available: https://doi.org/10.1109/SC.2014.78[39] J. M. Bull, J. P. Enright, and N. Ameer, “A microbenchmark suitefor mixed-mode openmp/mpi,” in

International Workshop on OpenMP .Springer, 2009.[40] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas,R. Rabenseifner, and D. Takahashi, “The hpc challenge (hpcc) benchmarksuite,” in

Proceedings of the 2006 ACM/IEEE conference on Supercom-puting , 2006.[41] T. Agarwal and M. Becchi, “Design of a hybrid mpi-cuda benchmarksuite for cpu-gpu clusters,” in . IEEE, 2014.[42] R. Reussner, P. Sanders, L. Prechelt, and M. M¨uller, “SKaMPI: A detailed,accurate MPI benchmark,” in

European Parallel Virtual Machine/MessagePassing Interface Users’ Group Meeting . Springer, 1998.[43] D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. K. Panda, “OMB-GPU: a micro-benchmark suite for evaluating MPI libraries on GPUclusters,” in