[PDF] Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Abstract

As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Reducing the frequency at which redundant information is stored lessens the runtime overhead. However, after the node failure, the solver must restart from the last iteration for which redundant information was stored, which increases recovery overhead. This formulation highlights the method's similarities to checkpoint-restart (CR). Thus, this method, which we call ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart. The state is stored implicitly, by exploiting redundancy inherent to the algorithm, rather than explicitly as in CR. We also minimize the amount of data to be stored and retrieved compared to CR, but additional computation is required to reconstruct the solver's state. In this paper, we describe the necessary modifications to ESR to convert it into ESRP, and perform an experimental evaluation. We compare ESRP experimentally with previously-existing ESR and application-level in-memory CR. Our results confirm that the overhead for ESR is reduced significantly, both in the failure-free case, and if node failures are introduced. In the former case, the overhead of ESRP is usually lower than that of CR. However, CR is faster if node failures happen. We claim that these differences can be alleviated by the implementation of more appropriate preconditioners.

Full PDF

AAlgorithm-Based Checkpoint-Recoveryfor the Conjugate Gradient Method

Carlos Pachajoa

University of ViennaFaculty of Computer ScienceVienna, [email protected]

Christina Pacher

University of ViennaFaculty of Computer ScienceVienna, [email protected]

Markus Levonyak

University of ViennaFaculty of Computer ScienceVienna, [email protected]

Wilfried N. Gansterer

University of ViennaFaculty of Computer ScienceVienna, [email protected]

ABSTRACT

ESR with periodic storage (ESRP), can be considered a form of algorithm-based checkpoint-restart . The state is stored implicitly, by exploiting redundancyinherent to the algorithm, rather than explicitly as in CR. We alsominimize the amount of data to be stored and retrieved comparedto CR, but additional computation is required to reconstruct thesolver’s state. In this paper, we describe the necessary modifica-tions to ESR to convert it into ESRP, and perform an experimentalevaluation.We compare ESRP experimentally with previously-existing ESRand application-level in-memory CR. Our results confirm that theoverhead for ESR is reduced significantly, both in the failure-freecase, and if node failures are introduced. In the former case, theoverhead of ESRP is usually lower than that of CR. However, CR isfaster if node failures happen. We claim that these differences canbe alleviated by the implementation of more appropriate precondi-tioners.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CCS CONCEPTS • Mathematics of computing → Solvers; Mathematical soft-ware performance; • Computing methodologies → Parallelalgorithms;

ACM Reference format:

Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N.Gansterer. 2020. Algorithm-Based Checkpoint-Recovery for the Conju-gate Gradient Method. In

Proceedings of 49th International Conference onParallel Processing - ICPP, Edmonton, AB, Canada, August 17–20, 2020 (ICPP’20),

11 pages.DOI: 10.1145/3404397.3404438

In order to cover the demand for solving contemporary compu-tational problems in reasonable times, modern computer clustersreach unprecedented levels of parallelism. The mean time betweenfailures (MTBF) in clusters formed of increasingly numerous nodesand components will continue to drop. In these circumstances,there is good reason to start thinking of parallel computers as unre-liable machines [23], and to come up with strategies to work aroundthis problem.The solution of linear equations for symmetric, positive-definite(SPD) matrices is a problem of great importance in science andengineering. These matrices often arise from the discretization ofelliptic differential equations, describing phenomena such as heatconduction and elastic deformation of materials. More detailedsimulations require a finer grid and larger matrices. The solution ofthese problems often requires computer clusters and, if the scale ofthe machine is large enough, the solver is prone to encounter faultsin the computer it runs on. To solve linear systems defined by SPDmatrices, the choice of solver often is the preconditioned conjugategradient (PCG) method (cf. Section 2.1). For very large matrices,running PCG on large numbers of computer nodes is warranted.Thus, it is worth to develop resilience strategies for this algorithm.A particular mode in which faults may occur in a cluster are nodefailures : events where one or more nodes that were working onthe solution of the system become inaccessible and the informationcontained in them is lost. To cope with them, checkpoint-restart(CR) is currently the most extensively used strategy. In CR, the a r X i v : . [ c s . D C ] J u l CPP ’20, August 17–20, 2020, Edmonton, AB, Canada Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N. Gansterer state of the application is periodically stored in safe storage (this isthe checkpointing part of CR), and in the event of a node failure,the cluster can revert to a previously stored state and continue fromthere (the restart part of CR). Checkpointing has a runtime cost andsince, in the absence of errors, this operation does not contributeto the results the application is producing, it is desirable to keepthe checkpointing frequency as low as possible. Returning to theprevious state, however, incurs the runtime cost of discarding theiterations performed since the checkpoint. Therefore, there is alsopressure to checkpoint as often as possible. Finding the optimalcheckpointing period is discussed in the literature, and the valuewill depend on the incidence of errors in the machine [8, 11, 28].The application of CR is straightforward: The state can be thecontent of relevant variables selected by the programmer (application-level CR), or simply the contents of all the memory used by it(full-memory CR). However, there are drawbacks that the methoddescribed in this paper improves upon: • It has been pointed out that CR will not scale up well onfuture exascale machines, particularly in the full-memorycase [4, 5]. • CR strategies are generally algorithm-agnostic. The tech-nique consists of storing data and reverting to it, ignoringany resilience arising from the algorithm itself, thus poten-tially being suboptimal for the task at hand.In contrast to these algorithm-agnostic approaches, there is acategory of strategies to deal with computer unreliability called algorithm-based fault-tolerance (ABFT), a concept first introduced in[13] where it is applied to dense matrix multiplication, that exploitsthe properties of the algorithms to endow them with resilience.In this work, we focus on the PCG method used to solve thelinear system Ax = b , where A is an sparse SPD matrix. PCGruns on a cluster that is vulnerable to node failures. We expand ona previously introduced strategy: exact state reconstruction (ESR)[7, 20], which provides ABFT resilience against node failures forPCG. Our contributions in this work reduce its runtime overheadconsiderably, particularly in the case of multiple simultaneous nodefailures. In this section, we introduce some terms that are used in this paper.

Lost nodes are nodes that stop working in the event of a nodefailure: They become inaccessible and the information contained inthem is lost. A spare node is in standby until a node failure occurs,at which point it replaces one of the lost nodes by reconstructingits information and continue iterating in its place. A spare nodethat, upon occurrence of a node failure, takes the place of a lostnode is called a replacement node . A surviving node is still workingafter a node failure, its information remains accessible, and it alsocontinues working after the reconstruction. In this paper, eventsin which a single node or multiple nodes fail simultaneously arecalled single-node failure and multiple-nodes failure, respectively.We use the notation of [7] to refer to indices of vectors andmatrices. The set of all indices is referred to as I . For a problemof size M , the cardinality of I is M . A subindex will restrict thisindex set to the indicated node or set of nodes: For example, theentries corresponding to node s will be denoted as I s . If f is the set of nodes that fail simultaneously, the set of indices correspondingto the lost elements is denoted as I f , and we can refer to the set ofindices for elements in surviving nodes as I \ I f . We use index setsas subscripts to refer to entries of vectors and matrices, for example,if we have a distributed vector x , we refer to the surviving vectorelements as x I \ I f , and for a matrix A , the rows corresponding tothe lost nodes can be written as A I f , I . We say that a node owns an index, and thus the corresponding vector entry and matrix row,if the index is in the set assigned to it. In the context of the PCGmethod, a superscript (as in x ( j ) ) indicates the iteration number ofa vector or scalar.The number of nodes in the cluster is denoted with N .The state of an iterative solver refers to all of the solver’s dynamicdata, i.e. the vectors and scalars whose values change in everyiteration, distributed among all nodes. It does not include the staticdata (system matrix, preconditioner and right-hand-side vector).The trajectory followed by the solver is the sequences of states thatit goes through until convergence. A given state fully defines thetrajectory subsequently followed by the solver: All future statesare completely determined by the current state. In this paper, we work on the solution of sparse, symmetric, positivedefinite linear systems in a distributed-memory setting using thePCG method. The solver data is distributed, and the problem issolved in N nodes of a computer cluster. Disjoint subsets I s ofconsecutive indices are distributed among the nodes, such thattheir union forms the set of all indices, I . Node s is assigned therows of the matrix, and entries of the vectors, corresponding to theindices in its subset I s . This block row distribution is common, andis used in prominent libraries such as PETSc [2]. The scalars usedby the solver are replicated in all nodes.We consider a situation in which the computer cluster is unreli-able, specifically, that it can suffer node failures, in which one ormore nodes can fail simultaneously. In terms of the linear solver,these faults represent the loss of information owned by the affectednodes: blocks of vector entries, matrix rows, and the scalars alsoresiding in the nodes’ memory.We look for strategies to recover from these events and stillconverge to the correct solution of the linear system. In this work,the cost metric is the overall runtime for the solver to converge.We assume the availability of sufficient memory to hold redundantinformation, and of spare nodes in the cluster. Here we present literature concerned with the recovery of linearsolvers from node failures.Concerning the checkpoint-restart approach, the currently mostcommon way to deal with this problem, we highlight work byTao [25]. In this thesis, the author proposes techniques for lossycompression of checkpoint data, using prediction formulas basedon spatial proximity.In work on the solution of linear systems originating from par-tial differential equations (PDE), Ltaief et al. present a strategy forresilience for parabolic PDEs against node failures [17]. Here, for-ward and backward-stepping strategies are described to reconstruct lgorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada data in the physical domain associated to the lost node, avoiding amore expensive checkpoint-restart approach. The method recon-structs the iterand exactly, like the methods described in this paper,but its application is a time-stepping solver with finite-differencesdiscretization, whereas ours is the conjugate gradient algorithm. In[14], Huber et al. present an algorithm for the reconstruction of thesubdomain after a node failure for a multigrid solver for the Laplaceequation. The affected values are reconstructed approximately bysolving a linear system local to the lost node.Langou et al. [15] work with a wider variety of iterative linearsolvers. After a node failure, approximations to the lost entries ofthe iterand are found using the system matrix, the right-hand-sidevector and the surviving data of the iterand itself by solving a smalllinear system. They can bound the new residual norm as less thanthe residual norm before the node failure times a constant factor.This method incurs no overhead in the absence of node failures.Agullo et al. [1] improve upon this strategy. As in [15], theyapproximate the lost entries of the iterand from its surviving in-formation, but use least-squares minimization instead of solving alinear system. As a result, the residual norm of the new vector willbe less than or equal to the residual norm before the node failure.Chen [7] introduces a way to perform exact state reconstruction(ESR) for multiple iterative methods, including PCG. They presenta strategy to exploit the sparse matrix-vector product (SpMV) prod-uct to store redundant information for the input vector, so that thefull state of the vector can be reconstructed. In [20], Pachajoa etal. extend the algorithm in [7] to combine ESR with differentlyformulated preconditioners (the preconditioner itself, its inverse ora split preconditioner), and also compare ESR to the linear interpo-lation algorithm from [15]. In [21], Pachajoa et al. extend the ESRapproach by describing how to operate in the event of multiple,simultaneous node failures.In [16], Levonyak et al. extend the concept of ESR to the pipelinedPCG algorithm, while maintaining its communication-hiding prop-erties.The work mentioned so far supposes the availability of sparenodes. In [12], Hori et al. propose strategies for the allocation ofthese spare nodes, and the replacement of lost nodes, when runtimeperformance is of consideration.Pachajoa et al. [22] introduce an ESR method which does notrequire spare nodes as replacements for failed nodes, but can recon-struct the lost information and continue on the surviving nodes.

As the main contribution of this paper, we frame ESR for PCGas an instance of what we call algorithm-based checkpoint-restart ,and describe how it can be restructured to enable decreased state-storage frequencies. We experimentally show that this approachreduces the runtime overhead in the absence of node failures. Thisis particularly beneficial in scenarios with multiple node failures,where the additional communication needs increase the overheadmost drastically and, consequently, for which the runtime overheadreduction is greatest.The rest of this paper is structured in the following manner:In Section 2, we describe the exact state reconstruction approachapplied to PCG in more detail. In Section 3, we reframe ESR, as presented in [21], as a CR-like method, for which the state-storageinterval can be optimized, and which offers reduced runtime over-heads. In Section 4, we describe the framework we use to obtainour experimental results. In Section 5, we present our experimentalresults, highlighting favorable scenarios for new methods. Finally,Section 6 concludes the paper and presents our perspectives onfuture work.

In this section, we introduce the PCG method, describe the waywe exploit the inherent data redundancies in the sparse matrixvector-product, and explain the exact state reconstruction method.

PCG is a linear solver for the system Ax = b , where A is an SPDmatrix. The method is applied in conjunction with a preconditioner,which reduces the number of iterations until convergence at theprice of the application of the preconditioner in every iteration.The variables used in PCG are the following: x is the iterandvector, containing the current approximation to the solution. P isthe preconditioner, here representing its action as a linear operator.The residual vector is represented with r , and the preconditionedresidual vector with z . The search direction vector, p , determinesthe direction in which the iterand is modified in every iteration. β is a scalar used for the conjugation of the search directions, and α is a scalar determining the length of the step to be taken along thesearch direction towards the solution. PCG is presented in Alg. 1.In exact arithmetic, supposing a naive selection of the initialguess, solving an SPD linear system of size M will take the conjugategradient method M iterations to reach the solution. If the solveris restarted from the iterand at some point before convergence,reinitializing the search directions, reaching the solution mightrequire performing M additional iterations from that point, thuswasting the work already performed. In [19], it is shown that thiseffect is also observed in floating-point arithmetic. This observationis the motivation for the exact state reconstruction (ESR) method (cf.Section 2.3). Algorithm 1:

Preconditioned conjugate gradient (PCG)method [24, Alg. 9.1] r ( ) (cid:66) b − Ax ( ) , z ( ) (cid:66) Pr ( ) , p ( ) (cid:66) z ( ) ; for j = , , . . . , until convergence do α ( j ) (cid:66) r ( j )(cid:62) z ( j ) / p ( j )(cid:62) Ap ( j ) ; x ( j + ) (cid:66) x ( j ) + α ( j ) p ( j ) ; r ( j + ) (cid:66) r ( j ) − α ( j ) Ap ( j ) ; z ( j + ) (cid:66) Pr ( j + ) ; β ( j ) (cid:66) r ( j + )(cid:62) z ( j + ) / r ( j )(cid:62) z ( j ) ; p ( j + ) (cid:66) z ( j + ) + β ( j ) p ( j ) ; end CPP ’20, August 17–20, 2020, Edmonton, AB, Canada Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N. Gansterer

In order to reconstruct the entirety of the state of the PCG solveras it was before a node failure, enough redundant information hasto be available. That is, there has to be redundancy that we canexploit.The SpMV already provides some redundancy: In order to com-pute the product of matrix A and vector p , entries of p must betransmitted from their owner node to other nodes in the cluster,thus already creating copies of some entries in other locations.However, in order for the SpMV to provide full redundancy for theinput vector, the matrix must fulfill the following condition [7]: Forevery node s , every column of the submatrix A I \ I s , I s contains atleast one non-zero entry. For a matrix with this property, everyentry of the vector is communicated from its owner to at least oneother node. Most matrices do not fulfill this condition. Furthermore,this would only guarantee sufficient redundancy to recover fromthe failure of a single node.To achieve the required redundancy, we use the extensions tothe SpMV introduced in [7, 21] and name the concept augmentedsparse matrix vector product (ASpMV). Entries that would not havebeen sent to any node with the ordinary SpMV are transferred to aneighbor anyway. With our chosen strategy, node s sends a givenentry to node ( s + ) mod N , if this entry is not already being sentto some other node as part of the regular SpMV.The exact communication overhead depends on the sparsity pat-tern of the matrix. In general, denser matrices will have loweroverheads for ASpMV, since more information has to be sent any-way to compute the product. With the nodes sending informationto their neighbors, it is convenient if the matrix is banded, withmost of its entries close to the diagonal. That way, the amount ofinformation that ASpMV has to send additionally to the neighborsis minimized. ASpMV canalso guarantee the presence of several redundant copies of eachinput vector element. This is necessary in order to provide resilienceagainst multiple-nodes failures. In order to describe this extension,we make use of the notation of [21, §3 and §4]: Let ϕ denote thetarget number of times each entry of the input vector must bereplicated in the cluster (i.e. the number of simultaneous nodefailures that should be supported). Before, we defined I s as the set ofall indices owned by node s . Let I s , l , with l ∈ { .. N } , be the subsetof I s with indices of the input vector p corresponding to entries thatmust be sent to node l for the computation of the product Ap . Node s does not send data to itself during this operation, so we define I s , s (cid:66) ∅ . Furthermore, let d s , k denote the designated destinationnodes for resilient copies of the vector elements of node s , with k ∈ { .. ϕ } . In this work, we select d s , k to be the ϕ nearest neighborsof node s . This can be achieved with the following strategy: d s , k (cid:66)  (cid:16) s + (cid:108) k (cid:109)(cid:17) mod N , if k odd (cid:16) s − k (cid:17) mod N , if k even (1)For index i , which belongs to node s , we define its multiplicity, m ( i ) , as the number of subsets I s , l , with l ∈ { .. N } , where i ispresent. That is, m ( i ) is the number of nodes that the i th vectorentry must be sent to in order to compute Ap . Additionally, we define д ( i ) as the number of the subsets I s , d s , k , k ∈ { .. ϕ } in which i appears, that is, how many of the nodes d s , k already need the i th entry to compute Ap .We can now describe the set R cs , k of indices of entries, ownedby node s , to be sent to d s , k , in addition to the ones required tocompute the product Ap : R cs , k (cid:66) (cid:110) i ∈ I s | i (cid:60) I s , d s , k and m ( i ) − д ( i ) < ϕ − k (cid:111) . That is, the i th entry, if owned by node s , will be sent to d s , k if(1) it is not already being sent there and if (2) as we traverse thedesignated destination nodes by increasing k , the target numberof copies for this entry has not been met yet. The approach for asingle-node failure described earlier in this section is the same asthe approach for a multiple-nodes failure, with ϕ set to 1.After the ASpMV is complete, each entry of the vector will havebeen communicated by its owner to at least ϕ nodes, thus creating ϕ + ϕ nodes failsimultaneously, each entry will have survived in at least one nodeand can afterwards be transferred to a replacement node.With this method, we send entries that are not necessary forthe computation of the matrix-vector product. The communicationof this additional information will cause iterations to take longer,and thus lead to an increased runtime until convergence. The exactoverhead depends on factors such as the sparsity pattern of thematrix and the network topology of the cluster. Optimization ofour strategies taking these factors into consideration is beyond thescope of this paper. Research in this direction is ongoing work. After the ASpMV is executed, the re-dundant information of the search direction p ( j ) is not explicitlyavailable. This means, even though the information of the copies ispresent and spread in the cluster, after a node failure, the data mustbe gathered in a replacement node before we can work with p ( j ) again. We introduce the concept of a redundant copy , designatedwith a prime symbol ( (cid:48) ), as in p (cid:48) ( j ) , to abstractly represent the re-dundant vector data in the cluster, in whatever storage scheme theframework utilizes. We do not specify the number of copies pervector entry that a redundant copy represents, since we do not needthis information for further descriptions. The concept of redundantcopies will be used to explain our algorithms in Section 3. The PCG algorithm performs a matrix-vector product in everyiteration (Line 3 of Alg. 1): The system matrix A is multiplied withthe search direction vector p , thus potentially providing redundancyfor the latter. In [21], the authors explain how to reconstruct thestate of the solver as it was prior to the node failure, save forsmall perturbations resulting from floating-point arithmetic. Withthe two latest search directions, and after retrieving the scalar β from one of the surviving nodes, it is possible to move backwardsfrom Line 8 of Alg. 1 and reconstruct every vector involved in thecomputation. The reconstruction procedure, run on replacementnodes, is presented in Alg. 2. Note that the reconstruction procedureassumes that the static solver data (system matrix, preconditionerand right-hand-side vector) can be retrieved from safe storage. lgorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada Algorithm 2:

ESR reconstruction phase for the PCG methodon the replacement nodes [20, Alg. 2] Retrieve the static data A I f , I , P I f , I , and b I f ; Gather r ( j ) I \ I f and x ( j ) I \ I f ; Retrieve the redundant copies of β ( j − ) , p ( j − ) I f , and p ( j ) I f ; Compute z ( j ) I f (cid:66) p ( j ) I f − β ( j − ) p ( j − ) I f ; Compute (cid:118) (cid:66) z ( j ) I f − P I f , I \ I f r ( j ) I \ I f ; Solve P I f , I f r ( j ) I f = (cid:118) for r ( j ) I f ; Compute w (cid:66) b I f − r ( j ) I f − A I f , I \ I f x ( j ) I \ I f ; Solve A I f , I f x ( j ) I f = w for x ( j ) I f ;With the state of the solver as it was before the node failure, it ispossible to reach convergence following the same trajectory as anundisturbed solver. As illustrated in [21], this method produces verylow overheads, particularly if it protects only against single-nodefailures. In this section, we extend ESR introduced in Section 2.3 to performiterations with ASpMV with a reduced frequency, that is, not inevery iteration, but two consecutive times every T iterations. Wecall T the checkpointing interval to keep with CR terminology, andwe refer to the set of two iterations in which redundant informationis stored as the storage stage . In the event of a node failure, thesolver will return to the last time the search directions for twosuccessive iterations were stored redundantly via ASpMV. It is thenpossible to reconstruct the state for the last of those iterations.We call the new approach Exact state reconstruction with periodicstorage (ESRP), and contrast it with ESR which stores data in everyiteration.To describe ESRP, we introduce the concept of a queue , wherethe solver stores redundant copies (cf. Section 2.2.2). In the ESRalgorithm, this queue has space for two positions: Every iteration,ASpMV will push a new redundant copy into the queue, and theoldest copy will be released. This queue thus contains the redundantcopies of search directions for two successive iterations.We now examine the same procedure in the case of ESRP. Sup-pose that the last redundant copies held in the queue are p (cid:48) ( j ) and p (cid:48) ( j + ) , that we performed some additional iterations using regularSpMV afterwards, and that then a node failure occurs. The searchdirections in the queue could be used to reconstruct the state foriteration j +

1. However, it is possible that the node failure takesplace after only one of the two iterations of a storage stage havebeen completed. Suppose that the solver has reached a storagestage at some iteration j , and the first call to ASpMV is performed.The redundant copy p (cid:48) ( j ) is then pushed to the queue. If a nodefailure happens at this point in time, before the redundant copy p (cid:48) ( j + ) is created, the vector p ( j ) that we can retrieve is not suf-ficient to perform the reconstruction shown in Alg. 2, since wewould additionally need p ( j + ) . For this reason, it is necessary to have a queue of not two, but three redundant copies of search direc-tions, such that, if this happens, the queue still contains entries oftwo successive search directions from a redundant storage periodbefore.In addition to the redundant copies created during the ASpMV,the solver needs to duplicate some local data at each node duringthe storage stage. As can be seen in Line 4 of Alg. 2, to reconstructthe vector z ( j ) , and subsequently the state of the solver for iteration j , the value of β ( j − ) is needed. However, depending on when thenode failure occurs, the scalar β may have changed since the laststorage stage. It is therefore necessary to create a duplicate of thevalue of β ( j − ) . Similarly, the local entries of the residual r , thepreconditioned residual z , the iterand x and the search direction p at iteration j must also be duplicated in all nodes, so that they canbe used for the reconstruction process and so that the survivingnodes can reset their own parts of the solver state to match thestate that is reconstructed at the replacement nodes. Entries of thevector z ( j ) corresponding to surviving nodes are not used duringthe reconstruction, and could also be recomputed from r ( j ) oncethe latter has been reconstructed. However, our solver stores alocal copy instead of performing this operation. We mark theseduplicate values with an asterisk: β ∗ is a scalar, and r ∗ , p ∗ , z ∗ and x ∗ are distributed vectors. There is no need to store the scalar α .It is not used during the reconstruction and it will be computedin Line 13 of Alg. 3 when the solver continues iterating. Sincethese copies are created locally by each node, they do not introduceany additional communication between nodes, and the runtimeoverhead they cause is therefore negligible. Note that since r ∗ , p ∗ , z ∗ and x ∗ are created by each node copying its own data, if a nodefails, the copies contained in it are also lost. Therefore, these copies,by themselves, obviously cannot be used to reconstruct the state.An example of the procedure follows: The solver has a queue Q with three positions to hold redundant copies. Initially, the queueis empty: Q (cid:66) [ _ , _ , _ ] . After T iterations, ASpMV will be calledfor the first time and will push the first redundant copy, thus Q contains [ _ , _ , p (cid:48) ( T ) ] . The solver will also create a copy of the valueof β ( T ) , β ∗ , on every node. At this point, it is not possible to recoverfrom a node failure using ESRP. After another iteration, ASpMV pushes another redundant copy, Q becomes [ _ , p (cid:48) ( T ) , p (cid:48) ( T + ) ] , andcopies of the vectors r ( T + ) , z ( T + ) , x ( T + ) and p ( T + ) are made,respectively designated r ∗ , z ∗ , x ∗ and p ∗ . This information cannow be used to reconstruct the state for iteration T +

1, and thestorage stage ends. From there, we continue iterating using regular

SpMV .When the next storage stage is reached, after an additional T iter-ations, we use ASpMV again, and Q becomes [ p (cid:48) ( T ) , p (cid:48) ( T + ) , p (cid:48) ( T ) ] ,and again, a copy of β ( T ) is created. The newest entry of Q can-not be used in conjunction with the previous ones to reconstructthe state of the solver. In the event of a node failure at this point,the state would be recovered for iteration T +

1. At this time, westill need the value of β ( T ) to reconstruct the state, so we maynot overwrite β ∗ yet, and will overwrite it in the next iterationinstead. After an additional successful iteration, the queue contains [ p (cid:48) ( T + ) , p (cid:48) ( T ) , p (cid:48) ( T + ) ] , and copies for r ( T + ) , z ( T + ) , x ( T + ) and p ( T + ) are created. From that point on, it is possible to recon-struct for iteration 2 T +

1, which concludes the second storage stage.

CPP ’20, August 17–20, 2020, Edmonton, AB, Canada Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N. Gansterer

The process is presented in Alg. 3 and graphically represented inFig. 1. For the description of the algorithms in Section 3, we referto the ordinary SpMV as ϱ (cid:66) SpMV ( A , p ) , a function that takesmatrix A and vector p as inputs and returns their product, ϱ . Theaugmented variant is represented as ϱ (cid:66) ASpMV ( A , p , ϕ , Q ) , wherethe function additionally takes the number of desired redundantcopies ϕ , and the queue Q where it will push a new redundant copy for p . Algorithm 3:

Preconditioned conjugate gradient (PCG)method with periodic redundant storage (for ESRP) r ( ) (cid:66) b − Ax ( ) , z ( ) (cid:66) Pr ( ) , p ( ) (cid:66) z ( ) , j (cid:66) , Q (cid:66) [ _ , _ , _ ] ; repeat if j mod T = and j > then ϱ ( j ) (cid:66) ASpMV ( A , p ( j ) , ϕ , Q ) ; β ∗ ∗ = β ( j ) ; else if ( j − ) mod T = and j > then ϱ ( j ) (cid:66) ASpMV ( A , p ( j ) , ϕ , Q ) ; x ∗ = x ( j ) , r ∗ = r ( j ) , z ∗ = z ( j ) , p ∗ = p ( j ) ; β ∗ = β ∗ ∗ ; else ϱ ( j ) (cid:66) SpMV ( A , p ( j ) ) ; α ( j ) (cid:66) r ( j )(cid:62) z ( j ) / p ( j )(cid:62) ϱ ( j ) ; x ( j + ) (cid:66) x ( j ) + α ( j ) p ( j ) ; r ( j + ) (cid:66) r ( j ) − α ( j ) ϱ ( j ) ; z ( j + ) (cid:66) Pr ( j + ) ; β ( j ) (cid:66) r ( j + )(cid:62) z ( j + ) / r ( j )(cid:62) z ( j ) ; p ( j + ) (cid:66) z ( j + ) + β ( j ) p ( j ) ; j (cid:66) j + until (cid:107) r (cid:107) /(cid:107) b (cid:107) < rtol ;From Fig. 1 we can see which checkpointing intervals make sensefor ESRP. For T >

2, we proceed as explained above. For T =

2, itno longer makes sense to use this approach, since redundant copiesare created in every iteration. In this case, it is better to use ESR(Section 2.3). For T =

1, we can no longer talk about creating twosuccessive copies. Again, this corresponds to regular ESR.

The addition of a checkpointing interval to ESRP highlights itssimilarity to CR approaches. In both cases, we store the state ofthe solver after every checkpointing period, either explicitly, in thecase of CR, or implicitly, by exploiting redundancy provided by thealgorithm itself, in the case of ESRP. We claim, therefore, that ESRPis an algorithm-based checkpoint-restart strategy.As with CR, this method introduces a trade-off between theruntime overhead, which decreases with increasing T since wecreate redundant copies less frequently, and the cost of discardingthe iterations performed since the last storage stage was reached. In-memory checkpoint-restart.

In our experiments in Section 5 wecompare the runtime of our ESRP approach to an in-memory buddy checkpoint-restart strategy (IMCR), which will now be describedin detail.Similar to the description above, we assume a checkpoint intervalof T : Once every T iterations, each node will create a checkpointby sending a complete copy of the local parts of all vectors it ownsto a neighboring node (the “buddy node”). In the event of a nodefailure, the replacement node will then simply retrieve its localvector parts from the buddy. As in the case of ESR, we assumethat the static solver data can be retrieved from safe storage anddoes not need to be stored during the checkpointing. To extendthis approach to support multiple node failures, it is sufficient tosend the checkpointing data to multiple buddies. This is a formof algorithm-based fault tolerance , since the checkpointing andrecovery strategies are specifically tailored to the PCG solver.There are further similarities between this checkpointing strat-egy and ESR/ESRP: For example, the data that can be retrieved froma checkpointing buddy is the same data that is reconstructed duringthe recovery phase of ESR, and the strategy for choosing the bud-dies is the same as the strategy for determining the destinations ofredundant elements in the augmented sparse matrix-vector product,as defined by Eq. 1 in Section 2.2.1. An important difference, how-ever, is that ESR mainly adds on to existing communication, whilethe checkpointing strategy introduces a completely new round ofcommunication in each storage iteration. We implement our algorithms using our own framework, writtenin C. In this way we achieve the highest flexibility for our purpose.Elementary linear algebra functionality is provided by GSL [10],but parallelization of linear algebra operations, in particular of thematrix-vector product, is in-house code. Communication betweennodes is realized with MPI. The framework is modularly structured,such that different strategies to achieve data redundancy, and toperform reconstruction and recovery, can be used. In particular, wecan simulate ESRP, as well as in-memory CR.We simulate one node failure event for each run of PCG. Theranks of the affected nodes and the iteration at which the failureshould occur are passed as parameters to our framework. Once themarked iteration is reached, the nodes set to fail zero-out all theirvector entries, as well as the scalars they contain, thus simulatingthe loss of all of their dynamic data (their components of vectors x , r , z , p , as well as the scalars β and α ). This is also the initial state ofa replacement node, which starts without knowledge of the state ofthe node it is replacing. For the sake of ease of implementation, theset of nodes simulating a node failure will also act as the replace-ments. After a simulated node failure, the replacement nodes startthe recovery process, collecting information from their neighborsand reconstructing the lost data.In a real-world scenario, the replacement nodes would also haveto reload the rows of the system matrix and the preconditioner, andthe entries of the right-hand-side vector that they own; however,we have decided not to include this step in the measurement of theruntime overheads during our experiments. There are two mainreasons for this decision. Firstly, this overhead depends too stronglyon the individual use case for us to be able to make any generalizedstatements about it: Not only is it influenced by the matrix size lgorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada Start j = [ ,, ] j = [ ,, ] j = T − [ ,, ] j = T [ ,, p ( T ) ] j = T + [ , p ( T ) , p ( T + ) ] j = T + [ , p ( T ) , p ( T + ) ] j = T − [ , p ( T ) , p ( T + ) ] j = T [ p ( T ) , p ( T + ) , p ( T ) ] j = T + [ p ( T + ) , p ( T ) , p ( T + ) ] j = T + [ p ( T + ) , p ( T ) , p ( T + ) ] Figure 1: State of the redundancy queue for the search directions during the solution process. The lists below the line representthe state of the search direction queue, and which search directions are stored redundantly somewhere in the cluster. The thinarrows running leftwards show how far the solver has to revert in the event of a node failure. and file system properties, but matrices might be stored in differ-ent file formats (e.g. plain text or binary), or the solver might beworking with a matrixless representation altogether. This changesthe loading time considerably. Secondly, the reloading step is thesame for both the ESR and CR versions we investigate. Therefore,we would not gain any valuable information about differences inthe behaviour of these strategies from examining the time requiredfor the reloading of static data.

Beyond node-failure simulation

Since we only simulate node failures, our framework does notcapture all of the events that would take place in the case of a realincident, nor all of the steps that are necessary to recover fromone. We assume that, in a realistic application, there would besome middleware available to take care of these additional tasks. Atany rate, we can describe the operations necessary to perform therecovery using ESR and CR that are not modeled in our framework.The first of these tasks is detecting a node failure: A prerequisitefor recovery for both ESR and CR is that there is a mechanism inplace that will notice if one of the nodes becomes unresponsive.It is reasonable to assume that this cost would be the roughly thesame for both approaches in a framework that is well-optimizedfor this purpose.A second task currently not modeled is determining which nodehas failed. The surviving nodes need this information to decidewhat data must be transferred to the replacement node, and thereplacement node needs it to know which rows of the matrix, thepreconditioner, and which entries of the right-hand-side vector toload.A third task would be to setup the cluster to continue working.For ESR as as well as the CR strategy described in this paper, thisinvolves providing a replacement node to take the place of the lostnode and setting up a new communicator to continue iterating. Inthe case of ESR, it is also possible to proceed without a replacementnode (cf. [22]), but this is beyond the scope of this paper. Moregenerally speaking, depending on how exactly CR is implemented, it could make use of spare nodes, enabling the application to keepusing most of the nodes already allocated to it, but making it nec-essary to identify the identity of the lost node just as in the case ofESR; or the whole application could be restarted on newly-allocatednodes, although this is likely to be more costly that identifying thelost nodes, particularly at greater scales [6, 12, 26].All in all, we expect the costs of the events that our frameworkis not modeling to be comparable between ESR and CR.Although presently not in the MPI standard, there is ongoingwork on tools to deal with node failures. The

User-Level FaultMitigation (ULFM) library [3, 18] offers functions for detection ofnode failures and identification of the affected nodes. In conjunctionwith standard MPI, it is possible to create a new communicator onwhich the solver can continue working.

Experimental setup.

Our experiments are run on the VSC3 ma-chine of the Vienna Scientific Cluster. We use 128 nodes, withone process per node. (One process is sufficient to examine theoverheads for resilience, since the redundant data has to be sent todifferent nodes in any case.) This machine has a fat-tree topology.We use the following libraries: Intel C compiler 18.0.5, Intel MPIversion 2018 update 4 and GSL 2.4.If no protective measures are taken, node failures can causethe loss of all the computation invested in the solution of a linearsystem. However, they are a relatively rare occurrence. With theincidence estimations of [11] (mean time between failures of 9hours for 100 000 nodes, and 53 minutes for 1 000 000 nodes), alinear solver might be affected by at most a few of such eventsduring its runtime. Thus, we consider that examining the behaviorof the solver when a single node-failure event strikes at some pointduring its operation is a useful exercise.We use a block Jacobi preconditioner, with non-overlappingblocks and all rows of a block belonging to a single node. Theblocks are uniformly sized and we use as few of them as possible,with a maximum block size of 10. This preconditioner is used both

CPP ’20, August 17–20, 2020, Edmonton, AB, Canada Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N. Gansterer

Table 1: Test matrices from [9]Matrix Problem type Problem size

Emilia_923 Structural 923 136 40 373 538audikw_1 Structural 943 695 77 651 847for the linear system of the problem we are solving, and for the innersystems of the reconstruction (Lines 6 and 8 of Alg 2). The solverhas converged once the relative residual (cid:107) r (cid:107) /(cid:107) b (cid:107) is below 10 − .The relative residual for the inner system for the reconstructionmust reach 10 − for convergence.Our test problems are SPD matrices from the SuiteSparse MatrixCollection [9] (see Table 1). They were selected based on their size,to allow for comparisons with related work in [21]. With thesematrices, we set up the test constellation as follows: • Two recovery strategies: ESRP and in-memory CR. • Checkpoint interval of 20, 50 and 100 iterations, plus aninterval of 1 for ESRP, representing the previously existingESR method. • Resilience with 1, 3 and 8 redundant copies. • Reference runs, runs with resilience but without node fail-ures, and node failures introduced in contiguous blocksstarting in ranks 0 and 64, with as many node failures asthe solver can tolerate with the number of available copies.We introduce a node failure in the interval between checkpointsthat contains the iteration C /

2, where C is the number of iterationsthat a failure-free solver needs to converge. Within this interval,the node failure is introduced two iterations before its end, thusrepresenting a worst-case scenario in which most of the progresssince the start of the interval is lost. Experiments are repeated atleast five times for every setting in this test constellation.The use of contiguous blocks of ranks for the node failures isjustified by considering that multiple-nodes failures would mostlikely come from, for example, a switch fault, affecting a branch ofthe fat-tree and, consequently, a contiguous block of ranks. Experimental results

We measure the runtime t of the solver to reach convergence toevaluate our algorithms. We define the reference time, t , as themedian time for runs of a reference, non-resilient PCG solver. Themetric that we present in our results is the relative overhead overthis reference time: realtive overhead = ( t − t )/ t . Our results aresummarized in Tables 2 and 3, and shown in Fig. 2 and 3. Note thatin all of our experiments the measured total runtime is less thana minute and, thus, well below an estimated mean time betweenfailures of approximately an hour up to several hours in large-scaleapplications [11]. Since we have to recover from a node failurein that short runtime, the overhead due to this recovery is likelymore severe than in a real-world application with less frequentrecoveries. Therefore, we test a scenario that is less favorable forour approach. Reasonably low overheads in our experiments canhence be considered a proof of concept.For the used test matrices, overheads of ESRP are usually muchsmaller than the ones of ESR (Section 2.3), and this advantage ismore pronounced for larger numbers of redundant copies. In the case of ESRP, the columns for reconstruction overhead in Tables 2and 3 show the cost of gathering the information and of recomput-ing the lost data. In the case of IMCR, these columns show the costof communicating checkpointed data to the replacement nodes. Forboth test matrices, our experiments show a reconstruction overheadof basically zero for IMCR. This suggests that in our experimen-tal setup the communication cost is considerably smaller than thecomputation cost. Keeping in mind the common understandingthat communication tends to become more expensive than compu-tation, especially in the large scale, this observation indicates thatthe experimental setup currently available to us possibly leads tounderestimating the communication cost and does not allow for asolid comparison with IMCR. Consequently, we decided to depictexperimental measurements for IMCR less prominently in Figs. 2and 3 and leave more representative experiments at larger scalesfor future work.The cluster that we use introduces a certain amount of varia-tion to our measurements. We repeat each experiment to reducethe standard deviation of the tests, and we present their median.However, there are cases in which this standard deviation for thereference runtimes is not reduced below the overhead. Table 2shows an instance of this: The median reference time for ESRP forthree redundant copies is larger for a period of 50 that for a periodof 20. In general, we expect larger checkpoint intervals to producesmaller overheads in the failure-free case, but these results can beaffected by the variation in runtime from external factors. In thiscase, the overhead is so close to zero that it is overshadowed by thenoise from the machine.Table 2 shows that, for the matrix Emilia_923 , the failure-freeoverhead for ESRP is lower than for IMCR, down to about halfof the corresponding value for IMCR in some settings. For ESRP,reducing the frequency at which redundant copies are stored visi-bly reduces the overhead, especially in cases with multiple-nodesfailures (Tables 2 and 3). As for matrix audikw_1 , Table 3 showsthat overheads for the failure-free cases for ESRP and IMCR areclose, with some advantage for ESRP in cases with multiple-nodefailures.In the case of ESRP, the ranks of the lost nodes determine thesubmatrix A I f , I f (where f represents the set of indices affected bythe node failure), and thus which inner linear system will be solvedduring the reconstruction in Line 8 of Alg. 2, and how fast thiscan be done. This cost is also influenced by the performance of thepreconditioner used for the inner system. As a consequence, therecovery times for ESRP change for different matrices and for differ-ent sets of lost nodes for the same matrix. In contrast, the recoverycost in IMCR is the cost of transferring checkpointed vectors to thereplacement nodes, and it is more or less independent of the loca-tion of the failure. Currently, our experiments use a simple blockJacobi preconditioner. We believe that ESRP would greatly benefitfrom more appropriate preconditioners, and an investigation of thisaspect is future work.Whether ESRP or IMCR is a better strategy for resilience dependson the probability of node failures happening. If the probability islow, a method with a low overhead in the undisturbed case wouldbe preferable, even if the reconstruction cost is higher, since it isunlikely that the runtime overhead must be incurred. Conversely, lgorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada Table 2: Results for matrix

Emilia_923 . Reference time t = 14.66 s. The reference case takes C = 10279 iterations to reachconvergence. The strategies shown are ESR with periodic storage (ESRP) and In-memory buddy CR (IMCR). T : Checkpointinginterval, measured in iterations. ϕ : Number of supported node failures. ψ : Number of introduced node failures. All overheadsare relative to t . Failure-free overhead : runtime overhead of runs with resilience, but without introduced node failures. TheLocation column indicates where the failures are introduced. Rows marked with start and center have node failures introducedin blocks starting in ranks 0 and 64, respectively.

Overhead with node failures : overall overheads for runs with an eventwhere as many nodes fail simultaneously as the solver can tolerate.

Reconstruction overhead : overhead for the reconstructionoperations (collecting data in the replacement nodes and reconstructing the state for ESRP, and sending the checkpointeddata to the replacement node in IMCR). All results are the median of at least five repeated experiments for the correspondingsetting. In all cases, node failures are introduced two iterations before a checkpoint for the interval containing the iteration C / , thus representing a worst-case scenario. The overhead for wasted iterations for T > is not explicitly shown. It can beapproximated by subtracting the reconstruction overhead from the overall overhead since, in general, the reconstruction doesnot change the trajectory of the solver after rollback. Failure-free overhead [%] Overhead with node failures [%] Reconstruction overhead [%]Strategy T φ = 1 φ = 3 φ = 8 Location φ = ψ = 1 φ = ψ = 3 φ = ψ = 8 φ = ψ = 1 φ = ψ = 3 φ = ψ = 8 ESRP 1 0 . . . . . . . . . . . . . . .

820 0 . . . . . . . . . . . . . . .

850 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

050 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 3: Results for matrix audikw_1 . Reference time t = 23.22 s. The reference case takes C = 5543 iterations to reach conver-gence. Symbols and terms are explained in the caption of Table 2 Failure-free overhead [%] Overhead with node failures [%] Reconstruction overhead [%]Strategy T φ = 1 φ = 3 φ = 8 Location φ = ψ = 1 φ = ψ = 3 φ = ψ = 8 φ = ψ = 1 φ = ψ = 3 φ = ψ = 8 ESRP 1 4 . . . . . . . . . . . . . . .

220 0 . . . . . . . . . . . . . . .

350 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

050 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a method with a lower recovery cost is preferable if the probabilityof encountering a node failure is higher. Accuracy of the experiments.

In general, when working withPCG without residual replacement, there is some drift between the vector r and the vector b − Ax [27]. In ESRP, solving the innersystem with an iterative solver, and performing the reconstructionin floating-point arithmetic, cause the reconstructed vector r notto be exactly equal to its state before the node failure. To evaluate CPP ’20, August 17–20, 2020, Edmonton, AB, Canada Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N. Gansterer T = 20 T = 50 T = 100 checkpointing interval0.1%1.0%10.0% r un t i m e o v e r h e a d ESRPESRIMCR (a) Failure-free solver T = 20 T = 50 T = 100 checkpointing interval1.0%10.0% r un t i m e o v e r h e a d ESRPESRIMCR (b) Node failures introduced

Figure 2: Median runtime overhead for the experiments with the matrix

Emilia_923 . In each plot, experiments for a checkpointinterval T are clustered together. In each cluster, there are three lines, representing experiments with ESRP, ESR and in-memory CR (IMCR). In each line, the three markers, from left to right, represent experiments with 1, 3, or 8 redundant copies,and also 1, 3 or 8 simultaneous node failures for Fig. 2b, to the right. ESR results are the same for all checkpointing intervalsin each plot because they are equivalent to ESRP results with T = , and are displayed along data for ESRP and IMCR forcomparison. The markers represent the median for results in all locations (cf. Tab 2) and repetitions. T = 20 T = 50 T = 100 checkpointing interval0.1%1.0% r un t i m e o v e r h e a d ESRPESRIMCR (a) Failure-free solver T = 20 T = 50 T = 100 checkpointing interval1.0%10.0% r un t i m e o v e r h e a d ESRPESRIMCR (b) Node failures introduced

Figure 3: Median runtime overhead for the experiments with the matrix audikw_1 . See the caption of Fig. 2 for details on thestructure of the plot. the accuracy of ESRP, we compute the vector b − Ax ( End ) , wherethe superscript End represents the vectors’ state after convergence,and use it to compute the residual drift metric: (cid:107) r ( End ) (cid:107) − (cid:107) b − Ax ( End ) (cid:107) (cid:107) b − Ax ( End ) (cid:107) . (2)A more positive value of the residual drift indicates a smallervalue of (cid:107) b − Ax ( End ) (cid:107) and, thus, a more accurate result. Thismetric is not used to determine convergence of the solver; for this,we use the relative residual as described in Section 5. The residualdrift is computed only after the solver has converged. We use thismetric to ensure that ESRP is not generally less accurate than PCG. PCG and ESRP experiments without node failures produce thesame value for the metric since they always follow exactly the sametrajectory. With node failures, the residual drift depends on theselection of affected nodes and on the iteration when the nodefailure occurs; in this case, we present the minimum and medianfor all experiments. Accuracy results are summarized in Table. 4.In the median, ESRP with node failures does not differ significantlyfrom PCG. As for the minimum value, the results for the matrix Emilia_923 show little accuracy loss with respect to PCG, butfor the matrix audikw_1 , there is a drift of close to 15 . audikw_1 is explained by the fact that it reconstructs the iterand lgorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method ICPP ’20, August 17–20, 2020, Edmonton, AB, Canada from the residual vector, thus making the residual and iterandconsistent with each other. Table 4: Residual drift observed in the experiments.

Refe-rence : Residual drift for all failure-free cases.

Median : Me-dian residual drift over all experiments with node failures.

Minimum : Minimum residual drift over all experimentswith node failures, representing the greatest loss of accuracyduring the reconstruction in ESRP.

Matrix Reference Median Minimum

Emilia_923 − . × − − . × − − . × − audikw_1 − . × − − . × − − . × − In this paper, we introduce an extension to the exact state recon-struction (ESR) algorithm for node-failure resilience for PCG. Ourapproach reduces the runtime overhead of ESR by saving redundantcopies not in every iteration, but only every T iterations. Giventhe relation of this approach to checkpoint-restart (CR), we callsuch a strategy an algorithm-based checkpoint-restart method. Weintroduce a framework for our experiments and evaluate this newstrategy, which we call ESRP, comparing it to standard ESR andalso to our implementation of in-memory buddy CR (IMCR). Inour experimental results, the runtime overhead of ESRP turns outconsiderably lower than that of standard ESR and also lower thanthat of IMCR for the failure-free cases, When node failures occur,however, the recovery time is dominated by the solution of a smallerinner linear system and depends on the matrix itself.An important step to take in future work is to evaluate ESRPusing different preconditioners. Furthermore, we are working onproducing larger test problems, so that we can observe a differentregime in the computation/communication ratio for PCG. Anotherinteresting direction is the study of ESRP working with partitioningalgorithms, looking in particular for partitioning strategies thatoptimize for the matrix-vector product and simultaneously providesufficient redundancy. In the future, we intend to produce an imple-mentation of the algorithms on a framework that can detect nodefailures and provide replacements, using tools like ULFM [3] andsimilar, for example. ACKNOWLEDGMENTS

This work has been funded by the Vienna Science and TechnologyFund (WWTF) through project ICT15-113. The computational re-sults presented have been achieved using resources of the ViennaScientific Cluster (VSC). We thank our reviewers for their helpfulcomments and suggestions.

REFERENCES [1] Emmanuel Agullo, Luc Giraud, Abdou Guermouche, Jean Roman, and MawussiZounon. 2016. Numerical recovery strategies for parallel resilient Krylov linearsolvers.

Numer. Lin. Algebra. Appl.

23, 5 (2016), 888–905.[2] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, KrisBuschelman, Lisandro Dalcin, Alp Dener, Victor Eijkhout, William D. Gropp,Dmitry Karpeyev, Dinesh Kaushik, Matthew G. Knepley, Dave A. May, Lois Curf-man McInnes, Richard Tran Mills, Todd Munson, Karl Rupp, Patrick Sanan, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. 2019.

PETScUsers Manual

Int. J. High Perform. Comput. Appl.

27, 3 (2013), 244–254.[4] Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K. Panda, HariSubramoni, Jérôme Vienne, and Gene Cooperman. 2016. System-Level ScalableCheckpoint-Restart for Petascale Computing. In . 932–941.[5] Franck Cappello, Geist Al, William Gropp, Sanjay Kale, Bill Kramer, and MarcSnir. 2014. Toward exascale resilience: 2014 update.

Supercomput. Front. Innov. . 131–140.[7] Zizhong Chen. 2011. Algorithm-based Recovery for Iterative Methods With-out Checkpointing. In

Proceedings of the 20th International Symposium on HighPerformance Distributed Computing . 73–84.[8] John T. Daly. 2006. A higher order estimate of the optimum checkpoint intervalfor restart dumps.

Future. Gener. Comp. Sy.

22, 3 (2006), 303–312.[9] Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse MatrixCollection.

ACM Trans. Math. Software

38, 1 (2011), 1:1–1:25.[10] Brian Gough. 2009.

GNU Scientific Library Reference Manual (3rd ed.). NetworkTheory Ltd.[11] Thomas Herault and Yves Robert. 2015.

Fault-Tolerance Techniques for High-Performance Computing (1st ed.). Springer Publishing Company, Incorporated.[12] Atsushi Hori, Kazumi Yoshinaga, Thomas Herault, Aurélien Bouteiller, GeorgeBosilca, and Yutaka Ishikawa. 2020. Overhead of using spare nodes.

Int. J. HighPerform. Comput. Appl.

34, 2 (2020), 208–226.[13] Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-Based Fault Toler-ance for Matrix Operations.

IEEE Trans. Comput.

C-33, 6 (1984), 518–528.[14] Markus Huber, Björn Gmeiner, Ulrich Rüde, and Barbara Wohlmuth. 2015. Re-silience for Exascale Enabled Multigrid Methods. arXiv:1501.07400.[15] Julien Langou, Zizhong Chen, George Bosilca, and Jack Dongarra. 2007. RecoveryPatterns for Iterative Methods in a Parallel Unstable Environment.

SIAM J. Sci.Comput.

30, 1 (2007), 102–116.[16] Markus Levonyak, Christina Pacher, and Wilfried N. Gansterer. 2020. ScalableResilience Against Node Failures for Communication-Hiding PreconditionedConjugate Gradient and Conjugate Residual Methods. In

Proceedings of the 2020SIAM Conference on Parallel Processing for Scientific Computing . 81–92.[17] Hatem Ltaief, Marc Garbey, and Edgar Gabriel. 2006. Parallel fault tolerantalgorithms for parabolic problems.

Euro-Par 2006 Parallel Processing (2006),700–709.[18] Message Passing Interface Forum. 2017. User Level Failure Mitigation. http://fault-tolerance.org/.[19] Carlos Pachajoa and Wilfried N. Gansterer. 2017. On the Resilience of ConjugateGradient and Multigrid Methods to Node Failures. In

Euro-Par 2017: ParallelProcessing Workshops - Euro-Par 2017 International Workshops (Lecture Notes inComputer Science) , Vol. 10659. Springer, 569–580.[20] Carlos Pachajoa, Markus Levonyak, and Wilfried N. Gansterer. 2018. Extendingand Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods. In .49–58.[21] Carlos Pachajoa, Markus Levonyak, Wilfried N. Gansterer, and Jesper LarssonTräff. 2019. How to Make the Preconditioned Conjugate Gradient MethodResilient Against Multiple Node Failures. In

Proceedings of the 48th InternationalConference on Parallel Processing . Article 67, 10 pages.[22] Carlos Pachajoa, Christina Pacher, and Wilfried N. Gansterer. 2019. Node-Failure-Resistant Preconditioned Conjugate Gradient Method without ReplacementNodes. In . IEEE, 31–40.[23] Daniel A. Reed and Jack Dongarra. 2015. Exascale Computing and Big Data.

Commun. ACM

58, 7 (2015), 56–68.[24] Yousef Saad. 2003.

Iterative Methods for Sparse Linear Systems (2nd ed.). Societyfor Industrial and Applied Mathematics.[25] Dingwen Tao. 2018.

Fault Tolerance for Iterative Methods in High-PerformanceComputing . Ph.D. Dissertation. UC Riverside.[26] Keita Teranishi and Michael A. Heroux. 2014. Toward Local Failure Local Recov-ery Resilience Model Using MPI-ULFM. In

Proceedings of the 21st European MPIUsers’ Group Meeting . 51–56.[27] Henk A Van Der Vorst and Qiang Ye. 2000. Residual replacement strategies forKrylov subspace iterative methods for the convergence of true residuals.

SIAMJ. Sci. Comput.

22, 3 (2000), 835–852.[28] John Young. 1974. A first order approximation to the optimum checkpointinterval.