[PDF] Checkpointing and Localized Recovery for Nested Fork-Join Programs

Abstract

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.

Full PDF

CCheckpointing and Localized Recoveryfor Nested Fork-Join Programs

Claudia Fohry

Research Group Programming Languages / MethodologiesUniversity of Kassel, [email protected]

Abstract —While checkpointing is typically combined with arestart of the whole application, localized recovery permits allbut the affected processes to continue. In task-based clusterprogramming, for instance, the application can then be ﬁnishedon the intact nodes, and the lost tasks be reassigned.This extended abstract suggests to adapt a checkpointing andlocalized recovery technique that has originally been developedfor independent tasks to nested fork-join programs. We considera Cilk-like work stealing scheme with work-ﬁrst policy in a dis-tributed memory setting, and describe the required algorithmicchanges. The original technique has checkpointing overheadsbelow 1% and neglectable costs for recovery, we expect the newalgorithm to achieve a similar performance.

I. I

NTRODUCTION

Checkpoint/Restart (C/R) is the current standard techniqueto handle fail-stop failures of processes in clusters [1], [2]. Itis often criticized for limited scalability, and therefore variantssuch as uncoordinated [3], in-memory [4] and multi-level [5]checkpointing have been devised. These variants reduce thecosts, but most systems still require to restart the wholeapplication when a failure occurs.From both a deployment and performance point of view, itmay be preferable to continue the program execution on thereduced set of processes, which is called shrinking recovery .Related to that, a localized recovery approach conﬁnes thefailure handling to the affected processes, ideally without anyinvolvement of the others.A few localized, and partly shrinking, recovery techniqueshave already been suggested, notably in the context of task-based parallel programming [6]–[8]. This context is generallypromising for the provision of resilience: Since tasks haveclearly deﬁned interfaces, checkpointing at the task levelallows to nicely combine ease-of-use and efﬁciency. Therebyease-of-use is achieved through a transparent implementationin the runtime system, and efﬁciency through saving taskdescriptors only.The importance of task-level checkpointing is likelyto increase with the current rise of task-based parallelprogramming (e.g. [9]–[16]). While the cited environmentsdiffer widely in their mechanisms for task creation andcooperation, this paper solely refers to nested fork-joinprograms (NFJs), which were popularized with Cilk [17]–[19]. Listing 1 depicts the computation of Fibonacci numbersas an example. Listing 1: Nested fork-join programi n t f i b ( i n t n ) { / / 0i f ( n < } NFJs start with a single root task, here fib(n) . Then eachtask may spawn any number of children and pass parametersto them. A task must wait for the results of all children, eitherexplicitly with sync , or implicitly at the end of the function.We assume that the tasks communicate through parameterpassing and result return only, they must not have side effects.The execution of a fork-join program gives rise to a tree,such as the one in Figure 1. In the ﬁgure, rectangles denotespawned functions. Numbers 0 to 3 correspond to sequentialcode sections as given by the comments in Listing 1. Forinstance, section 0 runs from the beginning of the functionuntil the spawn of the ﬁrst child. Downward edges (solid)mark spawns, and upward edges (dotted) mark result returns atexplicit or implicit sync ’s. NFJ implementations commonlyuse work-ﬁrst work stealing, which is explained in Section II. fib(3) fib(2) fib(1) fib(0) fib(1) Fig. 1: Execution of nested fork-join programsA resilience scheme with shrinking localized recovery forNFJs under work-ﬁrst work stealing has already been in-troduced by Kestor et al. [7]. Their technique exploits theparticular NFJ style of synchronization to restrict task re-execution to the lost (sub-)tasks. More speciﬁcally, when k outof p processes fail, a share of k/p -th of the previous work mustbe re-done on average. The technique has overheads below 1%in failure-free runs, but some drawbacks in recovery: a r X i v : . [ c s . D C ] F e b Unlike on average, up to 100% of the previous work mustbe re-done in worst-case scenarios. These occur when theroot of a large (sub-)tree fails at the moment when allresults have been returned there. To avoid such unlimitedinformation loss, the authors of [7] suggest to combinetheir technique with standard C/R, which however meansto essentially give up localized recovery. • When a failure occurs, all processes must inspect a datastructure and participate in a global reduction, deviatingfrom perfectly localized recovery. • The average share of k/p may still be large for long-running programs in failure-prone environments.Therefore, this paper advocates a checkpointing-based al-ternative. Checkpointing algorithms already exist for otherclasses of task-based parallel programs [6], [20]. We refer to analgorithm for dynamic independent tasks (DIT) from Posneret al. [8], and adapt it to NFJ. Like NFJ tasks, DIT tasks arespawn dynamically, forming a tree. However, child tasks donot return a result to their parent, but contribute to a ﬁnalresult that is accumulated locally and calculated by reduction.A recent study has compared the algorithms from Kestoret al. [7] and Posner et al. [8] in DIT, to which the NFJalgorithm was transferred [21]. The study reported overheadsbelow 1% for both algorithms, with those of the NFJ-speciﬁcalgorithm [7] being lower in failure-free cases, and those of thecheckpointing algorithm [8] being lower during recovery. Weexpect similar results for NFJ, since the algorithm proposedin this paper closely resembles the original one.Section II of this abstract states our assumptions on thefailure model and provides background on work stealing. ThenSection III sketches the checkpointing algorithm and explainsour proposed algorithmic changes. Section IV outlines somerelated work, and Section V ﬁnishes with conclusions.II. B

ACKGROUND AND A SSUMPTIONS a) Failure Model:

We consider fail-stop failures ofprocesses after permanent hardware failures, and assumereliable network communication. Any number of failures maystrike at any time, including simultaneous failures and failuresduring recovery. A program must always compute the correctresult, except that it may abort when the resilient store usedfor checkpoint saving (see Section III) fails. All processesmust be notiﬁed of all failures, possibly with a delay. A1 A5 A4A3A2 C D1 D2 D3E

F G1

G2 G3H

1. 2. victimthief rF fA

B1 B2

B3 B4 A6

Fig. 2: Work stealing example b) Work stealing:

Tasks are commonly executed by aﬁxed set of workers , which in our distributed memory settingcorrespond to processes. Each worker owns a local pool forstoring and retrieving tasks, which are represented by stackframes. When the pool is empty, the worker becomes a thief and tries to steal tasks from a victim , e.g. from a randomworker. Like [7], we consider private pools, i.e., the thief mustsend a request message for that, and the victim answers bysending loot [22], [23].An NFJ execution starts with one local pool holding theroot task. Then, most implementations proceed in a work-ﬁrstmanner: When a worker encounters a spawn, it branches intothe child and puts the continuation of the parent frame intothe pool. Steals always extract the oldest frame.Figure 2 illustrates work-ﬁrst work stealing. The ﬁgure usesthe same notation as Figure 1, but a different task structureto facilitate further discussion. Each color marks the workperformed by a particular worker.The computation starts with the green worker (called Green)processing the A frame. At the ﬁrst spawn, Green branchesinto B, and Brown steals the continuation of A. In general,thieves process parent frames, and victims process children,as shown on the right side of the picture.There are only few cluster implementations of the abovescheme [7], [18]. The one in [7] uses active messages in aPartitioned Global Address Space (PGAS) setting. When athief encounters a sync , it sends the frame back to the victim(or transitively to all victims) for result matching. Note thatthe parent frame is sent back to the child (and not vice versa),even though the arrows in Figure 2 indicate that logically theresult is incorporated into the parent frame. When a victimﬁnishes a task whose parent is away, it locally saves the resultand steals a new task.For an example, consider Red in Figure 2. It stole frame Bfrom Green at B2, and was stolen from by Yellow at B3.Red ﬁnished F before Yellow returned. So Red kept the result(called rF ) and stole the A frame at A3. Later, Blue stolethe A frame at A4 and already returned it (called fA ) at the sync opening A5 (as marked by the dotted red line). Nowconsider the time when all sections printed in bold have beenﬁnished, i.e., right before branching into H. At this time, Redholds rF , fA , a local pool with D2 and G2, and a descriptorof H. Furthermore, Red knows the identities of all victimsand thieves with still unmatched results: Green and Yellowfor the B2 frame, and Brown and Blue for the A3 frame. Inits entirety, this information forms Red’s state , which will bedeﬁned in Section III.III. C HECKPOINTING A LGORITHM

We refer to the AllFT algorithm from Posner et al. [8],which is the simplest of three DIT algorithms suggested inthat reference. Our techniques may extend to their incrementalscheme, which reduces the overhead for large task descriptors.AllFT regularly writes uncoordinated checkpoints to a re-silient store. Reference [8] uses the IMap of Hazelcast [24]for this purpose, which is a replication-based in-memory store,ut the algorithm is not restricted to it. Checkpoints for DITcomprise the local pool contents and the accumulated workerresult. They are written between ﬁnishing one task and startingthe next one, so that they capture a coherent state. Besideregular checkpoint writing, AllFT updates the checkpointsat each steal. A resilient steal protocol ensures consistencybetween victim, thief, and their respective checkpoints, despitepossible failures [8].For recovery, each worker has a designated buddy worker.The buddy reads the last checkpoint of the failed worker fromthe resilient store, and inserts the saved tasks into its ownpool. This way, all recently run tasks of the failed worker arere-executed. Moreover, the buddy takes care of any recentlyextracted loot from the failed worker. Only relevant in [8], itdoes not adopt the accumulated result, which is kept in theresilient store instead.Buddies are chosen to be the next worker alive in a ringof workers, using some numbering of workers. If a buddyfails, its successor takes its role. Like stealing, task adoptionis protected by a resilient recovery protocol.Details of the steal and recovery protocols can be foundin [8], [25]. Brieﬂy stated, they consider speciﬁc cases andfor each deﬁne a set of actions to be taken to get back to aconsistent state [8]. For instance, a victim may have to takeback tasks when the thief fails. Outside the protocols, thecheckpointed state of a worker is always consistent with theongoing computation. Thus, one may safely reset a worker’sstate to the checkpointed one, without adjusting the states ofthe others.The above algorithm can be adapted to NFJ with only twomajor changes:1) The contents of the checkpoints must be equated to thestate of an NFJ worker, as deﬁned below.2) An additional frame return protocol is required.As in [8], checkpoints are written independently for eachworker and contain the worker’s state. Since they are writtenbetween task processings, they need not include the internalstate of a current task. We deﬁne the state of a worker W toconsist of: • the current contents of W’s local pool, • all locally saved task results at W that have not yet beenincorporated into their parent frame (e.g. rF ) • all frames returned to W from their thieves that areawaiting result incorporation (e.g. fA ), • the identities of all victims of W to which W has not yetreturned the respective stolen frame, • the identities of all thieves of W that have not yet returnedtheir frame to W, and • if relevant, a task descriptor of the next task.The last item is relevant in the following case 1 of possibleoccasions for checkpoint writing, but not in case 2. When acheckpoint is due, whichever of these two occasions comesﬁrst applies:1. right before branching into a child or into a stolen task(e.g. before branching into H), or 2. after ﬁnishing a task and incorporating its result into theparent frame or storing it locally (e.g. after ﬁnishing Hand incorporating its result into the G frame).In case 2, the descriptor of the next task is irrelevant, sinceit will either be a task from the local pool, which is part ofthe state anyway; or a newly stolen task, for which the stealprotocol will schedule another checkpoint.The steal and restore protocols can essentially be takenfrom [8], [25], since the handshaking to reach consistency isindependent from the contents of checkpoints. Merely a fewdifferences exist in the way in which the buddy worker adoptsthe failed worker’s data. Most importantly, for stored resultslike rF , it must inform the thief (in the example: Yellow) aboutthe result’s new location. This is feasible since the identitiesof the thieves are contained in the adopted checkpoint. If thethief has failed as well, the buddy contacts the buddy of thethief instead, which has adopted or will adopt the stolen frameor a continuation thereof. The identity of the thief’s buddy canbe ﬁgured out easily, since it is the next worker alive in thering of workers.Reference [8] does not specify a frame return protocol, sinceDIT has no result return. However, frame return resemblesstealing insofar as data (a result or loot, respectively) aremoved from one worker to another. Therefore, a subset ofthe steal protocol from [8] (starting after receipt of the stealrequest) can be used for this purpose. The protocol includestwo checkpoints and a temporary frame saving in the resilientstore. IV. R ELATED W ORK

There has been growing interest in task-level resiliencein recent years. Topics include task re-execution after silenterrors [26]–[29], techniques to handle different failure typestogether [30], and the tracking of global data accesses oftasks [31]. While we focus on the recovery of a dynamic taskstructure, Lion and Thibault [6] concentrate on checkpointingthe data that are communicated between tasks. Like us, theyperform a localized recovery, as do all previously discussedNFJ and DIT resilience schemes and their precursors [7], [8],[18]–[20], [25], [32].Outside task-level resilience, localized recovery has been re-alized for MPI programs [33]–[35], with the help of Fenix [36]and User Level Failure Mitigation (ULFM) [37]. Anotherresilient in-memory store than the IMap was discussed in [38].V. C

ONCLUSIONS

This extended abstract has suggested a checkpointing algo-rithm for NFJs, which supports shrinking localized recoveryfrom one or multiple fail-stop failures of processes. It is avariant of a previous algorithm for DIT with only few changesthat chieﬂy regard the contents of checkpoints. Therefore weexpect the new algorithm to perform similarly to the originalone, which causes less than 1% running time overhead andneglectable costs for recovery. Future work should experimen-tally investigate this expectation.

EFERENCES[1] M. Snir, R. W. Wisniewski, J. A. Abraham et al. , “Addressing failures inexascale computing,”

The Int. Journal of High Performance ComputingApplications (IJHPCA) , vol. 28, no. 2, pp. 129–173, 2014.[2] T. Herault and Y. Robert, Eds.,

Fault-Tolerance Techniques for High-Performance Computing . Springer, 2015.[3] A. Guermouche, T. Ropars, E. Brunet et al. , “Uncoordinated check-pointing without domino effect for send-deterministic MPI applications,”in

Proc. Int. Parallel and Distributed Processing Symposium (IPDPS) ,2011, pp. 989–1000.[4] G. Zheng, X. Ni, and L. V. Kal´e, “A scalable double in-memorycheckpoint and restart scheme towards exascale,” in

IEEE/IFIP Int. Conf.on Dependable Systems and Networks Workshops (DSN-W) , 2012.[5] A. Benoit, A. Cavelan, V. L. F`evre et al. , “Towards optimal multi-levelcheckpointing,”

IEEE Transactions on Computers , vol. 66, no. 7, pp.1212–1226, 2017.[6] R. Lion and S. Thibault, “From tasks graphs to asynchronous distributedcheckpointing with local restart,” in

Proc. Workshop on Fault Tolerancefor HPC at eXtreme Scale(FTXS) , 2020, pp. 31–40.[7] G. Kestor, S. Krishnamoorthy, and W. Ma, “Localized fault recoveryfor nested fork-join programs,” in

Proc. Int. Parallel and DistributedProcessing Symp. (IPDPS) , 2017, pp. 397–408.[8] J. Posner, L. Reitz, and C. Fohry, “A comparison of application-levelfault tolerance schemes for task pools,”

Future Generation ComputerSystems (FGCS) , vol. 105, pp. 119–134, 2020.[9] “The Chapel parallel programming language.” [Online]. Available:chapel-lang.org[10] H. Kaiser, P. Diehl, A. S. Lemoine et al. , “HPX - the C++ standardlibrary for parallelism and concurrency,”

The Journal of Open SourceSoftware (JOSS) , vol. 5, no. 53, 2020.[11] “Legion programming system.” [Online]. Available: legion.stanford.edu[12] R. Hoque, T. Herault, G. Bosilca, and J. Dongarra, “Dynamic taskdiscovery in PaRSEC- a data-ﬂow task-based runtime,” in

ACM Work-shop on Latest Advances in Scalable Algorithms for LargeScale Systems(ScalA) , 2017.[13] “Tascell backtracking-based load balancing framework.” [Online].Available: super.para.media.kyoto-u.ac.jp/tascell[14] B. Archibald, P. Maier, R. Stewart, and P. Trinder, “YewPar: Skeletonsfor exact combinatorial search,” in

Proc. ACM SIGPLAN Symp. onPrinciples and Practice of Parallel Programming (PPoPP) , 2020, pp.292–307.[15] P. Thoman, K. Dichev, T. Heller et al. , “A taxonomy of task-basedparallel programming technologies for high-performance computing,”

The Journal of Supercomputing , vol. 74, no. 4, pp. 1422–1434, 2018.[16] E. Slaughter, W. Wu, Y. Fu et al. , “TaskBench: A parameterized bench-mark for evaluating parallel runtime performance,” in

Proc. Int. Conf. onHigh Performance Computing, Networking, Storage and Analysis (SC) ,2020, pp. 62:1–62:15.[17] M. Frigo, C. E. Leiserson, and K. H. Randall, “The implementationof the Cilk-5 multithreaded language,”

ACM SIGPLAN Notices (PLDI) ,vol. 33, no. 5, pp. 212 – 223, 1998.[18] R. D. Blumofe and P. A. Lisiecki, “Adaptive and reliable parallelcomputing on networks of workstations,” in

Proc. USENIX AnnualTechnical Symp. , 1997, pp. 133–147.[19] R. V. V. Nieuwpoort, G. Wrzesi´nska, C. J. H. Jacobs, and H. E. Bal,“Satin: a high-level and efﬁcient grid programming model,”

Transactionson Programming Languages and Systems (TOPLAS) , vol. 32, no. 3, pp.1–40, 2010. [20] C. Fohry, M. Bungart, and P. Plock, “Fault tolerance for lifeline-based global load balancing,”

Journal of Software Engineering andApplications , vol. 10, no. 13, pp. 925–958, 2017.[21] J. Posner, L. Reitz, and C. Fohry, “Checkpointing vs. supervisionresilience approaches for dynamic independent tasks,” 2021, submitted.[22] U. A. Acar, A. Chargu´eraud, and M. Rainey, “Scheduling parallelprograms by work stealing with private deques,”

ACM SIGPLAN Notices(PPoPP) , vol. 48, no. 8, pp. 219–228, 2013.[23] J. Posner and C. Fohry, “Cooperation vs. coordination for lifeline-basedglobal load balancing in APGAS,” in

Proc. ACM SIGPLAN Workshopon X10 , 2016, pp. 13–17.[24] “Hazelcast.” [Online]. Available: hazelcast.org[25] J. Posner and C. Fohry, “A Java task pool framework providing fault-tolerant global load balancing,”

Int. Journal of Networking and Com-puting (IJNC) , vol. 8, no. 1, pp. 2–31, 2018.[26] N. Gupta, J. R. Mayo, A. S. Lemoine, and H. Kaiser, “Towardsdistributed software resilience in asynchronous many-task programmingmodels,” in

Proc. Workshop on Fault Tolerance for HPC at eXtremeScale(FTXS) , 2020, pp. 11–20.[27] S. R. Paul, A. Hayashi, N. Slattengren et al. , “Enabling resilience inasynchronous many-task programming models,” in

Proc. Euro-Par Int.Conf. on Parallel and Distributed Computing , 2019, pp. 346–360.[28] M. C. Kurt, S. Krishnamoorthy, K. Agrawal et al. , “Fault-tolerant dy-namic task graph scheduling,” in

Proc. Int. Conf. for High PerformanceComputing, Networking, Storage and Analysis (SC) , 2014, pp. 719–730.[29] O. Subasi, T. Martsinkevich, F. Zyulkyarov et al. , “Uniﬁed fault-tolerance framework for hybrid task-parallel message-passing applica-tions,”

The Int. Journal of High Performance Computing Applications(IJHPCA) , vol. 32, no. 5, pp. 641–657, 2018.[30] O. Subasi, J. Arias, O. Unsal et al. , “NanoCheckpoints: A task-basedasynchronous dataﬂow framework for efﬁcient and scalable Check-point/Restart,” in

Proc. Euromicro Conf. on Parallel, Distributed andNetwork-Based Processing , 2015, pp. 99–102.[31] W. Ma and S. Krishnamoorthy, “Data-driven fault tolerance for workstealing computations,” in

Proc. ACM Int. Conf. on Supercomputing ,2012, pp. 79–90.[32] M. Bungart and C. Fohry, “A malleable and fault-tolerant task poolframework for X10,” in

Proc. Int. Conf. on Cluster Computing, Work-shop on Fault Tolerant Systems . IEEE, 2017, pp. 749–757.[33] M. Gamell, K. Teranishi, M. A. Heroux et al. , “Local recovery andfailure masking for stencil-based applications at extreme scales,” in

Proc.Int. Conf. for High Performance Computing, Networking, Storage andAnalysis (SC) , 2015, pp. 70:1–70:12.[34] H. Kolla, J. R. Mayo, K. Teranishi, and R. C. Armstrong, “Improvingscalability of silent-error resilience for message-passing solvers via localrecovery and asynchrony,” in

Proc. Workshop on Fault Tolerance forHPC at eXtreme Scale(FTXS) , 2020, pp. 1–10.[35] K. Teranishi and M. A. Heroux, “Toward local failure local recoveryresilience model using MPI-ULFM,” in

Proc. European MPI Users’Group Meeting (EuroMPI/ASIA) , 2014, pp. 51–56.[36] M. Gamell, D. S. Katz, H. Kolla et al. , “Exploring automatic, onlinefailure recovery for scientiﬁc applications at extreme scales,” in

Proc.Int. Conf. for High Performance Computing, Networking, Storage andAnalysis (SC) , 2014, pp. 895–906.[37] N. Losada, P. Gonz´alez, M. J. Mart´ın et al. , “Fault tolerance ofMPI applications in exascale systems: The ULFM solution,”

FutureGeneration Computer Systems (FGCS) , vol. 106, pp. 467–481, 2020.[38] D. Grove, S. S. Hamouda, B. Herta et al. , “Failure recovery in resilientX10,”