[PDF] Adapting the DMTCP Plugin Model for Checkpointing of Hardware Emulation

Abstract

Checkpoint-restart is now a mature technology. It allows a user to save and later restore the state of a running process. The new plugin model for the upcoming version 3.0 of DMTCP (Distributed MultiThreaded Checkpointing) is described here. This plugin model allows a target application to disconnect from the hardware emulator at checkpoint time and then re-connect to a possibly different hardware emulator at the time of restart. The DMTCP plugin model is important in allowing three distinct parties to seamlessly inter-operate. The three parties are: the EDA designer, who is concerned with formal verification of a circuit design; the DMTCP developers, who are concerned with providing transparent checkpointing during the circuit emulation; and the hardware emulator vendor, who provides a plugin library that responds to checkpoint, restart, and other events. The new plugin model is an example of process-level virtualization: virtualization of external abstractions from within a process. This capability is motivated by scenarios for testing circuit models with the help of a hardware emulator. The plugin model enables a three-way collaboration: allowing a circuit designer and emulator vendor to each contribute separate proprietary plugins while sharing an open source software framework from the DMTCP developers. This provides a more flexible platform, where different fault injection models based on plugins can be designed within the DMTCP checkpointing framework. After initialization, one restarts from a checkpointed state under the control of the desired plugin. This restart saves the time spent in simulating the initialization phase, while enabling fault injection exactly at the region of interest. Upon restart, one can inject faults or otherwise modify the remainder of the simulation. The work concludes with a brief survey of checkpointing and process-level virtualization.

Full PDF

aa r X i v : . [ c s . O S ] M a r Adapting the DMTCP Plugin Model for Checkpointing of Hardware Emulation † Rohan Garg ∗ Northeastern UniversityBoston, MA

Email: [email protected]

Kapil Arya

Mesosphere, Inc.San Francisco, CA

Email: [email protected]

Jiajun Cao ∗ Northeastern UniversityBoston, MA

Email: [email protected]

Gene Cooperman ∗ Northeastern UniversityBoston, MA

Email: [email protected]

Jeff Evans

Mentor Graphics Corp.Austin, TX

Email: [email protected]

Ankit Garg

Mentor Graphics Corp.NOIDA / India

Email: [email protected]

Neil A. Rosenberg

Intel CorporationAustin, TX

Email: [email protected]

K. Suresh

Mentor Graphics Corp.NOIDA / India

Email: [email protected]

Abstract —Checkpoint-restart is now a mature technology. Itallows a user to save and later restore the state of a runningprocess. The new plugin model for the upcoming version 3.0 ofDMTCP (Distributed MultiThreaded Checkpointing) is describedhere. This plugin model allows a target application to disconnectfrom the hardware emulator at checkpoint time and then re-connect to a possibly different hardware emulator at the time ofrestart. The DMTCP plugin model is important in allowing threedistinct parties to seamlessly inter-operate. The three parties are:the EDA designer, who is concerned with formal veriﬁcation ofa circuit design; the DMTCP developers, who are concernedwith providing transparent checkpointing during the circuitemulation; and the hardware emulator vendor, who provides aplugin library that responds to checkpoint, restart, and otherevents.The new plugin model is an example of process-level virtualiza-tion: virtualization of external abstractions from within a process.This capability is motivated by scenarios for testing circuitmodels with the help of a hardware emulator. The plugin modelenables a three-way collaboration: allowing a circuit designerand emulator vendor to each contribute separate proprietaryplugins while sharing an open source software framework fromthe DMTCP developers. This provides a more ﬂexible platform,where different fault injection models based on plugins can bedesigned within the DMTCP checkpointing framework. Afterinitialization, one restarts from a checkpointed state under thecontrol of the desired plugin. This restart saves the time spent insimulating the initialization phase, while enabling fault injectionexactly at the region of interest. Upon restart, one can injectfaults or otherwise modify the remainder of the simulation. Thework concludes with a brief survey of the existing approaches tocheckpointing and to process-level virtualization.

I. I

NTRODUCTION

Checkpoint-restart is now a mature technology with a va-riety of robust packages [1], [2], [3]. This work concentrateson the DMTCP (Distributed MultiThreaded CheckPointing)package and its sophisticated plugin model that enables pro-cess virtualization [4]. This plugin model has been used re-cently to demonstrate checkpointing of 32,752 MPI processeson a supercomputer at TACC (Texas Advanced ComputingCenter) [5]. DMTCP itself is free and open source. TheDMTCP publications page [6] lists approximately 50 refereed ∗ This work was partially supported by the National Science Foundationunder Grant ACI-1440788. † SELSE ’17, March 21–22, 2017, Boston, MA, USA publications by external groups that have used DMTCP in theirwork.This work concentrates on the recent advances in theDMTCP programming model that were motivated by workwith Intel Corporation. While Intel works with multiple ven-dors of hardware emulators, this work reﬂects the three-waycollaboration between the DMTCP team, Intel, and MentorGraphics, a vendor of hardware emulators for EDA. Furtherinformation speciﬁc to EDA (Electronic Design Automation)is contained in [7]. In particular, the ability to save the stateof a simulation including the state of a back-end hardwareemulator is a key to using checkpoint-restart in EDA.For background on how DMTCP is used generally at Intel,see [8]. The focuses of the ongoing work at Intel is bestdescribed by their statement of future work:“Within Intel IT, we will focus on the developmentand enhancement of the DMTCP technology for usewith graphical EDA tools, with strong network de-pendencies. . . .

There is also additional engagementwith third-party vendors to include native DMTCPsupport in their tools, as well as engagement withsuper-computing development teams on enablingDMTCP for the Xeon Phi family of products.”A hardware emulator may entail a thousand-fold slowdown,as compared to direct execution in silicon. There are twonatural use cases of checkpointing in the context of EDA.In both cases, the natural strategy is to run until reachingthe region of logic of interest. Then checkpoint. Later, onecan repeatedly restart and test the logic, without worryingabout the long initialization times under a hardware emulator.Restarting under DMTCP is extremely fast, especially whenthe -fast-restart ﬂag is used that takes advantage of mmap() to load pages into memory on-demand at runtime (after theinitial restart). The two use cases follow.

Testing of silicon logic: run until reaching the logic to betested; then repeatedly restart and follow different logicbranches; and

Fault injection in silicon logic: run until reaching the logicto be tested; then repeatedly restart, inject faults in theemulated (or simulated) silicon model and run along are-determined logic branch to determine the level offault tolerance for that silicon design.For this work, the second case is of greater interest. Thisrequires running arbitrary code either immediately at thepoint of restart by injecting faults in the logic design, or byinterposing on later logic functions of the simulator/emulatorso as to inject transient faults.The ﬁrst use case above has been extensively studied usingDMTCP in domains as varied as architecture simulation [9],formal veriﬁcation of embedded control systems [10], networksimulation [11], and software model checking [12]. While thetwo use cases are closely related, this work highlights thesecond use case, by including the possibility of interposingat runtime. Section II presents the tools for such interposition,including the creation of global barriers at an arbitrary pointin the program. Section III presents three particular extensionsof checkpointing that were added to the DMTCP plugin modelspeciﬁcally motivated by the concerns observed in our generalcollaboration on EDA.The DMTCP plugin model is critical in this latter applica-tion. One must stop a computation at a pre-deﬁned locationin the simulation, save additional state information (such asthe state of a hardware emulator being used [7]), and theninject additional code (such as fault injection) at restart time.A contribution of the DMTCP plugin model is the ability tovirtualize multiple aspects of the computation. These include:pathnames (for example, the subdirectory corresponding to thecurrent “run slot” of the emulator); environment variables (forexample, modiﬁcation of the DISPLAY environment variable,or other environment variables intrinsic to the running of thesimulation); interposition of the simulation by a third-partyplugin (for example, for purposes of measuring timings sincerestart at multiple levels of granularity, or programmaticallycreating additional checkpoints for analysis of interestingstates leading to logic errors); and third-party programmablebarriers across all processes (enabling the acceleration ofsimulations through the use of parallel processes and evendistributed processes within a single computation).The DMTCP plugin model is an example of process virtu-alization : virtualization of external abstractions from within aprocess. It is argued here that the DMTCP plugin model setsit apart from other checkpointing approaches. To this end, abrief survey of existing checkpointing approaches and processvirtualization is provided at the end.In the rest of this paper, Section II motivates the needfor a model of process virtualization with a simple exampleconcerning process ids. It also reviews the DMTCP pluginmodel. Section III presents a series of micro-case studiesin which DMTCP was extended to support the applicationsat Intel, along with third-party DMTCP plugins developedby Mentor Graphics for use by Intel and other customers..Section IV the provides a survey of DMTCP and some otherrelated approaches to checkpointing and process virtualization.Section V then presents the conclusions. II. U

SER -S PACE P ROCESS V IRTUALIZATION

Application-speciﬁc checkpointing and system-level trans-parent checkpointing are two well-known options for check-pointing. Unfortunately, neither one ﬁts the requirements forthe proposed use case for simulating fault injection in siliconlogic. Application-speciﬁc checkpointing is error-prone anddifﬁcult to maintain. System-level transparent checkpointinggenerally does not provide enough control at runtime todynamically adjust the type of fault injection. In particular, itis often necessary to capture control of the target applicationdynamically at runtime in order to inject faults. Here we showhow that can be incorporated in a modular DMTCP plugin,rather than incorporated directly into the simulator/emulator.For a more thorough introduction to the DMTCP pluginmodel, see either [4] or the DMTCP documentation [13]. Thissection highlights those aspects most likely to assist in addingfault injection through a DMTCP plugin.The primary features of the model of interest for faultinjection are:1) interposition on function/library calls, and their use invirtualization;2) programmatically deﬁned barriers across all processeson a computer; and3) programmatically deﬁned choices of when to checkpointand when to avoid checkpointing.

A. Process Virtualization through Interposition and Layers: ASimple Example with Pids

User Process

PID: 4000

User Process

PID: 4001

Virt. PID Real PID4000 26524001 3120

Translation Table getpid()26524000kill(4001, 9)

KERNEL

Sending signal 9to pid 3120

Fig. 1:

Process virtualization for pids.

Figure 1 succintly describes the philosophy of processvirtualization. Some invariant (in this case the pid (process id)of a process may have a different name prior to checkpointand after restart. A virtualized process will interact only withvirtual process ids in the base code. A DMTCP plugin retainsa translation table between the virtualized pid known to thebase code and the real pid known to the kernel.Since the base code and the kernel interact primarily throughsystem calls, the DMTCP plugin deﬁnes a wrapper functionaround that system call. The wrapper function translates be-tween virtual and real pids both for arguments to the systemcall and for the return value. This is illustrated both in Figure 1and in the example code of Listing 1.Additionally, pid’s may be passed as part of the procﬁlesystem, and through other limited means. To solve this,

RAPPER int kill(pid_t pid, int sig) { disable_ckpt ();real_pid = virt_to_real(pid);int ret =

REAL_ kill(real_pid, sig); enable_ckpt ();return ret;} ✆ Listing 1:

A simpliﬁed function wrapper for pidvirtualization

DMTCP implements virtualization of ﬁlenames as well as pidnames, and so the “open” system call will also be interposedupon to detect names such as /proc/PID/maps .In this way, a collection of wrapper functions can becollected together within a DMTCP plugin library. Such alibrary implements a virtualization layer. The ELF librarystandard implements a library search order such that symbolsare searched in order as follows:EXECUTABLE → LIB1 → LIB2 ... LIBC → KERNELwhere the symbol is ﬁnally replaced by a direct kernel call.This sequence can also be viewed as a sequence of layers,consistent with the common operating system implementationthrough layers. A DMTCP plugin for pids then presents avirtualization layer in which all higher layers see only virtualpids, and all lower layers see only real pids. This is analogousto an operating system design in which a higher layer seesthe disk as a ﬁlesystem, and a lower layer sees the diskas a collection of disk blocks. In a similar way, DMTCPprovides layers to virtualize ﬁlenames, environment variablesand myriad other names.In this way, an end user can implement a fault injectionplugin layer such that all code below that layer sees injectedfaults, while higher layers do not see the injected faults.Additionally, such a layer can be instrumented to gatherinformation such as the cumulative number of faults.DMTCP also provides an API for the application or a pluginto either request a checkpoint or to avoid a checkpoint. Uponcheckpoint, each plugin is notiﬁed of a checkpoint barrier, andsimilarly upon restart. Thus, it is feasible to create successivecheckpoints available for restart or available as a snaphot forlater forensics on the cause of a later error. Optimizationssuch as forked checkpointing (fork a child and continue in theparent) are available in order to take advantage of the kernel’scopy-on-write in order to make checkpointing/snapshottingextremely fast.

B. Checkpointing Distributed Resources with the Help ofBarriers

Checkpointing in a distributed application context requirescoordination between multiple processes at different virtual-ization layers. The use of programmable barriers enables thiscoordination. In addition to the checkpoint and restart events,each plugin (or virtualization layer) can deﬁne its own setbarriers and a callback to execute at a barrier. A centralizedDMTCP coordinator forces the application processes to exe-cute the barriers in sequence. Further, a hardware resource, for example, the interface to ahardware emulator, might be shared among multiple processesthat share parent-child relationships. To get a semanticallyequivalent state on restart, the barriers can be used to electa leader to save and restore the connection to the hardwareemulator on restart.III. C

ASE S TUDIES ALONG THE W AY TO E XTENDING

DMTCPThis section describes three speciﬁc real-world use caseswhere DMTCP was extended to support hardware emulationand simulation software. The examples are motivated by ourwork with various hardware and EDA tool vendors.

A. External connections

GUI-based simulation software presents a unique challengein checkpointing. The front-end software communicates withan X server via a socket. The X server runs in a privilegedmode and outside of checkpoint control. While the connectioncould be blacklisted for the checkpointing, application’s GUIcontext and state is part of the X server and cannot becheckpointed. The context does not exist at restart time andneeds to be restored. DMTCP was extended to transparentlysupport checkpointing of VNC [14] and XPRA [15]. Thetwo tools allow X command forwarding to a local X serverthat can be run under checkpoint control. [16] presents analternate record-prune-replay based approach using DMTCPto checkpoint GUI-based applications.Authentication and license services is an important issue forprotecting the intellectual property of all the parties. Often,the authentication protocols and software are proprietary andspeciﬁc to a vendor. Further, the licensing services are notrun under checkpoint control, which makes it difﬁcult to get a“complete” checkpoint of the software. Extensions were addedto DMTCP to allow a vendor to hook into the checkpoint andrestart events and mark certain connections as “external” tothe computation. At checkpoint time, the connections markedexternal are ignored by DMTCP and instead the responsibilityof restoring these connections is delegated to the vendor-speciﬁc extension. The vendor-speciﬁc plugin also allows theapplication to check back with the licensing service at restarttime in order to not violate a licensing agreement that restrictsthe number of simultaneous “seats”.

B. Virtualizing an application’s environment

The ability to migrate a process among the available re-sources is critical for efﬁcient utilization of hardware emulatorresources. However, the environment variables, the ﬁle paths,and the ﬁles that are saved as part of a checkpoint image makesuch migrations challenging. We added DMTCP extensions(plugins) to virtualize the environment and the ﬁle paths.This allows a process to be restarted on a different systemby changing the values and the paths. Another extensionthat we added to DMTCP allows a user to explicitly controlthe checkpointing of ﬁles used by their application at thegranularity of a single ﬁle. . Interfacing with hardware and closed-source, third-partylibraries

Hardware emulators communicate with the host software viahigh-speed interfaces. Any in-ﬂight transactions at checkpointtime can result in the data being lost and inconsistent stateon restart. Thus, it is important to bring the system to aquiescent state and drain the in-ﬂight data on the buses beforesaving the state. Further, checkpointing while the software isin a critical state (like holding a lock on a bus) can lead tocomplications on restart. To help mitigate such issues, DMTCPwas extended to allow ﬁne-grained programmatic control overcheckpointing. This enables the hardware/EDA tool vendorto tailor the checkpointing for their speciﬁc requirements. Inparticular, it allows a user to invoke checkpointing from withintheir code, disable checkpointing for critical sections, or delaythe resuming of user threads until the system reaches a well-behaved state.The software toolchain used for simulation and emulationis often put together by integrating various third-party com-ponents. The components may be closed-source and may useproprietary protocols for interfacing with each other and thesystem. For example, many software toolchains rely on legacy32-bit code that’s difﬁcult to port to 64-bits, and so, support formixed 32-/64- bit processes was an important consideration.Checkpointing while holding locks was another interestingissue. While the locks and their states are a part of the user-space memory (and hence, a part of the checkpoint image),an application can also choose to use an error-checking lockthat disallows unlocking by a different thread than the onethat acquired it. On restart, when new thread ids would beassigned by the system, the locks would become invalid andthe unlock call would fail. We extended DMTCP by addingwrapper functions for lock acquisition and release functionsto keep track of the state of locks. At restart time, a lock’sstate is patched with the newer thread ids.More generally, the problem described above is about thestate that’s preserved when a resource is allocated at check-point time and needs to be deallocated at restart time. Whilethe restarted process inherits its state from the checkpointimage, its environment (thread ids, in the above case) mighthave changed on restart. An application author with domainexpertise can extend the DMTCP checkpointing framework torecognize and virtualize these resources. The state could bea part of the locks that are acquired by a custom thread-safemalloc library, or the guard regions created by a library toguard against buffer overﬂows, or the libraries that are loadedtemporarily.IV. S

URVEY OF E XISTING A PPROACHES TO C HECKPOINTING AND P ROCESS V IRTUALIZATION

High performance computing (HPC) is the traditional do-main in which checkpoint-restart is heavily used. It is usedfor the sake of fault tolerance during a long computation, forexample of days. For a survey of checkpoint-restart imple-mentations in the context of high performance computing, seeEgwutuohaet al. [17]. In the context of HPC, DMTCP and BLCR [2], [18] are the most widely used examples of trans-parent, system-level checkpoint-restart parallel computing. (Atransparent checkpointing package is one that does not modifythe target application.)

A. DMTCP

DMTCP (Distributed MultiThreaded CheckPointing) is apurely user-space implementation. In addition to being trans-parent, it also does not require any kernel modules and itsinstallation and execution does not require root privilege orthe use of special Linux capabilities. It achieves its robustnessby trying to stay as close to the POSIX standard as possiblein its API with the Linux kernel.The ﬁrst version of DMTCP was later described in [1].That version did not provide the plugin model for process vir-tualization. For example, virtualization of network addressesdid not exist, as well as a series of other constructs, suchas timers, session ids, System V shared memory, and otherfeatures. These features were added later due to the require-ments of high performance computing. Eventually, the currentprocedure for virtualizing process ids (see Section II-A wasdeveloped. To the best of our knowledge, DMTCP is uniquein its approach toward process id virtualization.Eventually, the plugin model was developed, initially fortransparent support of the InﬁniBand network fabric [19]. thecurrent extension of that plugin model is described in [4].Still later, the requirements for robust support of EDA incollaboration with Intel led to the development of reductionof runtime overhead graphic support using XPRA, path virtu-alization (for virtualization of the runtime slot and associateddirectory of a run using a hardware emulator, includingdifferent mount points on the restart computer), virtualizationof environment variables including the X-Windows DISPLAYvariable (for similar reasons), robustness across a varietyof older and newer Linux kernels and GNU libc versions,mixed multi-architecture (32- and 64-bit) processes within asingle computation, low-overhead support for malloc-intensiveprograms, re-connection of a socket to a license server onrestart, and whitelist and blacklist of special temporary ﬁlesthat many or may not be present on the restart computer.

B. BLCR

BLCR supports only single-node standalone checkpointing.In particular, it does not support checkpointing of TCP sockets,InﬁniBand connections, open ﬁles, or SysV shared memoryobjects.BLCR is often used in HPC clusters, where one has fullcontrol over the choice of Linux kernel and other systemssoftware. Typically, a Linux kernel is chosen that is compatiblewith BLCR, a BLCR kernel module is installed, and whenit is time to checkpoint, it is the responsibility of an MPIcheckpoint-restart service to temporarily disconnected the MPInetwork layer, then checkpoint locally on each node, andﬁnally re-connect the MPI network layer.ote that BLCR is limited in what features it supports,notably including a lack of support for sockets and System Vshared memory. Quoting from the BLCR User’s Guide:“However, certain applications are not supportedbecause they use resources not restored by BLCR: . . .

Applications which use sockets (regardless of ad-dress family). . . . ; Applications which use characteror block devices (e.g. serial ports or raw partitions). . . . ; Applications which use System V IPC mech-anisms including shared memory, semaphores andmessage queues.” [20]The lack of BLCR support for shared memory also preventsits use in OpenSHMEM [21].

C. ZapC and CRUZ

ZapC and CRUZ represent two other checkpointing ap-proaches that are not currently widely used.ZapC [22] and CRUZ [23] were earlier efforts to supportdistributed checkpointing, by modifying the kernel to insertinghooks into the network stack using netﬁlter to translate sourceand destination addresses. ZapC and CRUZ are no longerin active use. They were designed to virtualize primarilytwo resources: process ids and IP network addresses. Theydid not support SSH, InﬁniBand, System V IPC, or POSIXtimers, all of which are commonly used in modern softwareimplementation.

D. CRIU

CRIU [3] leverages Linux namespaces for transparentlycheckpointing on a single host (often within a Linux con-tainer), but lacks support for distributed computations. In-stead of directly virtualizing the process id, CRIU relieson extending the kernel API through a much larger procﬁlesystem and a greatly extended “prctl” system call. Forexample, the “PR_SET_MM” has 13 additional parametersthat can be set (e.g., beginning end end of text, data, andstack). In another example, CRIU relies on the “CON-FIG_CHECKPOINT_RESTORE” kernel conﬁguration to al-low a process to directly modify the kernel’s choice of pid forthe next process to be created [24]. In a general context, thereis a danger that the desired pid to be restored may alreadybe occupied by another process, but CRIU is also often usedwithin a container where this restriction can be avoided.Finally, CRIU has a more specialized plugin facility [25].Some examples are: ability to save and restore the contentsof particular ﬁles; and the means to save and restore pointersto external sockets, external links, and mount points that areoutside the ﬁlesystem namespace of an LXC (Linux Con-tainer). Recall that CRIU does not try to support distributedcomputations. Perhaps it is for this reason that CRIU didnot have the same pressure to develop a broader pluginsystem capable of supporting generic external devices suchas hardware emulators.

E. Process Virtualization

The term process virtualization was used in [26]. That workdiscusses kernel-level support for such process virtualization,while the current work emphasizes an entirely user-spaceapproach within unprivileged processes. Related to processvirtualization is the concept of a Library OS, exempliﬁedby the Drawbridge Library OS [27] and Exokernel [28].However, such systems are concerned with providing extendedor modiﬁed system services that are not natively present in theunderlying operating system kernel.Both process-level virtualization and the Library OS ap-proach employ a user-space approach (ideally with no mod-iﬁcation to the application executable, and no additionalprivileges required). However, a Library OS is concernedwith providing extended or modiﬁed system services thatare not natively present in the underlying operating systemkernel. Process virtualization is concerned with providing asemantically equivalent system object using the same systemservice. This need arises when restarting from a checkpointimage, or when carrying out a live process migration fromone computer to another. The target computer host is assumedto provide the same system services as were available on theoriginal host.Although process-level virtualization and a Library OS bothoperate in user space without special privileges, the goal ofa Library OS is quite different. A Library OS modiﬁes orextends the system services provided by the operating systemkernel. For example, Drawbridge [27] presents a Windows 7personality, so as to run Windows 7 applications under newerversions of Windows. Similarly, the original exokernel operat-ing system [28] provided additional operating system servicesbeyond those of a small underlying operating system kernel,and this was argued to often be more efﬁcient that a largerkernel directly providing those services.V. C

ONCLUSION

In order to develop a successful plugin model for check-pointing in the context of EDA, one required modularity thatenabled the DMTCP team, Intel, and Mentor Graphics to eachwrite their own modular code. Further, the Intel and MentorGraphics DMTCP-based plugins and other code were ofnecessity proprietary. This work has shown how the DMTCPplugin model can be used to provide a ﬂexible model enablingfull cooperation, while avoiding the more extreme roadmapsof either fully application-speciﬁc code or transparent, system-level checkpointing with no knowledge of the proprietaryaspects of the Mentor Graphics hardware emulator.R

EFERENCES[1] J. Ansel, K. Arya, and G. Cooperman, “DMTCP: Transparent Check-pointing for Cluster Computations and the Desktop,” in

IEEE Int. Symp.on Parallel and Distributed Processing (IPDPS) . IEEE Press, 2009,pp. 1–12.[2] P. Hargrove and J. Duell, “Berkeley Lab Checkpoint/Restart (BLCR)for Linux Clusters,”

Journal of Physics Conference Series , vol. 46, pp.494–499, Sep. 2006.[3] CRIU team, “CRIU,” accessed Jan., 2017, http://criu.org/.4] K. Arya, R. Garg, A. Y. Polyakov, and G. Cooperman, “Design andimplementation for checkpointing of distributed resources using process-level virtualization,” in

IEEE Int. Conf. on Cluster Computing (Clus-ter’16) . IEEE Press, 2016, pp. 402–412.[5] J. Cao, K. Arya, R. Garg, S. Matott, D. K. Panda, H. Subramoni,J. Vienne, and G. Cooperman, “System-level scalable checkpoint-restartfor petascale computing,” in . IEEE Press, 2016, also, technicalreport available as: arXiv preprint arXiv:1607.07995.[6] DMTCP team, “DMTCP publications,” accessed Jan., 2017, http://dmtcp.sourceforge.net/publications.html.[7] G. Cooperman, J. Evans, A. Garg, R. Garg, N. A. Rosenberg, andK. Suresh, “Transparently checkpointing software test benches to im-prove productivity of SoC veriﬁcation in an emulation environment,”2017, (submitted).[8] I. Ljubuncic, R. Giri, A. Rozenfeld, and A. Goldis, “Be kind, rewind —checkpoint & restore capability for improving reliability of large-scalesemiconductor design,” in , Dec 2012, pp. 311–315.[10] S. Resmerita and W. Pree, “Veriﬁcation of embedded control systems bysimulation and program execution control,” in . IEEE, 2012, pp. 3581–3586.[11] K. Harrigan and G. Riley, “Simulation speedup of ns-3 using checkpointand restore,” in

Proceedings of the 2014 Workshop on ns-3 . ACM, 2014,p. 7.[12] W. Leungwattanakit, C. Artho, M. Hagiya, Y. Tanabe, M. Yamamoto,and K. Takahashi, “Modular software model checking for distributedsystems,”

IEEE Trans. on Software Engineering , vol. 40, no. 5, pp. 483–501, May 2014.[13] DMTCP team, “dmtcp/plugin-tutorial.pdf,” http://github.com/dmtcp/dmtcp/blob/master/doc/plugin-tutorial.pdf, accessed Jan., 2017.[14] T. Richardson, Q. Stafford-Fraser, K. R. Wood, and A. Hopper, “Virtualnetwork computing,”

IEEE Internet Computing , vol. 2, no. 1, pp. 33–38,1998.[15] XPRA team, “XPRA,” accessed Jan., 2017, http://xpra.org/.[16] S. Kazemi, R. Garg, and G. Cooperman, “Transparent checkpoint-restartfor hardware-accelerated 3d graphics,”

CoRR , vol. abs/1312.6650, 2013.[Online]. Available: http://arxiv.org/abs/1312.6650[17] I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, “A survey offault tolerance mechanisms and checkpoint/restart implementations forhigh performance computing systems,”

The Journal of Supercomputing ,vol. 65, no. 3, pp. 1302–1326, Sep. 2013.[18] J. Duell, P. Hargrove, and E. Roman, “The Design and Implementation ofBerkeley Lab’s Linux Checkpoint/Restart (BLCR),” Lawrence BerkeleyNational Laboratory, Tech. Rep. LBNL-54941, 2003.[19] J. Cao, G. Kerr, K. Arya, and G. Cooperman, “Transparent Checkpoint-Restart over InﬁniBand,” in

Proc. of the 23rd Int. Symp. on High-performance Parallel and Distributed Computing . ACM Press, 2014,pp. 13–24.[20] BLCR team, “Berkeley Lab Checkpoint/Restart (BLCR) user’sguide,” accessed Jan., 2017, https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html.[21] OpenSHMEM team, “Openshmem,” accessed Jan., 2017, http://openshmem.org/site/Links

Cluster Computing,2005. IEEE International , Sep. 2005, pp. 1–13.[23] G. Janakiraman, J. Santos, D. Subhraveti, and Y. Turner, “Cruz:Application-transparent distributed checkpoint-restart on standard oper-ating systems,” in

International Conference on Dependable Systems andNetworks, 2005. DSN 2005. Proceedings , Jun. 2005, pp. 260–269.[24] CRIU team, “CRIU — pid restore,” accessed Jan., 2017, https://criu.org/Pid_restore.[25] ——, “CRIU — plugins,” accessed Jan., 2017, https://criu.org/Plugins.[26] G. Vallee, R. Lottiaux, D. Margery, and C. Morin, “Ghost process: Asound basis to implement process duplication, migration and check-point/restart in Linux clusters,” in

Proc. of the The 4th Int. Symp. onParallel and Distributed Computing , ser. ISPDC ’05, 2005, pp. 97–104. [27] D. E. Porter, S. Boyd-Wickizer, J. Howell, R. Olinsky, and G. C. Hunt,“Rethinking the Library OS from the top down,” in

Proc. of the SixteenthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , ser. ASPLOS XVI. New York,NY, USA: ACM, 2011, pp. 291–304.[28] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr., “Exokernel: An oper-ating system architecture for application-level resource management,” in