A Generic Checkpoint-Restart Mechanism for Virtual Machines
aa r X i v : . [ c s . O S ] D ec A Generic Checkpoint-Restart Mechanism for Virtual Machines
Rohan Garg and Komal Sodha and Gene CoopermanNortheastern UniversityBoston, MA, USA { rohgarg,komal,gene } @ccs.neu.edu Abstract
It is common today to deploy complex software inside a virtual machine (VM). Snapshotsprovide rapid deployment, migration between hosts, dependability (fault tolerance), and security(insulating a guest VM from the host). Yet, for each virtual machine, the code for snapshots islaboriously developed on a per-VM basis. This work demonstrates a generic checkpoint-restartmechanism for virtual machines. The mechanism is based on a plugin on top of an unmodifieduser-space checkpoint-restart package, DMTCP. Checkpoint-restart is demonstrated for threevirtual machines: Lguest, user-space QEMU, and KVM/QEMU. The plugins for Lguest andKVM/QEMU require just 200 lines of code. The Lguest kernel driver API is augmented by40 lines of code. DMTCP checkpoints user-space QEMU without any new code. KVM/QEMU,user-space QEMU, and DMTCP need no modification. The design benefits from other DMTCPfeatures and plugins. Experiments demonstrate checkpoint and restart in 0.2 seconds usingforked checkpointing, mmap-based fast-restart, and incremental Btrfs-based snapshots.
A generic mechanism is presented for checkpointing virtual machines. Snapshots of virtual machinesare a key technology for dependable computing. They are more important today than ever fordeployment in clouds (including IaaS, “Infrastructure as a Service”), rapid deployment (startingfrom an initial snapshot), migration between hosts, fault tolerance for dependability, and greatersecurity by insulating a guest VM from the host. Current virtual machines rely on machine-specific checkpoint-restart mechanisms. Such designs struggle with common checkpointing issues(live checkpointing without stopping the virtual machine, incremental checkpointing, differentialcheckpointing, forked checkpointing (checkpointing within a forked child process) concurrently withexecution within a parent, and checkpointing of distributed virtual machines.By employing a standard checkpoint-restart package, the virtual machine directly inherits allof the features of that checkpoint-restart package. A further key difference of the new approach isthat the checkpoint-restart package operates externally to the virtual machine. Because it is notembedded inside the guest virtual machine or the hypervisor, there is greater flexibility for it tointeract as a standard process within the hypervisor or host operating system.As an example, a desktop user can now gain very high reliability by running within a virtualmachine that is set to take a snapshot every minute. Section 5 demonstrates that the DMTCPfeatures of forked checkpointing and mmap-based fast restart enable a virtual machine snapshot in0.2 seconds, when running with the Btrfs filesystem. That section also shows a run-time overhead1hat is too small to be measured when running the nbench2 benchmark program. Btrfs is expectedto become the default filesystem for Fedora, Ubuntu, and others in about a year.The generic mechanism of this work is based on the DMTCP checkpoint-restart package [AAC09].The mechanism is demonstrated on three types of virtual machines: KVM/QEMU [KVM12], user-space (standalone) QEMU [Qem12], and Lguest [Rus12]. In all three cases, the hypervisor (VMM— virtual machine monitor) is based on Linux as the host operating system. The three examplescover three distinct situations: entirely user-space virtualization (QEMU), full virtualization usinga Linux kernel driver (KVM/QEMU), and paravirtualization using a Linux kernel driver (Lguest).(A paravirtualized virtual machine is a virtual machine that requires modifications to the hostoperating system.)By providing checkpoint-restart capability to existing virtual machines, one can retroactivelyadd snapshot capability to a virtual machine in a mostly transparent manner. DMTCP alreadycheckpoints user-space QEMU “out of the box”. An additional DMTCP-based plugin is requiredfor KVM/QEMU, but neither KVM/QEMU nor DMTCP is modified. In the case of Lguest, thekernel driver from Lguest require about 40 lines of new code to support the checkpoint-restartcapability.The additional code required to checkpoint a new virtual machine is approximately 200 lines(for plugins in the case of KVM/QEMU and Lguest). Since user-space QEMU has no kernel drivercomponent, DMTCP is able to directly checkpoint it.Given our experience it is estimated that someone familiar with the examples provided here,could implement checkpoint-restart for a new virtual machine in approximately five person-days —assuming that the VM provides a kernel driver API, as is the case for KVM. (KVM is the kerneldriver component of a KVM/QEMU virtual machine.) Where no kernel driver API is provided,the development time is estimated at ten days, due to the need to understand VM kernel driverinternals, and augment the existing API between driver kernel space and user space.The two virtual machines above (KVM/QEMU and Lguest) require the estimated effort pri-marily due to the need to save state within the kernel driver, and then to appropriately restore andpatch the state within the kernel driver at the time of restart.Surprisingly, DMTCP was able to checkpoint user-space QEMU directly, with no requirementsfor new code or new plugins. In hindsight, this is attributed to the fact that user-space QEMU hasno kernel driver, and so no communication between kernel space and user-space. In experiments,DMTCP and QEMU were used to checkpoint both the Linux and Windows guest operating systems“out of the box”, with no additional modifications.For all three virtual machines, DMTCP [AAC09, DMT12] is used for purposes of checkpointing.DMTCP is a widely used user-space transparent checkpoint-restart package. DMTCP was chosenin part for the sake of its support for third-party plugins (see Section 2).Of the three virtual machines on which generic checkpoint-restart is demonstrated, to the best ofour knowledge Lguest has not previously been checkpointed. QEMU provides a “savevm” commandfor directly checkpointing. KVM/QEMU has been previously checkpointed by modifying existingfeatures within KVM [SC11, CLO +
08] and making use of the QEMU savevm command.By checkpointing using an external, generic checkpoint-restart package, one immediately inher-its DMTCP’s ability to take a consistent distributed snapshot. This is useful for analyzing anytype of distributed computation. Furthermore, there is the opportunity to easily extend this workin such future directions as: • fork-based checkpointing (quiesce the VM process, and fork a child VM process to be check-2ointed, while the parent continues to execute, using the copy-on-write semantics of fork); • heterogeneous checkpointing (checkpointing different virtual machines and “bare” processesrunning a distributed computation); and • incremental and differential checkpointing (checkpointing only that part of RAM that haschanged since the last checkpoint).DMTCP has already been used for fault-tolerant applications, while demonstrating each of theabove features. Thus, checkpoint-restart of virtual machines can be extended to take advantage ofsuch features. At the same time, the checkpoint-restart capability remains largely orthogonal tothe ongoing internal development of the virtual machine packages. Snapshots (including filesystem).
A distinction is sometimes made between checkpoints andsnapshots when the terminology is applied to virtual machines. A checkpoint is a copy of the stateof a virtual machine suitable for being restored. Such a checkpoint may or may not include savinga copy of the filesystem. A snapshot always includes a full copy of the filesystem.In a snapshot, rather than copy the entire filesystem during each checkpoint, one prefers touse a filesystem supporting copy-on-write in order to take a snapshot. Filesystems that supportcopy-on-write usually also support incremental snapshots.This is a stable filesystem that is likely to be readily available in most future Linux distributions.Btrfs has been in the mainline Linux kernel since 2009 (since Linx 2.6.29). Both Fedora and Ubuntuare planning for Btrfs to be the default filesystem in late 2013 or later.The experimental section uses copy-on-write incremental snapshots, based on Btrfs [RBM12],for most experiments. The time to take a snapshot of the guest filesystem tends to be too small tomeasure as part of the total restart time.
Forked checkpointing and fast restart:
The checkpoint and restart can be sped up throughstandard features of DMTCP. At checkpoint time, forked checkpointing is employed. The guestVM (viewed as a process in the host) forks itself, and the child process is checkpointed. Fast restartuses mmap to map the checkpoint image into RAM. This allows the memory pages to be demandpaged in as needed. Forked checkpointing reduces the delay for a checkpoint to approximately0.2 seconds (while the child process continues to write out the checkpoint image), and the fastrestart time is about 0.1 seconds.In the rest of this paper, Section 2 provides background on DMTCP plugins. Section 3 describesthe generic mechanism for checkpoint-restart of virtual machines. Section 4 describes several chal-lenges in the implementation, in order to provide deeper insights into the issues in implementingthe generic mechanism. Finally, Section 5 provides experimental results, Section 6 describes relatedwork, and Section 7 provides the conclusion.
DMTCP (Distributed MultiThreaded CheckPointing) [AAC09] is used to checkpoint and restart avirtual machine. The current version of DMTCP (DMTCP-1.2.6) [DMT12] provides a facility forthird-party plugins. The work described here was based on DMTCP svn revision 1755.3hen a new virtual machine is launched (e.g. QEMU), the user prefixes the launch commandwith dmtcp checkpoint . A checkpoint image is then created, and the virtual machine is restartedvia dmtcp restart : dmtcp checkpoint --with-plugin \ dmtcp VM plugin.so qemu ...dmtcp command --checkpointdmtcp restart qemu *.dmtcp In the above scenario, VM would be KVM or LGUEST. The plugin dmtcp VM plugin.so is theadditional code developed for this work.Plugins allow the functionality of DMTCP to be extended without modification to the under-lying DMTCP binary. For the purposes of this work, we use two essential features of DMTCPplugins.1.
Wrapper functions:
DMTCP provides wrapper functions around calls to library functions. Inparticular, it supports wrappers around system calls.2.
DMTCP event handling:
DMTCP notifies the plugins of several events. DMTCP blockswhile plugins process events. The most important events for our purposes are pre-checkpointand post-restart.A wrapper function is a function that is interposed between the caller and a callee. If thebase code calls a function foo , and if a wrapper function bar is interposed between the base codeand foo, then the base code calls bar instead. DMTCP provides a mechanism for plugins totransparently insert such wrapper functions around any library call, including system calls. Intypical usage, the wrapper function will then call the interposed functions (although possibly withmodified arguments), and then pass back a (possibly modified) copy of the return value of theinterposed function. For a review of the many techniques for interposition, see [TL01].
VM kernelmodule(f_ops) Kernel(in host)pluginDMTCP(wrapper fnc)VMlauncher(syscall)
Figure 1: A system call is initiated by a virtual machine launcher as it passes through DMTCP,the VM kernel driver in the host O/S, and then the kernel of the host O/S. The DMTCP wrapperfunction allows DMTCP to record configuration information from the initial launch of the virtualmachine, and then restore the original configuration at the time of restart.4he VM kernel driver may interpose its own wrapper functions around system calls that referto a device supported by the virtual machine. (Traditional kernel terminology does not call thisa wrapper function.) For example, KVM creates wrappers for the device /dev/kvm , and Lguestcreates wrappers for the device /dev/lguest . Thus, the DMTCP plugin is effectively creating itsown wrapper function around a VM-supplied wrapper, which in turn delegates to the kernel forthe standard functionality.Next, we discuss DMTCP events. During a pre-checkpoint event, all user threads have beenquiesced, and DMTCP has not yet begun to save the process state (including the state of memory).During a post-restart event, DMTCP has finished restoring process state (including all of memory),but control has not yet been returned to the user threads.This design allows the plugin to save additional state relevant to the virtual machine at thetime of checkpoint. During the post-restart event, the checkpoint-restart appears transparent tothe plugin. Hence, the plugin finds the VM state in whatever data structure that the plugin hadoriginally used to save the information.The plugin uses a virtual-machine-specific method to transfer data between the kernel driverof the virtual machine and the user-space memory where the plugin “lives”. Sections 3.4 and 3.5describes those VM-specific mechanisms.A typical use of a DMTCP plugin is to use a wrapper function to record information by certainsystem calls issued by the launcher. This allows the DMTCP plugin to execute modified versionsof those same system calls at the time of restart, before the thread of control is handed back to thevirtual machine.
In this section, we describe the general mechanism for checkpointing and restarting a virtual ma-chine. In the rest of this section, Section 3.1 provides an overview of the actions of our DMTCPplugin in supporting checkpoint-restart. Section 3.2 describes a generic sequence of steps that anyvirtual machine must employ in launching a new virtual machine. Section 3.3 then describes thesteps needed to restore and restart that virtual machine. Finally, the generic mechanism dependson the APIs provided by the virtual machine (or augmented APIs in the case of Lguest). Sec-tions 3.4 and 3.5 describe those APIs for KVM and Lguest, respectively. The APIs are responsiblefor saving the VM driver state, and later for launching a shell VM, and then restoring the VMdriver state.The existing DMTCP package already transparently checkpoints and restores all of user-spacememory, along with essentially all pertinent process state (threads, open file descriptors, associatedterminal device, stdin/stdout/stderr, sockets, shared memory regions, etc.). Where a subsystemrefers to an external object, DMTCP has several subsystem-specific heuristics for restoring suchinformation. Examples of such cases abound: open files that were modified or re-named aftercheckpoint; sockets to database servers; shared memory regions with daemons such as NSCD; etc.In the case of user-space QEMU, the existing DMTCP package and its heuristics for restoringsubsystems sufficed to correctly checkpoint and restart QEMU. No DMTCP plugin was required.This was tested with QEMU running each of Linux and Microsoft Windows. (See Section 5.) Therest of this section is concerned with KVM/QEMU and Lguest, for which a DMTCP plugin wasrequired. 5 hared memory −> user spaceshared memory −> kernel spaceThread(s)to handleI/Oasynchron. (other VMs)
VM Shell
Hardware description (peripherals, IRQ, etc.) (other VMs)
Guest VM (user−space component)
Kernel Module for VM:
VCPU (VCPU for each core)thread for each VCPU
User Space MemoryKernel Space Memory
Figure 2: Launching a Virtual Machine: a Generic Architecture. This sketch illustrates the com-ponents of interest for checkpoint-restart. The
VM shell refers to one or more uninitialized datastructures in the kernel driver that describe the virtual machine. A VM launcher will initializethose data structures, and a generic checkpoint-restart mechanism must be prepared to restorethose data structures appropriately.
For the KVM/QEMU and Lguest virtual machines, a DMTCP plugin was implemented to saveand restore state contained in the VM kernel driver. Recall from Section 2 that the two features ofDMTCP plugins we use are wrapper functions and notification of the pre-checkpoint/post-restartevents. Wrapper functions are used to record information about system calls sent by the VMlauncher to the kernel. At the time of pre-checkpoint, the plugin saves certain state within theVM kernel driver. Since that information is contained in the plugin’s user-space memory, DMTCPautomatically saves it at checkpoint time and later restores it, as part of DMTCP’s standardprocedure for saving and restoring all of user-space memory. At the time of post-restart, the plugincopies that state back into the VM kernel driver and appropriately patches it.In addition to implementing a VM-specific plugin, one must modify about 40 lines in in Lguest(in lguest user.c ). This is because KVM provides an API for communication between the VMkernel driver and the DMTCP plugin. Lguest does not. Hence, we have augmented the API ofLguest. This is needed to enable the plugin to save and restore state.In overview, the DMTCP plugin is needed only in the case of VM kernel drivers (KVM/QEMU6nd Lguest in this work). It does the following:1.
Time of Original VM Launch:
Wrapper functions in the DMTCP plugin record pertinentinformation from the system calls made by the VM launcher. This information is used tolater restore the configuration of memory, etc., of the new virtual machine created by the VMlauncher.2.
Checkpoint Time:
The DMTCP plugin is notified of the pre-checkpoint event after the userthreads have been quiesced, and before all of user-space memory is copied to a checkpointimage. The DMTCP plugin then copies pertinent information from the data structures insidethe VM kernel driver. This uses a kernel driver API to user space (KVM/QEMU), or else anaugmented driver provided by us (Lguest).3.
Restart Time (restoring user-space memory of the VM):
DMTCP restores user-space memoryto the same addresses where they existed prior to checkpoint. DMTCP does this transpar-ently, and the DMTCP plugin does not do any work at this stage.4.
Restart Time (re-launching the VM):
The DMTCP plugin is notified when all user-spacememory and process state has been restored, but before control is returned to the user threads.At this time, the user-space component of the VM has been fully restored. But the pre-checkpoint VM does not exist, and so the VM kernel driver is not aware of any VM’s. TheDMTCP plugin replays a modified version of the first few system calls by the VM launcher.It replays just enough to provide an “empty shell” of a virtual machine. Many of the VMkernel driver data structures have not been initialized, and for some data structures, not evenstorage has been allocated.5.
Restart Time (patching the VM kernel driver):
The DMTCP plugin must now copy its savedVM kernel driver state back into the VM kernel driver. However, in some cases, that VMkernel driver state must be modified to account for the fact that this is not the original VM,and the kernel may have changed some of the memory addresses in this re-launched VM. Thisuses a kernel driver API to user space (KVM/QEMU), or else an augmented driver providedby us (Lguest).
We first describe in general terms how a virtual machine is launched (created). Any particularvirtual machine may differ in some detail, or may merge or sub-divide the steps described. It ispartly for this reason, that we do not currently see the possibility of a fully transparent checkpoint-restart mechanism. However, this general description provides a framework that can be used toaccelerate the development of a plugin for a new virtual machine.Figure 2 shows those portions of a virtual machine of interest for checkpoint-restart. Typically,the virtual machine is created by a command issued from user-space. The program run by thatcommand is referred to as a
VM launcher , which sets up, runs and services the Guest.The launcher must:1. open an interface to the host kernel via a character device (e.g. /dev/kvm or /dev/lguest ).2. initialize the VM: tell the kernel where is the start of the guest physical memory in thelauncher’s virtual address space. 7. arrange to virtualize IRQ interrupts.4. create and initialize virtual CPUs to hold the current state of the registers.5. run the guest. In restoring the memory after checkpoint, the user-space memory (memory of QEMU) is restoredexactly. The memory within the VM kernel driver must be restored in one of three ways.1.
Launch a “shell” of a new VM in kernel driver:
On restart, the DMTCP plugin executesthe first few steps of launching a VM, in order to create an empty shell for the VM datastructures. (See Figure 3.) We refer to this as “re-launch”.2.
Restore pre-checkpoint state of kernel driver:
Next, we identify those data structures of theVM kernel driver that have not yet been initialized. For those data structures, we design theDMTCP plugin to save the values at the time of checkpoint, and restore the values at thetime of restart.3.
Patch kernel driver state:
Finally, some of the data values that are restored are incorrect.These must be filled in correctly on a case-by-case basis. For example, at the time of launchinga new VM, KVM dynamically allocates memory for a struct that describes the memoryaddresses where user-space QEMU resides. At the time of restart, the kernel is unlikely toallocate the new struct at the same address as before. The DMTCP plugin must save thedata from the old struct prior to checkpoint and use it within the new struct allocated atthe time of restart. KVM provides an API for this purpose, while for Lguest the API wasaugmented.Figure 3
QEMU uses KVM’s ioctl commands to check for the different hardware capabilities and configuresdata structures internal to the kernel driver. They represent the state of the virtual machine. Onceconfigured, these data structures can be read from QEMU using ioctl system calls with different
GET XXX parameters. The DMTCP plugin retrieves values of relevant data structures for task statesegment address, guest registers, a programmable interval timer, the IRQ chip and the registers ofthe virtual CPU.For certain internal kernel driver data structures, there is a
SET XXX parameter, but no cor-responding
GET XXX parameter. Hence, the DMTCP plugin defines a wrapper function aroundioctl, and monitors the initialization of the missing data structures via calls by the VM launcherof ioctl. Upon restart, the plugin (running inside the launcher process) issues an ioctl call with theappropriate
SET XXX parameter and the appropriate values discovered during the original launch.
Lguest already employs the read and write system calls to pass the parameters needed in VMlaunch as described in Section 3.2. These system calls were extended to provide an API for reading8 shared mem obj not alloc)shared memory −> kernel spaceThread(s)to handleI/Oasynchron.(empty hardware description) (VCPU for each core) (exact memory copy)Restored from ckptimage at originaladdresses by DMTCPDMTCP pluginlaunches empty shell(init part of VM launch protocol)VM ShellGuest VM (user−space component)
Kernel Module for VM:
VCPUthread for each VCPU
User Space MemoryKernel Space Memory {{ Figure 3: Re-Starting Virtual Machine from Checkpoint Image. The original hardware descriptionin the kernel driver must be re-created. In addition, the mechanism for sharing memory betweenthe kernel and the user-space VM component must be re-created.and writing internal data structures of the kernel driver. Another alternative would have been toaugment the API using ioctl, similarly to the situation under KVM. Some of the data structuressaved and restored are the virtual cpu, registered eventfd objects,and the address of the guestphysical address, stack, page directory, etc.
Two particular implementation issues required special treatment.
Graphics support
In checkpoint-restart of virtual machines, most modern operating systemssupport GUI interfaces. Hence, the graphics of the GUI must be checkpointed and restored. Astandard trick is used. The virtual machine is run inside TightVNC, an example of a VNC client-server for virtual network computing. In particular, QEMU starts up a vncserver for the graphicsat the time of launching. A VNC viewer connects to the VNC server. Just prior to checkpoint, wedisconnect the VNC viewer, and we reconnect after resuming or restarting the guest VM.
Anonymous inodes for KVM
When KVM launches a new virtual machine, it maps a regionfrom kernel space to user-space memory for convenience of communication between the kernel-level9VM driver and the user-space QEMU component. This is implemented by having QEMU callmmap on an anonymous inode. (An anonymous inode will be deleted when no object continuesto refer to it. Since the anonymous inode is associated with the KVM node, when QEMU callsmmap, the call is intercepted by the KVM driver, and it arranges for the kernel space that will bemapped into user space.) This occurs when KVM launches a new virtual machine.At the time of restart, the DMTCP plugin must then re-create the memory region for sharingbetween user space and kernel space. Since the user-space component (QEMU) will be restoredexactly at restart time, it will retain pointers into the address of the mapped region backed by theanonymous inode, as it existed prior to checkpoint.The DMTCP plugin handles this issue in two phases: prior to checkpoint, and at the time ofrestart. Prior to checkpoint (during the original launch), the wrapper for mmap inside the DMTCPplugin detects the call by QEMU for the specific anonymous inode in question. The return valueof mmap is then saved by the DMTCP plugin. At the time of restart, the DMTCP plugin callsmmap, and specifies the anonymous inode and the original address where it had been mapped. Inaddition, the DMTCP plugin mmap wrapper invokes the parameter
MAP FIXED in order to re-mapthe region at the desired address.
Allocated Free Lguest KVM/QEMU QEMU (user-space)Mem. (MB) Mem. (MB) Ckpt (s) Restart (s) Image Ckpt (s) Restart (s) Image Ckpt (s) Restart (s) Image128 2.5 2.292 1.264 30 MB 3.949 1.308 44 MB 4.342 1.686 59 MB256 4.2 3.169 1.382 33 MB 6.424 2.353 89 MB 7.705 3.017 109 MB512 184 5.390 2.417 35 MB 9.886 3.278 129 MB 11.870 4.427 170 MB768 441 6.823 3.013 38 MB 9.212 3.307 130 MB 14.039 5.047 194 MB1024 700 8.339 2.986 37 MB 10.033 3.130 122 MB 16.504 5.467 208 MB
Table 1: Checkpoint-restart times for idle virtual machines. The checkpoint times include the timesfor compressing the memory image and writing the contents to the disk.
Allocated Lguest KVM/QEMU QEMU (user-space)Memory (MB) Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size128 0.157 1.183 30 MB 0.183 1.284 44 MB 0.161 1.697 59 MB256 0.174 1.426 32 MB 0.200 2.379 90 MB 0.165 2.985 111 MB512 0.176 2.523 35 MB 0.233 3.061 122 MB 0.174 4.435 171 MB768 0.174 2.447 36 MB 0.211 3.106 122 MB 0.183 4.970 191 MB1024 0.178 2.818 37 MB 0.243 2.964 116 MB 0.191 5.633 213 MB
Table 2: Forked checkpoint-restart times for idle virtual machines. (The size of free memory is thesame as for Table 1.)
We ran our experiments on a system with an Intel Core i7 (2.3 GHz) and 8 GB of RAM. This waspart of a MacBook laptop with a 256 GB SSD. The host operating system was a 32-bit version ofUbuntu-12.10 with Linux kernel-3.5.7. The host was running natively in its own partition on theMacBook. The guest was set up to run Ubuntu-8.04. DMTCP svn revision 1755 was used for allexperiments. 10ll experiments represent full snapshots, including a snapshot of the guest filesystem. Theguest filesystem appears as a single file within the host filesystem. Unless otherwise noted, theguest filesystem is located within a Btrfs filesystem of the host operating system. Checkpointincludes the time to create a snapshot of the guest filesystem within Btrfs. The snapshot of theguest filesystem is created using the GNU binutils command “ cp --reflink ”. This operationtends to be fast, since it primarily involves taking a snapshot of the current data blocks of the hostfile comprising the guest filesystem.Experiments were conducted for: broad coverage (Section 5.2); forked checkpointing (Sec-tion 5.3); tests of DMTCP features for forked checkpointing and mmap-based fast restart (Sec-tion 5.4); analyzing the impact of running the nbench2 benchmark program (Section 5.5); and theoverhead of saving snapshots of the guest filesystem on a host Btrfs filesystem (Section 5.6).
Table 1 demonstrates the memory-intensive version of checkpoint-restart using the default mode ofDMTCP (using gzip compression) on an idle virtual machine. The checkpoint times grow roughlyproportionally to the size of the allocated memory for the larger sizes (512 MB guest VM to 1024 MBguest VM). Below those memory sizes, other factors in the checkpoint times presumably dominate.Restart times do not change appreciably at the higher ranges of memory.
Forked checkpointing on an idle virtual machine is demonstrated in Table 2. This uses the“ --enable-forked-checkpointing ” configure option of DMTCP, such that at checkpoint time,a child process is created. The child fulfills the rest of the checkpoint, while the parent processcontinues computing concurrently. As would be expected, the parent completes its portion of thecheckpoint largely independently of the size of the checkpoint image or allocated memory. Forkedcheckpointing typically requires about 0.2 seconds. Since the checkpoint was taken while the vir-tual machine was running, it was not possible to take checkpoints at the same time within thetwo runs (forked checkpointing and standard). For this reason, the sizes of the images differ byapproximately 2.5%, as seen in Table 1.The times for checkpoint and restart for KVM/QEMU are larger than the times for user-spaceQEMU. This is because the plugin for KVM/QEMU makes extra system calls at checkpoint andrestart time. The times can be reduced by modifying the kernel driver to implement a new systemcall that coalesces all of the operations of the previous system calls.
Allocated Lguest KVM/QEMU QEMU (user-space)Memory (MB) Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size Ckpt (s) Restart (s) Image Size128 0.523 0.096 139 MB 0.689 0.097 182 MB 0.593 0.098 230 MB256 0.834 0.098 267 MB 1.098 0.092 311 MB 1.329 0.096 408 MB512 1.489 0.097 523 MB 1.843 0.098 566 MB 2.437 0.097 761 MB768 2.495 0.097 779 MB 2.523 0.094 823 MB 3.539 0.096 1.1 GB1024 3.021 0.098 1.1 GB 3.119 0.098 1.1 GB 4.480 0.097 1.5 GB
Table 3: Fast restart times for idle virtual machines. (The size of free memory is the same as forTable 1.) 11 llocated Memory QEMU/KVM(MB) Checkpoint (s) Restart (s) Image Size128 0.204 0.095 184 MB256 0.194 0.093 310 MB512 0.205 0.095 568 MB768 0.223 0.098 822 MB1024 0.206 0.095 1.1 GB
Table 4: Forked checkpoint and fast restart times for an idle VM under QEMU/KVM. (The sizeof free memory is the same as for Table 1.)Table 3 employs fast restart on an idle virtual machine using the “ --enable-fast-ckpt-restart ”option of DMTCP. This option uses mmap to map the checkpoint image directly into memory, in-stead of copying it. In this case, memory is demand-paged in as needed from the checkpoint image.In this mode, compression is not used in creating the checkpoint image. Checkpoint times aresomewhat faster in writing an uncompressed checkpoint image to disk, since the time for executinggzip (compression) dominates over the time to write to disk.Table 4 presents the results of combining both fast-restart and forked-checkpointing mechanismson QEMU/KVM. Note that on restart from a checkpoint image, the shadow page tables inside thekernel must be recreated, after which the pages will be faulted back into RAM. The impact of thison the performance of the running applications within the guest operating system is not capturedby these tables. The tables indicate only the time after which the virtual machine can begin toexecute.
The numbers in Table 5 demonstrate the small overhead of executing with DMTCP. DMTCPincurs this overhead due to its use of wrapper functions around certain system calls. We usedthe nbench2 benchmark program [May] to analyze the overhead under conditions of stress. Thenbench2 benchmark program is a collection of applications that stress the cpu and the memory.The applications stress the integer unit, the floating-point unit and the memory subsystem. Theindexes in Table 5 are a measure of performance, normalized with respect to the AMD K6/233.Higher numbers are better.Table 5 shows that DMTCP has little impact on performance for a VM running cpu-intensiveor memory-intensive loads. In contrast the performance of KVM/QEMU is much higher thanuser-space QEMU, as expected.Table 6 shows the large impact of using DMTCP optimizations to enhance the checkpoint andrestart times. Further, one can compare the effect of running a virtual machine under load withan idle virtual machine. Table 6 shows a machine under load (running nbench2), while Tables 1,2 and 3 show an idle machine. The checkpoint and restart times are almost the same in the twocases. The size of the checkpoint image increases by at most 7.2% when under load. This is due tofewer zero page when under load.
Table 7 shows the advantage of using the copy-on-write feature of Btrfs to store the guest VM’sfilesystem. At checkpoint time a small additional DMTCP plugin rapidly copies the state of theentire filesystem (which appears as a single file on the host filesystem), using the --reflink option12
VM/QEMU QEMU (user-space)Memory Index Integer Index Floating-point Index Memory Index Integer Index Floating-point IndexWith DMTCP 31.483 25.535 47.806 2.516 3.473 0.285Without DMTCP 31.381 25.518 48.380 2.435 3.338 0.274
Table 5: Nbench2 Benchmark program on Virtual Machines. (Memory allocated in each case is1024 MB.)
Checkpoint Mechanism KVM/QEMU QEMU (user-space)Checkpoint (s) Restart (s) Image Size Checkpoint (s) Restart (s) Image SizeDefault-ckpt 9.915 3.203 125 MB 15.154 5.967 226 MBForked-ckpt 0.214 3.171 125 MB 0.188 5.902 226 MBFast-restart 3.245 0.098 1.1 GB 4.382 0.093 1.5 GBForked-ckpt/Fast-restart 0.206 0.095 1.1 GB 0.212 0.122 1.5 GB
Table 6: Checkpoint-restart while running nbench2, as influenced by DMTCP options for forkedcheckpoint and fast restart. (Memory allocated in each case is 1024 MB.)of the GNU binutils copy command. At restart time the state of the guest filesytem is similarlycopied back.
Checkpoint-restart mechanisms specific to individual virtual machines have existed for some years.Xen [BDF +
03] and QEMU [Qem12] are notable examples of this.Xen has offered checkpointing at least since [VNOS06]. A faster checkpoint-restart based onCOW (copy-on-write filesystems) was developed independently by two groups [SB05, Col12]. Later,support for deduplication in Xen checkpoints was described in [PEL11].QEMU can be checkpointed by issuing the “stop” command, followed by the “savevm” com-mand. This capability has been enhanced in the case of QEMU running on top of KVM (kernel-based virtual machine). This was done by modifying KVM to add an additional checkpointthread [SC11] and similarly by modifying the live migration facility of KVM to save a copy whilemigrating the ongoing VM computation [CLO + + + heckpoint (s) Restart (s)Btrfs 0.264 0.102Without Btrfs 7.932 8.428 Table 7: Snapshotting an idle guest VM under KVM/QEMU, including its guest filesystem. Theguest filesystem is optionally stored in a host Btrfs filesystem. (Memory allocated in each case is1024 MB. Size of guest filesystem is 2.5 GB.)machine snapshots.Among other choices for checkpoint-restart, BLCR [HD06] has the longest history among thecommonly used checkpoint-restart packages. It is based on a kernel module, and has especiallystrong support for use with MPI-based checkpoint-restart services and with batch queues. Cry-oPid2 [O’N] represents an alternative user-space checkpoint-restart package based on using ptraceto control the target application. OpenVZ [Ope] is a kernel-based checkpoint-restart package basedon Linux containers. CRIU [CRI] is a recent checkpoint-restart package with an interesting hybridstrategy between user-space and kernel-space approaches. The Linux kernel has been extended toinclude many interfaces that expose the kernel internals. CRIU uses those interfaces to provide anentirely user-space checkpoint-restart package.Forked checkpointing has an exceptionally long history, dating back to 1990 [LNP90, LNP94].Incremental checkpointing has been demonstrated at least since 1995 [PXN95].
A generic checkpoint-restart mechanism was presented based on the DMTCP checkpoint-restartpackage. DMTCP can directly checkpoint the user-space QEMU virtual machine. In other cases,where the virtual machine employs a kernel driver, DMTCP relies on an API to transfer driver statebetween the kernel driver and user space. KVM provides such an API, and so a 200-line DMTCPplugin sufficed to implement checkpoint-restart for KVM/QEMU. Lguest does not provide suchan API, and about 40 lines were added to augment the Lguest kernel driver. The estimateddevelopment time for developing checkpoint-restart for a new virtual machine is estimated at fiveperson days (where a full kernel driver API is provided, as for KVM), and ten person days (wherea full kernel driver API is not provided, as for Lguest).The method is applicable wherever DMTCP is available. DMTCP currently runs under Linux(x86, x86 64, and ARM). Thin hypervisors may or may not support DMTCP, depending on whatfeatures of Linux they support.The generic mechanism presented assumes a homogeneous architecture (same CPU, same hostoperating system, same hardware). Future work may consider removing some of those restric-tions — especially those of homogeneous hardware. Future work will also explore transparentlycheckpointing a cluster of virtual machines.Where the kernel driver API must be extended (Lguest, in our case), an alternative approachwas considered. In this approach, all of the system calls from the VM launcher to the VM kerneldriver are recorded at the time of launch, and more of those system calls are played back at thetime of restart (although possibly in modified form). This may have advantages in being morerobust as the VM software evolves. This also is a topic for future work.14 cknowledgment
The authors gratefully acknowledge the discussions and insights provided by Zhengping Jing.
References [AAC09] Jason Ansel, Kapil Arya, and Gene Cooperman. DMTCP: Transparent checkpointingfor cluster computations and the desktop. In , pages 1–12, 2009.[BDF +
03] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, RolfNeugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In
Proc.of 19th ACM symposium on Operating systems principles , SOSP ’03, pages 164–177, NewYork, NY, USA, 2003. ACM.[CLO +
08] K. Chanchio, C. Leangsuksun, H. Ong, V. Ratanasamoot1, and A. Shafi. An efficientvirtual machine checkpointing mechanism for hypervisor-based HPC systems. In
HighAvailability and Performance Computing Workshop (HAPCW) , 2008.[Col12] Patrick Colp. Xen project code released: VM snapshots. http://vmblog.com/archive/2009/04/22/xen-project-code-released-vm-snapshots.aspx ,Accessed Nov. 18, 2012.[CRI] CRIU team. Criu. http://criu.org/ .[DMT12] DMTCP team. DMTCP : Distributed multithreaded checkpointing. http://dmtcp.sourceforge.net , Accessed Nov. 18, 2012.[HD06] Paul Hargrove and Jason Duell. Berkeley lab checkpoint/restart (BLCR) for Linuxclusters.
Journal of Physics Conference Series , 46:494–499, September 2006.[KVM12] KVM team. Kvm — Qemu. http://wiki.qemu.org/KVM , see also , Accessed Nov. 18, 2012.[LNP90] Kai Li, Jeffrey F. Naughton, and James S. Plank. Real-time, concurrent checkpoint forparallel programs. In
Proc. of Second ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming , pages 79–88, March 1990.[LNP94] Kai Li, Jeffrey F. Naughton, and James S. Plank. Low-latency, concurrent checkpointingfor parallel programs.
IEEE Transactions on Parallel and Distributed Systems , 5(8):874–879, August 1994.[May] Uwe F. Mayer. Linux/Unix nbench. ;retrieved Dec. 4, 2012.[McL08] Mark McLoughlin. The QCOW2 image format. http://people.gnome.org/~markmc/qcow-image-format.html , 2008.15NAB +
11] Bogdan Nicolae, Gabriel Antoniu, Luc Boug´e, Diana Moise, and Alexandra Carpen-Amarie. BlobSeer: Next generation data management for large scale infrastructures.
Journal of Parallel and Distributed Computing , 71(2):168–184, February 2011.[NC11] Bogdan Nicolae and Franck Cappello. BlobCR: efficient checkpoint-restart for HPCapplications on IaaS clouds using virtual disk image snapshots. In
Proc. of 2011 Int.Conf. for High Performance Computing, Networking, Storage and Analysis , SC ’11, pages1–12. ACM, 2011.[O’N] Mark O’Neill. Cryopid2. http://sourceforge.net/projects/cryopid2 .[Ope] OpenVZ team. Openvz. http://wiki.openvz.org/ .[PEL11] Eunbyung Park, Bernhard Egger, and Jaejin Lee. Fast and space-efficient virtual machinecheckpointing. In
Proc. of 7th ACM SIGPLAN/SIGOPS Int. Conf. on Virtual ExecutionEnvironments , VEE ’11, pages 75–86, New York, NY, USA, 2011. ACM.[PXN95] J. S. Plank, J. Xu, and R. H. B. Netzer. Compressed differences: An algorithm for fastincremental checkpointing. Technical Report CS-95-302, University of Tennessee, August1995.[Qem12] Qemu team. Qemu. http://wiki.qemu.org/Main_Page , Accessed Nov. 18, 2012.[RBM12] Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS: The Linux B-tree filesys-tem. Technical report, IBM Research Report, July 2012. RJ10501 (ALM1207-004); http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf .[Rus12] Rusty Russell. Lguest: The simple x86 hypervisor. http://lguest.ozlabs.org/ , Ac-cessed Nov. 18, 2012.[SB05] Michael H. Sun and Douglas M. Blough. Fast, lightweight virtual machine check-pointing. Technical report, Georgia Institute of Technology, 2005. GIT-CERCS-10-05; .[SC11] Vasinee Siripoonya and Kasidit Chanchio. Thread-based live checkpointing of virtualmachines. In , 2011.[TL01] Douglas Thain and Miron Livny. Multiple bypass: Interposition agents for distributedcomputing.
Cluster Computing , 4(1):39–47, 2001.[VNOS06] Geoffroy Vall´ee, Thomas Naughton, Hong Ong, and Stephen L. Scott. Check-point/restart of virtual machines based on xen. In
HAPCW’06: High Availability andPerformance Computing Workshop , Santa Fe, New Mexico, USA, October 2006. Held inconjunction with LACSI 2006.[WDG +
06] Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, David P. Quigley,Erez Zadok, and Mohammad Nayyer Zubair. Versatility and Unix semantics in names-pace unification.