[PDF] Checkpoint, Restore, and Live Migration for Science Platforms

Abstract

We demonstrate a fully functional implementation of (per-user) checkpoint, restore, and live migration capabilities for JupyterHub platforms. Checkpointing -- the ability to freeze and suspend to disk the running state (contents of memory, registers, open files, etc.) of a set of processes -- enables the system to snapshot a user's Jupyter session to permanent storage. The restore functionality brings a checkpointed session back to a running state, to continue where it left off at a later time and potentially on a different machine. Finally, live migration enables moving running Jupyter notebook servers between different machines, transparent to the analysis code and w/o disconnecting the user. Our implementation of these capabilities works at the system level, with few limitations, and typical checkpoint/restore times of O(10s) with a pathway to O(1s) live migrations. It opens a myriad of interesting use cases, especially for cloud-based deployments: from checkpointing idle sessions w/o interruption of the user's work (achieving cost reductions of 4x or more), execution on spot instances w. transparent migration on eviction (with additional cost reductions up to 3x), to automated migration of workloads to ideally suited instances (e.g. moving an analysis to a machine with more or less RAM or cores based on observed resource utilization). The capabilities we demonstrate can make science platforms fully elastic while retaining excellent user experience.

Full PDF

CCheckpoint, Restore, and Live Migration for Science Platforms

Mario Juric, Steven Stetzler, and Colin T. Slater

DiRAC Institute and the Department of Astronomy, University of Washington,Seattle, WA, U.S.A; [email protected]

Abstract.

We demonstrate a fully functional implementation of (per-user) checkpoint, re-store, and live migration capabilities for JupyterHub platforms. Checkpointing – theability to freeze and suspend to disk the running state (contents of memory, registers,open ﬁles, etc.) of a set of processes – enables the system to snapshot a user’s Jupytersession to permanent storage. The restore functionality brings a checkpointed sessionback to a running state, to continue where it left o ﬀ at a later time and potentially ona di ﬀ erent machine. Finally, live migration enables moving running Jupyter notebookservers between di ﬀ erent machines, transparent to the analysis code and w / o discon-necting the user. Our implementation of these capabilities works at the system level,with few limitations, and typical checkpoint / restore times of O(10s) with a pathway toO(1s) live migrations. It opens a myriad of interesting use cases, especially for cloud-based deployments: from checkpointing idle sessions w / o interruption of the user’swork (achieving cost reductions of 4x or more), execution on spot instances w. trans-parent migration on eviction (with additional cost reductions up to 3x), to automatedmigration of workloads to ideally suited instances (e.g. moving an analysis to a ma-chine with more or less RAM or cores based on observed resource utilization). Thecapabilities we demonstrate can make science platforms fully elastic while retainingexcellent user experience.

1. Introduction

With ever-increasing dataset sizes, remote analysis paradigms are becoming increas-ingly popular. In such systems (e.g. Juri´c et al. 2017; Taghizadeh-Popp et al. 2020;Nikutta et al. 2020; Stetzler 2020), the users access data and computing resourcesthrough science platforms – rich gateways exposing server-side code editing, manage-ment, execution and result visualization capabilities – usually implemented as note-books such as Jupyter (Kluyver et al. 2016), or Zeppelin. A challenge of this modelis that the data provider (e.g., an archive facility) now bears both the cost of datasetstorage and that of computing resources – including those used for running the users’Jupyter notebooks. This cost can balloon quickly, especially on cloud resources: leftunmanaged, a 24 / , + range. This can be reducedby terminating inactive instances, but the price is a poor user experience.In this contribution we present a solution: the ability to checkpoint (freeze) a user’srunning Jupyter notebook server to disk, and restore it to memory on-demand (includ-ing on a di ﬀ erent host). This C / R functionality can dramatically reduce the cost, whilefully maintaining the user experience. It also enables novel capabilities, such as unin-terrupted migration of work based on resource needs.1 a r X i v : . [ a s t r o - ph . I M ] J a n Juric, Stetzler, and Slater

2. Elsa: A Checkpoint-able JupyterHub Deployment

In a fully functional proof-of-concept we named

Elsa , we added the C / R functionalityto a cloud deployment of JupyterHub. The user experience can be viewed in a YouTubescreencast at https://dirac.us/5aj ; here, we provide a brief summary.

Figure 1. C / R UI. Pause check-points, fast-forward initiates migra-tion to a di ﬀ erent VM instance type. With Elsa, the user logs into the Jupyter-Hub aspect of the science platform andstarts Jupyter on a machine with desiredcapabilities (e.g., CPU core count, orRAM size). The user then works on theirnotebooks as usual. However, an addi-tional option is now present in the note-book interface – the “pause” button onthe right of the notebook toolbar (Fig-ure 1). Clicking this button checkpointsthe complete state (memory, open ﬁles,etc.) of the notebook server, and releasesall computational resources. Later, the user can restore the checkpointed session eitheron the same VM or one with di ﬀ erent resources, and continue where they left o ﬀ as ifnothing happened.

3. Implementation

The high-level architecture of the Elsa prototype is shown in Figure 2. The JupyterHubfront-end is run on a single, dedicated, node, from an essentially unmodiﬁed upstreamJupyterHub container image.Our main (conﬁguration-level) customization is the addition of a new

Spawner class. Within JupyterHub, a spawner is responsible for starting and managing users’notebook server instance(s) . The spawner ﬁnds a VM where the notebooks run, andstarts and stops Jupyter on that VM. As it has no support for pod C / R, we could not useKubernetes at this time ; instead, our spawner directly allocates one new VM per userfrom the cloud provider. While reducing portability, running on bare VMs does bringsome additional (and signiﬁcant) beneﬁts .Although each user gets their own VM, per-user Jupyter is still run from a con-tainer. This i) abstracts away the details of the raw VM (e.g., Linux distribution doesn’tmatter, as long as podman / CRIU are available), ii) allows us to use the standard notebook-server container, iii) makes deployment signiﬁcantly easier (a simple pull , rather than The source code is available at https://github.com/dirac-institute/elsa See https://jupyterhub.readthedocs.io/en/stable/reference/spawners.html for more. Work is ongoing; see https://github.com/kubernetes/enhancements/pull/1990 Isolation between users at the VM level leads to predictable user experience; using VMs allows us toadd swap making out-of-memory conditions a “soft” fail; and bare VMs are faster to provision relative tospeeds from common k8s cluster autoscalers. heckpoint, Restore, and Live Migration for Science Platforms Figure 2. The architecture of the checkpointable JupyterHub deployment.

OS-level install), and iv) allows for secure re-use of VMs between di ﬀ erent users (asusers are sandboxed by their container).We manage the container using podman , which has built in support for containercheckpointing with CRIU . To inject C / R support into Jupyter, we hijack the

Spawner’s start and stop APIs. When receiving the start command, our custom spawner restoresthe session from a previously stored checkpoint, if one exists. Similarly, upon receivinga request to stop it checkpoints rather than stops. This is clearly a convenient hack, anda proper C / R API should ultimately be added to JupyterHub.Each VM mounts disks from a shared NSF server ( nsf.internal in Figure 2), in-cluding the /home ﬁlesystem which is mounted as a volume within the user’s container.A shared /home elegantly solves the problem of how to keep users’ data identical onan inode level if / when they restore a checkpoint on a di ﬀ erent machine (a requirementfor checkpointing). While a shared ﬁlesystem introduces a potential bottleneck (e.g.,imagine thousands of users simultaneously analyzing large datasets from their home-dirs), we haven’t observed any issues in typical usage. For larger deployments, onecould use a more scalable shared ﬁlesystem (e.g. pNFS or GPFS ). We use the same NFSto centrally store the checkpoints themselves.Finally, this new functionality is exposed to the user through a simple two-buttonUI featuring a “pause” and “fast forward” buttons (Figure 1). For simplicity, we addedthese to the nbresuse

Jupyter extension.This code has been prototyped and deployed on Digital Ocean ( DO ). Given weuse DO APIs in our spawner to manage the VM instances, Elsa will not run on otherproviders out-of-the-box. However, extensions to other clouds (e.g., AWS or GCP) arerather easy – O(50) lines of code – and planned as a future addition. A daemonless container engine for OCI Containers; https://podman.io/ Checkpoint-and-Restore in Userspace; http://criu.org Digital Ocean is a low-cost cloud provider;

Juric, Stetzler, and Slater

4. Discussion and Future Work

To our knowledge, this is the ﬁrst fully functional implementation of checkpoint-restoreand migration functionality of Jupyter notebooks on JupyterHub. It demonstrates thatC / R for Jupyter not only possible, but fully functional for analyses as complex as theLSST software stack(Juri´c et al. 2017). Going forward we plan to generalize the codeto other Cloud providers, implement migration of open TCP network connections, adda dedicated C / R API for JupyterHub, and improve overall C / R performance.We see three main application areas for this work: shared servers, on-prem scienceplatforms, and Cloud-based science platforms. For shared servers (e.g., a machine usedby a research group), one can now opportunistically checkpoint Jupyter instances aftera period of inactivity thus optimizing overall resource usage. For on-prem scienceplatforms our work lets the platform operator checkpoint rather than terminate inactiveinstances resulting in signiﬁcantly better user experience. They can also dynamicallymigrate users’ instances to optimize resource usage.But the largest opportunity is for Cloud deployments, where C / R can both improvethe user experience and signiﬁcantly lower the operating cost. As we show in Table 1,with our C / R work running a typical user’s Jupyter instance may cost as little as $200 / yr,with no degradation to user experience relative to running 24 /

7. This is likely betterthan the total cost of ownership of running a similar system on premise, eliminatingone more barrier to migrating science analyses to the Cloud.Instance Annual Cost24x7x365 8core /

32G RAM (m5.2xlarge, on-demand) $3364As above + c / r, 15% duty cycle $505As above + c / r, 15% duty cycle, spot $196Savings ∼ Table 1. Savings when running on cloud resources (AWS pricing as of 3:20pmPST, Nov 4, 2020.). The ﬁrst row shows the cost of running an on-demand instancefor an entire year (giving ideal user experience). The second row shows the costof running for 15% of that time (a typical duty cycle we observed with our users),storing a checkpoint while the user is inactive. Finally we show the cost of runningon spot instances, which is now possible as users work can be transparently migratedto a new instance if the spot VM is to be terminated.

References

Juri´c, M., Dubois-Felsmann, G. P., D., Ciardi, & Guy, L. 2017, LSST Science Platform VisionDocument. URL http://ls.st/lse-319

Juri´c, M., Kantor, J., Lim, K. T., & et al. 2017, in ADASS XXV, edited by N. P. F. Lorente,K. Shortridge, & R. Wayth, vol. 512 of ASP Conference Series, 279.

Kluyver, T., Ragan-Kelley, B., Pérez, F., & et al. 2016, in PPAP, edited by F. Loizides, &B. Scmidt, 87. URL https://eprints.soton.ac.uk/403913/

Nikutta, R., Fitzpatrick, M., Scott, A., & Weaver, B. A. 2020, Astronomy and Computing, 33,100411Stetzler, S. 2020, A Scalable Cloud-Based Analysis Platform for Survey Astronomy. URL osf.io/e2zwf