[PDF] Checkpointing with cp: the POSIX Shared Memory System

Abstract

We present the checkpointing scheme of Abacus, an N-body simulation code that allocates all persistent state in POSIX shared memory, or ramdisk. Checkpointing becomes as simple as copying files from ramdisk to external storage. The main simulation executable is invoked once per time step, memory mapping the input state, computing the output state directly into ramdisk, and unmapping the input state. The main executable remains unaware of the concept of checkpointing, with the top-level driver code launching a file-system copy between executable invocations when a checkpoint is needed. Since the only information flow is through files on ramdisk, the checkpoint must be correct so long as the simulation is correct. However, we find that with multi-GB of state, there is a significant overhead to unmapping the shared memory. This can be partially mitigated with multithreading, but ultimately, we do not recommend shared memory for use with a large state.

Full PDF

CCheckpointing with cp: the POSIX Shared MemorySystem

Lehman H. Garrison

Center for Computational AstrophysicsFlatiron Institute

New York, NY 10010lgarrison@ﬂatironinstitute.org

Daniel J. Eisenstein

Center for AstrophysicsHarvard & Smithsonian

Cambridge, MA [email protected]

Nina A. Maksimova

Center for AstrophysicsHarvard & Smithsonian

Cambridge, MA [email protected]

Abstract —We present the checkpointing scheme of A

BACUS ,an N -body simulation code that allocates all persistent statein POSIX shared memory, or ramdisk. Checkpointing becomesas simple as copying ﬁles from ramdisk to external storage.The main simulation executable is invoked once per time step,memory mapping the input state, computing the output statedirectly into ramdisk, and unmapping the input state. The mainexecutable remains unaware of the concept of checkpointing, withthe top-level driver code launching a ﬁle-system copy betweenexecutable invocations when a checkpoint is needed. Since theonly information ﬂow is through ﬁles on ramdisk, the checkpointmust be correct so long as the simulation is correct. However, weﬁnd that with multi-GB of state, there is a signiﬁcant overhead tounmapping the shared memory. This can be partially mitigatedwith multithreading, but ultimately, we do not recommend sharedmemory for use with a large state. Index Terms —simulation, checkpointing, shared memory,ramdisk

I. O UT - OF - CORE MODEL A BACUS is a cosmological N -body solver that integratesparticle trajectories under mutual self-gravity in 6D phase-space from the smooth, nearly-uniform condition of the earlyuniverse to the clustered richness of ﬁlaments, halos, andvoids seen in galaxy surveys today. The depth and breadthof upcoming galaxy surveys like the DESI, the Dark EnergySpectroscopic Intrument, demand simulations with hundredsof billions or trillions of particles [1].Particle-based simulation codes are often memory limited,since the natural dimensions for improvement are to simulatephysics at smaller scales or in a larger domain—either way,using more particles. The ﬂoating-point throughput providedby GPUs has made processing such increasing numbers ofparticles tractable; in many cases, the challenge is now efﬁ-ciently storing the state, and on distributed memory systems,communicating state between nodes.A BACUS is designed to support simulations whether theyﬁt in memory or not. In the latter case, also known as “out ofcore”, the state is buffered on disk. A side-effect of this modelis that the state is the checkpoint: the only data ﬂow betweensimulation time steps is through ﬁles on disk. This enforcesthe completeness of checkpoints.When running on a distributed-memory system where thestate does ﬁt in memory, we preserve this ﬁle-oriented model by using POSIX shared memory, or ramdisk [2], [3]. Theramdisk is exposed as a directory, so checkpointing consistsof launching a ﬁle-system copy in between time steps. Thesimulation “reads” the shared memory with a memory map,thus avoiding a copy that a ﬁle-system read from ramdiskwould incur. However, this mapping procedure has its ownoverheads, which we will discuss below.II. S IMULATION FLOW A BACUS is divided into a top-level driver code and asimulation executable. The driver code calls the executablein a loop, once per time step. The task of the executable is toload the state at time t , called the read state , compute forceson particles, update their kinematics, and write a new state attime t +1 , called the write state . The executable is idempotent,relying on the top-level driver code to rename the write stateto the read state between invocations.The fact that a new executable is invoked for each timestep, loading the state anew, ensures that the state is thecheckpoint. There can be no side-channel information thatﬂows through out-of-band allocations; any information thatpersists across time steps must be part of the state ﬁles.This model is a strong enforcement of the completeness andcorrectness of checkpointing. Furthermore, the executable canremain oblivious to the concept of checkpointing, leaving thedriver code to handle the ﬁle-oriented tasks to which it is well-suited.The state ﬁles themselves are raw binary representationsof the phase-space particle information, with ﬁles dividedinto planar “slabs” of the simulation volume. Within a slab,particles are ordered by cell (the atomic unit of our domaindecomposition). These ﬁles mirror the memory model used bythe simulation. Metadata is stored in a separate ASCII ﬁle inthe state directory.Each slab may have multiple slab types , each stored in aseparate ﬁle and representing one ﬁeld, such as the positions,velocities, and particle ﬂags. When a given slab is requested bythe code, the type is speciﬁed as well. The request is processed We are using the term “ramdisk” colloquially, since, in Linux kernelparlance, a ramdisk is “raw” block device on top of which a ﬁle systemmay be created. We are using the related, but distinct, kernel tmpfs, which ismore widely available. a r X i v : . [ c s . D C ] F e b y the slab buffer , which computes the ﬁle path and determineswhether it resides on ramdisk. If so, the slab buffer instructsthe arena allocator to map the slab directly. If not, the arenaallocator makes a new allocation, and a ﬁle I/O request ispassed to the I/O subsystem to read into that allocation.The determination of whether a path resides on the ramdiskis done by string comparison of the path preﬁx. Other, morerobust methods were deemed either too complex or too ex-pensive. III. S HARED M EMORY

POSIX shared memory is exposed on Linux via a tmpfs ﬁlesystem. By default, it is mounted at /dev/shm/ and has acapacity of half of the system’s physical memory. Files createdin that directory have “kernel persistence”, meaning they stayin memory until the kernel is terminated.One way to use this ramdisk is as if it were an ordinarystorage device, reading and writing with standard ﬁle I/O inter-faces. This will speed up I/O in most cases, but it will consumeextra memory and the I/O will only be as fast as a memorycopy. This can be slow for large allocations—especially if theI/O is only using a single thread on a system with multiplesockets—and exerts unnecessary memory bandwidth pressure.We instead opt to memory map the ramdisk ﬁles. This canbe thought of as getting a pointer directly into shared memory,avoiding any memory movement. This is accomplished bygetting a ﬁle descriptor with open() , setting the size with ftruncate() (if writing), and mapping the shared pagesinto user space with mmap() .This model has been very successful in our code, with stateﬁles naturally serving as the checkpoint, and the actual backupto disk being as simple as launching a ﬁle system copy on eachnode. We have conﬁrmed that even though the shared memoryis held by the kernel, the underlying pages obey user-spaceﬁrst-touch NUMA semantics.The default ramdisk size limit on Linux systems is half ofthe system’s memory. This is not a limitation in our case, asroughly half of A

BACUS ’s memory allocations are transient,mostly from kinematic data like particle accelerations.IV. D

EPLOYMENT ON S UMMIT

We ran a suite of simulations on the Summit supercom-puter at Oak Ridge National Lab using this shared-memorycheckpointing model. Overall, it was very successful, withtimed checkpoints running every few hours, and conditionalcheckpoints running before time steps that included on-the-ﬂy analysis. These time steps were considered “riskier” asthey increased the memory footprint and the code path wasdependent on the physical state of the simulation, increasingthe chance of exposing a rare, corner-case bug. The state copyfrom the nodes to the Alpine network ﬁle system took onaverage 2 minutes for 13 TB (6800 ﬁles) spread across 63nodes, or about 1.7 GB/s/node.The primary checkpointing failures were (i) a string ofnetwork failures triggered by the copy operation on multi-GB ﬁles, (ii) timeouts in the checkpointing caused by variable net-work ﬁle system performance, and (iii) user error in deletingthe original checkpoint instead of the partial checkpoint in thecase of checkpoint failures.V. O VERHEADS

We ﬁnd that unmapping shared memory has a noticeableoverhead that scales with the size of the mapping. This isshown for a Linux Intel Skylake platform (page size 4096bytes) in Figure 1. All mappings and unmapping were per-formed with a thread afﬁnity ﬁxed and on a single NUMAnode. Two cases are shown: with and without the underlying /dev/shm/ ﬁle name being unlinked (deleted) before theunmapping. If the ﬁle name has not been unlinked, thenthe unmapping is fast (10s of GB/s). If the name has beenunlinked, then the unmap runs at 10 GB/s independent of thesize of the mapping. This rate is similar to the memset() speed and suggests some kind of operation (zeroing?) isoccurring on the contents of the pages, not just the page tables.This work can be assigned to its own thread, but in simulations,we have observed performance degradation in other areas ofthe code while munmap() is running in a separate thread.Memory bandwidth pressure may be to blame.We conﬁrm that an ordinary malloc() / free() pair doesnot exhibit this behavior. The Summit platform exhibited thissame pattern of overheads, despite being a IBM POWER9platform with 64 KB pages.For certain allocations (write state slabs), we can skip the munmap() call, as it will not free any memory, because theunderlying ﬁle handle must persist until the next time step.However, we ﬁnd that doing so simply defers the unmappingcost to the program termination. Similarly, performing anunlink after the unmapping, rather than before, just shifts thetime differential into the unlink.These overheads may be similar or even smaller thanmethods that stage checkpoints in a main memory bufferor a burst buffer—a write to a burst buffer will likely beslower than 10 GB/s. However, the overheads are incurred forevery time step (typically O (1000) ), rather than a few timesper simulation. A hybrid method that allows the simulationexecutable to run multiple time steps in memory then switchto the ramdisk method just for the checkpoint step may besuperior, at the cost of increased code complexity.We surmise that the shared memory system was designedto facilitate lightweight inter-process communication, and notallocations of dozens of GB. Because our code requires a largeamount of state relative to the time it takes to process it, it issuboptimal to use POSIX shared memory as the only way topass information between time steps. However, the correctnessenforced by the out-of-core model is a useful property. Thismodel may be appropriate for codes with smaller state or ahigher compute density (ratio of compute work to state size).A CKNOWLEDGMENT

The authors would like to thank the co-developers ofthe A

BACUS code: Marc Metchnik, Doug Ferrer, and Phil Allocation size [MB]01020304050 munmap r a t e [ G B / s ] No unlinkWith unlink

Fig. 1. In the POSIX shared memory checkpoint model, all persistentallocations are made with mmap() and freed with munmap() . mmap() isfast, but munmap() is noticeably slow, especially when the corresponding ﬁlehandle has already been unlinked or deleted (dashed line). With checkpointsof 10s of GB, an unmap rate of 10 GB/s can be a bottleneck on the simulationperformance. Pinto. Abacus development has been supported by NSF AST-1313285 and more recently by DOE-SC0013718, as well asby Simons Foundation funds and Harvard University startupfunds. NM was supported as a NSF Graduate ResearchFellow. The Summit simulations have been supported byOLCF projects AST135 and AST145, the latter through theDepartment of Energy ALCC program.R

EFERENCES[1] DESI Collaboration, A. Aghamousa, J. Aguilar, S. Ahlen, S. Alam, L. E.Allen, C. Allende Prieto, J. Annis, S. Bailey, C. Balland, O. Ballester,C. Baltay, L. Beaufore, C. Bebek, T. C. Beers, E. F. Bell, J. L. Bernal,R. Besuner, F. Beutler, C. Blake, H. Bleuler, M. Blomqvist, R. Blum,A. S. Bolton, C. Briceno, D. Brooks, J. R. Brownstein, E. Buckley-Geer,A. Burden, E. Burtin, N. G. Busca, R. N. Cahn, Y.-C. Cai, L. Cardiel-Sas,R. G. Carlberg, P.-H. Carton, R. Casas, F. J. Castander, J. L. Cervantes-Cota, T. M. Claybaugh, M. Close, C. T. Coker, S. Cole, J. Comparat,A. P. Cooper, M. C. Cousinou, M. Crocce, J.-G. Cuby, D. P. Cunningham,T. M. Davis, K. S. Dawson, A. de la Macorra, J. De Vicente, T. Delubac,M. Derwent, A. Dey, G. Dhungana, Z. Ding, P. Doel, Y. T. Duan, A. Ealet,J. Edelstein, S. Eftekharzadeh, D. J. Eisenstein, A. Elliott, S. Escofﬁer,M. Evatt, P. Fagrelius, X. Fan, K. Fanning, A. Farahi, J. Farihi, G. Favole,Y. Feng, E. Fernandez, J. R. Findlay, D. P. Finkbeiner, M. J. Fitzpatrick,B. Flaugher, S. Flender, A. Font-Ribera, J. E. Forero-Romero, P. Fosalba,C. S. Frenk, M. Fumagalli, B. T. Gaensicke, G. Gallo, J. Garcia-Bellido,E. Gaztanaga, N. Pietro Gentile Fusillo, T. Gerard, I. Gershkovich,T. Giannantonio, D. Gillet, G. Gonzalez-de-Rivera, V. Gonzalez-Perez,S. Gott, O. Graur, G. Gutierrez, J. Guy, S. Habib, H. Heetderks, I. Heet-derks, K. Heitmann, W. A. Hellwing, D. A. Herrera, S. Ho, S. Holland,K. Honscheid, E. Huff, T. A. Hutchinson, D. Huterer, H. S. Hwang, J. M.Illa Laguna, Y. Ishikawa, D. Jacobs, N. Jeffrey, P. Jelinsky, E. Jennings,L. Jiang, J. Jimenez, J. Johnson, R. Joyce, E. Jullo, S. Juneau, S. Kama,A. Karcher, S. Karkar, R. Kehoe, N. Kennamer, S. Kent, M. Kilbinger,A. G. Kim, D. Kirkby, T. Kisner, E. Kitanidis, J.-P. Kneib, S. Koposov,E. Kovacs, K. Koyama, A. Kremin, R. Kron, L. Kronig, A. Kueter-Young, C. G. Lacey, R. Lafever, O. Lahav, A. Lambert, M. Lampton,M. Landriau, D. Lang, T. R. Lauer, J.-M. Le Goff, L. Le Guillou,A. Le Van Suu, J. H. Lee, S.-J. Lee, D. Leitner, M. Lesser, M. E. Levi,B. L’Huillier, B. Li, M. Liang, H. Lin, E. Linder, S. R. Loebman, Z. Luki´c,J. Ma, N. MacCrann, C. Magneville, L. Makarem, M. Manera, C. J.Manser, R. Marshall, P. Martini, R. Massey, T. Matheson, J. McCauley,P. McDonald, I. D. McGreer, A. Meisner, N. Metcalfe, T. N. Miller, R. Miquel, J. Moustakas, A. Myers, M. Naik, J. A. Newman, R. C.Nichol, A. Nicola, L. Nicolati da Costa, J. Nie, G. Niz, P. Norberg,B. Nord, D. Norman, P. Nugent, T. O’Brien, M. Oh, K. A. G. Olsen,C. Padilla, H. Padmanabhan, N. Padmanabhan, N. Palanque-Delabrouille,A. Palmese, D. Pappalardo, I. Pˆaris, C. Park, A. Patej, J. A. Peacock, H. V.Peiris, X. Peng, W. J. Percival, S. Perruchot, M. M. Pieri, R. Pogge, J. E.Pollack, C. Poppett, F. Prada, A. Prakash, R. G. Probst, D. Rabinowitz,A. Raichoor, C. H. Ree, A. Refregier, X. Regal, B. Reid, K. Reil,M. Rezaie, C. M. Rockosi, N. Roe, S. Ronayette, A. Roodman, A. J.Ross, N. P. Ross, G. Rossi, E. Rozo, V. Ruhlmann-Kleider, E. S.Rykoff, C. Sabiu, L. Samushia, E. Sanchez, J. Sanchez, D. J. Schlegel,M. Schneider, M. Schubnell, A. Secroun, U. Seljak, H.-J. Seo, S. Serrano,A. Shaﬁeloo, H. Shan, R. Sharples, M. J. Sholl, W. V. Shourt, J. H. Silber,D. R. Silva, M. M. Sirk, A. Slosar, A. Smith, G. F. Smoot, D. Som, Y.-S. Song, D. Sprayberry, R. Staten, A. Stefanik, G. Tarle, S. Sien Tie,J. L. Tinker, R. Tojeiro, F. Valdes, O. Valenzuela, M. Valluri, M. Vargas-Magana, L. Verde, A. R. Walker, J. Wang, Y. Wang, B. A. Weaver,C. Weaverdyck, R. H. Wechsler, D. H. Weinberg, M. White, Q. Yang,C. Yeche, T. Zhang, G.-B. Zhao, Y. Zheng, X. Zhou, Z. Zhou, Y. Zhu,H. Zou, and Y. Zu, “The DESI Experiment Part I: Science,Targeting, andSurvey Design,” arXiv e-prints , p. arXiv:1611.00036, Oct. 2016.[2] shm overview - overview of POSIX shared memoryshm overview - overview of POSIX shared memory