Checkpointing with cp: the POSIX Shared Memory System
CCheckpointing with cp: the POSIX Shared MemorySystem
Lehman H. Garrison
Center for Computational AstrophysicsFlatiron Institute
New York, NY 10010lgarrison@flatironinstitute.org
Daniel J. Eisenstein
Center for AstrophysicsHarvard & Smithsonian
Cambridge, MA [email protected]
Nina A. Maksimova
Center for AstrophysicsHarvard & Smithsonian
Cambridge, MA [email protected]
Abstract —We present the checkpointing scheme of A
BACUS ,an N -body simulation code that allocates all persistent statein POSIX shared memory, or ramdisk. Checkpointing becomesas simple as copying files from ramdisk to external storage.The main simulation executable is invoked once per time step,memory mapping the input state, computing the output statedirectly into ramdisk, and unmapping the input state. The mainexecutable remains unaware of the concept of checkpointing, withthe top-level driver code launching a file-system copy betweenexecutable invocations when a checkpoint is needed. Since theonly information flow is through files on ramdisk, the checkpointmust be correct so long as the simulation is correct. However, wefind that with multi-GB of state, there is a significant overhead tounmapping the shared memory. This can be partially mitigatedwith multithreading, but ultimately, we do not recommend sharedmemory for use with a large state. Index Terms —simulation, checkpointing, shared memory,ramdisk
I. O UT - OF - CORE MODEL A BACUS is a cosmological N -body solver that integratesparticle trajectories under mutual self-gravity in 6D phase-space from the smooth, nearly-uniform condition of the earlyuniverse to the clustered richness of filaments, halos, andvoids seen in galaxy surveys today. The depth and breadthof upcoming galaxy surveys like the DESI, the Dark EnergySpectroscopic Intrument, demand simulations with hundredsof billions or trillions of particles [1].Particle-based simulation codes are often memory limited,since the natural dimensions for improvement are to simulatephysics at smaller scales or in a larger domain—either way,using more particles. The floating-point throughput providedby GPUs has made processing such increasing numbers ofparticles tractable; in many cases, the challenge is now effi-ciently storing the state, and on distributed memory systems,communicating state between nodes.A BACUS is designed to support simulations whether theyfit in memory or not. In the latter case, also known as “out ofcore”, the state is buffered on disk. A side-effect of this modelis that the state is the checkpoint: the only data flow betweensimulation time steps is through files on disk. This enforcesthe completeness of checkpoints.When running on a distributed-memory system where thestate does fit in memory, we preserve this file-oriented model by using POSIX shared memory, or ramdisk [2], [3]. Theramdisk is exposed as a directory, so checkpointing consistsof launching a file-system copy in between time steps. Thesimulation “reads” the shared memory with a memory map,thus avoiding a copy that a file-system read from ramdiskwould incur. However, this mapping procedure has its ownoverheads, which we will discuss below.II. S IMULATION FLOW A BACUS is divided into a top-level driver code and asimulation executable. The driver code calls the executablein a loop, once per time step. The task of the executable is toload the state at time t , called the read state , compute forceson particles, update their kinematics, and write a new state attime t +1 , called the write state . The executable is idempotent,relying on the top-level driver code to rename the write stateto the read state between invocations.The fact that a new executable is invoked for each timestep, loading the state anew, ensures that the state is thecheckpoint. There can be no side-channel information thatflows through out-of-band allocations; any information thatpersists across time steps must be part of the state files.This model is a strong enforcement of the completeness andcorrectness of checkpointing. Furthermore, the executable canremain oblivious to the concept of checkpointing, leaving thedriver code to handle the file-oriented tasks to which it is well-suited.The state files themselves are raw binary representationsof the phase-space particle information, with files dividedinto planar “slabs” of the simulation volume. Within a slab,particles are ordered by cell (the atomic unit of our domaindecomposition). These files mirror the memory model used bythe simulation. Metadata is stored in a separate ASCII file inthe state directory.Each slab may have multiple slab types , each stored in aseparate file and representing one field, such as the positions,velocities, and particle flags. When a given slab is requested bythe code, the type is specified as well. The request is processed We are using the term “ramdisk” colloquially, since, in Linux kernelparlance, a ramdisk is “raw” block device on top of which a file systemmay be created. We are using the related, but distinct, kernel tmpfs, which ismore widely available. a r X i v : . [ c s . D C ] F e b y the slab buffer , which computes the file path and determineswhether it resides on ramdisk. If so, the slab buffer instructsthe arena allocator to map the slab directly. If not, the arenaallocator makes a new allocation, and a file I/O request ispassed to the I/O subsystem to read into that allocation.The determination of whether a path resides on the ramdiskis done by string comparison of the path prefix. Other, morerobust methods were deemed either too complex or too ex-pensive. III. S HARED M EMORY
POSIX shared memory is exposed on Linux via a tmpfs filesystem. By default, it is mounted at /dev/shm/ and has acapacity of half of the system’s physical memory. Files createdin that directory have “kernel persistence”, meaning they stayin memory until the kernel is terminated.One way to use this ramdisk is as if it were an ordinarystorage device, reading and writing with standard file I/O inter-faces. This will speed up I/O in most cases, but it will consumeextra memory and the I/O will only be as fast as a memorycopy. This can be slow for large allocations—especially if theI/O is only using a single thread on a system with multiplesockets—and exerts unnecessary memory bandwidth pressure.We instead opt to memory map the ramdisk files. This canbe thought of as getting a pointer directly into shared memory,avoiding any memory movement. This is accomplished bygetting a file descriptor with open() , setting the size with ftruncate() (if writing), and mapping the shared pagesinto user space with mmap() .This model has been very successful in our code, with statefiles naturally serving as the checkpoint, and the actual backupto disk being as simple as launching a file system copy on eachnode. We have confirmed that even though the shared memoryis held by the kernel, the underlying pages obey user-spacefirst-touch NUMA semantics.The default ramdisk size limit on Linux systems is half ofthe system’s memory. This is not a limitation in our case, asroughly half of A
BACUS ’s memory allocations are transient,mostly from kinematic data like particle accelerations.IV. D
EPLOYMENT ON S UMMIT
We ran a suite of simulations on the Summit supercom-puter at Oak Ridge National Lab using this shared-memorycheckpointing model. Overall, it was very successful, withtimed checkpoints running every few hours, and conditionalcheckpoints running before time steps that included on-the-fly analysis. These time steps were considered “riskier” asthey increased the memory footprint and the code path wasdependent on the physical state of the simulation, increasingthe chance of exposing a rare, corner-case bug. The state copyfrom the nodes to the Alpine network file system took onaverage 2 minutes for 13 TB (6800 files) spread across 63nodes, or about 1.7 GB/s/node.The primary checkpointing failures were (i) a string ofnetwork failures triggered by the copy operation on multi-GB files, (ii) timeouts in the checkpointing caused by variable net-work file system performance, and (iii) user error in deletingthe original checkpoint instead of the partial checkpoint in thecase of checkpoint failures.V. O VERHEADS
We find that unmapping shared memory has a noticeableoverhead that scales with the size of the mapping. This isshown for a Linux Intel Skylake platform (page size 4096bytes) in Figure 1. All mappings and unmapping were per-formed with a thread affinity fixed and on a single NUMAnode. Two cases are shown: with and without the underlying /dev/shm/ file name being unlinked (deleted) before theunmapping. If the file name has not been unlinked, thenthe unmapping is fast (10s of GB/s). If the name has beenunlinked, then the unmap runs at 10 GB/s independent of thesize of the mapping. This rate is similar to the memset() speed and suggests some kind of operation (zeroing?) isoccurring on the contents of the pages, not just the page tables.This work can be assigned to its own thread, but in simulations,we have observed performance degradation in other areas ofthe code while munmap() is running in a separate thread.Memory bandwidth pressure may be to blame.We confirm that an ordinary malloc() / free() pair doesnot exhibit this behavior. The Summit platform exhibited thissame pattern of overheads, despite being a IBM POWER9platform with 64 KB pages.For certain allocations (write state slabs), we can skip the munmap() call, as it will not free any memory, because theunderlying file handle must persist until the next time step.However, we find that doing so simply defers the unmappingcost to the program termination. Similarly, performing anunlink after the unmapping, rather than before, just shifts thetime differential into the unlink.These overheads may be similar or even smaller thanmethods that stage checkpoints in a main memory bufferor a burst buffer—a write to a burst buffer will likely beslower than 10 GB/s. However, the overheads are incurred forevery time step (typically O (1000) ), rather than a few timesper simulation. A hybrid method that allows the simulationexecutable to run multiple time steps in memory then switchto the ramdisk method just for the checkpoint step may besuperior, at the cost of increased code complexity.We surmise that the shared memory system was designedto facilitate lightweight inter-process communication, and notallocations of dozens of GB. Because our code requires a largeamount of state relative to the time it takes to process it, it issuboptimal to use POSIX shared memory as the only way topass information between time steps. However, the correctnessenforced by the out-of-core model is a useful property. Thismodel may be appropriate for codes with smaller state or ahigher compute density (ratio of compute work to state size).A CKNOWLEDGMENT
The authors would like to thank the co-developers ofthe A
BACUS code: Marc Metchnik, Doug Ferrer, and Phil Allocation size [MB]01020304050 munmap r a t e [ G B / s ] No unlinkWith unlink
Fig. 1. In the POSIX shared memory checkpoint model, all persistentallocations are made with mmap() and freed with munmap() . mmap() isfast, but munmap() is noticeably slow, especially when the corresponding filehandle has already been unlinked or deleted (dashed line). With checkpointsof 10s of GB, an unmap rate of 10 GB/s can be a bottleneck on the simulationperformance. Pinto. Abacus development has been supported by NSF AST-1313285 and more recently by DOE-SC0013718, as well asby Simons Foundation funds and Harvard University startupfunds. NM was supported as a NSF Graduate ResearchFellow. The Summit simulations have been supported byOLCF projects AST135 and AST145, the latter through theDepartment of Energy ALCC program.R
EFERENCES[1] DESI Collaboration, A. Aghamousa, J. Aguilar, S. Ahlen, S. Alam, L. E.Allen, C. Allende Prieto, J. Annis, S. Bailey, C. Balland, O. Ballester,C. Baltay, L. Beaufore, C. Bebek, T. C. Beers, E. F. Bell, J. L. Bernal,R. Besuner, F. Beutler, C. Blake, H. Bleuler, M. Blomqvist, R. Blum,A. S. Bolton, C. Briceno, D. Brooks, J. R. Brownstein, E. Buckley-Geer,A. Burden, E. Burtin, N. G. Busca, R. N. Cahn, Y.-C. Cai, L. Cardiel-Sas,R. G. Carlberg, P.-H. Carton, R. Casas, F. J. Castander, J. L. Cervantes-Cota, T. M. Claybaugh, M. Close, C. T. Coker, S. Cole, J. Comparat,A. P. Cooper, M. C. Cousinou, M. Crocce, J.-G. Cuby, D. P. Cunningham,T. M. Davis, K. S. Dawson, A. de la Macorra, J. De Vicente, T. Delubac,M. Derwent, A. Dey, G. Dhungana, Z. Ding, P. Doel, Y. T. Duan, A. Ealet,J. Edelstein, S. Eftekharzadeh, D. J. Eisenstein, A. Elliott, S. Escoffier,M. Evatt, P. Fagrelius, X. Fan, K. Fanning, A. Farahi, J. Farihi, G. Favole,Y. Feng, E. Fernandez, J. R. Findlay, D. P. Finkbeiner, M. J. Fitzpatrick,B. Flaugher, S. Flender, A. Font-Ribera, J. E. Forero-Romero, P. Fosalba,C. S. Frenk, M. Fumagalli, B. T. Gaensicke, G. Gallo, J. Garcia-Bellido,E. Gaztanaga, N. Pietro Gentile Fusillo, T. Gerard, I. Gershkovich,T. Giannantonio, D. Gillet, G. Gonzalez-de-Rivera, V. Gonzalez-Perez,S. Gott, O. Graur, G. Gutierrez, J. Guy, S. Habib, H. Heetderks, I. Heet-derks, K. Heitmann, W. A. Hellwing, D. A. Herrera, S. Ho, S. Holland,K. Honscheid, E. Huff, T. A. Hutchinson, D. Huterer, H. S. Hwang, J. M.Illa Laguna, Y. Ishikawa, D. Jacobs, N. Jeffrey, P. Jelinsky, E. Jennings,L. Jiang, J. Jimenez, J. Johnson, R. Joyce, E. Jullo, S. Juneau, S. Kama,A. Karcher, S. Karkar, R. Kehoe, N. Kennamer, S. Kent, M. Kilbinger,A. G. Kim, D. Kirkby, T. Kisner, E. Kitanidis, J.-P. Kneib, S. Koposov,E. Kovacs, K. Koyama, A. Kremin, R. Kron, L. Kronig, A. Kueter-Young, C. G. Lacey, R. Lafever, O. Lahav, A. Lambert, M. Lampton,M. Landriau, D. Lang, T. R. Lauer, J.-M. Le Goff, L. Le Guillou,A. Le Van Suu, J. H. Lee, S.-J. Lee, D. Leitner, M. Lesser, M. E. Levi,B. L’Huillier, B. Li, M. Liang, H. Lin, E. Linder, S. R. Loebman, Z. Luki´c,J. Ma, N. MacCrann, C. Magneville, L. Makarem, M. Manera, C. J.Manser, R. Marshall, P. Martini, R. Massey, T. Matheson, J. McCauley,P. McDonald, I. D. McGreer, A. Meisner, N. Metcalfe, T. N. Miller, R. Miquel, J. Moustakas, A. Myers, M. Naik, J. A. Newman, R. C.Nichol, A. Nicola, L. Nicolati da Costa, J. Nie, G. Niz, P. Norberg,B. Nord, D. Norman, P. Nugent, T. O’Brien, M. Oh, K. A. G. Olsen,C. Padilla, H. Padmanabhan, N. Padmanabhan, N. Palanque-Delabrouille,A. Palmese, D. Pappalardo, I. Pˆaris, C. Park, A. Patej, J. A. Peacock, H. V.Peiris, X. Peng, W. J. Percival, S. Perruchot, M. M. Pieri, R. Pogge, J. E.Pollack, C. Poppett, F. Prada, A. Prakash, R. G. Probst, D. Rabinowitz,A. Raichoor, C. H. Ree, A. Refregier, X. Regal, B. Reid, K. Reil,M. Rezaie, C. M. Rockosi, N. Roe, S. Ronayette, A. Roodman, A. J.Ross, N. P. Ross, G. Rossi, E. Rozo, V. Ruhlmann-Kleider, E. S.Rykoff, C. Sabiu, L. Samushia, E. Sanchez, J. Sanchez, D. J. Schlegel,M. Schneider, M. Schubnell, A. Secroun, U. Seljak, H.-J. Seo, S. Serrano,A. Shafieloo, H. Shan, R. Sharples, M. J. Sholl, W. V. Shourt, J. H. Silber,D. R. Silva, M. M. Sirk, A. Slosar, A. Smith, G. F. Smoot, D. Som, Y.-S. Song, D. Sprayberry, R. Staten, A. Stefanik, G. Tarle, S. Sien Tie,J. L. Tinker, R. Tojeiro, F. Valdes, O. Valenzuela, M. Valluri, M. Vargas-Magana, L. Verde, A. R. Walker, J. Wang, Y. Wang, B. A. Weaver,C. Weaverdyck, R. H. Wechsler, D. H. Weinberg, M. White, Q. Yang,C. Yeche, T. Zhang, G.-B. Zhao, Y. Zheng, X. Zhou, Z. Zhou, Y. Zhu,H. Zou, and Y. Zu, “The DESI Experiment Part I: Science,Targeting, andSurvey Design,” arXiv e-prints , p. arXiv:1611.00036, Oct. 2016.[2] shm overview - overview of POSIX shared memoryshm overview - overview of POSIX shared memory