Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption
aa r X i v : . [ c s . D C ] F e b Optimized Memoryless Fair-Share HPC Resources Schedulingusing Transparent Checkpoint-Restart Preemption
Kfir Zvi , Gal Oren (cid:0) Abstract — Common resource management methods in su-percomputing systems usually include hard divisions, capping,and quota allotment. Those methods, despite their ‘advantages’,have some known serious disadvantages including unoptimizedutilization of an expensive facility, and occasionally there is stilla need to dynamically reschedule and reallocate the resources.Consequently, those methods involve bad supply-and-demandmanagement rather than a free market playground that willeventually increase system utilization and productivity. In thiswork, we propose the newly Optimized Memoryless Fair-ShareHPC Resources Scheduling using Transparent Checkpoint-Restart Preemption, in which the social welfare increases usinga free-of-cost interchangeable proprietary possession scheme.Accordingly, we permanently keep the status-quo in regard tothe fairness of the resources distribution while maximizing theability of all users to achieve more CPUs and CPU hours forlonger period without any non-straightforward costs, penaltiesor additional human intervention.
I. I
NTRODUCTION
Demand for High-Performance Computing (HPC) re-sources has been skyrocketing over the last decade [28].Consequently, said resources became more available over thecloud, affordable (due to competition [10]), and immediatelysupplied [15], [29], [32]. Nevertheless, HPC on the cloud islimited [14], [20]: (1) Cloud hardware is general purposed,and not tailor-made like a supercomputer. Hence, it is insuf-ficient for large-scale coupled computations; (2) The currentcommercial tariff paradigm makes it more expensive to runcontinuous computations on the cloud, compared with on-premise resources. Thus, many R&D institutions still estab-lish in-house supercomputing facilities [9]. However, unlikecloud resources, which are dynamically assigned accordingto an economic model [17], supercomputing facilities areusually assigned by hard divisions or by utilization capping.These models reflect entitlements of entities (groups orindividual users) [25], that originate in centrally-managedwork-plans. An entity expects immediate access to its re-sources, similarly to reserved cloud instances, but lacking K. Zvi is with the Department of Computer Science, Ben-GurionUniversity of the Negev, Be’er-Sheva, Israel & Israel Atomic EnergyCommission, P.O.B. 7061, Tel Aviv, Israel [email protected] G. Oren is with the Department of Computer Science, Technion– Israel Institute of Technology, Haifa 32000, Israel & the ScientificComputing Center, Nuclear Research Center-Negev, P.O.B. 9001, Be’er-Sheva, Israel [email protected] * This work was supported by Pazy grant 226/20. Computational supportwas provided by the NegevHPC project [4]. The authors would like to thankMatan Rusanovsky, Harel Levin, Hagit Attiya and Danny Hendler for theirfruitful comments. an economic mechanism – which would have regulated userrequests – congestion is formed. To prevent group starvation,the congestion is usually mitigated by capping usage, staticallocation and rationing CPU hours. Jobs are prioritized inqueues, and low priority jobs may starve, in full accordancewith the users’ expectations. These methods suffer fromunder-utilization of the expensive, divided facility, and froma frequent need to manually adjust the entitlements [35].Therefore, there is a need for a scheduling mechanism thatwill alleviate the barriers that prevent full resource pooling,and yet maintain the entities’ expectation of fairness : Eachentity gets at least its entitlement of CPUs when it has thesuitable workload for it [18].II. T HE O PTIMIZED M EMORYLESS F AIR -S HARE
The scheduling and resource allocation is usually per-formed using schedulers such as SLURM [6], [9], [38].A recent development of the SLURM scheduler is theseamless incorporation of the Distributed MultiThreadedCheckpointing (DMTCP) [13] library, enabling it to trans-parently Checkpoint and Restart (C/R) a single-host, parallelor distributed computation. It does so in user-space, withno modifications to user code or the operating system,supporting a variety of HPC languages and infrastructures,including MPI and OpenMP [31].Following this recent capability, we propose the OptimizedMemoryless Fair-Share HPC Resource Scheduling (OMFS)method by enabling transparent resource pooling. Each entityis still guaranteed its
CPU entitlement , while not being lim-ited to certain compute nodes via static allocation, and with-out computation capping. Instead, underutilized resources areassigned to entities with active demand.
Non-preemptible jobs can only be run up to the entity’s entitlement. Toutilize more resources, jobs must be either preemptable (may be killed) or checkpointable (can C/R using DMTCPor otherwise). Hence, if an entity does not utilize its fullentitlement, it can always increase its utilization up to itsentitlement by evicting jobs of entities utilizing more thantheir allotment. If these jobs are checkpointable, they aretransparently checkpointed prior to the eviction, and laterrestarted.In Algorithm 1 we describe a simplified algorithm, whichis based on pre-existing priority queues. They can be gov-erned by any prioritization policy such as FIFO or priority-by-user (lines 5–6). The algorithm starts with system initial-ization, which occurs only once at startup (lines 1–9), andwith a job initialization procedure which is run every timea new job is initialized (lines 10–13). The M
EMORYLESS lgorithm 1
Optimized Memoryless Fair-Share HPC Resource Scheduling using Transparent Checkpoint-Restart Preemption procedure S YSTEM I NIT ( U sers , Jobs
Submitted , Jobs
Running ) ⊲ Performed once at startup CP U
T otal ← N ⊲
Amount of CPUs in the entire system CP U
Idle ← ⊲ Amount of idle CPUs in the entire system U sers ← U sers ⊲
Predefined group of system users Jobs
Submitted ← Jobs
Submitted ⊲ A predefined priority queue Jobs
Running ← Jobs
Running ⊲ A predefined priority queue foreach user u ∈ U sers : u.percent ← p : p ∈ [0 , ⊲ Percentage of CPU allocation per user assert P u ∈ Users ( u.percent ) ≤ ⊲ Upper bound of allocations’ percentages procedure J OB I NIT (User u, Job j) ⊲ Performed for every job creation j.priority ← pr : pr ∈ N ⊲ Priority among the jobs of the user only j.CP U
Count ← cpus : cpus ∈ { x ∈ N : 0 ≤ x ≤ N } ⊲ Amount of requested job’s CPUs j.user ← u procedure M EMORYLESS F AIR -S HARE S CHEDULER () while True: ⊲ Continuously try to run jobs from
Jobs
Submitted J ← Jobs
Submitted .dequeue () M EMORYLESS F AIR -S HARE R UNNER (J) procedure M EMORYLESS F AIR -S HARE R UNNER ( Job J ) U ser
P AbleJobsCP UCount ← P j ∈{ j ∈ Jobs
Running : j.user = J.user and j is preemptable } ( j.CP U Count ) ⊲ Amount of CPUs occupied by user’s preempt-able jobs
U ser
NonP AbleJobsCP UCount ← P j ∈{ j ∈ Jobs
Running : j.user = J.user and j is not preemptable } ( j.CP U Count ) ⊲ Amount of CPUs occupied by user’s non-preemptable jobs
U ser
T otalJobsCP UCount ← U ser
P AbleJobsCP UCount + U ser
NonP AbleJobsCP UCount
U ser
EntitledCP UCount ← ⌊ ( J.user.percent/ · CP U
T otal ⌋ if J is not preemptable and U ser
NonP AbleJobsCP UCount + J.CP U
Count ≥ U ser
EntitledCP UCount : ⊲ If the non-preemptable jobs exceed the user’sentitlement
Jobs
Submitted .enqueue ( J ) ⊲ Do not allow to run return elif
CP U
Idle > J.CP U
Count : ⊲ If enough resources are available goto ⊲ Allow to run anyways elif
J.CP U
Count > U ser
EntitledCP UCount − U ser
T otalJobsCP UCount : ⊲ Check job’s CPU request fits within entitlement
Jobs
Submitted .enqueue ( J ) ⊲ No fit, the job remains in
Jobs
Submitted return else : ⊲ User is entitled to run J , so make resourcesavailable while CP U
Idle < J.CP U
Count : ⊲ Until reaching the entitlements
Job checkpoint ← Jobs
Running .dequeue () ⊲ Checkpoint the least prioritized running jobs if Job checkpoint is checkpointable : ⊲ If it is not checkpointable, drop it
Jobs
Submitted .enqueue ( Job checkpoint ) ⊲ Checkpointed job back to
Jobs
Submitted
CP U
Idle ← CP U
Idle + Job checkpoint .CP U
Count ⊲ Update amount of idle CPUs in the system
Jobs
Running .enqueue ( J ) ⊲ J is eligible to run. Schedule it CP U
Idle ← CP U
Idle − J.CP U
Count ⊲ Update amount of idle CPUs in the systemF
AIR -S HARE S CHEDULER in lines 14–17 iterates over thequeue, trying to execute submitted jobs using the mainalgorithm of M
EMORYLESS F AIR -S HARE R UNNER (lines18–38). The M
EMORYLESS F AIR -S HARE R UNNER may run the job and may also preempt other jobs, according to thefairness policy described above.This preemption method lets underutilized entitlements betemporarily used by other entities, and thus improves thetilization over a capping-based system. Furthermore, anentity can use it to run a single job that is larger than itswhole entitlement, without any manual intervention.Recurrent C/R preemption operations may lead to thrash-ing. To resolve this problem we take two approaches: First,each job is allowed a quantum : A minimal running duration(e.g., 30 minutes). The running jobs queue demotes jobs thathave been running uninterruptedly for at least a quantum. En-larging the quantum lowers the frequency of C/R operationsat the price of non-instantaneous execution of accreditedjobs. Second, we utilize current developments in distributedfile systems [11], [37] over non-volatile memory [23] toreduce the cost of C/R operations: These memory modulesare connected to the CPU via the Memory Bus, and canbe accessed faster than standard storage. We intend to usePersistent Memory File Systems [27] (e.g., SplitFS [24],NOVA-Fortis [36], or Assise [12]) over the Intel Optane™DataCenter Persistent Memory (DCPMM) module [23], sothat DMTCP will be able to use it transparently. A betterapproach to resolve this problem is to perform source codemodifications to DMTCP, to support Direct Access (DAX)to persistent memory via libraries such as PMDK [33].DAX allows accessing the persistent memory directly fromthe application. This way, we can utilize the DCPMM tothe maximum extent to reduce thrashing cost, as it willserve as a fast non-volatile storage (essential property forcheckpointing) and as a memory from which the restartoperation could directly start from.III. R
ELATED W ORK
A. Technical scheduler and transparent C/R
We follow Rodriguez et al. [31] in basing our work onSLURM because it is the only scheduler which supports C/Rin the form of an API. DMTCP is a system-level C/R librarywhich allows to perform C/R operations without any sourcecode modifications; it supports SysV enhancements such asSystem V shared memory [2] which many MPI implemen-tations employ; and it supports InfiniBand [16]. This is incontrast to BLCR [21], which requires re-compilation andpossibly re-tuning of the kernel module and does not supportneither SysV enhancements nor InfiniBand, and CRIU [1],which currently does not even have support for parallel ordistributed applications. Furthermore, Rodriguez et al. [31]provided two sample scheduling algorithms based on theirwork. These algorithms do not share the same goal as ours– which is mainly to increase system utilization – as oneis used to reduce power consumption by scheduling jobs onas least nodes as possible, and the other is used to schedulejobs with higher priority on better hardware.
B. Scheduling methods in SLURM
Currently, SLURM supports two main scheduling meth-ods: sched/builtin which is a simple FCFS algorithm, and sched/backfill which is an extension to the basic FCFSalgorithm that enables scheduling of jobs that are not atthe head of the queue, as long as they do not cause adelay in the scheduling of reserved jobs (jobs which their starting time has already been determined). Both FCFSand backfill have been studied thoroughly and while beingrelatively simple, they are used extensively in current su-percomputers’ job scheduling systems. The main reason isthat practical limitations prevent the use of other schedulingalgorithms [30]. In particular, FCFS-backfill relies on users’estimates for jobs’ run-time, which have been proven to behighly inaccurate [19], [26], [30]. This evidently leads to verypoor efficiency for scheduling systems [30]. Niu et al. [30]have also shown an example of how checkpointing alongwith preemptive scheduling can increase the performanceof the backfill algorithm, and in this work we follow thisapproach in practice. To the authors’ knowledge, aside from[31], this has never been done before.
C. Fairness
Among others, fairness is a common attribute of a jobscheduling system in HPC, and it is of utmost importancefor HPC users, maybe even more than the productivityof the HPC system [39]. The concept of fairness is wellestablished [22]. As fairness and performance are in conflictin parallel and distributed job scheduling, previous workseither focused on providing fair scheduling or improvingperformance [39]. A stronger notion of fairness , which wedid not implement, is strict fairness , where no job is delayedby any other job of lower priority.Our scheduler implements memoryless fairness , as op-posed to history-based fair-share . Schedulers which use thisconcept re-prioritize jobs according to previous job metrics,with the purpose of reaching a designated level of these met-rics (e.g., mean usage or wait time) over a time frame, suchthat it reflects its entitlements.
History-based fair-share isless predictable and transparent than memoryless fair-share .As implemented in popular schedulers [3], [7], history-basedfair-share is only sensitive to jobs that have been completedand logged by the rolling fair-share calculation. If a groupsubmits a bunch of jobs quickly, they could potentially fillthe entire system with their jobs [34]. Furthermore, sinceSLURM’s implementation of fair-share [7], [8] uses a decayfactor , it is hard to predict when exactly a submitted jobwill be scheduled.IV. C ONCLUSION AND FUTURE WORK
In this work we presented the
Optimized MemorylessFair-Share HPC Resources Scheduling using TransparentCheckpoint-Restart Preemption (OMFS). We permanentlykeep the status quo regarding the fairness of the resourcesdistribution – which is crucial to the users’ work-fashion –while maximizing the ability of all users to achieve moreCPUs and CPU hours for longer periods of time withoutany non-straightforward costs, penalties or additional humanintervention. SLURM’s fair-share design includes an accounts hierarchy, in whicheach account is given a normalized relative share of its parent account.For each account, SLURM keeps track of its actual usage , and adjusts its effective usage accordingly so it represents the account’s shares accordingto the fair-share policy. Both actual usage , and effective usage are affectedby a predefined decay factor, which gives higher weight to a more recentusage of the resources made by each account. n accordance with the notions presented in this paper,we intend to test our scheduler on the NegevHPC [4] semi-supercomputer cluster. The first step will be to use SLURM’sAPI using PySlurm [5] to build our scheduler, and to evaluateusers’ behaviour. Next, we plan to build a dedicated SLURMplugin, based on Rodriguez et al. [31] work, and to apply oursolution to the thrashing problem using non-volatile memory.R
EFERENCES[1] CRIU, a project to implement checkpoint/restore functionality forLinux. https://criu.org . [Online].[2] DMTCP (FAQ): Distributed MultiThreaded CheckPointing. http://dmtcp.sourceforge.net/FAQ.html . [RetrievedJanuary 18, 2021. Online].[3] Moab Adaptive Computing Suite Administrator’s Guide - v. 5.4. http://docs.adaptivecomputing.com/macs/6.3fairshare.php .[Retrieved January 18, 2021. Online].[4] NegevHPC Project. . [Online].[5] PySlurm: Slurm Interface for Python. https://github.com/PySlurm/pyslurm . [Online].[6] SLURM. https://slurm.schedmd.com . [Online].[7] Slurm Priority, Fairshare and Fair Tree. https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf .[Online].[8] Slurm’s classic fairshare algorithm. slurm.schedmd.com/classic_fair_share.html .[Online].[9] TOP500 List, June 2020. .[Online].[10] Orna Agmon Ben-Yehuda, Muli Ben-Yehuda, Assaf Schuster, and DanTsafrir. The rise of raas: The resource-as-a-service cloud.
Commun.ACM , 57(7):76–84, July 2014.[11] Thomas E Anderson, Marco Canini, Jongyul Kim, Dejan Kosti´c,Youngjin Kwon, Simon Peter, Waleed Reda, Henry N Schuh, and Em-mett Witchel. Assise: Performance and availability via nvm colocationin a distributed file system. arXiv preprint arXiv:1910.05106 , 2019.[12] Thomas E. Anderson, Marco Canini, Jongyul Kim, Dejan Kosti´c,Youngjin Kwon, Simon Peter, Waleed Reda, Henry N. Schuh, andEmmett Witchel. Assise: Performance and availability via client-local NVM in a distributed file system. In , pages1011–1027. USENIX Association, November 2020.[13] Jason Ansel, Kapil Arya, and Gene Cooperman. DMTCP: Transparentcheckpointing for cluster computations and the desktop. In , pages 1–12, Rome, Italy, 2009. IEEE.[14] Alexander Breuer, Yifeng Cui, and Alexander Heinecke. Petaflopseismic simulations in the public cloud. In
International Conferenceon High Performance Computing , pages 167–185. Springer, 2019.[15] Rajkumar Buyya, Satish Narayana Srirama, Giuliano Casale, RodrigoCalheiros, Yogesh Simmhan, Blesson Varghese, Erol Gelenbe, BahmanJavadi, Luis Miguel Vaquero, Marco AS Netto, et al. A manifesto forfuture generation cloud computing: Research directions for the nextdecade.
ACM computing surveys (CSUR) , 51(5):1–38, 2018.[16] Jiajun Cao, Kapil Arya, and Gene Cooperman. TransparentCheckpoint-Restart over InfiniBand.
HPDC 2014 - Proceedings ofthe 23rd International Symposium on High-Performance Parallel andDistributed Computing , December 2013.[17] Delong Cui, Zhiping Peng, Qirui Li, Jieguang He, Lizi Zheng, andYiheng Yuan. A survey on cloud workflow collaborative adaptivescheduling. In
Advances in Computer, Communication and Computa-tional Sciences , pages 121–129. Springer, 2020.[18] Danny Dolev, Dror Feitelson, Joseph Halpern, Raz Kupferman, andNati Linial. No Justified Complaints: On Fair Sharing of MultipleResources.
ITCS 2012 - Innovations in Theoretical Computer ScienceConference , June 2011.[19] D. G. Feitelson and A. M. Weil. Utilization and predictability inscheduling the ibm sp2 with backfilling. In
Proceedings of the FirstMerged International Parallel Processing Symposium and Symposiumon Parallel and Distributed Processing , pages 542–546, 1998.[20] Marco Ferretti and Luigi Santangelo.
Cloud vs On-Premise HPC: amodel for comprehensive cost assessment . press, 2019. [21] Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart(blcr) for linux clusters. In
Journal of Physics: Conference Series ,volume 46, page 494, 2006.[22] G. J. Henry. The UNIX System: The Fair Share Sched-uler.
AT&T Bell Laboratories Technical Journal , 63(8):1845–1857,1984. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.1538-7305.1984.tb00068.x.[23] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu,Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu,Subramanya R Dulloor, et al. Basic performance measurementsof the intel optane dc persistent memory module. arXiv preprintarXiv:1903.05714 , 2019.[24] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim,Aasheesh Kolli, and Vijay Chidambaram. Splitfs: Reducing softwareoverhead in file systems for persistent memory. In
Proceedings of the27th ACM Symposium on Operating Systems Principles , SOSP ’19,page 494–508, New York, NY, USA, 2019. Association for ComputingMachinery.[25] Lett Yi Kyaw and Sabai Phyu. Scheduling methods in hpc system. In , pages 1–6.IEEE, 2020.[26] Cynthia Bailey Lee, Yael Schwartzman, Jennifer Hardy, and AllanSnavely. Are user runtime estimates inherently inaccurate? In
Workshop on Job Scheduling Strategies for Parallel Processing , pages253–263. Springer, 2004.[27] Haikun Liu, Di Chen, Hai Jin, Xiaofei Liao, Bingsheng He, Kan Hu,and Yu Zhang. A survey of non-volatile main memory technologies:State-of-the-arts, practices, and future directions. arXiv preprintarXiv:2010.04406 , 2020.[28] Hans Werner Meuer, Erich Strohmaier, Jack Dongarra, and Horst DSimon. The top500: History, trends, and future directions in highperformance computing. 2014.[29] Marco AS Netto, Rodrigo N Calheiros, Eduardo R Rodrigues, Re-nato LF Cunha, and Rajkumar Buyya. Hpc cloud for scientific andbusiness applications: taxonomy, vision, and research challenges.
ACMComputing Surveys (CSUR) , 51(1):1–29, 2018.[30] Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Mingliang Liu, Y. Zhai,W. Chen, and W. Zheng. Employing checkpoint to improve jobscheduling in large-scale systems. In
JSSPP , 2012.[31] Manuel Rodr´ıguez-Pascual, Jiajun Cao, Jos´e A Mor´ı˜nigo, Gene Coop-erman, and Rafael Mayo-Garc´ıa. Job migration in hpc clustersby means of checkpoint/restart.
The Journal of Supercomputing ,75(10):6517–6541, 2019.[32] Iman Sadooghi, Jes´us Hern´andez Martin, Tonglin Li, Kevin Brandstat-ter, Ketan Maheshwari, Tiago Pais Pitta de Lacerda Ruivo, GabrieleGarzoglio, Steven Timm, Yong Zhao, and Ioan Raicu. Understandingthe performance and potential of cloud computing for scientificapplications.
IEEE Transactions on Cloud Computing , 5(2):358–371,2015.[33] Steve Scargall. Introducing the persistent memory development kit.In
Programming Persistent Memory , pages 63–72. Springer, 2020.[34] Craig P Steffen. A better way of scheduling jobs on hpc systems:Simultaneous fair-share. 2019.[35] Wilmer Uruchi Ticona. Game theoretic analysis of the slurm schedulermodel. 2020.[36] Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadhara-iah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and AndyRudoff. Nova-fortis: A fault-tolerant non-volatile main memory filesystem. In
Proceedings of the 26th Symposium on Operating SystemsPrinciples , SOSP ’17, page 478–496, New York, NY, USA, 2017.Association for Computing Machinery.[37] Jian Yang, Joseph Izraelevitz, and Steven Swanson. Orion: A dis-tributed file system for non-volatile main memory and rdma-capablenetworks. In
USENIX
Conference on File and StorageTechnologies (
F AST , pages 221–234, 2019.[38] Andy B Yoo, Morris A Jette, and Mark Grondona. Slurm: Simplelinux utility for resource management. In
Workshop on Job SchedulingStrategies for Parallel Processing , pages 44–60. Springer, 2003.[39] Yulai Yuan, Guangwen Yang, Yongwei Wu, and Weimin Zheng. PV-EASY: a strict fairness guaranteed and prediction enabled scheduler inparallel job scheduling. In