Deep Reinforcement Agent for Scheduling in HPC
Yuping Fan, Zhiling Lan, Taylor Childers, Paul Rich, William Allcock, Michael E. Papka
DDeep Reinforcement Agent for Scheduling in HPC
Yuping Fan, Zhiling Lan
Illinois Institute of TechnologyChicago, IL [email protected], [email protected]
Taylor Childers, Paul Rich, William Allcock
Argonne National LaboratoryLemont, IL { jchilders,richp,allcock } @anl.gov Michael E. Papka
Argonne National LaboratoryNorthern Illinois University [email protected]
Abstract —Cluster scheduler is crucial in high-performancecomputing (HPC). It determines when and which user jobsshould be allocated to available system resources. Existing clusterscheduling heuristics are developed by human experts basedon their experience with specific HPC systems and workloads.However, the increasing complexity of computing systems andthe highly dynamic nature of application workloads have placedtremendous burden on manually designed and tuned schedul-ing heuristics. More aggressive optimization and automationare needed for cluster scheduling in HPC. In this work, wepresent an automated HPC scheduling agent named DRAS(Deep Reinforcement Agent for Scheduling) by leveraging deepreinforcement learning. DRAS is built on a novel, hierarchicalneural network incorporating special HPC scheduling featuressuch as resource reservation and backfilling. A unique trainingstrategy is presented to enable DRAS to rapidly learn the targetenvironment. Once being provided a specific scheduling objectivegiven by system manager, DRAS automatically learns to improveits policy through interaction with the scheduling environmentand dynamically adjusts its policy as workload changes. Theexperiments with different production workloads demonstratethat DRAS outperforms the existing heuristic and optimizationapproaches by up to 45%.
Index Terms —cluster scheduling, high-performance comput-ing, deep reinforcement learning, job starvation, backfilling,resource reservation
I. I
NTRODUCTION
Cluster scheduler plays a critical role in high-performancecomputing (HPC). It enforces site policies through decidingwhen and which user jobs are allocated to system resources.Common scheduling goals include high system utilization,good user satisfaction and job prioritization.
Heuristics are theprevailing approaches in HPC cluster scheduling. For example,first come, first served (FCFS) with EASY backfilling is awell-known scheduling policy deployed on production HPCsystems [1]. Bin packing is another well-known heuristicapproach aiming for high utilization. Heuristics are easy toimplement and fast by trading optimality for speed. In addi-tion, optimization is also extensively studied in the literaturefor cluster scheduling [2]–[6]. Optimization methods focus onoptimizing immediate scheduling objective(s) without regardto long-term performance. Moreover, both heuristics and opti-mization approaches are static, and neither of them is capableof adapting its scheduling policy to dynamic changes in theenvironment. In case of sudden variation in workloads, systemadministrators have to manually tune the algorithms and pa-rameters in their policies to mitigate performance degradation.As HPC systems become increasingly complex combined with highly diverse application workloads, such a manual processis becoming challenging, time-consuming, and error-prone.We believe that more aggressive optimization and automation,beyond the existing heuristics and optimization methods, isessential for HPC cluster scheduling.In recent years, reinforcement learning (RL) combined withdeep neural networks has been successfully employed invarious fields for dynamic decision making, such as self-driving cars [9], autonomous robots [10], and game playing[11] [12]. Reinforcement learning refers to an area of machinelearning that automatically learns to maximize cumulativereward through interaction with the environment [13]. Mao etal. present a RL-driven scheduling design named Decima fordata processing jobs with dependent tasks [8]. While Decimahas shown promising results for scheduling, it is not applicableto cluster scheduling in HPC (detailed in §II-A).Inspired by the above RL-driven studies, we present an au-tomated HPC scheduling agent named
DRAS (Deep Reinforce-ment Agent for Scheduling) tailored for HPC workloads.
Thegoal is twofold : (1) to improve HPC scheduling performancebeyond the existing approaches, and (2) to automaticallyadjust scheduling policies in case of workload changes. Unlikecloud scheduling, HPC scheduling has several salient features,especially resource reservation to prevent job starvation and backfilling to reduce resource fragmentation. In the designof DRAS, we incorporate both features into the formulationof deep reinforcement learning and introduce a hierarchicalneural network structure , where the level-1 network selectsjobs for immediate or reserved execution and the level-2 net-work concentrates on choosing proper backfilled jobs for morescheduling optimization. In order to optimize and automate theprocess, all the scheduling decisions including immediate jobselection, job reservation, and backfilling are made by DRASwithout human involvement. Moreover, we develop a three-phase training process using historical job logs. Our trainingstrategy allows DRAS to gradually explore simple averagesituations to more challenging rare situations, hence leadingto a fast and converged model.We evaluate DRAS by extensive trace-based simulationswith the job traces collected from two production supercom-puters representing capability computing and capacity com-puting. The results indicate DRAS is capable of automaticallylearning to improve its policy through interaction with thescheduling environment and dynamically adjusts its policy asworkload changes. Specifically, this paper makes three major a r X i v : . [ c s . D C ] F e b ABLE I: Comparison of cluster scheduling methods.
Features Methods FCFS [1] BinPacking [7] Optimization [2]–[4] Decima [8] DRASAdaption to workload changes (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)
Automatic policy tuning (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)
Long-term scheduling performance (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)
Starvation avoidance (cid:52) (cid:55) (cid:55) (cid:55) (cid:52)
Require training (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)
Implementation effort Easy Easy Median Hard HardKey objective Fairness Resource utilization Customizable Customizable Customizable contributions:1) We design a new scheduling agent DRAS which leveragesthe advance in deep reinforcement learning and incorpo-rates the key features of HPC scheduling in the form of ahierarchical neural network model.2) We develop a three-phase training process which allowsDRAS to automatically learn the scheduling environment(i.e., the system and its workloads) and to rapidly convergeto an optimal policy.3) Our trace-based experiments demonstrate DRAS outper-forms a number of scheduling methods by up to 45%.Compared to the heuristic and optimization approaches,DRAS offers two benefits: better long-term schedulingperformance and adaptation to dynamic workload changeswithout human intervention.II. B
ACKGROUND AND C HALLENGES
A. Cluster Scheduling in HPC
HPC job scheduling, also known as batch scheduling, is re-sponsible for assigning jobs to resources (e.g, compute nodes)according to site policies and resource availability [1], [14],[15]. Well-known schedulers include Slurm, Moab/TORQUE,PBS, and Cobalt [16]–[19]. Let’s consider a cluster with N nodes. Users submit their jobs to the system through thescheduler. When submitting a job, a user is required to providejob size n i (i.e., number of compute nodes needed for the job)and job runtime estimate t i (i.e., estimated time needed for thejob). Typical HPC jobs are rigid , meaning job size is fixedthroughout its execution. Job runtime estimate is the upperbound for the job such that it will be killed by the schedulerif the actual job runtime exceeds this runtime estimate [20],[21]. At each scheduling instance, the scheduler orders thejobs in the queue according to the site policy and executesjobs from the head of the queue.Existing HPC scheduling policies can be broadly classifiedinto two groups: heuristics and optimization methods. FirstCome First Serve (FCFS) with EASY backfilling is the mostwidely used heuristics, which sorts the jobs in the wait queueaccording to their arrival times and executes jobs from the headof the queue. If the available resources are not sufficient for thefirst job in the queue, the scheduler will reserve the resourcesfor this job. Backfilling is often used in conjunction withreservation to enhance system utilization. It allows subsequentjobs in the wait queue to move ahead under the condition thatthey do not delay the existing reservations [1]. Optimization methods select a set of jobs from the queue with an objectiveto optimize certain scheduling metrics, such as minimizingaverage job wait time and maximize system utilization [2]–[6].Several recent studies have explored reinforcement learn-ing for cluster scheduling. DeepRM [22] is the first workdemonstrating the potential of using reinforcement learningfor learning customized scheduling policies from experience.Unfortunately, DeepRM’s state representation cannot handlerealistic cluster workloads with continuous job arrivals. UnlikeDeepRM, RLScheduler [23] attempts to develop a generalreinforcement learning model that is trained with one systemlog and then is used on other systems with different character-istics (e.g., system size, workload patterns, etc.). While sucha generic model is appealing, RLScheduler might lead to lesssatisfactory scheduling performance than heuristic methods.The work closely related to ours is Decima, which exploresreinforcement learning to allocate data processing jobs. Eachjob consists of dependent tasks and is represented as directedacyclic graphs (DAGs). Decima integrates a graph neuralnetwork to extract job DAGs and cluster status as embeddingvectors. It then feeds the embedding vectors to a policygradient network for decision making. The decision consists oftwo parts: to select tasks for immediate execution and to deter-mine task parallelism. Unfortunately, Decima is not applicableto HPC scheduling. First, Decima assumes all jobs can bedecomposed into malleable tasks, whereas HPC is dominatedby rigid jobs that cannot be decomposed. Second, Decimacan cause serious job starvation due to the lack of resourcereservation support (Figure 7). In short, Table I summarizedand compared existing cluster scheduling methods, along withtheir features.
B. Overview of Reinforcement Learning
Reinforcement learning (RL) is a type of machine learningtechnique that studies how agents situated in stochastic envi-ronments can learn optimal policies through interaction withtheir environment [24]. The agent’s environment is describedby an abstraction called Markov Decision Process (MDP)with four basic components: state space S , action space A , reward R , and state transition probability P . In Markovdecision processes, a learning agent interacts with a dynamicenvironment in discrete timesteps. At each time step t , theagent observes the state s t ∈ S and takes an action a t ∈ A ( s t ) .Upon taking the action, the environment transits to a newstate s t +1 with the transition probability P ( s t +1 | s t , a t ) androvides a reward r t to the agent as feedback of the action.The process continues until the agent reaches a terminal state.The goal of the agent is to find a policy π ( s ) , mapping astate to an action (deterministic) or a probability distributionover actions (stochastic), which maximizes the long-term (dis-counted) cumulative reward (cid:80) Tt (cid:48) = t γ t (cid:48) r t (cid:48) . A discount factor γ is between 0 and 1. The smaller of γ , the less importance offuture rewards.In practice, the state and action space is often too largeto be stored in a lookup table. It is common to use functionapproximators with a manageable number of adjustable param-eters θ , to represent the components of agents. Using a deepneural network with reinforcement learning is often called deep reinforcement learning [25]. The highly representationalpower of deep neural networks enables reinforcement learningto solve complex decision-making problems, such as playingAtari and Go games [11], [12]. Policy gradient and
Q-learning are the most popular RLalgorithms [26] [13].
Policy gradient methods directly param-eterize the policy π θ ( s ) and optimize the parameters θ in theneural network by gradient descent. In Q-learning algorithms,an agent chooses an action at a given state that maximizesQ-value, i.e., the cumulative reward over all successive steps.Q-table is a lookup table containing Q-value for all the state-action pairs. To address an overwhelming number of state-action pairs, neural networks are often used to approximateQ-table and the methods are generally called deep Q-learning(DQL) . DQL learns by approximating the optimal action-value function Q ∗ θ ( s, a ) . Policy gradient methods are generallybelieved to be applicable for a wider range of problemsand converge faster, but tend to converge to a local optimal.On the other hand, Q-learning methods are more difficult toconverge, but once they converge, they tend to have morestable performance than policy gradient methods [27]. C. Technical Challenges
Designing deep reinforcement learning driven clusterscheduling for HPC is challenging. Several key obstacles aslisted below.
Avoidance of job starvation.
HPC jobs have drasticallydifferent characteristics: user jobs may range from a single-node job to a whole-system job, and job runtimes may varyfrom seconds to hours or even days. This feature presents aunique challenge to HPC systems: jobs, especially large-sizedjobs, tend to be starved, if small-sized jobs keep arriving andskip over large jobs due to insufficient available resources.Simply applying existing RL-based scheduling methods canlead to severe job starvation. We have tested a state-of-the-artpolicy gradient method with a real workload trace. Our resultsshow that large jobs, e.g., 4k-node jobs, were held in the queuefor 170 days. Typically, large jobs have high priority at HPCsites, especially capability computing facilities. The long waittimes discourage users from submitting large jobs.
Incorporation of backfilling.
Backfilling is a key strategyto reduce resource fragmentation in HPC. Currently, the well-known EASY backfilling strategy uses the simple first-fit method to select jobs for backfilling, i.e., choosing the firstjob which can fit in the backfill hole. We argue that similarto the selection of jobs for scheduling, the selection of jobsfor backfilling has many possible options, hence having thepotential for more aggressive optimization.
Scalable state and action representation.
To transforma scheduling problem to a reinforcement learning problem,we must first capture the dynamic environment, e.g., status ofthousands of nodes and hundreds of waiting jobs, to a statevector as an input to the neural network. Additionally, it isvitally important to map the extremely large action space toan output of the neural network in a manageable size. Theaction space grows exponentially with the number of jobs inthe queue. Working directly with large action space can becomputationally demanding.
Effective agent training.
An RL agent learns to improveits policy by experiencing diverse situations. An effectivetraining should be capable of efficiently and rapidly buildinga converged model based on sample data in order to makedecisions without being explicitly programmed to do so. It isalso challenging to select training data to reliably cover asmuch of the state space as possible and generalize to new orunseen situations. III. D
ESIGN OF
DRASNow we present DRAS, a new scheduling method tailoredfor HPC workload and is empowered by deep reinforcementlearning. DRAS, illustrated in Figure 1, represents the sched-uler as an agent to make decisions on when and which jobsshould be allocated to computer nodes with the objectiveto optimize scheduling performance. At a given schedulinginstance t , the agent first encodes the job queue and systemstate into a vector s t , and passes the vector to the neuralnetwork (§III-A). Next, DRAS uses a hierarchical neuralnetwork for decision making (§III-B). The agent takes anaction by selecting jobs from the wait queue according to theoutput of the neural network and then receives a reward signalfrom the environment. The goal of DRAS is to choose actions(i.e., to select jobs) over time so as to maximize the cumulativereward. DRAS trains its neural network through simulationwith massive datasets composed of both real and syntheticworkload traces (§III-C). Once the model is converged, wedeploy the DRAS agents into operation. The DRAS agentsautomatically adjust their neural network parameters duringoperation to handle workload changes. A. State, Reward and Action Representation
The DRAS agent receives three observations from theenvironment: (1) job wait queue, (2) cluster node status, and(3) reward, a scalar indicating the quality of the action.
State.
We encode each waiting job as a vector of [2 , ,containing four pieces of information, including job size, jobestimated runtime, priority (1 means high priority; 0 meanslow priority), and job queued time (time elapsed since sub-mission). We encode each node as a vector of [1 , with twopieces of information. The first cell is a binary representing PC Cluster
J1J2J6 J3J4J5
Job Wait Queue
Environment
SchedulerRepresented by DRAS Agent
DRAS Agent
JobStateHPC ClusterState
Level-1 Neural Network Level-2 Neural NetworkReady & Reserved jobs Backfilled jobs R e w a r d : S c h e du li ng O b j ec t i v e Action: Ready, Reserved, Backfilled Jobs J o b & S ys t e m S t a t e Window
Fig. 1: DRAS overview. The agent (at the bottom) represents the scheduler; the environment (at the top) comprises the restof the system, including job wait queue and HPC cluster. The DRAS agent first observes the environment state, including jobstate and system state, and encodes the state into a vector. The agent’s neural network takes the vector as input and outputs ascheduling action. The environment executes the action and provides a reward indicating the quality of the action. The agentuses reward to improve its policy automatically.node availability (1 means available; 0 means not available).If the node is occupied, we use the user-supplied runtimeestimate and job start time to calculate the node estimatedavailable time. The second cell represents the time differencebetween the node estimated available time and the currenttime. If the node is available, we set the second cell to zero.We concatenate job information and node information into afixed-size vector as the input to the neural network.
Reward.
Reward functions reflect scheduling objectives.It is hard to offer a one-size-fits-all reward function due todiverse site objectives. HPC systems can be broadly classifiedas capability computing or capacity computing. Capabilitycomputing facilities are commonly interested in prioritizingcapability jobs (i.e., large jobs) [14] and optimizing resourceutilization. An example reward of capability computing couldbe as follows: w × t i t max + w × n i N + w × N used N (1)where t i denotes the average wait time of selected jobs; t max is the maximum wait time of jobs in the queue. Similarly, n i isthe average job size of the selected jobs; N is the total numberof nodes in the system; N used is the number of occupiednodes. In other words, this reward function intends to balancethree factors: to prevent job starvation, to promote capabilityjobs, and to improve system utilization. The weights can betuned by system administrators based on the site priority. Forexample, the higher w value could meet a more stringentrequirement on job starvation.Capacity computing facilities typically focus on fastturnaround time and short wait time [28]. For capacity com-puting facilities, we may define the reward function as: (cid:80) j ∈ J − /t j c (2) where J is the set of jobs in the queue and c is the numberof waiting jobs at the current timestep. This reward functionaims to minimize the average job wait time. Action.
DRAS processes the input vector and outputs a vec-tor as the scheduling action. The output vector specifies whichjobs are selected for job execution (i.e., immediate execution,reserved execution, and backfilled execution). Intuitively, ateach scheduling instance, the scheduler selects multiple jobssimultaneously. This leads to an explosive number of actionsand is infeasible to be trained efficiently. Instead, DRASdecomposes one scheduling decision (i.e., selects several jobsin one shot) into a series of job selections, i.e., selecting onejob at each time.
B. Two-level Neural Network
A key challenge when applying deep reinforcement learningto HPC cluster scheduling is to prevent job starvation. State-of-the-art RL methods focus on scheduling jobs for immediateexecution and lack reservation strategy, hence leading to jobstarvation. To overcome this obstacle, we build a hierarchicalneural network structure , in which the level-1 network is toselect jobs for immediate or reserved execution and the level-2network is to identify jobs for backfilling.More specifically, at a given scheduling instance, the sched-uler first enforces a window at the front of the job wait queue.The window alleviates job starvation problems by providinghigher priorities to older jobs. The level-1 network selects ajob from the window. If the number of available nodes is morethan or equal to the job size, the agent marks the job as readyjob and sends it for immediate execution on the system. Thisprocess repeats until the job selected from the window hasa size greater than the number of available nodes. The agentmarks the job as reserved job and reserves a set of nodes forits execution on the system at the earliest available time. Athis point, the agent moves to the level-2 network. Unlike thefirst-fit strategy used in the traditional backfilling method, weuse the neural network to make backfilling decisions so as tominimize resource waste. Toward this end, we fill the windowwith job candidates, i.e., the jobs that can be fit into the holesin the system before the reserved time. The agent selects onejob at a time for system to backfill. The process at the level-2network repeats until no more job candidates for backfilling.In a nutshell, the decision making of DRAS is to select jobsand execute them in three modes:1) ready job : the jobs are selected to run immediately.2) reserved job : the jobs are selected to start at the earliestreserved time.3) backfilled job : the jobs are selected to fill the holes beforethe reserved time.The same neural network is used for both level-1 and level-2 networks. The entire 2-level neural network is trained jointlyusing deep reinforcement learning to optimize schedulingperformance. Each network consists of five layers : input layer,convolution layer, two fully-connected layers, and output layer.The input layer is connected to a convolution layer with a × filter to extract job or node status information in each row. Theconvolution layer is connected to two fully-connected layersactivated by leaky rectifier [29]. The second fully-connectedlayer is connected to the output layer. We denote all of theparameters in the neural network jointly as θ .In this study, we develop two DRAS agents: DRAS-PG and
DRAS-DQL . PG denotes policy gradient, and DQL denotesdeep Q-learning. The selection of PG and DQL is for us tosystematically evaluate these popular reinforcement learningmethods under a unified environment.
DRAS-PG uses the neural network to parameterize schedul-ing policy as π θ ( s k , a k ) (i.e., the probability of taking action a k in state s k ). The input of DRAS-PG is a 2D vectorof [2 × W + N, , where W is the window size and N is the total number of nodes in the system. The output ofthe neural network contains W neurons, each denoting theprobability of selecting a job out of the W jobs. A schedulingaction is stochastically drawn from the W jobs following theirprobability distributions. We employ the softmax [29] as theactivation function to ensure the sum of output values equalsto 1.0. If the number of wait jobs is less than the window size W , we mask the invalid actions in the output by rescaling allvalid actions. In terms of learning, DRAS-PG method updatesthe neural network parameters θ by: θ ← θ + α K (cid:88) k =1 (cid:53) θ logπ θ ( s k , a k )( K (cid:88) k (cid:48) = k r k (cid:48) − b k ) (3)Here, K denotes the total number of actions taken in theparameter update, α is the learning rate of using Adamoptimizer [30], and b k is the baseline used to reduce thevariance of policy gradient. We set b k to the cumulative rewardfrom step k onwards averaging over all past parameter updates. DRAS-DQL uses the neural network to approximate Q-value as Q θ ( s k , a k ) (i.e., the expected cumulative reward of taking action a k in state s k ). DRAS-DQL network processesone job at a time and produces the expected Q-value for thisjob. We use the same network to approximate Q-value forall the jobs in the window W . The input of DRAS-DQLneural network is a 2D vector of [2 + N, , containing onejob information and N nodes information. The output is asingle neuron corresponding to the expected Q-value of thejob. After processing all the jobs in the window, normally, theagent selects the job with the highest Q-value.In order to explore various actions, the agent randomlychooses a job instead of the job with the highest Q-value withprobability (cid:15) . In practice, (cid:15) is very high at the beginning ofthe training to ensure that the agent explores various state-action pairs and it decays over time as the agent becomes moreexperienced. In our study, we set (cid:15) = 1 . at the beginning ofthe training and it decays at the rate of α = 0 . . In training,the parameters θ in DRAS-DQL network is updated by: θ ← θ − α K (cid:88) k =1 (cid:53) θ Q ( s k , a k )( r k + max a Q ( s k +1 , a ) (cid:124) (cid:123)(cid:122) (cid:125) new value − Q ( s k , a k ) (cid:124) (cid:123)(cid:122) (cid:125) old value ) (4)Here, the old value Q ( s k , a k ) is the expected Q-value of takingaction a k at state s k . After taking action a k , we can computethe more accurate expected Q-value (i.e., the new value) byadding the immediate reward r k and the expected cumulativefuture reward. DQL networks learn through minimizing theloss between the new value and the old value. C. Training Strategy
At the beginning of the training, we initialize DRAS’s neuralnetwork parameters θ to random numbers. We train the neuralnetwork in episodes and the network parameters θ are updatedwith episodic training until convergence. For each episode,the environment is first set to its initial state (i.e., all nodesare idle and no jobs run on the system). We train DRAS viatrace-based simulation, in which job events occur at a specificinstant in time according to the job traces. DRAS observes thescheduling state, makes scheduling decisions according to itsneural network, and collects scheduling reward. For every tenscheduling instances, DRAS updates its parameters θ based onthe collected observations and then clears the memory for thenext update. An episode terminates when all jobs in the jobsethave been scheduled. We monitor the progress of the trainingby taking a snapshot of the model after each episode. The nextepisode uses a new jobset to refine the previous model.The jobsets used in training determine the convergence andquality of the DRAS model. To learn a converged model, wefollow the principle of gradual improvement: DRAS starts withsimple average cases and gradually improves its capabilitywith unseen rare cases.
Specifically, we train DRAS by using athree-phase training process and three types of jobsets are usedto train DRAS in order: (1) a set of sampled jobs from real jobtraces, (2) a period of real job traces, and (3) a set of syntheticjobs generated according to job patterns on the target system.The sampled jobsets have controlled job arrival rates providinghe easiest learning environment. Once DRAS can makegood scheduling decisions under the controlled environment,training on the real job traces with various job arrival patternsallows DRAS to learn more challenging situations. The finalphase is to train DRAS with synthetic jobsets, which enablesDRAS to experience a variety of potential states that mightnot be seen in the first two types of jobsets. We will showthis three-phase training process leads to a fast convergence(§IV-D).DRAS is implemented in Tensorflow [31] and available asopen-source on GitHub [32].IV. E
XPERIMENTAL S ETUP
A. Comparison Methods
We compare the following scheduling methods: • FCFS represents FCFS with EASY backfilling, which isthe default scheduling policy deployed on many productionsupercomputers [16]. FCFS prioritizes jobs based on theirarrival times and EASY backfilling is used to reduce re-source fragmentation [1]. • BinPacking is widely used heuristic method for schedulingin datacenters [7]. It iteratively allocates the largest runnablejobs (i.e., job size is less than or equal to the numberof available nodes in the system) until the system cannotaccommodate any further jobs. • Random randomly selects runnable jobs from the queue toexecute until no more jobs in the queue can fit into thesystem. Since DRAS performs similar to Random at thebeginning of training by randomly explores action space, ifDRAS’s performance is better than Random, it demonstratesthat DRAS gradually learns to improve its scheduling action. • Optimization denotes a suite of scheduling methods thatformulate cluster scheduling as an optimization problem [3],e.g., to minimize average job wait time. In our experiments,the optimization problem is formulated as a 0-1 knapsackproblem which is solved using dynamic programming. Fora fair comparison, we use the same scheduling objectives(i.e., Equation (3) and (4)) for Optimization and for DRAS. • Decima-PG denotes a modified version of Decima [8]. Asmentioned earlier, Decima is not designed for schedulingHPC jobs. Hence, we use a modified version of Decimaby skipping graph neural network and adopting our staterepresentation presented in §III-A. Note that Decima-PG isa RL agent without hierarchical network structure. Henceit acts as the baseline to demonstrate the benefits of thehierarchical design of DRAS. • DRAS-PG and
DRAS-DQL denote our DRAS agents.
B. Trace-based Simulation
We compare these scheduling policies through trace-basedsimulation. Specifically, a trace-based, event-driven schedulingsimulator called CQSim is used in our experiments [2] [33][34]. CQSim contains a queue manager and a scheduler thatcan plug in different scheduling policies. It emulates the actualscheduling environment. A real system takes jobs from usersubmission, while CQSim takes jobs by reading the job arrival information in the trace. Rather than executing jobs on system,CQSim simulates the execution by advancing the simulationclock according to the job runtime information in the trace.TABLE II: Theta and Cori workloads.
Theta CoriLocation ALCF NERSCScheduler Cobalt SlurmSystem Types Capability computing Capacity computingCompute Nodes 4,392(4,392 KNL) 12,076(2,388 Haswell; 9,688 KNL)Trace Period Jan. 2018 - Dec. 2019 Apr. 2018 - Jul. 2018Number of Jobs 121,837 2,607,054Max Job Length 1 day 7 days
Fig. 2: Job characterization of Theta at ALCF and Cori atNERSC. The outer circle shows the number of jobs in eachjob size category. The inner circle presents the total core hoursconsumed by each job size category.
C. Workload Traces
In our study, two real workload traces are used. Table IIsummarizes the two traces collected from production systems,and Figure 2 gives an overview of job size distributions onthese supercomputers. We select these traces as they representdifferent workload profiles: (1) capability computing focus-ing on solving large-sized problems, (2) capacity computingsolving a mix of small-sized and large-sized problems. Thefirst workload is a two-year job log from Theta [35], theproduction HPC system located at ALCF. Theta is a capabilitycomputing system. The smallest job allowed on Theta is 128-node [36]. Only 2.25% of jobs have dependency. For jobswith dependency, the scheduler hides them from schedulinguntil all their parents have been executed. On Theta, there are32 nodes dedicated to run debugging jobs and the rest of 4,360nodes are dedicated to user jobs. In our experiments, we setthe system size to be 4,360 and filter out all debugging jobsin the trace.
We use the first 2-month data for training, thenext month data for validating model convergence, and therest 21-month data for testing.
The second trace is a four-month job log from Cori [37].Cori is a capacity computing system deployed at NERSC. Amajority of its jobs consume one or several nodes (Figure 2).The longest job executed for seven days.
We use the first 2-week data for training, the next 1-week data for validatingmodel convergence, and the last 15-week data for testing.. DRAS Training
The details of RL architectures for these systems are listedin Table III. Take the neural network of
DRAS-PG on Thetaas an example. The input of the neural network is a vectorof [4460 , . We use a convolutional layer with 4460 neuronsand two fully-connected layers with 4000 and 1000 neuronsrespectively. The output layer contains 50 neurons representingjobs in the window. In total, the neural network has 21,890,053trainable parameters.TABLE III: DRAS network configurations for Theta and Cori. Theta CoriDRAS-PG DRAS-DQL DRAS-PG DRAS-DQLInput [4460 ,
2] [4362 ,
2] [12176 ,
2] [12078 , Convolutional Layer
Fully Connected Layer 1 4000 10000Fully Connected Layer 2 1000 4000Output 50 1 50 1Trainable Parameters 21,890,053 21,449,004 161,960,053 161,764,004
For the capability computing facility Theta, we define itsreward as Equation (1). We set the weights w = w = w = 1 / . For the capacity computing facility Cori, we setthe reward as Equation (2). The learning rate α is set to 0.001. (a) Weekly job submission pat-tern. (b) Daily job submission pattern.(c) Job size distribution. (d) Accuracy of job runtime esti-mates supplied by users. Fig. 3: Job patterns of Theta training dataset.We use 100 jobsets composed of 320,000 jobs for DRAStraining on Theta. We collect 9 sampled jobsets by randomlyselecting jobs from the original training trace and modelingjob arrival times as Poisson distribution following the averageinter-arrival time of the original trace. We split the originalTheta training dataset into nine one-week jobsets. We generate82 synthetic jobsets that mimic Theta workload patterns interms of hourly and daily job arrivals, and distributions of jobsizes and runtimes (Figure 3).We validate the trained DRAS agent with an unseen val-idation dataset (i.e., March of 2018). Figure 4 comparesthe convergence rates by training DRAS in different jobsetorderings. We make two key observations. First, training only Fig. 4: Comparison of quality and convergence of DRAS-PGby training it in different jobset orders (§III-C).with real jobsets (the first 9 episodes of the orange line) cannotobtain a converged model. To achieve a converged model,more jobsets are needed to train our agents. Second, trainingorder plays an important role in performance. Training in theorder of sampled, real and synthetic jobsets achieves the bestresult. While training with real jobsets first can also obtain aconverged model, the performance is not as good as the caseof training with sampled jobsets first. Training with syntheticjobsets first results in slow convergence. In summary, in orderto generate a converged and high-quality model, DRAS needsto first learn from simple averaged cases (sampled jobsets)and then gradually move to more complicated special cases(real and synthetic jobsets).
Fig. 5: The total reward collected by the different schedulingmethods on Theta validation dataset.Figure 5 shows the learning curves of different schedulingmethods. Our three-phase training process allows DRAS toquickly learn and surpass other competing methods and con-verge to optimal solutions. Based on the results, we use themodel trained after the 50th episode for testing listed in §V.We perform a similar training and validation process onCori. We train DRAS using 100 jobsets (20,000,000 jobs)composed of sampled traces, real traces, and synthetic traces.Both
DRAS methods converge at 40 episodes. Hence, we usethe model trained after the 40th episode for testing.
E. Evaluation Metrics
There are two classes of metrics for evaluating clusterscheduling: user-level metrics and system-level metrics. In ourexperiments, we measure four well-established metrics:
Job wait time is a user-level metric. It measures the intervalbetween job submission to job start time. In our experiments,we analyze average job wait time, maximum job wait time,as well as the distribution of job wait times. • Job response time is a user-level metric which measuresthe interval between job submission to completion. • Job slowdown is another user-level metric. It measures theratio of the job response time to its actual runtime. • System utilization is a system-level metric. It measures theratio of the used node-hours for useful job execution to thetotal elapsed node-hours.V. E
XPERIMENTAL R ESULTS
Now we present the experimental results of applying thetrained DRAS on test data (i.e., 21-month Theta log and15-week Cori log). Our experiments intend to answer thefollowing questions:1) Does DRAS outperform existing scheduling? (§V-A)2) Is DRAS capable of preventing jobs from starvation?(§V-B)3) If DRAS outperforms other methods, where does theperformance gain come from? (§V-C)4) Can DRAS adapt to workload changes? (§V-D) (a) Theta (b) Cori
Fig. 6: Overall scheduling performance comparison usingKiviat graphs: Theta traces (left) and Cori traces (right). Weuse the reciprocal of average job wait time, the reciprocal ofmaximum job wait time, the reciprocal of average slowdown,and the reciprocal of average job response time in the plots.All metrics are normalized to the range of 0 to 1. 1 means amethod achieves the best performance among all methods and0 means a method obtains the worst performance. The largerthe area is, the better the overall performance is.
A. Scheduling Performance
The quality of a scheduling method needs to be evaluatedby multiple metrics, including both system-level and user-levelmetrics. Figure 6 presents the overall scheduling performanceobtained by different scheduling methods.
DRAS yields thebest result.
DRAS-PG achieves slightly better result on user-level metrics, while
DRAS-DQL obtains the best system-levelperformance. Although
FCFS has the lowest maximum wait time, it has poor performance on the rest of the metrics.Both
DRAS methods outperform
Optimization , suggesting that
DRAS agents learn to select jobs that not only maximize theimmediate reward, but also potentially improve performance inthe future through maximizing cumulative reward.
Decima-PG achieves good performance on system utilization, but it fails toimprove user-level metrics.
BinPacking and
Random have theworst performance, because they greedily select jobs one byone which ignores the best job combinations. Recall that
DRAS applies the similar strategy as
Random at the beginning ofthe training, by randomly exploring various actions, the betterperformance of
DRAS indicates that our RL models developedgood policies through learning.We also notice that some methods have inconsistent perfor-mance on the user-level metrics. For example, FCFS achievesthe lowest maximum job wait time, however, it suffers fromthe high average job wait time. More detailed analysis of jobwait time is needed. Due to the space limitation, we onlypresent the in-depth analysis of job wait time on Theta in thefollowing subsections.
B. Job Starvation Analysis
Figure 7 shows job wait times under different job sizesand categories. We make three key observations from thisfigure. First,
DRAS and
FCFS prevent jobs from starvation,while
Decima-PG , BinPacking and
Random suffer severe jobstarvation. The maximum job wait times of
DRAS-PG and
DRAS-DQL are 16 days and 20 days respectively, which areonly slightly higher than the maximum job wait time of
FCFS (13 days) and are similar to
Trace . Although
Optimization and
DRAS aim at the same scheduling objectives, the maximumwait time of
Optimization is twice as long as that of
DRAS . Themaximum job wait times of
Decima-PG , BinPacking and
Ran-dom are 170 days, 95 days, and 170 days, indicating they arenot suitable for HPC cluster scheduling. Second, in
Decima-PG , BinPacking , and
Random , large-sized jobs wait noticeablymore time than small-sized jobs. They inherently give higherpriority to small-sized jobs at the expense of large-sized jobs,because they lack reservation strategy to reserve resourcesfor large-sized jobs. The bias toward small-sized jobs is notideal for HPC scheduling, especially for capability systems.In contrast, the methods with reservation strategy, i.e.,
FCFS and
DRAS , do not have a significant difference between smalljobs’ wait times and large jobs’ wait times. This demonstratesthat
DRAS and
FCFS are relatively fair scheduling policies.Third, if we take a look at the methods using reservation andbackfilling strategies (i.e.,
FCFS and
DRAS ), we notice thatalmost all large jobs are executed through reservation, whilethe majority of small jobs are executed through backfilling.In short, these results demonstrate that DRAS is capable ofpreventing job starvation mainly due to the incorporation ofjob reservation and backfilling in our DRAS design.
C. Source of DRAS Performance Gain
Table IV presents job distributions by using differentscheduling methods. We notice that although
DRAS backfillsig. 7: Job wait time distributions with respect to job size andjob type on Theta. Note that the y-axis scale for
Decima-PG , BinPacking and
Random is much larger than those for others.
Trace presents the job wait times extracted from the originallog, which can be used as the baseline. Since we do not havethe job type information, all jobs are marked in grey. Eclipsesin the plots indicate
Decima-PG , BinPacking , and
Random lead to severe job starvation.a majority of the jobs, most node hours are consumed byreserved jobs. If we read Table IV along with Figure 7, we observe that there are a few jobs with wait time of over 300hours and these jobs are mainly allocated through reservationby DRAS. Without reservation, these jobs would wait for 2X-10X more time as happened in
Decima-PG , Optimization , BinPacking , and
Random . Put together, these results reveal thatDRAS learns to achieve the scheduling goals by prioritizingjobs and preventing job starvation through its reservationmechanism embedded in its two-level neural network design.TABLE IV: Job distributions in different execution models(defined in §III-B) on Theta.
Backfilled Ready Reservedjobs core hours jobs core hours jobs core hoursOptimization 0% 0% 100% 100% 0% 0%Decima-PG 0% 0% 100% 100% 0% 0%BinPacking 0% 0% 100% 100% 0% 0%Random 0% 0% 100% 100% 0% 0%FCFS 79.25% 30.45% 9.88% 16.99% 10.87% 52.56%DRAS-PG 83.76% 33.67% 8.63% 11.29% 7.61% 55.04%DRAS-DQL 84.83% 34.17% 6.84% 10.91% 15.17% 54.92%
Fig. 8: Bar plot of job wait times, grouping by job executionmodes. As compared to
FCFS , DRAS learns to intelligentlyselect jobs for immediate execution, reservation, or backfillingso as to maximize the overall scheduling performance.Although both
FCFS and
DRAS apply backfilling strategies,
DRAS performs significantly better in terms of average jobwait time (Figure 6). In Figure 8, we notice that DRASlargely reduces the wait time of ready and backfilled jobsat the expense of a slightly higher wait time for reservedjobs.
FCFS schedules jobs in their arrival order, while
DRAS selects jobs from the queue that aim to balance three objectives(i.e., minimizing average job wait time, prioritizing large jobs,and maximizing system utilization). Therefore,
DRAS learnsto pick backfilled and ready jobs that lead to lower average jobwait time and select jobs queued for long times to avoid jobstarvation. The better performance of
DRAS demonstrates thatDRAS learns to intelligently select jobs for resource allocationso as to maximize the long-term scheduling performance.
D. Adaptation to Workload Change
Figure 9 shows the total core hours and average job waittimes per week during the testing period. The system loadsare dynamically changing. Several dramatic demand surgesput severe pressure on scheduling performance. The bottomfigure compares how DRAS agents respond to the workloadchanges compared to other methods. It is clear that DRASachieves greater wait time reduction when workload surges.ig. 9: When the system load is high, DRAS dynamicallyadjusts it network parameters to reduce average job wait time.Recall that DRAS agents continuously adjust their networkparameters to minimize average job wait time. On the otherhand, the policy of the static methods is predefined and fixed,which performs poorly under heavy workloads.A system change, such as adding more nodes, requiresDRAS to re-train the model. In our experiments, we spentless than 3 hours on a personal computer to obtain a convergedmodel. Considering that system changes is not very frequentand DRAS can avoid complicated manually tuning policies, itis worth to re-train the model when the system changes.
E. Runtime Overhead
In our experiments,
DRAS-PG takes less than 1 secondand
DRAS-DQL takes less than 2 seconds for each networkparameter update during testing. Note that the experimentsare conducted on a personal computer configured with Intelquad-core 2.6Ghz CPU with 16GB memory. In practice, HPCcluster scheduling is typically required to make decisions in15-30 seconds [3]. In other words, the
DRAS agents imposetrivial overhead, hence being feasible for online deployment.VI. C
ONCLUSION
In this paper, we have presented DRAS, a RL-empoweredHPC cluster scheduling agent. DRAS represents the schedul-ing policy as a hierarchical neural network and automaticallylearns customized policies through training with the systemspecific workloads. Our results demonstrate that DRAS iscapable of grasping system- and workload-specific character-istics, preventing large jobs from starvation, adapting to work-load changes without human intervention, and outperformingexisting scheduling policies by up to 45%. We hope it opensup exciting opportunities to rethink HPC cluster scheduling.A
CKNOWLEDGMENT
This work is supported in part by US National ScienceFoundation grants CNS-1717763, CCF-1618776. T. Childers, W. Allcock, P. Rich, and M. Papka are supported by theArgonne Leadership Computing Facility, which is a U.S.Department of Energy Office of Science User Facility oper-ated under contract DE-AC02-06CH11357. Job logs from theCori system were provided by the National Energy ResearchScientific Computing Center (NERSC), a U.S. Departmentof Energy Office of Science User Facility operated underContract No. DE-AC02-05CH11231.R
EFERENCES[1] A. Mu’alem and D. Feitelson. Utilization, Predictability, Workloads, andUser Runtime Estimates in Scheduling the IBM SP2 with Backfilling.
TPDS’01 .[2] X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, andM. Papka. Integrating Dynamic Pricing of Electricity into Energy AwareScheduling for HPC Systems. SC’13.[3] Y. Fan, Z. Lan, P. Rich, W. Allcock, M. Papka, B. Austin, and D. Paul.Scheduling Beyond CPUs for HPC. HPDC’19.[4] H. Sun, P. Stolf, J. Pierson, and G. Costa. Energy-Efficient andThermal-Aware Resource Management for Heterogeneous Datacenters.In
Sustainable Computing: Informatics and Systems , 2014.[5] P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan. Preliminary InterferenceStudy About Job Placement and Routing Algorithms in the Fat-TreeTopology for HPC Applications. In
CLUSTER , 2017.[6] P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan. Joint Effectsof Application Communication Pattern, Job Placement and NetworkRouting on Fat-Tree Systems. In
ICPP Workshops , 2018.[7] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella.Multi-resource Packing for Cluster Schedulers. SIGCOMM’14.[8] H. Mao, M. Schwarzkopf, S. Venkatakrishnan, Z. Meng, and M. Al-izadeh. Learning Scheduling Algorithms for Data Processing Clusters.SIGCOMM’19.[9] A. Sallab, M. Abdou, E. Perot, and S. Yogamani. Deep ReinforcementLearning framework for Autonomous Driving.
Electronic Imaging .[10] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Ojea,E. Solowjow, and S. Levine. Residual Reinforcement Learning for RobotControl. ICRA’19.[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller. Playing Atari with Deep ReinforcementLearning.
NIPS Deep Learning Workshop , 2013.[12] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap,F. Hui, L. Sifre, G. Driessche, T. Graepel, and D. Hassabis. Masteringthe Game of Go without Human Knowledge.
Nature , 2017.[13] R. Sutton and A. Barto. Reinforcement Learning: An Introduction,Second Edition.
MIT Press , 2017.[14] W. Allcock, P. Rich, Y. Fan, and Z. Lan. Experience and Practice ofBatch Scheduling on Leadership Supercomputers at Argonne. JSSPP’17.[15] L. Yu, Z. Zhou, Y. Fan, M. Papka, and Z. Lan. System-wide Trade-offModeling of Performance, Power, and Resilience on Petascale Systems.In
The Journal of Supercomputing , 2018.[16] M. Jette, A. Yoo, and M. Grondona. SLURM: Simple Linux Utility forResource Management. In
JSSPP’03
SC Poster , 2019.[22] H. Mao, M. Alizadeh, I. Menache, and S. Kandula. Resource Manage-ment with Deep Reinforcement Learning. HotNets’16.[23] D. Zhang, D. Dai, Y. He, F. Bao, and B. Xie. RLScheduler: AnAutomated HPC Batch Job Scheduler Using Reinforcement Learning.SC’20.[24] E. ˙Ipek, O. Mutlu, J. Mart´ınez, and R. Caruana. Self-OptimizingMemory Controllers: A Reinforcement Learning Approach. ISCA’08.[25] Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning.
Nature , 2015.26] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy GradientMethods for Reinforcement Learning with Function Approximation.NIPS’99.[27] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Harley, T. Lillicrap,D. Silver, and K. Kavukcuoglu. Asynchronous Methods for DeepReinforcement Learning. ICML’16.[28] NERSC Queue Policies. https://docs.nersc.gov/jobs/policy/.[29] I. Goodfellow, Y. Bengio, and A. Courville.
Deep Learning