[PDF] Deep Reinforcement Agent for Scheduling in HPC

Abstract

Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.

Full PDF

DDeep Reinforcement Agent for Scheduling in HPC

Yuping Fan, Zhiling Lan

Illinois Institute of TechnologyChicago, IL [email protected], [email protected]

Taylor Childers, Paul Rich, William Allcock

Argonne National LaboratoryLemont, IL { jchilders,richp,allcock } @anl.gov Michael E. Papka

Argonne National LaboratoryNorthern Illinois University [email protected]

Abstract —Cluster scheduler is crucial in high-performancecomputing (HPC). It determines when and which user jobsshould be allocated to available system resources. Existing clusterscheduling heuristics are developed by human experts basedon their experience with speciﬁc HPC systems and workloads.However, the increasing complexity of computing systems andthe highly dynamic nature of application workloads have placedtremendous burden on manually designed and tuned schedul-ing heuristics. More aggressive optimization and automationare needed for cluster scheduling in HPC. In this work, wepresent an automated HPC scheduling agent named DRAS(Deep Reinforcement Agent for Scheduling) by leveraging deepreinforcement learning. DRAS is built on a novel, hierarchicalneural network incorporating special HPC scheduling featuressuch as resource reservation and backﬁlling. A unique trainingstrategy is presented to enable DRAS to rapidly learn the targetenvironment. Once being provided a speciﬁc scheduling objectivegiven by system manager, DRAS automatically learns to improveits policy through interaction with the scheduling environmentand dynamically adjusts its policy as workload changes. Theexperiments with different production workloads demonstratethat DRAS outperforms the existing heuristic and optimizationapproaches by up to 45%.

Index Terms —cluster scheduling, high-performance comput-ing, deep reinforcement learning, job starvation, backﬁlling,resource reservation

I. I

NTRODUCTION

Cluster scheduler plays a critical role in high-performancecomputing (HPC). It enforces site policies through decidingwhen and which user jobs are allocated to system resources.Common scheduling goals include high system utilization,good user satisfaction and job prioritization.

Heuristics are theprevailing approaches in HPC cluster scheduling. For example,ﬁrst come, ﬁrst served (FCFS) with EASY backﬁlling is awell-known scheduling policy deployed on production HPCsystems [1]. Bin packing is another well-known heuristicapproach aiming for high utilization. Heuristics are easy toimplement and fast by trading optimality for speed. In addi-tion, optimization is also extensively studied in the literaturefor cluster scheduling [2]–[6]. Optimization methods focus onoptimizing immediate scheduling objective(s) without regardto long-term performance. Moreover, both heuristics and opti-mization approaches are static, and neither of them is capableof adapting its scheduling policy to dynamic changes in theenvironment. In case of sudden variation in workloads, systemadministrators have to manually tune the algorithms and pa-rameters in their policies to mitigate performance degradation.As HPC systems become increasingly complex combined with highly diverse application workloads, such a manual processis becoming challenging, time-consuming, and error-prone.We believe that more aggressive optimization and automation,beyond the existing heuristics and optimization methods, isessential for HPC cluster scheduling.In recent years, reinforcement learning (RL) combined withdeep neural networks has been successfully employed invarious ﬁelds for dynamic decision making, such as self-driving cars [9], autonomous robots [10], and game playing[11] [12]. Reinforcement learning refers to an area of machinelearning that automatically learns to maximize cumulativereward through interaction with the environment [13]. Mao etal. present a RL-driven scheduling design named Decima fordata processing jobs with dependent tasks [8]. While Decimahas shown promising results for scheduling, it is not applicableto cluster scheduling in HPC (detailed in §II-A).Inspired by the above RL-driven studies, we present an au-tomated HPC scheduling agent named

DRAS (Deep Reinforce-ment Agent for Scheduling) tailored for HPC workloads.

Thegoal is twofold : (1) to improve HPC scheduling performancebeyond the existing approaches, and (2) to automaticallyadjust scheduling policies in case of workload changes. Unlikecloud scheduling, HPC scheduling has several salient features,especially resource reservation to prevent job starvation and backﬁlling to reduce resource fragmentation. In the designof DRAS, we incorporate both features into the formulationof deep reinforcement learning and introduce a hierarchicalneural network structure , where the level-1 network selectsjobs for immediate or reserved execution and the level-2 net-work concentrates on choosing proper backﬁlled jobs for morescheduling optimization. In order to optimize and automate theprocess, all the scheduling decisions including immediate jobselection, job reservation, and backﬁlling are made by DRASwithout human involvement. Moreover, we develop a three-phase training process using historical job logs. Our trainingstrategy allows DRAS to gradually explore simple averagesituations to more challenging rare situations, hence leadingto a fast and converged model.We evaluate DRAS by extensive trace-based simulationswith the job traces collected from two production supercom-puters representing capability computing and capacity com-puting. The results indicate DRAS is capable of automaticallylearning to improve its policy through interaction with thescheduling environment and dynamically adjusts its policy asworkload changes. Speciﬁcally, this paper makes three major a r X i v : . [ c s . D C ] F e b ABLE I: Comparison of cluster scheduling methods.

Features Methods FCFS [1] BinPacking [7] Optimization [2]–[4] Decima [8] DRASAdaption to workload changes (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)

Automatic policy tuning (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)

Long-term scheduling performance (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)

Starvation avoidance (cid:52) (cid:55) (cid:55) (cid:55) (cid:52)

Require training (cid:55) (cid:55) (cid:55) (cid:52) (cid:52)

Implementation effort Easy Easy Median Hard HardKey objective Fairness Resource utilization Customizable Customizable Customizable contributions:1) We design a new scheduling agent DRAS which leveragesthe advance in deep reinforcement learning and incorpo-rates the key features of HPC scheduling in the form of ahierarchical neural network model.2) We develop a three-phase training process which allowsDRAS to automatically learn the scheduling environment(i.e., the system and its workloads) and to rapidly convergeto an optimal policy.3) Our trace-based experiments demonstrate DRAS outper-forms a number of scheduling methods by up to 45%.Compared to the heuristic and optimization approaches,DRAS offers two beneﬁts: better long-term schedulingperformance and adaptation to dynamic workload changeswithout human intervention.II. B

ACKGROUND AND C HALLENGES

A. Cluster Scheduling in HPC

HPC job scheduling, also known as batch scheduling, is re-sponsible for assigning jobs to resources (e.g, compute nodes)according to site policies and resource availability [1], [14],[15]. Well-known schedulers include Slurm, Moab/TORQUE,PBS, and Cobalt [16]–[19]. Let’s consider a cluster with N nodes. Users submit their jobs to the system through thescheduler. When submitting a job, a user is required to providejob size n i (i.e., number of compute nodes needed for the job)and job runtime estimate t i (i.e., estimated time needed for thejob). Typical HPC jobs are rigid , meaning job size is ﬁxedthroughout its execution. Job runtime estimate is the upperbound for the job such that it will be killed by the schedulerif the actual job runtime exceeds this runtime estimate [20],[21]. At each scheduling instance, the scheduler orders thejobs in the queue according to the site policy and executesjobs from the head of the queue.Existing HPC scheduling policies can be broadly classiﬁedinto two groups: heuristics and optimization methods. FirstCome First Serve (FCFS) with EASY backﬁlling is the mostwidely used heuristics, which sorts the jobs in the wait queueaccording to their arrival times and executes jobs from the headof the queue. If the available resources are not sufﬁcient for theﬁrst job in the queue, the scheduler will reserve the resourcesfor this job. Backﬁlling is often used in conjunction withreservation to enhance system utilization. It allows subsequentjobs in the wait queue to move ahead under the condition thatthey do not delay the existing reservations [1]. Optimization methods select a set of jobs from the queue with an objectiveto optimize certain scheduling metrics, such as minimizingaverage job wait time and maximize system utilization [2]–[6].Several recent studies have explored reinforcement learn-ing for cluster scheduling. DeepRM [22] is the ﬁrst workdemonstrating the potential of using reinforcement learningfor learning customized scheduling policies from experience.Unfortunately, DeepRM’s state representation cannot handlerealistic cluster workloads with continuous job arrivals. UnlikeDeepRM, RLScheduler [23] attempts to develop a generalreinforcement learning model that is trained with one systemlog and then is used on other systems with different character-istics (e.g., system size, workload patterns, etc.). While sucha generic model is appealing, RLScheduler might lead to lesssatisfactory scheduling performance than heuristic methods.The work closely related to ours is Decima, which exploresreinforcement learning to allocate data processing jobs. Eachjob consists of dependent tasks and is represented as directedacyclic graphs (DAGs). Decima integrates a graph neuralnetwork to extract job DAGs and cluster status as embeddingvectors. It then feeds the embedding vectors to a policygradient network for decision making. The decision consists oftwo parts: to select tasks for immediate execution and to deter-mine task parallelism. Unfortunately, Decima is not applicableto HPC scheduling. First, Decima assumes all jobs can bedecomposed into malleable tasks, whereas HPC is dominatedby rigid jobs that cannot be decomposed. Second, Decimacan cause serious job starvation due to the lack of resourcereservation support (Figure 7). In short, Table I summarizedand compared existing cluster scheduling methods, along withtheir features.

B. Overview of Reinforcement Learning

Reinforcement learning (RL) is a type of machine learningtechnique that studies how agents situated in stochastic envi-ronments can learn optimal policies through interaction withtheir environment [24]. The agent’s environment is describedby an abstraction called Markov Decision Process (MDP)with four basic components: state space S , action space A , reward R , and state transition probability P . In Markovdecision processes, a learning agent interacts with a dynamicenvironment in discrete timesteps. At each time step t , theagent observes the state s t ∈ S and takes an action a t ∈ A ( s t ) .Upon taking the action, the environment transits to a newstate s t +1 with the transition probability P ( s t +1 | s t , a t ) androvides a reward r t to the agent as feedback of the action.The process continues until the agent reaches a terminal state.The goal of the agent is to ﬁnd a policy π ( s ) , mapping astate to an action (deterministic) or a probability distributionover actions (stochastic), which maximizes the long-term (dis-counted) cumulative reward (cid:80) Tt (cid:48) = t γ t (cid:48) r t (cid:48) . A discount factor γ is between 0 and 1. The smaller of γ , the less importance offuture rewards.In practice, the state and action space is often too largeto be stored in a lookup table. It is common to use functionapproximators with a manageable number of adjustable param-eters θ , to represent the components of agents. Using a deepneural network with reinforcement learning is often called deep reinforcement learning [25]. The highly representationalpower of deep neural networks enables reinforcement learningto solve complex decision-making problems, such as playingAtari and Go games [11], [12]. Policy gradient and

Q-learning are the most popular RLalgorithms [26] [13].

Policy gradient methods directly param-eterize the policy π θ ( s ) and optimize the parameters θ in theneural network by gradient descent. In Q-learning algorithms,an agent chooses an action at a given state that maximizesQ-value, i.e., the cumulative reward over all successive steps.Q-table is a lookup table containing Q-value for all the state-action pairs. To address an overwhelming number of state-action pairs, neural networks are often used to approximateQ-table and the methods are generally called deep Q-learning(DQL) . DQL learns by approximating the optimal action-value function Q ∗ θ ( s, a ) . Policy gradient methods are generallybelieved to be applicable for a wider range of problemsand converge faster, but tend to converge to a local optimal.On the other hand, Q-learning methods are more difﬁcult toconverge, but once they converge, they tend to have morestable performance than policy gradient methods [27]. C. Technical Challenges

Designing deep reinforcement learning driven clusterscheduling for HPC is challenging. Several key obstacles aslisted below.

Avoidance of job starvation.

HPC jobs have drasticallydifferent characteristics: user jobs may range from a single-node job to a whole-system job, and job runtimes may varyfrom seconds to hours or even days. This feature presents aunique challenge to HPC systems: jobs, especially large-sizedjobs, tend to be starved, if small-sized jobs keep arriving andskip over large jobs due to insufﬁcient available resources.Simply applying existing RL-based scheduling methods canlead to severe job starvation. We have tested a state-of-the-artpolicy gradient method with a real workload trace. Our resultsshow that large jobs, e.g., 4k-node jobs, were held in the queuefor 170 days. Typically, large jobs have high priority at HPCsites, especially capability computing facilities. The long waittimes discourage users from submitting large jobs.

Incorporation of backﬁlling.

Backﬁlling is a key strategyto reduce resource fragmentation in HPC. Currently, the well-known EASY backﬁlling strategy uses the simple ﬁrst-ﬁt method to select jobs for backﬁlling, i.e., choosing the ﬁrstjob which can ﬁt in the backﬁll hole. We argue that similarto the selection of jobs for scheduling, the selection of jobsfor backﬁlling has many possible options, hence having thepotential for more aggressive optimization.

Scalable state and action representation.

To transforma scheduling problem to a reinforcement learning problem,we must ﬁrst capture the dynamic environment, e.g., status ofthousands of nodes and hundreds of waiting jobs, to a statevector as an input to the neural network. Additionally, it isvitally important to map the extremely large action space toan output of the neural network in a manageable size. Theaction space grows exponentially with the number of jobs inthe queue. Working directly with large action space can becomputationally demanding.

Effective agent training.

An RL agent learns to improveits policy by experiencing diverse situations. An effectivetraining should be capable of efﬁciently and rapidly buildinga converged model based on sample data in order to makedecisions without being explicitly programmed to do so. It isalso challenging to select training data to reliably cover asmuch of the state space as possible and generalize to new orunseen situations. III. D

ESIGN OF

DRASNow we present DRAS, a new scheduling method tailoredfor HPC workload and is empowered by deep reinforcementlearning. DRAS, illustrated in Figure 1, represents the sched-uler as an agent to make decisions on when and which jobsshould be allocated to computer nodes with the objectiveto optimize scheduling performance. At a given schedulinginstance t , the agent ﬁrst encodes the job queue and systemstate into a vector s t , and passes the vector to the neuralnetwork (§III-A). Next, DRAS uses a hierarchical neuralnetwork for decision making (§III-B). The agent takes anaction by selecting jobs from the wait queue according to theoutput of the neural network and then receives a reward signalfrom the environment. The goal of DRAS is to choose actions(i.e., to select jobs) over time so as to maximize the cumulativereward. DRAS trains its neural network through simulationwith massive datasets composed of both real and syntheticworkload traces (§III-C). Once the model is converged, wedeploy the DRAS agents into operation. The DRAS agentsautomatically adjust their neural network parameters duringoperation to handle workload changes. A. State, Reward and Action Representation

The DRAS agent receives three observations from theenvironment: (1) job wait queue, (2) cluster node status, and(3) reward, a scalar indicating the quality of the action.

State.

We encode each waiting job as a vector of [2 , ,containing four pieces of information, including job size, jobestimated runtime, priority (1 means high priority; 0 meanslow priority), and job queued time (time elapsed since sub-mission). We encode each node as a vector of [1 , with twopieces of information. The ﬁrst cell is a binary representing PC Cluster

J1J2J6 J3J4J5

Job Wait Queue

Environment

SchedulerRepresented by DRAS Agent

DRAS Agent

JobStateHPC ClusterState

Level-1 Neural Network Level-2 Neural NetworkReady & Reserved jobs Backﬁlled jobs R e w a r d : S c h e du li ng O b j ec t i v e Action: Ready, Reserved, Backﬁlled Jobs J o b & S ys t e m S t a t e Window

Fig. 1: DRAS overview. The agent (at the bottom) represents the scheduler; the environment (at the top) comprises the restof the system, including job wait queue and HPC cluster. The DRAS agent ﬁrst observes the environment state, including jobstate and system state, and encodes the state into a vector. The agent’s neural network takes the vector as input and outputs ascheduling action. The environment executes the action and provides a reward indicating the quality of the action. The agentuses reward to improve its policy automatically.node availability (1 means available; 0 means not available).If the node is occupied, we use the user-supplied runtimeestimate and job start time to calculate the node estimatedavailable time. The second cell represents the time differencebetween the node estimated available time and the currenttime. If the node is available, we set the second cell to zero.We concatenate job information and node information into aﬁxed-size vector as the input to the neural network.

Reward.

Reward functions reﬂect scheduling objectives.It is hard to offer a one-size-ﬁts-all reward function due todiverse site objectives. HPC systems can be broadly classiﬁedas capability computing or capacity computing. Capabilitycomputing facilities are commonly interested in prioritizingcapability jobs (i.e., large jobs) [14] and optimizing resourceutilization. An example reward of capability computing couldbe as follows: w × t i t max + w × n i N + w × N used N (1)where t i denotes the average wait time of selected jobs; t max is the maximum wait time of jobs in the queue. Similarly, n i isthe average job size of the selected jobs; N is the total numberof nodes in the system; N used is the number of occupiednodes. In other words, this reward function intends to balancethree factors: to prevent job starvation, to promote capabilityjobs, and to improve system utilization. The weights can betuned by system administrators based on the site priority. Forexample, the higher w value could meet a more stringentrequirement on job starvation.Capacity computing facilities typically focus on fastturnaround time and short wait time [28]. For capacity com-puting facilities, we may deﬁne the reward function as: (cid:80) j ∈ J − /t j c (2) where J is the set of jobs in the queue and c is the numberof waiting jobs at the current timestep. This reward functionaims to minimize the average job wait time. Action.

DRAS processes the input vector and outputs a vec-tor as the scheduling action. The output vector speciﬁes whichjobs are selected for job execution (i.e., immediate execution,reserved execution, and backﬁlled execution). Intuitively, ateach scheduling instance, the scheduler selects multiple jobssimultaneously. This leads to an explosive number of actionsand is infeasible to be trained efﬁciently. Instead, DRASdecomposes one scheduling decision (i.e., selects several jobsin one shot) into a series of job selections, i.e., selecting onejob at each time.

B. Two-level Neural Network

A key challenge when applying deep reinforcement learningto HPC cluster scheduling is to prevent job starvation. State-of-the-art RL methods focus on scheduling jobs for immediateexecution and lack reservation strategy, hence leading to jobstarvation. To overcome this obstacle, we build a hierarchicalneural network structure , in which the level-1 network is toselect jobs for immediate or reserved execution and the level-2network is to identify jobs for backﬁlling.More speciﬁcally, at a given scheduling instance, the sched-uler ﬁrst enforces a window at the front of the job wait queue.The window alleviates job starvation problems by providinghigher priorities to older jobs. The level-1 network selects ajob from the window. If the number of available nodes is morethan or equal to the job size, the agent marks the job as readyjob and sends it for immediate execution on the system. Thisprocess repeats until the job selected from the window hasa size greater than the number of available nodes. The agentmarks the job as reserved job and reserves a set of nodes forits execution on the system at the earliest available time. Athis point, the agent moves to the level-2 network. Unlike theﬁrst-ﬁt strategy used in the traditional backﬁlling method, weuse the neural network to make backﬁlling decisions so as tominimize resource waste. Toward this end, we ﬁll the windowwith job candidates, i.e., the jobs that can be ﬁt into the holesin the system before the reserved time. The agent selects onejob at a time for system to backﬁll. The process at the level-2network repeats until no more job candidates for backﬁlling.In a nutshell, the decision making of DRAS is to select jobsand execute them in three modes:1) ready job : the jobs are selected to run immediately.2) reserved job : the jobs are selected to start at the earliestreserved time.3) backﬁlled job : the jobs are selected to ﬁll the holes beforethe reserved time.The same neural network is used for both level-1 and level-2 networks. The entire 2-level neural network is trained jointlyusing deep reinforcement learning to optimize schedulingperformance. Each network consists of ﬁve layers : input layer,convolution layer, two fully-connected layers, and output layer.The input layer is connected to a convolution layer with a × ﬁlter to extract job or node status information in each row. Theconvolution layer is connected to two fully-connected layersactivated by leaky rectiﬁer [29]. The second fully-connectedlayer is connected to the output layer. We denote all of theparameters in the neural network jointly as θ .In this study, we develop two DRAS agents: DRAS-PG and

DRAS-DQL . PG denotes policy gradient, and DQL denotesdeep Q-learning. The selection of PG and DQL is for us tosystematically evaluate these popular reinforcement learningmethods under a uniﬁed environment.

DRAS-PG uses the neural network to parameterize schedul-ing policy as π θ ( s k , a k ) (i.e., the probability of taking action a k in state s k ). The input of DRAS-PG is a 2D vectorof [2 × W + N, , where W is the window size and N is the total number of nodes in the system. The output ofthe neural network contains W neurons, each denoting theprobability of selecting a job out of the W jobs. A schedulingaction is stochastically drawn from the W jobs following theirprobability distributions. We employ the softmax [29] as theactivation function to ensure the sum of output values equalsto 1.0. If the number of wait jobs is less than the window size W , we mask the invalid actions in the output by rescaling allvalid actions. In terms of learning, DRAS-PG method updatesthe neural network parameters θ by: θ ← θ + α K (cid:88) k =1 (cid:53) θ logπ θ ( s k , a k )( K (cid:88) k (cid:48) = k r k (cid:48) − b k ) (3)Here, K denotes the total number of actions taken in theparameter update, α is the learning rate of using Adamoptimizer [30], and b k is the baseline used to reduce thevariance of policy gradient. We set b k to the cumulative rewardfrom step k onwards averaging over all past parameter updates. DRAS-DQL uses the neural network to approximate Q-value as Q θ ( s k , a k ) (i.e., the expected cumulative reward of taking action a k in state s k ). DRAS-DQL network processesone job at a time and produces the expected Q-value for thisjob. We use the same network to approximate Q-value forall the jobs in the window W . The input of DRAS-DQLneural network is a 2D vector of [2 + N, , containing onejob information and N nodes information. The output is asingle neuron corresponding to the expected Q-value of thejob. After processing all the jobs in the window, normally, theagent selects the job with the highest Q-value.In order to explore various actions, the agent randomlychooses a job instead of the job with the highest Q-value withprobability (cid:15) . In practice, (cid:15) is very high at the beginning ofthe training to ensure that the agent explores various state-action pairs and it decays over time as the agent becomes moreexperienced. In our study, we set (cid:15) = 1 . at the beginning ofthe training and it decays at the rate of α = 0 . . In training,the parameters θ in DRAS-DQL network is updated by: θ ← θ − α K (cid:88) k =1 (cid:53) θ Q ( s k , a k )( r k + max a Q ( s k +1 , a ) (cid:124) (cid:123)(cid:122) (cid:125) new value − Q ( s k , a k ) (cid:124) (cid:123)(cid:122) (cid:125) old value ) (4)Here, the old value Q ( s k , a k ) is the expected Q-value of takingaction a k at state s k . After taking action a k , we can computethe more accurate expected Q-value (i.e., the new value) byadding the immediate reward r k and the expected cumulativefuture reward. DQL networks learn through minimizing theloss between the new value and the old value. C. Training Strategy

At the beginning of the training, we initialize DRAS’s neuralnetwork parameters θ to random numbers. We train the neuralnetwork in episodes and the network parameters θ are updatedwith episodic training until convergence. For each episode,the environment is ﬁrst set to its initial state (i.e., all nodesare idle and no jobs run on the system). We train DRAS viatrace-based simulation, in which job events occur at a speciﬁcinstant in time according to the job traces. DRAS observes thescheduling state, makes scheduling decisions according to itsneural network, and collects scheduling reward. For every tenscheduling instances, DRAS updates its parameters θ based onthe collected observations and then clears the memory for thenext update. An episode terminates when all jobs in the jobsethave been scheduled. We monitor the progress of the trainingby taking a snapshot of the model after each episode. The nextepisode uses a new jobset to reﬁne the previous model.The jobsets used in training determine the convergence andquality of the DRAS model. To learn a converged model, wefollow the principle of gradual improvement: DRAS starts withsimple average cases and gradually improves its capabilitywith unseen rare cases.

Speciﬁcally, we train DRAS by using athree-phase training process and three types of jobsets are usedto train DRAS in order: (1) a set of sampled jobs from real jobtraces, (2) a period of real job traces, and (3) a set of syntheticjobs generated according to job patterns on the target system.The sampled jobsets have controlled job arrival rates providinghe easiest learning environment. Once DRAS can makegood scheduling decisions under the controlled environment,training on the real job traces with various job arrival patternsallows DRAS to learn more challenging situations. The ﬁnalphase is to train DRAS with synthetic jobsets, which enablesDRAS to experience a variety of potential states that mightnot be seen in the ﬁrst two types of jobsets. We will showthis three-phase training process leads to a fast convergence(§IV-D).DRAS is implemented in Tensorﬂow [31] and available asopen-source on GitHub [32].IV. E

XPERIMENTAL S ETUP

A. Comparison Methods

We compare the following scheduling methods: • FCFS represents FCFS with EASY backﬁlling, which isthe default scheduling policy deployed on many productionsupercomputers [16]. FCFS prioritizes jobs based on theirarrival times and EASY backﬁlling is used to reduce re-source fragmentation [1]. • BinPacking is widely used heuristic method for schedulingin datacenters [7]. It iteratively allocates the largest runnablejobs (i.e., job size is less than or equal to the numberof available nodes in the system) until the system cannotaccommodate any further jobs. • Random randomly selects runnable jobs from the queue toexecute until no more jobs in the queue can ﬁt into thesystem. Since DRAS performs similar to Random at thebeginning of training by randomly explores action space, ifDRAS’s performance is better than Random, it demonstratesthat DRAS gradually learns to improve its scheduling action. • Optimization denotes a suite of scheduling methods thatformulate cluster scheduling as an optimization problem [3],e.g., to minimize average job wait time. In our experiments,the optimization problem is formulated as a 0-1 knapsackproblem which is solved using dynamic programming. Fora fair comparison, we use the same scheduling objectives(i.e., Equation (3) and (4)) for Optimization and for DRAS. • Decima-PG denotes a modiﬁed version of Decima [8]. Asmentioned earlier, Decima is not designed for schedulingHPC jobs. Hence, we use a modiﬁed version of Decimaby skipping graph neural network and adopting our staterepresentation presented in §III-A. Note that Decima-PG isa RL agent without hierarchical network structure. Henceit acts as the baseline to demonstrate the beneﬁts of thehierarchical design of DRAS. • DRAS-PG and

DRAS-DQL denote our DRAS agents.

B. Trace-based Simulation

We compare these scheduling policies through trace-basedsimulation. Speciﬁcally, a trace-based, event-driven schedulingsimulator called CQSim is used in our experiments [2] [33][34]. CQSim contains a queue manager and a scheduler thatcan plug in different scheduling policies. It emulates the actualscheduling environment. A real system takes jobs from usersubmission, while CQSim takes jobs by reading the job arrival information in the trace. Rather than executing jobs on system,CQSim simulates the execution by advancing the simulationclock according to the job runtime information in the trace.TABLE II: Theta and Cori workloads.

Theta CoriLocation ALCF NERSCScheduler Cobalt SlurmSystem Types Capability computing Capacity computingCompute Nodes 4,392(4,392 KNL) 12,076(2,388 Haswell; 9,688 KNL)Trace Period Jan. 2018 - Dec. 2019 Apr. 2018 - Jul. 2018Number of Jobs 121,837 2,607,054Max Job Length 1 day 7 days

Fig. 2: Job characterization of Theta at ALCF and Cori atNERSC. The outer circle shows the number of jobs in eachjob size category. The inner circle presents the total core hoursconsumed by each job size category.

C. Workload Traces

In our study, two real workload traces are used. Table IIsummarizes the two traces collected from production systems,and Figure 2 gives an overview of job size distributions onthese supercomputers. We select these traces as they representdifferent workload proﬁles: (1) capability computing focus-ing on solving large-sized problems, (2) capacity computingsolving a mix of small-sized and large-sized problems. Theﬁrst workload is a two-year job log from Theta [35], theproduction HPC system located at ALCF. Theta is a capabilitycomputing system. The smallest job allowed on Theta is 128-node [36]. Only 2.25% of jobs have dependency. For jobswith dependency, the scheduler hides them from schedulinguntil all their parents have been executed. On Theta, there are32 nodes dedicated to run debugging jobs and the rest of 4,360nodes are dedicated to user jobs. In our experiments, we setthe system size to be 4,360 and ﬁlter out all debugging jobsin the trace.

We use the ﬁrst 2-month data for training, thenext month data for validating model convergence, and therest 21-month data for testing.

The second trace is a four-month job log from Cori [37].Cori is a capacity computing system deployed at NERSC. Amajority of its jobs consume one or several nodes (Figure 2).The longest job executed for seven days.

We use the ﬁrst 2-week data for training, the next 1-week data for validatingmodel convergence, and the last 15-week data for testing.. DRAS Training

The details of RL architectures for these systems are listedin Table III. Take the neural network of

DRAS-PG on Thetaas an example. The input of the neural network is a vectorof [4460 , . We use a convolutional layer with 4460 neuronsand two fully-connected layers with 4000 and 1000 neuronsrespectively. The output layer contains 50 neurons representingjobs in the window. In total, the neural network has 21,890,053trainable parameters.TABLE III: DRAS network conﬁgurations for Theta and Cori. Theta CoriDRAS-PG DRAS-DQL DRAS-PG DRAS-DQLInput [4460 ,

2] [4362 ,

2] [12176 ,

2] [12078 , Convolutional Layer

Fully Connected Layer 1 4000 10000Fully Connected Layer 2 1000 4000Output 50 1 50 1Trainable Parameters 21,890,053 21,449,004 161,960,053 161,764,004

For the capability computing facility Theta, we deﬁne itsreward as Equation (1). We set the weights w = w = w = 1 / . For the capacity computing facility Cori, we setthe reward as Equation (2). The learning rate α is set to 0.001. (a) Weekly job submission pat-tern. (b) Daily job submission pattern.(c) Job size distribution. (d) Accuracy of job runtime esti-mates supplied by users. Fig. 3: Job patterns of Theta training dataset.We use 100 jobsets composed of 320,000 jobs for DRAStraining on Theta. We collect 9 sampled jobsets by randomlyselecting jobs from the original training trace and modelingjob arrival times as Poisson distribution following the averageinter-arrival time of the original trace. We split the originalTheta training dataset into nine one-week jobsets. We generate82 synthetic jobsets that mimic Theta workload patterns interms of hourly and daily job arrivals, and distributions of jobsizes and runtimes (Figure 3).We validate the trained DRAS agent with an unseen val-idation dataset (i.e., March of 2018). Figure 4 comparesthe convergence rates by training DRAS in different jobsetorderings. We make two key observations. First, training only Fig. 4: Comparison of quality and convergence of DRAS-PGby training it in different jobset orders (§III-C).with real jobsets (the ﬁrst 9 episodes of the orange line) cannotobtain a converged model. To achieve a converged model,more jobsets are needed to train our agents. Second, trainingorder plays an important role in performance. Training in theorder of sampled, real and synthetic jobsets achieves the bestresult. While training with real jobsets ﬁrst can also obtain aconverged model, the performance is not as good as the caseof training with sampled jobsets ﬁrst. Training with syntheticjobsets ﬁrst results in slow convergence. In summary, in orderto generate a converged and high-quality model, DRAS needsto ﬁrst learn from simple averaged cases (sampled jobsets)and then gradually move to more complicated special cases(real and synthetic jobsets).

Fig. 5: The total reward collected by the different schedulingmethods on Theta validation dataset.Figure 5 shows the learning curves of different schedulingmethods. Our three-phase training process allows DRAS toquickly learn and surpass other competing methods and con-verge to optimal solutions. Based on the results, we use themodel trained after the 50th episode for testing listed in §V.We perform a similar training and validation process onCori. We train DRAS using 100 jobsets (20,000,000 jobs)composed of sampled traces, real traces, and synthetic traces.Both

DRAS methods converge at 40 episodes. Hence, we usethe model trained after the 40th episode for testing.

E. Evaluation Metrics

There are two classes of metrics for evaluating clusterscheduling: user-level metrics and system-level metrics. In ourexperiments, we measure four well-established metrics:

Job wait time is a user-level metric. It measures the intervalbetween job submission to job start time. In our experiments,we analyze average job wait time, maximum job wait time,as well as the distribution of job wait times. • Job response time is a user-level metric which measuresthe interval between job submission to completion. • Job slowdown is another user-level metric. It measures theratio of the job response time to its actual runtime. • System utilization is a system-level metric. It measures theratio of the used node-hours for useful job execution to thetotal elapsed node-hours.V. E

XPERIMENTAL R ESULTS

Now we present the experimental results of applying thetrained DRAS on test data (i.e., 21-month Theta log and15-week Cori log). Our experiments intend to answer thefollowing questions:1) Does DRAS outperform existing scheduling? (§V-A)2) Is DRAS capable of preventing jobs from starvation?(§V-B)3) If DRAS outperforms other methods, where does theperformance gain come from? (§V-C)4) Can DRAS adapt to workload changes? (§V-D) (a) Theta (b) Cori

Fig. 6: Overall scheduling performance comparison usingKiviat graphs: Theta traces (left) and Cori traces (right). Weuse the reciprocal of average job wait time, the reciprocal ofmaximum job wait time, the reciprocal of average slowdown,and the reciprocal of average job response time in the plots.All metrics are normalized to the range of 0 to 1. 1 means amethod achieves the best performance among all methods and0 means a method obtains the worst performance. The largerthe area is, the better the overall performance is.

A. Scheduling Performance

The quality of a scheduling method needs to be evaluatedby multiple metrics, including both system-level and user-levelmetrics. Figure 6 presents the overall scheduling performanceobtained by different scheduling methods.

DRAS yields thebest result.

DRAS-PG achieves slightly better result on user-level metrics, while

DRAS-DQL obtains the best system-levelperformance. Although

FCFS has the lowest maximum wait time, it has poor performance on the rest of the metrics.Both

DRAS methods outperform

Optimization , suggesting that

DRAS agents learn to select jobs that not only maximize theimmediate reward, but also potentially improve performance inthe future through maximizing cumulative reward.

Decima-PG achieves good performance on system utilization, but it fails toimprove user-level metrics.

BinPacking and

Random have theworst performance, because they greedily select jobs one byone which ignores the best job combinations. Recall that

DRAS applies the similar strategy as

Random at the beginning ofthe training, by randomly exploring various actions, the betterperformance of

DRAS indicates that our RL models developedgood policies through learning.We also notice that some methods have inconsistent perfor-mance on the user-level metrics. For example, FCFS achievesthe lowest maximum job wait time, however, it suffers fromthe high average job wait time. More detailed analysis of jobwait time is needed. Due to the space limitation, we onlypresent the in-depth analysis of job wait time on Theta in thefollowing subsections.

B. Job Starvation Analysis

Figure 7 shows job wait times under different job sizesand categories. We make three key observations from thisﬁgure. First,

DRAS and

FCFS prevent jobs from starvation,while

Decima-PG , BinPacking and

Random suffer severe jobstarvation. The maximum job wait times of

DRAS-PG and

DRAS-DQL are 16 days and 20 days respectively, which areonly slightly higher than the maximum job wait time of

FCFS (13 days) and are similar to

Trace . Although

Optimization and

DRAS aim at the same scheduling objectives, the maximumwait time of

Optimization is twice as long as that of

DRAS . Themaximum job wait times of

Decima-PG , BinPacking and

Ran-dom are 170 days, 95 days, and 170 days, indicating they arenot suitable for HPC cluster scheduling. Second, in

Decima-PG , BinPacking , and

Random , large-sized jobs wait noticeablymore time than small-sized jobs. They inherently give higherpriority to small-sized jobs at the expense of large-sized jobs,because they lack reservation strategy to reserve resourcesfor large-sized jobs. The bias toward small-sized jobs is notideal for HPC scheduling, especially for capability systems.In contrast, the methods with reservation strategy, i.e.,

FCFS and

DRAS , do not have a signiﬁcant difference between smalljobs’ wait times and large jobs’ wait times. This demonstratesthat

DRAS and

FCFS are relatively fair scheduling policies.Third, if we take a look at the methods using reservation andbackﬁlling strategies (i.e.,

FCFS and

DRAS ), we notice thatalmost all large jobs are executed through reservation, whilethe majority of small jobs are executed through backﬁlling.In short, these results demonstrate that DRAS is capable ofpreventing job starvation mainly due to the incorporation ofjob reservation and backﬁlling in our DRAS design.

C. Source of DRAS Performance Gain

Table IV presents job distributions by using differentscheduling methods. We notice that although

DRAS backﬁllsig. 7: Job wait time distributions with respect to job size andjob type on Theta. Note that the y-axis scale for

Decima-PG , BinPacking and

Random is much larger than those for others.

Trace presents the job wait times extracted from the originallog, which can be used as the baseline. Since we do not havethe job type information, all jobs are marked in grey. Eclipsesin the plots indicate

Decima-PG , BinPacking , and

Random lead to severe job starvation.a majority of the jobs, most node hours are consumed byreserved jobs. If we read Table IV along with Figure 7, we observe that there are a few jobs with wait time of over 300hours and these jobs are mainly allocated through reservationby DRAS. Without reservation, these jobs would wait for 2X-10X more time as happened in

Decima-PG , Optimization , BinPacking , and

Random . Put together, these results reveal thatDRAS learns to achieve the scheduling goals by prioritizingjobs and preventing job starvation through its reservationmechanism embedded in its two-level neural network design.TABLE IV: Job distributions in different execution models(deﬁned in §III-B) on Theta.

Backﬁlled Ready Reservedjobs core hours jobs core hours jobs core hoursOptimization 0% 0% 100% 100% 0% 0%Decima-PG 0% 0% 100% 100% 0% 0%BinPacking 0% 0% 100% 100% 0% 0%Random 0% 0% 100% 100% 0% 0%FCFS 79.25% 30.45% 9.88% 16.99% 10.87% 52.56%DRAS-PG 83.76% 33.67% 8.63% 11.29% 7.61% 55.04%DRAS-DQL 84.83% 34.17% 6.84% 10.91% 15.17% 54.92%

Fig. 8: Bar plot of job wait times, grouping by job executionmodes. As compared to

FCFS , DRAS learns to intelligentlyselect jobs for immediate execution, reservation, or backﬁllingso as to maximize the overall scheduling performance.Although both

FCFS and

DRAS apply backﬁlling strategies,

DRAS performs signiﬁcantly better in terms of average jobwait time (Figure 6). In Figure 8, we notice that DRASlargely reduces the wait time of ready and backﬁlled jobsat the expense of a slightly higher wait time for reservedjobs.

FCFS schedules jobs in their arrival order, while

DRAS selects jobs from the queue that aim to balance three objectives(i.e., minimizing average job wait time, prioritizing large jobs,and maximizing system utilization). Therefore,

DRAS learnsto pick backﬁlled and ready jobs that lead to lower average jobwait time and select jobs queued for long times to avoid jobstarvation. The better performance of

DRAS demonstrates thatDRAS learns to intelligently select jobs for resource allocationso as to maximize the long-term scheduling performance.

D. Adaptation to Workload Change

Figure 9 shows the total core hours and average job waittimes per week during the testing period. The system loadsare dynamically changing. Several dramatic demand surgesput severe pressure on scheduling performance. The bottomﬁgure compares how DRAS agents respond to the workloadchanges compared to other methods. It is clear that DRASachieves greater wait time reduction when workload surges.ig. 9: When the system load is high, DRAS dynamicallyadjusts it network parameters to reduce average job wait time.Recall that DRAS agents continuously adjust their networkparameters to minimize average job wait time. On the otherhand, the policy of the static methods is predeﬁned and ﬁxed,which performs poorly under heavy workloads.A system change, such as adding more nodes, requiresDRAS to re-train the model. In our experiments, we spentless than 3 hours on a personal computer to obtain a convergedmodel. Considering that system changes is not very frequentand DRAS can avoid complicated manually tuning policies, itis worth to re-train the model when the system changes.

E. Runtime Overhead

In our experiments,

DRAS-PG takes less than 1 secondand

DRAS-DQL takes less than 2 seconds for each networkparameter update during testing. Note that the experimentsare conducted on a personal computer conﬁgured with Intelquad-core 2.6Ghz CPU with 16GB memory. In practice, HPCcluster scheduling is typically required to make decisions in15-30 seconds [3]. In other words, the

DRAS agents imposetrivial overhead, hence being feasible for online deployment.VI. C

ONCLUSION

In this paper, we have presented DRAS, a RL-empoweredHPC cluster scheduling agent. DRAS represents the schedul-ing policy as a hierarchical neural network and automaticallylearns customized policies through training with the systemspeciﬁc workloads. Our results demonstrate that DRAS iscapable of grasping system- and workload-speciﬁc character-istics, preventing large jobs from starvation, adapting to work-load changes without human intervention, and outperformingexisting scheduling policies by up to 45%. We hope it opensup exciting opportunities to rethink HPC cluster scheduling.A

CKNOWLEDGMENT

This work is supported in part by US National ScienceFoundation grants CNS-1717763, CCF-1618776. T. Childers, W. Allcock, P. Rich, and M. Papka are supported by theArgonne Leadership Computing Facility, which is a U.S.Department of Energy Ofﬁce of Science User Facility oper-ated under contract DE-AC02-06CH11357. Job logs from theCori system were provided by the National Energy ResearchScientiﬁc Computing Center (NERSC), a U.S. Departmentof Energy Ofﬁce of Science User Facility operated underContract No. DE-AC02-05CH11231.R

EFERENCES[1] A. Mu’alem and D. Feitelson. Utilization, Predictability, Workloads, andUser Runtime Estimates in Scheduling the IBM SP2 with Backﬁlling.

TPDS’01 .[2] X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, andM. Papka. Integrating Dynamic Pricing of Electricity into Energy AwareScheduling for HPC Systems. SC’13.[3] Y. Fan, Z. Lan, P. Rich, W. Allcock, M. Papka, B. Austin, and D. Paul.Scheduling Beyond CPUs for HPC. HPDC’19.[4] H. Sun, P. Stolf, J. Pierson, and G. Costa. Energy-Efﬁcient andThermal-Aware Resource Management for Heterogeneous Datacenters.In

Sustainable Computing: Informatics and Systems , 2014.[5] P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan. Preliminary InterferenceStudy About Job Placement and Routing Algorithms in the Fat-TreeTopology for HPC Applications. In

CLUSTER , 2017.[6] P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan. Joint Effectsof Application Communication Pattern, Job Placement and NetworkRouting on Fat-Tree Systems. In

ICPP Workshops , 2018.[7] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella.Multi-resource Packing for Cluster Schedulers. SIGCOMM’14.[8] H. Mao, M. Schwarzkopf, S. Venkatakrishnan, Z. Meng, and M. Al-izadeh. Learning Scheduling Algorithms for Data Processing Clusters.SIGCOMM’19.[9] A. Sallab, M. Abdou, E. Perot, and S. Yogamani. Deep ReinforcementLearning framework for Autonomous Driving.

Electronic Imaging .[10] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Ojea,E. Solowjow, and S. Levine. Residual Reinforcement Learning for RobotControl. ICRA’19.[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller. Playing Atari with Deep ReinforcementLearning.

NIPS Deep Learning Workshop , 2013.[12] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap,F. Hui, L. Sifre, G. Driessche, T. Graepel, and D. Hassabis. Masteringthe Game of Go without Human Knowledge.

Nature , 2017.[13] R. Sutton and A. Barto. Reinforcement Learning: An Introduction,Second Edition.

MIT Press , 2017.[14] W. Allcock, P. Rich, Y. Fan, and Z. Lan. Experience and Practice ofBatch Scheduling on Leadership Supercomputers at Argonne. JSSPP’17.[15] L. Yu, Z. Zhou, Y. Fan, M. Papka, and Z. Lan. System-wide Trade-offModeling of Performance, Power, and Resilience on Petascale Systems.In

The Journal of Supercomputing , 2018.[16] M. Jette, A. Yoo, and M. Grondona. SLURM: Simple Linux Utility forResource Management. In

JSSPP’03

SC Poster , 2019.[22] H. Mao, M. Alizadeh, I. Menache, and S. Kandula. Resource Manage-ment with Deep Reinforcement Learning. HotNets’16.[23] D. Zhang, D. Dai, Y. He, F. Bao, and B. Xie. RLScheduler: AnAutomated HPC Batch Job Scheduler Using Reinforcement Learning.SC’20.[24] E. ˙Ipek, O. Mutlu, J. Mart´ınez, and R. Caruana. Self-OptimizingMemory Controllers: A Reinforcement Learning Approach. ISCA’08.[25] Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning.

Nature , 2015.26] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy GradientMethods for Reinforcement Learning with Function Approximation.NIPS’99.[27] V. Mnih, A. Badia, M. Mirza, A. Graves, T. Harley, T. Lillicrap,D. Silver, and K. Kavukcuoglu. Asynchronous Methods for DeepReinforcement Learning. ICML’16.[28] NERSC Queue Policies. https://docs.nersc.gov/jobs/policy/.[29] I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning