[PDF] Data Centers Job Scheduling with Deep Reinforcement Learning

Abstract

Efficient job scheduling on data centers under heterogeneous complexity is crucial but challenging since it involves the allocation of multi-dimensional resources over time and space. To adapt the complex computing environment in data centers, we proposed an innovative Advantage Actor-Critic (A2C) deep reinforcement learning based approach called A2cScheduler for job scheduling. A2cScheduler consists of two agents, one of which, dubbed the actor, is responsible for learning the scheduling policy automatically and the other one, the critic, reduces the estimation error. Unlike previous policy gradient approaches, A2cScheduler is designed to reduce the gradient estimation variance and to update parameters efficiently. We show that the A2cScheduler can achieve competitive scheduling performance using both simulated workloads and real data collected from an academic data center.

Full PDF

DData Centers Job Scheduling with DeepReinforcement Learning

Sisheng Liang, Zhou Yang, Fang Jin, and Yong Chen

Department of Computer ScienceTexas Tech University { sisheng.liang, zhou.yang, fang.jin, yong.chen } @ttu.edu Abstract.

Eﬃcient job scheduling on data centers under heterogeneouscomplexity is crucial but challenging since it involves the allocation ofmulti-dimensional resources over time and space. To adapt the com-plex computing environment in data centers, we proposed an innovativeAdvantage Actor-Critic (A2C) deep reinforcement learning based ap-proach called A2cScheduler for job scheduling. A2cScheduler consists oftwo agents, one of which, dubbed the actor, is responsible for learningthe scheduling policy automatically and the other one, the critic, re-duces the estimation error. Unlike previous policy gradient approaches,A2cScheduler is designed to reduce the gradient estimation variance andto update parameters eﬃciently. We show that the A2cScheduler canachieve competitive scheduling performance using both simulated work-loads and real data collected from an academic data center.

Keywords:

Job scheduling · Cluster scheduling · Deep reinforcementlearning · Actor critic.

Job scheduling is a critical and challenging task for computer systems since itinvolves a complex allocation of limited resources such as CPU/GPU, memoryand IO among numerous jobs. It is one of the major tasks of the scheduler ina computer system’s Resource Management System (RMS), especially in high-performance computing (HPC) and cloud computing systems, where ineﬃcientjob scheduling may result in a signiﬁcant waste of valuable computing resources.Data centers, including HPC systems and cloud computing systems, have be-come progressively more complex in their architecture [15], conﬁguration(e.g.,special visualization nodes in a cluster) [6] and the size of work and workloadsreceived [3], all of which increase the job scheduling complexities sharply.The undoubted importance of job scheduling has fueled interest in the schedul-ing algorithms on data centers. At present, the fundamental scheduling method-ologies [18], such as FCFS (ﬁrst-come-ﬁrst-serve), backﬁlling, and priority queuesthat are commonly deployed in data centers are extremely hard and time-consuming to conﬁgure, severely compromising system performance, ﬂexibilityand usability. To address this problem, several researchers have proposed data-driven machine learning methods that are capable of automatically learning the a r X i v : . [ c s . O S ] M a r Sisheng Liang, Zhou Yang, Fang Jin, and Yong Chen scheduling policies, thus reducing human interference to a minimum. Speciﬁcally,a series of policy based deep reinforcement learning approaches have been pro-posed to manage CPU and memory for incoming jobs [10], schedule time-criticalworkloads [8], handle jobs with dependency [9], and schedule data centers withhundreds of nodes [2].Despite the extensive research into job scheduling, however, the increasingheterogeneity of the data being handled remains a challenge. These diﬃcultiesarise from multiple issues. First, policy gradient DRL method based schedulingmethod suﬀers from a high variance problem, which can lead to low accuracywhen computing the gradient. Second, previous work has relied on used MonteCarlo (MC) method to update the parameters, which involved massive calcula-tions, especially when there are large numbers of jobs in the trajectory.To solve the above-mentioned challenges, we propose a policy-value baseddeep reinforcement learning scheduling method called A2cScheduler, which cansatisfy the heterogeneous requirements from diverse users, improve the space ex-ploration eﬃciency, and reduce the variance of the policy. A2cScheduler consistsof two agents named actor and critic respectively, the actor is responsible forlearning the scheduling policy and the critic reduces the estimation error. Theapproximate value function of the critic is incorporated as a baseline to reducethe variance of the actor, thus reducing the estimation variance considerably [14].A2cScheduler updates parameters via the multi-step Temporal-diﬀerence (TD)method, which speeds up the training process markedly compared to conven-tional MC method due to the way TD method updates parameters. The maincontributions are summarized as below:1. This represents the ﬁrst time that A2C deep reinforcement has been suc-cessfully applied to a data center resource management, to the best of theauthors’ knowledge.2. A2cScheduler updates parameters via multi-step Temporal-diﬀerence (TD)method which speeds up the training process comparing to MC methoddue to the way TD method updates parameters. This is critical for the realworld data center scheduling application since jobs arrive in real time andlow latency is undeniably important.3. We tested the proposed approach on both real-world and simulated datasets,and results demonstrate that our proposed model outperformed many exist-ing widely used methods.

Job scheduling with deep reinforcement learning

Recently, researchershave tried to apply deep reinforcement learning on cluster resources manage-ment. A resource manager DeepRM was proposed in [10] to manage CPU andmemory for incoming jobs. The results show that policy based deep reinforce-ment learning outperforms the conventional job scheduling algorithms such asShort Job First and Tetris [4]. [8] improves the exploration eﬃciency by adding ata Centers Job Scheduling with Deep Reinforcement Learning 3

Fig. 1.

A2cScheduler job scheduling framework. baseline guided actions for time-critical workload job scheduling. [17] discussedheuristic based method to coordinate disaster response. Mao proposed Decimain [9] which could handle jobs with dependency when graph embedding techniqueis utilized. [2] proved that policy gradient based deep reinforcement learning canbe implemented to schedule data centers with hundreds of nodes.

Actor-critic reinforcement learning

Actor-critic algorithm is the most pop-ular algorithm applied in the reinforcement learning framework [5] which fallsinto three categories: actor-only, critic-only and actor-critic methods [7]. Actor-critic methods combine the advantages of actor-only and critic-only methods.Actor-critic methods usually have good convergence properties, in contrast tocritic-only [5]. At the core of several recent state-of-the-art Deep RL algorithmsis the advantage actor-critic (A2C) algorithm [11]. In addition to learning apolicy (actor) π ( a | s ; θ ), A2C learns a parameterized critic: an estimate of valuefunction v π ( s ), which then uses both to estimate the remaining return after ksteps, and as a control variate (i.e. baseline) that reduces the variance of thereturn estimates [13]. In this section, we ﬁrst review the framework of A2C deep reinforcement learning,and then explain how the proposed A2C based A2cScheduler works in the jobscheduling on data centers. The rest part of this section covers the essentialdetails about model training.

The Advantage Actor-critic (A2C), which combines policy based method andvalue based method, can overcome the high variance problem from pure policygradient approach. The A2C algorithm is composed of a policy π ( a t | s t ; θ ) and avalue function V ( s t ; w ), where policy is generated by policy network and valueis estimated by critic network. The proposed the A2cScheduler framework is Sisheng Liang, Zhou Yang, Fang Jin, and Yong Chen shown in ﬁgure 1, which consists of an actor network, a critic network and thecluster environment. The cluster environment includes a global queue, a backlogand the simulated machines. The queue is the place holding the waiting jobs.The backlog is an extension of the queue when there is not enough space forwaiting jobs. Only jobs in the queue will be allocated in each state.

The setting of A2C – Actor : The policy π is an actor which generates probability for each possibleaction. π is a mapping from state s t to action a t . Actor can choose a jobfrom the queue based on the action probability generated by the policy π .For instance, given the action probability P = { p , . . . , p N } for N actions, p i denotes the probability that action a i will be selected. If the action is chosenaccording to the maximum probability ( action = arg max i ∈ [0 ,N ] ,i ∈ N + p i ), theactor acts greedily which limits the exploration of the agent. Explorationis allowed in this research. The policy is estimated by a neural network π ( a | s, θ ), where a is an action, s is the state of the system and θ is theweights of the policy network. – Critic : A state-value function v ( s ) used to evaluate the performance of theactor. It is estimated by a neural network ˆ v ( s, w ) in this research where s isthe state and w is the weights of the value neural network. – State s t ∈ S : A state s t is deﬁned as the resources allocation status of thedata center including the status of the cluster and the status of the queueat time t . The states S is a ﬁnite set. Figure 2 shows an example of thestate in one time step. The state includes three parts: status of the resourcesallocated and the available resources in the cluster, resources requested byjobs in the queue, and status of the jobs waiting in the backlog. The schedulerwill only schedules jobs in the queue. – Action a t ∈ A : An action a t = { a t } N denotes the allocation strategy ofjobs waiting in the queue at time t , where N is the number of slots forwaiting jobs in the queue. The action space A of an actor speciﬁes all thepossible allocations of jobs in the queue for the next iteration, which givesa set of N + 1 discrete actions represented by {∅ , , , . . . , N } where a t = i ( ∀ i ∈ { , . . . , N } ) means the allocation of the i th job in the queue and a t = ∅ denotes a void action where no job is allocated. – Environment : The simulated data center contains resources such as CPUs,RAM and I/O. It also includes resource management queue system in whichjobs are waiting to be allocated. – Discount Factor γ : A discount factor γ is between 0 and 1, and is used toquantify the diﬀerence in importance between immediate rewards and futurerewards. The smaller of γ , the less importance of future rewards. – Transition function P : S × A → [0 , p ( s t +1 | s t , a t ) represents the probability of transitingto s t +1 ∈ S given a joint action a t ∈ A is taken in the current state s t ∈ S . – Reward function r ∈ R = S × A → ( −∞ , + ∞ ): A reward in the datacenter scheduling problem is deﬁned as the feedback from the environment ata Centers Job Scheduling with Deep Reinforcement Learning 5 Fig. 2.

An example of the tensor representation of a state. At each iteration, the deci-sion combination of number of jobs will be scheduled is 2

Total jobs , which has exponen-tial growth rate. We simplify the case by selecting a decision from decision domain = { , , . . . , N } , where N is a ﬁxed hyper-parameter, decision = i denotes select i th job,and decision = 0 denotes no job will be selected. when the actor takes an action at a state. The actor attempts to maximizeits expected discounted reward: R t = E ( r it + γr it +1 + ... ) = E ( ∞ (cid:80) k =0 γ k r it + k ) = E ( r it + γR t +1 ).The agent reward at time t is deﬁned as r t = − T j , where T j is the runtimefor job j .The goal of data center job scheduling is to ﬁnd the optimal policy π ∗ (asequence of actions for agents) that maximizes the total reward. The state valuefunction Q π ( s, a ) is introduced to evaluate the performance of diﬀerent policies. Q π ( s, a ) stands for the expected total reward with discount from current state s on-wards with the policy π , which is equal to: Q π ( s t , a t ) = E π ( R t | s t , a t ) = E π ( r t + γQ π ( s (cid:48) , a (cid:48) ))= r t + γ (cid:88) s (cid:48) ∈ S P π ( s (cid:48) | s ) Q π ( s (cid:48) , a (cid:48) ) (1), where s (cid:48) is the next state, and a (cid:48) is the action for the next time step.Function approximation is a way for generalization when the state and/oraction spaces are large or continuous. Several reinforcement learning algorithmshave been proposed to estimate the value of an action in various contexts suchas the Q-learning [16] and SARSA [12]. Among them, the model-free Q-learningalgorithm stands out for its simplicity [1]. In Q-learning, the algorithm uses aQ-function to calculate the total reward, deﬁned as Q : S × A → R . Q-learning Sisheng Liang, Zhou Yang, Fang Jin, and Yong Chen iteratively evaluates the optimal Q-value function using backups: Q ( s, a ) = Q ( s, a ) + α [ r + γmax a (cid:48) Q ( s (cid:48) , a (cid:48) ) − Q ( s, a )] (2), where α ∈ [0 ,

1) is the learning rate and the term in the brackets is the temporal-diﬀerence (TD) error. Convergence to Q π ∗ is guaranteed in the tabular caseprovided there is suﬃcient state/action space exploration. The loss function for critic

Loss function of the critic is utilized to updatethe critic network parameters. L ( w i ) = E ( r + γmax a (cid:48) Q ( s (cid:48) , a (cid:48) ; w i − ) − Q ( s, a ; w i )) , (3)where s (cid:48) is the state encountered after state s . Critic update the parameters ofthe value network by minimizing critic loss in equation 3. Advantage actor-critic

The critic updates state-action value function pa-rameters, and the actor updates policy parameters, in the direction suggestedby the critic. A2C updates both the policy and value-function networks with themulti-step returns as described in [11]. Critic is updated by minimizing the lossfunction of equation 3. Actor network is updated by minimizing the actor lossfunction in equation L ( θ (cid:48) i ) = ∇ θ (cid:48) log π ( a t | s t ; θ (cid:48) ) A ( s t , a t ; θ, w i ) (4), where θ i is the parameters of the actor neural network and w i is the parametersof the critic neural network. Note that the parameters θ i of policy and w i of valueare distinct for generality. Algorithm 1 presents the calculation and update ofparameters per episode. The A2C consists of an actor and a critic, and we implement both of them usingdeep convolutional neural network. For the Actor neural network, it takes theafore-mentioned tensor representation of resource requests and machine statusas the input, and outputs the probability distribution over all possible actions,representing the jobs to be scheduled. For the Critic neural network, it takes asinput the combination of action and the state of the system, and outputs the asingle value, indicating the evaluation for actor’s action.

The experiments are executed on a desktop computer with two RTX-2080 GPUsand one i7-9700k 8-core CPU. A2cScheduler is implemented using Tensorﬂowframework. Simulated jobs arrive online in Bernouli process. A piece of job trace ata Centers Job Scheduling with Deep Reinforcement Learning 7

Algorithm 1

A2C reinforcement learning scheduling algorithm

Input: a policy parameterization π ( a | s, θ ) Input: a state-value function parameterization ˆ v ( s, w )Parameters: step sizes α θ > , α w > θ ∈ R d (cid:48) and state-value function weights w ∈ R d ( e.g. , to . ) Output:

The scheduled sequence of jobs[1..n]Loop forever (for each episode):Initialize S (state of episode)Loop while S is not terminal (for each time step of episode): A ∼ π ( ·| S, θ )Take action A , observe state S (cid:48) , reward Rδ ← R + γ ˆ v ( S (cid:48) , w ) − ˆ v ( S, w ) ( If S (cid:48) is terminal, then ˆ v ( S (cid:48) , w ) . = 0) w ← w + α w δ ∇ ˆ v ( S, w ) θ ← θ + α θ δ ∇ ln π ( A | S, θ ) S ← S (cid:48) Table 1.

Performance comparison when model converged.

Job Rate0.9 0.8Type Random Tetris SJF A2cScheduler Random Tetris SJF A2cSchedulerSlowdown 5.50 ± ± ± ± ± ± ± ± Complete time 12.51 ± ± ± ± ± ± ± ± Waiting time 8.22 ± ± ± ± ± ± ± ± from a real data center is also tested. CPU and Memory are the two kinds ofresources considered in this research.The training process begins with an initial state of the data center. At eachtime step, a state is passed into the policy network π . An action is generatedunder policy π . A void action is made or a job is chosen from the global queue andput into the cluster for execution. Then a new state is generated and some rewardis collected. The states, actions, policy and rewards are collected as trajectories.Meanwhile, the state is also passed into the value network to estimate the value,which used to evaluate the performance of the action. Actor in A2cSchedulerlearns to produce resource allocation strategies from experiences after epochs. Table 2.

Performance comparison when model converged.

Job Rate0.7 0.6Type Random Tetris SJF A2cScheduler Random Tetris SJF A2cSchedulerSlowdown 5.05 ± ± ± ± ± ± ± ± Complete time 13.15 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Sisheng Liang, Zhou Yang, Fang Jin, and Yong Chen

Reinforcement learning algorithms, including A2C, have been mostly evaluatedby converging speed. However, these metrics are not very informative in domain-speciﬁc applications such as scheduling. Therefore, we present several evaluationmetrics that are helpful for access the performance of the proposed model.Given a set of jobs J = { j , . . . , j N } , where i th job is associated with arrivaltime t ai , ﬁnish time t fi , and execution time t ei . Average job slowdown

The slowdown for i th job is deﬁned as s i = t fi − t ai t ei = c i t i ,where c i = t fi − t ai is the completion time of the job and t i is the duration of thejob. The average job slowdown is deﬁned as s avg = N n (cid:80) i =1 t fi − t ai t ei = n N (cid:80) i =1 c i t i . Theslowdown metric is important because it helps to evaluate normalized waitingtime of a system. Average job waiting time

For the i th job, the waiting time t wi is the time betweenarrival and start of execution, which is formally deﬁned as t wi = t si − t ai . We simulated the data center cluster containing N nodes with two resources:CPU and Memory. We trained the A2cScheduler with diﬀerent neural networksincluding a fully connected layer and Convolutional Neural Networks (CNN).In order to design the best performance neural networks, we explore diﬀerentCNN architectures and compare whether it converges and how is the convergespeed with diﬀerent settings. As shown in table 3, fully connected layer (FClayer) with a ﬂatten layer in front did not converge. This is because the stateof the environment is a matrix with location information while some locationinformation lost in the ﬂatten layer when the state is processed. To keep thelocation information, we utilize CNN layers (16 3*3-ﬁlters CNN layer and 323*3-ﬁlters CNN layer) and they show better results. Then, we explored CNNwith max-pooling and CNN with ﬂattening layer behind. Results show both ofthem could converge but CNN with max-pooling gets poorer results. This isdue to some of the state information also get lost when it passes max-poolinglayer. According to the experiment results, we decide to choose the CNN witha ﬂattening layer behind architecture as it converges fast and gives the bestperformance. The performance of the proposed method is compared with some of the main-stream baselines such as Shortest Job First (SJF), Tetris [4], and random policy.SJF sorts jobs according to their execution time and schedules jobs with the ata Centers Job Scheduling with Deep Reinforcement Learning 9(a) Discounted reward. (b) Slowdown.(c) Average completion time. (d) Average waiting time.

Fig. 3.

A2C performance with a job arrival rate=0.7

Table 3.

Performances of diﬀerent network architectures.Architecture Converge Converging ConvergingSpeed EpochsFC layer No N.A. N.A.Conv3-16 Yes Fast 500Conv3-32 Yes Slow 1100Conv3-16 + pooling Yes Fast 700Conv3-32 + pooling Yes Fast 900 shortest execution time ﬁrst; Tetris schedules job by a combined score of pref-erences for the short jobs and resource packing; random policy schedules jobsrandomly. All of these baselines work in a greedy way that allocates as manyjobs as allowed by the resources, and share the same resource constraints andtake the same input as the proposed model.

Fig. 4.

A2C performance with real world log data

Table 4.

Results of Job Traces.Type Random Tetris SJF A2cSchedulerSlowdown 3.52 ± ± ± ± CT ∗ ± ± ± ± W T ∗ ± ± ± ± Performance on Synthetic Dataset

In our experiment, the A2cSchedulerutilized an A2C reinforcement learning method. It is worth to mention thatthe model includes the option to have multiple episodes in order to allow usto measure the average performance achieved and the capacity to learn for eachscheduling policy. Algorithm 1 presents the calculation and update of parametersper episode. Figure 3 shows experimental results with synthetic job distributionas input. ata Centers Job Scheduling with Deep Reinforcement Learning 11

Figure 3(a) and Figure 3(b) present the rewards and averaged slowdown whenthe new job rate is 0.7. Cumulative rewards and averaged slowdown convergearound 500 episodes. A2cScheduler has lower averaged slowdown than random,Tetris and SJF after 500 episodes. Figure 3(c) and Figure 3(d) show the averagecompletion time and average waiting time of the A2cScheduler algorithm versusbaselines. As we can see, the performance of A2cScheduler is the best comparingto all the baselines.Table 1, 2 present the steady state simulation results at diﬀerent job rates.We can see the A2cScheduler algorithm gets the best or close to the best perfor-mance regrading slowdown, average completion time and average waiting timeat diﬀerent job rates ranging from 0.6 to 0.9.

Performance on Real-world Dataset

We ran experiments with a piece of jobtrace from an academic data center. The results were shown in ﬁgure 4. The jobtraces were preprocessed before they are trained with the A2cScheduler. Therewas some ﬂuctuation at the ﬁrst 500 episodes in 4(a), then it started to converge.Figure 4(b) shows the average slowdown was better than all the baselines andclose to optimal value 1, which means the average waiting time was almost 0as shown in ﬁgure 4(d). This happens because there were only 60 jobs in thiscase study and jobs runtime are small. This was an case where almost no jobwas waiting for the allocation when it was optimally scheduled. A2cScheduleralso gains the shortest completion time among diﬀerent methods from ﬁgure 4(c).Table 4 shows the steady state results from a real-world job distribution runningon an academic cluster. A2cScheduler gets optimal scheduling results since thereis near 0 average waiting time for this jobs distribution. Again, this experimentalresults proves A2cScheduler eﬀectively ﬁnds the proper scheduling policies byitself given adequate training, both on simulation dataset and real-world dataset.There were no rules predeﬁned for the scheduler in advance, instead, there wasonly a reward deﬁned with the system optimization target included. This provenour deﬁned reward function was eﬀective in helping the scheduler to learn theoptimal strategy automatically after adequate training.

Job scheduling with resource constraints is a long-standing but critically im-portant problem for computer systems. In this paper, we proposed an A2C deepreinforcement learning algorithm to address the customized job scheduling prob-lem in data centers We deﬁned a reward function related to averaged job wait-ing time which leads A2cScheduler to ﬁnd scheduling policy by itself. Withoutthe need for any predeﬁned rules, this scheduler is able to automatically learnstrategies directly from experience and thus improve scheduling policies. Ourexperiments on both simulated data and real job traces for a data center showthat our proposed method performs better than widely used SJF and Tetris formulti-resource cluster scheduling algorithms, oﬀering a real alternative to cur-rent conventional approaches. The experimental results reported in this paper are based on two-resource (CPU/Memory) restrictions, but this approach canalso be easily adapted for more complex multi-resource restriction schedulingscenarios.

We are thankful to the anonymous reviewers for their valuable feedback. Thisresearch is supported in part by the National Science Foundation under grantCCF-1718336 and CNS-1817094.

References

1. Al-Tamimi, A., et al.: Model-free q-learning designs for linear discrete-time zero-sum games with application to h-inﬁnity control. Automatica (3), 473–481(2007)2. Domeniconi, G., Lee, E.K., Morari, A.: Cush: Cognitive scheduler for heterogeneoushigh performance computing system (2019)3. Garg, S.K., Gopalaiyengar, S.K., Buyya, R.: Sla-based resource provisioning forheterogeneous workloads in a virtualized cloud datacenter. In: Proc. ICA3PP. pp.371–384 (2011)4. Grandl, R., et al.: Multi-resource packing for cluster schedulers. Computer Com-munication Review (4), 455–466 (2015)5. Grondman, I., et al.: A survey of actor-critic reinforcement learning: Standard andnatural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics (6), 1291–1307 (2012)6. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in hpc resource manage-ment systems: Queuing vs. planning. In: Workshop on JSSPP. pp. 1–20. Springer(2003)7. Konda, V.R., et al.: Actor-critic algorithms. In: Proc. NIPS. pp. 1008–1014 (2000)8. Liu, Z., Zhang, H., Rao, B., Wang, L.: A reinforcement learning based resourcemanagement approach for time-critical workloads in distributed computing envi-ronment. In: Proc. Big Data. pp. 252–261. IEEE (2018)9. Mao, H., et al.: Learning scheduling algorithms for data processing clusters. arXivpreprint arXiv:1810.01963 (2018)10. Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management withdeep reinforcement learning. In: HotNets ’16. pp. 50–56. ACM, New York.https://doi.org/10.1145/3005745.300575011. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: Proc.ICML. pp. 1928–1937 (2016)12. Sprague, N., Ballard, D.: Multiple-goal reinforcement learning with modular sarsa(0) (2003)13. Srinivasan, S., et al.: Actor-critic policy optimization in partially observable mul-tiagent environments. In: Proc. NIPS. pp. 3422–3435 (2018)14. Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning, vol. 135.MIT press Cambridge (1998)15. Van Craeynest, K., et al.: Scheduling heterogeneous multi-cores through perfor-mance impact estimation (pie). In: Computer Architecture News. vol. 40, pp. 213–224ata Centers Job Scheduling with Deep Reinforcement Learning 1316. Watkins, C.J., Dayan, P.: Q-learning. Machine learning8