[PDF] A Novel Update Mechanism for Q-Networks Based On Extreme Learning Machines

Abstract

Reinforcement learning is a popular machine learning paradigm which can find near optimal solutions to complex problems. Most often, these procedures involve function approximation using neural networks with gradient based updates to optimise weights for the problem being considered. While this common approach generally works well, there are other update mechanisms which are largely unexplored in reinforcement learning. One such mechanism is Extreme Learning Machines. These were initially proposed to drastically improve the training speed of neural networks and have since seen many applications. Here we attempt to apply extreme learning machines to a reinforcement learning problem in the same manner as gradient based updates. This new algorithm is called Extreme Q-Learning Machine (EQLM). We compare its performance to a typical Q-Network on the cart-pole task - a benchmark reinforcement learning problem - and show EQLM has similar long-term learning performance to a Q-Network.

Full PDF

AA Novel Update Mechanism for Q-Networks BasedOn Extreme Learning Machines

Callum Wilson

Department of Mechanical and Aerospace EngineeringUniversity of Strathclyde

Glasgow, United [email protected]

Annalisa Riccardi

Department of Mechanical and Aerospace EngineeringUniversity of Strathclyde

Glasgow, United [email protected]

Edmondo Minisci

Department of Mechanical and Aerospace EngineeringUniversity of Strathclyde

Glasgow, United [email protected]

Abstract —Reinforcement learning is a popular machine learn-ing paradigm which can ﬁnd near optimal solutions to complexproblems. Most often, these procedures involve function approx-imation using neural networks with gradient based updates tooptimise weights for the problem being considered. While thiscommon approach generally works well, there are other updatemechanisms which are largely unexplored in reinforcementlearning. One such mechanism is Extreme Learning Machines.These were initially proposed to drastically improve the trainingspeed of neural networks and have since seen many applications.Here we attempt to apply extreme learning machines to areinforcement learning problem in the same manner as gradientbased updates. This new algorithm is called Extreme Q-LearningMachine (EQLM). We compare its performance to a typical Q-Network on the cart-pole task - a benchmark reinforcementlearning problem - and show EQLM has similar long-termlearning performance to a Q-Network.

I. I

NTRODUCTION M ACHINE learning methods have developed signiﬁ-cantly over many years and are now applied to in-creasingly practical and real world problems. For example,these techniques can optimise control tasks which are oftencarried out inefﬁciently by basic controllers. The ﬁeld ofReinforcement Learning (RL) originates in part from thestudy of optimal control [1], where a controller is designedto maximise, or minimise, a characteristic of a dynamicalsystem over time. It is often impossible or impractical to derivean analytical optimal control solution for environments withcomplex or unknown dynamics, which motivates the use ofmore intelligent methods such as RL. In particular, intelligentcontrollers must be capable of learning quickly online to adaptto changes. The study of optimal control and RL brought c (cid:13) machine learning into the broader ﬁeld of engineering withapplications to a wide variety of problems [2].The generalisation performance of RL-derived controllerssigniﬁcantly improved with the incorporation of function ap-proximators [3]. Unlike the earlier tabular methods, architec-tures such as fuzzy logic controllers [4] or more commonlyNeural Networks (NNs) can exploit similarities in areas ofthe state space to learn better policies. This comes at a cost:NNs usually take a long time to train and in general they donot guarantee convergence. Furthermore, nonlinear functionapproximators can be unstable and cause the learning algo-rithm to diverge [5]. Despite this, through careful selection ofhyperparameters and use of additional stability improvementmeasures, as will be discussed later, such function approxi-mators can still obtain useful solutions to control problems.Of all the algorithms available for tuning network weights,backpropagation is the most widely used in state-of-the-artsystems [6], [7], [8], [9]. The most common alternativesto this approach involve evolutionary algorithms, which canbe used to evolve network weights or replace the functionapproximator entirely [10]. Such algorithms tend to showbetter performance but have a much higher computational costwhich can make them infeasible for certain learning problems.Extreme Learning Machines (ELMs) are a class of neuralnetworks which avoid using gradient based updates [11].For certain machine learning problems, ELM has severaladvantages over other update rules - mainly that it can be con-siderably faster than iterative methods for optimising networkweights since they are instead calculated analytically. ELMhas seen many improvements and adaptations allowing it tobe applied to a wide variety of problems involving functionapproximation [12]. These include applications within therealm of RL, such as using a table to provide training data foran ELM network [13], approximating system dynamics usingELM to later apply RL methods [14], or using ELM theory toderive an analytical solution for weight updates based on the a r X i v : . [ c s . N E ] J un gent Reward

State

Environment

Action

Fig. 1. Agent-environment interaction in reinforcement learning, where theagent observes a state and reward signal from the environment and uses thisinformation to select an action to take loss function of gradient-based updates [15]. Here we aim touse ELM in a conventional RL algorithm by only altering theneural network update rule. The algorithm which uses ELMin this manner is referred to here as the “Extreme Q-LearningMachine” (EQLM).In this paper we develop the EQLM algorithm and compareits performance to a standard Q-network of the same complex-ity. The type of Q-network used here is relatively primitive butincorporates some features to improve stability and generallearning performance to provide a reasonable comparison. Afull stability analysis of each algorithm is outwith the scopeof this paper, however we compare their performance usingstandard measures of learning. EQLM uses an incrementalform of ELM which allows updates to be made online whilethe RL agent is interacting with the environment. Tests arecarried out on a classical RL benchmark known as the cart-pole problem. II. B

ACKGROUND

A RL process consists of an agent which senses an envi-ronment in a certain state and carries out actions to maximisefuture reward [2]. The only feedback the agent receives fromthe environment is a state signal and reward signal and it canonly affect the environment by its actions as shown schemat-ically in Figure 1. The objective is then to maximise the totaldiscounted reward it receives. One method for optimising thereward is Q-Learning, where the agent learns the action-valuefunction, Q of its policy and uses this to improve the policy inan iterative process [16]. The temporal-difference (TD) errorof Q is deﬁned as shown: e = Q ( s t , a t ) − (cid:16) r t +1 + γ max a Q ( s t +1 , a ) (cid:17) (1)where s t , a t , and r t denote state, action, and reward respec-tively at time-step t , γ is the discount factor which determinesthe affect of long term rewards on the TD error, and Q ( s, a ) isthe estimated action-value function. We approximate Q ( s, a ) using a feed-forward NN with parameters θ and, for the stan-dard Q-network, perform updates using the mean-squared TDerror. Approximating the value function and updating usingTD-error forms the basis for the most rudimentary Q-Learning algorithms. This section details the additional features of theQ-Learning algorithm which are employed in EQLM. A. (cid:15) -Greedy Policies To ﬁnd an optimal solution, an agent must visit every statein the environment throughout its training, which requiresthe agent to explore the environment by periodically takingrandom actions. This conﬂicts with the agent’s global goal oftaking actions deemed optimal by the current control policyto improve the policy; thus the well known issue of balancingexploration and exploitation. One type of policies which helpremedy this issue are known as “ (cid:15) -greedy” policies [2]. Inthese policies, a parameter (cid:15) dictates the probability of takinga random action at each time-step, where ≤ (cid:15) ≤ , andthis can be tuned to give the desired trade-off between takingexploratory or exploitative actions. Exploration becomes lessnecessary later in the training period once the agent has moreexperience. Instead, the agent seeks to exploit actions consid-ered more “optimal” following a period of more exploration.To achieve this in practice, (cid:15) varies linearly during the trainingperiod from (cid:15) i to (cid:15) f over N (cid:15) episodes. Following this, (cid:15) isheld constant at (cid:15) = (cid:15) f for the remainder of training. Theexploration probability after n episodes is given by Equation2. (cid:15) = (cid:40) (cid:15) i − nN (cid:15) ( (cid:15) i − (cid:15) f ) , if n < N (cid:15) (cid:15) f , if n ≥ N (cid:15) (2) B. Target Network

A crucial issue with Q-networks is that they are inherentlyunstable and will tend to overestimate action-values, whichcan cause the predicted action-values to diverge [17]. Severalmethods to resolve this issue have been proposed, includingthe use of a target network [7]. This network calculates targetaction-values for updating the policy network and shares itsstructure with this network. The parameters of the policynetwork, θ are periodically transferred to the target network,whose parameters are denoted θ − , which otherwise remainconstant. In practise, the target network values are updatedevery C time-steps. This slightly decouples the target valuesfrom the policy network which reduces the risk of divergence. C. Experience Replay

The online methods of learning discussed thus far allconventionally make updates on the most recent observedstate transition, which has several limitations [7]. For example,states which are only visited once may contain useful updateinformation which is quickly lost and updating on single statetransitions results in low data efﬁciency of the agent’s expe-rience. A more data efﬁcient method of performing updatesutilises experience replay [18]. In this method, experiencesof transitions are stored in a memory, D which contains thestate s j , action taken a j , observed reward r j +1 , and observedstate s j +1 for each transition. Updates are then made ona “minbatch” of k experiences selected randomly from thememory at every time-step. To limit the number of state ig. 2. Q-Network algorithm1: initialise network with random weights for episode = 1 to N ep do initialise state s t ← s while state s t is non-terminal do select action a t according to policy π execute action a t and observe r , s t +1 update memory D with ( s t , a t , r t , s t +1 ) select random minibatch of k experiences ( s j , a j , r j , s j +1 ) from D t j = (cid:40) r j , if s j +1 is terminal r j + γ max a Q T ( s j +1 , a ) , otherwise e j = Q ( s j , a j ) − ( r j +1 + γ max a Q T ( s j +1 , a )) update network using the error e j for each transitionin the minibatch after C time-steps set θ − ← θ end while end for transitions stored, a maximum memory size N mem is deﬁnedsuch that a moving window of N mem previous transitions arestored in the agent’s memory [17]. D. Q-Network Algorithm

Figure 2 details the Q-Network algorithm which incorpo-rates a target network and experience replay. This algorithmgives our baseline performance to which we compare ExtremeQ-Learning Machine (EQLM) and also provides the basis forincorporating ELM as a novel update mechanism.III. ELM T

HEORY AND D EVELOPMENT

A. Extreme Learning Machine

ELM in its most widely used form is a type of single-layerfeedforward network (SLFN). The description of ELM hereinuses the same notation as in [11]. Considering an arbitraryset of training data ( x i , t i ) where x i = [ x i , x i , . . . , x in ] and t i = [ t i , t i , . . . , t im ] , a SLFN can be mathematicallymodelled as follows ˜ N (cid:88) i =1 β i g ( w i · x j + b i ) = o j , j = 1 , . . . , N (3)where ˜ N is the number of hidden nodes, β i =[ β i , β i , . . . , β im ] T is the output weight vector which con-nects the i th hidden node to the output nodes, g ( x ) is theactivation function, w i = [ w i , w i , ..., w in ] T is the inputweight vector which connects the i th hidden node to the inputnodes, and b i is the bias of the i th hidden node. Where thenetwork output o j has zero error compared to the targets t j for all N samples, (cid:80) ˜ Nj =1 (cid:107) o j − t j (cid:107) = 0 it can be written that ˜ N (cid:88) i =1 β i g ( w i · x j + b i ) = t j , j = 1 , . . . , N (4) which contains the assumption that the SLFN can approximatethe N samples with zero error. Writing this in a more compactform gives H β = T (5)where H is the hidden layer output matrix, β is the outputweight vector matrix, and T is the target matrix. These aredeﬁned as shown H =  g ( w · x + b ) · · · g ( w ˜ N · x + b ˜ N ) ... · · · ... g ( w · x N + b ) · · · g ( w ˜ N · x N + b ˜ N )  N × ˜ N (6) β =  β T ... β T ˜ N  ˜ N × m (7) T =  t T ... t TN  N × m (8)ELM performs network updates by solving the linear systemdeﬁned in equation 5 for β ˆ β = H † T (9)where H † here denotes the Moore-Penrose generalised inverseof H as deﬁned in equation 10. This is used since, in general, H is not a square matrix and so cannot be inverted directly. H † = (cid:0) HH T (cid:1) † H T (10)The method used by ELM to update its weights has severaladvantages over classical methods of updating neural net-works. It is proven in [11] that ˆ β is the smallest norm leastsquares solution for β in the linear system deﬁned by equation5, which is not always the solution reached using classicalmethods. ELM also avoids many of the issues commonlyassociated with neural networks such as converging to localminima and improper learning rate. Such problems are usuallyavoided by using more sophisticated algorithms, whereas ELMis far simpler than most conventional algorithms. B. Regularized ELM

Despite the many beneﬁts of ELM, several issues withthe algorithm are noted in [19]. Mainly, the algorithm stilltends to overﬁt and is not robust to outliers in the input data.The authors propose a Regularized ELM which attempts tobalance the empirical risk and structural risk to give bettergeneralisation. This differs to the ELM algorithm which issolely based on empirical risk minimisation.The main feature of regularized ELM is the introduction ofa parameter ¯ γ which regulates the amount of empirical andstructural risk. This parameter can be adjusted to balance therisks and obtain the best generalisation of the network. Weightsare calculated as shown: β = (cid:18) I ¯ γ + H T D H (cid:19) † H T T (11)hich incorporates the parameter ¯ γ and a weighting matrix D . Setting D as the identity matrix I yields an expression forunweighted regularized ELM: β = (cid:18) I ¯ γ + H T H (cid:19) † H T T (12)which is a simpliﬁcation of equation 11. ELM is then thecase of equation 12 where ¯ γ → ∞ . Adding the parameter ¯ γ adds some complexity to the ELM algorithm because of itstuning, however regularized ELM still maintains most of theadvantages of ELM over conventional neural networks. C. Incremental Extreme Learning Machine

It is desired to perform network updates sequentially onbatches of data which necessitates an incremental form ofELM. Such an algorithm is presented in [20] whose basisis the regularized form of ELM shown in equation 12. Thealgorithm used for the purposes of EQLM is the least squareincremental extreme learning machine (LS-IELM).For an initial set of N training samples ( x i , t i ) the LS-IELM algorithm initialises the network weights as shown: β = A † t H T T (13)where A t = I ¯ γ + H T H (14)and H and T are given by equations 6 and 8. Suppose newsets of training data arrive in chunks of k samples - the hiddenlayer output matrix and targets for a new set of k samples areas shown: H IC =  g ( w · x N + b ) · · · g ( w ˜ N · x N + b ˜ N ) ... · · · ... g ( w · x N + k + b ) · · · g ( w ˜ N · x N + k + b ˜ N )  k × ˜ N (15) T IC =  t TN +1 ... t TN + k  k × m (16)To perform updates using the most recent data at time t , K t is deﬁned as K t = I − A † t H TIC (cid:16) H IC A † t H TIC + I k × k (cid:17) † H IC (17)and the update rules for β and A are then as follows: β t +1 = K t β t + K t A † t H TIC T IC (18) A † t +1 = K t A † t (19) D. Extreme Q-Learning Machine

The algorithm for applying Q-learning using LS-IELMbased updates, here referred to as the Extreme Q-LearningMachine (EQLM) is shown in Figure 3. Similar to the Q-network algorithm in Figure 2, this uses experience replayand a target network to improve its performance. Unlike theQ-network, the TD-error is not calculated and instead a targetmatrix, T for the minibatch of data is created which has thepredicted action-values for all actions in the given states. Thetarget action-value for each state, s j is then assigned to theapplicable value in t j . Matrix H is constructed using the statesin the minibatch and then the update rules are applied. Theboolean variable step is introduced to initialise the networkat the very ﬁrst update.One further key difference in the EQLM algorithm isthe heuristic policy used in initial episodes. The return ininitial episodes has a substantial effect on the convergence ofEQLM as discussed later. This necessitates a simple heuristiccontroller for the start of training which does not need toperform very well, but can at least prevent the agent fromconverging on a highly sub-optimal policy. EQLM uses aheuristic action selection a t = h ( t ) , which is effectively anopen loop control scheme dependant only on the time-step,for N h episodes. Deﬁnition of this heuristic is discussed inthe following section.IV. E XPERIMENTS AND R ESULTS

Code to reproduce results is available athttps://github.com/strath-ace/smart-ml.

A. OpenAI Gym Environments

The environment used to test the algorithms comes from theOpenAI Gym which is a toolkit containing benchmark testsfor a variety of machine learning algorithms [21]. The gymcontains, among other environments, several classical controlproblems, control tasks which use the MuJoCo physics engine[22], and the Atari2600 games which are used in [7]. Here theagents will be tested on the environment named “CartPole-v0”.The cart-pole problem was originally devised in [23] wherethe authors created an algorithm called “BOXES” to learnto control the system. In this task, a pole is attached by ahinge to a cart which rolls along a track and is controlledby two possible actions - an applied force of ﬁxed magnitudein either the positive or negative x-direction along the track.The goal is to keep the pendulum from toppling for as longas possible, which yields a very simple reward function of r = +1 for every time-step where the pendulum has nottoppled. In addition, the track on which the cart is situatedis ﬁnite and reaching the limits of the track also indicatesfailure. The dynamics of the system used in the gym are thesame as those deﬁned by [24]. The state-space size for thisenvironment is 4 and the action-space size is 2.This problem can be considered an “episodic” task [2],where the learning is divided into episodes which have de-ﬁned ending criteria. An episode terminates either when thependulum passes ◦ or the cart reaches either end of the track. ig. 3. EQLM algorithm1: initialise network with random weights step ← T rue for episode = 1 to N ep do initialise state s t ← s while state s t is non-terminal do if episode ≤ N h then select action a t according to heuristic h ( t ) else select action a t according to policy π end if execute action a t and observe r , s t +1 update memory D with ( s t , a t , r t , s t +1 ) select random minibatch of k experiences ( s j , a j , r j , s j +1 ) from D t j = (cid:40) r j , if s j +1 is terminal r j + γ max a Q ( s j +1 , a ) , otherwise construct matrix H if step then A t = I ¯ γ + H T H β t +1 = A † t H T T A t +1 = A t step ← F alse else K t = I − A † t H T (cid:16) H A † t H T + I k × k (cid:17) † H β t +1 = K t β t + K t A † t H T T A † t +1 = K t A † t end if after C time-steps set θ − ← θ end while end for In this task, a maximum number of time-steps per episode of200 is set within the gym.

B. Heuristic Policy

As discussed previously, EQLM is susceptible to convergingon a sub-optimal policy without the use of a heuristic policyin the initial episodes. A random policy at the start of trainingwill sometimes produce this sub-optimal result and so weneed to deﬁne a simple deterministic policy which does notimmediately solve the task but prevents unacceptable long-term performance. For the cart-pole task we consider herewhich has a binary action space, we deﬁne the heuristic policyas taking alternating actions at each time-step as shown: h ( t ) = mod ( t, (20)From testing, we found N h = 5 to be a suitable number ofepisodes over which to use the heuristic. The effect of thisinitial heuristic policy is shown in Figure 4. This shows theaveraged rewards over the ﬁrst 200 episodes of training forboth networks with and without the heuristic. While the returnin the initial episodes is still higher for EQLM in both cases, it is clear that with the heuristic EQLM shows a more favourableperformance. This is due to occasions where, without theheuristic, EQLM quickly converges to a sub-optimal policy,which is mitigated by the heuristic policy. Also shown is theaverage performance of the heuristic alone, which receivesa reward of 37 per episode. This indicates that although theheuristic alone performs very poorly on the task, it is stilluseful to improve the performance of both algorithms. Fig. 4. Varying performance with the use of an initial heuristic h with theaverage performance for the heuristic alone shown C. Hyperparameter Selection

The performance of a Q-learning agent can be very sensitiveto its hyperparameters. To create a useful comparison of eachagent’s performance we therefore need to tune the hyperpa-rameters for this problem. Here we use the Python libraryHyperopt which is suitable for optimising within combineddiscrete- and real-valued search spaces [25]. Hyperparametersto be optimised are: learning rate α (Q-Network only), regular-isation parameter ¯ γ (EQLM only), number of hidden nodes ˜ N ,initial exploration probability (cid:15) i (with (cid:15) f ﬁxed as 0), numberof episodes to decrease exploration probability N (cid:15) , discountfactor γ , minibatch size k , and target network update steps C .Our main objective to optimise is the ﬁnal performance ofthe agent, i.e. the total reward per episode, after it convergesto a solution. In addition, an agent should converge to theoptimal solution in as few episodes as possible. Both theseobjectives can be combined into the single metric of area underthe learning curve as shown in Figure 5. Since hyperopt usesa minimisation procedure, we speciﬁcally take the negativearea under the curve. One of the issues with optimising thesesystems is their stochastic nature which can result in severalruns with the same hyperparameters having vastly differentperformance. To account for this, each evaluation uses 8 runsand the loss is the upper conﬁdence interval of the metricfrom these runs. This gives a conservative estimate of theworst-case performance for a set of hyperparameters.Table I shows the best parameters obtained when tuning thehyperparameters for this task. Most of the hyperparameterscommon to each algorithm are not substantially different with ig. 5. Example learning curves which show different values for the lossfunction the exception of minibatch size, k which is 26 and 2 for theQ-network and EQLM respectively. In fact, the performanceof EQLM tended to decrease for larger values of k whichwas not the case for the Q-network. This could be a resultof the matrix inversion in EQLM where the behaviour of thenetwork is less stable if the matrix is non-square. Alternatively,it is possible that EQLM attempting to ﬁt to a much largernumber of predicted Q-values causes the overall performanceto decrease. The fact it needs fewer data per time-step thana standard Q-network could also indicate that EQLM is moreefﬁcient at extracting information on the environment’s action-values compared to using gradient descent. Hyperparameter Q-Network EQLM α ¯ γ - 1.827e-5 ˜ N

29 25 (cid:15) i N (cid:15)

400 360 γ k

26 2 C

70 48TABLE IH

YPERPARAMETERS USED FOR EACH AGENT IN THE CART - POLE TASK

D. Learning Performance

With the selected hyperparameters, each agent carried out50 runs of 600 episodes in the cart-pole environment tocompare their performance. The results of this are shownin Figure 6 and Table II. Here we use two measures ofperformance: mean reward over the ﬁnal 100 episodes andarea under the learning curve (auc).From the learning curves, we see EQLM on averageachieves a superior performance in the earliest episodes oftraining followed by a steady increase in return until itplateaus. The Q-network begins with comparatively low av-erage return but then shows a sharp increase in return beforeits performance plateaus for the remainder of the episodes.After each of the learning curves plateau at their near-optimal

Fig. 6. Learning curves for EQLM and a standard Q-Network in the cart-poletask. Results are averaged over all 50 runs at each episode and shaded areaindicates the conﬁdence intervalMeasure Q-Network EQLMreward mean 160.0 (147.5, 173.7) 166.9 (160.7, 173.3)std 47.0 (35.1, 62.2) 23.1 (20.3, 26.7)auc ( ∗ ) mean 84.1 (81.0 87.4) 83.3 (80.4, 86.2)std 11.7 (9.1, 14.7) 10.6 (9.3, 12.4)TABLE IIP ERFORMANCE OF EACH ALGORITHM IN THE CART - POLE TASK performance, we see some of the most interesting differencesbetween the two algorithms. The average return for EQLMremains very consistent as do the conﬁdence intervals, how-ever the Q-network displays some temporal variation in itsperformance as training continues and the conﬁdence intervalstend to get larger. This shows that the long-term performanceof EQLM is more consistent than the equivalent Q-network,which is backed up by the data in Table II. The standarddeviation of the mean reward of EQLM (23.1) is less thanhalf that of the Q-network (47.0) and both algorithms havecomparable mean rewards (160.0 and 166.9 for Q-networkand EQLM respectively).To ﬁnd a statistical measure of the difference in performanceof each algorithm, we use a two-tailed t-test [26]. This assumesboth algorithms’ performance belongs to the same distributionwhich we reject when the p-value is less than a threshold of0.05. When comparing the mean reward in the ﬁnal episodes,the t-test yielded values of t = − . , p = 0 . . Similarlyfor the area under the learning curve we obtained t = − . , p = 0 . . As a result, we cannot reject the hypothesisthat the performance of both algorithms follows the samedistribution. This demonstrates EQLM as being capable ofachieving similar performance to a standard Q-Network in thistask. V. C ONCLUSION

This paper proposed a new method of updating Q-networksusing techniques derived from ELM called Extreme Q-Learning Machine (EQLM). When compared to a standardQ-network on the benchmark cart-pole task, EQLM showsomparable average performance which it achieves more con-sistently than the Q-network. EQLM also shows better initiallearning performance when initialised using a basic heuristicpolicy.While EQLM shows several advantages to standard Q-networks, it is clear that the conventional gradient descentmethods are also capable of learning quickly as they gainmore experience. Future work could look at combining thestrengths of EQLM’s initial performance and using gradient-based methods to accelerate the learning. In this paper wehave tuned the hyperparameters of EQLM for a speciﬁcproblem, but a more rigorous parametric study is necessaryto learn more about the effect of the hyperparameters onEQLM’s learning performance. One of the developments inELM which was not used here is the ELM-based multilayerperceptron [27]. Such a network could similarly be used forRL problems since deep networks are generally better suitedto more complex tasks [28].The results in this paper suggest ELM methods are capableof being used within RL with similar performance and greaterconsistency than conventional gradient-descent for simple RLproblems. Additional research is needed on the application ofEQLM to higher dimensional and adaptive control problems.A

CKNOWLEDGMENT

The authors would like to thank the University of Strath-clyde Alumni Fund for their support.R

EFERENCES[1] R. E. Bellman, “The theory of dynamic programming,”

Bulletin of theAmerican Mathematical Society , vol. 60, p. 503516, 1954.[2] R. S. Sutton and A. G. Barto,

Reinforcement Learning: An Introduction .Cambridge: MIT Press, 1998.[3] R. S. Sutton, “Generalization in Reinforcement Learning : SuccessfulExamples Using Sparse Coarse Coding,”

Advances in Neural Informa-tion Processing Systems , vol. 8, pp. 1038–1044, 1996.[4] R. Davoodi and B. J. Andrews, “Computer simulation of FES standingup in paraplegia: A self-adaptive fuzzy controller with reinforcementlearning,”

IEEE Transactions on Rehabilitation Engineering , 1998.[5] J. N. Tsitsiklis and B. Van Roy, “An Analysis of Temporal-DifferenceLearning with Function Approximation,”

IEEE Transactions on Auto-matic Control , vol. 42, no. 5, pp. 674–690, 1997.[6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, andD. Hassabis, “Mastering the game of Go with deep neural networks andtree search,”

Nature , vol. 529, 2016.[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,”

Nature , vol. 518, 2015.[8] V. Mnih, A. Puigdom`enech Badia, M. Mirza, A. Graves, T. Harley, T. P.Lillicrap, D. Silver, and K. Kavukcuoglu, “Asynchronous Methods forDeep Reinforcement Learning,” in

International conference on MachineLearning , 2016.[9] H. Van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement LearningWith Double Q-Learning,” in

Proceedings of the 30th AAAI Conferenceon Artiﬁcial Intelligence , 2016, pp. 2094–2100.[10] D. E. Moriarty and A. C. Schultz, “Evolutionary Algorithms for Rein-forcement Learning,”

Journal of Artiicial Intelligence Research , vol. 11,pp. 241–276, 1999. [11] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine:Theory and applications,”

Neurocomputing , vol. 70, no. 1-3, pp. 489–501, 2006.[12] G. B. Huang, D. H. Wang, and Y. Lan, “Extreme Learning Machines: ASurvey,”

International Journal of Machine Learning and Cybernetics ,vol. 2, no. 2, pp. 107–122, 2011.[13] J. M. Lopez-Guede, B. Fernandez-Gauna, and M. Gra˜na, “State-ActionValue Function Modeled by ELM in Reinforcement Learning for HoseControl Problems,”

International Journal of Uncertainty , vol. 21, pp.99–116, 2013.[14] J. M. Lopez-Guede, B. Fernandez-Gauna, and J. A. Ramos-Hernanz,“A L-MCRS dynamics approximation by ELM for ReinforcementLearning,”

Neurocomputing , vol. 150, pp. 116–123, 2014.[15] T. Sun, B. He, and R. Nian, “Target Following for an Autonomous Un-derwater Vehicle Using Regularized ELM-based Reinforcement Learn-ing,” in

OCEANS’15 MTS/IEEE Washington , 2015, pp. 1–5.[16] C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D. disserta-tion, King’s College, 1989.[17] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing Atari with Deep ReinforcementLearning,” arXiv preprint arXiv:1312.5602 , 2013.[18] L.-J. Lin, “Self-Improving Reactive Agents Based On ReinforcementLearning, Planning and Teaching,”

Machine Learning , vol. 8, pp. 293–321, 1992.[19] W. Deng, Q. Zheng, and L. Chen, “Regularized Extreme LearningMachine,”

IEEE Symposium on Computational Intelligence and DataMining , no. 60825202, pp. 389–395, 2009.[20] L. Guo, J. h. Hao, and M. Liu, “An incremental extreme learningmachine for online sequential learning problems,”

Neurocomputing , vol.128, pp. 50–58, 2014.[21] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,J. Tang, and W. Zaremba, “OpenAI Gym,” pp. 1–4, 2016.[22] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine formodel-based control,”

IEEE International Conference on IntelligentRobots and Systems , pp. 5026–5033, 2012.[23] D. Michie and R. A. Chambers, “BOXES: An Experiment in AdaptiveControl,”

Machine Intelligence , vol. 2, pp. 137–152, 1968.[24] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike AdaptiveElements That Can Solve Difﬁcult Learning Control Problems,”

IEEETransactions on Systems, Man and Cybernetics , vol. SMC-13, no. 5, pp.834–846, 1983.[25] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of modelsearch: Hyperparameter optimization in hundreds of dimensions forvision architectures,” in , vol. 28, no. PART 1, 2013, pp. 115–123.[26] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,“Deep Reinforcement Learning that Matters,” in

The Thirty-SecondAAAI Conference on Artiﬁcial Intelligence , 2018, pp. 3207–3214.[27] J. Tang, C. Deng, and G.-B. Guang, “Extreme learning machine formultilayer perceptron,”

IEEE Transactions on Neural Networks andLearning Systems , vol. 27, no. 4, pp. 809–821, 2015.[28] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”