NNatural Gradient Deep Q-learning
Ethan Knight [email protected]
Stanford Cognitive & Systems Neuroscience LabThe Nueva School
Osher Lerner [email protected]
The Nueva School
Abstract
We present a novel algorithm to train a deep Q-learning agent using natural-gradient techniques. We compare the original deep Q-network (DQN) algorithm toits natural-gradient counterpart, which we refer to as NGDQN, on a collection ofclassic control domains. Without employing target networks, NGDQN significantlyoutperforms DQN without target networks, and performs no worse than DQN withtarget networks, suggesting that NGDQN stabilizes training and can help reducethe need for additional hyperparameter tuning. We also find that NGDQN is lesssensitive to hyperparameter optimization relative to DQN. Together these resultssuggest that natural-gradient techniques can improve value-function optimizationin deep reinforcement learning.
A core piece of various reinforcement-learning algorithms is the estimation of a value function–the expected sum of future discounted rewards under a desired policy [Sutton and Barto, 1998].Q-learning is a model-free reinforcement learning algorithm that estimates the value function underthe optimal policy by minimizing the temporal-difference error between the agent’s value functionestimates [Watkins, 1989]. This basic algorithm, when combined with deep neural networks [LeCunet al., 2015], has proven to be a major success in AI, most notably by exhibiting human-levelperformance in a suite of challenging Atari games [Mnih et al., 2013].Because Q-learning, as the control extension of TD algorithms [Sutton, 1988], is not truly a stochasticgradient descent algorithm [Maei, 2011], convergence of the algorithm with non-linear functionapproximators is poorly understood. In fact, it has been shown that TD algorithms can sometimes bedivergent [Tsitsiklis and Van Roy, 1997]. Moreover, in practice it is sometimes difficult to train adeep neural network with Q-learning. In the original DQN work, for example, the authors proposedthree key additions to stabilize training, namely experience replay [Lin, 1992], reward clipping, andthe use of target networks. [Mnih et al., 2013]In this work, we aim to address some of the practical issues pertaining to DQN training as well asimprove upon it by using natural-gradient techniques. Natural gradient was originally proposed byAmari as a method to accelerate gradient descent [1998]. Rather than exclusively using the lossgradient or using local curvature from the Hessian matrix, natural gradient uses “information” foundin the parameter space of the model to train efficiently.Natural gradient has been successfully applied to several deep learning domains [Desjardins et al.,2015, Schulman et al., 2015, Wu et al., 2017] and has been used to accelerate the training ofreinforcement learning systems [Kakade, 2001, Peters and Schaal, 2008, Dabney and Thomas, 2014].To motivate our approach, we hoped that using natural gradient would accelerate the training ofDQN, making our system more sample-efficient, thereby addressing one of the major problems inreinforcement learning. We also hoped that since natural gradient stabilizes training (e.g. naturalgradient is relatively unchanged when changing the order of training inputs [Pascanu and Bengio,
Preprint. Work in progress. a r X i v : . [ c s . L G ] N ov In the reinforcement learning problem, typically modeled as an MDP [Puterman, 2014], an agentinteracts with an environment to maximize cumulative reward. The agent observes a state s , performsan action a , and receives a new state s (cid:48) and reward r . Usually a discount factor γ is also defined,which specifies the relative importance of immediate reward as opposed to those recieved in thefuture. More specifically the objective is to maximize: E π [ R t ] = E (cid:2) T (cid:88) t (cid:48) = t γ t (cid:48) − t r t (cid:48) | π (cid:3) , (1)by attempting to learn a good policy π . Q-learning [Watkins, 1989, Rummery and Niranjan, 1994] is a model-free reinforcement learningalgorithm which works by gradually learning Q ( s, a ) , the expectation of the cumulative reward. TheBellman equation defines the optimal Q-value Q ∗ [Sutton and Barto, 1998, Hester et al., 2017]: Q ∗ ( s, a ) = E (cid:34) R ( s, a ) + γ (cid:88) s (cid:48) P ( s (cid:48) | s, a ) max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) (cid:35) (2)This function Q can be then optimized through value iteration, which defines the update rule Q ( s, a ) ← E [ r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) | s, a ] [Sutton and Barto, 1998, Mnih et al., 2013]. Additionally,the optimal policy π is defined as π ( s ) = argmax a Q ∗ ( s, a ) [Sutton and Barto, 1998, Hester et al.,2017]. A neural network can be described as a parametric function approximator that uses “layers” ofunits, each containing weights, biases and activation functions, called “neurons”. Each layer’s outputis fed into the next layer, and the loss is backpropagated to each layer’s weights in order to adjust theparameters according to their effect on the loss.For deep Q-learning, the neural network, parameterized by θ , takes in a state s and outputs a predictedfuture reward for each possible action a with a linear activation on the final layer. The loss of thisnetwork is defined as follows, given the environment ε : L = E (cid:2) ( y − Q ( s, a i ; θ )) (cid:3) (3)where Q ( s, a i ; θ ) is the output of the network corresponding to action taken a i , and y = E s (cid:48) ∼ ε (cid:104) r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ ) (cid:12)(cid:12) s, a i (cid:105) (4)Notice that we take the mean-squared-error between the expected Q-value and actual Q-value. Theneural network is optimized over the course of numerous iterations through some form of gradientdescent. In the original DQN (deep Q-network) paper in which an agent successfully played Atarigames from pixels, an adaptive gradient method was used to train this network [Mnih et al., 2013].2eep Q-networks use experience replay to train the Q-value estimator on a randomly sampled batchof previous experiences (essentially replaying past remembered events back into the neural network)[Lin, 1992]. Experience replay makes the training samples independent and identically distributed(i.i.d.), unlike the highly correlated consecutive samples which are encountered during interaction withthe environment [Schaul et al., 2015]. This is a prerequisite for many SGD convergence theorems.Additionally, DQN uses an (cid:15) -greedy policy: the agent acts nearly randomly in order to explorepotentially successful strategies, and as the agent learns, it acts randomly less often (this is sometimescalled the “exploit” stage, as opposed to the prior “explore” stage). Mathematically, the probabilityof choosing a random action (cid:15) is gradually annealed over the course of training.We combine these two approaches, using natural gradient to optimize the neural network in Q-learningarchitectures. Gradient descent optimizes parameters of a model with respect to a loss function by “descending”down the loss manifold. To do this, we take the gradient of the loss with respect to the parameters,then move in the opposite direction of that gradient [Goodfellow et al., 2016]. Mathematically,gradient descent updates parameters θ of a model mapping from x to y as θ ← θ − α ∇ θ L ( x, y ; θ ) given a learning rate of α .A commonly used variant of gradient descent is stochastic gradient descent (SGD). Instead of calcu-lating the entire gradient at a time, SGD uses a mini-batch of training samples: θ − α ∇ θ L ( x i , y i ; θ ) .Our baselines use Adam, an adaptive gradient optimizer, which is a modification of SGD [Kingmaand Ba, 2014].However, this approach of gradient descent has a number of issues. For one, gradient descent willoften become very slow in plateaus where the magnitude of the gradient is close to zero. Also, whilegradient descent takes uniform steps in the parameter space, this does not necessarily correspond touniform steps in the output distribution. Natural gradient attempts to fix these issues by incorporatingthe inverse Fisher information matrix, a concept from statistical learning theory [Amari, 1998].Essentially, the core problem is that Euclidean distances in the parameter space do not give enoughinformation about the distances between the corresponding outputs, as there is not a strong enoughrelationship between the two [Foti, 2013]. Kullback and Leibler define a more expressive distribution-wise measure, as follows [1951]: KL ( µ | µ ) = (cid:90) ∞−∞ µ ( s ) log µ ( s ) µ ( s ) d s (5)However, since KL ( µ | µ ) (cid:54) = KL ( µ | µ ) , symmetric KL divergence, also known as Jensen-Shannon (JS) divergence, is defined as follows [Foti, 2013]: KL sym ( µ | µ ) := 12 ( KL ( µ | µ ) + KL ( µ | µ )) (6)To perform gradient descent on the manifold of functions given by our model, we use the Fisherinformation metric on a Riemannian manifold. Since symmetric KL divergence behaves like a distancemeasure in infinitesimal form, a Riemannian metric is derived as the Hessian of the divergence ofsymmetric KL divergence [Pascanu and Bengio, 2013]. We give Pascanu and Bengio’s definition,which assume that the probability of a point sampled from the network is a gaussian with the network’soutput as the mean and with a fixed variance. Given some probability density function p , input vector s , and parameters θ [Pascanu and Bengio, 2013]: F θ = E s,q [( ∇ log p θ ( q | s )) T ( ∇ log p θ ( q | s ))] (7)Finally, to achieve uniform steps on the output distribution, we use Pascanu and Bengio’s derivationof natural gradient given a loss function L [2013]: ∇L N = ∇L F θ − (8)Using this definition and solving the Lagrange multiplier for minimizing the loss of parametersupdated by ∆ θ under the constraint of a constant symmetric KL divergence, one can derive the3pproximation for constant symmetric KL divergence, using the information matrix. Taking thesecond-order Taylor expansion, Pascanu and Bengio [2013] derive: KL sym ( p θ | p θ +∆ θ ) ≈
12 ∆ θ T F θ ∆ θ (9)As the output probability distribution is dependent on the final layer activation, Pascanu and Bengio[2013] give the following representation for a layer with a linear activation (interpreted as a conditionalGaussian distribution), here adapted for Q-learning, where β is defined as the standard deviation: p θ ( q | s ) = N ( q | Q ( s, θ ) , β ) (10)In this formulation, since the information is only dependent on the final layer’s activation we can usedifferent activations in the hidden layers without changing the Fisher information. As in Pascanu andBengio [2013], the Fisher information can be derived where J Q corresponds to the Jacobian of theoutput vector with respect to the parameters as follows: F linear = β E s ∼ d π ( s ) (cid:2) J TQ J Q (cid:3) (11) We borrow heavily from the approach of Pascanu and Bengio [2013], using their natural gradient fordeep neural networks formalization and implementation in our method.Next, we look at work on a different method of natural-gradient descent by Desjardins et al. [2015].In this paper, algorithm called “Projected Natural Gradient Descent” (PRONG) is proposed, whichalso considers the Fisher information matrix in its derivation. While our paper does not explore thisapproach, it could be an area of future research, as PRONG is shown to converge better on multipledata-sets, such as CIFAR-10 [Desjardins et al., 2015].Additional methods of applying natural gradient to reinforcement learning algorithms such as policygradient and actor-critic are explored in Kakade [2001] and Peters et al. [2005]. In both works, thenatural variants of their respective algorithms are shown to perform favorably compared to theirnon-natural counterparts. Details on theory, implementation, and results are in their respective papers.Insights into the mathematics of optimization using natural conjugate gradient techniques are providedin the work of Honkela et al. [2015]. These methods allow for more efficient optimization in highdimensions and nonlinear contexts.The Natural Temporal Difference Learning algorithm applies natural gradient to reinforcementlearning systems based on the Bellman error, although Q-learning is not explored [Tesauro, 1995].The authors use natural gradient with residual gradient, which minimizes the MSE of the Bellmanerror and apply natural gradient to SARSA, an on-policy learning algorithm. Empirical experimentsshow that natural gradient again outperforms standard methods in the tested environments.Finally, to our knowledge, the only one other published or publicly available attempt of naturalQ-learning was created by [Barron et al., 2016]. In this work, the authors re-implemented PRONGand verified its efficacy at MNIST. However, when the authors tried to apply it to Q-learning, theygot negative results, with no change on CartPole and worse results on GridWorld.
In our experiments, we use a standard method of Q-learning to act on the environment. Lasagne[Dieleman et al., 2015], Theano [Theano Development Team, 2016], and AgentNet [Yandex, 2016]complete the brunt of the computational work. Because our implementation of natural gradientadapted from Pascanu and Bengio originally fit an X to a mapping y and directly back-propagateda loss, we modify the training procedure to use a target value change similar to that described inequation 3. We also decay the learning rate by multiplying it by a constant factor every iteration.As the output layer of our Q-network has a linear activation function, we use the parameterization ofthe Fisher information matrix for linear activations, which determines the natural gradient. For this,we refer to equation 11, approximated at every batch.4e calculate the desired change in parameters according to the Fisher information matrix as inPascanu and Bengio [2013] by efficiently solving the system of linear equations relating the desiredchange in parameters to the gradients of the loss with respect to the weights: Gx = ∂L∂θ (see Algorithm1). The MinRes-QLP Algorithm solves this linear equation by extending MinRes, an existing Krylovsubspace descent algorithm to solve linear equations, to ill-conditioned systems such as a singularFIM using the QLP decomposition of the tridiagonal matrix from the Lanczos process [Choi et al.,2011]. This method finds minimum length solutions robust to different conditions. We also test LinearConjugate Gradient, an algorithm that solves the linear equation by decomposing x into vectorsconjugate with respect to G and iteratively calculating its components. Linear Conjugate Gradient isused for solving linear equations quickly and efficiently, with O ( m √ k ) where m is the number ofnonzero entries of G and k is its condition number [Shewchuk, 1994].For both algorithms, a damping factor d is applied to ensure computability: G := G + d I . Anotherefficiency of using linear solvers is that we are able to represent the FIM as an operator on x withoutneeding to explicitly compute the matrix. We take advantage of this by using Theano’s left and rightoperator representations (Lop and Rop) of the Jacobian, as well as compare the linear solvers toTheano’s explicit matrix inversion. This inversion utilizes the Gauss–Jordan elimination method toinvert the Fisher information matrix with asymptotic time complexity O ( n ) [Theano DevelopmentTeam, 2016].Our implementation runs on the OpenAI Gym platform which provides several classic controlenvironments, such as the ones shown here, as well as other environments such as Atari [Brockmanet al., 2016]. The current algorithm takes a continuous space and maps it to a discrete set of actions.In Algorithm 1, we adapt Mnih et al.’s Algorithm 1 and Pascanu and Bengio’s Algorithm 2 [2013,2013]. Because these environments do not require preprocessing, we have omitted the preprocessingstep, however this can easily be re-added. In our experiments, ∆ α was chosen somewhat arbitrarilyto be − e − , and α was selected according to our grid-search (see: Hyperparameters). Accordingto our grid search, we either leave the damping value unchanged or adjust it according to theLevenberg-Marquardt heuristic as used in Pascanu and Bengio [2013] and Martens [2010]. Algorithm 1
Natural Gradient Deep Q-Learning with Experience Replay
Require:
Initial learning Rate α Require:
Learning rate decay ∆ α Require:
Function update damping
Initialize replay memory D to capacity N Initialize action-value function Q with random weights α ← α for episode = 1 , M do Initialize sequence with initial state s for t = 1 , T do With probability (cid:15) select a random action a t , otherwise select action a t = max a Q ∗ ( s t , a ; θ ) Execute action a t in emulator and observe reward r t and state s t +1 Store transition ( s t , a t , r t , s t +1 ) in memory D Sample random minibatch of n transitions ( s j , a j , r j , s j +1 ) from D y j ← (cid:26) r j for terminal s j +1 r j + γ max a (cid:48) Q ( s j +1 , a (cid:48) ; θ ) for non-terminal s j +1 g ← ∂ L ∂θ d ← update damping ( d ) Define G such that G ( v ) = ( n J Q v ) J Q Solve argmin x (cid:13)(cid:13) ( G + d I ) x − ∂L∂θ (cid:13)(cid:13) with linear solver (e.g. MinresQLP [Choi et al., 2011]) θ ← θ − αxα ← ∆ α α end forend for a) (b) Figure 1: NGDQN and DQN performance over 10 trials over time with average line . We can see thatwhen training, NGDQN appears to be significantly more stable than the DQN baseline (i.e. NGDQNtended to reliably converge to a solution while the DQN baseline without target nests did not). To run Q-learning models on OpenAI gym, we adapt Pascanu and Bengio’s implementation [2013].For the baseline, we use OpenAI’s open-source Baselines library [Dhariwal et al., 2017], which allowsreliable testing of tuned reinforcement learning architectures. As is defined in Gym, performance ismeasured by taking the best 100-episode reward over the course of running.We run a grid search on the parameter spaces specified in the Hyperparameters section, measuringperformance for all possible combinations. Because certain parameters like the exploration fractionare not used in our implementation of NGDQN, we grid search those parameters as well. As wewish to compare “vanilla” NGDQN to “vanilla” DQN, we test a version where target networks,model saving, or any other features, such as prioritized experience replay are not used. To furthertest the capabilities of NGDQN, we also compare NGDQN (which in these experiments are alwaysrun without target networkss) to DQN with target networks in order to show that the algorithm iscompetitive with other stabilization techniques.Following this grid search, we take the best result performance for each environment from both DQNand NGDQN, and run this configuration 10 times, recording a moving 100-episode average and themean best 100-episode average across each run. These experiments reveal that NGDQN withouttarget networks compare favorably to standard adaptive gradient techniques, even outperformingDQN with target networks. However, the increase in stability and speed comes with a trade-off:due to the additional computation, natural gradient takes longer to train when compared to adaptivemethods, such as the Adam optimizer [Kingma and Ba, 2014]. Details of this can be found in Pascanuand Bengio’s work [2013]. 6igure 2: Average best 100-episode run over 10 trials with IQR. We can see that in every environmentfully run, NGDQN achieves a higher max 100-episode average than our DQN baselines. We test NGDQN in this manner on four common control environments from https://github.com/openai/gym : CartPole-v0, CartPole-v1, Acrobot-v1, and LunarLander-v2 (see Appendix B).
During training, updates to the weights are calculated by solving a linear system (see Algorithm1), equivalent to matrix-vector-product of the inverse damped FIM with the gradient of the loss.We test the different methods to solve for these updates by computing the parameter updates fromboth MinRes-QLP and Linear Conjugate Gradient and comparing them to the updates given by anexplicitly computed true FIM inversion. After separately solving for these individual updates, wecalculate a variety of metrics to record how the natural gradient differs between inversion methods.This process is measured over 100 episodes of training on CartPole-v0, using the Linear CG’sparameter updates.For this NGDQN algorithm to satisfy the theoretical properties of natural gradient, an accurateinversion method is needed, and in order to create an effective yet efficient algorithm, it is necessaryto balance accuracy and computational cost. Because natural gradient alters the step size of thegradient descent vector according to second order information and the angle of that vector throughKL divergence, we record the norm of the calculated natural gradient and the angle between theupdate steps between true values solvers and estimators. We also record the computation time ofeach method. Finally, to ensure that damping is not significantly skewing the NG calculation, wecompute the maximal eigenvalue of the Fisher information matrix by optimizing max ˆ x ˆ x T G ˆ x (seeAppendix C), which gives us an indication of the scaling done by the FIM. By comparing this valueto the damping factor, we see that damping is relatively small and not significantly skewing thecalculation of the natural gradient. NGDQN and DQN were run against these four experiments, to achieve the following results summa-rized in Figure 1. The hyperparameters can be found in the Appendix A, and the code for this projectcan be found in the Appendix D. Each environment was run for a number of episodes (see AppendixA), and as per Gym standards the best 100 episode performance was taken.In all experiments, natural gradient converges faster and achieves higher performance more con-sistently than the DQN benchmark, indicating its robustness in this task compared to the standardadaptive gradient optimizer used in the Baseline library (Adam). NGDQN also arrives at bettersolutions more reliably across the searched hyperparameters, exhibiting its versatility to differentconfigurations when compared to the harsh tuning of DQN (see Appendix A). The success across all The LunarLander-v2 task for NGDQN was not completed, as the Stanford Sherlock cluster where theenvironments were run does not permit GPU tasks for over 48 hours. Therefore, each of the 10 trials was run for48 hours and then stopped. Calculated as arccos (ˆ a · ˆ b ) , where ˆ a is the flattened normalized updates computed by the true inverse and ˆ b is the flattened normalized updates given by the linear solvers. tests indicates that natural gradient generalizes well to diverse control tasks, from simpler tasks likeCartPole to more complex tasks like LunarLander.Comparison of the different inversion methods reveals that they calculate similar parameter updates.We find that MinRes-QLP and Linear CG arrive at updates with slightly smaller magnitudes andextremely similar directions as computing the true matrix inversion of the FIM, indicating that evenwith these approximations, the algorithm is consistent with the theory behind natural gradient. It isalso shown that the damping factor is most often between to of the maximal eigenvalue,indicating that the FIM is not over-damped. Finally, the compute times for the estimated FIMinversions are shown to be significantly less than that of the true FIM inversions, showing that usingthese methods helps accelerate training. In this paper, natural-gradient methods are shown to accelerate and stabilize training for commoncontrol tasks, even without target networks. This could indicate that Q-learning’s instability may bediminished by naturally optimizing it, and also that natural gradient could be applied to other areas ofreinforcement learning in order to address important problems such as sample efficiency.Although we have not yet empirically investigated the precise cause of this increase in stability, weoffer a possible explanation below. One potential cause is that, although the replay buffer partiallydecorrelates the training set and time, the replay buffer will, over the course of training, become morefilled with transitions from later on in each environment’s episode.When playing, our agent’s buffer will, during the beginning of training, be primarily filled withtransitions from when the agent was acting poorly (e.g. in LunarLander-v2, the replay buffer isfilled with transitions where the craft eventually plummets to a fiery death). However, much laterin training when the agent has learned a policy to achieve higher rewards, the overall compositionof this training buffer will be shifted to become more relevant to what the agent has to learn laterin training. As the first few steps of gradient descent have a disproportionately large impact on thetrained model, training with SGD could potentially be destabilized later on by these largely randomtransitions Pascanu and Bengio [2013].In this scenario, target networks could help stabilize training. Natural gradient, by comparison, isvery robust to reordering of the training set [Pascanu and Bengio, 2013]. This means that NGDQNcould potentially use experience acquired later in training more effectively, as the overall policy ofthe agent would not be as skewed to experience gained early during training. This is, of course, onlyone possible explanation, and we hope that researchers will further investigate this phenomena inlater work.
Contributions & Acknowledgements
Here, a brief contributor statement is provided, as recommended by Sculley et al. [2018]. Tested on separate computer, hence slightly different compute time from agent training. eferences
Shun-Ichi Amari. Natural gradient works efficiently in learning.
Neural computation , 10(2):251–276,1998.Alex Barron, Todor Markov, and Zack Swafford. Deep q-learning with natural gradients, Dec2016. URL https://github.com/todor-markov/natural-q-learning/blob/master/writeup.pdf .Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,and Wojciech Zaremba. Openai gym.
ArXiv e-prints , abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540 .Sou-Cheng T. Choi, Christopher C. Paige, and Michael A. Saunders. MINRES-QLP: A krylovsubspace method for indefinite or singular symmetric systems.
SIAM Journal on ScientificComputing , 33(4):1810–1836, 2011. doi: 10.1137/100787921. URL http://web.stanford.edu/group/SOL/software/minresqlp/MINRESQLP-SISC-2011.pdf .William Dabney and Philip S Thomas. Natural temporal difference learning.
AAAI Conference onArtificial Intelligence , 2014. URL .Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural neuralnetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems 28 , pages 2071–2079. Curran Associates, Inc.,2015. URL http://papers.nips.cc/paper/5953-natural-neural-networks.pdf .Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford,John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines , 2017.Sander Dieleman, Jan Schl¨uter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri,Daniel Maturana, Martin Thoma, Eric Battenberg, Jack Kelly, Jeffrey De Fauw, Michael Heilman,Diogo Moitinho de Almeida, Brian McFee, Hendrik Weideman, G´abor Tak´acs, Peter de Rivaz, JonCrall, Gregory Sanders, Kashif Rasul, Cong Liu, Geoffrey French, and Jonas Degrave. Lasagne:First release, August 2015. URL http://dx.doi.org/10.5281/zenodo.27878 .Nick Foti. The natural gradient, Jan 2013. URL https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-gradient/ .Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016. .T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris,G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning fromDemonstrations.
ArXiv e-prints , April 2017.Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen. Natural conjugate gradient invariational inference.
International Conference on Neural Information Processing , 2015. URL .Sham Kakade. A natural policy gradient. In Thomas G. Dietterich, Suzanna Becker, and ZoubinGhahramani, editors,
Advances in Neural Information Processing Systems 14 (NIPS 2001) , pages1531–1538. MIT Press, 2001. URL http://books.nips.cc/papers/files/nips14/CN11.pdf .Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
ArXiv e-prints ,abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980 .S. Kullback and R. A. Leibler. On information and sufficiency.
Ann. Math. Statist. , 22(1):79–86, 03 1951. doi: 10.1214/aoms/1177729694. URL http://dx.doi.org/10.1214/aoms/1177729694 .Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436, 2015.10ong-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.
Machine learning , 8(3-4):293–321, 1992.Hamid Reza Maei. Gradient temporal-difference learning algorithms.
ACM Digital Library , 2011.URL https://dl.acm.org/citation.cfm?id=2518887 .James Martens. Deep learning via hessian-free optimization. In
ICML , 2010.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning.
ArXiv e-prints ,abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602 .OpenAI. Requests for research: Initial commit, 2016. URL https://github.com/openai/requests-for-research/commit/03c3d42764dc00a95bb9fab03af08dedb4e5c547 .OpenAI. Requests for research, 2018. URL https://openai.com/requests-for-research/ .Razvan Pascanu and Yoshua Bengio. Natural gradient revisited.
ArXiv e-prints , abs/1301.3584, 2013.URL http://arxiv.org/abs/1301.3584 .Jan Peters and Stefan Schaal. Natural actor-critic.
Neurocomputing , 71(7-9):1180–1190, 2008.Jan Peters, Sethu Vijayakumar, and Stefan Schaal.
Natural Actor-Critic , pages 280–291. SpringerBerlin Heidelberg, Berlin, Heidelberg, 2005. ISBN 978-3-540-31692-3. doi: 10.1007/11564096 29.URL https://doi.org/10.1007/11564096_29 .Martin L Puterman.
Markov decision processes: discrete stochastic dynamic programming . JohnWiley & Sons, 2014.G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical Report166, Cambridge University Engineering Department, September 1994. URL ftp://svr-ftp.eng.cam.ac.uk/reports/rummery_tr166.ps.Z .T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized Experience Replay.
ArXiv e-prints ,November 2015.John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In Francis Bach and David Blei, editors,
Proceedings of the 32nd InternationalConference on Machine Learning , volume 37 of
Proceedings of Machine Learning Research , pages1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/schulman15.html .D. Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. Winner’s curse? on pace, progress, andempirical rigor.
ICLR 2018 (under review) , Feb 2018. URL https://openreview.net/forum?id=rJWF0Fywf .J.R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994.Richard S Sutton. Learning to predict by the methods of temporal differences.
Machine learning , 3(1):9–44, 1988.Richard S. Sutton and Andrew G. Barto.
Introduction to Reinforcement Learning . MITPress, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981. URL https://pdfs.semanticscholar.org/aa32/c33e7c832e76040edc85e8922423b1a1db77.pdf .Gerald Tesauro. Temporal difference learning and td-gammon.
Commun. ACM , 38(3):58–68, March1995. ISSN 0001-0782. doi: 10.1145/203330.203343. URL http://doi.acm.org/10.1145/203330.203343 .Theano Development Team. Theano: A Python framework for fast computation of mathematicalexpressions. arXiv e-prints , abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688 . 11ohn N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with functionapproximation. In
Advances in neural information processing systems , pages 1075–1081, 1997.Christopher John Cornish Hellaby Watkins.
Learning from Delayed Rewards . PhD thesis, King’s Col-lege, Cambridge, UK, May 1989. URL .Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. Scalable trust-region method for deep reinforce-ment learning using Kronecker-factored approximation.
ArXiv e-prints , August 2017.Yandex. Agentnet, 2016. URL https://github.com/yandexdataschool/AgentNet .12 ppendix A: Hyperparameters
Both NGDQN and DQN had a minimum epsilon of . and had a γ of . (both default forBaselines). The NGDQN model was tested using an initial learning rate of . . For NGDQN, theepsilon decay was set to . , but since there wasn’t an equivalent value for the Baselines library,the grid search for Baselines included an exploration fraction (defined as the fraction of entire trainingperiod over which the exploration rate is annealed) of either . , . , or . (see Table 4). Likewise,to give baselines the best chance of beating NGDQN, we also searched a wide range of learning rates,given below. Environment CartPole-v0 2000 [64]CartPole-v1 2000 [64]Acrobot-v1 10,000 [64, 64]LunarLander-v2 10,000 [256, 128]Table 1: Shared configurationThe batch job running time is given below (hours:minutes:seconds) for Sherlock. NGDQNLunarLander-v2 was run on the gpu partition which supplied either an Nvidia GTX Titan Black oran Nvidia Tesla GPU. All other environments were run on the normal partition. Additional detailsabout natural gradient computation time can be found in Pascanu and Bengio [2013].Environment
NGDQN Batch Time DQN Batch Time
CartPole-v0 4:00:00 1:00:00CartPole-v1 9:00:00 1:00:00Acrobot-v1 48:00:00 8:00:00LunarLander-v2 48:00:00 Jobs not completed; see Figure 1 for details [1e-08, 1e-07, 1e-06, 1e-05, 1e-04, 1e-03]Exploration Fraction [0.01, 0.1, 0.5]Batch Size [32, 128]Memory Length [500, 2500, 50,000]Activation [Tanh, ReLU]Target Network Update Freq [N/A, 500, 1000, 10,000]Table 4: Baseline DQN hyperparameter search spaceBest grid-searched configurations, used for experiments:Environment LearningRate Explorationfraction BatchSize MemoryLength Activ-ation
CartPole-v0 1e-07 0.01 128 2500 TanhCartPole-v1 1e-08 0.1 32 50,000 TanhAcrobot-v1 1e-05 0.01 128 50,000 ReLULunarLander-v2 1e-05 0.01 128 2500 TanhTable 5: Baseline DQN hyperparameter configurationEnvironment
LearningRate ExplorationFraction BatchSize MemoryLength Target NetUpdate Freq Activ-ation
CartPole-v0 0.001 0.01 128 50,000 500 TanhCartPole-v1 0.001 0.01 32 50,000 500 TanhAcrobot-v1 0.001 0.01 32 50,000 500 TanhLunarLander-v2 0.0001 0.1 128 50,000 10,000 ReLUTable 6: Baseline DQN with target nets hyperparameter configurationEnvironment
LearningRate AdaptDamping BatchSize MemoryLength Activ-ation
CartPole-v0 0.01 No 128 50,000 TanhCartPole-v1 0.01 Yes 128 50,000 TanhAcrobot-v1 1.0 No 128 50,000 TanhLunarLander-v2 0.01 No 128 50,000 ReLUTable 7: NGDQN (MinresQLP) hyperparameter configuration Due to training idiosyncrasies, the learning rate search space was different and some configurations forAcrobot-v1 were not run, although we believe given the results, this minor difference is insignificant a) (b) Figure 4: Grid searched performances over tested hyperparameter configurations for NGDQN (left)and DQN (right), ordered by increasing performance, hinting at robustness to changing hyperparame-ters Environment
LearningRate AdaptDamping BatchSize MemoryLength Activ-ation
CartPole-v0 0.01 N/A 128 50,000 TanhCartPole-v1 0.01 N/A 128 50,000 TanhAcrobot-v1 0.1 N/A 128 50,000 TanhLunarLander-v2 0.01 N/A 128 50,000 TanhTable 8: NGDQN (LinCG) hyperparameter configuration Appendix B: Environments
Data is summarized from https://github.com/openai/gym and information provided on thewiki: https://github.com/openai/gym/wiki . CartPole-v0
The classic control task CartPole involves balancing a pole on a controllable sliding cart on a friction-less rail for 200 timesteps. The agent “solves” the environment when the average reward over100 episodes is equal to or greater than 195. However, for the sake of consistency, we measureperformance by taking the best 100-episode average reward.The agent is assigned a reward for each timestep where the pole angle is less than ± deg, and thecart position is less than ± . units off the center. The agent is given a continuous 4-dimensionalspace describing the environment, and can respond by returning one of two values, pushing the carteither right or left. CartPole-v1
CartPole-v1 is a more challenging environment which requires the agent to balance a pole on a cartfor 500 timesteps rather than 200. The agent solves the environment when it gets an average rewardof 450 or more over the course of 100 timesteps. However, again for the sake of consistency, we againmeasure performance by taking the best 100-episode average reward. This environment essentiallybehaves identically to CartPole-v0, except that the cart can balance for 500 timesteps instead of 200. Damping adapt, learning rate . , memory length , batch size not tested for LinCG due to poorresults from initial tests to reduce computation burden crobot-v1 In the Acrobot environment, the agent is given rewards for swinging a double-jointed pendulum upfrom a stationary position. The agent can actuate the second joint by returning one of three actions,corresponding to left, right, or no torque. The agent is given a six dimensional vector describing theenvironments angles and velocities. The episode ends when the end of the second pole is more thanthe length of a pole above the base. For each timestep that the agent does not reach this state, it isgiven a − reward. LunarLander-v2
Finally, in the LunarLander environment, the agent attempts to land a lander on a particular locationon a simulated 2D world. If the lander hits the ground going too fast, the lander will explode, or if thelander runs out of fuel, the lander will plummet toward the surface. The agent is given a continuousvector describing the state, and can turn its engine on or off. The landing pad is placed in the centerof the screen, and if the lander lands on the pad, it is given reward. The agent also receives a variableamount of reward when coming to rest, or contacting the ground with a leg. The agent loses a smallamount of reward by firing the engine, and loses a large amount of reward if it crashes. Although thisenvironment also defines a solve point, we use the same metric as above to measure performance.
Appendix C: Computation of the Maximal Eigenvalue
To ensure our inversion is not overdamped, we compare the maximal eigenvalue the Fisher informationmatrix to its damping factor. To calculate this eigenvalue, we optimize max x ˆ x T G ˆ x , as our FIM isimplemented as a matrix-vector product only. Here we give pseudocode to calculate the eigenvectorand outline a proof to show that this method has a global at minimum the maximal eigenvalue. Algorithm 2
Algorithm to find the approximate maximum eigenvalue
Require:
Matrix vector product Gx given x Require:
Starting vector v , initialized randomly Require:
Early stopping condition ( c : . ) Require:
Training steps (steps: ) Require:
Learning rate ( α : . ) for i = 1 , steps do ∆ v ← ∇ v (cid:16) v (cid:107) v (cid:107) · G (cid:16) v (cid:107) v (cid:107) T (cid:17)(cid:17) v ← v + α ∆ v if (cid:107) ∆ v (cid:107) > c thenbreakend ifend forreturn v (cid:107) v (cid:107) · G (cid:16) v (cid:107) v (cid:107) T (cid:17) To prove this is maximized at the maximal eigenvalue of G , we show that ˆ x T G ˆ x is equivalent to thedot product ˆ x · G ˆ x , which is also expressed as (cid:107) ˆ x (cid:107) (cid:107) G ˆ x (cid:107) cos θ (where θ is the angle between ˆ x and G ˆ x ).Clearly, (cid:107) ˆ x (cid:107) = 1 . Then, (cid:107) Gx (cid:107) is maximized at the maximal eigenvalue of G . This is because any ˆ x can be decomposed into (cid:80) N λ i =1 c i ˆ v i where ˆ v i ’s are the unit eigenvectors of G . When ˆ x is operated by G , G ˆ x = G N λ (cid:88) i =1 c i ˆ v i = N λ (cid:88) i =1 c i G ˆ v i = N λ (cid:88) i =1 λ i c i ˆ v i (12)Thus, (cid:107) G ˆ x (cid:107) is maximized at (cid:107) G ˆ x (cid:107) = λ max when ˆ x = ˆ v max . Furthermore, cos θ is maximized at θ = 0 , which is also true when ˆ x = ˆ v max since ˆ v max and G ˆ v max are in the same direction by the16efinition of an eigenvector. Since each of the factors are maximized at this point, we thus show thatthe expression ˆ x T G ˆ x is maximized when ˆ x = ˆ v max . It’s maximal value is then ˆ v T max G ˆ v max = ˆ v T max ( λ max ˆ v max ) = λ max (ˆ v T max ˆ v max ) = λ max (13) Appendix D: Code
The code for this project can be found at https://github.com/hyperdo/natural-gradient-deep-q-learning . It uses a fork of OpenAI Baselines to allow fordifferent activation functions: https://github.com/hyperdo/baselineshttps://github.com/hyperdo/baselines