Depth and nonlinearity induce implicit exploration for RL
DDepth and nonlinearity induce implicit exploration for RL
Justas Dauparas , Ryota Tomioka , Katja Hofmann University of Cambridge & Microsoft Research, Cambridge, UKMay 31, 2018
Abstract
The question of how to explore, i.e., take actions with uncertain outcomes to learn about possible future rewards,is a key question in reinforcement learning (RL). Here, we show a surprising result: We show that Q-learning withnonlinear Q-function and no explicit exploration (i.e., a purely greedy policy) can learn several standard benchmarktasks, including mountain car, equally well as, or better than, the most commonly-used (cid:178) -greedy exploration.We carefully examine this result and show that both the depth of the Q-network and the type of nonlinearity areimportant to induce such deterministic exploration.
Reinforcement learning (RL) is a systematic approach to learning in sequential decision problems, where a learners’future task performance depends on its past actions. In such settings, learners have to explore, meaning they haveto take actions with uncertain outcomes, to facilitate learning about the consequences of such actions.The question of how to best explore is a key open question in RL. Here, we specifically address this question froman empirical perspective, and investigate how to explore in a way that leads to sample efficient learning in deep RL,i.e., reinforcement learning with value function approximators that are parameterized as powerful neural networks.We present a surprising finding: in this setting, good approximate value functions can be learned without anyexplicit exploration. In fact, we find that in several cases learning without explicit exploration is equally or moresample efficient than the most-commonly used (cid:178) -greedy exploration scheme on several standard benchmark tasks.We present additional results that suggest a likely role of model structure (network depth and nonlinearity) ininducing such implicit exploration. We believe that our insights have strong practical implications and open up anovel line of research towards understanding exploration in deep RL.
We briefly outline the components to our approach that form the basis of our investigation. We assume a standardformulation of RL as learning in a Markov Decision Process, where the learner is tasked to find an optimal policy π ∗ . For any given policy π the Q-value, also called state-action value, can be written as Q π ( a , s ) : = (cid:69) [ r ( s , a ) + (cid:80) ∞ t = γ t r ( s t , a t )], i.e., the expected discounted (with discount factor γ ) cumulative reward from taking action a instate s and following policy π thereafter. An optimal policy achieves optimal Q-values Q ∗ : = max π Q ( s , a ). Learning approach (DDQN)
Q-learning-based approaches estimate Q ∗ using an iterative approach that boot-straps estimates of Q ( s , a ) from those of subsequent states s (cid:48) , using the recursion Q ( s , a ) = r ( s , a ) + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ).In approaches based on deep Q-learning (Mnih et al. , 2015), Q-value estimates are parameterized by a deep neuralnetwork, and trained using stochastic gradient descent using interaction data obtained through interaction with anenvironment using a behavior policy. In Double DQN (DDQN (Van Hasselt et al. , 2016)), gradient updates minimizethe squared loss (cid:107) Q ( s , a ; θ t ) − r ( s , a ) − Q ( s (cid:48) , ar g max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ t ); θ (cid:48) t ) (cid:107) , where the parameters of the Q functionare denoted by θ and we explicitly distinguish between model parameters θ and target parameters θ (cid:48) . Stochasticupdates are computed on mini-batches sampled from a replay buffer, a record of past experience. * Part of this work was done while Justas was a Research Intern at Microsoft Research Cambridge. a r X i v : . [ c s . L G ] M a y a) (cid:178) = (cid:178) decayed in 25k steps (c) (cid:178) decayed in 100k steps Figure 1: Comparison of no explicit exploration ( (cid:178) =
0) to linearly decaying (cid:178) on the mountaincar-V0 task (5 randomseeds). (a) cartpole-v0, (cid:178) = (cid:178) dec. in 10k steps (c) acrobat-v1, (cid:178) = (cid:178) dec. in 10k steps Figure 2: cartpole-v0 and acrobat-v1 tasks (10 random seeds).
Exploration
We contrast a greedy behavior policy with the standard (cid:178) -greedy approach (Sutton and Barto, 1998).A greedy policy selects actions a ∗ θ = ar g max a Q ( a , s ; θ ). In (cid:178) -greedy, actions are sampled uniformly at random withexploration rate (cid:178) , while the greedy action is selected with probability 1 − (cid:178) . Following common practice (Mnih et al. ,2015), we decay the exploration rate over time. Tasks
We use the following OpenAI-gym (Brockman et al. , 2016) tasks: mountaincar-v0, cartpole-v0, and acrobat-v1. These are common RL benchmarks (Duan et al. , 2016).
Hyper-parameters
For the experiments on mountaincar-v0, we used a replay buffer of size 200k, batch size256, discount factor γ = α = · − ; the target network was updatedevery 1000 steps. For cartpole-v0 and acrobat-v1 we used replay buffer size 50k; all other parameters were the same. In Figure 1, we plot the reward statistics (mean, median, 2%- and 98%-pecentiles) obtained by running DDQN onthe mountaincar-v0 task with (a) no explicit exploration ( (cid:178) =
0) (b) linear decay of the exploration rate (cid:178) from 1 to 0in 25k steps and (c) linear decay in 100k steps. 5 independent random seeds were used to obtain the statistics. Theplots show that the agent without explicit exploration ( (cid:178) =
0) can solve the task equally well, or even slightly betterthan, standard exploration strategies. We confirmed similar results on cartpole-v0 and acrobat-v1 (Fig. 2).How can an agent explore without randomness? Note that all the above environments are deterministic exceptfor the initial states. If it is not the environment or the stochasticity in the behavior policy, it must be some propertyof the Q-network that is inducing the exploration .To understand what is inducing the exploration for the mountaincar-v0 task, we carried out further experimentswith the following Q-network architectures (see Fig. 4): 2 .25 1.00 0.75 0.50 0.25 0.00 0.25 0.500.0200.0150.0100.0050.0000.0050.0100.015 (a) Random nonlinear
Q-function as a controller. (b) Random linear
Q-function as a controller.
Figure 3: Vector fields in the phase space of the mountaincar-v0 task with and without the random Q-function as acontroller. Blue: with the controller. Black: uncontrolled system.1. Linear (no hidden layer);2. 1 hidden layer with 128 ReLU units;3. 2 hidden layers with 128 ReLU units in each layer (original setting);4. 2 hidden layers with 128 tanh units in each layer.All results were obtained with (cid:178) =
0. Within each column, we plot the reward statistics (as above), and phase spacediagrams showing the 1000 state transitions leading up to 10k steps, 20k steps, 40k steps, and 160k steps. Thetrajectories are superimposed on top of the histograms of the state visit frequencies colored from black (zero) towhite (more than 100). The red vertical lines indicate the goal states.In the first column of Fig. 4, we can see that without any nonlinearity, the agent was not able to reach the goalstate even once and consequently did not learn the task at all, although we believe that a linear agent is sufficient tosolve this task (Mania et al. , 2018).By contrast, we can see in the second column that the agent is able to solve the task with a single ReLU hiddenlayer of size 128. We have also experimented with two fully-connected layers without nonlinearity or just onefully-connected layer initialized with large weight initialization scale, but none of them were as successful as thenetworks with ReLU nonlinearities. The original setup of two hidden layers (third column) seem to be slightly betterthan one hidden layer. The last column shows the same result for two hidden layers with the tanh nonlinearity. Thereward curve appears slightly noisier than for the ReLU activations, but this may be due to the high variance.
Can deterministic exploration be an alternative to random exploration?
Deterministic exploration is attractivebecause it would avoid the unnatural dithering behavior often observed with (cid:178) -greedy and other stochastic ex-ploration strategies. From a control theory perspective, an easy way to induce exploration is to destabilize theunderlying system. For example, a small inverse damper term (i.e., an acceleration proportional to the speed) wouldbe sufficient for the mountain car task, because success does not depend on the speed at which the goal state isreached. However, this is not the case for other benchmark tasks (e.g., acrobat-v1) and it would be a bad idea forreal-world systems. Another way to induce deterministic exploration would be to induce chaotic dynamics. Forexample, for acrobat-v1, it is enough for the controller to compensate for gravity. However, in both cases it is unlikelythat a randomly drawn initial Q-function behaves like an inverse damper term or a gravity compensator. In thispaper we did not design an optimal deterministic exploration behavior but we demonstrated that such an behaviorcan be induced by the network architecture.
What is the role of the nonlinearity?
We plot the vector fields of the moutaincar-v0 task with and without arandomly initialized Q-function with two hidden layers as a controller in Fig. 3(a). All weight matrices were3nitialized using Glorot initialization (Glorot and Bengio, 2010) and all bias terms were initialized to zero. Wealso plot 10 trajectories from random initial states. The same plot with a linear Q-function is shown in Fig. 3(b).Comparing the two plots we notice that the nonlinear Q-function can modify dynamics in multiple regions ofthe phase space (e.g., area around position = −
Limitations
We note several limitations. First, stochasticity may be induced by initial states, as they are randomlysampled from [ − − (cid:178) = (cid:178) -greedyapproach. In this note we have shown that competitive performance on standard RL benchmarks can be achieved withoutexplicit exploration when deep neural networks are used as function approximators in Q-learning. Our analysissuggests that both network depth and nonlinearity play a role by inducing optimism without overgeneralization.While we have mainly focused on the aspect of optimism induced by a deterministic policy, another important aspectis understanding the role of uncertainty. We believe that combining uncertainty quantification (e.g., bootstrappedDQN Osband et al. , 2016) with deterministic exploration could be an interesting alternative to standard stochasticexploration. 4 o hidden layer(linear) 1 hidden layer(ReLU) 2 hidden layers(ReLU) 2 hidden layers(tanh) R e w a r d k s t e p s k s t e p s k s t e p s k s t e p s Figure 4: Reward (5 random seeds) and trajectories for different Q-network architectures.5 eferences
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.Openai gym. arXiv:1606.01540 , 2016.Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learningfor continuous control. In
ICML , pages 1329–1338, 2016. arXiv:1604.06778 .Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In
AISTATS , pages 249–256, 2010.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
ICLR , 2015. arXiv:1412.6980 .Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach toreinforcement learning. arXiv:1803.07055 , 2018.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves,Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, IoannisAntonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-levelcontrol through deep reinforcement learning.
Nature , 518(7540):529–533, 2015.Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In
NIPS , pages 4026–4034, 2016. arXiv:1602.04621 .Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction . MIT press Cambridge, 1998.Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In
AAAI ,volume 16, pages 2094–2100, 2016. arXiv:1509.06461arXiv:1509.06461