[PDF] An Empirical Comparison of Neural Architectures for Reinforcement Learning in Partially Observable Environments

Abstract

This paper explores the performance of fitted neural Q iteration for reinforcement learning in several partially observable environments, using three recurrent neural network architectures: Long Short-Term Memory, Gated Recurrent Unit and MUT1, a recurrent neural architecture evolved from a pool of several thousands candidate architectures. A variant of fitted Q iteration, based on Advantage values instead of Q values, is also explored. The results show that GRU performs significantly better than LSTM and MUT1 for most of the problems considered, requiring less training episodes and less CPU time before learning a very good policy. Advantage learning also tends to produce better results.

Full PDF

AAn Empirical Comparison of NeuralArchitectures for Reinforcement Learning inPartially Observable Environments

Denis Steckelmacher, Peter Vrancx

AI-lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel

Abstract

This paper explores the performance of ﬁtted neural Q iteration for reinforcement learning in severalpartially observable environments, using three recurrent neural network architectures: Long Short-Term Memory [7], Gated Recurrent Unit [3] and MUT1, a recurrent neural architecture evolved froma pool of several thousands candidate architectures [8]. A variant of ﬁtted Q iteration, based onAdvantage values [6, 1] instead of Q values, is also explored. The results show that GRU performssigniﬁcantly better than LSTM and MUT1 for most of the problems considered, requiring less trainingepisodes and less CPU time before learning a very good policy. Advantage learning also tends toproduce better results.

Reinforcement learning was originally developed for Markov Decision Processes (MDPs). It allows anagent to learn a policy to maximize a possibly delayed reward signal in a stochastic environment andguarantees convergence to an optimal policy, provided that the agent can sufﬁciently experiment and theenvironment in which it is operating is Markovian.In many real world problems, however, the agent cannot directly perceive the full state of its envi-ronment and must make decisions based on incomplete observations of the system state. This partialobservability introduces uncertainty about the true environment state and renders the problem non-Markovian from the agent’s point of view. One way to deal with partially observable environments isto equip the agent with a memory of past observations and actions in order to help it discover what thecurrent state of the environment is. This memory can be implemented in a variety of ways, includingexplicit history windows [9, 10], but this article only focuses on reinforcement learning using recur-rent neural networks for function approximation. Unlike basic feed-forward networks, recurrent neuralnetworks can contain cyclic connections between neurons. These cycles give rise to dynamic temporalbehavior, which can function as an internal memory that allows these networks to model values associ-ated with sequences of observations [7, 4, 1]. This paper aims at comparing different recurrent neuralarchitectures when used to model value functions in a reinforcement learning context.The next section provides necessary background on reinforcement learning and the recurrent net-work architectures compared in this paper. Section 3 describes the experimental setup and environ-ments used for the comparison. The empirical results are provided in Section 4. Finally, we conclude inSection 5.

Discrete-time reinforcement learning consists of an agent that repeatedly senses observations of itsenvironment and performs actions. After each action a t ∈ A , the environment changes state to s t +1 ∈ S and the agent receives a reward r t +1 = R ( s t , a t , s t +1 ) ∈ R and an observation o t +1 ∈ O = f ( s t +1 ) . a r X i v : . [ c s . N E ] D ec he agent has no knowledge of R ( s, a, s (cid:48) ) and f ( s ) and has to interact with its environment in order tolearn a policy π ( o t ) ∈ R | A | , that gives the probability distribution of taking each of the actions for anygiven observation. The optimal policy π ∗ is the one that, when followed by the agent, maximizes thecumulative discounted reward r = (cid:80) t γ t r t , with ≤ γ ≤ .When the reward received by the agent depends solely on its current observation and action, theproblem is reduced to a Markov decision process and is said to be completely observable (the agent canassume that o t = s t without losing learning abilities). Partially observable Markov decision problemsoccur when the reward does not depend only on o t , but on state s t , whose dynamics still obey someunderlying MDP, but that the agent cannot observe directly. In this case, o t = f ( s t ) , with f an unknownone-way function part of the environment. Q-Learning [13] and Advantage Learning [6] allow an agent to learn a policy that converges to theoptimal policy given an inﬁnite amount of time and in discrete domains.Q-Learning estimates the Q ( o, a ) function, that maps each state-action pair to the expected, optimalcumulative discounted reward reachable by taking action a given observation o . At each time step, theagent observes o t , takes action a t and observes r t +1 and o t +1 . Equation 1 is used to update the Qfunction after each time step, with ≤ α ≤ the learning factor. δ t = r t +1 + γ max a Q k ( o t +1 , a ) − Q k ( o t , a t ) (1) Q k +1 ( o t , a t ) = Q k ( o t , a t ) + αδ t (2)Advantage Learning [6] is related to Q-Learning, but artiﬁcially decreases the value of non-optimalactions. This widens the difference between the value of the optimal action and the other ones, whichallows learning to converge more easily even if the values are approximated (using function approxi-mation). Equation 3 is used to update the Advantage values at each time step [1]. The smaller κ is, thewidest the gap between the optimal and non-optimal actions becomes. δ t = max a A k ( o t , a ) + r t +1 + γ max a A k ( o t +1 , a ) − max a A k ( o t , a ) κ − A k ( o t , a t ) (3) A k +1 ( o t , a t ) = A k ( o t , a t ) + αδ t (4)In very large or even continuous environments, exact representation of the Q-function (or Advantagefunction) is no longer possible. In these cases a function approximation architecture is needed to repre-sent the target function. It has been shown, however, that on-line Q-Learning can diverge, or convergevery slowly, when used in combination with function approximation [11]. One solution to this problemis to learn the Q -values off-line. The method used in this paper is the neural ﬁtted Q iteration describedin [11], an adaptation of ﬁtted Q iteration [5] using neural networks. The agent interacts with its envi-ronment using a ﬁxed policy until reaching the goal or a maximum number time steps have elapsed, andcollects samples of the form ( o t , a t , r t +1 , o t +1 ) . After a number of episodes have been run, the modelis trained in batch on the collected data. The model maps sequences of observations to action values: M : O ∞ → R | A | .The next subsections describe the different recurrent network architectures that we consider in thispaper to represent the target functions. An LSTM [7] cell stores a value. An output gate allows the cell to modulate its output strength, whilean input gate controls the intensity of the input signal that is continuously added to the cell’s content.A forget gate, when set to zero, clears the content of the cell. Equations 5 to 7 show how the valuesof the gates are computed. Equations 8 and 9 show how to compute the value of the memory cell, andEquation 10 shows the output of an LSTM cell.

STM / GRU / MUT1tanh outputinput tanhoutputinputLSTM / GRU / MUT1

Figure 1: Cyclic architecture proposed by [1] (left), and architecture used in this paper (right). i jt = σ ( W i x t + U i h t − ) j (5) f jt = σ ( W f x t + U f h t − ) j (6) o jt = σ ( W o x t + U o h t − ) j (7) ˜ c jt = tanh( W c x t + U c h t − ) j (8) c jt = f jt c jt − + i jt ˜ c jt (9) h jt = o jt tanh( c jt ) (10)In [1], Bakker proposes a neural network architecture tailored for reinforcement learning. The net-work has one input neuron per observation variable, and one output neuron per action. A softmax layertransforms the output of the neural network to a probability distribution over the actions. The neuralnetwork itself consists of an LSTM layer and a simple tanh layer working in parallel: the input of thenetwork is fed to the LSTM and tanh layer, both these layers are connected to the output, the output ofthe tanh layer is connected to the input of the LSTM layer and the output of the LSTM layer is connectedto the input of the tanh layer (see Figure 1). This article uses a simpler version of the network: the inputis connected to a tanh layer, that is in turn connected to an LSTM layer, that is connected to the output.Both the tanh layer and the LSTM layer contain 100 neurons (or LSTM cells). GRU has been introduced recently and follows a design completely different from LSTM [3, 4]. Insteadof storing a value in a memory cell and updating it using input and forget gates, a GRU unit computes acandidate activation ˜ h t based on its input, and then produces an output that is a blend of its past outputand the candidate activation. Equations 11 and 12 show how the Z (modulation) and R (reset) gates arecomputed. Equations 13 and 14 show how the input is mixed with the last activation in order to producethe candidate activation, and Equation 15 shows how the last activation and the candidate activation aremixed to produce the new activation. The models themselves are built on Keras http://keras.io/ , a Python library providing neural network primitivesbased on Theano [2]. Keras provides LSTM, GRU, MUT and dense fully-connected weighted layers (among others). Layers canbe assembled either in a stack or in a directed acyclic graph. The connection scheme in [1] makes the network layer graph cyclic,and hence impossible to build using the current version of Keras. jt = σ ( W z x t + U z h t − ) j (11) r jt = σ ( W r x t + U r h t − ) j (12) ˜ x jt = r jt h jt − (13) ˜ h jt = tanh( W x t + U ˜ x t ) j (14) h jt = (1 − z jt ) h jt − + z jt ˜ h jt (15) J´ozefowicz et al. observed that GRU and LSTM are very different from each other, and wonderedwhether other recurrent neural architectures could be used. In order to discover them, they developed agenetic algorithm that evaluated thousands of recurrent neural architectures. Once the experiment wasﬁnished, they identiﬁed three architectures that performed as good as or better than LSTM and GRU ontheir test vectors: MUT1, MUT2 and MUT3 [3].This paper only considers MUT1, that produced the best results on preliminary experiments. Equa-tions 16 and 17 show to compute the value of the Z and R gates, Equations 18 and 19 show how tocompute the candidate activation, and Equation 20 shows that the output of the MUT1 uses the sametype of mixing as the one used by GRU. z jt = σ ( W z x t ) j (16) r jt = σ ( W r x t + W h h t − ) j (17) ˆ h jt = r jt h jt − (18) ˜ h jt = tanh( W ˆ h ˆ h t + tanh( x t )) j (19) h t = (1 − z jt ) h jt − + z jt ˜ h jt (20) In order to keep training time manageable, the neural networks are trained to associate values with thelast 10 observations, instead of the complete history. LSTM, GRU and MUT1 are able to associatevalues to arbitrarily long sequences of inputs, but Keras requires all the sequences on which it is trainedto have the same length (possibly with padding).Training has to be done carefully, because one does not want the model to forget past experienceswhen a new batch of episodes is learned. The network has been conﬁgured to perform 2 training epochson the data, using a batch size of 10 (batches of 10 O × R | A | samples are used to compute an averagegradient when performing backpropagation). The small number of epochs prevents the model fromoverﬁtting speciﬁc episodes. Three environments are used to evaluate the neural network models. The ﬁrst one is a simple fully-observable × grid world with the initial position at (0 , , the goal at (9 , and an obstacle at (5 , (see Figure 2). The agent can observe its ( x, y ) coordinates. It receives a reward of − at each timestep, − if it hits a wall or the obstacle, and when it reaches the goal.The second environment is based on the same grid world as the ﬁrst one, but the agent can onlyobserve its x coordinate. The y coordinate is masked to zero.The last environment is also based on the grid world, but the agent can only observe its orientation(whether it is facing up, down, left or right, expressed as a 0 to 3 integer number) and the distancebetween it and the wall in front of it. This agent-centric environment is very close to what actual robotscan experience. GO Figure 2: Grid world, the initial position (when ﬁxed) is at S, the goal is at G, and the obstacle is depictedby O. ˆ Q ( s, a ) is a neural network model for e = 1 to do H e ← ∅ for t = 1 to do Agent observes o t , takes action a t , receives reward r t +1 and observation o t +1 Q ( o t , a t ) ← ˆ Q ( o t , a t ) + α ( r t +1 + γ max a ˆ Q ( o t +1 , a ) − ˆ Q ( o t , a t )) H e ← H e ∪ { ( o t , a t , Q ( o t , a t )) } end forif e a multiple of 10 then Train ˆ Q ( s, a ) on H e − , . . . , H e end ifend for Figure 3: Neural ﬁtted Q iteration as used in this paper.The “stochastic” variant of the experiments uses a random initial position for every episode. Theagent can sense its initial ( x, y ) coordinate at the ﬁrst time step, even in otherwise partially observableenvironments .The observations of the agent, that consist of integer numbers, are encoded using a one-hot encodingso that they are more easily processed by neural networks. For instance, the y coordinate of the gridworld can take values from 0 to 4, which are encoded as (1 , , , , , (0 , , , , , ..., (0 , , , , .For the × grid world, the neural networks therefore have 15 input neurons. Each experiment consists of 5000 episodes of a maximum of 500 time steps. During the episodes, theneural network is not trained on any new data, but Q k +1 ( s, a ) values are computed based on Q k ( s, a ) and stored in a list. After every batch of 10 episodes, the neural networks are trained on the Q k +1 values,as described in [11] and shown in Figure 3.The experiments themselves consist of trying to reach the goal in one of the environments describedin Section 3.1. Each experiment is run 15 times for each combination of the following parameters: • Value iteration: Q-Learning and Advantage learning, α = 0 . , γ = 0 . and κ = 0 . • Neural network architecture: feed-forward perceptron with a single hidden layer (nnet), LSTM(lstm), GRU (gru) and MUT1 (mut1) • World: gridworld (gw), partially observable gridworld (po) and agent-centric gridworld (ac) • Fixed initial position and random initial position • Softmax action selection with a temperature of 0.5

Each experiment (see Section 3.2) is run 15 times. The ﬁrst time step at which the agent is able tomaintain an average (over the 1000 next time steps) reward of more than − with a standard deviation Some experiments have been re-run without this hint, with no change in the results. The agent learns to look left, then up,and uses those observations as initial position. net lstm gru mut1gw . / . . / . / . . / . po NA / . . / . / . ac NA . / . . / . . / . (a) ﬁxed initial position and Advantage Learning. nnet lstm gru mut1gw . /

244 653 . / . / . . / . po NA . / / . / . ac NA . / . / . . / . (b) ﬁxed initial position and Q-Learning. nnet lstm gru mut1gw / . . / . . / . . / po NA . / . / . NAac NA / . . / . . / . (c) random initial position and Advantage learning. nnet lstm gru mut1gw . / . . / . . / . / . po . / . / . . / . NAac NA / . / . . / . (d) random initial position and Q-Learning. Table 1: Learning time of all the experiments.less than 20 is called the learning time . The best average reward obtained during a 1000-time-stepswindow is called the learning performance .Table 1 shows the learning time of the different neural networks for all the experiment conﬁgura-tions. Best results are emphasized in bold. Results are displayed in a mean/stddev format.Advantage Learning leads to smaller learning times and standard deviations than Q-Learning in allworlds except the partially observable grid world. Figure 4 shows the behavior of the Q and Advantagelearning algorithms in the partially observable grid world. Q-Learning allows faster convergence with asmaller standard deviation.When using a ﬁxed initial position, GRU learns faster than any other network. The differenceof learning speed between GRU and LSTM is statistically signiﬁcant for (gw, Advantage), (gw, Q-Learning), (po, Q-Learning) and (ac, Q-Learning) (p-values of 0.003, 0.008, 0.0003 and 0.0003, respec-tively), but not for (po, Advantage) and (ac, Advantage) (p-values of 0.118 and 0.140, respectively).When using a random initial position, GRU is the only model allowing learning in all the envi-ronments when Advantage Learning is used. LSTM and GRU give comparable results in the partiallyobservable worlds, with no statistically signiﬁcant difference between them.Agents using MUT1 as a function approximator nearly always manage to learn a good enoughpolicy in partially observable worlds, but they need a large number of episodes to do so. However, plainperceptron-based agents don’t manage at all to learn a policy in these worlds , which shows that MUT1allows better learning in partially observable worlds than a simple non-recurrent neural network.Table 2 shows the learning performance of the different neural networks, with the highest valueshighlighted in bold. Results are displayed in a mean/stddev format.The feed-forward neural network always achieves the best scores in the grid world, followed byGRU, then LSTM, and ﬁnally MUT1. GRU always outperforms the other network architectures in thepartially observable worlds. In these worlds, GRU is statistically signiﬁcantly better than LSTM in allcases (p-value less than 0.0001) except in (random initial, ac, Q-Learning) (p-value of 0.682). Except in the partially observable grid world using Q-Learning and random initial positions, where the agent learns to go left,then randomly go up and down until the goal is reached by chance. net lstm gru mut1gw − / − . / . − . / . − . / po − . / . − / . − . / . − / . ac − . / . − . / . − . / . − . / . (a) ﬁxed initial position and Advantage Learning. nnet lstm gru mut1gw − / − . / . − . / . − . / . po − . / . − . / . − . / . − . / ac − . / . − . / . − . / . − . / (b) ﬁxed initial position and Q-Learning. nnet lstm gru mut1gw / . − . / . . / . − . / . po − . / . − . / . − . / . − . / . ac − . / . − . / − / . − . / . (c) random initial position and Advantage learning. nnet lstm gru mut1gw . / . − . / . / . − . / . po − . / . − . / − . / . − . / . ac − . / . − . / . − . / . − . / . (d) random initial position and Q-Learning. Table 2: Learning performance of all the experiments. − − − − Index m ean . r e w a r d s − (a) LSTM, partially observablegrid world, Advantage. − − − − Index m ean . r e w a r d s − . (b) GRU, partially observable gridworld, Advantage. − − − − Index m ean . r e w a r d s − . (c) GRU, partially observable gridworld, Q-Learning. Figure 4: Average reward over 15 runs for each episode. In the partially observable grid world, Q-Learning allows higher rewards using GRU and LSTM (GRU shown). Using Advantage, GRU andLSTM need more episodes before learning.

Conclusion

LSTM, GRU and MUT1 have been compared on simple reinforcement learning problems. It has beenshown that agents using LSTM and GRU for approximating Q or Advantage values perform signiﬁcantlybetter than the ones using MUT1, obtaining higher rewards and learning faster.GRU and LSTM provide comparable performance, with GRU often being signiﬁcantly better thanLSTM. LSTM is never signiﬁcantly better than GRU. When considering the rewards received by theagents once they have learned, and not the time required for learning, GRU always achieves betterresults than LSTM.This shows that using GRU instead of LSTM should be considered when tackling reinforcementproblems. Moreover, on the machine used for the experiments, the simpler GRU cell (compared toLSTM) allowed the GRU-based agents to complete their 5000 episodes approximately two times fasterthan LSTM-based agents.

References [1] B. Bakker. Reinforcement Learning with Long Short-Term Memory. In

Advances in NeuralInformation Processing Systems 14 , pages 1475–1482, 2001.[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU Math Expression Compiler. In

Proceedings ofthe Python for Scientiﬁc Computing Conference (SciPy) , 2010.[3] K. Cho, B. van Merrienboer, C¸ . G¨ulc¸ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.In

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ,pages 1724–1734, 2014.[4] J. Chung, C¸ . G¨ulc¸ehre, K. Cho, and Y. Bengio. Empirical Evaluation of Gated Recurrent NeuralNetworks on Sequence Modeling.

CoRR , abs/1412.3555, 2014.[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-Based Batch Mode Reinforcement Learning.

Journalof Machine Learning Research , 6:503–556, 2005.[6] M. E. Harmon and L. C. Baird III. Multi-player residual advantage learning with general functionapproximation.

Wright Laboratory, WL/AACF, Wright-Patterson Air Force Base, OH , 1996.[7] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory.

Neural Computation , 9(8):1735–1780, 1997.[8] R. J´ozefowicz, W. Zaremba, and I. Sutskever. An Empirical Exploration of Recurrent NetworkArchitectures. In

Proceedings of the 32nd International Conference on Machine Learning , pages2342–2350, 2015.[9] L.-J. Lin and T. Mitchell. Reinforcement Learning with Hidden States. In

From animals to an-imats 2: Proceedings of the second international conference on simulation of adaptive behavior ,volume 2, page 271. MIT Press, 1993.[10] A. K. McCallum. Learning to use selective attention and short-term memory in sequential tasks.In

From animals to animats 4: proceedings of the fourth international conference on simulation ofadaptive behavior , volume 4, page 315. MIT Press, 1996.[11] M. A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efﬁcient NeuralReinforcement Learning Method. In

Machine Learning: ECML 2005, 16th European Conferenceon Machine Learning , pages 317–328, 2005.[12] O. Tange. GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine ,36(1):42–47, Feb 2011.[13] C. J. Watkins and P. Dayan. Q-learning.