Extrapolation in Gridworld Markov-Decision Processes
aa r X i v : . [ c s . L G ] A p r Extrapolation in Gridworld Markov-DecisionProcesses
Eugene Charniak ∗ Department of Computer ScienceBrown UniversityApril 16, 2020
Abstract
Extrapolation in reinforcement learning is the ability to generalizeat test time given states that could never have occurred at trainingtime. Here we consider four factors that lead to improved extrapo-lation in a simple Gridworld environment: (a) avoiding maximum Q-value (or other deterministic methods) for action choice at test time,(b) ego-centric representation of the Gridworld, (c) building rotationaland mirror symmetry into the learning mechanism using rotationaland mirror invariant convolution (rather than standard translation-invariant convolution), and (d) adding a maximum entropy term to theloss function to encourage equally good actions to be chosen equallyoften.
Reinforcement learning commonly is studied with no distinction made be-tween training and testing data. This is reasonable when the set of statesencountered in training is so large that the agent is unlikely to see a novel(test-time) situation thereafter or if the research goals lie elsewhere, such asefficient policy convergence at training time.However, for RL to be trusted in the “wild,” it is so important that itworks in novel cases that increasingly researchers deliberately distinguishbetween training and testing data, and ensure that the latter includes, “un-reachable” states [WLT + ∗ Thanks to George Kanidars and Michael Littman for advice and guidance. Problems,of course, are completely my own. + + While a major impetus to this research was the work of [WLT + +
18] have developed a version of the Sonic the Hedgehog TM that al-lows the user to get at enough of the internals to create an environment inwhich different levels of the game are available either for training, or testing.2 1 2 3 4 5 60 X X X X X X X1 X * X2 X X3 X X4 X @ X5 X X6 X X X X X X XFigure 1: Gridworld game, where “@” represents the agent, which receivesa reward if it gets to the position marked with “*”.Inspired by this new version of Sonic, [CKH +
18] create a new game fromscratch, coin run , designed to make training and testing on distinct levelseasy. Similarly [JKB +
19] create the game obstacle tower . In reinforcement learning, a Gridworld is a Markov-Decision Process inwhich a single agent moves around a two dimensional grid by means offour possible actions, left, down, right, and up. There many variations,e.g. [BM95, CH03, MMB14].In this paper all Gridworld MDPs will be of size 7 ∗ +
18, MDG + →
1D Embedding → W → Q valuesorlogitsFigure 2: Basic function approximation of Q values or policy-probabilitylogits ∗ g be the onedimensional size of the grid, so g = 7.) The object indices are immediatelyconverted to trainable embeddings. Then a single full width matrix Wconverts the input to Q values (for Q function learning methods) or, forpolicy-gradient-learning logits that are feed into softmax. The loss functionfor Q-value-function learning is squared distance between the current valueestimate and the improved estimate from computing the next reward plusthe maximum estimated Q value for the next state. In all cases we useexperience replay. [SQAS15]For hyperparameter details, see Appendix A Table 1 shows how well our deep Q learning version of Gridword is able togeneralize, A constant of generalization research is that the more variationwe see at test time the better the generalization. For example, [CKH + to 10 test instances we leverage the simplicity of our Gridworld to reduce this to 16down to 1, as the label in the first column in each row indicates. The secondand third columns have two measures of train-time performance. Columntwo indicates fraction of train-time completion, which increases monoton-ically from 77% with 16 train-time instances, to 100% with one or twosuch instances. Column three shows for the games that are completed, howmany steps the completion took, on average, over the minimum (mistake-free) trajectory length. This too shows improving train-time performanceas the variation seen at train-time decreases.However, the test-time performance illustrated in Table 1 is quite dif-ferent. To a first approximation none of the test-time games are solved. Alook at what happens in these games quickly illustrates that in virtually allcases the policy created has an infinite loop for one or more steps in all of200 testing instances. (We test on 10 game instances, all of which start ina unreachable state. We run the entire train-test cycle 20 times, creatinga new suite of training and testing games, and report mean results). E. g.in position (3,4) the maximum Q-value corresponds to moving up, but inposition(2,4) it is for moving down. The key point here is that standard Qfunction learning at test time has us always deterministically choosing themove with maximum Q-value in the current state, so any inaccuracy in theQ-value/state association is blown up by the determinism. Secondly there are lots of reasons to expect inaccurate associations. Theinput state representation is simply our 2D grid so at test time, if our agentis in position (3,4), the most similar input states are those with the agentat this position no matter where the reward state is located since in all suchpositions the difference is in one grid location. So to a first approximationthe policy at (3,4) is going to be independent of the testing reward location.In particular, when trained on a single goal, the agent will still head towardthe single training goal, not the testing one. When trained on 16 variousgoals, it will head toward whichever of the goals happens to be ascendant, We have done experiments that combine Q-learning with a temperature guided prob-abilistic choice of actions at test-time. While this improves extrapolation greatly, we haveobserved odd behavior using this scheme that we have not been able to diagnose, and thushave put it aside in this study. An ego-centric grid representation , also called agent-centric or deictic rep-resentation ( [AC87, RB04, KSB12, FGKO12, JRK19]) is one in which therepresentation depends on the agent’s point of view. In this paper we adopta very simple version of this in which rather that placing the origin of theGridworld at an arbitrary point, e.g., the upper left-hand corner, we insteadmake the origin always the position of the agent. Or as we will show it, theagent is always at the center of the grid. So the grid of Figure 1 would nowlook like Figure 3.Notice how the agent is at location (5, 5), and no matter where it movesin the smaller 7*7 space, it remains at location (5,5) in ego-centric space.We have also padded both the X and Y axes with 4 extra spaces. This allowsthe representation to put the agent in the center and still show the completeoriginal grid, even if in the original grid the agent was in one of the extremecorners. (We let x be the one dimensional size of the ego-centric array, sogiven that the agent is barred from rows and columns 0 and 6, it can befound that to make sure every bit of the original is visible in the ego-centricversion x = 2 g − ∗ · · ·
14 15 · · ·
28 29 · · ·
48X X X X @ X @ XWhen we learn not to go into the wall from (2,1) the agent position corre-sponds to position 15 and the change in W will be specific to this location,There will be little to nothing that translates into information about whatto do when the agent is at location (4,1,) (or position 29 in the vector boardrepresentation).Contrast this with the processing when we use an ego-centric represen-tation. In both cases we see the following piece of the state representation:0 1 · · ·
59 60 · · · Table 3 shows our extrapolation results when using ego-centric coor-dinates. Contrast them with those in Table 2. Most notably the com-pletion percentages have more than doubled (32% down to 10% withoutego-centrism, 84% down to 24% with). The quality of the solutions areup (see number over minimum number of moves necessary) and from theprobability mass devoted to trivially wrong moves (e.g., 3.6% when trainedon 16 instances) it is clear that the policy has learned to handle the basicsreasonably well, when trained on 16, 8, or 4 instances.
In the last section we saw how ego-centric spatial representation allows anagent to automatically generalize between some kinds of learning experi-ences — those which look similar when viewed from the agent’s point ofview. There are other cases, however that ego-centrism does not catch —for example, learning not to move right when there is a wall to your rightdoes not help to learn not to move left when there is a wall to the left. Ormore generally, Gridworld is inherently 90 degrees rotationally symmetricbut nothing we have done so far captures this.Exploiting symmetry has been occasionally used in computer visionwork, but requests for papers about computer vision and symmetry almostexclusively return papers on recognizing objects with symmetry. One excep-tion is [DWD15], who applied convolutional deep learning to the classifica-tion of galaxies. Pictures of galaxies do, of course, have rotational symmetry, It might be noted/objected that something like the same generalization could beobtained by using convolutional filters on the original grid since convolution automaticallyinduces translation invariance But it is not to hard to convince yourself that translationinvariance is not really right for Gridworld problems. We see a vertical wall in positions(0, 0:6) and (6, 0:6) but not elsewhere. Similarly if we see the agent in one location, it isexcluded from all others at that time instant.
9o oo o oX o o oo X o o o@ o X o o oo X o o oX o o oo o oo ooFigure 5: Triangular-quadrant two, t (2), array values for Figure 3. Thetriangular quadrants are labeled 0-3, in correspondence with the directionsof motion 0-3. Here blank space indicates grid positions not in t (2), and “o”indicates inclusion in t (2) and there is a space at the location.and this property is added in [DWD15] by data augmentation. The galacticphotographs are repeated in the data set with different 90 degree rotationsAlso rotation is mentioned in [CKH +
18] but in the immediate context ofimage augmentation, and no study of its use in RL is presented. [MVT16]directly build rotational convolution into their textture recognition systemas texture is another area where images show rotational symmetry. However,they do this by enforcing rotational symmetry on their convolutional ker-nels, which makes sense for texture, but does not work for us. A theoreticalanalysis of symmetry in RL can be found in [RB04].Here we too build rotational (and later mirror image) symmetry directlyinto our learning mechanism, but in a novel fashion. Standard convolutionbuilds translational symmetry into visual processing by translating patchesto line up with a kernel before taking the dot product. We do the same, butrather than a translation/dot-product process we use a rotation/dot-productone. The shape of the kernel is shown in Figure 5. To get full coverage ofthe image we rotate it four times to bring each quarter into alignment withour pattern. The idea is that the matrix W of Figure 2 should not cover theentire x ∗ x Gridworld matrix but rather just a quarter of it. To be more precise, since a symmetric form for our quadrant requires overlap betweenthe triangular shapes (at the origin, and along one of the diagonals) four copies of thekernel shape cover x spaces, plus 4 ∗ ( x − / x + 2 x −
10 o X o o oo X o o oX o o oo o oo ooFigure 6: A triangular octant for Figure 3
Policy-gradient RL methods in general, and REINFORCE in particular donot necessarily converge to a unique optimum when two or more actions bothengender the maximum discounted reward. In such cases any combinationof the best moves summing to one produce an equally good policy — for11 train % comp test One might hope that actor-critic methods such as a2c [MBM +
16] would fix this prob-lem since they explicitly compute the value function, and, after all, the value of the states
We have presented results on extrapolation in Gridworld MDPs. In partic-ular we conclude that • policy gradient methods extrapolate better than ones using estimatesof Q values. First, as seen in many other contexts, it seems to beeasier to learn a policy directly than to estimate Q-values, and morespecifically, the Q-value optimization step of deterministically takingthe move that leads to maximum state value makes extrapolation verydifficult. • ego-centric space representation makes differences since similarities inspatial configurations better align to those that need attention formove choice — e.g., a wall to the left looks nearly identical for var-ious agent positions, thus bringing into play the same deep-learningparameters. • building rotational and mirror-image symmetry into the learning mech-anism helps a great deal as it goes much further in unifying the re-sponse to common situations (the same parameters deal with movingleft to the goal state as with moving down). Furthermore this can beaccomplished in a relatively straight-forward fashion using rotationaland mirror convolution rather than standard translational convolution. • all of policy-gradient methods we have tried have a very strong ten-dency to find nearly minimum-entropy action choices when two ormore actions are equally good, leading to fewer well explored states attraining time. Furthermore we find that adding a minimum entropyloss has only a very small effect on completion rate, while noticeablyworsening quality of solutions.To boil this down even further, after we moved from Q-value learning topolicy-gradient methods, the most important modifications we found were resulting from equally good moves are equal. However, our experiments indicate this isnot the case. While the values for the moves end up more or less equal, this does notchange the dynamics of the policy probability update mechanism any more than do thecorrect discounted rewards when using REINFORCE. References [AC87] Philip E Agre and David Chapman. Pengi: An implementationof a theory of activity. In
AAAI , pages 286–272, 1987.[BM95] Justin A Boyan and Andrew W Moore. Generalization in rein-forcement learning: Safely approximating the value function. In
Advances in neural information processing systems , pages 369–376, 1995.[CH03] Paul Crook and Gillian Hayes. Learning in a state of confusion:Perceptual aliasing in grid world navigation.
Towards IntelligentMobile Robots , 4, 2003.[CKH +
18] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and JohnSchulman. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341 , 2018.[DWD15] Sander Dieleman, Kyle W Willett, and Joni Dambre. Rotation-invariant convolutional neural networks for galaxy morphologyprediction.
Monthly notices of the royal astronomical society ,450(2):1441–1459, 2015.[FGKO12] Sarah Finney, Natalia Gardiol, Leslie Pack Kaelbling, and TimOates. The thing that we tried didn’t work very well: de-ictic representation in reinforcement learning. arXiv preprintarXiv:1301.0567 , 2012.[JKB +
19] Arthur Juliani, Ahmed Khalifa, Vincent-Pierre Berges,Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Ju-lian Togelius, and Danny Lange. Obstacle tower: A generaliza-tion challenge in vision, control, and planning. arXiv preprintarXiv:1902.01378 , 2019.[JRK19] Steven James, Benjamin Rosman, and George Konidaris. Learn-ing portable representations for high-level planning. arXivpreprint arXiv:1905.12006 , 2019.14KSB12] George Konidaris, Ilya Scheidwasser, and Andrew Barto. Trans-fer in reinforcement learning via shared features.
Journal ofMachine Learning Research , 13(May):1333–1371, 2012.[MBM +
16] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza,Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, andKoray Kavukcuoglu. Asynchronous methods for deep reinforce-ment learning. In
International conference on machine learning ,pages 1928–1937, 2016.[MDG +
19] Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher JPal, and Liam Paull. Active domain randomization. arXivpreprint arXiv:1904.04762 , 2019.[MMB14] Alexey A Melnikov, Adi Makmal, and Hans J Briegel. Projec-tive simulation applied to the grid-world and the mountain-carproblem. arXiv preprint arXiv:1405.5459 , 2014.[MVT16] Diego Marcos, Michele Volpi, and Devis Tuia. Learning rotationinvariant convolutional filters for texture classification. In ,pages 2012–2017. IEEE, 2016.[NPH +
18] Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, andJohn Schulman. Gotta learn fast: A new benchmark for gener-alization in rl. arXiv preprint arXiv:1804.03720 , 2018.[PGK +
18] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr¨ahenb¨uhl,Vladlen Koltun, and Dawn Song. Assessing generalization indeep reinforcement learning. arXiv preprint arXiv:1810.12282 ,2018.[RB04] Balaraman Ravindran and Andrew G Barto.
An algebraic ap-proach to abstraction in reinforcement learning . PhD thesis,University of Massachusetts at Amherst, 2004.[SQAS15] Tom Schaul, John Quan, Ioanni s Antonoglou, and David Silver.Prioritized experience replay. arXiv preprint arXiv:1511.05952 ,2015.[Wil92] Ronald J Williams. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.
Machine learn-ing , 8(3-4):229–256, 1992.15WLT +
18] Sam Witty, Jun Ki Lee, Emma Tosch, Akanksha Atrey, MichaelLittman, and David Jensen. Measuring and characterizinggeneralization in deep reinforcement learning. arXiv preprintarXiv:1812.02868 , 2018.[ZVMB18] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio.A study on overfitting in deep reinforcement learning. arXivpreprint arXiv:1804.06893 , 2018.
A Hyperparameters
We use an embedding size of 2 for our 4 objects: space, wall, agent, goal). Tospeed convergence in training embeddings are initialized to (0,0) for space,and three of the four corners of the space (+/- .1, 0), (0, +/- .1). For gamereuse we maintain a set of 228 full games. At every epoch the oldest 32 areremoved and a new 32, based upon the current policy, are added. Gamesare terminated when the goal state is achieved or after 100 moves.A training epoch consists of 3*NumMoves random selections from theset of moves. There are 200 epochs. We used the Adam optimizer witha learning rate of .002, and a batch size of 10. Gradients were clipped at+/-20.We simplified the maximum entropy calculations in Section 6 by addinga fraction f of the highest probability to the loss. ff