[PDF] Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games

Abstract

We show that Reinforcement Learning (RL) methods for solving Text-Based Games (TBGs) often fail to generalize on unseen games, especially in small data regimes. To address this issue, we propose Context Relevant Episodic State Truncation (CREST) for irrelevant token removal in observation text for improved generalization. Our method first trains a base model using Q-learning, which typically overfits the training games. The base model's action token distribution is used to perform observation pruning that removes irrelevant tokens. A second bootstrapped model is then retrained on the pruned observation text. Our bootstrapped agent shows improved generalization in solving unseen TextWorld games, using 10x-20x fewer training games compared to previous state-of-the-art methods despite requiring less number of training episodes.

Full PDF

BBootstrapped Q-learning with Context Relevant ObservationPruning to Generalize in Text-based Games

Subhajit Chaudhury

IBM Research AI [email protected]

Daiki Kimura

IBM Research AI [email protected]

Kartik Talamadupula

IBM Research AI [email protected]

Michiaki Tatsubori

IBM Research AI [email protected]

Asim Munawar

IBM Research AI [email protected]

Ryuki Tachibana

IBM Research AI [email protected]

Abstract

We show that Reinforcement Learn-ing (RL) methods for solving Text-BasedGames (TBGs) often fail to generalize on un-seen games, especially in small data regimes.To address this issue, we propose C ontext R elevant E pisodic S tate T runcation ( CREST )for irrelevant token removal in observationtext for improved generalization. Our methodﬁrst trains a base model using Q-learning,which typically overﬁts the training games.The base model’s action token distributionis used to perform observation pruning thatremoves irrelevant tokens. A second boot-strapped model is then retrained on the prunedobservation text. Our bootstrapped agentshows improved generalization in solvingunseen TextWorld games, using x- xfewer training games compared to previousstate-of-the-art (SOTA) methods despiterequiring less number of training episodes. Reinforcement Learning (RL) methods are increas-ingly being used for solving sequential decision-making problems from natural language inputs,like text-based games (Narasimhan et al., 2015; Heet al., 2016; Yuan et al., 2018; Zahavy et al., 2018)chat-bots (Serban et al., 2017) and personal con-versation assistants (Dhingra et al., 2017; Li et al.,2017; Wu et al., 2016). In this work, we focus onText-Based Games (TBGs), which require solvinggoals like “Obtain coin from the kitchen” , basedon a natural language description of the agent’sobservation of the environment. To interact withthe environment, the agent issues text-based actioncommands (“ go west ”) upon which it receives areward signal used for training the RL agent.Traditional text-based RL methods focus on theproblems of partial observability and large actionspaces. However, the topic of generalization to un-seen TBGs is less explored in the literature. We

Goal:

Who’s got a virtual machine and is aboutto play through an fast paced round of textworld?You do! Retrieve the coin in the balmy kitchen.

Observation : You’ve entered a studio.

You tryto gain information on your surroundings byusing a technique you call “looking.”

You needan unguarded exit ? you should try going east .You need an unguarded exit? You should try go-ing south . You don’t like doors?

Why not trygoing west , that entranceway is unblocked.

Bootstrapped Policy Action: go south

Figure 1: Our method retains context-relevant tokensfrom the observation text (shown in green) while prun-ing irrelevant tokens (shown in red). A second policynetwork re-trained on the pruned observations general-izes better by avoiding overﬁtting to unwanted tokens. show that previous RL methods for TBGs oftenshow poor generalization to unseen test games. Wehypothesize that such overﬁtting is caused due tothe presence of irrelevant tokens in the observationtext, which might lead to action memorization. Toalleviate this problem, we propose

CREST , whichﬁrst trains an overﬁtted base model on the originalobservation text in training games using Q-learning.Subsequently, we apply observation pruning suchthat, for each episode of the training games, weremove the observation tokens that are not seman-tically related to the base policy’s action tokens.Finally, we re-train a bootstrapped policy on thepruned observation text using Q-learning that im-proves generalization by removing irrelevant to-kens. Figure 1 shows an illustrative example ofour method. Experimental results on TextWorldgames (Cˆot´e et al., 2018) show that our proposedmethod generalizes to unseen games using almost x- x fewer training games compared to SOTAmethods; and features signiﬁcantly faster learning. a r X i v : . [ c s . L G ] S e p c) Medium validation games typical kind of place there is an exit to the east AttentionMap ContextVector

LSTMEncoder “go” ℎ !" ℎ ! Linear !(

Linear “east” ℎ !$ ActionScorer Concept-Net basedsimilarity score

Pruned observation text:“exit east”Train Bootstrapped Policy on Pruned Text using Q-learning typical kind of place there is an exit to the east thresholdBase Action Similarity Distribution

Original observation text

Token 1: “go” Token 2: “west”Token n: “coin”

EpisodicAction Tokens

BaseModel ... (a) Overview of our CREST observation pruning system (b) Easy validation games

Token Relevance Distribution

Token 1: “go” Token 2: “west”Token n: “coin”

EpisodicAction TokenAggregation

Figure 2: (a) Overview of Context Relevant Episodic State Truncation (CREST) module using Token RelevanceDistribution for observation pruning. Our method shows better generalization from x- x less number of traininggames and faster learning with fewer episodes on (b) “easy” and (c) “medium” validation games. LSTM-DQN (Narasimhan et al., 2015) is the ﬁrstwork on text-based RL combining natural lan-guage representation learning and deep Q-learning.LSTM-DRQN (Yuan et al., 2018) is the state-of-the-art on TextWorld CoinCollector games, andaddresses the issue of partial observability by us-ing memory units in the action scorer. Fulda et al.(2017) proposed a method for affordance extractionvia word embeddings trained on a Wikipedia cor-pus. AE-DQN (Action-Elimination DQN) – whichis a combination of a Deep RL algorithm withan action eliminating network for sub-optimal ac-tions – was proposed by Zahavy et al. (Zahavyet al., 2018). Recent methods (Adolphs and Hof-mann, 2019; Ammanabrolu and Riedl, 2018; Am-manabrolu and Hausknecht, 2020; Yin and May,2019; Adhikari et al., 2020) use various heuristicsto learn better state representations for efﬁcientlysolving complex TBGs.

We consider the standard sequential decision-making setting: a ﬁnite horizon Partially Observ-able Markov Decision Process (POMDP), repre-sented as ( s, a, r, s (cid:48) ) , where s is the current state, s (cid:48) the next state, a the current action, and r ( s, a ) is the reward function. The agent receives statedescription s t that is a combination of text describ-ing the agent’s observation and the goal statement. The action consists of a combination of verb andobject output, such as “go north”, “take coin”, etc.The overall model has two modules: a represen-tation generator, and an action scorer as shown inFigure 2. The observation tokens are fed to theembedding layer, which produces a sequence ofvectors x t = { x t , x t , ..., x tN t } , where N t is thenumber of tokens in the observation text for time-step t . We obtain hidden representations of theinput embedding vectors using an LSTM modelas h ti = f ( x ti , h ti − ) . We compute a context vec-tor (Bahdanau et al., 2014) using attention on the j th input token as, e tj = v T tanh( W h h tj + b attn ) (1) α tj = softmax ( e tj ) (2)where W h , v and b attn are learnable parameters.The context vector at time-step t is computedas the weighted sum of embedding vectors as c t = (cid:80) N t j =1 α tj h tj . The context vector is fed intothe action scorer, where two multi-layer percep-trons (MLPs), Q ( s, v ) and Q ( s, o ) produce theQ-values over available verbs and objects froma shared MLP’s output. The original works ofNarasimhan et al. (2015); Yuan et al. (2018) donot use the attention layer. LSTM-DRQN replacesthe shared MLP with an LSTM layer, so that themodel remembers previous states, thus addressingthe partial observability in these environments.Q-learning (Watkins and Dayan, 1992; Mnihet al., 2015) is used to train the agent. The param- able 1: The average success rate of various methods on 20 unseen test games. Experiments were repeated on 3random seeds. Our method trained on almost x fewer data has a similar success rate to state-of-the-art methods. Methods Easy Medium HardN25 N50 N500 N50 N100 N500 N50 N100LSTM-DQN (no att) 0.0 0.03 0.33 0.0 0.0 0.0 0.0 0.0LSTM-DRQN (no att) 0.17 0.53 0.87 0.02 0.0 0.25 0.0 0.0LSTM-DQN (+attn) 0.0 0.03 0.58 0.0 0.0 0.0 0.0 0.0LSTM-DRQN (+attn) 0.32 0.47 0.87 0.02 0.06 0.82 0.02 0.08

Ours (ConceptNet+no att)

Ours (ConceptNet+att) 0.82 (b) Medium games (N50)

Observation:

You've entered acookhouse. You begin to take stock ofwhat's in the room. You need anunguarded exit? You should try goingnorth. There is an exit to the south. Don'tworry, it is unguarded. There is a coin onthe floor.

Observation:

You find yourself in alaunderette. An usual kind of place. Theroom seems oddly familiar, as though itwere only superficially different from theother rooms in the building. There is an exitto the east. Don't worry, it is unguarded.There is an unguarded exit to the west. (a) Easy games (N50)

Figure 3: Ranking of context-relevant tokens from ob-servation text by our token relevance distribution. eters of the model are updated, by optimizing thefollowing loss function obtained from the Bellmanequation (Sutton et al., 1998), L = (cid:13)(cid:13)(cid:13)(cid:13) Q ( s, a ) − E s,a (cid:20) r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (3)where Q ( s, a ) is obtained as the average of verband object Q-values, γ ∈ (0 , is the discountfactor. The agent is given a reward of from theenvironment on completing the objective. We alsouse episodic discovery bonus (Yuan et al., 2018)as a reward during training that introduces curios-ity (Pathak et al., 2017) encouraging the agent touncover unseen states for accelerated convergence. Traditional LSTM-DQN and LSTM-DRQN meth-ods, trained on observation text containing irrele-vant textual artifacts (like “You don’t like doors?”in Figure 1), that leads to overﬁtting in small dataregimes. Our CREST module removes unwantedtokens in the observation that do not contribute todecision making. Since the base policy overﬁts on the training games, the action commands issued byit can successfully solve the training games, thusyielding correct (observation text, action command)pairs for each step in the training games. Therefore,by only retaining tokens in the observation text thatare contextually similar to the base model’s actioncommand, we remove unwanted tokens in the ob-servation, which might otherwise cause overﬁtting.Figure 2(a) shows the overview of our method.We use three embeddings to obtain token rele-vance: (1) Word2Vec (Mikolov et al., 2013); (2)Glove (Pennington et al., 2014); and (3) Concept-net (Liu and Singh, 2004).The distance between tokens is computed usingcosine similarity, D ( a , b ) . Token Relevance Distribution (TRD) : We run in-ference on the overﬁtted base model for each train-ing game (indexed by k ) and aggregate all the ac-tion tokens issued for that particular game as theEpisodic Action Token Aggregation (EATA), A k .For each token w i in a given observation text o kt at step t for the k th game, we compute the TokenRelevance Distribution (TRD) C as: C ( w i , A k ) = max a j ∈A k D ( w i , a j ) ∀ w i ∈ o kt , (4)where the i th token w i ’s score is computed as themaximum similarity to all tokens in A k . This rele-vance score is used to prune irrelevant tokens in theobservation text by creating a hard attention maskusing a threshold value. Figure 3 presents examplesof TRD’s from observations highlighting which to-kens are relevant for the next action. Examples oftoken relevance are shown in the appendix. Bootstrapped model : The bootstrapped model istrained on the pruned observation text by removingirrelevant tokens using TRDs. Same model archi-tecture and training methods as the base model are .02 0.00.17 0.130.95 0.8700.10.20.30.40.50.60.70.80.91

L20 L25

Number of testing levels(c) Zero-shot transfer for N50 easy games R e w a r d DR Q N D Q N Ours D Q N DR Q N Ours (a) N25 easy (valid)Training Episode R e w a r d Training Episode(b) N50 medium (valid)

Figure 4: Comparison of validation performance for various thresholds on (a) easy and (b) medium games, (c) Ourmethod trained on L games and tested on L and L games signiﬁcantly outperforms the previous methods. used. During testing, TRDs on unseen games arecomputed as C ( w i , G ) , by global aggregation of ac-tion tokens, G = (cid:83) k A k , that combines the EATAfor all training games. This approach retains allrelevant action tokens to obtain the training domaininformation during inference assuming similar do-main distribution between training and test games. Setup:

We used easy , medium , and hard modesof the Coin-collector Textworld (Cˆot´e et al., 2018;Yuan et al., 2018) framework for evaluating ourmodel’s generalization ability. The agent has tocollect a coin that is located in a particular room.We trained each method on various numbers oftraining games (denoted by N Quantitative comparison:

We compare the per-formance of our proposed model with LSTM-DQN (Narasimhan et al., 2015) and LSTM-DRQN (Yuan et al., 2018).Figure 2(b) and 2(c) show the reward of varioustrained models, with increasing training episodeson easy and medium games. Our method shows im-proved out-of-sample generalization on validationgames with about x- x fewer training games( vs. , ) with accelerated training usingdrastically fewer training episodes compared toprevious methods.We report performance on unseen test games inTable 1. Parameters corresponding to the best val-idation score are used. Our method trained with N and N games for easy and medium levelsrespectively achieves performance similar to games for SOTA methods. We perform ablationstudy with and without attention in the policy net-work and show that the attention mechanism alone does not substantially improve generalization. Wealso compare the performance of various word em-beddings for TRD computation and ﬁnd that Con-ceptNet gives the best generalization performance. Pruning threshold:

In this experiment, we test ourmethod’s response to changing threshold values forobservation pruning. Figure 4(a) and Figure 4(b)reveals that thresholds of . for easy games and . for medium games, gives the best validationperformance. A very high threshold might removerelevant tokens also, leading to failure in the train-ing; whereas a low threshold value would retainmost irrelevant tokens, leading to over-ﬁtting. Zero-shot transfer : In this experiment, agentstrained on games with quest lengths of roomswere tested on unseen game conﬁgurations withquest length of and rooms respectively with-out retraining, to study the zero-shot transferabilityof our learned agents to unseen conﬁgurations. Theresults in the bar charts of Figure 4(c) for N easygames show that our proposed method can gener-alize to unseen game conﬁgurations signiﬁcantlybetter than previous state-of-the-art methods on thecoin-collector game. Generalizability to other games : In the aboveexperimental section, we reported the results onthe coin-collector environment, where the nounand verbs used in the training and testing gameshave a strong overlap. In this section, we presentsome discussion about the generalizability of thismethod to other games, where the context-relevanttokens for a particular game might be never seen inthe training games.To test the generalizability of our method, weperformed experiments on a different type of agame (cooking games) considered in (Adolphs andofmann, 2019). An example observation lookslike this: “ . . . Y

OU SEE A FRIDGE . T

HE FRIDGECONTAINS SOME WATER , A DICED CILANTROAND A DICED PARSLEY . Y

OU WONDER IDLYWHO LEFT THAT HERE . W

ERE YOU LOOKINGFOR AN OVEN ? B

ECAUSE LOOK OVER THERE , IT ’ S AN OVEN . W

ERE YOU LOOKING FOR ATABLE ? B

ECAUSE LOOK OVER THERE , IT ’ SA TABLE . T

HE TABLE IS MASSIVE . O

N THETABLE YOU MAKE OUT A COOKBOOK AND AKNIFE . Y

OU SEE A COUNTER . H

OWEVER , THE COUNTER , LIKE AN EMPTY COUNTER , HASNOTHING ON IT . . . ”. The objective of this gameis to prepare a meal following the recipe found inthe kitchen and eat it.We took 20 training and 20 testing cookinggames with unseen items in test observations.Training action commands are obtained from anoracle. From the training games, we obtain nounaction tokens as, ’onion’, ’potato’, ’parsley’, ’ap-ple’, ’counter’, ’pepper’, ’meal’, ’water’, ’fridge’,’carrot’. Using our token relevance method (us-ing concept-net embeddings) described in Section3.2, we obtain scores for unseen cooking relatednouns during test as, ”banana”: 0.45, ”cheese”:0.48, ”chop”: 0.39, ”cilantro”: 0.71, ”cookbook”:0.30, ”knife”: 0.13, ”oven”: 0.52, , ”stove”: 0.48,”table”: 0.43.Although these nouns were absent in the trainingaction distribution, our proposed method can assigna high score to these words (except knife), sincethey are similar in concept to training actions. Anappropriate threshold (eg.,th=0.4) can retain mosttokens, which can be automatically tuned using val-idation games as shown in Fig 4(a) and 4(b) in thepaper. Thus, as described in Section 5, assumingsome level of overlap between training and testingknowledge domains, our method is generalizableand can reduce overﬁtting for RL in NLP. Largetraining and testing distribution gap is a much moredifﬁcult problem, even for supervised ML and con-ventional RL settings, and is out of the scope ofthis paper.

We present a method for improving generalizationin TBGs using irrelevant token removal from ob-servation texts. Our bootstrapped model trained onthe salient observation tokens obtains generaliza-tion performance similar to SOTA methods, with x- x fewer training games, due to better gener-alization; and shows accelerated convergence. In this paper, we have restricted our analysis to TBGsthat feature similar domain distributions in trainingand test games. In the future, we wish to handle thetopic of generalization in the presence of domaindifferences such as novel objects, and goal state-ments in test games that were not seen in training. References

Ashutosh Adhikari, Xingdi Yuan, Marc-AlexandreCˆot´e, Mikul´aˇs Zelinka, Marc-Antoine Rondeau, Ro-main Laroche, Pascal Poupart, Jian Tang, AdamTrischler, and William L Hamilton. 2020. Learn-ing dynamic knowledge graphs to generalize on text-based games. arXiv preprint arXiv:2002.09127 .Leonard Adolphs and Thomas Hofmann. 2019.Ledeepchef: Deep reinforcement learning agentfor families of text-based games. arXiv preprintarXiv:1909.01646 .Prithviraj Ammanabrolu and Matthew Hausknecht.2020. Graph constrained reinforcement learningfor natural language action spaces. arXiv preprintarXiv:2001.08837 .Prithviraj Ammanabrolu and Mark O Riedl. 2018.Playing text-adventure games with graph-baseddeep reinforcement learning. arXiv preprintarXiv:1812.01628 .Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Marc-Alexandre Cˆot´e, ´Akos K´ad´ar, Xingdi Yuan, BenKybartas, Tavian Barnes, Emery Fine, James Moore,Matthew Hausknecht, Layla El Asri, MahmoudAdada, et al. 2018. Textworld: A learning en-vironment for text-based games. arXiv preprintarXiv:1806.11532 .Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.Towards end-to-end reinforcement learning of dia-logue agents for information access. In

Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 484–495.Nancy Fulda, Daniel Ricks, Ben Murdoch, and DavidWingate. 2017. What can you do with a rock? af-fordance extraction via word embeddings. arXivpreprint arXiv:1703.03429 .Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Li-hong Li, Li Deng, and Mari Ostendorf. 2016. Deepreinforcement learning with a natural language ac-tion space. In

Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1621–1630.iujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In

Proceedingsof the Eighth International Joint Conference on Nat-ural Language Processing (Volume 1: Long Papers) ,pages 733–743.Hugo Liu and Push Singh. 2004. Conceptneta practi-cal commonsense reasoning tool-kit.

BT technologyjournal , 22(4):211–226.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidje-land, Georg Ostrovski, et al. 2015. Human-levelcontrol through deep reinforcement learning.

Na-ture , 518(7540):529.Karthik Narasimhan, Tejas Kulkarni, and Regina Barzi-lay. 2015. Language understanding for text-basedgames using deep reinforcement learning. In

Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages 1–11.Ian Osband, Charles Blundell, Alexander Pritzel, andBenjamin Van Roy. 2016. Deep exploration viabootstrapped dqn. In

Advances in neural informa-tion processing systems , pages 4026–4034.Deepak Pathak, Pulkit Agrawal, Alexei A Efros, andTrevor Darrell. 2017. Curiosity-driven explorationby self-supervised prediction. In

Proceedings of theIEEE Conference on Computer Vision and PatternRecognition Workshops , pages 16–17.Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Matthias Plappert, Rein Houthooft, Prafulla Dhariwal,Szymon Sidor, Richard Y Chen, Xi Chen, TamimAsfour, Pieter Abbeel, and Marcin Andrychowicz.2017. Parameter space noise for exploration. arXivpreprint arXiv:1706.01905 .Iulian V Serban, Chinnadhurai Sankar, Mathieu Ger-main, Saizheng Zhang, Zhouhan Lin, Sandeep Sub-ramanian, Taesup Kim, Michael Pieper, SarathChandar, Nan Rosemary Ke, et al. 2017. A deepreinforcement learning chatbot. arXiv preprintarXiv:1709.02349 .Richard S Sutton, Andrew G Barto, et al. 1998.

Intro-duction to reinforcement learning , volume 2. MITpress Cambridge.Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.

Machine learning , 8(3-4):279–292. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between hu-man and machine translation. arXiv preprintarXiv:1609.08144 .Xusen Yin and Jonathan May. 2019. Learn how to cooka new recipe in a new house: Using map familiar-ization, curriculum learning, and bandit feedback tolearn families of text-based adventure games. arXivpreprint arXiv:1908.04777 .Xingdi Yuan, Marc-Alexandre Cˆot´e, Alessandro Sor-doni, Romain Laroche, Remi Tachet des Combes,Matthew Hausknecht, and Adam Trischler. 2018.Counting to explore and generalize in text-basedgames. arXiv preprint arXiv:1806.11525 .Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel JMankowitz, and Shie Mannor. 2018. Learn whatnot to learn: Action elimination with deep reinforce-ment learning. In

Advances in Neural InformationProcessing Systems , pages 3562–3573.

A Description of Text-based games

We used Textworld (Cˆot´e et al., 2018) frameworkfor evaluating our model’s generalization abilityon text-based games. For each game, the agentis provided with a goal statement and an observa-tion text describing the current state of the worldaround it only. The agent has to overcome par-tial observability using memory because it neversees the full state of the world. The games areinspired by the chain experiments used in (Plap-pert et al., 2017; Osband et al., 2016) for evaluat-ing exploration in RL policies. The agent has tonavigate through various rooms that are randomlyconnected to form a chain, ﬁnally reaching thegoal state. We ensure that the goal statement doesnot contain navigational instructions by using the“ --only-last-action ” option. The agent isrewarded only when it successfully achieves theend-goal. We use a quest length (number of roomsto travel before reaching goal state) of L fortraining our policies.We use the Coin-collector environment for eval-uating our experiments, where the agent has tocollect a coin that is located in a particular room.For different games the location of coin and inter-connectivity between the rooms are different. Weexperiment with three modes of this challenge ac-cording to (Yuan et al., 2018): easy , there areno distractor rooms (dead-ends) along the optimalpath, and the agent needs to choose a command a) Easy Training Games (b) Medium Training Games (c) Hard Training Games(d) Easy Validation Games (e) Medium Validation Games (f) Hard Validation Games Figure 5: Training and validation games learning curve for various games. The metric of measurement (y-axis)is the avergare of ﬁnal reward of 1.0 on completion of the quest and 0.0 otherwise, thus measuring the averagesuccess rate. Our method shows better generalization from signiﬁcantly less number of training games and fasterlearning with fewer episodes for all cases of “easy”, “Medium” and “hard” validation games. that only depends on the previous state; medium ,there is one distractor per room in the optimal pathand the agent has to issue a reverse command of itsprevious command to come out of such distractorrooms; hard , there are two distractors per rooms inthe optimal path and the agent has to issue a reversecommand of its previous command to come out ofsuch distractor rooms, in addition to rememberinglonger into the past to successfully keep track ofwhich paths it has already traveled.

B Experimental Setup

For the Textworld coin collector environment, weuse 10 verbs and 10 nouns in the vocabulary whichis learned using Q learning. This is different fromprevious methods that use only 2 words and 5 ob-jects thus increasing the complexity of Q learningslightly. However, it is to be noted that the prob-lem of generalization exists even for less numberof action tokens as reported in (Yuan et al., 2018)and is not signiﬁcantly aggravated on a slightlylarger action space (10 vs 100 combinations). Theconﬁguration used in our base and bootstrappedmodel learning is the same as the previous meth-ods with the only change being the addition ofthe attention layer. We trained each environmentfor 6000 epochs with annealing of 3600 epochsfrom a starting value of 1.0 to 0.2. Each trainingexperiment took about - hours for completion.Our experiments were conducted on a Ubuntu16.04system with a Titan X (Pascal) GPU. We use a sin- gle LSTM network with 100 dimensional hiddenunits in the representation generator. For the actionscorer, a single LSTM network with 64-dim hid-den unit (for DRQN ) and two MLPs for verb andobject Q-values were used. The number of train-able parameters in our policy network is 128,628for the model with attention and 125,364 withoutattention.In our experiments, we wish to investigate thegeneralization property of our method on a smallnumber of training games. To that end, we wishto answer the following questions: (1) Can ourproposed observation masking system out-performprevious RL methods for TBGs using less trainingdata with accelerated learning?, (2) Is there a posi-tive correlation between the strength of observationmasking and generalization performance? and (3)Does our method understand the semantic meaningof the games to perform zero-shot generalizationto unseen conﬁgurations of TBGs? C Improved generalization by CREST

The generalization ability of the learned base policyis measured by the performance on unseen gamesthat were not used during training the policy net-work. We measure the reward obtained by the agentin each episode which is the metric of success inour experiments. During the evaluation of unseengames, only the environment reward is used andthe episodic discovery bonus is turned off. Since a bservation:

You have fallen into a salon.Not the salon you'd expect. No, this is asalon. You start to take note of what's in theroom. You need an unguarded exit? Youshould try going east. You don't like doors?Why not try going north, that entranceway isunblocked. There is an unblocked exit to thesouth. You don't like doors? Why not trygoing west, that entranceway is unblocked. (c) Hard games (N100)(b) Medium games (N50)

Observation:

You find yourself in astudio. An usual kind of place. Okay,just remember what you're here to do,and everything will go great. There is anexit to the east. Don't worry, it isunblocked. There is an unguarded exitto the north. There is an unguarded exitto the south.

Observation:

You've just walked into a chamber. You begin to take stock of what's here. There is an unblocked exit to the east. There is a coin on the floor.

Observation:

You've entered a cookhouse. You begin to take stock of what's in the room. You need an unguarded exit? You should try going north. There is an exit to the south. Don't worry, it is unguarded. There is a coin on the floor.

Observation:

You arrive in an office. A normal kind of place. You don't like doors? Why not try going east, that entranceway is unblocked. There is an exit to the north. Don't worry, it is unblocked. You need an unblocked exit? You should try going west. There is a coin on the floor.

Figure 6: Showing the relevance distribution of observation token for easy, medium and hard games along withoriginal observation text. The top row shows a non-terminal observation where the “coin” is not present. Thesecond row shows terminal states. Each relevance score is bounded between [0,1]. The bootstrapped model istrained on tokens that have relevance above some threshold to remove irrelevant tokens. reward of 1.0 is obtained on the completion of thegame, the average reward can also be interpretedas the average success rate in solving the games.The verb and object tokens corresponding to themaximum Q-value are chosen as the action com-mand. Traditional LSTM-DQN and LSTM-DRQNmethods are trained on observation text descrip-tions that include irrelevant textual artifacts, whichmight lead to overﬁtting in small data regimes. Todemonstrate this effect, we plot the performanceof LSTM-DRQN (SOTA on coin-collector) andLSTM-DQN on the Coin-Collector easy , medium ,and hard games on various training and un-seen validation games in Figure 5. In each trainingepisode, a random batch of games is sampled fromthe available training games and Q-learning is per-formed.While for a large number of traininggames (500), the SOTA policies can solvemost of the validation games (especially for easygames). However, the performance degradessigniﬁcantly for less number of training games. Onthe other hand, the training performance shows a 100% success rate indicating overﬁtting. This kindof behavior might occur if the agent associatescertain action commands to irrelevant tokens inthe observation. For example, the agent mightencounter games in training where observationtokens, “a typical kind of place” correspond to theaction of “go east”. In this case, the agent mightlearn to associate such irrelevant tokens to the “goeast” command without actually learning the truedependency on tokens like “there is a door to theeast”. C.1 Quantitative evaluation of generalization

Our proposed method shows better generalizationperformance as is evident from Figure 5. Bothtraining and validation performance increases withincreasing training episodes, indicating good gen-eralization. Slight overﬁtting is evidenced if theagent is trained for a longer duration. The pol-icy parameters corresponding to the best validationscore is used for evaluating unseen test games. Ourpolicy shows better performance due to trainingon only context-relevant tokens after the removalethods Easy Medium HardN25 N50 N500 N50 N100 N500 N50 N100Evaluate on L20LSTM-DQN (+attn) 0.0 0.02 0.58 0.0 0.0 0.0 0.0 0.0LSTM-DRQN (+attn) 0.25 0.17 0.90 0.0 0.02 0.53 0.0 0.02Ours (ConceptNet+att)

Evaluate on L25LSTM-DQN (+attn) 0.0 0.0 0.48 0.0 0.0 0.00 0.0 0.0LSTM-DRQN (+attn) 0.07 0.13 0.88 0.0 0.02 0.50 0.02 0.0Ours (ConceptNet+att)

Table 2: Average succes in zero-shot transfer to other conﬁgurations. We trained the RL policies for L gamesand test the performance on L and L unseen game conﬁgurations. CREST signiﬁcantly outperforms theprevious methods on such tasks for all cases of easy , medium and hard games. of unwanted tokens from observation text. Weshow the visualization of Token Relevance Distri-butions (TRDs) obtained by our method for easy , medium , and hard games in Figure 6. Each tokenhas a similarity score between 0 and 1, indicat-ing how relevant it is for making decisions aboutthe next action. Tokens with a score less than athreshold are pruned. We also perform such obser-vation pruning in the testing phase. Therefore, ourproposed method learns on such clean observationtexts which are also tested on unseen pruned texts,which leads to improved generalization. C.2 Zero-shot transfer

While in the previous experiments the training andevaluation games had the same quest length conﬁg-uration, in this experiment we evaluate our methodon games with different conﬁgurations of coin-collector never seen during training. Speciﬁcally,during training, we use games with quest lengthsof rooms. The models trained on such conﬁgu-ration are tested on games with quest length of and rooms respectively without any retraining.This is aimed to study the zero-shot transferabilityto other conﬁgurations that the agent has never en-countered before. The results are shown in Table 2for all modes of the coin-collector games, showthat our proposed observation masking method canalso generalize to unseen game conﬁgurations withincreased quest length and largely outperforms theprevious state-of-the-art methods. Our method, CREST learns to retain important tokens in theobservation text which leads to a better semantic un-derstanding resulting in the better zero-shot transfer.In contrast, the previous method can overﬁt to theunwanted tokens in the observation text that doesnot contribute to the decision making process.

D Discussion

Empirical evaluation shows that our observationmasking method can successfully reduce the over-ﬁtting problem in RL for Text-based games by re-ducing irrelevant tokens. Our method also learnsat an accelerated rate requiring fewer trainingepisodes due to pruned textual representations. Weshow that observation masking leads to better gen-eralization, as demonstrated by superior perfor-mance for our CREST method with acceleratedconvergence with less number of training games ascompared to the state-of-the-art method.In this paper, we assume that the domain dis-tribution between the training and evaluation aresimilar in our environments because our goal isto explore generalization by observation pruningwithout additional heuristic learning components.This means the evaluation games will have simi-lar objectives as seen during the training games,and similar objects would be encountered in theevaluation games without encountering any novelobjects. For example, if the goal objective is setas “ pickup the coin ” in the training games, it willnot be changed to “ eat the appleeat the apple