[PDF] Visual Transfer between Atari Games using Competitive Reinforcement Learning

Abstract

This paper explores the use of deep reinforcement learning agents to transfer knowledge from one environment to another. More specifically, the method takes advantage of asynchronous advantage actor critic (A3C) architecture to generalize a target game using an agent trained on a source game in Atari. Instead of fine-tuning a pre-trained model for the target game, we propose a learning approach to update the model using multiple agents trained in parallel with different representations of the target game. Visual mapping between video sequences of transfer pairs is used to derive new representations of the target game; training on these visual representations of the target game improves model updates in terms of performance, data efficiency and stability. In order to demonstrate the functionality of the architecture, Atari games Pong-v0 and Breakout-v0 are being used from the OpenAI gym environment; as the source and target environment.

Full PDF

VVisual Transfer between Atari Games usingCompetitive Reinforcement Learning

Akshita Mittel Purna Sowmya Munukutla Himanshi Yadav {amittel, spmunuku, hyadav}@andrew.cmu.edu

Robotics InstituteCarnegie Mellon University

Abstract

This paper explores the use of deep reinforcement learning agents to transferknowledge from one environment to another. More speciﬁcally, the method takesadvantage of asynchronous advantage actor critic (A3C) architecture to generalize atarget game using an agent trained on a source game in Atari. Instead of ﬁne-tuninga pre-trained model for the target game, we propose a learning approach to updatethe model using multiple agents trained in parallel with different representationsof the target game. Visual mapping between video sequences of transfer pairs isused to derive new representations of the target game; training on these visualrepresentations of the target game improves model updates in terms of performance,data efﬁciency and stability. In order to demonstrate the functionality of thearchitecture, Atari games

Pong-v0 and

Breakout-v0 are being used from theOpenAI gym environment; as the source and target environment.

Our paper is motivated by ’Learning to Learn’ as was discussed in the NIPS-95 workshop, whereinterest was shown in applying previously gained knowledge to a new domain for learning new tasks[1]. The learner uses a series of tasks learned in the source domain, to improve and learn the targettasks. In our paper, a frame taken from the target game is mapped to the analogous state in the sourcegame and the trained policy learned from the source game knowledge is used to play the target game.We also rely on transferring knowledge using transfer learning methods that have shown to improveperformance and stability [2]. Expert performances can be achieved on several games and with themodel complexity of a single expert. Signiﬁcant improvements in learning speeds in target games arealso seen [2, 3]. The same learned weights can be generalized from a source game to several newtarget games. This treatment of weights from a previously trained model has implications in SafeReinforcement Learning [2] as we can easily learn from already learned stable agents.We ﬁnd underlying similarities between the source and the target domains i.e. different Atari Gamesto represent common knowledge using Unsupervised Image-to-image Translation (UNIT) Generativeadversarial networks (GANs) [4].Recent progress in the applications of Reinforcement Learning (RL) using deep networks has ledus to policy gradient methods like the A3C algorithm, that can autonomously achieve human-likeperformance in Atari games[5]. In an A3C network, several agents are executed in parallel, withdifferent starting policies and the global state updated intermittently by the agents. A3C has a smallercomputation cost and we do not have to calculate the Q value for every action in the action space toﬁnd the maximum. a r X i v : . [ c s . C V ] S e p cKenzie et al. propose a training method where two games are simultaneously trained where eachgame within the neural network competes for representational space. In our paper, the target gamecompetes with its visual representation obtained after using the UNIT GAN as a visual mapperbetween the source and target game. Translating images to another set of images has been extensively studied by the vision and thegraphics communities. Recently, considerable interest has been shown in the ﬁeld of unsupervisedimage translation [7, 8] which successfully aids in the task of domain adaptation. The main objectiveis to learn a mapper that translates an input image to an output image without prior aligned imagepairs. By assuming that the input and the output images have some underlying relationship, in [9]introduce CoGAN and cross-modal scene networks [10] that learn a common representation usinga weight sharing technique. Zhu et al. achieve the same without relying on a similarity functionbetween the input and output images and also, not assuming that the input and output images are inthe same low-dimensional embedding space.Domain adaption aims to generate a shared representation for distinct domains [11]. [12] use multi-task and transfer learning to act in a new domain by using previous knowledge. The multi-tasklearning method employed, called "Actor-Mimic", used model compression techniques to train onemulti-task network from several expert networks. The multi-task network is treated as a Deep Qnetwork (DQN) pre-trained on some tasks in the source domain. The DQN which used this multi-taskpre-training learns a target task signiﬁcantly faster than a DQN starting from a random initialization,effectively demonstrating that the source task representations generalize to the target task. [6] adaptthe DQN algorithm to competitively learn two similar Atari games. The authors prove that a DQNtrained to solve a particular speciﬁc task does not perform well on unforeseen tasks, however, theircompetitively learning technique does when a DQN agent is trained simultaneously.More recently, Ratcliffe et al. use a 2D GridWorld environment which has been rendered differentlyto create several versions of the same environment that still have underlying similarities and employmulti-task learning and domain adaption on these different versions. The environment is then learnedusing the A3C algorithm.

A3C is a lightweight asynchronous variant of the actor critic model that shows success in learning avariety of Atari games. A3C network architecture consists of four convolutional layers followed byan LSTM layer and two fully connected layers to predict actions and value functions of the states.However, the entire representational space of the network is specialized for a particular game that ithas been trained on. The idea of transfer learning methods is to use this experience to improve theperformance of other similar tasks.The goal of this paper is to use an RL agent to generalize between two related but vastly differentAtari games like

Pong-v0 and

Breakout-v0 . This is done by learning visual mappers across games:given a frame from the source game, we should be able to generate the analogous frame in thetarget game. Building on the existence of these mappers, the training method we propose is tosimultaneously learn two representations of the target game and effectively making them compete forrepresentational space within the neural network.

To create visual analogies between games, we rely on the core ideas explored in [3] to learn themapping G : s → t between source game ( s ) and target game ( t ) in an unsupervised manner. Theunsupervised learning step as described in the setup of [3] requires preprocessing of data. Theattention maps of input frames are used as the preprocessed frames. Attention maps are generatedby rotation of input image so that the main axis of motion is horizontal, binarizing the input aftersubtracting the median pixel and applying dilation operator on that frame to enlarge relevant object2izes. In the ﬁnal preprocessing step, output is obtained by cloning dilated image and applying twolevels of blurring to create three channels of the image as shown in Figure 1.Figure 1: Images with input frame and preprocessed frame for source ( Pong-v0 ) and target(

Breakout-v0 ) gameTo train the mapper function G , shared latent space assumption can be made for Atari games and themapping is trained with an unsupervised image-to-image translation framework. The pre-processedframes from source game and target game are mapped to the same latent representation in a sharedlatent space across both games. We assume that a pair of corresponding images in two differentdomains of source game and target game can be inferred by learning two encoding functions, to mapimages to latent codes and two generation functions that map latent codes to images. Based on thisassumption of shared latent space, we use an existing framework that is based on GANs and VAEs tolearn the mapper.Figure 2: Mapped images with preprocessed frames for source ( Pong-v0 ) on the left and target(

Breakout-v0 ) game mapping outputted by trained UNIT GANFigure 3: Mapped images with preprocessed frames for source (

Pong-v0 ) and target (

Breakout-v0 )game mapping outputted by trained UNIT GAN as implemented in [3]3he model is implemented with the network architecture of UNIT GAN [4] with unit consistencyloss. The encoding and generating functions are implemented using CNNs and the shared-latentspace assumption is enforced with a weight sharing constraint across these functions. In addition,adversarial discriminators for the respective domains are trained to evaluate whether the translatedimages are realistic. To generate an image in target domain of

Breakout-v0 , we use the encodingfunction of source game to get the latent code and use the generating function of target game to getthe frame. The resulting images are shown in Figure 1 with the preprocessing step of input framesand a naive mapping between source and target games that has been learned using UNIT GAN.

One of the challenges that we are trying to address in this paper is to prove or disprove that visualanalogies across games are necessary and sufﬁcient to transfer the knowledge of playing one particulargame to another.In the recent times, policy gradient methods like A3C are shown to be extremely effective in learningthe world of Atari games. They are currently set as the baseline for

Pong-v0 and

Breakout-v0 games in terms of training time vs rewards and they have already surpassed the performance achievedby Dueling DQN and Double DQN networks for these games. The idea of A3C networks is to usemultiple workers in parallel that can each interact with the environment and update the shared modelsimultaneously. Thus, A3C networks asynchronously exchange multiple agents in parallel instead ofusing experience replay.We use the baseline A3C network trained for source game (

Pong-v0 ) in the ﬁrst stage of our trainingprocess and transfer the knowledge from this model to learning to play target game (

Breakout-v0 ).We measure the efﬁciency of transfer learning method in terms of training time and data efﬁciencyacross parallel actor-learners. In the second stage of training process, we use two representationsof the target game amongst the workers in parallel. The ﬁrst representation of transfer process usesthe target game frames taken directly from the environment. The second representation of transferprocess uses the frames learned from the visual mapper i.e., G ( S ) . The ratio of number of workersthat train directly on frames queried from the target game and frames mapped from the source gameis a hyperparameter that is determined through experimentation.Figure 4: Transfer learning process with training for the source game in stage 2 and training for thetarget game using two different representations in a competitive manner. Representation 1 is forthe target game is queried directly from the environment and representation 2 for the target game isextracted using visual transfer of states, static mapping of actions and rewards from source game4n this way, we transfer the knowledge from source game to target game by competitively andsimultaneously ﬁne-tuning the model using two different visual representations of the target game.One visual representation is queried directly from the environment and the other visual representationuses the learned mappers for source game to extract the frames of target game. The actions ofthe target game for the second representation are determined from a static mapping of actionsbetween source and target games. Since Pong-v0 and

Breakout-v0 have similar game strategiesof controlling a paddle to hit the ball to obtain a certain objective, it is intuitive to determine ameaningful static mapping of actions. The six actions of

Pong-v0 {No Operation, Fire, Right, Left,Right Fire, Left Fire} is mapped to four actions actions of

Breakout-v0 as {Fire, Fire, Right, Left,Right, Left} respectively. The rewards are mapped directly from source game to target game withoutany scaling.

This section describes the results of different stages in our transfer learning pipeline. For this particularpipeline we use the OpenAI gym environment of Atari games and generalize the RL agent between

Pong-v0 and

Breakout-v0 . As described in Section 3, the ﬁrst stage is to obtain the preprocessed images for the frames of boththe source (

Pong-v0 ) and target (

Breakout-v0 ) games. The next stage is to obtain the state, action,and reward mappings to train our A3C network. In order to get the state mapping G from our sourceto target frames[3] propose training UNIT GAN network architecture. However, this stage requiresimmense amounts of manual processing steps; generating enough unique state spaces for both thesource and target games. Furthermore, it took us approximately six hours of training time to run thisarchitecture on a GeForce GTX 1080 GPU for an epoch. To avoid this bottle-neck, we are currentlyusing pre-trained models that contain the visual mapping from Pong-v0 to Breakout-v0 providedby the original authors[3]. The results of the mapping from the source frames s to G ( s ) are shown inFigure 2 and Figure 3.The process of reward mapping at different stages varies. During stage 1 of training process to learnthe base network, the rewards of s are used directly. During stage 2 of training process to ﬁne-tune, astatic mapping of rewards and actions for G ( s ) are used; for t the rewards, actions, and states of t itself are used directly.Figure 5: (a) Feature activations obtained from the ﬁrst layer of A3C model trained on Breakout-v0 with visual transfer learning method in competitive setting (b) Feature activations obtained fromthe second layer of A3C model trained on

Breakout-v0 with visual transfer learning method incompetitive settingIn addition, the activations of states learned by A3C model that is initialized with weights of

Pong-v0 and ﬁne-tuned for

Breakout-v0 are shown in Figure 5. The ﬁrst layer activations are analogous5o a series of

Gabor ﬁlters, wherein it learns low level state features like edges, contours in theAtari frames. These include, for instance, the paddle, the position of the ﬁred entity, and the target.The second layer activations extract game speciﬁc features by combining low level features fromthe previously learned layer. In the game play of

Breakout-v0 , paddle is controlled to hit the ballwhich determines the rewards and target position does not alter the rewards. This is reﬂected in theactivations of second layer and it can be observed that the pixels at target position are no longer ﬁred.The second layer activations are more focused on information such as the position of the paddle andﬁred entity; features that are crucial to predict the optimal policies.

The ﬁrst stage of transfer learning process is to train baseline A3C network for source game (

Pong-v0 )and transfer the knowledge from this model to learning to play target game (

Breakout-v0 ). Thesource and target games are trained with preprocessed attention frames from Section 3. This sectioncompares vanilla transfer learning method of directly using a pre-trained model of source game withthe proposed strategy of using competing representations to learn the target game.

The baseline model to be trained for the target game (

Breakout-v0 ) is initialized directly withthe weights from the expert model of source game (

Pong-v0 ) excluding the last layer. It is thenﬁne-tuned on the target game to learn the optimal policy. The ﬁrst graph in Figure 6 depicts thetraining curve of the model ﬁne-tuned with preprocessed frames of

Breakout-v0 . The blue curveindicates the non pre-trained behavior on

Breakout-v0 , whereas the red curve indicates the behaviorafter pre-training. From Figure 6, it is evident that by initializing A3C architecture with weightsobtained from

Pong-v0 , Breakout-v0 attains much better rewards as shown in Figure 6(b). Thoughboth the curves reach similar rewards towards the end of training, the mean rewards obtained bythe pre-trained network are signiﬁcantly higher to show that this transfer learning method works.However, the reward curve rises sharply and fails to make a smooth transition from

Pong-v0 to Breakout-v0 inspite of similar objectives in both the games.Figure 6: (a) Total reward per episode versus number of epochs (b) Mean reward per episode over500 epochs during the training phases of

Breakout-v0 with and without pre-training.It can also be seen that the total rewards obtained per episode are not stable and vary a lot betweenconsecutive episodes. With the architecture we propose in Section 3 using competitive learningbetween different representations of the target game, we aim to make the model converge fasterand smoothen the transition of learning target game from source game. We further evaluate theperformance of our models on several evaluation metrics as described in Section 4.3.

In the next stage, we train the agent with a series of representations of the target game that are eitherqueried directly from the environment or generated with the visual mappers from source game asdescribed in Section 3. The ratio of number of workers that use these representations simultaneouslyis a hyper-parameter to be determined through experiments with multiple worker threads. A subset6f the workers are fed directly with the frames and actions from target game,

Breakout-v0 (whichare the native frames given by Atari environment). The other subset of workers are fed with framesfrom the source game,

Pong-v0 along with the converter ( visual mapper to convert source framesto target representations). The results obtained from these experiments are plotted in Figure 7(a). Toﬁnd the right combination of workers using competing representations, a series of experimentationswere carried out with the following ratios of the native vs visual mapper workers as {3:1, 2:1, 1:1,1:3}.Figure 7: (a) The mean rewards per episode for the entire set of experiments. The curve in blue is thebaseline with no pre-training.(b) Total reward per episode over 700 epochs during the training phasesof

Breakout-v0 with and without pre-training. Here the ratio of workers is 2:1From Figure 7, it is clear that it is possible to learn better representations through the use of pre-trainedmethods. Further, through experimentations with different conﬁgurations of the native versus visualmapper asynchronous workers, it can be concluded that the number of visual mapper workers have asigniﬁcant impact on the results.In other experiments, we have also trained a model with the representations derived from the visualmapper (implemented as UNIT GAN convertor) for source game,

Pong-v0 . This particular modeldoes not get updated with native Atari frames and only uses the generated frames from source gameas an input. The model has been trained for over 300 epochs and gained only little improvement inlearning the expert policy of target game. Though it did not have signiﬁcant performance on its own,as it had been established earlier, it acts as a stabilizer to the transfer process with A3C agents.In the experiments of evaluating running mean rewards plotted in Figure 7(a), the blue curve indicatesthe baseline model which is trained directly on

Breakout-v0 environment without using the pre-trained weights obtained from

Pong-v0 . It is important to note that no converter workers have beenused for this setup. The other curves are trained with speciﬁc ratios of the native versus visualmapper workers. As can be seen from the Figure 7(a), the conﬁgurations with visual mapper workersoutnumbering native Atari workers did not perform well. This can be explained due to the inabilityof the converted workers to learn by themselves as explained previously. On the other hand, as theproportion of these worker decreases, the performance of the model increases. The conﬁgurationswhere the ratio of the workers are 3:1 and 2:1 perform better than the baseline. A detailed discussionon the evaluation of each model will be discussed in Section 4.3.The improvement of performance of the models with lesser number of visual mapper workers, thatuse the generated representations of target games, raises questions on its importance. However, uponcareful observation of the graph in Figure 6(b), it can be seen that vanilla transfer learning methods,inspite of similarities of both games, do not transition smoothly in learning target game from a certainsource game. The aim of these experiments has been to transfer knowledge between games in astable and data efﬁcient manner. It can also be seen that the ﬁnal performance of each model withA3C agent converges to the same optimal policy albeit the time taken to converge differs across eachexperiment setting. 7 .3 Evaluation metrics

This section evaluates the trained model across a range of metrics [14] for all experiment setups thathave been discussed in the previous section.

Worker Conﬁgurations OriginalAtariFrames 3:1 2:1 1:1 1:3

Jumpstart - None None None NoneEpoch to threshold 435 357 319 746 872Total Rewards 47960 74932 65376 18400 17403Transfer Ratio - 1.562 1.363 0.384 0.355Table 1: Evaluation metrics across different experiment settings with worker conﬁgurations (framestaken from native Atari target game vs frames generated from source game using visual mappers)The experiments are evaluated with the following metrics to compare the extent of improvementacross transfer learning processes.

Jumpstart : This compares the initial performance of an agent in the transfer learning task and it canbe seen in Table 1 that the initial performance could not be improved by transfer from a source task.

Epoch to threshold : It measure the time taken to reach a particular level of performance. Based onthe results from the graph in Figure 7, it is intuitive to infer that models with higher number of nativeworkers reach a given threshold earlier. Though there isn’t a standard threshold level of performanceat which the environment is considered to be solved, the threshold for this particular experiment iskept at 400.

Total Rewards : The total rewards are essentially the area under the graph of mean reward per episodevs total number of episodes for each model. It can be inferred, based on the graphs in the previoussection, that the agents trained with a higher proportion of native workers have higher total rewards.The mean rewards per episode are only computed for 700 episodes to ensure that the area under thecurve is compared across consistent experiment setups.

Transfer Ratio : The transfer ratio is the ratio of the total rewards obtained from the transfer learningexperiment when compared to the baseline. It measures the effectiveness of transfer learning process.Transfer ratio greater than one implies that the total reward accumulated by the transfer learner ishigher than the total reward accumulated by the non-transfer learner and the magnitude speciﬁes theextent of efﬁcient knowledge transfer. Similar to the trends of previous metrics, agents trained with ahigher proportion of native workers transfer better.

We conclude that it is possible to generate a visual mapper for semantically similar games with the useof UNIT GANs. We then explored the idea of learning two different representations of the same gameand using them simultaneously for transfer learning and show that the learning curve is signiﬁcantlystabilized. Different ratios of the workers were used to study the effect of the visual mapper ontransfer learning. Although the workers using representations of the target game obtained from thevisual mappers did not perform well in a stand alone setting, however they showed improvementswhen used for the competitive learning.A topic for further research is the generalization of the above discussed methods and techniques toother sets of Atari games. Secondly, the A3C workers that obtain the representations of the sourcegame from the visual mapper; are to be studied and analyzed further to prove the hypothesis thatthese workers should perform well on both the source and target games simultaneously when allowedto learn models that do well in multiple settings.The trained models and code can be accessed here for further experiments.8 eferences [1] S. J. Pan and Q. Yang. A survey on transfer learning.

IEEE Transactions on Knowledge and DataEngineering , 22(10):1345–1359, Oct 2010. ISSN 1041-4347. doi: 10.1109/TKDE.2009.191.[2] Christopher Elamri Chaitanya Asawa and David Pan. Using transfer learningbetween games to improve deep reinforcement learning performance and stability,2017. URL http://web.stanford.edu/class/cs234/past_projects/2017/2017_Asawa_Elamri_Pan_Transfer_Learning_Paper.pdf .[3] Lior Wolf Doron Sobol and Yaniv Taigman. Visual analogies between atari games for studyingtransfer learning in rl, 2018. URL https://openreview.net/pdf?id=rJvjL71DM .[4] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translationnetworks.

CoRR , abs/1703.00848, 2017. URL http://arxiv.org/abs/1703.00848 .[5] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-ment learning.

CoRR , abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783 .[6] Mark McKenzie, Peter Loxley, William Billingsley, and Sebastien Wong. Competitive reinforce-ment learning in atari games. In Wei Peng, Damminda Alahakoon, and Xiaodong Li, editors,

AI 2017: Advances in Artiﬁcial Intelligence , pages 14–26, Cham, 2017. Springer InternationalPublishing. ISBN 978-3-319-63004-5.[7] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-imagetranslation using cycle-consistent adversarial networks.

CoRR , abs/1703.10593, 2017. URL http://arxiv.org/abs/1703.10593 .[8] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning todiscover cross-domain relations with generative adversarial networks.

CoRR , abs/1703.05192,2017. URL http://arxiv.org/abs/1703.05192 .[9] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks.

CoRR , abs/1606.07536,2016. URL http://arxiv.org/abs/1606.07536 .[10] Yusuf Aytar, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Cross-modal scene networks.

CoRR , abs/1610.09003, 2016. URL http://arxiv.org/abs/1610.09003 .[11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation.In

Proceedings of the 32Nd International Conference on International Conference on MachineLearning - Volume 37 , ICML’15, pages 1180–1189. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045244 .[12] Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask andtransfer reinforcement learning.

CoRR , abs/1511.06342, 2015. URL http://arxiv.org/abs/1511.06342 .[13] Dino S. Ratcliffe, Luca Citi, Sam Devlin, and Udo Kruschwitz. Domain adaptation for deepreinforcement learning in visually distinct games, 2018. URL https://openreview.net/forum?id=BJB7fkWR- .[14] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains:A survey.

J. Mach. Learn. Res. , 10:1633–1685, December 2009. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1577069.1755839http://dl.acm.org/citation.cfm?id=1577069.1755839