How Transferable are the Representations Learned by Deep Q Agents?
HH OW T RANSFERABLE ARE THE R EPRESENTATIONS L EARNED BY D EEP
Q A
GENTS ? Jacob Tyo & Zachary Lipton
Machine Learning DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213, USA [email protected], [email protected] A BSTRACT
In this paper, we consider the source of Deep Reinforcement Learning (DRL)’ssample complexity, asking how much derives from the requirement of learninguseful representations of environment states and how much is due to the samplecomplexity of learning a policy. While for DRL agents, the distinction betweenrepresentation and policy may not be clear, we seek new insight through a setof transfer learning experiments. In each experiment, we retain some fraction oflayers trained on either the same game or a related game, comparing the benefitsof transfer learning to learning a policy from scratch. Interestingly, we find thatbenefits due to transfer are highly variable in general and non-symmetric acrosspairs of tasks. Our experiments suggest that perhaps transfer from simpler en-vironments can boost performance on more complex downstream tasks and thatthe requirements of learning a useful representation can range from negligible tothe majority of the sample complexity, based on the environment. Furthermore,we find that fine-tuning generally outperforms training with the transferred layersfrozen, confirming an insight first noted in the classification setting.
NTRODUCTION
Deep Reinforcement Learning (DRL) agents learn policies by selecting actions directly from rawperceptual data. Despite numerous breakthroughs (Mnih et al., 2015; Vinyals et al., 2019; Silveret al., 2016), DRL’s prohibitive sample complexity limits its real world application. The samplecomplexity of DRL may derive from many sources, including the requirement of learning usefulstate representations and the requirement of learning good policies given a suitable representation.This paper provides several simple experiments to provide new insights into this breakdown.For a DRL agent, where precisely the “representation” ends and the “policy” begins is not clear.In our experiments, we consider multiple interpretations of this divide by partitioning the networkat various layers. To evaluate the extent to which representation learning contributes to the samplecomplexity of DRL, we execute a series of transfer learning experiments aimed to determine howquickly an agent can learn given pre-learned representations (from either the same or a differentgame). Our experiments proceed in the following manner: 1) Train a parent network until bestreported performance is achieved. 2) Transplant the first k layers into a child network, re-initializingthe remaining l − k layers randomly. 3) Train the child network, either fine-tuning the transplantedlayers or keeping them frozen, following the methodology of Yosinski et al. (2014).In particular, our experiments address the transferability of DQN representations among the threeAtari games Berzerk , Krull , and
River Raid . We find the benefits of transferring representations frompretrained networks to be remarkably variable in general and non-symmetric across pairs of tasks.While we have not evaluated enough tasks to draw definitive conclusions, our preliminary resultssuggest that transfer from simple environments may improve performance on more complex tasksmore than the reverse transfer. Furthermore, our results suggest that the contribution of learninga useful representation to the overall sample complexity of the problem can range from negligibleto the majority, based on the destination environment. Lastly, fine-tuning the transferred layersoutperforms training with those layers frozen in general.1 a r X i v : . [ c s . L G ] F e b B ACKGROUND AND E XPERIMENTAL S ETUP
Reinforcement Learning (RL) methods address the problem where an agent must learn to act inan unknown environment to maximize a reward signal. Initially, the environment provides a state s to the agent, and then the agent selects an action a based on the provided state. Thereafter, ateach time step the environment provides an agent with a state s t and a reward r t each influenced bythe agents previous action. The agent must then select subsequent actions a t , and so on, until theepisode terminates. Formally, this interaction is described by a Markov Decision Process (MDP) M = (cid:104)S , A , T , r, γ (cid:105) , where S is the set of states, A is the set of actions, T ( s, a, s (cid:48) ) = P ( s t +1 = s (cid:48) | s t = s, a t = a ) is the transition function, r ( s, a ) = E [ r t +1 | s t = s, a t = a ] is the reward function,and γ ∈ [0 , is a discount factor.We focus on Q-learning, an off-policy value-based RL algorithm. The value of each state is rep-resented by v π ( s ) = E π [ G t ( s t ) | s t = s ] , and the value of each state-action pair is represented by q π ( s, a ) = E π [ G t ( s t ) | s t = s, a t = a ] where G t ( s ) = (cid:80) Ti = t γ i − t r t ( s ) . In Q-learning, the agentestimates the state-action value function by predicting the expected discounted return.Many interesting problems, including Atari games, have a large state and action space, makingtabular estimates of the Q-function intractable. In these cases, the Q-function can be approximated.DRL denotes methods that approximate either the value function (or, in other algorithms, the policydirectly) by deep neural networks.In this paper, we build on a DRL Q-learning implementation called Rainbow (Hessel et al., 2018):a 5-layer convolutional neural network based on Mnih et al. (2015) that incorporates double Q-learning (Van Hasselt et al., 2016), prioritized replay (Schaul et al., 2015), dueling networks (Wanget al., 2015), multi-step learning (Sutton, 1988), distributional RL (Bellemare et al., 2017), andNoisy Nets (Fortunato et al., 2017). Furthermore, Rainbow maintains two separate Q-networks: onewith parameters θ , and a second with parameters θ target that are updated from θ every fixed numberof iterations. In order to capture the game dynamics, a state is represented by a sequence (four inour case) of history frames.We tested the transferability of features learned by Rainbow agents on Atari games (i.e. environ-ments). To make transfer learning experiments feasible within our resources, we selected environ-ments according to the speed that a Rainbow agents can reach high performance, requiring that thecardinality of the state and action spaces of each environment be equivalent and that two amongthe three environments were qualitatively similar (same genre of game). We are interested in gamesimilarity to test a hypothesis that more similar games have more similar representations, and there-fore agent transfer should be more effective between them. Environments with the same state andaction space cardinality is required as it made agent transfer possible without modification. Pullingfrom the results of Hessel et al. (2018), we selected Berzerk , Krull , and
River Raid from the ArcadeLearning Environment (Bellemare et al., 2013).Drawing inspiration from Yosinski et al. (2014), our experiments proceeded as follows:1. For all environments, train a parent network (a Rainbow agent) until best-reported perfor-mance is reached.2. For every permutation of environment pairs, transplant the first k layers of the parent net-work into a child network (also a Rainbow agent), then reinitialize the remaining l − k layers randomly (where l is the total number of layers in the network and k ∈ { , } ).3. Train the child network, either fine-tuning or freezing the transplanted portion of the net-work (we explore both settings).With runs per pair of environments, setting of k , and each choice among { freezing, fine-tuning } ,we ran a total of trials taking over days on the available resources. XPERIMENTAL R ESULTS AND D ISCUSSION
In this section, we analyze the performance and learned representations of the child agents, whereone child exists for every environment pair, k -value, and choice among freezing/fine-tuning. Forbrevity, we will denote child agents with 4 layers transferred from a parent trained on environment12igure 1: Child agent (blue, red, and green) performance over iterations of training when the trans-planted layers are frozen. Compare against the parent network trained from scratch (black).then frozen as child4-frozen-environment1. We will refer to the parents as baselines with respect totheir performance on the environment that they were trained.We evaluate each network separated at two places, transferring , and transferring (out of ) layersand compare freezing vs fine-tuning over the transplanted layers. For each respective experiment, werefer to the output of the last layer of the transplanted portion of the network as the “representation.”Figures 1a, 1b, and 1c demonstrate the performance of all environment pairs when layers are trans-planted and then frozen during subsequent training. Because only the output layer is not frozen, theseplots show the performance of a linear policy. trained on the transplanted representation. Levineet al. (2017) deem representations that can be used with a linear policy as a “good” representation,thus our analysis starts with these corresponding experiments.Figure 1a presents the performance of the agents on Berzerk for runs with transplanted layers, allof which are frozen during the subsequent training. The child4-frozen-Berzerk agent (blue) does notout perform the baseline (either by final performance or training speed), which we find surprising,since four of five layers of this agent were transferred from a high-performing parent trained onthe same environment, leaving only a linear policy to be relearned. We might deduce that for thisgame, the entire difficulty of DRL can be attributed simply to the “RL” since starting off with therepresentations upon which a strong linear policy can be learnt seems to confer no benefit. Notably,the agents transferred from foreign games (child4-frozen-Krull and child4-frozen-RiverRaid) haveworse final performance than the baseline. As a linear policy could not be learnt from a known goodrepresentation, we would not expect any other representation to perform better. Figure 1d showsthese trends continuing when layers are transplanted. The difference in performance between thechildren and the baseline is negligible, and therefore indicating that the difficulty does lie within theRL problem. Even with more model complexity allowed to the policy and different representation,no benefits are seen. 3igure 2: Child agent (blue, red, and green) performance over iterations of training when the trans-planted layers are fine-tuned. Compare against the parent network trained from scratch (black).However, as shown in Figure 1b, the transfer from Krull-to-Krull results in faster training and higherfinal performance than the baseline, perhaps suggesting that for this game, representation learningis a more significant part of the challenge. Note that with only two layers transferred, the represen-tations pretrained on both Krull and Berzerk, outperform the baseline as shown in Figure 1e. Thisreflects our intuition that these games are similar, while the representations pretrained on River Raidconfer no benefit.Transferring and freezing of layers from the parent to child agent on Berzerk and Krull showednegligible and positive changes in performance, respectively. However, River Raid paints a differentpicture. As shown in Figure 1c, no agent is able to learn a useful policy. This is especially surprisingbecause we know that if nothing else, the transfer from the parent trained on River Raid itself canreplicate the baseline results by learning a linear policy. Similarly, we see in Figure 1f that theagent with transplanted layers trained on the same environment performs the worst. Again this issurprising as, theoretically, the baseline could be replicated. Interestingly, this may indicate that arepresentation learned from a more simple environment may be helpful in solving more complexenvironments. The converse does not appear to hold from our experiments, but a deeper analysis onmore tasks would be required to confirm these intuitions.In general, we see that when transplanted layers are frozen, there is a trade-off between negativetransfer and training speed. As the number of unfrozen layers increases, the effect of negative trans-fer decreases but the training speed slows slightly. Furthermore, Figure 1 shows that transfer betweenenvironments is not symmetric. In other words, if a representation x is learned on environment X and is shown to perform well on environment Y , then this does not imply that a representation y learned on environment Y will be effective on X . This relation continues to hold, even when fine-tuning over the transplanted layers. However, when fine-tuning over the transplanted layers we donot see the same level of negative transfer. 4igures 2a, 2b, and 2c show the performance of all environment pairs when four layers are trans-planted from the parents to the children and then frozen, whereas Figures 2d, 2e, and 2f show theperformance when only two layers are transplanted and frozen. Intuitively, these experiments whichallow the transplanted representations to change during training, examine a more flexible notion oftheir utility (as initializations only). For these experiments, in all environments except for RiverRaid, negative transfer is no longer an issue. In River Raid, negative transfer is seen in of children, vs in of children when the transferred layers are frozen.The performance of all agents on Berzerk with and layers transplanted and fine-tuned respec-tively, shown in Figures 2a and 2d, are similar to the performance seen for the same game underfrozen transplanted layers. There is little difference between each of the agents, regardless of theenvironment on which the parent was trained. This further supports the earlier conclusions that thedifficulty of this game is due entirely to learning the policy as no transferred representation improvesperformance.Transplanting and fine-tuning layers results in a large decrease in training time, and a large increasein final performance on the Krull environment. This holds for both transplanting and fine-tuningover layers as shown in Figure 2b, and over layers as shown in Figure 2e. As expected, thetransplanted layers from the parent trained on Krull performed the best when layers were trans-planted. But interestingly when only layers were transplanted, the transplanted layers from theparent trained on Berzerk were more effective. Overall, this supports our earlier conclusion of theimportance of representations for Krull—in all cases, transfer is preferred to random initialization.In Figures 1 and 2, we see the general trend that the transfer of pretrained layers has negligible effecton Berzerk, positive effects on Krull, and negative effects on River Raid. Figures 2c and 2f show theperformance of all agents with and layers transplanted and fine-tune respectively. Interestingly,the trend observed in Krull, where the best performing child was the one with features transferredfrom a parent trained on the same environment is the best performer when layers are transferred,yet it is not the best performer when only layers are transferred, is also observed with River Raid.However, the baseline trains faster and reaches better final performance than all of the children onRiver Raid. ONCLUSIONS
This paper presents an empirical evaluation of transfer learning experiments on agents trainedon the Atari 2600 games Berzerk, Krull, and River Raid. We compare the effect of transplantinginitial layers of a pretrained network into a child network while either freezing or fine-tuning overthe transplanted layers. Surprisingly, the benefits of transferring portions of pretrained networksare highly variable and non-symmetric across tasks. Furthermore, the requirements of learning auseful representation can range from nothing to the majority of the sample complexity based on thedestination environment. We present analyses for why each task transfer occurs as shown, and giveintuition for understanding representations and policies in DQNs. We show that, in general, finetuning is better than freezing portions of networks, as performance gains can still be expected withless likelihood of negative transfer.Zahavy et al. (2016) have shown that DQNs find hierarchical abstractions automatically. Our worksuggests that the similarity of the high level task abstraction may be a good metric to determinethe transferabitliy of DQN agents on. Future work includes heavier analysis on these experimentsto determine how agents pretrained on different environments “focus” differently, how their repre-sentations differ, and how to numerically quantify the contribution of representation learning to theoverall sample complexity. Furthermore, this methodology can pass insight to questions about thebenefit of unsupervised reinforcement learning in pre-training, e.g. techniques based on intrinsicmotivation (Chentanez et al., 2005). To the extent that intrinsic motivation serves to learn represen-tations suitable for fine-tuning to a given reward signal, quantifying just how much representationlearning is the bottleneck to learning in the first place can provide insight in assessing its potential.5
EFERENCES
Mehran Asadi and Manfred Huber. Effective control knowledge transfer through learning skill andrepresentation hierarchies. In
IJCAI , volume 7, pp. 2054–2059, 2007.Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-ment: An evaluation platform for general agents.
Journal of Artificial Intelligence Research , 47:253–279, 2013.Marc G Bellemare, Will Dabney, and R´emi Munos. A distributional perspective on reinforcementlearning. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pp. 449–458. JMLR. org, 2017.Daniele Calandriello, Alessandro Lazaric, and Marcello Restelli. Sparse multi-task reinforcementlearning. In
Advances in Neural Information Processing Systems , pp. 819–827, 2014.Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcementlearning. In
Advances in neural information processing systems , pp. 1281–1288, 2005.Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves,Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295 , 2017.David Foster and Peter Dayan. Structure in the space of value functions.
Machine Learning , 49(2-3):325–346, 2002.Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, DanHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements indeep reinforcement learning. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.Alessandro Lazaric.
Knowledge transfer in reinforcement learning . PhD thesis, PhD thesis, Politec-nico di Milano, 2008.Nir Levine, Tom Zahavy, Daniel J Mankowitz, Aviv Tamar, and Shie Mannor. Shallow updatesfor deep reinforcement learning. In
Advances in Neural Information Processing Systems , pp.3135–3145, 2017.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning.
Nature , 518(7540):529, 2015.Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, KorayKavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprintarXiv:1606.04671 , 2016.Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXivpreprint arXiv:1511.05952 , 2015.David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Masteringthe game of go with deep neural networks and tree search. nature , 529(7587):484, 2016.Satinder Pal Singh. Transfer of learning by composing solutions of elemental sequential tasks.
Machine Learning , 8(3-4):323–339, 1992.Richard S Sutton. Learning to predict by the methods of temporal differences.
Machine learning , 3(1):9–44, 1988.Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.
Journal of Machine Learning Research , 10(Jul):1633–1685, 2009.Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In
Thirtieth AAAI Conference on Artificial Intelligence , 2016.6riol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Woj-ciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, TimoEwalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dal-ibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai,David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, YuhuaiWu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, TimothyLillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar:Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ , 2019.Thomas J Walsh, Lihong Li, and Michael L Littman. Transferring state abstractions between mdps.In
ICML Workshop on Structural Knowledge Transfer for Machine Learning , 2006.Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas.Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581 ,2015.Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deepneural networks? In
Advances in neural information processing systems , pp. 3320–3328, 2014.Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor. Graying the black box: Understanding dqns. In
International Conference on Machine Learning , pp. 1899–1908, 2016.7 R ELATED W ORK
Hessel et al. (2018) present Rainbow, the DQN used in this paper, incorporating Double Q-learning (Van Hasselt et al., 2016), prioritized replay (Schaul et al., 2015), dueling networks (Wanget al., 2015), multi-step learning (Sutton, 1988), distributional reinforcement learning (Bellemareet al., 2017), and noisy networks Fortunato et al. (2017).Zahavy et al. (2016) analyze Deep Q Networks (DQNs) by observing the activation’s of the model’slast layer and saliency maps. They show that DQNs learn temporal abstractions, such as hierarchicalstate aggregation and options, automatically. Levine et al. (2017) show that the last layer in a deeparchitecture can be seen as a linear representation, and thus can be learned using standard shallowreinforcement learning algorithms. They then show that this hybrid approach improves performanceon the Atari benchmark.Yosinski et al. (2014) present a large scale study of feature transferabitlity in deep neural networks.They show that transferability is negatively affected by the specialization of higher layer neurons totheir original task. Furthermore, optimization difficulties can arise when co-adapted neurons are splitduring transfer, and freezing vs fine-tuning over transferred layers are compared. The authors showthat transferring features, even if from a very different task, can improve generalization performanceeven after substantial fine-tuning on a new task. Lastly, a relation between the effectiveness oftransfer and the distance between tasks is presented, but even in the worst case is shown to be betterthan random.Taylor & Stone (2009) provides a survey of transfer learning techniques in reinforcement learning.Here we will focus on the tasks that allow variation in the reward function, as we assume the statespaces are of the same cardinality, and the action spaces are equivalent in this paper. Singh (1992)and Foster & Dayan (2002) learn multiple tasks by assuming that each goal (or composite) task iscomposed of several elemental tasks, and then learning a set of elemental tasks that can be composedto solve each task of interest.Solving multiple MDPs has also been approached from the representation perspective, specificallywith the goal of developing a shared representation that can then be used to solve all tasks. Theapproach proposed by Asadi & Huber (2007) focuses on learning a more efficient state-space rep-resentation of the problem that will transfer between multiple tasks, and then learning options onthe new representation. Walsh et al. (2006) use a similar approach, but rely on the learned stateabstraction techniques to transfer between tasks. Another approach similar to state abstractions is tocompare observations ( (cid:104) s, a, r, s (cid:48) (cid:105) ))