Universal Successor Representations for Transfer Reinforcement Learning
WWorkshop track - ICLR 2018 U NIVERSAL S UCCESSOR R EPRESENTATIONS FOR T RANSFER R EINFORCEMENT L EARNING
Chen Ma & Junfeng Wen
Department of Computing Science, University of Alberta [email protected] , [email protected] Yoshua Bengio
MILA, Universit´e de Montr´eal [email protected] A BSTRACT
The objective of transfer reinforcement learning is to generalize from a set of pre-vious tasks to unseen new tasks. In this work, we focus on the transfer scenariowhere the dynamics among tasks are the same, but their goals differ. Althoughgeneral value function (Sutton et al., 2011) has been shown to be useful for knowl-edge transfer, learning a universal value function can be challenging in practice.To attack this, we propose (1) to use universal successor representations (USR)to represent the transferable knowledge and (2) a USR approximator (USRA)that can be trained by interacting with the environment. Our experiments showthat USR can be effectively applied to new tasks, and the agent initialized by thetrained USRA can achieve the goal considerably faster than random initialization.
NTRODUCTION
Deep reinforcement learning (RL) has shown its capability to learn human-level knowledge in manydomains, such as playing Atari games (Mnih et al., 2015) and control in robotics (Levine et al.,2016). However, these methods often spend a huge amount of time and resource only to train a deepmodel for very specific task. How to utilize knowledge learned from one task to other related tasksremains a challenge problem. Transfer reinforcement learning (Taylor & Stone, 2009), which reusesprevious knowledge to facilitate new tasks, is appealing in solving this problem. Knowledge transferwould not be possible if the tasks are completely unrelated. Therefore, in this work, we focus onone particular transfer scenario, where dynamics among tasks remain the same and their goals aredifferent, as will be elaborated in Sec. 2.General value functions (Sutton et al., 2011) can be used as knowledge for transfer. However, learn-ing a good universal value function approximator V ( s, g ; θ ) (Schaul et al., 2015), which generalizesover the state s and the goal g with parameters θ , is challenging. Unlike Schaul et al. (2015), whofactorized the general state values into state and goal features to facilitate learning, we propose tolearn a universal approximator for successor representations (SR) (Dayan, 1993), which is moresuitable for transfer as we will see in Sec. 2.Kulkarni et al. (2016) proposed a deep learning framework to approximate SR and incorporate itwith Q-learning to learn SR by interacting with the environment on a single task. In comparison,our approach learns the universal SR (USR) that generalizes not only over the states but also overthe goals, so as to accomplish multi-task learning and transfer among tasks. Additionally, we incor-porate the framework with actor-critic (Mnih et al., 2016) to learn the SR in an on-policy fashion. NIVERSAL S UCCESSOR R EPRESENTATIONS
Consider a Markov decision process (MDP) with state space S , action space A and transition prob-ability p ( s (cid:48) | s, a ) of reaching s (cid:48) ∈ S when action a ∈ A is taken in state s ∈ S . For any goal g ∈ G (very often G ⊆ S ), define pseudo-reward function r g ( s, a, s (cid:48) ) and pseudo-discount function γ g ( s ) ∈ [0 , . γ g ( s ) can be that γ g ( s ) = 0 when s is a terminal state w.r.t. g . For any policy π : S (cid:55)→ A , the general value function (Sutton et al., 2011) is defined as V πg ( s ) = E π (cid:34) ∞ (cid:88) t =0 r g ( S t , A t , S t +1 ) t (cid:89) k =0 γ g ( S k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = s (cid:35) a r X i v : . [ c s . A I] A p r orkshop track - ICLR 2018For any g , there exists V ∗ g ( s ) = V π ∗ g g ( s ) evaluated according to the optimal policy π ∗ g w.r.t. g . By see-ing many optimal policies π ∗ g and optimal values V ∗ g for different goals, we would hope that the agentcan utilize previous experience and quickly adapt to new goal. Ideally, such transfer would succeed ifwe can accurately model π ∗ g ( s ) , V ∗ g ( s ) using universal approximators π ( s, g ; θ π ) , V ( s, g ; θ V ) where θ π , θ V are respective parameters. However, this would not be easy without utilizing the similaritieswithin r g for all g , as we discuss next.2.1 T RANSFER VIA U NIVERSAL S UCCESSOR R EPRESENTATIONS
We assume that the reward function can be factorized as (Kulkarni et al., 2016; Barreto et al., 2017) r g ( s t , a t , s t +1 ) = φ ( s t , a t , s t +1 ) (cid:62) w g , (1)where φ ∈ R d are state features and w g ∈ R d are goal-specific features of the reward. Note that if w g can be effectively computed for any g , then we can quickly identify r g since φ is shared acrossgoals. With this factorization, for a fixed policy π , the general value function can be computed as V πg ( s ) = E π (cid:34) ∞ (cid:88) t =0 φ ( S t , A t , S t +1 ) t (cid:89) k =0 γ g ( S k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = s (cid:35) (cid:62) w g = ψ πg ( s ) (cid:62) w g where ψ πg ( s ) is defined as the universal successor representations (USR) of state s . The followingBellman equations enable us to learn USR the same way as learning the value function: V πg ( s ) = E π [ r g ( s, A, S (cid:48) ) + γ ( s ) V πg ( S (cid:48) )] , ψ πg ( s ) = E π [ φ ( s, A, S (cid:48) ) + γ g ( s ) ψ πg ( S (cid:48) )] . Figure 1: Model Architecture
Framework Architecture.
In addition to modeling USR witha USR approximator (USRA) ψ π ( s, g ; θ ψ ) parametrized by θ ψ , we also model the policy with π ( s, g ; θ π ) . Practically, wecombine θ π and θ ψ in a deep neural network such that theyshare the first few layers and forked in higher layers. In orderto quickly transfer to new goal, we need an efficient way to ob-tain w g given goal g . This can be achieved by directly model w g = w ( g ; θ w ) using a neural network. Finally, we furtherencode the state features φ ( s, a, s (cid:48) ) as φ ( s, a, s (cid:48) ; θ φ ) . Moreoften, it is sufficient to model it as φ ( s (cid:48) ; θ φ ) as we will do inthe experiment. To summarize, the trainable parameters of ourmodel are ( θ π , θ ψ , θ w , θ φ ) , as shown in Fig. 1. Transfer viaUSRA.
The trained USRA can be used (1) as an initializationfor exploring new goal, and (2) to directly compute policy forany new goal.2.2 T
RAINING
USRWe begin with the state features φ ( s ) . The state features are learned with an autoencoder, mappingfrom raw input s to φ ( s ) and then back to s . In the early stage of the training, state s are sampledfrom exploration of the agent with randomly initialized policy. The autoencoder are trained basedon the reconstruction loss and θ φ are the encoder parameters. This step can be skipped in the casethat φ ( s ) already has meaningful natural representations.Once φ ( s ) is trained to converged, we then learn the rest of the parameters incorporated with actor-critic method by interacting with the environment. Algorithm 1 highlights the learning procedure.The update to θ π is the typical policy gradient method (Williams, 1992). XPERIMENT
We perform experiments in a four-room grid-world environment. The agent’s objective is to reachcertain positions (goals). We use grid-world for simplicity, but our model uses raw pixels as inputto show how USRA can handle continuous space. There are 64 goals in total, 48 of which act assource goals and the rest 16 as unseen target goals to be transfer to. An image indicating the agent’slocation is the input of the state. The goal is alike.2orkshop track - ICLR 2018
Algorithm 1
USR with actor critic for each time step t do Obtain transition { g, s t , a t , s t +1 , r t , γ t } from the environment following π ( s t ) Perform gradient descent on L w = [ r t − φ ( s t +1 ) (cid:62) w ( g ; θ w )] w.r.t. θ w Perform gradient descent on L ψ = (cid:107) φ ( s t ) + γ t ψ ( s t +1 , g ; θ ψ ) − ψ ( s t , g ; θ ψ ) (cid:107) w.r.t. θ ψ Compute advantage A t = [ φ ( s t ) + γ t ψ ( s t +1 , g ) − ψ ( s t , g )] (cid:62) w ( g ) Perform gradient descent on J π = log π ( s t , g ; θ π ) A t w.r.t. θ π , end for Figure 2: USR Generalization Figure 3: π Generalization Figure 4: Effect of Initialization3.1 G
ENERALIZATION P ERFORMANCE ON U NSEEN G OALS
In this section, we show that how our model can generalize/transfer from source goals to targetgoals. Following our approach, we firstly trained USRA on k source goals, randomly selected,until it converges. Then to measure the generalization performance on the target goals, we computethe distance between the USR/policy generated from our model to the “optimal” ones, which areobtained by learning directly on the target goals with the same model to convergence. Here we useMean Squared Error (MSE) distance for USR, and cross entropy for policy with 6 repeats.Fig. 2 and Fig. 3 visualize USR and policy’s generalization performance w.r.t different numbers ofsource goals for training, with solid line as mean and shade as standard error. First note that as thenumber of source goals increases, the generalized policy and USR approach to the “optimal” ones.Second, the generalization performance trained on k = 20 goals is comparable to that on k = 40 goals. This indicates that only a relatively small portion of goals is required to achieve a decentgeneralization performance. These results demonstrate that our approach enables USR and policyto generalizes across goals.3.2 T RAINED
USRA AS I NITIALIZATION
In this section, we show how the trained model can be used as an initialization for fast learning fortarget goals. We firstly train USRA on k source goals, randomly selected, until convergence, theninitialize the agent with this learned USRA for further exploration on target goals. Fig. 4 shows theaverage rewards the agent collected on target tasks over the steps. The baseline method is trainedwith random initialization. When the number of source goals k is relatively small ( k = 1 ), the agentlearns more slowly than the baseline, which could be due to insufficient knowledge interfering withnew goals’ learning. However, when trained on a sufficient number of goals, 20/64 in this case, theagent can learn considerably faster for new goals. These results show that the agent initialized withtrained USRA on only a small portion of the goals can learn much faster than random initialization. ONCLUSION
In this work, we focus on solving transfer reinforcement learning problem in which the tasks sharethe same underlying dynamics but their goals differ. Our experiments show that the proposed USRAcan generalize across tasks and can be used as a better initialization for learning new tasks.3orkshop track - ICLR 2018 R EFERENCES
Andr´e Barreto, Will Dabney, R´emi Munos, Jonathan J Hunt, Tom Schaul, David Silver, and Hado Pvan Hasselt. Successor features for transfer in reinforcement learning. In
Advances in NeuralInformation Processing Systems , pp. 4058–4068, 2017.Peter Dayan. Improving generalization for temporal difference learning: The successor representa-tion.
Neural Computation , 5(4):613–624, 1993.Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successorreinforcement learning. arXiv preprint arXiv:1606.02396 , 2016.Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuo-motor policies.
The Journal of Machine Learning Research , 17(1):1334–1373, 2016.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning.
Nature , 518(7540):529, 2015.Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcementlearning. In
International Conference on Machine Learning , pp. 1928–1937, 2016.Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approxima-tors. In
International Conference on Machine Learning , pp. 1312–1320, 2015.Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-vised sensorimotor interaction. In
The 10th International Conference on Autonomous Agents andMultiagent Systems-Volume 2 , pp. 761–768. International Foundation for Autonomous Agentsand Multiagent Systems, 2011.Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.
Journal of Machine Learning Research , 10(Jul):1633–1685, 2009.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. In