[PDF] Learning State Representations from Random Deep Action-conditional Predictions

Abstract

In this work, we study auxiliary prediction tasks defined by temporal-difference networks (TD networks); these networks are a language for expressing a rich space of general value function (GVF) prediction targets that may be learned efficiently with TD. Through analysis in an illustrative domain we show the benefits to learning state representations of exploiting the full richness of TD networks, including both action-conditional predictions and temporally deep predictions. Our main (and perhaps surprising) result is that deep action-conditional TD networks with random structures that create random prediction-questions about random features yield state representations that are competitive with state-of-the-art hand-crafted value prediction and pixel control auxiliary tasks in both Atari games and DeepMind Lab tasks. We also show through stop-gradient experiments that learning the state representations solely via these unsupervised random TD network prediction tasks yield agents that outperform the end-to-end-trained actor-critic baseline.

Full PDF

LLearning State Representations fromRandom Deep Action-conditional Predictions

Zeyu Zheng Vivek Veeriah Risto Vuorio

Richard Lewis Satinder Singh Abstract

In this work, we study auxiliary prediction tasksdeﬁned by temporal-difference networks (TD net-works) (Sutton & Tanner, 2004); these networksare a language for expressing a rich space of gen-eral value function (GVF) prediction targets thatmay be learned efﬁciently with TD. Through anal-ysis in an illustrative domain we show the ben-eﬁts to learning state representations of exploit-ing the full richness of TD networks, includingboth action-conditional predictions and tempo-rally deep predictions. Our main (and perhaps sur-prising) result is that deep action-conditional TDnetworks with random structures that create ran-dom prediction-questions about random featuresyield state representations that are competitivewith state-of-the-art hand-crafted value predictionand pixel control auxiliary tasks in both Atarigames and DeepMind Lab tasks. We also showthrough stop-gradient experiments that learningthe state representations solely via these unsuper-vised random TD network prediction tasks yieldagents that outperform the end-to-end-trainedactor-critic baseline.

1. Introduction

Providing auxiliary tasks to Deep Reinforcement Learn-ing (Deep RL) agents has become an important class ofmethods for driving the learning of representations that ac-celerate learning on a main task. Existing auxiliary taskshave the property that their semantics are ﬁxed and care-fully designed by the agent designer. Some notable exam-ples include pixel control, reward prediction, terminationprediction, and multi-horizon value prediction (these arereviewed in more detail below). Unlike the prior approachesthat require careful design of auxiliary task semantics, weexplore here a different approach in which a set of ran-dom prediction tasks are generated through a rich space of † Now at the University of Oxford. University of Michigan.Correspondence to: Zeyu Zheng < [email protected] > . general value functions (GVFs) deﬁned by a language ofpredictions of random features of observations. We adopt temporal-difference networks (TD networks) (Sutton & Tan-ner, 2004) as our language of GVF prediction targets. Ourmain, and perhaps surprising, result is that TD networksthat ask random questions about the action-conditional fu-tures of random features yield state representations thatare competitive with state-of-the-art auxiliary tasks withhand-crafted semantics. We demonstrate this in Atari gamesand DeepMind Lab tasks, comparing to multi-horizon valueprediction and pixel control as our baseline auxiliary tasks.Through empirical analyses on illustrative domains we showthe beneﬁts of exploiting the richness of TD networks—their temporal depth and action-conditionality. We alsoprovide direct evidence that using random TD networkslearns useful representations for the main task through aset of stop-gradient experiments in which the state repre-sentations are trained solely via the random TD networktasks without using the rewards. We show that—again,surprisingly—these stop-gradient agents outperform the end-to-end-trained actor-critic baseline.

2. Related Work

Auxiliary tasks were formalized and introduced to RLby Sutton et al. (2011) through the Horde learning architec-ture. Horde is an off-policy learning framework for learningknowledge from an agent’s experience. This knowledge wasin the form of predictions represented as a set of GVFs. Ourwork is related to Horde in that TD networks express a richsubspace of GVFs but differs from Horde in that our interestis in the effect of learning these auxiliary predictions on themain task via shared state representations (as opposed tothe knowledge captured in these GVFs). Our work is alsorelated to predictive state representations (PSRs) (Littmanet al., 2001; Singh et al., 2004). PSRs use predictions as state representations whereas our work learns latent staterepresentations from predictions. Recently, in the use ofdeep neural networks in RL as powerful function approxima-tors, various auxiliary tasks have been proposed to improvethe latent state representations of Deep RL agents. We re-view these auxiliary tasks below. Our work belongs to thisfamily of work in that the auxiliary prediction tasks are used a r X i v : . [ c s . L G ] F e b earning State Representations from Random Deep Action-conditional Predictions to improve the state representations of Deep RL agents.All of the following approaches in this paragraph have pre-deﬁned GVF targets for the auxiliary task predictions.

UN-REAL (Jaderberg et al., 2017) uses reward prediction (andalso an auxiliary pixel control task) and achieved a signiﬁ-cant performance improvement in DeepMind Lab but onlymarginal improvement in Atari. Kartal et al. (2019) foundpredicting episode termination an useful auxiliary task inepisodic RL settings. However, they only demonstrated mildimprovement in four Atari games. SimCore DRAW (Gregoret al., 2019) learns a generative model of future observationsconditioned on action sequences and uses it as an auxiliarytask to shape the agent’s belief states in partially observableenvironments. Fedus et al. (2019) found that simply pre-dicting the return with multiple different discount factors(MHVP) serves as an effective auxiliary task. MHVP relieson the availability of rewards and thus is different from ourwork and other unsupervised auxiliary tasks.The following approaches use information-theoretic meth-ods to learn good representations that are informative aboutthe future trajectory of these representations as the agentinteracts with the environment. CPC (van den Oord et al.,2018), CPA | action (Guo et al., 2018), ST-DIM (Anand et al.,2019), and DRIML (Mazoure et al., 2020) apply differentforms of temporal contrastive losses to learn predictionsin a latent space. PBL (Guo et al., 2020) focuses on par-tially observable environments and introduces a separatetarget encoder to set the prediction targets. The target en-code is trained to distill the learned state representations.SPR (Schwarzer et al., 2020) replaces the target encoderin PBL with a moving average of the state representationfunction. In addition to being predictive, PI-SAC (Lee et al.,2020) also enforces the state representations to be com-pressed. SPR and PI-SAC focuses on data efﬁciency andonly conducted experiments under low data budgets. In theworks cited in this paragraph the targets are not GVFs anddespite some empirical success, none of these works couldlearn long-term predictions effectively. This is in contrast toGVF-like prediction tasks which can be effectively learnedvia TD as in our work as well as the work presented in theprevious paragraph.Bellemare et al. (2019) and Dabney et al. (2020) studied theoptimal representation in RL from a geometric perspectiveand provided theoretical insights into why predicting GVF-like targets is helpful in learning state representations. Ourwork is consistent with this theoretical motivation.In a recent work, Veeriah et al. (2019) used metagradients todiscover very simple kinds of GVFs (essentially discountedsums of features of observations) instead of hand-craftingthem. In this work, we explore a much richer space of GVFsand show that random choices of features and random butrich GVFs are competitive with the state of the art of hand- f a b f γ Figure 1.

An example of a question network. The squares rep-resent feature nodes and circles represent the prediction nodes .Description about question network is in § 3.1. crafted GVFs as auxiliary tasks.

3. Method

In this section we ﬁrst describe the TD networks formalismfor deﬁning GVFs, then we describe RADAR, our procedurefor constructing random GVFs, and then ﬁnish this sectionwith a description of the agent architecture used in ourempirical work.

A TD network is a graph in which the nodes represent pre-dictions. In the original work, there are two sets of edgesin a TD network. One set deﬁnes the targets of the pre-dictions and the graph formed by those edges is called the question network . The other set determines the computa-tional process for computing the predictions and is calledthe answer network . In this work we are only interested inquestion network part of a TD network because it deﬁnesthe prediction-semantics or targets. We do not use the an-swer network; instead we parameterize the computation ofthe predictions using a deep neural network and update thepredictions via backpropagation.Figure 1 shows an example of a question network. The twosquares represent two feature nodes and the six circles rep-resent six prediction nodes . Node (labeled in the circles)predicts the expected value of feature f at the next step.Implicitly, this prediction is conditioned on following thecurrent policy. Node predicts the expected value of node at the next step. Note that we can “unroll” the target of node to ground it on the features. In this example, node pre-dicts the expected value of feature f after two steps whenfollowing the current policy. Node has a self-loop andpredicts the expectation of the discounted sum of feature f with a discount factor γ . We will call this a conventional TDprediction except that it is a general value function becausethe feature f is not the reward but some other function ofobservation. Node is labeled by action a . It predicts theexpected value of node at the next step given action a is earning State Representations from Random Deep Action-conditional Predictions taken at the current step. We say node is conditioned onaction a . Similarly, node predicts the same target but isconditioned on action b . Node has two outgoing links. Itpredicts the sum (in general weighted sum, but in this paperwe do not explore the role of these weights and instead ﬁxthem to be ) of feature f and the value of node , both atthe next step. In this case, it is hard to describe the semanticof node ’s prediction in terms of the features, but we cansee that the prediction is still grounded on feature f and f .Generalising from the example above, a question networkwith n p prediction nodes and n f feature nodes deﬁnes n p predictions of n f features. We use N p to denote the setof all prediction nodes and N f to denote the set of all fea-ture nodes. Let W be the adjacency matrix of the questionnetwork. W ij denotes the weight on the edge from node i to node j . We deﬁne W ij (cid:44) if there is no edge fromnode i to node j . Now consider an agent interacting withthe environment. At each step t , it receives an observation O t and takes an action A t according to its policy π . Thenat the next step it receives an observation O t +1 . The fea-ture f k ( O t , A t , O t +1 ) is a scalar function of the transition.The agent makes a prediction y i ( O , A , . . . , O t ) for eachprediction node i based on its history; this is computed bya neural network in our work. For brevity, we use f kt +1 and y it to denote f k ( O t , A t , O t +1 ) and y i ( O , A , . . . , O t ) respectively. The target for prediction i at step t is denotedby z it . If prediction node i is not conditioned on any action,its target is z it = E π (cid:2) (cid:88) j ∈ N p W ij z jt +1 + (cid:88) k ∈ N f W ik f kt +1 (cid:3) otherwise, if it is conditioned on action a i , its target is z it = E π (cid:2) (cid:88) j ∈ N p W ij z jt +1 + (cid:88) k ∈ N f W ik f kt +1 | A t = a i (cid:3) . By the construction of the targets, the agent can learn theprediction y it via TD. If i is not conditioned on any action,then y it is updated by y it ← (cid:88) j ∈ N p W ij y jt +1 + (cid:88) k ∈ N f W ik f kt +1 otherwise, if i is conditioned on action a i , then y it is updatedby y it ←  (cid:80) j ∈ N p W ij y jt +1 + (cid:80) k ∈ N f W ik f kt +1 if A t = a i y it otherwiseIn an episodic setting, if the episode terminates at step T ,we deﬁne z iT (cid:44) and y iT (cid:44) for all prediction nodes i . TD networks represent a broader class of predictions thanthe conventional value functions. Many existing predic-tions used as auxiliary tasks can be expressed by a questionnetwork. Reward prediction (Jaderberg et al., 2017) can berepresented by a question network with a single feature noderepresenting the reward and a single prediction node predict-ing the reward. Multi-horizon value prediction (Fedus et al.,2019) can be represented by a similar question network butwith multiple self-loop prediction nodes with different dis-count factors. Termination prediction (Kartal et al., 2019)can be represented by a question network with a featurenode of constant and a self-loop node with discount . In this work, instead of hand-crafting a new question net-work instance as in previous work on the use of predic-tions for auxiliary tasks, we verify a conjecture that a largenumber of random deep, action-conditional predictions isenough to drive the learning of good state representationswithout needing to carefully hand-design the semantics ofthose predictions. To test this conjecture, we designed agenerator of random question networks from which we cantake samples and evaluate their performance as auxiliarytasks. Speciﬁcally, we designed RA ndom D eep A ction-conditional p R edictions (RADAR), a heuristic algorithmthat generates question networks with random structures. Random Features.

RADAR uses random features, eachcomputed by a scalar function g k with random parameters.For any transition ( O t , A t , O t +1 ) , the feature for this transi-tion is computed as f kt +1 = | g k ( O t +1 ) − g k ( O t ) | . Insteadof directly using the output of g k as the feature, we use theamount of change in g k . A similar transformation was usedin the pixel control auxiliary task (Jaderberg et al., 2017). Random Structure.

The question network structure inRADAR is deﬁned with arguments as input - number offeatures n f , the discrete action set A , a discount factor γ ,depth D , and repeat R ; its output is a question networkthat contains n f feature nodes as deﬁned above, and D + 1 layers, each layer contains R × |A| prediction nodes exceptthe ﬁrst layer which contains n f prediction nodes. RADARconstructs the question network layer by layer incrementallyfrom layer to layer D . Layer has n f feature nodes and n f prediction nodes; each prediction node has an edge toa distinct feature node with weight on the edge and hasa self-loop with weight γ . Each prediction node in layer predicts the discounted sum of its corresponding featurenode at the next step and are not conditioned on actions. Theprocedure for generating each layer from through D is thesame. For a layer l (1 ≤ l ≤ D ) , RADAR creates R × |A| prediction nodes. Each prediction node is conditioned onone action and there are exactly R nodes that are conditionedon the same action. Each prediction node has two edges, earning State Representations from Random Deep Action-conditional Predictions O , A , . . . , O t θ repr S t θ RL θ ans π vL RL yL ans Figure 2.

The agent architecture. The dashed cross denotes anoptional stop-gradient operation. one to a random node in layer l − and one to a randomfeature node in layer . Note that prediction nodes in layer do not necessarily connect to a prediction node in layer -they may only connect to a feature node. A constraint forpreventing duplicated predictions is included in RADAR sothat any two prediction nodes in layer l that are conditionedon the same action cannot connect to the same predictionnode in layer l − . Detailed pseudocode for RADAR isprovided in the Appendix. Figure 2 shows the agent architecture. We base our agent onthe actor-critic architecture and it consists of modules. Thestate representation module, parameterized by θ repr , mapsthe history of observations and actions ( O , A , . . . , O t ) toa state vector S t . The RL module, parameterized by θ RL ,maps the state vector S t to a policy distribution over theavailable actions π ( ·| S t ) and a value function v ( S t ) . Theanswer network module, parameterized by θ ans , maps thestate vector S t to a set of predictions y ( S t ) equal in size tothe number of prediction nodes in the question network.We trained the network in two separate ways. In the aux-iliary task setting, the RL loss L RL is backpropagated toupdate the parameters of the state representation ( θ repr ) andthe RL ( θ RL ) modules, while the answer network loss L ans is backpropagated to update the parameters of the answernetwork ( θ ans ) and state representation ( θ repr ) modules. Notethat the answer network loss only affects the RL moduleindirectly through the shared state representation module.In the stop-gradient setting, we stopped the gradients fromthe RL loss from ﬂowing from L RL to θ repr . This allows usto do a harsher and more direct evaluation of how well the auxiliary tasks can train the state representation without anyhelp from the main task.For L RL , we used the standard actor-critic objective with anentropy regularizer (Mnih et al., 2016). For L ans , we usedthe mean-squared loss for all the targets and predictions.

4. Illustrating the Beneﬁts of DeepAction-conditional Questions

We present our experiments and analyses in two sections.The main aim of the experiments in this section is to explorethe beneﬁts of exploiting the full richness of the space de-ﬁned by TD networks, by systematically varying the depth,features, and action-conditionality of the question networks.We illustrate the beneﬁts ﬁrst in a simple grid world thataffords the use of intuitive features and visualizations, andthen on a set of six Atari games. In addition, to test the ro-bustness of RADAR to its hyperparameters, we conducteda random search experiment on the Atari game Breakout.

Although our primary interest (and the focus of subsequentexperiments) is learning good policies, in this domain westudy policy evaluation because this simpler objective issufﬁcient to illustrate our points and we can compute andvisualize the true value function for comparison. Figure 3ashows the environment, a by grid surrounded by walls.The observation is a × × top-down view includingthe walls. channels represent the walls, the agent, andthe goal. There are available actions that move the agenthorizontally or vertically to an adjacent cell. The agent getsa reward of upon arriving at the goal cell located in the toprow, and otherwise. This is a continuing environment soachieving the goal does not terminate the interaction. Theobjective here is to learn the state-value function of a randompolicy which selects each action with equal probability. Weused a discount factor of . .Specifying a question network requires specifying both thestructure and the features. Later we explore random features,but here we use a single hand-crafted touch feature so thatevery prediction has a clear semantics. touch is if theagent’s move is blocked by the wall and is otherwise.Using the touch feature we constructed two types of ques-tion networks. The ﬁrst type is the expected discountedsum of touch (we used a discount factor . ). We referto this as a conventional TD prediction target (Figure 3b).The second type is a full action-conditional tree of depth D .There is only one feature node in the tree which correspondsto the touch feature. Each internal node has child nodesconditioned on distinct actions. Each prediction node alsohas a skip edge directly to the feature node (except for thechild nodes of the feature node). Figure 3c shows an exam- earning State Representations from Random Deep Action-conditional Predictions (a) γ (b) aa b ba b (c) γ γ a b a ba b a b (d) Figure 3.

Empty room grid world environment and the question networks we studied in our illustrative experiment. (a)

The blue circledenotes the agent and the yellow star denotes the rewarding state. (b)

A conventional TD prediction. (c)

A depth- tree question networkwith actions. The bottom right prediction node predicts the sum of the values of the feature in the next two steps if action b were takenfor both the current and the next step. Other prediction nodes have similar semantics. (d) A random question network sampled fromRADAR. There are features and actions, with depth and repeat set to . θ repr tree, depth 1tree, depth 2tree, depth 3tree, depth 4end-to-end Figure 4.

MSE between the learned value function and the truevalue function in the tree question networks experiment in theempty room grid world environment. The agent trained with action-conditional question networks (tree, depth = , , , ) achievedsubstantially lower MSE than the random baseline and the agenttrained with conventional TD prediction target. The performanceimproved monotonically with increasing depth until depth , wherethe performance matched to that of end-to-end training (blackcurve). Each curve is averaged over independent runs withdifferent random seeds. Shaded area shows the standard error. ple of a depth- tree (the caption describes the semanticsof some of the predictions). We also compared to a ran-domly initialized state representation module as a baselinewhere the state representation was ﬁxed and only the valuefunction weights were learned during training. Neural Network Architecture.

The empty room environ-ment is fully observable and so the state representation mod-ule is a feed-forward neural network that maps the currentobservation O t to a state vector S t . It is parameterized by a -layer multi-layer perceptron (MLP) with units in theﬁrst two layers and units in the third layer. The RL mod- ule has one hidden layer with units and one output headrepresenting the state value. (There is no policy head as thepolicy was given). The answer network module also has onehidden layer with units and one output layer. We applieda stop-gradient between the state representation module andthe RL module (Figure 2). More implementation details areprovided in the Appendix. Results.

We measured the performance by the mean-squared error (MSE) between the learned value functionand the true value function across all states. The true valuefunction was obtained by solving a system of linear equa-tions (Sutton & Barto, 2018). Figure 4 shows the MSEduring training. Both the random baseline and the conven-tional TD prediction target performed poorly. But even atree question network of depth 1 (i.e., four prediction targetscorresponding to the four action conditional predictions of touch after one step) performed much better than thesetwo baselines. Performance increased monotonically withincreasing depth until depth when the MSE matched end-to-end training after million frames.Figure 5 shows the different value functions learned byagents with the different prediction tasks. Figure 5a visu-alizes the true value function. Figure 5b shows the learnedvalue function when the state representations are learnedfrom conventional TD predictions of touch . Its symmetricpattern reﬂects the symmetry of the grid world and the ran-dom policy, but this is inconsistent with the asymmetric truevalue function. Figure 5c shows the learned value functionwhen the state representations are learned from depth- -treepredictions. It clearly distinguishes corner states, groupsof states on the boundary, and the states in the center area,because this grouping reﬂects the different prediction targetsfor these states.For the answer network module to make accurate predictionsof the targets of the question network, the state representa- earning State Representations from Random Deep Action-conditional Predictions (a) (b) (c) (d) (e) (f) Figure 5.

Visualization of the learned value functions in the empty room environment. Bright indicates high value and dark indicateslow value. (a)

The true values. (b)

The conventional TD predictions of the touch feature. (c) - (f) The prediction are deﬁned by afull-tree-structured question network regarding the touch feature. The depth of the tree increases from to from (c) to (f) . Figure 6.

MSE between the learned value function and the truevalue function in the random question networks experiment inthe empty room environment. Different loss curves show theperformance of agents trained with different RADAR variants.Increasing the number of random features (from to ) improvesthe performance. Each curve is averaged over independent runswith different random seeds. Shaded area shows the standard error. tion module must map states with the same prediction targetto similar representations and states with different targetsto different representations. As the question network treebecomes deeper, the agent learns ﬁner state distinctions,until an MSE of is achieved at depth (Figure 5e). The previous experiment demonstrated beneﬁts of tempo-rally deeper prediction tasks. But achieving this by creatingdeeper and deeper full-branching action-conditional trees isnot tractable as the number of prediction targets grows expo-nentially. The previous experiment also used a single featureformulated using domain knowledge; such feature selectionis also not scalable. RADAR provides a method to mitigateboth concerns by growing random question networks withrandom features. Speciﬁcally, we used RADAR with discount . , depth ,and repeat equal to the number of features. Figure 6 showsthe MSE of different RADAR variants. The performanceof RADAR with touch —that is, random but not necessar-ily full branching trees of depth 4—performed as well as touch with a full tree of depth 4. RADAR with a singlerandom feature performed suboptimally; a random featureis likely less discriminative than touch . However, as thenumber of random features increases, the performance im-proves, and with random features, RADAR matches theﬁnal performance of touch with a full depth 4 tree.The results on the grid world provide preliminary evidencethat RADAR-generated deep random question networkswith many random features can yield good state representa-tions. We next test this conjecture on a set of Atari games,exploring again the beneﬁts of depth and also exploring thebeneﬁt of action conditionality. Here we use six Atari games (these six are often used forhyperparameter selection for the Atari benchmark (Mnihet al., 2016)) to compare four different question networks: (a)

Conventional TD predictions of discounted sums ofdistinct random features (Figure 3b); (b)

Shallow action-conditional predictions, a set of depth- trees, each for adistinct random feature (Figure 3c); (c) RADAR withoutaction-conditioning; and (d)

RADAR (Figure 3d). Network(a) provides a useful TD-prediction baseline that exploitsnone of the richness of TD networks; the contrast between(b) and (c) provides further evidence for the beneﬁts ofdepth; and the contrast between (c) and (d) provides evi-dence of the beneﬁts of conditioning predictions on actions.

Random Features for Atari.

The random function g forcomputing the random features are designed as follows. The × observation O t is divided into disjoint × patches, and a shared random linear function applies to eachpatch to obtain random features g t , g t , . . . , g t . Finally,we process these features as described in §3.2. earning State Representations from Random Deep Action-conditional Predictions Millions of frames

BeamRider

Breakout

Pong

Qbert

Seaquest

SpaceInvadersshallow action-conditional conventional TD RADAR w/o action-conditional RADAR

Figure 7.

Learning curves of different question networks in six Atari games. x-axis denotes the number of frames and y-axis denotes theepisode returns. Each curve is averaged over independent runs with different random seeds. Shaded area shows the std. error. γ )450500550600650700750 0 4 8 12 16depth450500550600650700750 0 8 16 24 32repeat450500550600650700750 0 8 16 24 32 Figure 8.

Scatter plots of scores in Breakout obtained by RADAR with different hyperparameters. x-axis denotes the value of thehyperparameter. y-axis denotes the ﬁnal game score after training for million frames. The dotted horizontal lines denote theperformance of the end-to-end A2C baseline. The solid vertical lines denote the values we used in our ﬁnal experiments.

Neural Network Architecture.

We used A2C (Mnihet al., 2016) with a standard neural network architecturefor Atari (Mnih et al., 2015) as our base agent. Speciﬁcally,the state representation module consists of convolutionallayers. The RL module has one hidden dense layer andtwo output heads for the policy and the value function re-spectively. The answer network has one hidden dense layerwith units followed by the output layer. We stoppedthe gradient from the RL module to the state representationmodule (Figure 2). Hyperparameters.

The discount factor, depth, and repeatfor RADAR were set to . , , and respectively. Thusthere are

16 + 8 ∗ ∗ |A| total predictions. RADAR with-out action-conditioning has the same question network asRADAR except that no prediction was conditioned on ac-tions. To match the total number of predictions, we used

16 + 8 ∗ ∗ |A| random features for conventional TD and ∗ features for shallow action-conditional predictions.Additional random features were generated by applyingmore random linear functions to the image patches. Thediscount factor for conventional TD predictions is also . .More implementation details are provided in the Appendix. Results.

Figure 7 shows the learning curves in the Atarigames. Shallow action-conditional predictions performedthe worst in all the games, providing further evidence forthe value of making deep predictions. RADAR consistentlyoutperformed its non-action-conditional counterpart, pro-viding clear evidence that action-conditioning is beneﬁcial. And ﬁnally, RADAR performed better than conventionalTD predictions in out of games despite using many fewerfeatures, suggesting that structured deep action-conditionalpredictions can be more effective than simply making con-ventional TD predictions about many features. We tested the robustness and stability of RADAR with re-spect to its hyperparameters, namely discount, depth, repeat,and number of features. We explored different values foreach hyperparameter independently while holding the otherhyperparameters ﬁxed to the values we used in the previousexperiment. For each hyperparameter, we took samplesuniformly from a wide interval and evaluated RADAR onBreakout using the sampled value. The results are presentedin Figure 8. Each hyperparameter has a range of valuesthat achieves high performance, indicating that RADAR isstable and robust to different hyperparameter choices.

5. Comparison to Baseline Auxiliary Tasks onAtari and DeepMind Lab

In this section, we present the empirical results of com-paring the performance of RADAR against the A2C base-line (Mnih et al., 2016) and two other auxiliary tasks, i.e.,multi-horizon value prediction (MHVP) (Fedus et al., 2019)and pixel control (PC) (Jaderberg et al., 2017). We con-ducted the evaluation in Atari games and DeepMind earning State Representations from Random Deep Action-conditional Predictions M e d i a n hu m a n s c o r e ( % ) Atari (a) M e a n r e c o r d s c o r e ( % ) Atari (b) M e a n c a pp e d hu m a n s c o r e ( % ) DeepMind Lab (c)

A2CA2C + RADARA2C + MHVPA2C + PCend-to-endstop-gradient

Figure 9. (a)

Median human-normalized score across Atari games. (b)

Mean record-normalized score across Atari games. (c)

Meancapped human-normalized score across DeepMind Lab environments. In all panels, the x-axis denotes the number of frames. Eachdark curve is averaged over independent runs with different random seeds. The shaded area shows the standard error. Lab environments.

Atari Implementation.

For Atari games, we used the samearchitecture for RADAR as in the prior section. For MHVP,we used value predictions following (Fedus et al., 2019).Each prediction has a unique discount factor, chosen tobe uniform in terms of their effective horizons from to ( { , − , − , . . . , − } ). The architecturefor MHVP is the same as RADAR. For PC, we followedthe architecture design and hyperparameters in (Jaderberget al., 2017). When not stopping gradient from the RL loss,we mixed the RL updates and the answer network updatesby scaling the learning rate for the answer network with acoefﬁcient c . We searched c in { . , . , . , , } on the games in the previous section. c = 1 worked the best for allmethods. More details are provided in the Appendix. DeepMind Lab Implementation.

For DeepMind Lab, weused the same RL module and answer network module asAtari but used a different state representation module toaddress the partial observability. Speciﬁcally, the convo-lutional layers in the state representation module were fol-lowed by a dense layer with units and a GRU core (Choet al., 2014; Chung et al., 2014) with units.

Results.

Figure 9a and Figure 9b shows the results forboth the stop-gradient and end-to-end architectures on Atari,comparing to two standard human-normalized score mea-sures (median human-normalized score (Mnih et al., 2015)and mean record-normalized score (Hafner et al., 2020)).When training representations end-to-end through a com-bined main task and auxiliary task loss, the performanceof RADAR matches or substantially exceeds the two base-lines. Surprisingly, the stop-gradient RADAR agents out-perform the end-to-end A2C baseline, unlike stop-gradientversions of the baseline auxiliary task agents. Figure 9cshows the results for both stop-gradient and end-to-end ar- chitectures on DeepMind Lab environments (using meancapped human-normalized scores). Again, RADAR sub-stantially outperforms both auxiliary task baselines, and thestop-gradient version matches the ﬁnal performance of theend-to-end A2C. Taken together the results from these tasks provide substantial evidence that RADAR-generatedrandom question networks drive the learning of good staterepresentations, outperforming auxiliary tasks with ﬁxedhand-crafted semantics.

6. Conclusion and Future Work

In this work we provided evidence that learning how to makerandom, deep action-conditional predictions can drive thelearning of good state representations. We used the languageof TD networks to deﬁne a rich space of predictions that canbe learned efﬁciently with TD methods, and created a heuris-tic algorithm, RADAR, that generates prediction targets inthis space that take the form of random question networkswith random features. Our empirical study on the Atari andDeepMind Lab benchmarks shows that learning state rep-resentations solely via auxiliary prediction tasks deﬁned byRADAR outperforms the end-to-end trained A2C baseline.RADAR also outperformed pixel control and multi-horizonvalue prediction on these two benchmarks when being usedas part of a combined loss function with the main RL task.In this work, the question network was sampled before learn-ing and was held ﬁxed during learning. An interesting goalfor future research is to ﬁnd methods that can adapt the ques-tion network and discover useful questions during learning.The question networks we studied are limited to discreteactions. It is unclear how to condition a prediction on acontinuous action. Thus another future direction to exploreis to extend TD networks to continuous action spaces. earning State Representations from Random Deep Action-conditional Predictions

Acknowledgement

This work was supported by DARPA’s L2M program aswell as a grant from the Open Philanthropy Project to theCenter for Human Compatible AI. Any opinions, ﬁndings,conclusions, or recommendations expressed here are thoseof the authors and do not necessarily reﬂect the views of thesponsors.

References

Anand, A., Racah, E., Ozair, S., Bengio, Y., Cˆot´e, M., andHjelm, R. D. Unsupervised state representation learningin atari. In Wallach, H. M., Larochelle, H., Beygelzimer,A., d’Alch´e-Buc, F., Fox, E. B., and Garnett, R. (eds.),

Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information ProcessingSystems 2019, NeurIPS 2019, December 8-14, 2019, Van-couver, BC, Canada , pp. 8766–8779, 2019.Bellemare, M. G., Dabney, W., Dadashi, R., Ta¨ıga, A. A.,Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore, T.,and Lyle, C. A geometric perspective on optimal repre-sentations for reinforcement learning. In Wallach, H. M.,Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox,E. B., and Garnett, R. (eds.),

Advances in Neural Infor-mation Processing Systems 32: Annual Conference onNeural Information Processing Systems 2019, NeurIPS2019, December 8-14, 2019, Vancouver, BC, Canada , pp.4360–4371, 2019.Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y.On the properties of neural machine translation: Encoder-decoder approaches. In Wu, D., Carpuat, M., Carreras, X.,and Vecchi, E. M. (eds.),

Proceedings of SSST@EMNLP2014, Eighth Workshop on Syntax, Semantics and Struc-ture in Statistical Translation, Doha, Qatar, 25 October2014 , pp. 103–111. Association for Computational Lin-guistics, 2014. doi: 10.3115/v1/W14-4012.Chung, J., G¨ulc¸ehre, C¸ ., Cho, K., and Bengio, Y. Empiricalevaluation of gated recurrent neural networks on sequencemodeling.

CoRR , abs/1412.3555, 2014.Dabney, W., Barreto, A., Rowland, M., Dadashi, R.,Quan, J., Bellemare, M. G., and Silver, D. The value-improvement path: Towards better representations forreinforcement learning.

CoRR , abs/2006.02243, 2020.Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., andLarochelle, H. Hyperbolic discounting and learning overmultiple horizons.

CoRR , abs/1902.06865, 2019.Gregor, K., Rezende, D. J., Besse, F., Wu, Y., Merzic, H.,and van den Oord, A. Shaping belief states with gen-erative environment models for RL. In Wallach, H. M.,Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox, E. B., and Garnett, R. (eds.),

Advances in Neural Infor-mation Processing Systems 32: Annual Conference onNeural Information Processing Systems 2019, NeurIPS2019, December 8-14, 2019, Vancouver, BC, Canada , pp.13475–13487, 2019.Guo, Z. D., Azar, M. G., Piot, B., Pires, B. A., Pohlen, T.,and Munos, R. Neural predictive belief representations.

CoRR , abs/1811.06407, 2018.Guo, Z. D., Pires, B. ´A., Piot, B., Grill, J., Altch´e, F., Munos,R., and Azar, M. G. Bootstrap latent-predictive represen-tations for multitask reinforcement learning. In

Proceed-ings of the 37th International Conference on MachineLearning, ICML 2020, 13-18 July 2020, Virtual Event ,volume 119 of

Proceedings of Machine Learning Re-search , pp. 3875–3886. PMLR, 2020.Hafner, D., Lillicrap, T. P., Norouzi, M., and Ba, J.Mastering atari with discrete world models.

CoRR ,abs/2010.02193, 2020.Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-ment learning with unsupervised auxiliary tasks. In . OpenReview.net, 2017.Kartal, B., Hernandez-Leal, P., and Taylor, M. E. Terminalprediction as an auxiliary task for deep reinforcementlearning. In Smith, G. and Lelis, L. (eds.),

Proceedings ofthe Fifteenth AAAI Conference on Artiﬁcial Intelligenceand Interactive Digital Entertainment, AIIDE 2019, Octo-ber 8-12, 2019, Atlanta, Georgia, USA , pp. 38–44. AAAIPress, 2019.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In Bengio, Y. and LeCun, Y. (eds.), , 2015.Lee, K., Fischer, I., Liu, A., Guo, Y., Lee, H., Canny, J.,and Guadarrama, S. Predictive information accelerateslearning in RL. In Larochelle, H., Ranzato, M., Hadsell,R., Balcan, M., and Lin, H. (eds.),

Advances in Neural In-formation Processing Systems 33: Annual Conference onNeural Information Processing Systems 2020, NeurIPS2020, December 6-12, 2020, virtual , 2020.Littman, M. L., Sutton, R. S., and Singh, S. Predictive rep-resentations of state. In Dietterich, T. G., Becker, S., andGhahramani, Z. (eds.),

Advances in Neural InformationProcessing Systems 14 [Neural Information ProcessingSystems: Natural and Synthetic, NIPS 2001, December3-8, 2001, Vancouver, British Columbia, Canada] , pp.1555–1561. MIT Press, 2001. earning State Representations from Random Deep Action-conditional Predictions

Mazoure, B., des Combes, R. T., Doan, T., Bachman, P.,and Hjelm, R. D. Deep reinforcement and infomax learn-ing. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan,M., and Lin, H. (eds.),

Advances in Neural InformationProcessing Systems 33: Annual Conference on NeuralInformation Processing Systems 2020, NeurIPS 2020, De-cember 6-12, 2020, virtual , 2020.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fid-jeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik,A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D.,Legg, S., and Hassabis, D. Human-level control throughdeep reinforcement learning.

Nat. , 518(7540):529–533,2015. doi: 10.1038/nature14236.Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InBalcan, M. and Weinberger, K. Q. (eds.),

Proceedingsof the 33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24,2016 , volume 48 of

JMLR Workshop and ConferenceProceedings , pp. 1928–1937. JMLR.org, 2016.Schwarzer, M., Anand, A., Goel, R., Hjelm, R. D., Courville,A., and Bachman, P. Data-efﬁcient reinforcement learn-ing with self-predictive representations. arXiv preprintarXiv:2007.05929 , 2020.Singh, S., James, M. R., and Rudary, M. R. Predictive staterepresentations: A new theory for modeling dynamicalsystems. In Chickering, D. M. and Halpern, J. Y. (eds.),

UAI ’04, Proceedings of the 20th Conference in Uncer-tainty in Artiﬁcial Intelligence, Banff, Canada, July 7-11,2004 , pp. 512–518. AUAI Press, 2004.Sutton, R. S. and Barto, A. G.

Reinforcement learning: Anintroduction . MIT press, 2018.Sutton, R. S. and Tanner, B. Temporal-difference networks.In

Advances in Neural Information Processing Systems17 [Neural Information Processing Systems, NIPS 2004,December 13-18, 2004, Vancouver, British Columbia,Canada] , pp. 1377–1384, 2004.Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski,P. M., White, A., and Precup, D. Horde: a scalablereal-time architecture for learning knowledge from un-supervised sensorimotor interaction. In Sonenberg, L.,Stone, P., Tumer, K., and Yolum, P. (eds.), , pp. 761–768. IFAAMAS, 2011.van den Oord, A., Li, Y., and Vinyals, O. Representa-tion learning with contrastive predictive coding.

CoRR ,abs/1807.03748, 2018. Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L.,Oh, J., van Hasselt, H., Silver, D., and Singh, S. Dis-covery of useful questions as auxiliary tasks. In Wallach,H. M., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F.,Fox, E. B., and Garnett, R. (eds.),

Advances in Neural In-formation Processing Systems 32: Annual Conference onNeural Information Processing Systems 2019, NeurIPS2019, December 8-14, 2019, Vancouver, BC, Canada , pp.9306–9317, 2019.Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanc-tot, M., and de Freitas, N. Dueling network architec-tures for deep reinforcement learning. In Balcan, M.and Weinberger, K. Q. (eds.),

Proceedings of the 33ndInternational Conference on Machine Learning, ICML2016, New York City, NY, USA, June 19-24, 2016 , vol-ume 48 of

JMLR Workshop and Conference Proceedings ,pp. 1995–2003. JMLR.org, 2016. earning State Representations from Random Deep Action-conditional Predictions

A. Pseudocode for RADAR

Algorithm 1

RADAR: A Random Question Network Generator

Input: number of features n f , discount factor γ , action set A , depth D and repeat R Output: a network GG ← an empty graph roots ← an empty set leaves ← an empty set for i = 1 to n f do create a new feature node f in Groots ← roots ∪ { f } leaves ← leaves ∪ { f } create a new prediction node v in Gleaves ← leaves ∪ { v } add edge < v, f, > to G add edge < v, v, γ > to G end forfor d = 1 to D do expanded ← an empty set for a ∈ A do parent ← randomly select R distinct nodes from leaves for p ∈ parent do create a new prediction node v in G mark v as conditioned on action aexpanded ← expanded ∪ { v } add edge < v, p, > to Gf ← randomly select a node from roots add edge < v, f, > to G end forend for leaves ← expanded end for earning State Representations from Random Deep Action-conditional Predictions B. Implementation Details

B.1. Experiments on the Empty Room EnvironmentNeural Network Architecture.

The empty room environment is fully observable and so the state representation module isa feed-forward neural network that maps the current observation O t to a state vector S t . It is parameterized by a -layermulti-layer perceptron (MLP) with units in the ﬁrst two layers and units in the third layer. The RL module has onehidden layer with units and one output head representing the state value. (There is no policy head as the policy wasgiven). The answer network module also has one hidden layer with units and one output layer. We applied a stop-gradientbetween the state representation module and the RL module. Hyperparameters.

Both the value function and the answer network were updated via TD. We used parallel actors togenerate data and updated the parameters every steps. We used the Adam optimizer (Kingma & Ba, 2015) with learningrate . for all agents except the end-to-end agent which used . . The value function updates and the answer networkupdates used two separate optimizers with identical hyperparameters. B.2. Atari ExperimentsNeural Network Architecture.

32 8 × convolutional kernels with a stride of , the second layer has

64 4 × kernels with stride , and the third layerhas

64 3 × kernels with stride . The RL module has one dense layer with units and two output heads for the policyand the value function respectively. The answer network has one hidden dense layer with units followed by the outputlayer. We stopped the gradient from the RL module to the state representation module. Hyperparameters.

Following convention (Mnih et al., 2015), we used a stack of the latest frames as the input to the agent,i.e., the input to the state representation module at step t is ( O t − , O t − , O t − , O t ) . We used parallel actors to generatedata and updated the agent’s parameters every steps. The entropy regularization was . and the discount factor forthe A2C loss was . . We used the RMSProp optimizer with learning rate . , decay . , and (cid:15) = 0 . . The RLupdates and the answer network updates used two separate optimizers with identical hyperparameters. The gradient fromthe A2C loss was clipped by global norm to . . When not stopping gradient from the RL loss, we mixed the RL updatesand the answer network updates by scaling the learning rate for the answer network with a coefﬁcient c . We searched c in { . , . , . , , } on the games in the previous section. c = 1 worked the best for both RADAR and baseline methods. Baseline Methods.

For MHVP, we used value predictions following (Fedus et al., 2019). Each prediction has a uniquediscount factor, chosen to be uniform in terms of their effective horizons from to ( { , − , − , . . . , − } ).The architecture for MHVP is the same as RADAR. For PC, We followed the architecture design and hyperparameters in(Jaderberg et al., 2017). Speciﬁcally, we center-cropped the observation image to × and used × patches whichresulted in features. The discount factor was set to . . The output of the state representation module is ﬁrst mapped toa -dimensional vector by a dense layer and then reshaped to a × × tensor. A deconvolutional layer then mapsthis D tensor to |A| × × representing the action-values for each patch. We used the dueling architecture (Wang et al.,2016). B.3. DeepMind Lab ExperimentsNeural Network Architecture.

For DeepMind Lab, we used the same RL module and answer network module as Atari butused a different state representation module to address the partial observability. Speciﬁcally, the convolutional layers in thestate representation module were followed by a dense layer with units and a GRU core (Cho et al., 2014; Chung et al.,2014) with units.

Hyperparameters.

We used parallel actors to generate data and updated the agent’s parameters every steps. Theentropy regularization was . and the discount factor for the A2C loss was . . We used the RMSProp optimizerwith learning rate . , decay . , and (cid:15) = 10 − . The RL updates and the answer network updates used two separateoptimizers with identical hyperparameters. The gradient from the A2C loss was clipped by global norm to . . For end-to-end (not stop-gradient) agents, we searched the mixing coefﬁcient c in { . , . , . , , } on explore goal locations small , explore object locations small , and lasertag three opponents small . c = 1 worked the best for both RADAR and baselinemethods. earning State Representations from Random Deep Action-conditional Predictions C. Additional Empirical Results

Figure 10.

Learning curves in DeepMind Lab environments. The x-axis denotes the number of frames. Each dark curve is averagedover independent runs with different random seeds. The shaded area shows the standard error. earning State Representations from Random Deep Action-conditional Predictions Atlantis020040060080010001200 BankHeist 400060008000100001200014000 BattleZone 020004000600080001000012000 BeamRider 253035 Bowling 020406080100 Boxing 0200400600 Breakout3000400050006000 Centipede 80090010001100 ChopperCommand 20000400006000080000100000120000 CrazyClimber 0100000200000300000400000 DemonAttack −15−10−5 DoubleDunk −0.04−0.020.000.020.04 Enduro−100−80−60−40−20020 FishingDerby 0246810 Freeway 175200225250275300 Frostbite 010000200003000040000 Gopher 200300400500600 Gravitar 100002000030000 Hero−10−9−8−7−6−5−4 IceHockey 100200300400500 Jamesbond 050010001500 Kangaroo 3000400050006000700080009000 Krull 5000100001500020000250003000035000 KungFuMaster 0.00.51.01.52.0 MontezumaRevenge100020003000 MsPacman 200040006000800010000 NameThisGame −20−1001020 Pong −1000100200300 PrivateEye 05000100001500020000 Qbert 25005000750010000125001500017500 Riverraid0100002000030000 RoadRunner 246810 Robotank 5007501000125015001750 Seaquest 500100015002000 SpaceInvaders 020000400006000080000100000 StarGunner −22.5−20.0−17.5−15.0−12.5−10.0−7.5 Tennis300040005000600070008000 TimePilot 100150200250 Tutankham 0100000200000300000 UpNDown 0.00.20.40.6 Venture 100000200000300000 VideoPinball 1000200030004000 WizardOfWor0 50 100 150 200Millions of frames02500500075001000012500 Zaxxon A2CA2C + RADARA2C + MHVPA2C + PCA2CA2C + RADARA2C + MHVPA2C + PCend-to-endstop-gradient

Figure 11.

Learning curves in Atari games. The x-axis denotes the number of frames. Each dark curve is averaged over5