TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning
Artemij Amiranashvili, Alexey Dosovitskiy, Vladlen Koltun, Thomas Brox
PPublished as a conference paper at ICLR 2018 TD OR NOT
TD: A
NALYZING THE R OLE OF T EMPORAL D IFFERENCING IN D EEP R EINFORCEMENT L EARNING
Artemij Amiranashvili Alexey Dosovitskiy Vladlen Koltun Thomas Brox
University of Freiburg Intel Labs Intel Labs University of Freiburg A BSTRACT
Our understanding of reinforcement learning (RL) has been shaped by theoreticaland empirical results that were obtained decades ago using tabular representationsand linear function approximators. These results suggest that RL methods that usetemporal differencing (TD) are superior to direct Monte Carlo estimation (MC).How do these results hold up in deep RL, which deals with perceptually complexenvironments and deep nonlinear models? In this paper, we re-examine the roleof TD in modern deep RL, using specially designed environments that control forspecific factors that affect performance, such as reward sparsity, reward delay, andthe perceptual complexity of the task. When comparing TD with infinite-horizonMC, we are able to reproduce classic results in modern settings. Yet we alsofind that finite-horizon MC is not inferior to TD, even when rewards are sparse ordelayed. This makes MC a viable alternative to TD in deep RL.
NTRODUCTION
The use of deep networks as function approximators has significantly expanded the range of prob-lems that can be successfully tackled with reinforcement learning. However, there is little under-standing of when and why certain deep RL algorithms work well. Theoretical results are mainlybased on tabular environments or linear function approximators (Sutton & Barto, 2017). Their as-sumptions do not cover the typical application domains of deep RL, which feature extremely highinput dimensionality (typically in the tens of thousands) and the use of nonlinear function approx-imators. Thus, our understanding of deep RL is based primarily on empirical results, and theseempirical results guide the design of deep RL algorithms.One design decision shared by the vast majority of existing value-based deep RL methods is the useof temporal difference (TD) learning – training predictive models by bootstrapping based on theirown predictions. This design decision is primarily based on evidence from the pre-deep-RL era (Sut-ton, 1988; 1995). The results of those experimental studies are well-known and clearly demonstratethat simple supervised learning, also known as Monte Carlo prediction (MC), is outperformed bypure TD learning, which, in turn, is outperformed by TD( λ ) – a method that can be seen as a mixtureof TD and MC (Sutton, 1988).However, recent research has shown that an algorithm based on Monte Carlo prediction can out-perform TD-based methods on complex sensorimotor control tasks in three-dimensional, partiallyobservable environments (Dosovitskiy & Koltun, 2017). These results suggest that the classic un-derstanding of the relative performance of TD and MC may not hold in modern settings. This evi-dence is not conclusive: the algorithm proposed by Dosovitskiy & Koltun (2017) involves customcomponents such as parametrized goals and decomposed rewards, and therefore cannot be directlycompared to TD-based baselines.In this paper, we perform a controlled experimental study aiming at better understanding the role oftemporal differencing in modern deep reinforcement learning, which is characterized by essentiallyinfinite-dimensional state spaces, extremely high observation dimensionality, partial observability,and deep nonlinear models used as function approximators. We focus on environments with vi-sual inputs and discrete action sets, and algorithms that involve prediction of value or action-valuefunctions. This is in contrast to value-free policy optimization algorithms (Schulman et al., 2015;Levine & Koltun, 2013) and tasks with continuous action spaces and low-dimensional vectorial staterepresentations that have been extensively benchmarked by Duan et al. (2016) and Henderson et al.1 a r X i v : . [ c s . L G ] J un ublished as a conference paper at ICLR 2018(2017). We base our study on deep Q -learning (Mnih et al., 2015), where the Q -function is learnedeither via temporal differencing or via a finite-horizon Monte Carlo method. To ensure that our con-clusions are not limited to pure value-based methods, we additionally evaluate asynchronous advan-tage actor-critic (A3C), which combines temporal differencing with a policy gradient method (Mnihet al., 2016).Our main focus is on performing controlled experiments, in terms of both algorithm configurationsand environment properties. This is in contrast to prior work, which typically benchmarked a numberof existing algorithms on a set of standard environments. While proper benchmarking is crucial fortracking progress in the field, it is not always sufficient for understanding the reasons behind goodor poor performance. In this work, we ensure that the algorithms are comparable by implementingthem in a common software framework. By varying the parameters such as the balance betweenTD and MC in the learning update or the prediction horizon, we are able to clearly isolate the effectof these parameters on learning. Moreover, we designed a series of controlled scenarios that focuson specific characteristics of RL problems: reward sparsity, reward delay, perceptual complexity,and properties of terminal states. Results in these environments shed light on the strengths andweaknesses of the considered algorithms.Our findings in modern deep RL settings both support and contradict past results on the merits ofTD. On the one hand, value-based infinite-horizon methods perform best with a mixture of TD andMC; this is consistent with the TD ( λ ) results of Sutton (1988). On the other hand, in sharp contrastto prior beliefs, we observe that Monte Carlo algorithms can perform very well on challenging RLtasks. This is made possible by simply limiting the prediction to a finite horizon. Surprisingly,finite-horizon Monte Carlo training is successful in dealing with sparse and delayed rewards, whichare generally assumed to impair this class of methods. Monte Carlo training is also more stable tonoisy rewards and is particularly robust to perceptual complexity and variability. RELIMINARIES
We work in a standard reinforcement learning setting of an agent acting in an environment overdiscrete time steps. At each time step t , the agent receives an observation o t and selects an action a t . We assume partial observability: the observation o t need not carry complete information aboutthe environment and can be seen as a function of the environment’s “true state”. We assume anepisodic setup, where an episode starts with time step and concludes at a terminal time step T . Wedenote by s t the tuple of all observations collected by the agent from the beginning of the episode: s t = (cid:104) o , . . . , o t (cid:105) . (In practice we will only include a set of recent observations in s .) The objectiveis to find a policy π ( a t | s t ) that maximizes the expected return – the sum of all future rewards throughthe remainder of the episode: R t = T (cid:88) i = t r i . (1)This sum can become arbitrarily large for long episodes. To avoid divergence, temporally distantrewards can be discounted. This is typically done in one of two ways: by introducing a discountfactor γ or by truncating the sum after a fixed number of steps (horizon) τ . R γt = T (cid:88) i = t γ i − t r i = r t + γr t +1 + γ r t +2 + ... ; R τt = t + τ (cid:88) i = t r i . (2)The parameters γ and τ regulate the contribution of temporally distant rewards to the agent’s objec-tive. In what follows ˆ R t stands for R γt or R τt .For a given policy π , the value function and the action-value function are defined as expected returnsthat are conditioned, respectively, on the observation or the observation-action pair: V π ( s t ) = E π [ ˆ R t | s t ] , Q π ( s t , a t ) = E π [ ˆ R t | s t , a t ] . (3)Optimal value and action-value functions are defined as the maxima over all possible policies: V (cid:63) ( s t ) = max π V π ( s t ) , Q (cid:63) ( s t , a t ) = max π Q π ( s t , a t ) . (4)2ublished as a conference paper at ICLR 2018In value-based, model-free reinforcement learning, the value or action value are estimated by afunction approximator V with parameters θ . The function approximator is typically trained byminimizing a loss between the current estimate and a target value: L ( θ ) = (cid:0) V ( s t ; θ ) − V target (cid:1) . (5)The learning procedure for the action-value function is analogous. Hence, we focus on the valuefunction in the remainder of this section.Reinforcement learning methods differ in how the target value is obtained. The most straightforwardapproach is to use the empirical return as target: i.e., V target = R γt or V target = R τt . This is referredto as Monte Carlo (MC) training, since the empirical loss becomes a Monte Carlo estimate of theexpected loss. Using the empirical return as target requires propagating the environment forwardbefore a training step can take place – by τ steps for finite-horizon return R τt (Dosovitskiy & Koltun,2017; Veness et al., 2015) or until the end of the episode for discounted return R γt . This increasesthe variance of the target value for long horizons and large discount factors.An alternative to Monte Carlo training is temporal difference (TD) learning (Sutton, 1988). The ideais to estimate the return by bootstrapping from the function approximator itself, after acting for afixed number of steps n : V target = t + n − (cid:88) i = t γ i − t r i + γ n V ( s t + n ; θ ) . (6)TD learning is typically used with infinite-horizon returns. When the rollout length n approachesinfinity (or, in practice, maximal episode duration T max ), TD becomes identical to Monte Carlotraining. TD learning applied to the action-value function is known as Q -learning (Watkins, 1989;Watkins & Dayan, 1992; Peng & Williams, 1996; Mnih et al., 2015).An alternative to value-based methods are policy-based methods, which directly parametrize thepolicy π ( a | s ; θ ) . An approximate gradient of the expected return is computed with respect to thepolicy parameters, and the return is maximized using gradient ascent. Williams (1992) has shownthat an unbiased estimate of the gradient can be computed as ∇ θ log π ( a | s ; θ ) ( R t − b t ( s t )) , wherethe function b t ( s t ) is called a baseline and can be chosen so as to decrease the variance of theestimator. A common choice for the baseline is the value function: b t ( s t ) = V π ( s t ) . A combinationof policy gradient with a baseline value function learned via TD is referred to as an actor-criticmethod, with policy π being the actor and the value function estimator being the critic. XPERIMENTAL S ETUP
LGORITHMS
In our analysis of temporal differencing we focus on three key characteristics of RL algorithms. Thefirst is the balance between TD and MC in the learning update. The second is the prediction horizon,in particular infinite versus finite horizon. The third is the use of pure value-based learning versusan actor-critic approach which includes an explicitly parametrized policy.To study the first aspect, we use asynchronous n-step Q-learning (n-step Q ) (Mnih et al., 2016). Inthis algorithm, an action-value function is learned with n-step TD (Eq. (6)), and actions are selectedgreedily according to this function. By varying the rollout length n , we can smoothly interpolatebetween pure TD and pure MC updates. In order to analyze the second aspect, we implemented afinite-horizon Monte Carlo version of n-step Q , which we call Q MC . This algorithm can be seenas a simplified version of Direct Future Prediction (Dosovitskiy & Koltun, 2017). Finally, we selectasynchronous advantage actor-critic (A3C) (Mnih et al., 2016) to study the third aspect. In A3C, thevalue function estimate is learned with n-step TD, and a policy is trained with policy gradient. Thisallows us to evaluate the interplay of TD learning and policy gradient learning.To ensure that the comparison is fully controlled and fair, we implemented all algorithms in theasynchronous training framework proposed by Mnih et al. (2016). Multiple actor threads are runningin parallel and send the weight updates asynchronously to a parameter server. For A3C and n-step Q ,we use the algorithms as described by Mnih et al. (2016). Q MC is the n-step Q algorithm where the3ublished as a conference paper at ICLR 2018n-step TD targets are replaced by finite-horizon MC targets. Further details on the Q MC and n-step Q algorithms and the network architecture are provided in the supplement.Note that switching to finite horizon necessitates a small additional change in the Q MC algorithm.In practice, in n-step Q each parameter update is not just an n -step TD update, but a sum of allupdates for rollouts from to n . This improves the stability of training. In Q MC such accumulationof updates is impossible, since predictions for different horizons are not compatible. We there-fore always predict several Q -values corresponding to different horizons, similar to Dosovitskiy &Koltun (2017). Specifically, for horizon τ = 2 K , we additionally predict Q -values for horizons { k } ≤ k To calibrate our implementations against results available in the literature, we begin by conduct-ing experiments on several standard benchmark environments: five Atari games from the ArcadeLearning Environment (Bellemare et al., 2013) and two environments based on first-person-view3D simulation in the ViZDoom framework (Kempka et al., 2016). We used a set of Atari gamescommonly analyzed in the literature: Space Invaders, Pong, Beam Rider, Sea Quest, and Frostbite(Mnih et al., 2015; Schulman et al., 2015; Lake et al., 2017). For the ViZDoom environments, weused the Navigation, Battle and Battle2 scenarios from Dosovitskiy & Koltun (2017).Our main experiments are on sequences of specialized environments. Each sequence is designedsuch that a single factor of variation is modified in a controlled fashion. This allows us to study theeffect of this factor. Factors of variation include: reward sparsity, reward delay, reward type, andperceptual complexity.For the controlled environments, we used the ViZDoom platform. This platform is compatible withexisting map editors with built-in scripting, which allows for flexible and controlled specificationof different scenarios. In comparison to Atari games, ViZDoom offers a more realistic setting witha three-dimensional environment and partially observed first-person navigation. We now brieflydescribe the tasks. Further details are provided in the supplement. Basic health gathering. The basis for our controlled scenarios is the health gathering task. In thisscenario, the agent’s aim is to collect health kits while navigating through a maze using visual input.Figure 1(b) shows a typical image observed by the agent. The agent’s health level is constantlydeclining. Health kits add to the health level. The goal is to collect as many health kits as possible.To be precise, the agent loses health units every steps, and obtains health units when collectinga health pack. The agent’s total health cannot exceed . The reward is +1 when the agent collectsa health kit and otherwise. There are health kits in the labyrinth at any given time. When theagent collects one of them, a new one appears at a random location. An episode is terminated after steps, which is equivalent to minute of in-game time. Terminal states. To test the effect of terminal states on the performance of the algorithms, we mod-ified the health gathering scenario so that each episode terminates after m health kits are collected.For m = 1 , all useful training signals come from the terminal state. With larger m , the importanceof terminal states diminishes. Delayed rewards. In this sequence of scenarios we introduce a delay between the act of collecting ahealth kit and its effect – an increase in health and a reward of . We have set up environments withdelays of , , , , and steps. Sparse rewards. To examine the effect of reward sparsity, we varied the number of available healthkits on the map. We created two variations of the basic health gathering environment with increas-ingly sparse rewards. In the ‘Sparse’ setting, there are health kits in the labyrinth – four timesfewer than in the basic setting. In the ‘Very Sparse’ setting, only health kits are in the labyrinth –eight times fewer than in the basic setting.In order to isolate the effect of sparsity, we keep the achievable reward fixed by adjusting the amountof health the agent loses per time period: in the Sparse configuration and in Very Sparse.In the Very Sparse scenario under random exploration, the agent gathers a health kit on averageevery , steps. 4ublished as a conference paper at ICLR 2018(a) (b) (c)Figure 1: Different levels of perceptual complexity in the health gathering task. (a) Map view of agrid world. (b) First-person view of a three-dimensional environment, fixed textures. (c) First-personview of a three-dimensional environment, random textures. Reward type. In this scenario, we compare the standard binary reward with its more natural butmore noisy counterpart. In the basic scenario above, the reward is +1 for gathering a health kit and otherwise. A more natural measure of success in the health gathering task is the actual change inhealth. With this reward, the agent would directly aim to maximize its health. In this configurationwe therefore use a scaled change in health as the reward signal. This reward is more challengingthan the basic binary reward due to its noisiness (health is decreased only every eighth step) and thevariance in the reward after collecting a health kit due to the total health limit. Perceptual complexity. To analyze the effect of perceptual complexity, we designed variants ofthe health gathering task with different input representations. First, to increase the perceptual com-plexity of the task, we replaced the single maze used in the basic health gathering scenario by randomly textured versions, some of which are shown in Figure 1(c). The labyrinth’s texture ischanged after each episode during both training and evaluation.We also created two variants of the health gathering task with reduced visual complexity. These arethe only controlled scenarios not using the ViZDoom framework. Both are based on a grid world,where the agent is navigating an × room with available actions: wait, up, down, left, and right.There are randomly placed health kits in the room, and the aim of the agent is to collect these, withreward +1 for collecting a health kit and otherwise. Each time a health kit is collected a new oneappears in a random location. The two variants differ in the representation that is fed to the agent. Inone, the agent’s input is a -dimensional vector that concatenates the 2D Cartesian coordinates ofthe agent itself and the health kits, sorted by their distance to the agent. In the other variant, we usea k-hot vector for the health kits coordinates and a one-hot vector for the agent coordinates. Eachpossible position on the grid is a separate entry in those vectors, and is equal to 1 if the accordingobject is present and 0 otherwise.3.3 A LGORITHM DETAILS We used identical network architectures for the three algorithms in all experiments. For experi-ments in Atari and ViZDoom domains we used deep convolutional networks similar to the one usedby Mnih et al. (2015). For gridworld experiments we used fully-connected networks with three hid-den layers. For Q MC and n-step Q we used dueling network architectures, similar to Wang et al.(2016). The exact architectures are specified in the supplement.For experiments in Atari environments we followed a common practice and fed the most recentframes to the networks. In all other environments the input was limited to the observation fromthe current time step. In ViZDoom scenarios, in addition to the observed image we fed a vectorof measurements to all networks. The measurements are the agent’s scalar health in the healthgathering scenarios and a three-dimensional vector of the agent’s health, ammo, and frags in thebattle scenario.We trained all models with asynchronous actor threads, for a total of million steps. Weidentified optimal hyperparameters for each algorithm via a hyperparameter search on a subset ofenvironments and used these fixed hyperparameters for all environments, unless noted otherwise.5ublished as a conference paper at ICLR 2018 Atari ViZDoom . − − − DFP (Dosovitskiy & Koltun, 2017) 50M − − − − − . . . Q MC − . . . . Q . . . . . . . . Table 1: Calibration against published results on standard environments. We report the averagescore at the end of an episode for Atari games, health for the Navigation scenario, and frags for theBattle scenarios. In all cases, higher is better.For evaluation, we trained three models on each task, selected the best-performing snapshot for eachtraining run, and averaged the performance of these three best-performing snapshots. Further detailsare provided in the supplement.The implementation of the environments and the algorithms will be made available at https://github.com/lmb-freiburg/td-or-not-td/ . A video of an Q MC agent trained onvarious tasks is available on the project page: https://lmb.informatik.uni-freiburg.de/projects/tdornottd/ ESULTS ALIBRATION We start by calibrating our implementations of the methods against published results reported inthe literature. To this end, we train and test our implementations on standard environments usedin prior work. The results are summarized in Table 1. Our implementations perform similarly tocorresponding results reported in prior work.For A3C the results are significantly different only for BeamRider. However, in Mnih et al. (2016)the evaluation used the average over the best out of experiments with different learning rates.We used the average over runs with a fixed learning rate. Since the results for BeamRider havea high variance even for very small learning rate changes, this explains the difference between theresults.On the ViZDoom scenarios, the Q MC implementation performs on par with the DFP algorithm.This shows that DFP does not crucially depend on a decomposition of the reward into a vectorof measurements, and can perform equally well given a standard RL setup with a scalar reward.Our A3C implementation achieves significantly better results than those reported by Dosovitskiy &Koltun (2017) on the ViZDoom scenarios. We attribute this to (a) using a rollout value of in ourexperiments instead of as used by Mnih et al. (2016) and Dosovitskiy & Koltun (2017), and (b)providing the measurements as input to the network. Dosovitskiy & Koltun (2017) did not reportresults on Atari games. We find that in these environments Q MC performs worse overall than 20-step Q and 20-step A3C.4.2 V ARYING THE ROLLOUT IN TD- BASED ALGORITHMS By changing the rollout length n in n-step Q and A3C, we can smoothly transition between TDand MC training. -step rollouts correspond to pure bootstrapping as used in the standard Bellmanequation. Infinite rollouts (until the terminal state), on the other hand, correspond to pure MonteCarlo learning of discounted infinite-horizon returns.Results on three environments – Basic health gathering, Sparse health gathering, and Battle – arepresented in Figure 2. Rollout length of is best on all tasks for n-step Q . Both very short and verylong rollouts lead to decreased performance. These findings are in agreement with prior results ofTD( λ ) experiments (Sutton, 1988; 1995), considering that longer rollouts increase the MC portionof the value target, converging to a full MC update for infinite rollout. A mixture of TD and MCyields the best performance. The results for A3C are qualitatively similar, and again the -steprollout is overall near-optimal. 6ublished as a conference paper at ICLR 2018Figure 2: Effect of rollout length on TD learning for n-step Q and A3C. We report average health atthe end of an episode for health gathering and average frags in the Battle scenario. Higher is better.4.3 C ONTROLLED EXPERIMENTS We now proceed to a series of controlled experiments on a set of specifically designed environmentsand compare TD-based methods to Q MC , a purely Monte Carlo approach. The motivation is asfollows. In the previous section we have seen that very long rollouts lead to deteriorated performanceof n-step Q and A3C. This can be attributed to large variance in target values. The variance can bereduced by using a finite horizon, as is the case in Q MC . However, the use of a finite horizon meansthat rewards that are further away than the horizon will not be part of the value target, resulting ina disadvantage in tasks with sparse or delayed rewards. In order to evaluate this we run controlledexperiments designed to isolate the reward delay, sparsity, and other factors. We test 20-step Q andA3C (optimal rollout for TD-based methods), 5-step Q and A3C (more TD in the update), and Q MC (finite horizon Monte Carlo). Reward type. We contrast the standard binary reward with the more natural reward signal propor-tional to the change in the health level of the agent. Figure 3 (left) shows that in the scenario withbinary reward the performance of Q MC , 20-step Q , and 20-step A3C is nearly identical, within of each other. However, when trained with the noisier health-based reward, Q MC performs within of the result with binary reward, but the performance of TD-based algorithms decreases signif-icantly, especially for the -step rollouts. These results suggest that Monte Carlo training is morerobust to noisy rewards than TD-based methods. m = 1 m = 2 Q MC . . Q . . Q . . . . . . Table 2: Terminal states. Terminal states. Table 2 shows that in environments whereterminal states play a crucial role, Q MC is outperformed byTD-based methods. This is due to the finite-horizon nature of Q MC . A terminal reward only contributes to a single updateper episode, while in TD it contributes to every update in theepisode. If non-terminal rewards are present ( m = 2 ), Q MC approaches the TD-based algorithms, but still does not reachthe performance of 20-step Q. Difficulties with terminal statescan partially explain poor performance of Q MC on some Atarigames. The results for larger m values are discussed in thesupplement. Delayed rewards. Figure 3 (middle) shows that the performance of all algorithms declines even withmoderate delays in the reward signal. A delay of steps, or approximately . seconds of in-gametime, already leads to a 8–12% relative drop in performance for Q MC and 20-step TD algorithmsand a 30–40% drop for 5-step TD algorithms. With a delay of steps, or approximately second,the performance of Q MC and 20-step TD algorithms drops by 30–70% and 5-step TD agents areessentially unable to survive until the end of an episode. With a delay of steps, all algorithmsdegrade to a trivial score. Interestingly, the performance of Q MC declines less rapidly than theperformance of the other algorithms and Q MC consistently outperforms the other algorithms in thepresence of delayed rewards. Sparse rewards. TD-based infinite-horizon approaches should theoretically be effective at prop-agating distal rewards, and are therefore supposed to be advantageous in scenarios with sparse re-wards. The results on the Sparse and Very Sparse scenarios however, do not support this expectation7ublished as a conference paper at ICLR 2018Figure 3: Effect of reward properties. Left to right: reward type, reward delay, reward sparsity. Wereport the average health at the end of an episode. Higher is better. MC training ( Q MC , green)performs well on all environments.(Figure 3 (right)): Q MC performs on par with 20-step Q , and noticeably better than 20-step A3Cand 5-step algorithms. We believe the reason for the unexpectedly good performance of Q MC is thatMonte Carlo approaches are well suited for training perception systems, as discussed in more detailin Section 4.4. Perceptual complexity. We test the algorithms on a series of environments with varying perceptualcomplexity. The results are summarized in Figure 4. In gridworld environments, TD-based methodsperform well. The Coord. Grid task, where the task is simplified by sorting the health kit coordinatesby distance, is successfully solved by all methods. 5-step unrolling outperforms the 20-step versionsand Q MC in both setups.However, the situation is completely different in the vision-based Basic and Multi-texture setups,in which the perceptual input is much more complex. In the Basic setup, all methods performroughly on par, but 5-step unrolling drops behind the other methods. In the Multi-texture setup, Q MC outperforms other algorithms. ControlPerception 20-step Q Q MC Q . . Q MC . . Q MC . . Table 3: Separate training ofperception and control on theBattle scenario. Higher is bet-ter.To further analyze the effect of perception on DRL, we conduct anadditional experiment where we separate the learning of percep-tion and control. We first train two perception systems on the Bat-tle task by predicting Q -values under a fixed policy with 20-step Q or Q MC . We then re-initiailize the weights in the top two layers,freeze the weights in the rest of the the networks, and re-train thetop two layers on the Battle task with 20-step Q or Q MC . To makesure that the perception results are not the result of having multipleheads for multiple final horizons, we also trained one perceptionusing a single head (1-head Q MC ). Further details are providedin the supplement. The results are shown in Table 3. Both Q and Q MC control reach higher score with a perception system trainedwith Q MC . This supports the hypothesis that Monte Carlo trainingis efficient at training deep perception systems from raw pixels.Figure 4: Effect of perceptual complexity. We report average cumulative reward per episode for gridworlds and average health at the end of the episode for ViZDoom-based setups. Perception in bothgridworlds is trivial. The perceptual complexity in the multi-texture task is higher than in the basictask. 8ublished as a conference paper at ICLR 20184.4 TD OR NOT TD?Temporal differencing methods are generally considered superior to Monte Carlo methods in re-inforcement learning. This opinion is largely based on empirical evidence from domains such asgridworlds (Sutton, 1995), cart pole (Barto et al., 1983), and mountain car (Moore, 1990). Our re-sults agree: in gridworlds and on Atari games we find that n-step Q learning outperforms Q MC . Wefurther find, similar to the TD( λ ) experiments from the past (Sutton, 1988), that a mixture of MCand TD achieves best results in n-step Q and A3C.However, the situation changes in perceptually complex environments. In our experiments in im-mersive three-dimensional simulations, a finite-horizon MC method ( Q MC ) matches or outperformsTD-based methods. Especially interesting are the results of the sparse reward experiments. Sparseproblems are supposed to be specifically challenging for finite-horizon Monte Carlo estimation: inour Very Sparse setting, average time between health kits is time steps when a human is con-trolling the agent. This exceeds Q MC ’s finite prediction horizon of steps, making it seeminglyimpossible for the algorithm to achieve nontrivial performance. Yet Q MC is able to keep up with theresults of the 20-step Q algorithm and clearly outperforms A3C.What is the reason for this contrast between classic findings and our results? We believe that the keydifference is in the complexity of perception in immersive three-dimensional environments, whichwas not present in gridworlds and other classic problems, and is only partially present in Atarigames. In immersive simulation, the agent’s observation is a high-dimensional image that repre-sents a partial view of a large (mostly hidden) three-dimensional environment. The dimensionalityof the state space is essentially infinite: the underlying environment is specified by continuous sur-faces in three-dimensional space. Memorizing all possible states is easy and routine in gridworldsand is also possible in some Atari games (Blundell et al., 2016), but is not feasible in immersivethree-dimensional simulations. Therefore, in order to successfully operate in such simulations, theagent has to learn to extract useful representations from the observations it receives. Encoding ameaningful representation from rich perceptual input is where Monte Carlo methods are at an ad-vantage due to the reliability of their training signals. Monte Carlo methods train on ground-truthtargets, not “guess from a guess”, as TD methods do (Sutton & Barto, 2017).These intuitions are supported by our experiments. Figure 4 shows that increasing the perceptualdifficulty of the health gathering scenario hurts the performance of Q MC less than it does the TD-based approaches. Table 3 shows that Q MC is able to learn a better perception network than 20-step Q . In Figure 3, 20-step TD algorithms perform better than their 5-step counterparts in all testedscenarios. Longer rollouts bring TD closer to MC, in agreement with our hypothesis. ONCLUSION For the past 30 years, TD methods have dominated the field of reinforcement learning. Our ex-periments on a range of complex tasks in perceptually challenging environments show that in deepreinforcement learning, finite-horizon MC can be a viable alternative to TD. We find that while TDis at an advantage in tasks with simple perception, long planning horizons, or terminal rewards, MCtraining is more robust to noisy rewards, effective for training perception systems from raw sensoryinputs, and surprisingly successful in dealing with sparse and delayed rewards. A key challengemotivated by our results is to find ways to combine the advantages of supervised MC learning withthose of TD. We hope that our work will contribute to a set of best practices for deep reinforcementlearning that are consistent with the empirical reality of modern application domains.A CKNOWLEDGMENTS This project was funded in part by the BrainLinks-BrainTools Cluster of Excellence (DFG EXC1086) and by the Intel Network on Intelligent Systems. R EFERENCES Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that cansolve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics ,13(5), 1983. 9ublished as a conference paper at ICLR 2018Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning envi-ronment: An evaluation platform for general agents. JAIR , 47, 2013.Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z. Leibo,Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv:1606.04460 ,2016.Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. In ICLR , 2017.Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deepreinforcement learning for continuous control. In ICML , 2016.Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger.Deep reinforcement learning that matters. arXiv:1709.06560 , 2017.Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZ-Doom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Confer-ence on Computational Intelligence and Games , 2016.Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Buildingmachines that learn and think like people. Behavioral and Brain Sciences , 40, 2017.Sergey Levine and Vladlen Koltun. Guided policy search. In ICML , 2013.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-mare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen,Charles Beattie, Amir Sadik, et al. Human-level control through deep reinforcement learning. Nature , 518(7540), 2015.Volodymyr Mnih, Adri`a Puigdom`enech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcementlearning. In ICML , 2016.Andrew William Moore. Efficient memory-based learning for robot control. Technical Report 209,University of Cambridge, Computer Laboratory, 1990.Jing Peng and Ronald J. Williams. Incremental multi-step Q-learning. Machine Learning , 22, 1996.John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust regionpolicy optimization. In ICML , 2015.Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning ,3, 1988.Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparsecoarse coding. In NIPS , 1995.Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction . MIT Press, 2ndedition, 2017.Joel Veness, Marc G Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins, et al. Compressand control. In AAAI , pp. 3016–3023, 2015.Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas.Dueling network architectures for deep reinforcement learning. In ICML , 2016.Christopher J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge,England, 1989.Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning , 8, 1992.Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning. Machine Learning , 8, 1992. 10ublished as a conference paper at ICLR 2018 S UPPLEMENTARY M ATERIAL S1 F URTHER RESULTS Effect of the rollout length and the prediction horizon In Table 5 of the main paper we haveshown that the performance of n-step Q decreases for rollouts larger than . For the Q MC algorithma similar phenomenon is observed for large horizons, as shown in Table S1. The performance isdecreasing for a horizon larger than .In both cases, the decrease is likely caused by the high variance of large sums of future rewards.The high variance in reward sums increases the variance of the gradients and leads to higher noisewhen training the value predictions. This hinders the action selection process, which relies on finedifferences between values of different actions.Figure S1: Performance of the Q MC algorithm using different value prediction horizons. Difference between asynchronous n-step Q and Q MC . As mentioned in the main paper, apartfrom the different targets to learn the Q -function there is another difference between the n-step Q and Q MC algorithms. It is caused by the usage of multiple unrolling values in the n-step Q algorithms.In n-step Q instead of only using the n-step rollout, multiple values are used within every batch(every value from 1 to n (Mnih et al., 2016)). This results in an increased performance and stabilityof the n-step Q algorithm. It is not directly applicable to Q MC since different unrolling values resultin different finite horizons. Instead in Q MC multiple Q -function heads exist to predict the differentfinite horizons (Dosovitskiy & Koltun, 2017). The difference between the trivial implementationand the multiple unrolling modifications are shown in Table S1.There are no further differences between the two algorithms. They use the same architecture andasynchronous training. Both even perform best under the same hyperparameters like the learningrate. Health Basic Health Sparse Battle Q MC . . . Constant rollout Q MC . . . Q . . . Constant rollout 20-step Q . . . Table S1: Difference between using multiple or constant rollouts within one train step. Separate training of perception and control In order to perform the perception freezing exper-iments we first train two perception systems on the Battle task with 20-step Q and Q MC for million steps ( / of the usual training) by predicting Q values under a fixed policy (we tried usinga fully trained Q MC or 20-step Q policy). Thereafter we freeze the perception and the measure-ments part of the network. (The full architecture of the perception and the measurements are shownin Table S5). We then reinitialize the remaining layers and retrain the networks with the frozen per-ception with Q MC and 20-step Q , both using each of the two available perceptions (for 40 millionsteps). 11ublished as a conference paper at ICLR 2018 Pretrained using Q MC policy Pretrained using 20-step Q policy20-step Q Q MC Q Q MC Q perception . . . . Q MC perception . . . . Table S2: Performance of 20-step Q and Q MC with a pretrained and frozen perception, higher isbetter.The full results are shown in Table S2. Both Q and Q MC are able to reach higher score with a Q MC perception, on both of the used initial policies. The results in the main paper correspond toperception systems trained under the Q MC policy. Additional results on terminal states The full results on the terminal reward environment areshown in Table S3. As m increases the terminal rewards become less relevant and for m = ∞ the task converges to the Health Sparse environment. Beside the result that Q MC performs worsethan the other algorithms for m = 1 we also see that the performance of all TD-based algorithmsdeclines with larger m values ( Q MC performance also declines after m = 3 ). The reason for this isthat apart from the terminal state the task becomes harder for larger m values: For m = 1 it is easyto find a single health kit. The larger m becomes, the higher is the probability that a new health kitwill spawn in a hard to reach place. Overall, exploring the labyrinth efficiently is important for highscores on the Health Sparse ( m = ∞ ) task, but complete labyrinth exploration is not needed to finda small amount of health kits.To show this we evaluated the performance of an 20-step Q agent, trained on the Health Sparseenvironment, on the m = 1 task. As expected, without additional training the agent was ableto solve the task with the same score ( , ) as the agent trained on the m = 1 task. Since theincreasing difficulty is not directly related to terminal states, we excluded the results for m > from the main paper. m = 1 m = 2 m = 3 m = ∞ (Health Sparse) Q MC . . . . Q . . . . Q . . . . . . . . . . . . Table S3: Difference between using multiple or constant rollouts within one train step. S2 A DDITIONAL ALGORITHM AND ENVIRONMENT DETAILS Q MC and n-step Q details In each experiment we used the same network architecture for allalgorithms. For tasks with visual input – in ViZDoom and ALE – we used a convolutional networkwith architecture similar to Mnih et al. (2015). For all experiments in the ViZDoom environment,in addition to the image the networks got a vector of measurements as input: agent’s health leveland current time step for Health gathering and Navigation, and agent’s health, ammo and fragsfor Battle. For Q MC and n-step Q we used the dueling architecture (Wang et al., 2016), splittingvalue prediction into an action independent expectation E ( s t , θ ) and an action dependent part forthe advantage of using a specific action A ( s t , a , θ ) . For l actions, the value prediction emitted bythe network is computed as: Q ( s t , a , θ ) = E ( s t , θ ) + A ( s t , a , θ ) ; A ( s t , a, θ ) = A ( s t , a , θ ) − l (cid:88) a (cid:48) A ( s t , a (cid:48) , θ ) (7)12ublished as a conference paper at ICLR 2018The architecture of the Q MC Network is shown in Table S5. Q MC is predicting the Q value formultiple finite horizons at once: , , , , and . Predictions for all horizons are emitted atonce. Therefore, for l actions, the network has outputs for the expectation values and × l outputsfor the action advantages. We used greedy action selection according to an objective function whichis a linear combination of predictions at different horizons, same as in DFP (Dosovitskiy & Koltun,2017): a ( s t ) = arg max a (cid:48) (cid:104) . · Q (8) ( s t , a (cid:48) ) + 0 . · Q (16) ( s t , a (cid:48) ) + 1 . · Q (32) ( s t , a (cid:48) ) (cid:105) (8)The network for n-step Q was identical, except that instead of predictions, a single value functionwas predicted for each action. The A3C architecture was also identical, except that the network wasnot split in the last hidden layer like it was for the dueling networks. Both the policy and the valueoutput shared the same last hidden layer as in Mnih et al. (2016). The network we are using for A3Cis larger than that used by Mnih et al. (2016). We found that the larger network matches or exceedsthe performance of the smaller network used by Mnih et al. (2016) on our tasks. For both gridworldswe used fully connected networks. For all results reported in the paper the three algorithms usedthree hidden fully connected layers with size of .The pseudocode for Q MC is shown in Algorithm 1 and the pseudocode for n-step Q and A3C is thesame as in Mnih et al. (2016). The hyperparameters are summarized in Table S4. Algorithm 1 Q MC pseudocode for each asynchronous thread if rank = 0 then Initialize global shared network parameters θ Initialize global shared step counter N ← end if Initialize local network parameters θ (cid:48) Initialize local step counter n ← Initialize local experience replay (cid:46) storing only the last 32 + batch size transitions while N < N MAX do Update local network parameters: θ (cid:48) ← θ for i ∈ { n, . . . , n + batch size } do Get state s i Sample random action a i with (cid:15) probability, otherwise: a i = arg max a (cid:2) . · Q (8) ( s i , a, θ (cid:48) ) + 0 . · Q (16) ( s i , a, θ (cid:48) ) + 1 . · Q (32) ( s i , a, θ (cid:48) ) (cid:3) Get reward r i and terminal state information τ i by applying action a i (cid:46) τ i ∈ { , } Store s i , r i and τ i in the experience replay end for n += batch size N += batch size for m ∈ { n − (32 + batch size ) , . . . , n − } do (cid:46) is the largest rollout for k ∈ { , , , , , } do R k = (cid:80) m + ki = m r i T k = clip (cid:16)(cid:80) m + ki = m τ i , , (cid:17) loss ( k ) m ( θ (cid:48) ) = (1 − T k ) · Huber ( Q ( k ) ( s m , a m , θ (cid:48) ) − R k ) end forend for Get gradients dθ ← ∂ (cid:80) m,k loss ( k ) m ( θ (cid:48) ) ∂θ (cid:48) Apply the gradients dθ to the global network parameters θ using RMSProp end whileTraining and evaluation details We found that for each of the three asynchronous algorithms thelearning rate of × − leads to the best result in most of the tested environments. Further wefound that, in general, ViZdoom scenarios are less sensitive to learning rate changes than differentAtari games. We decreased the learning rate linearly to zero over the training procedure. As theoptimizer we used shared RMSProp with the same parameters as in Mnih et al. (2016).13ublished as a conference paper at ICLR 2018Hyperparameter Q MC n-step Q Discount γ { , , , , , } ∞ Number of workers Batch size Optimizer RMSPropRMSProp decay . RMSProp epsilon . Input image resolution × × (grayscale)Frame skip Total amount of environment steps N MAX million steps (240 million with skipped frames)Learning rate × − → − over million stepsExploration (cid:15) . → . over million stepsTable S4: Summery of the Q MC and n-step Q algorithm hyperparameters.We used (cid:15) -greedy exploration for both Q MC and n-step Q . We decreased (cid:15) linearly from 1.0 to 0.01over million steps. Afterwards (cid:15) remains at 0.01. In all experiments we used a total of millionsteps for training. This means all actor threads together processed million steps. We used frameskip of , therefore million frame-skipped steps correspond to million non-frame-skippedenvironment steps. For Q MC each asynchronous agent performed a parameter update every stepswith a batch size of . Each time the most recent frames with available value targets were used.Every . million steps we evaluated the network over episodes for Vizdoom Environments andover episodes for Atari games. For one training run the best result out of all evaluations wasconsidered as its final score. For each experiment three runs were performed for each algorithm.The average of the three run scores was considered as the final performance of that algorithm on thetask. Batch size for small rollouts In algorithms with asynchronous n-step TD targets the batch size isusually equal to the unrolling length n. However decreasing the batch size in A3C could also effectthe performance of the policy gradient of the A3C loss. To make sure that we only measure theeffect of different n-step TD targets and do not alternate the policy gradient part we keep the batchsize at the constant value of (for all rollouts smaller than ). This is realized by using multiplen-step rollouts within one batch (e.g. for a 5-step rollout the batch consists of 4 rollouts). Overallthose batches lead to improved performance of A3C. For n-step Q using the constant batch size of results in similar performance and significantly reduced the execution time. Therefore we usedthose batches for both A3C and n-step Q in our experiments. Additional environment details The Navigation scenario is identical to the “Health GatheringSupreme” scenario included in the ViZDoom environment. The aim of the agent is to navigate amaze, collect health kits and avoid vials with poison. A map of the maze is shown in Figure S2.All other Health gathering scenarios are set up in the same labyrinth, but differ in the presence andthe number of objects in the maze: no poison vials, and a different number of health kits dependingon the variant of the Health gathering scenario. In each Health gathering scenario a constant numberof health kits is present on the map at any given point in time. Once a health kit is gathered, anotherone is created at a random location in the maze.To make sparse health gathering map results comparable to each other we kept the health d thatthe agent looses every time steps to be proportional to the density of health kits on the map: d ∝ √ .In the Battle scenario we used the same reward as in Dosovitskiy & Koltun (2017). It is a weightedsum of changes in measurements: r = f + ∆ h/ 60 + ∆ a/ where f are the amount of eliminatedmonsters, ∆ h the change in health and ∆ a the change in ammunition. For Basic health gatheringwe either used a binary reward r ∈ { , } , or the change in health: r = ∆ h/ .14ublished as a conference paper at ICLR 2018 Network part Input type Input size Channels Kernel Stride Layer typePerception (P) image × × { or } 32 8 × convolutions × × 32 64 4 × × × 64 64 3 × × × 64 3136 flatting fully connectedMeasurements (M) vector { or } fully connected 128 128128 128 Expectation P + M 512 + 128 512 fully connected 512 6 Action advantage P + M 512 + 128 512 fully connected 512 6 · l Table S5: Network architecture of Q MC for ll