[PDF] Pairwise Weights for Temporal Credit Assignment

Abstract

How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign this credit based on a scalar coefficient \lambda (treated as a hyperparameter) raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isn't clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.

Full PDF

PPairwise Weights for Temporal Credit Assignment

Zeyu Zheng

Risto Vuorio

Richard Lewis Satinder Singh Abstract

How much credit (or blame) should an actiontaken in a state get for a future reward? This isthe fundamental temporal credit assignment prob-lem in Reinforcement Learning (RL). One of theearliest and still most widely used heuristics is toassign this credit based on a scalar coefﬁcient λ (treated as a hyperparameter) raised to the powerof the time interval between the state-action andthe reward. In this empirical paper, we exploreheuristics based on more general pairwise weight-ings that are functions of the state in which theaction was taken, the state at the time of the re-ward, as well as the time interval between thetwo. Of course it isn’t clear what these pairwiseweight functions should be, and because they aretoo complex to be treated as hyperparameters wedevelop a metagradient procedure for learningthese weight functions during the usual RL train-ing of a policy. Our empirical work shows thatit is often possible to learn these pairwise weightfunctions during learning of the policy to achievebetter performance than competing approaches.

1. Introduction

The following umbrella problem (Osband et al., 2019) il-lustrates a fundamental challenge in most reinforcementlearning (RL) problems, namely the temporal credit assign-ment (TCA) problem. An RL agent takes an umbrella at thestart of a cloudy workday morning and experiences a longday at work ﬁlled with various rewards uninﬂuenced by theumbrella, before needing the umbrella in the rain on theway home. The agent must learn to credit the take-umbrellaaction in the cloudy-morning state with the very delayedreward at the end of the day, while also learning to not creditthe action with the many intervening rewards, despite theiroccurring much closer in time. More generally, the TCA * Equal contribution † Now at the University of Oxford. University of Michigan. Correspondence to: Zeyu Zheng < [email protected] > , Risto Vuorio < [email protected] > . problem is how much credit or blame should an action takenin a state get for a future reward. One of the earliest and stillmost widely used heuristics for TCA assigns credit basedon a scalar coefﬁcient λ raised to the power of the time inter-val between the state-action and the reward. This heuristiccomes from the celebrated TD( ≤ λ ≤ ) (Sutton, 1988)family of algorithms and has since been adopted in mostmodern RL algorithms.In this empirical paper, we explore new heuristics for TCAbased on more general (than TD( λ )) pairwise weightingsthat are functions of the state in which the action was taken , the state at the time of the reward , as well as the time intervalbetween the two . Of course, it isn’t clear what this pairwiseweight function should be, and it is too complex to be treatedas a hyperparameter (in contrast to the scalar λ which istypically set by searching over a small set of values). Wedevelop a metagradient approach to learning the pairwiseweight function at the same time as learning the policyparameters of the agent. Like most metagradient algorithms,our algorithm has two loops: an outer loop that periodicallyupdates the pairwise weight function in order to optimizethe usual RL loss (policy-gradient loss in our case), and aninner loop where the policy parameters are updated usingthe pairwise weight function set by the outer loop.Thus, our main contributions in this paper are a new pair-wise weight function for TCA and a metagradient algorithmto learn such a function. Our empirical work is geared to-wards answering two questions: (1) Are the more generalpairwise weight functions we propose able to outperformthe best choice of λ as well as other baselines? and (2)Is our metagradient algorithm able to learn the pairwiseweight functions fast enough to be worth the more complexlearning problem they introduce? Related Work on Credit Assignment.

Several heuristicmethods have been proposed to address the long-term creditassignment problem in RL. Hindsight Credit Assignment(HCA) (Harutyunyan et al., 2019) proposes the notion ofhindsight return which leads to a new family of RL algo-rithms. Theoretically HCA can address some problemswhere classic RL algorithms struggle, e.g., counterfactualcredit assignment. RUDDER (Arjona-Medina et al., 2019)trains a LSTM (Hochreiter & Schmidhuber, 1997) to predictthe return of an episode given the entire state and action a r X i v : . [ c s . L G ] F e b airwise Weights for Temporal Credit Assignment sequence. Then it conducts contribution analysis with theLSTM to decompose the return and redistribute rewards tostate-action pairs. Temporal Value Transport (TVT) (Hunget al., 2019) augments the agent with an external memorymodule and utilizes the memory retrieval as a proxy fortransporting future value back to related state-action pairs.Both RUDDER and TVT employ memory modules to iden-tify key events in the trajectory to facilitate long-term creditassignment. We compare directly against TVT because theircode was available and applicable, and we take inspirationfrom the core reward-redistribution idea from RUDDERand implement it within our policy gradient agent as a com-parison baseline (because the available RUDDER code isnot directly applicable). We do not compare against HCAbecause it is mainly a theoretical idea at this point and itis unclear how to implement it in a scalable way withinDeepRL architectures. We also compare against two otheralgorithms that are more closely related to ours in their useof metagradients. Xu et al. (2018) adapt λ via metagradi-ents rather than tuning it via hyperparameter search, therebyimproving over the use of a ﬁxed- λ algorithm. The ReturnGenerating Model (RGM) (Wang et al., 2019) generalizesthe notion of return from exponentially discounted sum ofrewards to a more ﬂexibly weighted sum of rewards wherethe weights are adapted via metagradients during policylearning. RGM takes the entire episode as input and gen-erates one weight for each time step. In contrast, we studypairwise weights as explained below.

2. Pairwise Weights for Advantages

At the core of our contribution are new parameterizations offunctions for computing advantages used in policy gradientalgorithms. Next, we brieﬂy review advantages in policygradient RL and TD ( λ ) as our points of departure for thenew parameterizations. Background on Policy Gradient RL, Advantages, andTD ( λ ) . We assume an episodic RL setting. The agent’spolicy π θ , parameterized by θ , maps a state S to a proba-bility distribution over the actions. Within each episode, attime step t , the agent observes the current state S t , takes anaction A t ∼ π θ ( ·| S t ) , and receives the reward R t +1 . Theperformance measure for the policy π θ , denoted by J ( θ ) , isdeﬁned as the expected sum of the rewards when the agentbehaves according to π θ , i.e., J π ( θ ) = E S ∼ ν,A t ∼ π θ ( ·| S t ) ,S t +1 ∼ P ( ·| S t ,A t ) [ T (cid:88) t =1 γ t − R t ] , where ν is the probability distribution of the initial state, P is the transition dynamics, T is the length of an episode,and γ is the discount factor. For brevity, henceforth we willdenote the expectation in the equation above by E θ unless further clariﬁcation is needed. The gradient of J ( θ ) w.r.t thepolicy parameters θ is (Sutton et al., 2000; Williams, 1992) ∇ θ J π ( θ ) = E θ (cid:104)(cid:0) G t − b ( S t ) (cid:1) ∇ θ log π θ ( A t | S t ) (cid:105) , (1)where G t = (cid:80) Tt (cid:48) = t +1 γ t (cid:48) − t − R t (cid:48) denotes the return and b ( S t ) is an arbitrary baseline function for variance reduc-tion. Typically this baseline is the value function V π ( s ) = E θ [ G t | S t = s ] , and the resulting difference is called theadvantage function Ψ π ( s, a ) = E θ (cid:2) G t − V π ( s ) (cid:12)(cid:12) S t = s, A t = a (cid:3) . (2)For brevity, we will omit the superscript π on V and Ψ .Since the true value function V is usually unknown, anestimated value function v is used in place of V to pro-vide an approximation, which leads to a Monte-Carlo (MC)estimation of Ψ π : ˆΨ MC t = G t − v ( S t ) , (3)where ˆΨ MC t is short for ˆΨ MC ( S t , A t ) . However, ˆΨ MC usu-ally suffers from high variance. To reduce variance, theestimated value function is used to estimate the return asin the TD( λ ) algorithm using the eligibility trace parameter λ ; speciﬁcally the new form of the return, called λ -returnis a weighted sum of n -step truncated corrected returnswhere the correction uses the estimated value function after n -steps. The corresponding λ -estimator is ˆΨ ( λ ) t = T (cid:88) t (cid:48) = t +1 ( γλ ) t (cid:48) − t − δ t (cid:48) , (4)where δ t = R t + γv ( S t ) − v ( S t − ) is the TD-error at time t (Schulman et al., 2015). Note that when λ = 1 , it recoversthe MC estimator: ˆΨ (1) t = T (cid:88) t (cid:48) = t +1 γ t (cid:48) − t − δ t (cid:48) = G t − v ( S t ) = ˆΨ MC t . (5)As noted above, the value for λ is usually manually tunedas a hyperparameter. Adjusting λ provides a way to tradeoffbias and variance in ˆΨ λ (this is absent in ˆΨ MC ). Below wepresent two new estimators that are analogous in this regardto ˆΨ λ and ˆΨ MC . Proposed Heuristic 1: Advantages via PairwiseWeighted Sum of TD-errors.

Our ﬁrst new estimator,denoted PWTD for P airwise W eighted TD -error, is a strictgeneralization of the λ -estimator above and is deﬁned asfollows: ˆΨ PWTD η,t = T (cid:88) t (cid:48) = t +1 f η ( S t , S t (cid:48) , t (cid:48) − t ) δ t (cid:48) , (6) airwise Weights for Temporal Credit Assignment s ... (cid:15) (cid:15) (cid:15) +1 ... (cid:15)(cid:15)(cid:15) − Figure 1.

A simple MDP for illustration. The initial action instate s determines the reward for the last transition but does notinﬂuence the intermediate noisy rewards. The main consequenceof the initial action is thus signiﬁcantly delayed. where f η ( S t , S t (cid:48) , t (cid:48) − t ) ∈ [0 , , parameterized by η , isthe weight given to the TD-error δ t (cid:48) as a function of thestate to which credit is being assigned, the state at whichthe TD-error is obtained, and the time interval between thetwo. Note that if we choose f to be f ( S t , S t (cid:48) , t (cid:48) − t ) =( γλ ) t (cid:48) − t − , it recovers the usual λ -estimator ˆΨ ( λ ) . Proposed Heuristic 2: Advantages via PairwiseWeighted Sum of Rewards.

Instead of generalizing fromthe λ -estimator, we can also generalize from the MC estima-tor via pairwise weighting. Speciﬁcally, the new pairwise-weighted return is deﬁned as G PWR η,t = T (cid:88) t (cid:48) = t +1 f η ( S t , S t (cid:48) , t (cid:48) − t ) R t (cid:48) , (7)where f η ( S t , S t (cid:48) , t (cid:48) − t ) ∈ [0 , is the weight given tothe reward R t (cid:48) . The corresponding advantage estimator,denoted PWR for P airwise W eighted R eward, then is: ˆΨ PWR η,t = G PWR η,t − v PWR ( S t ) , (8)where V PWR ( s ) = E θ [ G PWR η,t | S t = s ] and v PWR is an ap-proximation of V PWR . Note that if we choose f to be f ( S t , S t (cid:48) , t (cid:48) − t ) = γ t (cid:48) − t − , we can recover the MC es-timator ˆΨ MC .While the usual estimators ˆΨ ( λ ) and ˆΨ MC have some nicetheoretical properties stemming from the guarantees asso-ciated with Monte-Carlo returns and λ -returns, we losethose guarantees for our generalizations ˆΨ PWR and ˆΨ PWTD .In particular, the new estimators can be unbounded in theinﬁnite-horizon setting. However, this is less of a concernin practice because episodes are typically of ﬁnite expectedlength. Moreover, recall that these new estimators will beused in the inner loop of a metagradient algorithm and canthus be viewed as a ﬂexible parameterization of advantageswhere the pairwise weight function is tuned from data usingan outer loop that uses standard, theoretically justiﬁed, pol-icy gradient losses. We will discuss this further in the nextSection as well as validate the potential beneﬁts in our em-pirical work. Next, we provide an illustrative example basedon the umbrella problem (cf. §1) to show how exploitingthe ﬂexibility in the new estimators can be of beneﬁt.

An Illustrative Analysis of the Beneﬁt of the PWR Esti-mator.

Consider the simple-MDP version of the Umbrella problem in Figure 1. Each episode starts at the leftmoststate, s , and consists of T transitions. The only choice ofaction is at s and it determines the reward on the last tran-sition. A noisy reward (cid:15) is sampled for each intermediatetransition independently from a distribution with mean E [ (cid:15) ] and variance Var[ (cid:15) ] > . By construction, the intermediaterewards are independent of the initial action. The expectedreturn for state s under policy π is V ( s ) = E θ [ G ] = ( T − E [ (cid:15) ] + E A ∼ π ( ·| s ) [ R T ] . For any initial action a , the advantage Ψ( s , a ) = E (cid:15) (cid:2) G − V ( s ) (cid:12)(cid:12) a (cid:3) = E (cid:15) (cid:2) (cid:80) Ti =1 R i (cid:12)(cid:12) a (cid:3) − V ( s ) = E [ R T | a ] − E A ∼ π ( ·| s ) [ R T ] .Consider pairwise weights for computing ˆΨ PWR ( s , a ) thatplace weight only on the ﬁnal transition, and zero weight onthe noisy intermediate rewards, capturing the notion that theintermediate rewards are not inﬂuenced by the initial actionchoice. More speciﬁcally, we choose f such that for anyepisode, w T = 1 and w ij = 0 for other i and j . Here weuse w ij to denote f ( S i , S j , j − i ) for brevity. The expectedparameterized reward sum for the initial state s is V PWR ( s ) = E θ [ G η, ] = E θ [ T (cid:88) i = t w t R t ] = E A ∼ π ( ·| s ) [ R T ] . If v PWR is correct, for any initial action a , the pairwise-weighted advantage is the same as the regular advantage: E (cid:15) [ ˆΨ PWR η ( s , a )] = E (cid:15) [ G η, − v PWR ( s ) | a ]= E (cid:15) [ T (cid:88) t =1 w t R t ] − V PWR ( s )= E [ R T | a ] − E A ∼ π ( ·| s ) [ R T ]= Ψ( s , a ) . As for variance, for any initial action a , [ G η, | a ] is deter-ministic because of the zero weight on all the intermediaterewards and thus ˆΨ PWR η ( s , a ) has zero variance. The vari-ance of ˆΨ MC ( s , a ) on the other hand is ( T − (cid:15) ] > .Thus, in this illustrative example ˆΨ PWR yields an unbiasedadvantage estimator with far lower variance than ˆΨ MC .Our example exploited knowledge of the domain to setweights that would yield an unbiased advantage estimatorwith reduced variance, thereby providing some intuition onhow a more ﬂexible return might in principle yield beneﬁtsfor learning. Of course, in general RL problems will havethe Umbrella Problem in them to varying degrees. But howcan these weights be set by the agent itself, without priorknowledge of the domain? We turn to this question next. airwise Weights for Temporal Credit Assignment

3. A Metagradient Algorithm for AdaptingPairwise Weights

Recently metagradient methods have been developed tolearn various kinds of parameters that would otherwise beset by hand or by manual hyperparameter search; theseinclude discount factors (Xu et al., 2018; Zahavy et al.,2020), intrinsic rewards (Zheng et al., 2018; Rajendranet al., 2019; Zheng et al., 2019), auxiliary tasks (Veeriahet al., 2019), constructing general return functions (Wanget al., 2019), and discovering new RL objectives (Oh et al.,2020; Xu et al., 2020). We develop a similar metagradi-ent algorithm for learning pairwise weight functions duringpolicy-gradient learning: an outer loop learner for the pair-wise weight function is driven by a conventional policygradient loss, while an inner loop learner is driven by apolicy-gradient loss based on the new pairwise-weightedadvantages. An overview of the algorithm is in the Supple-ment. In the rest of this section, we use a uniﬁed notation ˆΨ η to denote ˆΨ PWTD η or ˆΨ PWR η unless it causes ambiguity. Learning in the Inner Loop.

In the inner loop, thepairwise-weighted advantage ˆΨ η is used to compute thepolicy gradient. We rewrite the gradient update from Eq. 1with the new advantage as ∇ θ J πη ( θ ) = E τ ∼ π θ (cid:104) T − (cid:88) t =0 ˆΨ η ( S t , A t ) ∇ θ log π θ ( A t | S t ) (cid:105) , where τ is a trajectory sampled by executing π θ . The overallupdate to θ is ∇ θ J inner ( θ ) = ∇ θ J πη ( θ ) + β H ∇ θ J H ( π θ ) , (9)where J H ( θ ) is the usual entropy regularization term toencourage exploration, and β H is a mixing coefﬁcient.Computing ˆΨ PWR η with Equation 8 requires a value functionpredicting the expected pairwise-weighted sums of rewards.We train the value function, v ψ with parameters ψ , alongwith the policy by minimizing the mean squared error be-tween its output v ψ ( S t ) and the pairwise-weighted sum ofrewards G η,t . The objective for training v ψ is J vη ( ψ ) = E τ ∼ π θ (cid:104) T − (cid:88) t =0 (cid:0) G η,t − v ψ ( S t ) (cid:1) (cid:105) . (10)Note that ˆΨ PWTD η does not need this extra value function. Updating η via Metagradient in the Outer Loop. Toupdate η , the parameters of the pairwise weight functions,we need to compute the gradient of the usual policy lossw.r.t. η through the effect of η on the inner loop’s parameters θ . Recall that η determines the update of θ to a new θ (cid:48) asdeﬁned above. Therefore, by chain rule ∇ η J outer ( η ) = ∇ θ (cid:48) J π ( θ (cid:48) ) ∇ η θ (cid:48) . (11) where, ∇ θ (cid:48) J π ( θ (cid:48) ) = E τ (cid:48) ∼ π θ (cid:48) (cid:104) T − (cid:88) i =0 Ψ( S t , A t ) ∇ θ (cid:48) log π θ (cid:48) ( A t | S t ) (cid:105) where τ (cid:48) is another trajectory sampled by executing theupdated policy π θ (cid:48) and Ψ( S t , A t ) is the regular advantage.Note that we need two trajectories, τ and τ (cid:48) , to make oneupdate to the meta-parameters η . The policy parameters θ are updated with Equation 9 after collecting trajectory τ . The next trajectory τ (cid:48) is collected using the updatedparameters θ (cid:48) . The η -parameters are updated on τ (cid:48) . In orderto make more efﬁcient use of the data, we follow (Xu et al.,2018) and reuse the second trajectory τ (cid:48) in the next iterationas the trajectory for updating θ . In practice we use modernauto-differentiation tools to compute Equation 11 withoutapplying the chain rule explicitly.

Computing the regular advantage, Ψ( S t , A t ) , requires anestimated value function for the regular return. This valuefunction is parameterized by φ and updated to minimize themean squared error analogously to Equation 10.

4. Experiments

We present three sets of experiments. The ﬁrst set (§4.1)uses simple tabular MDPs that allow visualization of thepairwise weights learned by Meta-PWTD and -PWR. Theresults show that the metagradient adaptation both increases and decreases weights in a way that can be interpreted asreﬂecting explicit credit assignment and variance reduction.In the second set (§4.2) we test Meta-PWTD and -PWRwith neural networks in a 2D pixel-based variant of thebenchmark credit assignment task

Key-to-Door (Hung et al.,2019). We show that Meta-PWTD and -PWR outperformseveral existing methods for directly addressing credit as-signment, as well as TD( λ ) methods, and show again that thelearned weights reﬂect domain structure in a sensible way.In the third set (§4.3), we evaluate Meta-PWTD and -PWRin two benchmark RL domains, bsuite (Osband et al.,2019) and Atari, and show that our methods do not hinderpolicy learning in environments not speciﬁcally designed topose idealized long-term credit assignment challenges. Consider the environment represented as a DAG in Figure 2(left). In each state in the left part of the DAG (states – ,the ﬁrst phase ), the agent chooses one of two actions butreceives no reward. In the remaining states (states – ,the second phase ) the agent has only one action availableand it receives a reward of +1 or − at each transition.Crucially, the rewards the agent obtains in the second phaseare a consequence of the action choices in the ﬁrst phasebecause they determine which states are encountered in the airwise Weights for Temporal Credit Assignment

012 3456 7891011121314 15161718192021222324252627282930 3837363534333231 42414039 444345 +1-1+1-1+1-1+1-1+1-1+1-1+1-1+1-1 +1-1+1-1+1-1+1-1 +1-1+1-1+1-1

Depth E p i s o d e s t o % S c o r e λ =1.0Meta-PWRMeta-PWTDH-PWR Figure 2.

Inner loop-reset and weight visualization experiment: (Left)

Depth DAG environment with choice of two actions ateach state and rewards along transitions. (Right)

Learning per-formance of regular return, handcrafted weights, and ﬁxed meta-learned weights. Lower is better. second phase. There is an interesting credit assignmentproblem with a nested structure; for example, the actionchosen at state determines the reward received later upontransition into state . We refer to this environment as the Depth DAG and also report results below for depths and with the same nested structure.For these DAG environments we use a tabular policy, valuefunction, and meta-parameter representations. The param-eters θ , ψ , φ , and η represent the policy, baseline for theweighted return, baseline for the regular return, and meta-parameters respectively. The η parameters are a | S | × | S | matrix. The entry on the i th row and the j th column deﬁnesthe pairwise weight for computing the contribution of re-ward at state j to the return at state i . A sigmoid is used tosquash the weights to [0 , when computing the updates,and the η parameters are initialized so that the pairwiseweights are close to . . Visualizing the Learned Weights via Inner-loop Reset.

One view of Meta-PWTD and -PWR is that they are co-adapting the pairwise weights to the current policy at eachpoint in training. To clearly see the most effective weightsthat metagradient learned for a random policy, we repeat-edly reset the policy parameters to a random initializationwhile continuing to train the meta-parameters until conver-gence. More speciﬁcally: the meta-parameters η are trainedrepeatedly by randomly initializing θ , ψ , and φ and runningthe inner loop for updates for each outer loop update.Following Veeriah et al. (2019) and Zheng et al. (2019),the outer loop objective is evaluated on all trajectoriessampled with the updated policies. The gradient of the outerloop objective on the i th trajectory with respect to η is back-propagated through all of the preceding updates to θ . Werun the outer loop for updates. Hyperparameters areprovided in the Supplement.What pairwise weights would accelerate learning in this do-main? Figure 3 (top) visualizes a set of handcrafted weights H a nd c r a f t e d M e t a - P W R M e t a - P W T D Figure 3.

Inner loop-reset weight visualization:

Top:

Handcraftedpairwise weights for Depth 8 DAG; rows and columns correspondto states in Fig. 2.

Middle:

Meta-learned weights for Depth 8 DAGfor rewards (Meta-PWR) and

Bottom:

TD-errors (Meta-PWTD). for ˆΨ PWR in the Depth DAG; each row in the grid repre-sents the state in which an action advantage is estimated,and each column the state in which a future reward is expe-rienced. For each state pair ( s i , s j ) the weight is (yellow)only if the reward at s j depends on the action choice at s i ,else it is zero (dark purple; the white pairs are unreachable).Figure 3 (middle) shows the corresponding weights learnedby Meta-PWR via the inner-loop reset procedure. Impor-tantly, the learned pairwise weights have been increased for those state pairs in which the handcrafted weights are and have been decreased (some to near zero) for thosestate pairs in which the handcrafted weights are ; recallthey were initialized to . . As in the analysis of the simpledomain in §2, these weights will result in lower varianceadvantage estimates.The same reset-training procedure was applied to Ψ PWTD ;Figure 3 (bottom) visualizes the resulting weights. Since theTD-errors depend on the value function which are nonsta-tionary during agent learning, we expect different weights toemerge at different points in training; the presented weightsare but one snapshot. But a clear contrast to reward weight-ing can be seen: high weights are placed on transitions in theﬁrst phase of the DAG, which yield no rewards—becausethe TD-errors at these transitions do provide signal once thevalue function begins to be learned. In the Supplement, weexplicitly probe the adaptation of Ψ PWTD to different pointsin learning by modifying the value function in reset experi- airwise Weights for Temporal Credit Assignment

Figure 4.

The three Key-to-Door phases. Agent observation is atopdown view of one of the × grids. The blue circle is theagent, the yellow and red areas are possible initial locations foragent and key, respectively. ments, and show that the weights indeed adapt sensibly todifferences in the accuracy of the value function. Evaluation of the Learned Pairwise Weights.

After the θ -reset training of the pairwise-weights completed, we usedthem to train a new set of θ parameters, ﬁxing the pairwiseweights during learning. Figure 2 (right) shows the numberof episodes to reach of the maximum score in eachDAG, for policies trained with regular returns, handcraftedweights (H-PWR), and meta-learned weights. The meta-learned weights perform as well as and indeed better thanthe handcrafted weights, and both outperform the regular re-turns, with the gap increasing for larger DAG-depth. We con-jecture that the learned weights performed even better thanthe handcrafted weights because the learned weights adaptedto the dynamics of the inner-loop policy learning procedureand thus could outperform the handcrafted weights, whichdo not take into account the policy-learning dynamics. We evaluated Meta-PWTD and -PWR in a 2D variant ofthe Key-to-Door (KtD) environment (Hung et al., 2019)that is an elaborate Umbrella problem that was designedto show-off the TVT algorithm’s ability to solve TCA. Wevaried properties of the domain to vary the credit assign-ment challenge. We compared the learning performance ofour algorithms to a version of ˆΨ PWR that uses ﬁxed hand-crafted pairwise weights and no metagradient adaptation,as well as to the following ﬁve baselines (see related workin §1): (a) best ﬁxed- λ : Actor-Critic (A2C) (Mnih et al.,2016) with a best ﬁxed λ found via hyperparameter search; (b) TVT (Hung et al., 2019) (using the code accompanyingthe paper); (c)

A2C-RR: a reward redistribution method in-spired by RUDDER (Arjona-Medina et al., 2019); (d)

Meta- λ ( s ) (Xu et al., 2018): meta-learning a state-dependent func-tion λ ( s ) for λ -returns; and (e) RGM (Wang et al., 2019):meta-learning a single set of weights for generating returnsas a linear combination of rewards.

Environment and Parametric Variation.

KtD is a ﬁxed-horizon episodic task where each episode consists of three2D gridworld phases (Figure 4, top). In the

Key phase ( steps in duration) there is no reward and the agent must navigate to the key to collect it. The initial locations of theagent and the key are randomly sampled for each episode.In the Apple phase ( steps in duration) the agent collectsapples by walking over them; apples disappear once col-lected. Each apple yields a noisy reward with mean µ andvariance σ . The number of apples is uniformly sampledfrom [1 , and their locations are randomly sampled. Inthe Door phase ( steps in duration) the agent starts at thecenter of a room with a door but can open the door only ifit has collected the key earlier. Opening the door yields areward of .The agent’s observation is a tuple, ( map, has key ) , where map is the top-down view of the current phase rendered inRGB, and has key is if the agent has collected the keyand otherwise. The agent has four navigation actions up,down, left , and right . The primary difference between ourKtD environment and the original is that our KtD is a top-down view fully-observable environment while the originalis a ﬁrst-person view partially observable environment.Crucially, picking up the key or not has no bearing on theability to collect apple rewards. The apples are the noisyrewards that distract the agent from learning that picking upthe key early on leads to a door-reward later . In our exper-iments, we evaluate methods on different environmentsrepresenting combinations of levels of apple reward meanand levels of apple reward variance. Neural Network Architecture.

The policy ( θ ) and thevalue functions ( ψ and φ ) are implemented by separate con-volutional neural networks. The meta-network ( η ) computesthe pairwise weight w ij as follows: First, it embeds theobservations s i and s j and the time difference ( j − i ) intoseparate latent vectors. Then it takes the element-wise prod-uct of these three vectors to fuse them into a vector h ij .Finally it maps h ij to a scalar output. Sigmoid is appliedto the output to bound the weight to [0 , . More details areprovided in the Supplement. Hyperparameters.

We tuned hyperparameters for eachmethod on the mid-level (cid:104) apple mean, apple variance (cid:105) con-ﬁguration (cid:104) µ = 5 , σ = 25 (cid:105) and kept these parameters ﬁxedfor the remaining environments. Each method has a dis-tinct set of parameters (e.g. outer-loop learning rates, λ values) and we used the original papers as guides for the pa-rameter ranges searched over; details are in the Supplement. Empirical Results.

Figure 5 presents learning curves forMeta-PWTD, Meta-PWR, and baselines in three KtD con-ﬁgurations (apple reward mean, variance labeled at the top;the remaining conﬁgurations are in the Supplement). Learn-ing curves are shown separately for the total episode return and the door phase reward , the latter a measure of successat the long term credit assignment. Not unexpectedly, H- airwise Weights for Temporal Credit Assignment E p i s o d e r e t u r n μ =1, σ =0 4045505560 μ =5, σ =5 9095100105110 μ =10, σ =100 1 2 3 4 5 D oo r p h a s e r e w a r d H-PWRRGM best fixed λ =0.5meta- λ ( s ) TVT, γ = 0.92Meta-PWR A2C-RR, γ = 0.92Meta-PWTD Figure 5.

Learning curves for the Key-to-Door domain. Each col-umn corresponds to a different mean ( µ ) and variance ( σ ) ofapple rewards. The x-axis denotes the number of frames. They-axis denotes the episode return (in top row) and the door phasereward (in bottom row). The solid curves show the average over independent runs with different random seeds and the shadedarea shows the standard errors. PWR which uses handcrafted pairwise weights performs thebest. The gap in performance between H-PWR and the bestﬁxed- λ shows that this domain provides a credit assignmentchallenge that the pairwise-weighted advantage estimatecan help with. The TVT and A2C-RR methods used a lowdiscount factor and so relied solely on their heuristics forlearning to pick up the key, but neither appears to enablefast learning in this domain. In the door phase, Meta-PWRis generally the fastest learner after H-PWR. Meta-PWTD,though slower, achieves optimal performance. AlthoughRGM performs third best in the door phase, it does notperform well overall, suggesting that the inﬂexibility of itssingle set of reward weights (vs. pairwise of Meta-PWR)forces a trade off between short and long-term credit assign-ment. In summary, Meta-PWR outperforms all the othermethods and Meta-PWTD is comparable to the baselines.Figure 6 presents a visualization of the handcrafted weightsfor H-PWR (left) and weights learned by Meta-PWR (right).In each heatmap, the element on the i -th row and the j -thcolumn denotes w ij , the pairwise weight for computing thecontribution of the reward upon transition to the j -th stateto the return at the i -th state in the episode. In the heatmapof the handcrafted weights (left), the top-right area has non-zero weights because the rewards in the door phase dependon the actions selected in the key phase. The weights inthe remaining part of the top rows are zero because thoserewards do not depend on the the actions in the key phase.For the same reason, the weights in the middle-right area are Figure 6.

Visualization of pairwise weights in the KtD experiment.

Left:

Handcrafted weights H-PWR.

Right:

The weights learnedby Meta-PWR sampled from the µ = 5 , σ = 5 setting. zero as well. The weights in the rest of the area resemble theexponentially discounted weights with a discount factor of . . This steep discounting helps fast learning of collectingapples. The weights learned by metagradients (right) largelyresemble the handcrafted weights, which indicate that themetagradient procedure was able to simultaneously learn(1) the important rewards for the key phase are in the doorphase, and (2) a quick-discounting set of weights within theapple phase that allows faster learning of collecting apples.To test the robustness of Meta-PWTD and -PWR to stochas-tic transitions, we ran a KtD environment where at each timestep the action being executed is replaced by a random ac-tion with probability . . The learning curves are providedin the Supplement. The results show that Meta-PWTD and-PWR are robust to this kind of stochasticity. Both the DAG and KtD domains are idealized credit assign-ment problems. It is possible that, in domains outside thisidealized class, Meta-PWTD and -PWR are slower to learnthan baseline methods because they need time to learn use-ful weights. To evaluate this possibility we compared themto baseline methods on bsuite (Osband et al., 2019) andAtari, both standard benchmarks for RL agents. For theseexperiments, we did not compare to Meta- λ ( s ) becauseit performed similarly to the ﬁxed- λ baseline in previousexperiments as noted in the original paper (Xu et al., 2018). bsuite is a set of unit-tests for RL agents: each domaintests one or more speciﬁc RL-challenges, such as explo-ration, memory, and credit assignment, and each containsseveral versions varying in difﬁculty. We selected all do-mains in bsuite that are tagged by “credit assignment”,but with the exception of Discount Chain , these domainsinvolve multiple challenges and were not designed solely asidealized credit assignment problems. We ran all methodsfor K episodes in each domain except Cartpole , whichwe ran for K episodes. Each run was repeated timeswith different random seeds. The resulting regret scores aresummarized in Table 1. For Discount Chain

Meta-PWRimproved regret by a large amount. More generally, Meta- airwise Weights for Temporal Credit Assignment

Table 1.

Total regrets on selected bsuite domains (lower is better).

Catch Catch Noise Catch Scale Umbr. Length Umbr. Distract Cartpole Discount ChainA2C 5975 42221 56800 38050 37524 76874 3554A2C-RR −40 −20 0 20 40AssaultDemonAttackAsterixVideoPinballKrullHeroBreakoutAmidarStarGunnerBankHeistBeamRiderRoadRunnerKungFuMasterBoxingEnduroFreewayMontezumaPrivateEyeVentureChopperKangarooSeaquestFrostbiteFishingDerbyGopherAsteroidsPongGravitarTutankhamBattleZoneIceHockeyRiverraidCentipedeNameThisGameRobotankBowlingCrazyClimberAtlantisUpNDownZaxxonQbertMsPacmanAlienTennisJamesbondWizardOfWorSpaceInvadersDoubleDunkTimePilot -33.3%-20.7%-13.3%-12.8%-9.7%-5.6%-5.1%-4.7%-3.4%-2.8%-2.4%-0.7%-0.2%-0.1%0.0%0.0%0.0%0.0%0.0%0.1%0.1%0.1%0.2%0.5%0.5%0.7%0.8%1.3%1.3%1.7%1.7%1.8%1.9%2.3%2.7%2.8%3.3%5.8%6.7%7.6%10.1%11.1%12.2%13.8%15.2%20.6%28.5%34.6%63.8%

Figure 7.

Relative performance of Meta-PWTD over A2C ( λ =0 . ). All scores are averaged over independent runs with dif-ferent random seeds. Inset:

Learning curves of median humannormalized score of all Atari games. Shaded area shows thestandard error over runs. PWTD or -PWR achieved the lowest total regrets in alldomains except for

Catch , in which A2C-RR achieved thelowest. It shows that Meta-PWTD and Meta-PWR performbetter than or comparably to the baseline methods evenin domains without the idealized long-term TCA structurepresent in Umbrella-like problems such as Key-to-Door.To test scalability to high-dimensional environments, weconducted experiments on Atari. Atari games often havelong episodes of > steps thus episode truncation isrequired. Meta-PWTD and -PWR can deal with truncatedepisodes by using the value function at the state of trunca-tion to correct for the missing rest of the episode. Using avalue function to handle truncation in PWR would be anal-ogous to using an n -step estimator while using the valuefunction to handle truncation in PWTD is analogous to using a truncated λ -estimator. Since truncated PWTD is smootherthan truncated PWR (early experiments conﬁrmed this intu-ition) and needs nearly zero modiﬁcation, we focused on theformer. Note that truncated-PWTD can precisely implementthe truncated λ -return by setting the weight to ( γλ ) ( t (cid:48) − t − for the correction value function at the state of truncation.TVT and RGM are precluded because they require com-plete episodes to apply updates. Therefore, we only ranMeta-PWTD, A2C-RR and A2C. For each we conductedhyperparameter search on a subset of games (Asterix,BeamRider, Breakout, Qbert, Seaquest, and SpaceInvaders),and ran each method on games with the ﬁxed set of hy-perparameters; see Supplement for details. An importanthyperparameter for the A2C baseline is λ , which was set to . . Following convention, we ran each method for million frames on each game.Figure 7 (inset) shows the median human-normalized scoreduring training. Meta-PWCA performed slightly better thanA2C over the entire period, and both performed better thanA2C-RR . Figure 7 shows the relative performance of Meta-PWTD over A2C. Meta-PWTD outperforms A2C in games, underperforms in , and ties in . These resultsshow that Meta-PWTD can scale to high-dimensional envi-ronments like Atari. We conjecture that Meta-PWTD pro-vides a beneﬁt in games with embedded Umbrella problemsbut this is hard to verify directly.

5. Conclusion

We presented two new advantage estimators with pairwiseweight functions as parameters to be used in policy gradi-ent algorithms, and a metagradient algorithm for learningthe pairwise weight functions. Simple analysis and empir-ical work conﬁrmed that the additional ﬂexibility in ouradvantage estimators can be useful in domains with delayedconsequences of actions, e.g., in Umbrella-like problems.Empirical work also conﬁrmed that the metagradient algo-rithm can learn the pairwise weights fast enough to be usefulfor policy learning, even in large-scale environments likeAtari. The paper introducing RUDDER incorporated many ideas inaddition to reward redistribution into PPO (Schulman et al., 2017)and was able to outperform PPO but that code is not available. airwise Weights for Temporal Credit Assignment

Acknowledgement

This work was supported by DARPA’s L2M program aswell as a grant from the Open Philanthropy Project to theCenter for Human Compatible AI. Any opinions, ﬁndings,conclusions, or recommendations expressed here are thoseof the authors and do not necessarily reﬂect the views of thesponsors.

References

Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Un-terthiner, T., Brandstetter, J., and Hochreiter, S. Rudder:Return decomposition for delayed rewards. In

Advancesin Neural Information Processing Systems , pp. 13544–13555, 2019.Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert,M., Radford, A., Schulman, J., Sidor, S., Wu, Y., andZhokhov, P. Openai baselines. https://github.com/openai/baselines , 2017.Harutyunyan, A., Dabney, W., Mesnard, T., Azar, M. G.,Piot, B., Heess, N., van Hasselt, H. P., Wayne, G., Singh,S., Precup, D., et al. Hindsight credit assignment. In

Advances in neural information processing systems , pp.12467–12476, 2019.Hochreiter, S. and Schmidhuber, J. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.Hung, C.-C., Lillicrap, T., Abramson, J., Wu, Y., Mirza,M., Carnevale, F., Ahuja, A., and Wayne, G. Optimizingagent behavior over long time scales by transporting value.

Nature communications , 10(1):1–12, 2019.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. nature , 518(7540):529–533, 2015.Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. In

International conference on machine learning , pp. 1928–1937, 2016.Oh, J., Hessel, M., Czarnecki, W. M., Xu, Z., van Hasselt,H. P., Singh, S., and Silver, D. Discovering reinforcementlearning algorithms.

Advances in Neural InformationProcessing Systems , 33, 2020.Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener,E., Saraiva, A., McKinney, K., Lattimore, T., Szepezvari, C., Singh, S., et al. Behaviour suite for reinforcementlearning. arXiv preprint arXiv:1908.03568 , 2019.Perez, E., Strub, F., De Vries, H., Dumoulin, V., andCourville, A. Film: Visual reasoning with a generalconditioning layer. In

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence , 2018.Rajendran, J., Lewis, R., Veeriah, V., Lee, H., and Singh,S. How should an agent practice? arXiv preprintarXiv:1912.07045 , 2019.Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,P. High-dimensional continuous control using generalizedadvantage estimation. arXiv preprint arXiv:1506.02438 ,2015.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Sutton, R. S. Learning to predict by the methods of temporaldifferences.

Machine learning , 3(1):9–44, 1988.Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,Y. Policy gradient methods for reinforcement learningwith function approximation. In

Advances in neural in-formation processing systems , pp. 1057–1063, 2000.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L.,Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. Discov-ery of useful questions as auxiliary tasks. In

Advances inNeural Information Processing Systems , pp. 9306–9317,2019.Wang, Y., Ye, Q., and Liu, T.-Y. Beyond exponentiallydiscounted sum: Automatic learning of return function. arXiv preprint arXiv:1905.11591 , 2019.Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.

Machinelearning , 8(3-4):229–256, 1992.Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient re-inforcement learning. In

Advances in neural informationprocessing systems , pp. 2396–2407, 2018.Xu, Z., van Hasselt, H., Hessel, M., Oh, J., Singh, S.,and Silver, D. Meta-gradient reinforcement learningwith an objective discovered online. arXiv preprintarXiv:2007.08433 , 2020. airwise Weights for Temporal Credit Assignment

Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., vanHasselt, H. P., Silver, D., and Singh, S. A self-tuningactor-critic algorithm.

Advances in Neural InformationProcessing Systems , 33, 2020.Zheng, Z., Oh, J., and Singh, S. On learning intrinsicrewards for policy gradient methods. In

Advances inNeural Information Processing Systems , pp. 4644–4654,2018.Zheng, Z., Oh, J., Hessel, M., Xu, Z., Kroiss, M., van Has-selt, H., Silver, D., and Singh, S. What can learned intrin-sic rewards capture? arXiv preprint arXiv:1912.05500 ,2019. airwise Weights for Temporal Credit Assignment

A. DAG Experiments

Three algorithms were compared in a tabular domain. In this appendix, the hyperparameter conﬁgurations of the threealgorithms are provided in Section A.1 and more detailed results are reported in Section A.3.

A.1. Hyperparameters

An informal hyperparameter search was conducted for the three algorithms. Fixed- λ uses the Adam optimizer (Kingma &Ba, 2014) with learning rate . , β = 0 , β = 0 . , and (cid:15) = 10 − . Updates are computed on batches consisting of full episodes. We found λ = 1 the best, better than any smaller values. We used a discount factor γ = 1 and an entropyregularization coefﬁcient . . The inner loop of Meta-PWTD, Meta-PWR, and H-PWR share hyperparameters with theﬁxed- λ baseline, except that λ is not used. For Meta-PWTD and Meta-PWR, the outer-loop optimizer is Adam with thesame hyperparameters as the inner-loop optimizer. Outer-loop gradient is clipped to . by global norm. The weight matrix η is initialized from uniform distribution in range [ − . , . . A.2. Learning TD-error weights weights with different value functions.

In the main paper we show that the weights learned for rewards converge to a ﬁxed weight matrix, which resembles thehandcrafted weights we believe are useful for variance reduction (see Figure 10, Figure 11, and Figure 12 for visualizationsof the weights). While the reward weighting may change somewhat during training, there exists ﬁxed sets of weights thatare beneﬁcial throughout the training. During the training of an actor-critic algorithm, the TD-errors change as the valuefunction ﬁts to the current policy. While we have demonstrated that we can learn ﬁxed weights for the TD-errors that speedup learning considerably, in more complicated domains a ﬁxed weighting scheme may end up hurting the performance.To shed light into weight learning in a setting where the value function is changing, we consider the DAG reset-trainingsetting but instead of learning the value function with the policy, we use the optimal value function, which we mask tosimulate non-stationary learning. By masking we mean setting the value to zero for all states up to certain depth in the DAGenvironment, making the optimal value function less informative.In Figure 8, three cases of value masking are shown in the depth 8 DAG environment where the optimal value functionhas been masked to depth 0, 4, and 8. In the mask depth 0 case, the weights for the TD-errors have mostly changed in theparts of the weight matrix before column 30. Column 30 corresponds to the last state of the middle-layer of the DAG thatsplits the states where the agent can act from the states where the agent receives rewards. When values are masked up todepth 4, the TD-error weighting shifts. State 14 is the last state at depth 4, so all weights before that are unchanged fromthe initialization. Note that at mask depths 0 and 4, no weights are placed on the states in the weight matrix after column30. Finally, at depth 8, all of the value function has been masked out and the TD-error weighting has converged to visuallysimilar weights as the reward weights, which is expected as the TD-errors computed with the fully masked value functionconsist only of the rewards. The learned weights in Figure 8 show that Meta-PWTD will learn different kinds of weightingsdepending on the value function.

A.3. Additional Empirical Results

Learning curves in the DAG environment are presented in Figure 9. Handcrafted and learned weights in the , , and deep DAG environment variants are presented in Figures 10, 11, and 12. B. Key-to-Door Experiments

B.1. Environment Description

Key-to-Door (KtD) is a ﬁxed-horizon episodic task where each episode consists of three 2D gridworld phases.In the

Key phase (duration steps), there is no reward and the agent must navigate in a × map to collect a key. The keydisappears once collected. The initial locations of the agent and the key are randomly sampled in each episode.In the Apple phase (duration steps), the agent collects apples in a × map by walking over them; apples disappearonce collected. Each apple yields a noisy reward with mean µ and variance σ . Speciﬁcally, each apple yields a rewardof r = σ µ + 1 with probability r or a reward of with probability − r . This sampling procedure is consistent with theoriginal TVT paper (Hung et al., 2019). The number of apples is uniformly sampled from [1 , and their locations are airwise Weights for Temporal Credit Assignment randomly sampled.In the Door phase (duration steps), the agent starts at the center of a × room with a door. The agent can open the dooronly if it has collected the key in the earlier Key phase. The door disappears after being opened. Successfully opening thedoor yields a reward of .The agent’s observation is a tuple, ( map, has key ) . map is the top-down view of the current phase and is rendered in anRGB representation. has key is a binary channel which is if the agent has already collected the key and otherwise.The agent has actions which correspond to moving up , down , left , and right . The primary difference between our KtDenvironment and the original is that our environment is fully observable; the original is partially observable. This differenceis reﬂected in two modiﬁcations: the agent observes the top-down view of the map rather than the ﬁrst-person view, and theagent observes whether it has collected the key.For the stochastic dynamics variant, the action being executed is replaced by a random action with probability . for eachtime step. B.2. Implementation Details

All methods use A2C (Mnih et al., 2016) as the policy optimization algorithm. actors are used to generate data.The rollout length is equal to the episode length, in this case. For each method described below, we conducted ahyperparameter search in the µ = 5 and σ = 5 KtD environment and selected the best-performing hyperparameters. Thenthe hyperparameters were ﬁxed for all the other environment conﬁgurations. Each candidate hyperparameter combinationwas run with three different random seeds for million frames. The best hyperparameter combination was determined to bethe one that ﬁrst achieved in episode return, i.e., of the maximum possible. The following hyperparameter settingsare shared across all methods unless otherwise noted: learning rate ∗ − , and Adam β = 0 , Adam β = 0 . , Adam (cid:15) = 10 − , discount factor γ = 0 . , and entropy regularization coefﬁcient . . The advantage estimates are standardizedin a batch of trajectories before computing the policy gradient loss (Dhariwal et al., 2017) unless otherwise noted. A Standard Perception module.

A standard perception module is used by all method to process the observation s to alatent vector h . The observation s is a tuple, ( map, has key ) . map is the top-down view of the current phase that has shape (7 , , , where the last dimension is the RGB channels. has key is a binary channel which is if the agent has alreadycollected the key and otherwise. map is processed by two convolutional layers with and ﬁlters respectively. Bothconvolutional layers use × kernels and are followed by ReLU activation. The output of the last convolutional layer isthen ﬂattened and processed by Dense(512) - ReLU. The binary input has key is concatenated with the ReLU layer output.Finally, the concatenated vector is further processed by a MLP: Dense(512) - ReLU - Dense(256) - ReLU. We denote theﬁnal output of the perception module as h . Fixed- λ The ﬁxed- λ baseline implements the standard A2C algorithm. The policy and value function are implemented bytwo separate neural networks consisting of a perception module and an output layer without any parameter sharing. Thepolicy network maps h to the policy logits via a single dense layer. The value network maps h to a single scalar via a singledense layer. We label this baseline ﬁxed- λ to underline the importance of the eligibility trace parameter λ . We searchedfor all combinations of the following hyperparameter sets: λ in { . , . , . , . , . , . , . , } , and learning rate in { − , ∗ − , ∗ − , − } . The best performing set of hyperparameters is λ = 0 . and learning rate ∗ − . Meta-PWTD

The policy ( θ ) and value function for the original return ( ψ ) have the same network architecture as in theﬁxed- λ baseline. The value function for the weighted sum of rewards ( φ ) has identical architecture as the value functionfor the original return ( ψ ). The meta-network ( η ) computes the weights as follows. For each episode, the inputs to themeta-network is a sequence ( s , δ , s , . . . , δ T , s T ) . Note that the TD-error δ i is part of the inputs. The meta-network ﬁrstmaps each s t (0 ≤ t ≤ T ) into a latent vector h t with a standard perception module. The meta-network ( η ) shares theperception module with the value function for the original return ( ψ ). No gradient is back-propagated from the meta-networkto the shared perception module. A dense layer with hidden units maps h i (0 ≤ i < T ) into h rowi ; a separate denselayer with units maps h j ⊕ δ j (0 < j ≤ T ) into h colj where ⊕ denotes concatenation. The TD-error δ j is clipped to [ − , before concatenation. Both dense layers are followed by ReLU activation. Another dense layer with units mapsthe time interval ( j − i )(0 ≤ i < j ≤ T ) to a latent vector td ij . h rowi , h colj , and td ij are element-wise multiplied to fuse thethree latent vectors into one vector h ij : h ij = ( h rowi + 1) ∗ ( h colj + 1) ∗ ( td ij + 1) . Note that every vector is shifted by aconstant before the multiplication to mitigate gradient vanishing at the beginning of training (Perez et al., 2018). The airwise Weights for Temporal Credit Assignment latent vectors h ij (0 ≤ i < j ≤ T ) are normalized by h (cid:48) ijd = γ d ( h ijd − µ d ) σ d + β d (1 ≤ d ≤ , where µ d = 2 T ∗ ( T + 1) (cid:88) ≤ i

Meta-PWR uses the same neural network architecture as Meta-PWTD. The only difference is that Meta-PWRdoes not take the reward r i or the TD-error δ i as inputs. The hyperparameters for Meta-PWR are also the same as those forMeta-PWTD. H-PWR

The policy and value function for the weighted sum of rewards have the same network architecture as in theﬁxed- λ baseline. We handcraft pairwise weights for the KtD domain to take advantage of the known credit assignmentstructure that can be described as follows: The policy learning in the key phase depends only on the reward in the doorphase, the policy learning in the apple phase does not depend on the other phases, and while the picking up the key or notimpacts the reward in the door phase, the reward in the door phase is still instantaneous. The weights are set so that in thekey phase, the rewards in the apple phase receive a zero weight and the rewards in the door phase are discounted startingfrom the ﬁrst timestep of the door phase. Weights that compute discounting equivalent to γ = 0 . are applied in the applephase and door phase. An illustration of the learned weights is presented in Figure 3 in the main paper. H-PWR uses thesame hyperparameters as ﬁxed- λ except γ , which does not apply and λ , which is set to . . No hyperparameter search isconducted speciﬁcally for H-PWR. Meta- λ ( s ) The policy ( θ ) and value function for the original return ( ψ ) have the same network architecture as in theﬁxed- λ baseline. The value function for the weighted sum of rewards ( φ ) has identical architecture as the value functionfor the original return ( ψ ). The meta-network ( η ) maps a state s t to a scalar λ ( s t ) ∈ (0 , . The meta-network ﬁrst mapsthe observation s t to a latent vector h t with the standard perception module and then maps h t to a single scalar λ ( s t ) via asingle dense layer. Sigmoid is applied to the output of the meta-network to bound it to (0 , . We searched for the outer-looplearning rate in { − , ∗ − , ∗ − , − } . The best performing outer-loop learning rate is − . The outer-loop λ is set to . without search. RGM

We adopted the agent architecture from the open-source implementation accompanying the original TVT paper(https://github.com/deepmind/deepmind-research/tree/master/tvt). The only difference is that we replaced the convolutional airwise Weights for Temporal Credit Assignment neural network torso in the original code with the standard perception module. We searched for all combinations of thefollowing hyperparameter sets: the read strength threshold for triggering the splice event in { , } and the learning ratein { − , ∗ − , ∗ − , − } . The best performing set of hyperparameters are for the read strength thresholdand ∗ − for the learning rate. We did not use advantage standardization for TVT because we found that it hurt theperformance in the KtD domain. We used the Adam optimizer parameters β = 0 , β = 0 . , and (cid:15) = 10 − , as theopen-source implementation suggested. We also set λ = 0 . and γ = 0 . following the original implementation. A2C-RR

We pick the main ideas from RUDDER (Arjona-Medina et al., 2019) and implement them in an algorithm wecall A2C-RR. RUDDER uses contribution analysis to redistribute rewards in a RL episode. The high-level idea is that sincethe environment rewards may be delayed from the transitions that resulted in them, contribution analysis may be used tocompute how much of the total return is explained by any particular transition. In effect, this drives the expected futurereturn to zero because any reward that can be expected at any given timestep will be included in the redistributed reward,eliminating the delay and leading to faster learning of the RL agent. The method is based on learning a LSTM-network,which predicts the total episodic return at every timestep. The redistributed reward is computed as the difference of returnpredictions on consecutive timesteps.Compared to the full RUDDER algorithm, A2C-RR incorporates a few changes to isolate some of the core ideas of thereward redistribution module and make the algorithm more directly comparable to the other A2C-based algorithms discussedin this paper. We use the LSTM cell-architecture proposed in the RUDDER paper and train it from samples stored in areplay buffer. In the KtD-experiments, we do not use the “quality” weighted advantage estimate proposed in the paper dueto the random noise in the episodic return, which we deem too high variance for reliable estimation of the redistributionquality. Instead we mix the original and redistribution-based advantages at a ﬁxed ratio. We recognize that omitting some ofthe features of the full RUDDER algorithm may adversely impact the reward redistribution and therefore the agent learningperformance. Nevertheless, we believe the reward redistribution idea is an interesting take on a similar idea as the pairwiseweighting studied in this paper and therefore provide our implementation – A2C-RR – of that idea as a baseline.The regular frames and delta frames (as described in the paper) are processed by the standard perception module. Thepreception module outputs and the one-hot encoded action are concatenated and processed by Dense(512) - ReLU. Theoutput of the ReLU layer is the input for the reward redistribution model. As suggested in the paper, the reward redistributionmodel is a LSTM without a forget gate and output gate. The cell input only receives forward connections and the gatesonly receive recurrent connections. All of the layers in the LSTM have units. We chose not to use the prioritized replaybuffer described in the paper due to the high variance of the returns in the KtD environment. For the same reason we did notuse the quality measure, which is also described in the paper, for mixing the RUDDER advantage and regular advantage.Instead, we used a ﬁxed mixing coefﬁcient, which we searched for. The advantage is standardized after the mixing. Weimplemented an auxiliary task described in the paper, where the total return prediction loss is applied at every step of theepisode. The reward redistribution model is trained for randomly sampled batches of size from a circular buffer holdingthe past trajectories between each policy update. We set the number of updates to via an informal hyperparametersearch in the µ = 5 , σ = 5 KtD setting, where we found that training times between each update performs better than but further increasing it did not yield further large improvements. The reward redistribution model is trained with Adamwith learning rate − , β = 0 . , and β = 0 . . We applied a L2 weight regularizer with coefﬁcient − . We searchedfor all combinations of the following hyperparameter sets: γ in (0 . , . , λ in (1 . , . , . , auxiliary task coefﬁcient in (0 . , . , and advantage mixing coefﬁcient in (0 . , . . The best performing set of hyperparameters is γ = 0 . , λ = 0 . ,auxiliary task coefﬁcient . , and advantage mixing coefﬁcient . . B.3. Additional Empirical Results

We ran all of the methods described above in variants of the KtD environment. Figure 13, Figure 14, and Figure 15 showsthe episode return, the total reward in the door phase, and the total reward in the apple phase respectively.Noticing that the KtD domain has deterministic dynamics, we also conducted experiments on a stochastic KtD domainwhere the action being executed is replaced by a random action with probability . for each time step. The correspondingresults are presented in Figure 16, Figure 17, and Figure 18. In general, Meta-PWTD and Meta-PWR still perform betterthan the baseline methods regardless of the stochasticity in the transition dynamics. airwise Weights for Temporal Credit Assignment C. bsuite Experiments

C.1. Environment Description

We selected tasks which were associated with the “credit assignment” tag from bsuite . They present a variety ofcredit assignment structures, including the umbrella problem in § in the main text. Additionally, all domains except Discount Chain have multiple tags which create additional challenges than temporal credit assignment. We ran alldifferent variants of every task, each with different random seeds. Unlike the standard data regime of bsuite , we raneach task for K episodes for all methods to calculate the total regret score, except Cartpole , which we ran for K episodes. We refer the readers to the original bsuite paper (Osband et al., 2019) and the accompanying github repository(https://github.com/deepmind/bsuite) for further details. C.2. Implementation Details

Most methods use a similar neural network architecture as described in §B.2. There are two common differences. First, weused single actor instead of parallel actors for generating data. Second, the standard perception module is replaced bya -layer MLP with hidden units and ReLU activation each layer, because the inputs are vectors instead of images in bsuite . Further architecture differences and hyperparameters are described below. Actor-critic baseline.

The entropy regularization weight is set to . . we used the Adam optimizer with learning rate ∗ − , β = 0 , β = 0 . , and (cid:15) = 10 − . Meta-PWTD and Meta-PWR

Besides the perception module, there are two more differences with §B.2. First, themeta-network ( η ) and the value function for the original return ( ψ ) use separate perception module instead of sharing.Second, all the hidden layers after the perception module use hidden units instead of . The entropy regularizationweight is set to . . For the inner loop, we used the Adam optimizer with learning rate ∗ − , β = 0 , β = 0 . , and (cid:15) = 10 − . For the outer loop, we used the Adam optimizer with learning rate ∗ − , β = 0 , β = 0 . , and (cid:15) = 10 − .Outer-loop gradient is clipped to . by global norm. RGM

The entropy regularization weight is set to . . For the inner loop, we used the Adam optimizer with learning rate ∗ − , β = 0 , β = 0 . , and (cid:15) = 10 − . For the outer loop, we used the Adam optimizer with learning rate ∗ − , β = 0 , β = 0 . , and (cid:15) = 10 − . A2C-RR

Apart from the perception module, the same A2C-RR implementation was used for bsuite as was used forKtD experiments. The actor-critic was trained with the same hyperparameters as the Actor-Critic baseline for bsuite . TheLSTM was trained with the same hyperparameters as in KtD.

D. Atari Experiment

D.1. Implementation Details

Most methods use a similar neural network architecture as described in §B.2. There are two common differences. First, wegenerate -step trajectories instead of full episodes for each policy update, following the original A2C implementation (Mnihet al., 2016). Second, the standard perception module is replaced by the convolutional neural network architecture usedin (Mnih et al., 2015). Further architecture differences and hyperparameters are described below. A2C

The policy and the value function share the perception module. The value loss coefﬁcient is . . The entropyregularization coefﬁcient is . . We used the RMSProp optimizer with learning rate . , decay . , and (cid:15) = 10 − .The gradient is clipped to . by global norm. The discount factor γ is . . We searched for the eligibility traces parameter λ in { . , . , . , . , . , } and selected . . Meta-PWTD

The inner-loop hyperparameters are exactly the same as the A2C baseline. For the outer loop, We appliedan entropy regularization as well to stabilize training. The coefﬁcient is . . The outer loop uses the Adam optimizer withlearning rate . , β = 0 , β = 0 . , and (cid:15) = 10 − . The outer-loop gradient is clipped to . by global norm. airwise Weights for Temporal Credit Assignment (a) Mask Depth = 0 (b) Mask Depth = 4 (c) Mask Depth = 8 Figure 8.

Meta-PWTD inner loop-reset weight visualization experiment with masked optimal value function. Figures (a,b,c) show thelearned weight matrices for mask depths , , and respectively. E p i s o d e r e t u r n depth = 4 Number of episodes depth = 8 depth = 16

Fixed =1.0 Meta-PWR Meta-PWTD H-PWR 95% of Max Return

Figure 9.

Learning curves in the DAG environments. Each curve is the mean of seeds. The line for of max return is added to helpcontextualize these learning curves with Figure 2 in the main paper. A2C-RR

To handle the variable length episodes in Atari, we chunk the trajectories as described in the RUDDERpaper (Arjona-Medina et al., 2019). Unlike RUDDER, A2C-RR uses a circular replay buffer, to which all trajectories areadded and samples training batches from the buffer uniformly. We use the quality measure described in the RUDDERpaper for mixing the advantage estimates to the A2C-RR implementation for Atari. The quality is computed as describedin the paper, and used for mixing the advantage computed with the environment rewards and the one computed with theredistributed rewards. The quality is also used as the coefﬁcient for the mean-squared error loss used for training the baselinefor the redistributed reward. The LSTM is trained for a maximum of 100 LSTM epochs every 100 actor-critic trainingiterations. If the quality of the last 40 LSTM training trajectories is positive after updating the LSTM, the LSTM training isstopped. For training the LSTM, we normalize the rewards by the maximum return encountered so far and multiply thenormalized rewards by . Before mixing the regular and the redistributed advantages, we denormalize the redistributedrewards by inverting the normalization process above. The inputs to the LSTM are the delta-frames and one-hot encodedactions. In the RUDDER paper, a more sophisticated exploration strategy is used for collecting data. We did not implementit for a fair comparison to other methods.A2C-RR uses the same A2C agent as the Fixed- λ baseline, with the same hyperparameters. Apart from the differencesdescribed above, the LSTM training hyperparameters are from the RUDDER paper (Arjona-Medina et al., 2019). TheLSTM is trained with trajectories of length , further split into chunks of length . The LSTM training batch size is 8,learning rate is − , the optimizer is ADAM with (cid:15) = 10 − , gradient clip is . , and the L -regularizer coefﬁcient is − .The LSTM starts training after at least trajectories have been collected from the environment. The replay buffer storesmaximum of trajectories of length . D.2. Additional Empirical Results

Figure 19 shows the learning curves of Meta-PWTD, A2C-RR, and the A2C baseline. airwise Weights for Temporal Credit Assignment (a) H-PWR (b) Meta-PWR (c) Meta-PWTD

Figure 10.

Handcrafted, meta-learned reward weights, and meta-learned TD-error weights in the depth DAG environment in a , b , and c respectively. White is unreachable. Y-axis has been cropped to only include weights that affect inner loop learning. airwise Weights for Temporal Credit Assignment (a) H-PWR (b) Meta-PWR (c) Meta-PWTD Figure 12.

Episode returns in all variants of the Key-to-Door environment. The x-axis is the number of frames. Rows are the differentapple reward means ( µ ), columns the different apple reward variance ( σ ). The x-axis reﬂects the number of frames, y-axis the episodereturn. The curves show the average over independent runs with different random seeds and the shaded area shows the standard errors. airwise Weights for Temporal Credit Assignment =10, =5 0 1 2 3 4 5 =10, =10 H-PWRRGM best fixed λ =0.5meta- λ ( s ) TVT, γ = 0.92Meta-PWR A2C-RR, γ = 0.92Meta-PWTD Figure 14.

Door phase returns in all variants of the Key-to-Door environment. The x-axis is the number of frames. Rows are the differentapple reward means ( µ ), columns the different apple reward variance ( σ ). The x-axis reﬂects the number of frames, y-axis the door phasereturn. The curves show the average over independent runs with different random seeds and the shaded area shows the standard errors. airwise Weights for Temporal Credit Assignment =10, =5 0 1 2 3 4 5 =10, =10 H-PWRRGM best fixed λ =0.5meta- λ ( s ) TVT, γ = 0.92Meta-PWR A2C-RR, γ = 0.92Meta-PWTD Figure 15.

Apple phase returns in all variants of the Key-to-Door environment. The x-axis is the number of frames. Rows are thedifferent apple reward means ( µ ), columns the different apple reward variance ( σ ). The x-axis reﬂects the number of frames, y-axis theapple phase return. The curves show the average over independent runs with different random seeds and the shaded area shows thestandard errors. airwise Weights for Temporal Credit Assignment =10, =5 0 1 2 3 4 5 =10, =10 H-PWRRGM best fixed λ =0.5meta- λ ( s ) TVT, γ = 0.92Meta-PWR A2C-RR, γ = 0.92Meta-PWTD Figure 16.

Episode returns in all variants of the stochastic Key-to-Door environment. The x-axis is the number of frames. Rows are thedifferent apple reward means ( µ ), columns the different apple reward variance ( σ ). The x-axis reﬂects the number of frames, y-axisthe episode return. The curves show the average over independent runs with different random seeds and the shaded area shows thestandard errors. airwise Weights for Temporal Credit Assignment =10, =5 0 1 2 3 4 5 =10, =10 H-PWRRGM best fixed λ =0.5meta- λ ( s ) TVT, γ = 0.92Meta-PWR A2C-RR, γ = 0.92Meta-PWTD Figure 17.

Door phase returns in all variants of the stochastic Key-to-Door environment. The x-axis is the number of frames. Rows arethe different apple reward means ( µ ), columns the different apple reward variance ( σ ). The x-axis reﬂects the number of frames, y-axisthe door phase return. The curves show the average over independent runs with different random seeds and the shaded area shows thestandard errors. airwise Weights for Temporal Credit Assignment =10, =5 0 1 2 3 4 5 =10, =10 H-PWRRGM best fixed λ =0.5meta- λ ( s ) TVT, γ = 0.92Meta-PWR A2C-RR, γ = 0.92Meta-PWTD Figure 18.

Apple phase returns in all variants of the stochastic Key-to-Door environment. The x-axis is the number of frames. Rows arethe different apple reward means ( µ ), columns the different apple reward variance ( σ ). The x-axis reﬂects the number of frames, y-axisthe apple phase return. The curves show the average over independent runs with different random seeds and the shaded area shows thestandard errors. airwise Weights for Temporal Credit Assignment Atlantis020040060080010001200 BankHeist 40006000800010000120001400016000 BattleZone 02000400060008000 BeamRider 22.525.027.530.032.535.0 Bowling 020406080100 Boxing 0100200300400500600 Breakout500010000150002000025000 Centipede 800850900 ChopperCommand 0250005000075000100000125000 CrazyClimber 0100000200000300000 DemonAttack 201816141210 DoubleDunk 0.040.020.000.020.04 Enduro100755025025 FishingDerby 0.00000.00250.00500.00750.01000.0125 Freeway 100150200250 Frostbite 02500500075001000012500 Gopher 175200225250275300 Gravitar 0100002000030000 Hero1098765 IceHockey 0100200300400 Jamesbond 3040506070 Kangaroo 2000400060008000 Krull 010000200003000040000 KungFuMaster 0.00.51.01.52.02.53.0 MontezumaRevenge10002000300040005000 MsPacman 100020003000400050006000 NameThisGame 201001020 Pong 2001000100200 PrivateEye 05000100001500020000 Qbert 050001000015000 Riverraid0100002000030000 RoadRunner 1.01.52.02.53.0 Robotank 050010001500 Seaquest 050010001500200025003000 SpaceInvaders 0200004000060000 StarGunner 22.520.017.515.012.510.0 Tennis2000300040005000600070008000 TimePilot 050100150200250 Tutankham 050000100000150000200000 UpNDown 0.040.020.000.020.04 Venture 050000100000150000200000 VideoPinball 1000200030004000 WizardOfWor0.0 0.5 1.0 1.5 2.0

Figure 19.

Learning curves on Atari games. The x-axis is the number of frames and the y-axis is the episode return. Each curve isaveraged over independent runs with different random seeds. Shaded area shows the standard error over5