Meta-descent for Online, Continual Prediction
Andrew Jacobsen, Matthew Schlegel, Cameron Linke, Thomas Degris, Adam White, Martha White
MMeta-descent for Online, Continual Prediction
Andrew Jacobsen, Matthew Schlegel, Cameron Linke, Thomas Degris, Adam White,
Martha White University of Alberta, Edmonton, Canada, Google DeepMind, London, UK Google DeepMind, Edmonton, [email protected], [email protected], [email protected]@gmail.com, [email protected], [email protected]
Abstract
This paper investigates different vector step-size adaptationapproaches for non-stationary online, continual predictionproblems. Vanilla stochastic gradient descent can be consid-erably improved by scaling the update with a vector of ap-propriately chosen step-sizes. Many methods, including Ada-Grad, RMSProp, and AMSGrad, keep statistics about thelearning process to approximate a second order update—avector approximation of the inverse Hessian. Another familyof approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems,but have not been as extensively explored as quasi-secondorder methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicableto a much broader range of algorithms, including those withsemi-gradient updates or even those with accelerations, suchas RMSProp. We provide an empirical comparison of meth-ods from both families. We conclude that methods from bothfamilies can perform well, but in non-stationary predictionproblems the meta-descent methods exhibit advantages. Ourmethod is particularly robust across several prediction prob-lems, and is competitive with the state-of-the-art method on alarge-scale, time-series prediction problem on real data froma mobile robot.
Introduction
In this paper we consider continual, non-stationary predic-tion problems. Consider a learning system whose objectiveis to learn a large collection of predictions about an agent’sfuture interactions with the world. The predictions specifythe value of some signal many steps in the future, given thatthe agent follows some specific course of action. There aremany examples of such prediction learning systems includ-ing Predictive State Representations (Littman, Sutton, andSingh 2001), Observable Operator Models (Jaeger 2000),Temporal-difference Networks (Sutton and Tanner 2004),and General Value Functions (Sutton et al. 2011). In oursetting, the agent continually interacts with the world, mak-ing new predictions about the future, and revising its pre-vious predictions as new outcomes are revealed. Occasion-ally, partially due to changes in the world and partially due
Copyright c (cid:13) to changes in the agent’s own behaviour, the targets maychange and the agent must refine its predictions. Stochastic gradient descent (SGD) is a natural choice forour setting because gradient descent methods work wellwhen paired with abundant training data. The performanceof SGD is dependent on the step-size parameter (scalar, vec-tor or matrix), which scales the gradient to mitigate samplevariance and improve data efficiency. Most modern large-scale learning systems make use of optimization algorithmsthat attempt to approximate stochastic second-order gradi-ent descent to adjust both the direction and magnitude ofthe descent direction, with early work indicating the ben-efits of such quasi-second order methods if used carefullyin the stochastic case (Schraudolph, Yu, and G¨unter 2007;Bordes, Bottou, and Gallinari 2009). Many of these algo-rithms attempt to approximate the diagonal of the inverseHessian, which describes the curvature of the loss function,and so maintain a vector of step-sizes—one for each param-eter. Starting from AdaGrad (McMahan and Streeter 2010;Duchi, Hazan, and Singer 2011), several diagonal approxi-mations have been proposed, including RmsProp (Tielemanand Hinton 2012), AdaDelta (Zeiler 2012), vSGD (Schaul,Zhang, and LeCun 2013), Adam (Kingma and Ba 2015)and AmsGrad (Reddi, Kale, and Kumar 2018). Stochasticquasi-second order updates have been derived specificallyfor temporal difference learning, with some empirical suc-cess (Meyer et al. 2014), particularly in terms of parame-ter sensitivity (Pan, White, and White 2017; Pan, Azer, andWhite 2017). On the other hand, second order methods, bydesign, assume the loss and thus Hessian are fixed, and sonon-stationary dynamics or drifting targets could be prob-lematic.A related family of optimization algorithms, called meta-descent algorithms, were developed for continual, on-line prediction problems. These algorithms perform meta-gradient descent adapting a vector of step-size parametersto minimize the error of the base learner, instead of approx- We exclude recent meta-learning frameworks (MAML (Finn,Abbeel, and Levine 2017), LTLGDGD (Andrychowicz et al.2016)) because they assume access to a collection of tasks that canbe sampled independently, enabling the agent to learn how to se-lect meta-parameters for a new problem. In our setting, the agentmust solve a large collection of non-stationary prediction problemsin parallel using off-policy learning methods. a r X i v : . [ c s . L G ] D ec mating the Hessian. Meta-descent applied to the step-sizewas first introduced for online least-mean squares methods(Jacobs 1988; Sutton 1992b; 1992a; Almeida et al. 1998;Mahmood et al. 2012), including the linear complexitymethod IDBD (Sutton 1992b). IDBD was later extendedto more general losses (Schraudolph 1999) and to support(semi-gradient) temporal difference methods (Dabney andBarto 2012; Dabney 2014; Kearney et al. 2018). These meth-ods are well-suited to non-stationary problems, and havebeen shown to ignore irrelevant features. The main limita-tion of several of these meta-descent algorithms, however,is that the derivations are heuristic, making it difficult to ex-tend to new settings beyond linear temporal difference learn-ing. The more general approaches, like Stochastic Meta-Descent (SMD) (Schraudolph 1999), require the update tobe a stochastic gradient descent update and have some is-sues in biasing towards smaller step-sizes (Wu et al. 2018).It remains an open challenge to make these meta-descentstrategies as broadly and easily applicable as the AdaGradvariants.In this paper we introduce a new meta-descent algorithm,called AdaGain, that attempts to optimize the stability of thebase learner, rather than convergence to a fixed point. Ada-Gain is built on a generic derivation scheme that allows it tobe easily combined with a variety of base-learners includ-ing SGD, (semi-gradient) temporal-difference learning andeven optimized SGD updates, like AMSGrad. Our goal is toinvestigate the utility of both meta-descent methods and themore widely used quasi-second order optimizers in online,continual prediction problems.We provide an extensive em-pirical comparison on (1) canonical optimization problemsthat are difficult to optimize with large flat regions (2) anonline, supervised tracking problem where the optimal step-sizes can be computed, (3) a finite Markov Decision Pro-cess with linear features that cause conventional temporaldifference learning to diverge, and (4) a high-dimensionaltime-series prediction problem using data generated from areal mobile robot. In problems with non-stationary dynam-ics the meta-descent methods can exhibit an advantage overthe quasi-second order methods. On the difficult optimiza-tion problems, however, meta-descent methods fail, which,retrospectively, is unsurprising given the meta-optimizationproblem for stepsizes is similarly difficult to optimize. Weshow that AdaGain can possess the advantages of both fam-ilies — performing well on both optimization problems withflat regions as well as non-stationary problems — by select-ing an appropriate base learner, such as RMSProp. Background and Notation
In this paper we consider online continual prediction prob-lems modeled as non-stationary, uncontrolled dynamicalsystems. On each discrete time step t , the agent observes theinternal state of the system through an imperfect summaryvector o t ∈ O ∈ R d for some d ∈ N , such as the sensorreadings of a mobile robot. On each step, the agent makes aprediction about a target signal T t ∈ R . In the simplest case,the target of the prediction is a component i of the observa-tion vector on the next step T t = o t +1 ,i —the classic one-step prediction. In the more general case, the target is con- structed by mapping the entire future of the observation timeseries to a scalar, such as the discounted sum formulationused in reinforcement learning: T t = E [ (cid:80) ∞ k =0 γ k o t + k +1 ,i ] ,where γ ∈ [0 , discounts the contribution of future obser-vations to the infinite sum. The prediction P t ∈ R is gener-ated by a parametrized function, with modifiable parametervector w t ∈ R k .In online continual prediction problems the agent updatesits predictions (via w t ) with each new sample o t , unlike themore common batch and stochastic settings. The agent’s ob-jective is to minimize the error between the prediction P t given by w t and the target T t before it is observed, over alltime steps. Online continual prediction problems are typi-cally solved using stochastic updates to adapt the parametervector w t after each time step t to reduce the error (retroac-tively) between P t and T t . Generically, for stochastic updatevector ∆ t ∈ R d , the weights are modified w t +1 = w t + α t ◦ ∆ t (1)for a vector step-size α t , where the operator ◦ denoteselement-wise multiplication. Given an update vector, thegoal is to select α t to reduce error, into the future. Semi-gradient methods like temporal difference learning follow asimilar scheme, but ∆ t is not the gradient of an objectivefunction.Step-size adaptation for the stationary setting is oftenbased on estimating second-order updates. The idea is toestimate the loss function (cid:96) : R d → R locally aroundthe current weights w t using a second-order Taylor seriesapproximation—which requires the Hessian H t . A closed-form solution can then be obtained for the approxima-tion, because it is a quadratic function, giving the nextcandidate solution w t +1 = w t − ( H t ) − ∇ (cid:96) ( w t ) . If in-stead the Hessian is approximated—such as with a diago-nal approximation—then we obtain quasi-second order up-dates. Taken to the extreme, with the Hessian approximatedby a scalar, as H t = α − t I , we obtain first-order gradient de-scent with a step-size of α t . For the batch setting, the gainsfrom second order methods are clear, with a convergencerate of O (1 /t ) , as opposed to O (1 /t ) for first-order de-scent.These gains are not as clear in the stochastic setting, butdiagonal approximations appear to provide an effective bal-ance between computation and convergence rate improve-ments (Bordes, Bottou, and Gallinari 2009). Duchi, Hazan, A related class of algorithms are natural gradient methods,which aim to be robust to the functional parametrization. Incre-mental natural gradient methods have been proposed (Amari, Park,and Fukumizu 2000), including for policy evaluation with gradi-ent TD methods (Dabney and Thomas 2014). However, these algo-rithms do not remove the need select a step-size, and so we do notconsider them further here. There is a large literature on accelerated first-order descentmethods, starting from early work on momentum (Nesterov 1983)and many since focused mainly on variance reduction (c.f. (Roux,Schmidt, and Bach 2012)). These methods can complement step-size adaptation, but are not well-suited to non-stationary problemsbecause many of the algorithms are designed for a batch of dataand focus on increasing convergence rate to a fixed minimum. nd Singer (2011) provide a general regret analysis for di-agonal approximations methods proving sublinear regret ifstep-sizes decrease to zero overtime. One algorithm, Ada-Grad, uses the vector step-size α t = η ( (cid:80) ti =1 ∆ t + (cid:15) ) − fora fixed η > and a small (cid:15) > , with element-wise division.RMSProp and Adam—which are not guaranteed to obtainsublinear regret—use a running average rather than a sum ofgradients, with Adam additionally including a momentumterm for faster convergence. AMSGrad is a modification ofAdam, that satisfies the regret criteria, without decaying thestep-sizes as aggressively as AdaGrad.The meta-descent strategies instead directly learn step-sizes that minimize the same objective as the base learner.A simpler set of such methods, called hypergradient meth-ods (Jacobs 1988; Almeida et al. 1998; Baydin et al. 2018),only adjust the step-size based on its impact on the weightson a single step. Hypergradient Descent (HD) (Baydin etal. 2018) takes the gradient of the loss (cid:96) ( w ) w.r.t. a scalarstep-size α > , to get the meta-gradient for the step-size as ∂(cid:96) ( w t ) /∂α = −∇ w (cid:96) ( w t − ) (cid:62) ∇ w (cid:96) ( w t ) . The update sim-ply requires storing the vector g t − = ∇ w (cid:96) ( w t − ) and up-dating α t +1 = α t + ¯ α g (cid:62) t − g t , for a meta step-size ¯ α > .More generally, meta-descent methods, like IDBD (Sutton1992b) and SMD (Schraudolph 1999), consider the impactof the step-size back in time, through the weights, with w t,j the j -th element in vector w t ∂(cid:96) ( w t ( α )) ∂α i = k (cid:88) j ∂(cid:96) ( w t ( α )) ∂w t,j ∂w t,j ∂α i . (2)The goal is to approximate this gradient efficiently, usuallyusing a recursive strategy. We derive such a strategy for Ada-Gain below using a different meta-descent objective, and forcompleteness include the derivation for the SMD objectivein the appendix (as the original contains an error). Illustrative example
To make the problem more concrete, consider a simple state-less tracking problem driven by two interacting Gaussians: Y t def = Z t + N (0 , σ Y,t ) , Z t +1 ← Z t + N (0 , σ Z,t ) . (3) where the agent only observes the sequence Y , Y , . . . . Theobjective is minimize mean squared error (MSE) between ascalar prediction P t = w t and the target T t = Y t +1 . Thisproblem is non-stationary because σ Y,t and σ Z,t change pe-riodically and the agent has no knowledge of the schedule.Since σ Y,t and σ Z,t govern how quickly the mean Z t driftsand the sampling variance in Y t , the agent must step its step-size accordingly: larger σ Z,t requires larger stepsize, larger σ Y,t requires a smaller step-size. The agent must contin-ually change its scalar step-size value in order to achievelow MSE. The optimal constant scalar step-size can be com-puted in this simple domain (Sutton 1992b), and is shownby the black dashed line in Figure 1. We compared thestep-sizes learned by several well-know quasi-second ordermethods (AdaGrad, RMSProp, Adadelta) and three meta-descent strategies including our own AdaGain. We ran theexperiment for over 24 hours to test the robustness of these methods in a long-running continual prediction task. Sev-eral methods including AdaGain were able to match the op-timal step-size. However, several well-known methods in-cluding AdaGrad and AdaDelta completely fail in this prob-lem. In addition, the meta-descent strategy SMD divergedafter 8183817 time steps, highlighting the special challengesof online, continual prediction problems. time stepsAdaDelta
AdaGain IDBDOptimal step size step sizeparametervalue
Figure 1: Optimal Gain Experiment. Depicted is the last500,000 steps out of ∗ (10 ) . AdaGrad, and AdaDelta failto learn the correct progression of stepsizes, and SMD di-verges. Adaptive Gain for Stability
Tracking—continually updating the weights with recentexperience—contrasts the typical goal of convergence.Much of the previous algorithm development for step-sizeadaptation, however, has been towards the aim of conver-gence, with algorithms like AdaGrad and AMSGrad that de-cay step-sizes over time. Assuming finite representationalcapacity, there may be aspects of the problem that can neverbe accurately modeled or predicted by the agent. In thesepartially observable problems tracking and thus treating theproblem as if it were non-stationary can improve predic-tion accuracy compared with methods that converge (Sutton,Koop, and Silver 2007). In continual learning we assume theagent’s task partially observable in this way, and develop anew step-size method that can facilitate tracking.We treat the learning system as a dynamical system—where the weight update is based on stochastic updatesknown to suitably track the targets—and consider the choiceof step-size as the inputs to the system to maintain stability .Such a view has been previously considered under adaptivegain for least-mean squares (LMS) (Benveniste, Metivier,and Priouret 1990, Chapter 4), where weights are treatedas state following a random drift. To generalize this ideato other incremental algorithms, we propose a more generalcriteria based on the magnitude of the update vector.A criteria for α to maintain stability in the system is tokeep the norm of the update vector small min α > E (cid:2) (cid:107) ∆ t ( w t ( α )) (cid:107) (cid:12)(cid:12) w (cid:3) . (4)The update ∆ t ( w t ( α )) on this time step is dependent onthe step-size α because that step-size influences w t and pastupdates. The expected value is over all possible update vec-tors ∆ t ( w t ( α )) for the given step-size and assuming theystem started with some w . If the dynamics are ergodic, ∆ t ( w t ( α )) does not depend on the initial w , and is onlydriven by the underlying state dynamics and the choice of α . The step-size can be seen as a control input for this sys-tem, with the goal to maintain a stable dynamical system byminimizing (cid:107) ∆ t ( w t ( α )) (cid:107) over time.We derive an algorithm to estimate α for this dynamicalsystem, which we call AdaGain: Adaptive Gain for Stabil-ity. The algorithm is derived for a generic update ∆ t ( w t ( α )) that is differentiable w.r.t. the weights w t ; we provide spe-cific examples for particular updates in the appendix, includ-ing for linear TD. Generic algorithm with quadratic-complexity
We derive the full quadratic-complexity algorithm to start,and then introduce approximations to obtain a linear-complexity algorithm. To minimize (4), we use stochasticgradient descent, and thus need to compute the gradient of (cid:107) ∆ t ( w t ( α )) (cid:107) w.r.t. the step-size α . For step-size α i as the i th element in the vector α , and w t,j the j -th element invector w t ∂ (cid:107) ∆ t ( w t ( α )) (cid:107) ∂α i = ∆ t ( w t ( α )) (cid:62) ∂ ∆ t ( w t ( α )) ∂α i = ∆ t ( w t ( α )) (cid:62) k (cid:88) j ∂ ∆ t ( w t ( α )) ∂w t,j ∂w t,j ∂α i . The key, then, is to track how a change in the weightsimpacts the update and how changes in the step-size impactthe weights. The first term can be computed instantaneouslyon this step. For the second term, however, the impact ofthe step-size on the weights goes back further to previousupdates. We show how to obtain a recursive form for thisstep-size gradient, ψ t,i def = ∂ w t ∂α i ∈ R k . ψ t +1 ,i = ∂ ( w t + α ◦ ∆ t ( w t ( α ))) ∂α i = ψ t,i + α ◦ (cid:88) j ∂ ∆ t ( w t ( α )) ∂w t,j ∂w t,j ∂α i + (cid:20) ∆ t,i ( α ) (cid:21) = ( I + diag ( α ) G t ) ψ t,i + (cid:20) ∆ t,i ( α ) (cid:21) , where G t,j def = ∂ ∆ t ( w t ( α )) ∂w t,j ∈ R k , G t def =[ G t, , . . . , G t,k ] ∈ R k × k , and Therefore, ψ t +1 ,i representsa sum of updates, with a recursive weighting on previous ψ t,i adjusting the weight of previous updates in the sum.We can approximate the gradient using this recursive re-lationship, without storing all previous samples. Though theabove updates are exact, we obtain an approximation whenimplementing such a recursive form in practice. When using ψ t − ,i computed on the last time step t − , this gradientestimate is in fact w.r.t. the previous step-size α t − , ratherthan α t − . Because these step-sizes are slowly changing,this gradient still provides a reasonable estimate; however,for many steps into the past, the accumulated gradients in ψ t,i are likely inaccurate. To improve the approximation,and forget old gradients, we introduce a forgetting parame-ter < β < , which focuses the accumulation of gradientsin ψ t,i to a more recent window. The gradient update to the step-size also needs to ensurethat the step-sizes remain positive. Similarly to IDBD, weuse an exponential form for the step-size, where α = exp( β ) and β ∈ R is updated with (unconstrained) stochastic gradi-ent descent. Conveniently, as we show in the appendix, wedo not need to maintain this auxiliary variable, and can sim-ply directly update α .The resulting generic updates for quadratic-complexityAdaGain, with meta step-size ¯ α , are α t = α t − ◦ exp (cid:16) − ¯ α α t − ◦ ( Ψ (cid:62) t G (cid:62) t ∆ t ) (cid:17) (5) ψ t +1 ,i = (1 − β ) ψ t,i + β α t ◦ ( G t ψ t,i ) + β (cid:20) ∆ t,i (cid:21) where the exponential is applied element-wise, ψ ,i = , α = 0 . (or some initial value), and ( Ψ t ) : ,i = ψ t,i with Ψ t ∈ R k × k . For computational efficiency to avoidmatrix-matrix multiplication, the order of multiplication for Ψ (cid:62) t G (cid:62) t ∆ t should start from the right, as Ψ (cid:62) t ( G (cid:62) t ∆ t ) . Thekey complexity in deriving an AdaGain update, then, is sim-ply in computing the Jacobian G t ; given this, the remainderof the algorithm is fixed. For each update ∆ t ( w t ( α )) , theJacobian will be different, but is straightforward to compute. Generic AdaGain algorithm with linear-complexity
Maintaining the entire matrix Ψ t can be prohibitively ex-pensive. As was done in IDBD (Sutton 1992b), one way toavoid maintaining this matrix is to assume that ∂w t,j ∂α i = 0 for i (cid:54) = j . This heuristic reflects that α i is likely to have thelargest impact on w t,i , and less impact on the other entriesin w t .The modification above for this heuristic is straightfor-ward, simply by setting entries ( ψ t,i ) j = 0 for i (cid:54) = j . Thisresults in the simplification ψ t +1 ,i = ψ t,i + α ◦ k (cid:88) j G t,j ( ψ t,i ) j + (cid:20) ∆ t,i ( α ) (cid:21) = ψ t,i + α ◦ G t,i ( ψ t,i ) i + (cid:20) ∆ t,i ( α ) (cid:21) . Further, since we will then assume that ( ψ t +1 ,i ) j = 0 for i (cid:54) = j , there is no purpose in computing the full vec-tor G t,i ( ψ t,i ) i . Instead, we only need to compute the i thentry, i.e., for ∂ ∆ t,i ( α ) ∂w t,i . We can then instead define ˆ ψ t,i tobe a scalar approximating ∂w t,i ∂α i , with ˆ ψ t the vector of these,and ˆ j t def = (cid:104) ∂ ∆ t, ( α ) ∂ w t, , . . . , ∂ ∆ t,k ( α ) ∂ w t,k (cid:105) to define the recursion as ˆ ψ t +1 def = ˆ ψ t + α ◦ ˆ j t ◦ ˆ ψ t + ∆ t ( w t ( α )) , with ˆ ψ = . Thegradient using this approximation, with off-diagonals zero,is ∂ (cid:107) ∆ t ( w t ( α )) (cid:107) ∂α i = ∆ t ( w t ( α )) (cid:62) k (cid:88) j ∂ ∆ t ( w t ( α )) ∂w t,j ∂w t,j ∂α i ≈ ∆ t ( w t ( α )) (cid:62) ∂ ∆ t ( w t ( α )) ∂w t,i ∂w t,i ∂α i = ˆ ψ t,i G (cid:62) t,i ∆ t ( w t ( α )) o compute this approximation, for all i , we still need tobe able to compute G (cid:62) t ∆ t ( w t ( α )) . In some cases this isstraightforward, as is the case for linear TD (found in theappendix). More generally, we can use R-operators (Pearl-mutter 1994) to compute this Jacobian-vector product, or asimple finite difference approximation, as we do in the ap-pendix. Therefore, because we can compute this Jacobian-vector product in linear time, the only approximation is to ˆ ψ t . The update is α t = α t − exp (cid:16) − ¯ α α t − ◦ ˆ ψ t ◦ ( G (cid:62) t ∆ t ) (cid:17) (6) ˆ ψ t +1 = (1 − β ) ˆ ψ t + β α t ◦ ˆ j t ◦ ˆ ψ t + β ∆ t . These approximations parallel diagonal approximations,for second-order techniques, which similarly assume off-diagonal elements are zero. Further, G t itself is a gradientof the update w.r.t. the weights, where this update was al-ready likely the gradient of the loss w.r.t. the weights. This G t , therefore, contains similar information as the Hessian.The AdaGain update, therefore, contains some informationabout curvature, but allows for updates that are not necessar-ily (true) gradient updates.This AdaGain update is generic, but does require comput-ing the Jacobian of a given update, which could be onerousin certain settings. We provide an update, based on finitedifferences in the appendix, that only requires differencesbetween updates, that we have found works well in practice. Experiments in synthetic tasks
We conduct experiments in several simulation domains tohighlight the performance characteristics of meta-descentand quasi-second order methods. In our first experiment weinvestigate AdaGain and several meta-descent and quasi-second order approaches on a notoriously difficult station-ary optimization task. Next we return to the simple state-lesstracking problem described in the introduction, and investi-gate the parameter sensitivity of each method. Our third ex-periment investigates how different optimization algorithmscan stabilize the iterates in sequential off-policy learningproblems, which cause SGD-based methods to diverge. Weconclude with a comparison of AdaGain and AMSGrad (thebest performing quasi-second order method in the first threeexperiments) for online prediction on data generated by amobile robot.In all the experiments, we use AdaGain layered on-top ofan RMSProp update, rather than a vanilla SGD update. Asmotivated earlier, meta-descent methods are not robust ondifficult optimization surfaces, such as with flat or sharp re-gions. AdaGain provides a practical method to pursue meta-descent strategies that are robust to such realistic optimiza-tion problems. We motivate the importance of this choice inour first experiment on a difficult optimization task.
Function optimization.
The aim of our first experimentis to investigate how AdaGain performs on optimizationproblems designed to be difficult for gradient descent. TheRosenbrock function is a two dimensional non-convex func-tion, and the minimum is inside a flat parabolic shaped val-ley. We compared AMSGrad, SGD, and SMD, in each case
RMSEaveragedover 100 runs
AMSGrad AdaGainAdaGain(finite di ff .)SGD SMD QuadraticAdaGain QuadraticAdaGain
QuadraticAdaGain AdaGainw/o RMSProp
Figure 2: Optimization paths of a single run (with tunedmeta-parameters) for several algorithms on the Rosenbrockfunction. The white × symbol indicates where in the in-put space the algorithm converged. The paths represent howeach algorithm changes the weights while searching for theminimum. The white + symbol indicates the optimal valuefor the weights—if × and + symbol overlap the algorithmhas reached the global minimum of the function. AlthoughSGD and SMD appear to quickly approach the minimum,the valley is in fact easy to find, but reaching the + is diffi-cult. Neither method achieves a low final value, and con-verge slowly. The AdaGain algorithms with RMSProp—including full quadratic AdaGain algorithm, AdaGain withthe linear approximation and AdaGain with the linear ap-proximation and finite differences—outperform the othermethods in this problem. The finite differences AdaGain al-gorithm is a generic strategy, that does not require knowl-edge of the Jacobian, and so can be easily applied to anyupdates (provided in the appendix). This result highlightsthat there is not a significant loss in using this approxima-tion, over AdaGain with analytic Jacobians. AdaGain with-out RMSProp, on the other hand, converges much moreslowly, though interestingly it does still outperform SMD.Note although the run above of AdaGain without RMSPropdid reach the minimum, that was not true in general as re-flected by the learning curve.extensively searching the meta-parameters of each method,averaging performance over 100 runs and 6000 optimizationsteps. The results are summarized in Figure 2, with trajec-tory plots of a single run of each algorithm, and the learningcurves for all methods. AdaGain both learns faster and getscloser to the global optimum than all other methods consid-ered. Further, two meta-descent methods, SMD and Ada-Gain without RMSProp perform poorly. This result high-lights issues with applying meta-descent approaches withoutconsidering the optimization surface, and the importance ofhaving an algorithm like AdaGain which can be combinedwith quasi-second order methods. daGain IDBD0.60.10 AdaGrad SMDRMSProp 10% 94%2%36%0% meansquarederroraveragedover100 runs Figure 3: Parameter sensitivity plot for the first 500,000steps of the stateless tracking problem. Each circle de-notes the average MSE for a single parameter combinationof an algorithm. The parameter combinations and respec-tive performance are grouped in vertical columns for eachmethod. The circles in each column are randomly offsetwithin the column horizontally as many parameter settingsmay achieve almost identical MSE. Circles near the bottomof the plot represent low MSE. Circles arranged in a linein the top-most part of the plot are parameter combinationsthat either diverged or exceeded a minimum performancethreshold, with the percentage of such parameter combina-tions given in the graph.
Stateless tracking problem.
Recall from Figure 1, thatseveral methods performed well in the stateless trackingproblem; sensitivity to parameter settings, however, is alsoimportant. To help better understand these methods, we con-structed a parameter sensitivity graph (Figure 3). IDBD canoutperform AdaGain on this problem (lower MSE), but onlya tiny fraction of IDBD’s parameter settings achieve goodperformance. None of AdaGrad’s parameter combinationsexceeded the threshold, but all combinations resulted in higherror compared with AdaGain. Many of the parameter com-binations allowed AdaGain to achieve low error, suggestingAdaGain with a simple manual parameter tuning is likelyto achieve good performance on this problem, while IDBDlikely requires a comprehensive parameter sweep.
Baird’s counterexample.
Our final synthetic-domain ex-periment tests the stability of AdaGain’s update when com-bined with the TD( λ ) algorithm for off-policy state-valueprediction in a Markov Decision Process. We use Baird’scounterexample, which causes the weight’s learned by off-policy TD( λ ) (Sutton and Barto 1998) to diverge if a globalstep-size parameter is used (decaying or otherwise) (Baird1995; Sutton and Barto 1998; Maei 2011). The key chal-lenge is the feature representation, and the difference be-tween the target and behavior policies. There is a shared re-dundant feature, and the weight associated seventh feature isinitialized to a high value. The target policy always choosesto go to state seven and stay there forever. The behavior pol-icy, on the other hand, only visits state seven 1/7 the time,causing large importance sampling corrections.We applied AdaGain, Adam, RMSprop, SMD, andTIDBD(Kearney et al. 2018)—a recent extension of the IDBD algorithm — to adapt the step-sizes of linear TD( λ )on Baird’s counterexample. As before, the meta-parameterswere extensively swept and the best performing parame-ters were used to generate the results for comparison. Fig-ure 5 shows the learning curves of each method. Only Ada-Gain and Adam are able to prevent divergence. SMD’s per-formance is typical of Baird’s counterexample: the meta-parameter search simply found parameters that caused ex-tremely slow divergence. AdaGain learns significantly fasterthan Adam, and achieves lower error.To understand how AdaGain prevents divergence con-sider Figure 4. The left graph shows the step-size values asthey evolve over time, and the right graph shows the cor-responding weights. Recall, the weight for feature seven isinitialized to a high value. AdaGain initially increases fea-ture seven’s step-size causing weight seven to quickly fall.In parallel AdaGain reduces the step-size for the redundantfeature, preventing incorrect generalization. Over time theweights converge to one of many valid solutions, and thevalue error, plotted in black on the right side converges tozero. The left plots of Figure 5 show the same evolution ofthe weights and step-sizes for Adam. Adam is successful inreducing the step-size for the redundant feature, however thestep-sizes of the other features decay quickly and then begingrowing again preventing convergence to low value error. step sizeparametervalue value error ↵ t for state 7 feature ↵ t for shared feature w t for state 7 feature w t for shared feature ↵ t forfeatures1 to 6 w t for features 1 to 6 weightrootmeansquaredvalueerror time stepstime steps Figure 4: The step-size parameter values over time, and thecorresponding weights learned by AdaGain in Baird’s coun-terexample, with results averaged over 1000 independentruns. AdaGain is able to adapt the step-sizes of each featurein such a way that off-policy TD( λ ) converges. rootmeansquaredvalueerror step sizeparametervalue value error ↵ t for state 7 feature ↵ t for shared feature w t for state 7 feature w t for shared feature w t for features 1 to 6 weight ↵ t for features 1 to 6
100 101 102 103 104 time steps time steps time steps
AdamSMDRMSPropTIDBDAdaGain rootmeansquaredvalueerror
Figure 5: The step-size parameter values over time, andthe corresponding weights learned by Adam, and learningcurves for several methods in Baird’s counterexample. Re-sults averaged over 1000 independent runs. TD( λ ) combinedwith AdaGain achieves the best performance. Adam alsoprevents divergence, but converges to worse value error. time steps time steps AdaGainpredictiono ffl ine optimal Figure 6: The median symmetric mean absolute percentageerror (SMAPE) across all 53 sensors (left), with a plot of thepredictions for the heat sensor versus the ideal prediction inearly learning (right). The ideal predictions are computed of-fline using all future data (as described in (Modayil, White,and Sutton 2014)), but the predictions are learned online andincrementally. The learning curve shows that the predictionslearned by AdaGain achieve good accuracy more quicklythan those learned by AMSGrad. The right plot highlightsearly learning performance on the heat sensor—from timezero—illustrating that AdaGain’s prediction more quicklyapproaches the desired magnitude and then maintains goodstability. This is particularly notable because the heat sen-sor targets in this case are unnormalized, obtaining valuesover 1 million. We also include the optimal predictions com-puted by solving a system of equations offline (again as in(Modayil, White, and Sutton 2014)). The optimal solutionmakes use of only the first 40,000 data points for each sen-sor, reflecting the realistic scenario of computing predictionsfrom a limited batch of data, and later using the offline solu-tion for online prediction. As to be expected the SMAPE forthese offline optimal predictions is low on the training data(first 40,000 time steps), and much higher on later data.
Experiments on robot data
In our final experiment we recreate nexting (Modayil, White,and Sutton 2014), using TD( λ ) to make dozens of predic-tions about the future values of robot sensor readings. Weformulate each prediction as estimating the discounted sumof future sensor readings, treating each sensor as a rewardsignal with discount factor of γ = 0 . corresponding toapproximately 80 second predictions. Using the freely avail-able nexting data set (144,000 samples, corresponding to 3.4hours of runtime on the robot), we incrementally processedthe data on each step constructing a feature vector from thesensor vector, and making one prediction for each sensor.At the end of learning we computed the ”ideal” predictionoffline and computed the symmetric mean absolute percent-age error of each prediction, and aggregated the 50 learn-ing curves using the median. We used the same non-linearcoarse recoding of the sensor inputs described in the origi-nal work, giving 6065 binary feature components for use asa linear representation.For this experiment we reduced the number of algorithms,using AMSGrad as the best performing quasi-second ordermethod based on our synthetic task experiments and Ada-Gain as the representative meta-descent algorithm. The meta lightreading0 200 400 120002000 20010004000 0 0 200 400 120010050000100030002000AMSGrad predictionstallAdaGainprediction0 1000 3000 700050002000400080000 time steps (+83,000) time steps (+90,000) time steps (+90,000) light sensor during stall condition light sensor normal operation magnetic sensor normal operationmagnetic reading Figure 7: Three snapshots in time of the predictions learnedby AdaGain compared with the offline ideal predictions.Each of the three plots highlights a different part of thedataset to give an alternative perspective on the accuracyof AdaGain’s learned predictions. The leftmost plot we seea situation where the robot stalled unexpectedly directly infront of a bright light source, saturating the light sensor. Dueto this sudden unpredictable event, the predictions of bothAdaGain and AMSGrad became incorrect. However, Ada-Gain more quickly adapts learning to adjust its predictions toreflect the new reality, matching the ideal predictions (blackline). Otherwise, these plots show that, in general, AdaGainand AMSGrad can track the ideal prediction similarily.step-size was optimized for both algorithms.The learning curves in Figure 6 show a clear advantagefor AdaGain in terms of aggregate error over all predic-tions. Inspecting the predictions of one of the heat sen-sors reveals why. In early learning, AdaGain much morequickly increases the prediction, to near the ideal prediction,whereas AMSGrad much more slowly reaches this point—over 12000 steps. AdaGain and AMSGrad then both trackthe the ideal heat prediction similarly, and so obtain similarerror for the remainder of learning. This advantage in ini-tial learning is also demonstrated in Figure 7, which depictspredictions on two different sensors. For example, AdaGainadapts the predictions more quickly in reaction to the un-expected stall event, but otherwise AdaGain and AMSGradobtain similar errors. This result also serves as a sanity checkfor AdaGain, validating that AdaGain does scale to more re-alistic problems and remains stable in the face of high levelsof noise and high-magnitude prediction targets.
Conclusion
In this work, we proposed a new general meta-descent strat-egy, to adapt a vector of stepsizes for online, continual pre-diction problems. We defined a new meta-descent objec-tive, that enables a broader class of incremental updates forthe base learner, generalizing beyond work specialized toleast-mean squares, temporal difference learning and vanillastochastic gradient descent updates. We derive a recursiveupdate for the stepsizes, and provide a linear-complexity ap-proximation. In a series of experiments, we highlight thatmeta-descent strategies are not robust to the shape of the op-timization surface. The ability to use AdaGain for genericupdates enabled us to overcome this issue, by layering Ada-Gain on RMSProp, a simple quasi-second order approach.We then showed that, with this modification, meta-descentmethods can perform better than the more commonly usedquasi-second order updates, adapting more quickly in non-stationary tasks. eferences
Almeida, L. B.; Langlois, T.; Amaral, J. D.; and Plakhov, A. 1998.On-line learning in neural networks. In Saad, D., ed.,
On-LineLearning in Neural Networks . New York, NY, USA: CambridgeUniversity Press. chapter Parameter Adaptation in Stochastic Op-timization, 111–134.Amari, S.-i.; Park, H.; and Fukumizu, K. 2000. Adaptive Methodof Realizing Natural Gradient Learning for Multilayer Perceptrons.
Neural Computation .Andrychowicz, M.; Denil, M.; G´omez, S.; Hoffman, M. W.; Pfau,D.; Schaul, T.; and de Freitas, N. 2016. Learning to learn by gradi-ent descent by gradient descent. In
Advances in Neural InformationProcessing Systems .Baird, L. 1995. Residual algorithms: Reinforcement learning withfunction approximation. In
Machine Learning Proceedings 1995 .Elsevier. 30–37.Baydin, A. G.; Cornish, R.; Rubio, D. M.; Schmidt, M.; and Wood,F. 2018. Online Learning Rate Adaptation with Hypergradient De-scent. In
International Conference on Learning Representations .Benveniste, A.; Metivier, M.; and Priouret, P. 1990.
Adaptive Al-gorithms and Stochastic Approximations . Springer.Bordes, A.; Bottou, L.; and Gallinari, P. 2009. SGD-QN: Care-ful quasi-Newton stochastic gradient descent.
Journal of MachineLearning Research .Dabney, W., and Barto, A. G. 2012. Adaptive step-size for onlinetemporal difference learning. In
AAAI .Dabney, W., and Thomas, P. S. 2014. Natural Temporal DifferenceLearning. In
AAAI Conference on Artificial Intelligence .Dabney, W. C. 2014.
Adaptive Step-sizes for Reinforcement Learn-ing . Ph.D. Dissertation, University of Massachusetts - Amherst.Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive SubgradientMethods for Online Learning and Stochastic Optimization.
Journalof Machine Learning Research .Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In
InternationalConference on Machine Learning .Jacobs, R. 1988. Increased rates of convergence through learningrate adaptation.
Neural Networks .Jaeger, H. 2000. Observable Operator Processes and ConditionedContinuation Representations.
Neural Computation .Kearney, A.; Veeriah, V.; Travnik, J. B.; Sutton, R. S.; and Pi-larski, P. M. 2018. Tidbd: Adapting temporal-difference step-sizesthrough stochastic meta-descent. arXiv preprint arXiv:1804.03334 .Kingma, D. P., and Ba, J. 2015. Adam: A Method for StochasticOptimization. In
International Conference on Machine Learning .Littman, M. L.; Sutton, R. S.; and Singh, S. 2001. Predictive repre-sentations of state. In
Advances in Neural Information ProcessingSystems .Maei, H. R. 2011.
Gradient temporal-difference learning algo-rithms . University of Alberta Edmonton, Alberta.Mahmood, A. R.; Sutton, R. S.; Degris, T.; and Pilarski, P. M. 2012.Tuning-free step-size adaptation.
ICASSP .McMahan, H. B., and Streeter, M. 2010. Adaptive Bound Opti-mization for Online Convex Optimization. In
International Con-ference on Learning Representations .Meyer, D.; Degenne, R.; Omrane, A.; and Shen, H. 2014. Accel-erated gradient temporal difference learning algorithms. In
IEEESymposium on Adaptive Dynamic Programming and Reinforce-ment Learning . Modayil, J.; White, A.; and Sutton, R. S. 2014. Multi-timescalenexting in a reinforcement learning robot.
Adaptive Behavior
Soviet Mathematics andDoklady .Pan, Y.; Azer, E. S.; and White, M. 2017. Effective sketchingmethods for value function approximation. In
Conference on Un-certainty in Artificial Intelligence, Amsterdam, Netherlands .Pan, Y.; White, A.; and White, M. 2017. Accelerated GradientTemporal Difference Learning. In
International Conference onMachine Learning .Pearlmutter, B. A. 1994. Fast Exact Multiplication by the Hessian. dx.doi.org .Reddi, S. J.; Kale, S.; and Kumar, S. 2018. On the Convergenceof Adam and Beyond. In
International Conference on LearningRepresentations .Roux, N. L.; Schmidt, M.; and Bach, F. R. 2012. A stochastic gradi-ent method with an exponential convergence rate for finite trainingsets. In
Advances in Neural Information Processing Systems .Schaul, T.; Zhang, S.; and LeCun, Y. 2013. No More Pesky Learn-ing Rates. In
International Conference on Artificial Intelligenceand Statistics .Schraudolph, N.; Yu, J.; and G¨unter, S. 2007. A stochastic quasi-Newton method for online convex optimization. In
InternationalConference on Artificial Intelligence and Statistics .Schraudolph, N. N. 1999. Local gain adaptation in stochastic gra-dient descent.
International Conference on Artificial Neural Net-works: ICANN ’99 .Spall, J. C. 1992. Multivariate stochastic approximation using asimultaneous perturbation gradient approximation.
IEEE Transac-tions on Automatic Control
Introduction to ReinforcementLearning . Cambridge, MA, USA: MIT Press, 1st edition.Sutton, R. S., and Tanner, B. 2004. Temporal-Difference Networks.In
Advances in Neural Information Processing Systems .Sutton, R. S.; Modayil, J.; Delp, M.; Degris, T.; Pilarski, P.; White,A.; and Precup, D. 2011. Horde: A scalable real-time architec-ture for learning knowledge from unsupervised sensorimotor inter-action. In
International Conference on Autonomous Agents andMultiagent Systems .Sutton, R.; Koop, A.; and Silver, D. 2007. On the role of tracking instationary environments. In
International Conference on MachineLearning .Sutton, R. S. 1992a. Gain Adaptation Beats Least Squares? In
Seventh Yale Workshop on Adaptive and Learning Systems .Sutton, R. 1992b. Adapting bias by gradient descent: An incre-mental version of delta-bar-delta. In
AAAI Conference on ArtificialIntelligence .Tieleman, T., and Hinton, G. 2012. RmsProp: Divide the gradi-ent by a running average of its recent magnitude. In
COURSERANeural Networks for Machine Learning .Wu, Y.; Ren, M.; Liao, R.; and Grosse, R. B. 2018. UnderstandingShort-Horizon Bias in Stochastic Meta-Optimization. In
Interna-tional Conference on Learning Representations .Zeiler, M. D. 2012. ADADELTA: An Adaptive Learning RateMethod. arXiv:1411.4000v2 [cs.LG] . tochastic Meta-Descent algorithm We recreate the SMD derivation, in our notation, for easier refer-ence.We compute the gradient of the loss function (cid:96) ( w ) , w.r.t. step-size. We derive the full quadratic-complexity algorithm to start, andthen introduce approximations to obtain a linear-complexity algo-rithm. For stepsize α i as the i th element in the vector α , ∂(cid:96) ( w ( α )) ∂α i = k (cid:88) j ∂(cid:96) ( w ( α )) ∂w t,j ∂w t,j ∂α i Define the following two vectors, for w t,j the j -th element invector w t,j , g t,j def = − ∂(cid:96) ( w ( α )) ∂w t,j ∈ R k the gradient update (7) ψ t,i def = ∂ w t ∂α i ∈ R k . (8)We can obtain vector ψ t,i recursively as ψ t +1 ,i = ∂ ( w t + α ◦ g t ) ∂α i = ∂ w t ∂α i + α ◦ ∂ g t ∂α i + (cid:20) t,i ( α ) (cid:21) = ψ t,i + α ◦ (cid:88) j ∂ g t ∂w t,j ∂w t,j ∂α i + (cid:20) t,i ( α ) (cid:21) = ψ t,i − α ◦ ( H t ψ t,i ) + (cid:20) t,i ( α ) (cid:21) = ( I − diag ( α ) H t ) ψ t,i + (cid:20) t,i ( α ) (cid:21) . The resulting generic updates for quadratic-complexity SMD,with meta stepsize ¯ α , are α t = α t − exp (cid:16) ¯ α α t ◦ Ψ (cid:62) t g t (cid:17) (9)for ( Ψ t ) : ,i = ψ t,i with Ψ t ∈ R k × k H t = Hessian of (cid:96) t w.r.t. w t . ψ t +1 ,i = (1 − β ) ψ t,i − β α t ◦ ( H t ψ t,i ) + β (cid:20) t,i (cid:21) ψ ,i = and α = 0 . (or some initial value). As with AdaGain,the Hessian-vector product H t ψ t,i can be computed efficiently,using R-operators. Here, it is irrelevant, because we maintain thequadratic Ψ .For the linear-complexity algorithm, again we set entries ( ψ t,i ) j = 0 for i (cid:54) = j . Let H t,i be the i th column of the Hes-sian. This results in the simplification ψ t +1 ,i = ψ t,i − α ◦ k (cid:88) j H t,j ( ψ t,i ) j + (cid:20) t,i ( α ) (cid:21) = ψ t,i − α ◦ H t,i ( ψ t,i ) i + (cid:20) t,i ( α ) (cid:21) . Further, since we will then assume that ( ψ t +1 ,i ) j = 0 for i (cid:54) = j , there is no purpose in computing the full vector H t,i ( ψ t,i ) i .Instead, we only need to compute the i th entry, i.e., for ∂ g t,i ( α ) ∂w t,i . We can then instead define ˆ ψ t,i to be a scalar approximating ∂w t,i ∂α i ,with ˆ ψ t the vector of these, and the diagonal of the Hessian ˆ h t def = (cid:34) ∂ (cid:96) ( w ( α ) ∂ w t, , . . . , ∂ (cid:96) ( w ( α ) ∂ w t,k (cid:35) (10)to define the recursion as ˆ ψ t +1 def = ˆ ψ t − α ◦ ˆ h t ◦ ˆ ψ t + g t ( α ) , with ˆ ψ = . The gradient using this approximation, with off-diagonalszero, is ∂(cid:96) ( w ( α )) ∂α i = k (cid:88) j ∂(cid:96) ( w ( α )) ∂w t,j ∂w t,j ∂α i ≈ ∂(cid:96) ( w ( α )) ∂w t,i ∂w t,i ∂α i = ˆ ψ t,i g t,i The resulting update to the stepsize is α t = α t − exp (cid:16) ¯ α α t ◦ ˆ ψ t ◦ g t (cid:17) (11) ˆ ψ t +1 = (1 − β ) ˆ ψ t − β α t ◦ ˆ h t ◦ ˆ ψ t + β g t . Difference to original SMD algorithm:
Now, surprisingly, theabove algorithm differs from the algorithm given for SMD. But,that derivation appears to have a flaw, where the gradients ofweights taken w.r.t. to a vector of stepsizes is assumed to be a vec-tor. Rather, with the same off-diagonal approximation we use, itshould be a diagonal matrix, and then they would also only get adiagonal Hessian. For completeness, we include their algorithm,which uses a full Hessian-vector product. α t = α t − exp (cid:16) ¯ α α t ◦ ˆ ψ t ◦ g t (cid:17) (12) ˆ ψ t +1 = ˆ ψ t − α t ◦ H t ˆ ψ t + g t . Note that a follow-up paper that tested SMD (Wu et al. 2018) usesthis update, but does not have an error, because they use a scalar step size. In fact, in the SMD paper, if the step size had been ascalar, then their derivation would be correct.
The addition of β : The original SMD algorithm did not useforgetting with β . In our experiments, however, we consider SMDwith β —which performs significantly better—since our goal is notto compare directly with SMD, but rather to compare the choice ofobjectives. Derivations for AdaGain updates
Consider again the generic update w t +1 = w t + α ◦ ∆ t (13)where ∆ t ∈ R d is the update for this step, for weights w t ∈ R d and constant vector stepsize α and the operator ◦ denotes element-wise multiplication. Maintaining non-negative stepsizes in AdaGain
One straightforward option to maintain non-negative stepsizes is todefine a constraint on the stepsize. We can prevent the stepsize fromgoing below a small threshold (cid:15) (e.g., (cid:15) = 0 . ), ensuring positivestepsizes. The projection onto this constraint set after each gradi-ent descent step simply involves applying the operator ( · ) (cid:15) , whichthresholds any values below (cid:15) > to (cid:15) . We experimented with thisstrategy compared to the mentioned exponential form, and foundit performed relatively similarly, but required an extra parameter totune.nother option—and the one we use in this work—is to use anexponential form for the stepsize, so that it remains positive. Oneform, used also by IDBD, is to use α = exp( β ) . The algorithm,with or without an exponential form, remains essentially identicalto the thresholded version, because ∂ (cid:107) ∆ t ( α ( β )) (cid:107) ∂β i = ∆ t ( α ( β )) ∆ t ( α ( β )) ∂α i ∂α i ∂β i . Therefore, we can still recursively estimate the gradient with thesame approach, regardless of how the stepsize α is constrained. Forthe thresholded form, we simply use the gradient ∆ t ( α ( β )) ∆ t ( α ) ∂α i and then project (i.e., threshold). For the exponential form, the gra-dient update for α is simply used within an exponential function,as described below.Consider directly maintaining β , which is unconstrained. Forthe function form α i = exp( β i ) , the partial derivative ∂α i ∂β i issimply equal to α i and so the gradient update includes an addi-tional α i in front. This can more explicitly be maintained, with-out an additional variable, by noticing that for gradient g i = α i ∆ t ( α ( β )) ∆ t ( α ( β )) ∂α i for β t,i α t +1 ,i = exp( β t +1 ,i )= exp( β t,i − ¯ αg i )= exp( β t,i ) exp( − ¯ αg i )= α t,i exp( − ¯ αg i ) Therefore, we can still directly maintain α . The resulting update to α is simply α t = α t − exp (cid:16) − ¯ α α t ◦ ˆ ψ t ◦ ( G (cid:62) t ∆ t ) (cid:17) (14)Other multiplicative updates are also possible. Schraudolph(1999) uses an exponential update, but uses an approximation witha maximum, to avoid the expensive computation of the exponentialfunction. Baydin et al. (2018) uses a similar multiplicative update,but without a maximum. AdaGain for linear TD
In this section, we derive g t for a particular algorithm, namely lin-ear TD. LMS updates can be obtained as special cases, by setting γ = 0 . We then provide a more general update algorithm—whichdoes not require knowledge of the form of the update—in the nextsection. One advantage of AdaGain is that it is derived generically,allowing extensions to many online algorithms, unlike IDBD, andvariants which are derived specifically for the squared TD-error.We first provide the AdaGain updates for linear TD( λ ), and thenprovide the derivation below. For TD( λ ), the update is δ t def = r t +1 + γ t +1 x (cid:62) t +1 w t − x (cid:62) t w t ∆ t def = δ t e t α t = α t − exp( − ¯ α (∆ (cid:62) t e t ) α t − ◦ d t ◦ ˆ ψ t ) (15) ˆ ψ t +1 = (1 − β ) ˆ ψ t,i + β α t ◦ e t ◦ d t ◦ ˆ ψ t + β ∆ t where α = 0 . , ˆ ψ = , γ t def = γ ( S t , A t , S t +1 ) To derive the update for α , we need to compute the gradients ofthe updates, particularly ( g t ) i = ∂ ∆ t,i ∂ w t,i or for the full algorithm, the Jacobian G . ∂ ∆ t ∂ w t,i = e t ∂δ t ∂ w t,i = e t ∂∂ w t,i ( r t +1 + γ t +1 x (cid:62) t +1 w t − x (cid:62) t w t )= e t ( γ t +1 x t +1 − x t ) i . Letting d t def = γ t +1 x t +1 − x t , the Jacobian is G t = e t d (cid:62) t and thediagonal approximation is g t = e t ◦ d t . Because of the form ofthe Jacobian, we can actually use it in the update to α , though notin computing ˆ ψ t , if we want to maintain linearity. The quadraticcomplexity algorithm uses G as given α t = α t − exp( − ¯ α (∆ (cid:62) t d t ) α t − ◦ ( Ψ (cid:62) t e t )) ψ t,i = (1 − β ) ψ t − ,i + β α ◦ ( e t − d (cid:62) t − ψ t − ,i ) + β (cid:20) ∆ t − ,i (cid:21) The linear complexity algorithm uses g t to update ˆ ψ t , giving thestepsize update in (15) α t = α t − exp( − ¯ α (∆ (cid:62) t d t ) α t − ◦ e t ◦ ˆ ψ t )ˆ ψ t +1 = (1 − β ) ˆ ψ t,i + β α t ◦ e t ◦ d t ◦ ˆ ψ t + β ∆ t Generic AdaGain algorithm
To avoid requiring knowledge about the algorithm update and itsderivatives, we can provide an approximation to the Jacobian-vector product and the diagonal of the Jacobian, using finite dif-ferences. As long as the update function for the algorithm can bequeried multiple times, this algorithm can be easily applied to anyupdate.To compute the Jacobian-vector product, we use the fact that thiscorresponds to a directional derivative. Notice that G (cid:62) t ∆ t corre-sponds to the vector of directional derivatives for each component(function) in the update ∆ t , in the direction of u = ∆ t , becausethe dot-product separates in G (cid:62) t, u , . . . , G (cid:62) t,k u . Therefore, for up-date function ∆ : R k → R k (such as the gradient of the loss), weget for small r = 0 . , G (cid:62) t ∆ t ≈ ∆( w + r u ) − ∆( w − r u )2 r (16)For the diagonal of the Jacobian, we can again use finite differ-ences. An efficient finite difference computation is proposed withinthe simultaneous perturbation stochastic approximation algorithm(Spall 1992), which uses a random perturbation vector (cid:15) to com-pute the centered difference (∆( w + r (cid:15) ) − ∆( w − r (cid:15) )) i r (cid:15) i . This formulaprovides an approximation to the gradient of the i entry in the up-date ∆ t with respect to weight i ; when computed for all i , thisapproximates the diagonal of the Jacobian ˆ j t . To avoid additionalcomputation, we can re-use the above difference with perturbation u , rather than a random vector (cid:15) . To avoid division by zero, if u contains a zero entry, we threshold the normalization with a smallconstant − to give ˆ j t ≈ ∆( w + r u ) − ∆( w − r u ))2 r ◦ (1 / sign( u ) max(10 − , | u | )) (17)where division is element-wise. another approach would be tosample a random direction (cid:15) for this finite difference and use ∆( w + (cid:15) ) − ∆( w ) , divided by the absolute value of each elementof (cid:15) . We found empirically that using the same direction as ∆ t wasctually more effective, and more computationally efficient, so wepropose that approach.Using these approximations, we can compute the update to thestepsize as in Equation (6), repeated here for easy reference α t = α t − exp (cid:16) − ¯ α α t − ◦ ˆ ψ t ◦ ( G (cid:62) t ∆ t ) (cid:17) ˆ ψ t +1 = (1 − β ) ˆ ψ t + β α t ◦ ˆ j t ◦ ˆ ψ t + β ∆ t .90000 90200 90400 90600 90800 91000 9120020000020000400006000080000 Offline solutionTrue returns