[PDF] Trust Region Value Optimization using Kalman Filtering

Abstract

Policy evaluation is a key process in reinforcement learning. It assesses a given policy using estimation of the corresponding value function. When using a parameterized function to approximate the value, it is common to optimize the set of parameters by minimizing the sum of squared Bellman Temporal Differences errors. However, this approach ignores certain distributional properties of both the errors and value parameters. Taking these distributions into account in the optimization process can provide useful information on the amount of confidence in value estimation. In this work we propose to optimize the value by minimizing a regularized objective function which forms a trust region over its parameters. We present a novel optimization method, the Kalman Optimization for Value Approximation (KOVA), based on the Extended Kalman Filter. KOVA minimizes the regularized objective function by adopting a Bayesian perspective over both the value parameters and noisy observed returns. This distributional property provides information on parameter uncertainty in addition to value estimates. We provide theoretical results of our approach and analyze the performance of our proposed optimizer on domains with large state and action spaces.

Full PDF

TTrust Region Value Optimization using Kalman Filtering

Shirli Di-Castro Shashua Shie Mannor Abstract

Policy evaluation is a key process in reinforce-ment learning. It assesses a given policy usingestimation of the corresponding value function.When using a parameterized function to approxi-mate the value, it is common to optimize the setof parameters by minimizing the sum of squaredBellman Temporal Differences errors. However,this approach ignores certain distributional prop-erties of both the errors and value parameters.Taking these distributions into account in the op-timization process can provide useful informa-tion on the amount of conﬁdence in value esti-mation. In this work we propose to optimize thevalue by minimizing a regularized objective func-tion which forms a trust region over its parame-ters. We present a novel optimization method, theKalman Optimization for Value Approximation(KOVA), based on the Extended Kalman Filter.KOVA minimizes the regularized objective func-tion by adopting a Bayesian perspective over boththe value parameters and noisy observed returns.This distributional property provides informationon parameter uncertainty in addition to value es-timates. We provide theoretical results of ourapproach and analyze the performance of our pro-posed optimizer on domains with large state andaction spaces.

1. Introduction

Reinforcement learning (RL) solves sequential decisionmaking problems by considering an agent that interactswith the environment and seeks for the optimal policy (Sut-ton & Barto, 1998). During the learning process, the agentis required to evaluate its policies using a value function.In many real world RL domains, such as robotics, gamesand autonomous driving cars, the state and action spaces arelarge, hence the value function is approximated, e.g., using a Technion, Israel. Correspondence to: Shirli Di-Castro Shashua < [email protected] > , Shie Mannor < [email protected] > . Deep Neural Network (DNN). A common approach is to op-timize a set of parameters by minimizing the sum of squaredBellman Temporal Differences (TD) errors (Dann et al.,2014). There are two underlying assumptions in this ap-proach: ﬁrst, the value and its parameters are deterministic;second, the Bellman TD errors are independent Gaussianrandom variables (RVs) with zero mean and a ﬁxed variance.Although being a commonly used objective function, theseunderlying assumptions may not be suitable for the policyevaluation task in RL. Distributional RL (Bellemare et al.,2017) refers to the second assumption and argues in favorof a full distribution perspective over the sum of discountedrewards for a ﬁxed policy. In particular, learning this dis-tribution is meaningful in presence of value approximation.However, in their formulation the value parameters are stillconsidered deterministic and they do not provide an amountof conﬁdence for the value estimates.Treating the value or its parameters as RVs has been inves-tigated in the RL literature. Engel et al. (2003; 2005) usedGaussian Processes (GP) for the value and the return tocapture uncertainties in policy evaluation. Geist & Pietquin(2010) proposed to use the Unscented Kalman ﬁlter (UKF)to learn the uncertainty in value parameters. Their formula-tion requires many samples of parameters in each trainingstep, which is not feasible in Deep Reinforcement Learning(DRL) with large state and action spaces.Motivated by the works of Engel et al. (2003; 2005)and Geist & Pietquin (2010), we present in this worka uniﬁed framework for addressing uncertainties whileapproximating the value in DRL domains. Our frameworkincorporates the well-known Kalman ﬁlter estimation tech-niques with RL principles to improve value approximation.The Kalman ﬁlter (Kalman et al., 1960) and its variantfor nonlinear approximations, the Extended Kalman ﬁlter(EKF) (Anderson & Moore, 1979; Gelb, 1974), are usedfor on-line tracking and for estimating states in dynamicenvironments through indirect noisy observations. Thesemethods have been successfully applied to numerouscontrol dynamic systems such as navigation and trackingtargets (S¨arkk¨a, 2013). The Kalman ﬁlter can also be usedfor parameter estimation in approximation functions, whereparameters replace the states of dynamic systems. a r X i v : . [ c s . L G ] J a n rust Region Value Optimization using Kalman Filtering We develop a new optimization method for policy evalu-ation based on the EKF formulation. Figure 1 illustratesour Bayesian perspective over value parameters and noisyobserved returns. Our proposed method has the followingproperties: It forms a trust region over the value parame-ters, based on their uncertainty covariance; It is aimed attracking the solution rather than converging to it; It incre-mentally updates the parameters and the error covariancematrix, hence avoids sampling the parameters as is oftenrequired in Bayesian methods; It adjusts suitable learningrate to each individual parameter through the Kalman gain,thus the learning procedure does not depend on the parame-terization of the value.Our main contributions are: (1) Developing a new regu-larized objective function for approximating values in thepolicy evaluation task. The regularization term accounts forboth parameters and observations uncertainties. (2) Present-ing a novel optimization algorithm, K alman O ptimizationfor V alue A pproximation (KOVA), and prove that it mini-mizes at each time step the regularized objective function.This optimizer can be easily plugged into any policy opti-mization algorithm and improve it. (3) Beyond RL context,we present the connection between EKF and the incremen-tal Gauss-Newton method, the on-line natural gradient andthe Kullback Leibler (KL) divergence, and explain how ourobjective function forms a trust region over the value pa-rameters. (4) Demonstrating the improvement achieved byour optimizer on several control tasks with large state andaction spaces.

2. Background

The standard RL setting considers an interaction of an agentwith an environment E for a discrete number of time steps.The environment is modeled as a Markov Decision Pro-cess (MDP) {S , A , P, R, γ } where S is a ﬁnite set of states, A is a ﬁnite set of actions, P : S × A × S → [0 , isthe state transition probabilities for each state s and ac-tion a , R : S × A → R is a deterministic and boundedreward function and γ is a discount factor. At each timestep t , the agent observes state s t ∈ S and chooses ac-tion a t ∈ A according to a policy π : S × A → [0 , .The agent receives an immediate reward r t ( s t , a t ) and theenvironment stochastically steps to state s t +1 ∈ S accord-ing to the probability distribution P ( s t +1 | s t , a t ) . The statevalue function and the state-action Q-function are used forevaluating the performance of a ﬁxed policy π (Sutton &Barto, 1998): V π ( s ) = E π (cid:2) (cid:80) ∞ t =0 γ t r ( s t , a t ) | s = s (cid:3) and Q π ( s, a ) = E π (cid:2) (cid:80) ∞ t =0 γ t r t ( s t , a t ) | s = s, a = a (cid:3) , where E π denotes the expectation with respect to the state (state-action) distribution induced by transition law P and policy π . ˆ θ P Θ θ θ θ y ( u ) y ( u ) y ( u ) y ( u N ) h ( u ; θ ) σ Observations y ( u ) y ( u ) = h ( u ; θ ) + n y ( u i ) ∼ N (cid:0) h ( u i ; ˆ θ ) , σ i (cid:1) n ∼ ( , P n ) P n =  σ σ ... σ N  Uncertainty in valueparameters θ Uncertainty inobservation noise

Figure 1.

Illustration of our proposed model: a Bayesian perspec-tive for the policy evaluation problem in RL. The noisy observation y ( u ) for an input u (for example u is a state or a state-action pairand y ( u ) is the a sum of discounted n-step rewards from this state)is decomposed into its mean, the value h ( u ; θ ) and a randomzero-mean noise n . The randomness of y ( u ) originates from twosources: (i) the random noise n which relates to the stochasticity ofthe transitions in the trajectory and to the possibly random policy.(ii) the randomness of h through its dependency on the randomparameters θ . In the context of RL, this randomness can be relatedto uncertainty regarding the MDP model that generated the noisyobservations. Policy evaluation, or value estimation, is a core element inRL algorithms. We will use the term value function (VF)to address the following functions: the state value func-tion V π ( s ) , the state-action Q-function Q π ( s, a ) and theadvantage function A π ( s, a ) = Q π ( s, a ) − V π ( s ) . Whenthe state or action space is large, a common approach is toapproximate the VF using a parameterized function, h ( · ; θ ) .We focus on general, possibly non-linear approximationfunctions such as DNNs that can learn effectively complexapproximations.A common approach for optimizing the VF parameters isto minimize at each time step t the empirical mean of thesquared Bellman TD error δ ( u ; θ t ) (cid:44) y ( u ) − h ( u ; θ t ) , overa batch of N samples generated form the environment E under a given policy: L MLE t ( θ t ) = 12 N N (cid:88) i =1 δ ( u i ; θ t ) . (1)We use the general notation u to specify the input forthe target label y ( u ) and for the approximated value attime t , h ( u ; θ t ) . For example, for h ( u ; θ t ) = V ( s m ; θ t ) , u = s m is the state at a discrete time m ; For h ( u ; θ t ) = Q ( s m , a m ; θ t ) , u = ( s m , a m ) is the state-action pair. InTable 1 we provide examples of several options for y ( u ) and h ( u ; θ t ) which clarify how this general notation can beutilized in known policy optimization algorithms. rust Region Value Optimization using Kalman Filtering Table 1.

Different examples for policy optimization algorithms and their Bellman TD error δ ( u ; θ t ) type. The decomposition of δ ( u ; θ t ) into the observation function h ( u ; θ t ) and the target label y ( u ) in the EKF model (2) enables the integration of our KOVA optimizer withany policy optimization algorithm. θ (cid:48) refers to the previous network or to a target network, different than the one being trained θ t . Algorithm type Example δ ( u ; θ t ) type h ( u ; θ t ) y ( u ) Actor-critic A3C (Mnih et al., 2016) k -step V-evaluation V ( s m ; θ t ) (cid:80) k − i =0 γ i r m + i + γ k V ( s m + k ; θ (cid:48) ) Actor-critic DDPG (Lillicrap et al., 2015) -step Q-evaluation Q ( s, a ; θ t ) r + γQ ( s (cid:48) , π ( s (cid:48) ); θ (cid:48) ) Policy gradient PPO (Schulman et al., 2017)TRPO (Schulman et al., 2015a) GAE (Schulman et al.,2015b) V ( s m ; θ t ) (cid:80) ∞ i =0 ( γλ ) i (cid:0) r m + i + γV ( s m + i +1 ; θ (cid:48) ) − V ( s m + i ; θ (cid:48) ) (cid:1) + V ( s m ; θ (cid:48) ) (cid:15) -greedy DQN (Mnih et al., 2013) Optimality equation Q ( s, a ; θ t ) r + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ (cid:48) ) Traditionally, the VF is trained by stochastic gradient de-scent methods, estimating the loss on each experience as itis encountered, yielding the update: θ t +1 ← θ t + α E u ∼ p ( · ) (cid:2)(cid:0) y ( u ) − h ( u ; θ t ) (cid:1) ∇ θ t h ( u ; θ t ) (cid:3) ,where α is the learning rate and p ( · ) is the experience distri-bution. Typically, the training procedure seeks for a pointestimate of the model parameters. We will show (Section 3)that the underlying assumption on L MLE t (1) is that the pa-rameters θ t are deterministic and that the target labels y ( u ) are independent Gaussian RVs with mean h ( u ; θ t ) and aﬁxed variance. In Section 2.3 we present the EKF approachwhich generalizes the process of generating observationsand adds ﬂexibility to the model assumptions: the param-eters may be viewed as RVs and the variance of the targetlabel may change between observations. In this section we brieﬂy outline the Extended Kalman ﬁlter(Anderson & Moore, 1979; Gelb, 1974). The EKF is astandard technique for estimating the state of a nonlineardynamic system or for learning the parameters of a nonlinearapproximation function. In this paper we will focus on itslatter role, meaning estimating θ . The EKF considers thefollowing model: (cid:40) θ t = θ t − + v t y ( u t ) = h ( u t ; θ t ) + n t , (2)where θ t ∈ R d × are the parameters evaluated at time t , y ( u t ) is the N -dimensional observations vector at time t : y ( u t ) = [ y ( u t ) , y ( u t ) , . . . , y ( u Nt )] (cid:62) ∈ R N × , (3)and h ( u t ; θ t ) ∈ R N × is an N -dimensional vector, where h ( u ; θ ) is a nonlinear observation function with input u andparameters θ : h ( u t ; θ t ) = [ h ( u t ; θ t ) , h ( u t ; θ t ) , . . . , h ( u Nt ; θ t )] (cid:62) . (4) v t is the evolution noise, n t is the observation noise, bothmodeled as additive and white noises with covariances P v t and P n t , respectively. As seen in the model presented inEquation (2), EKF treats the parameters θ t as RVs, similarly to Bayesian approaches. According to this perspective, theparameters belong to an uncertainty set Θ governed by themean and covariance of the parameters distribution.The estimation at time t, denoted as ˆ θ t |· is the conditionalexpectation of the parameters with respect to the observeddata. The EKF formulation distinguishes between estimatesthat are based on observations up to time t , ˆ θ t | t (cid:44) E [ θ t | y t ] ,and observations up to time t − , ˆ θ t | t − (cid:44) E [ θ t | y t − ] = ˆ θ t − | t − . With some abuse of notation, y t (cid:48) are the observa-tions gathered up to time t (cid:48) : y ( u ) , . . . , y ( u t (cid:48) ) . The param-eters errors are deﬁned by: ˜ θ t | t (cid:44) θ t − ˆ θ t | t and ˜ θ t | t − (cid:44) θ t − ˆ θ t | t − . The conditional error covariances are given by: P t | t (cid:44) E (cid:2) ˜ θ t | t ˜ θ (cid:62) t | t | y t (cid:3) , P t | t − (cid:44) E (cid:2) ˜ θ t | t − ˜ θ (cid:62) t | t − | y t − (cid:3) = P t − | t − + P v t .EKF considers several statistics of interest at each time step: The prediction of the observation function , the observationinnovation , the covariance between the parameters errorand the innovation , the covariance of the innovation and the Kalman gain are deﬁned respectively in Equations (5) - (9): ˆy t | t − (cid:44) E [ h ( u t ; θ t ) | y t − ] , (5) ˜y t | t − (cid:44) h ( u t ; θ t ) − ˆy t | t − , (6) P ˜ θ t , ˜y t (cid:44) E [ ˜ θ t | t − ˜y t | t − | y t − ] , (7) P ˜y t (cid:44) E [ ˜y t | t − ˜y (cid:62) t | t − | y t − ] + P n t , (8) K t (cid:44) P ˜ θ t , ˜y t P − ˜y t . (9)The above statistics serve for the EKF updates: (cid:40) ˆ θ EKF t | t = ˆ θ t | t − + K t (cid:0) y ( u t ) − h ( u t ; ˆ θ t | t − ) (cid:1) , P t | t = P t | t − − K t P ˜y t K (cid:62) t . (10)In the next section we present how to use the EKF formula-tion in order to approximate VFs which consider uncertaintyboth in the parameters and in the noisy observations.

3. EKF for Value Function Approximation

We now derive a novel regularized objective function andargue in its favor for optimizing value functions in RL. Weuse general notations in order to enable integration of our rust Region Value Optimization using Kalman Filtering proposed VF optimization method with any policy optimiza-tion algorithm. The main idea is to decompose the BellmanTD error vector δ ( u t ; θ t ) into two parts: δ ( u t ; θ t ) = y ( u t ) − h ( u t ; θ t ) = [ δ ( u t ; θ t ) , .., δ ( u Nt ; θ t )] (cid:62) .(i) The observation at time t , y ( u t ) is a vector that contains N target labels y ( u t ) , . . . , y ( u Nt ) . (ii) The observation func-tion may be one of the following: h ( u ; θ t ) =  V ( s ; θ t ) the state value function Q ( s, a ; θ t ) the state-action Q-function A ( s, a ; θ t ) the advantage function.The observation functions for N inputs are concatenatedinto the N -dimensional vector h ( u t ; θ t ) , as presented inEquation (4). In Table 1 we provide several examples for theBellman TD error decomposition according to the chosenpolicy optimization algorithm.Our goal is to estimate the parameters θ t . Oneway is to learn them by maximum likelihood esti-mation (MLE) using stochastic gradient descent meth-ods: θ MLE = arg max θ log p ( y t | θ ) . This formsthe objective function in Equation (1). Another wayis learning them by a Bayesian approach which usesBayes rule and adds prior knowledge over the pa-rameters p ( θ ) to calculate the maximum a-posteriori(MAP) estimator: θ MAP = arg max θ log p ( θ | y t ) =arg max θ log p ( y t | θ ) + log p ( θ ) . Given the observationsgathered up to time t , we can re-write the MAP estimator: θ MAP t = arg max θ t log p ( y t | θ t ) + log p ( θ t | y t − ) . (11)Here, instead of using the parameters prior, we use an equiv-alent derivation for the parameters posterior conditionedon y t , based on the likelihood of a single observation y t (cid:44) y ( u t ) and the posterior conditioned on y t − (VanDer Merwe, 2004). This unique derivation is a key step formaking the incremental Kalman updates and for deﬁning theobjective function in Equation (12). In order do deﬁne thelikelihood p ( y t | θ t ) and the posterior p ( θ t | y t − ) , we adoptthe EKF model (2), and make the following assumptions: Assumption 1.

The likelihood p ( y ( u t ) | θ t ) is assumed tobe Gaussian: y ( u t ) | θ t ∼ N ( h ( u t ; θ t ) , P n t ) Assumption 2.

The posterior distribution p ( θ t | y t − ) isassumed to be Gaussian: θ t | y t − ∼ N ( ˆ θ t | t − , P t | t − ) . These assumptions are common when using the EKF. Inthe context of RL, these assumptions add the ﬂexibility wewant: the value is treated as a RV and information is gath-ered on the uncertainty of its estimate. In addition, the noisyobservations (the target labels), can have different variancesand can even be correlated. Based on these Gaussian as-sumptions, we can derive the following Theorem:

Theorem 1.

Under Assumptions 1 and 2, ˆ θ EKF t | t (10) mini-mizes at each t the following regularized objective function: L EKF t ( θ t ) = 12 δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t )+ 12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − ) , (12) where ˆ θ EKF t | t ∈ arg min θ t L EKF t ( θ t ) . The proof for Theorem 1 appears in the supplementary ma-terial. It is based on solving the maximization problem inEquation (11) using the EKF model (2) and the GaussianAssumptions 1 and 2.We now explicitly write the expressions for the statisticsof interest in Equations (5) - (9) (see the supplementarymaterial for more detailed derivations). The derivationsare based on the ﬁrst order Taylor series linearization forthe observation function h ( θ t ) : h ( u t ; θ t ) = h ( u t ; ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1) , where ∇ θ t h ( u t ; ˆ θ ) (13) = (cid:2) ∇ θ t h ( u t ; ˆ θ ) , ∇ θ t h ( u t ; ˆ θ ) , . . . , ∇ θ t h ( u Nt ; ˆ θ ) (cid:3) ∈ R d × N and ˆ θ is typically chosen to be the previous estimation ofthe parameters at time t − , ˆ θ = ˆ θ t | t − . The predictionof the observation function is ˆy t | t − = h ( u t ; ˆ θ ) , the co-variance between the parameters error and the innovation is P ˜ θ t , ˜y t = P t | t − ∇ θ t h ( u t , ˆ θ ) and the covariance of theinnovation is: P ˜y t = ∇ θ t h ( u t ; ˆ θ ) (cid:62) P t | t − ∇ θ t h ( u t ; ˆ θ ) + P n t . (14)The Kalman gain then becomes: K t = P t | t − ∇ θ t h ( u t , ˆ θ ) (cid:0) ∇ θ t h ( u t ; ˆ θ ) (cid:62) P t | t − ∇ θ t h ( u t ; ˆ θ ) + P n t (cid:1) − . (15)This Kalman gain is used in the parameters update and theerror covariance update in Equation (10). L EKF t and L MLE t forOptimizing Value Functions We argue in favor of using the regularized objective function L EKF t ( θ t ) (12) for optimizing VFs instead of the commonlyused objective function L MLE t ( θ t ) (1). Corollary 1 will as-sist us to discuss and compare between the two objectivefunctions: Corollary 1.

Under Assumptions 1 and 2, consider a diag-onal covariance P n t with diagonal elements σ i = N andassume P | = P v t = , then: L EKF t ( θ t ) = L MLE t ( θ t ) . The proof is given in the supplementary material. Accordingto Corollary 1, the two objective functions are the same if we rust Region Value Optimization using Kalman Filtering consider the parameters as deterministic and if we assumethat the noisy target labels have a ﬁxed variance.So what are the differences between the two objective func-tions? First, L EKF t is a regularized version of L MLE t : the reg-ularization is causing the parameters θ t to track the recentparameters estimate, ˆ θ t | t − , stabilizing the estimate process.The error between the successive estimates is weighted withthe inverse of the uncertainty information P t | t − . L MLE t does not include a regularization term, meaning it does notaccount for parametrization uncertainties. Note that whenadding a standard L regularization to L MLE t , often commonin DNNs, it reﬂects staying close to the vector which isnot always desired.Second, L EKF t weights the squared Bellman TD error vector δ ( u t ; θ t ) with P − n t which can be interpreted as an addi-tional regularization technique. P n t can be viewed as theamount of conﬁdence we have in the observations, as de-ﬁned in the EKF model (2): if the observations are noisy,we should consider larger values for the diagonal elementsin the covariance P n t . In addition, L EKF t allows us to modelcorrelations between observations errors, unlike the iid as-sumption in L MLE t . In Section 5 we discuss possible optionsfor P n t .Looking at the parameters update in Equation (10) and thedeﬁnition of the Kalman gain K t in Equation (15), we cansee that the Kalman gain propagates the new informationfrom the noisy target labels, back down into the parametersuncertainty set Θ , before combining it with the estimatedparameter value. Actually, K t can be interpreted as anadaptive learning rate for each individual parameter that im-plicitly incorporates the uncertainty of each parameter. Thisapproach resembles familiar stochastic gradient optimiza-tion methods such as Adagrad (Duchi et al., 2011), AdaDelta(Zeiler, 2012), RMSprop (Tieleman & Hinton, 2012) andAdam (Kingma & Ba, 2014), for different choices of P t | t − and P n t . We refer the reader to Ruder (2016).When looking at L EKF t ( θ t ) the reader may ask what do ˆ θ t | t − and P t | t − stand for? When we are estimating theVF parameters for a ﬁxed policy π old , our objective functionimposes a trust region method in each iteration of a batchoptimization procedure. The trust region helps us to avoidover-ﬁtting to the most recent batch of data. In this case ˆ θ π old t | t − is the last evaluation of the VF parameters for thesame ﬁxed policy π old , and P π old t | t − is the conditional errorcovariance between the new parameters estimation and theprevious one, again, for the same ﬁxed policy . When wechange policies and start to evaluate the VF parameters of π new we set θ π new | = ˆ θ π old t | t and P π new | = P π old t | t , meaning westart a new estimation procedure at t = 0 for the new policy. We added the upper-script π old to emphasis that the VF param-eters correspond to evaluating the same policy. The EKF may be viewed as an on-line natural gradient algo-rithm (Amari, 1998) that uses the Fisher information matrix J t (Ollivier et al., 2018). In this setting, the connectionbetween the error covariance matrix and the Fisher matrixis given by: P − t | t = ( t + 1) J t . This insight suggests thatthe regularization term in L EKF t is actually a second orderapproximation of the KL-divergence between the previousparameter estimate and the current one. Combining theseinsights together, we conclude that our proposed methodcan be viewed as a natural gradient algorithm for VF ap-proximation. Similarly, the EKF may be viewed as an in-cremental version of the Gauss-Newton method, which isa common iterative method for solving least squares prob-lems (Bertsekas, 1996). When updating the parameters, theGauss-Newton uses the matrix H = E [ J T J ] where J is theJacobian of h ( θ t ) . When the observations are assumed to beGaussian (as we assume in Assumption 1), H is equivalentto the Fisher information matrix.The following Theorem formalizes the connection betweenEKF and two separate KL divergences: Theorem 2.

Assume the inputs u are drawn independentlyfrom a training distribution ˆ Q u with density function q ( u ) ,and assume the corresponding observations y are drawnfrom a conditional training distribution ˆ Q y | u with den-sity function q ( y | u ) . Let Q u,y be the joint distributionwhose density is q ( u, y ) = q ( y | u ) q ( u ) , and let P u,y ( θ ) be the learned distribution, whose density is p ( u, y | θ ) = p ( y | u, θ ) q ( u ) . Under Assumptions 1 and 2, consider a di-agonal covariance P n t with diagonal elements σ i = N then: L EKF t ( θ t ) = C + N E ˆ Q u [ D KL (cid:0) ˆ Q y | u || P y | u ( θ ) (cid:1) ]+ t · D KL (cid:0) P u,y ( θ + ∆ θ ) || P u,y ( θ ) (cid:1) + O ( (cid:107) ∆ θ (cid:107) ) where C = log (cid:0) π ) N/ | P n t | / (cid:1) . Theorem 2 illustrates how EKF is aimed at minimizing twoseparate KL-divergences. The ﬁrst is the KL divergencebetween two conditional distributions, ˆ Q y | u and P y | u ( θ ) .This term is equivalent to the loss in L MLE t (1). The second isthe KL divergence between two different parameterizationsof the joint learned distribution P u,y . This is the term whichimposes trust region on the VF parameters in L EKF t (12). Theproof for Theorem 2 appears in the supplementary material. We now derive a practical algorithm for approximating VFs,by minimizing the objective function L EKF t (12). In prac-tice we use the update Equations (10) and the Kalman gainEquation in (14)-(15) in order to avoid inversing P t | t − . In rust Region Value Optimization using Kalman Filtering Targetlabel vector y ( u t ) A posteriori θ parameterestimateValue function h ( u t ; θ ) ∇ h Kalmangain A posteriorierror covarianceestimateA priori errorcovariance estimate P v t P n t DelayDelayA priori θ parameter estimate Policy ˆ θ t | t = ˆ θ t | t − + α K t (cid:0) h ( u t ; ˆ θ t | t − ) − y ( u t ) (cid:1) P t | t = P t | t − − α K t P ˜y t K > t y ( u t ) y ( u t )... y ( u Nt ) ... ... ˆ θ t − | t − K t P t | t P t − | t − P t | t − ˆ θ t | t − ˆ θ t | t h ( u ; ˆ θ t | t ) ˆ θ t | t − h ( u ; ˆ θ t | t − ) Sample observations and store in R Figure 2.

KOVA optimizer block diagram. KOVA receives as inputthe initial general prior P | and the covariances P v t and P n t . Itinitializes ˆ θ | with small random values or with the VF param-eters of the previous policy (see the discussion in Section 3.1).For every t , it samples N target labels from R (see Table 1 fortarget label examples), constructs y ( u t ) (3) and h ( u t , ˆ θ t | t − ) (4)and computes ∇ θ t h ( u t ; ˆ θ t | t − ) (13) and K t (14)-(15). Then itupdates and outputs the MAP parameters estimator ˆ θ t | t and theerror covariance matrix P t | t according to Equation (10). addition, we add a ﬁxed learning rate α to smooth the up-date. The KOVA optimizer is presented in Algorithm 1 andillustrated in Figure 2. Notice that R is a samples generatorwhose structure depends on the policy algorithm for whichKOVA is used as a VF optimizer. R can contain trajectoriesfrom a ﬁxed policy or it can be an experience replay whichcontains transitions from several different policies. Algorithm complexity:

For a d -dimensional parameter vec-tor θ ∈ R d , our algorithm requires O ( d ) extra space tostore the covariance matrix and O ( d ) computations for ma-trix multiplications. Note that our update method does notrequire inverting the ( d × d ) -dimensional matrix P t | t − inthe update process, but only requires inverting the ( N × N ) -dimensional matrix (cid:0) ∇ h ( ˆ θ ) (cid:62) P t | t − ∇ h ( ˆ θ )+ P n t (cid:1) − . Usu-ally, N (cid:28) d . The extra time and memory requirements canbe tolerated for small-medium networks with size d . How-ever, it can be considered as a drawback of the algorithmfor large network sizes. Fortunately, there are several op-tions for overcoming these drawbacks: (a) The use of GPUfor matrix multiplications can accelerate the computationtime. (b)

We can assume correlations only between blocksof parameters, for example, between parameters in the sameDNN layer, and apply layer factorization. This can reducesigniﬁcantly the computation and memory requirements(Puskorius & Feldkamp, 1991; Zhang et al., 2017; Wu et al.,2017). (c)

We can apply the Kalman optimization method

Algorithm 1

KOVA Optimizer

Input: P | , P v t , P n t , α , R . Initialize: ˆ θ | , t = 0 .1: for t = 1 , . . . , T do

2: Set predictions: (cid:40) ˆ θ t | t − = ˆ θ t − | t − P t | t − = P t − | t − + P v t .3: Sample N tuples { y ( u i ) , h ( u i ; ˆ θ t | t − ) } Ni =1 from R .4: Construct N -dim vectors y ( u t ) (3) and h ( u t , ˆ θ t | t − ) (4).5: Compute ( d × N ) -dim matrix ∇ θ h ( u t ; ˆ θ t | t − ) (13).6: P ˜ θ , ˜y t = P t | t − ∇ θ h ( u t , ˆ θ t | t − ) .7: P ˜y t = ∇ θ h ( u t ; ˆ θ t | t − ) (cid:62) P t | t − ∇ θ h ( u t ; ˆ θ t | t − ) + P n t .8: K t = P ˜ θ t , ˜y t P − ˜y t

9: Set updates: (cid:40) ˆ θ t | t = ˆ θ t | t − + α K t (cid:0) y ( u t ) − h ( u t ; ˆ θ t | t − ) (cid:1) P t | t = P t | t − − α K t P ˜y t K (cid:62) t end forOutput: ˆ θ t | t and P t | t only on the last layer in large DNNs. This approach wasused by Levine et al. (2017) where they optimized the lastlayer using linear least squares optimization methods. Weemphasis that yet, our approach scales with large state andaction spaces, and is suitable for continuous control prob-lems which are considered hard domains.

4. Related Work

Bayesian Neural Networks (BNNs):

There are severalworks on Bayesian methods for placing uncertainty onthe approximator parameters (Blundell et al., 2015; Gal &Ghahramani, 2016). Depeweg et al. (2016; 2017) have usedBNNs for learning MDP dynamics in RL tasks. In theseworks a fully factorized Gaussian distribution on parametersis assumed while we consider possible correlations betweenparameters. In addition, BNNs require sampling the pa-rameters, and running several feed-forward runs for eachof the parameters samples. Our incremental method avoidsmultiple samples of the parameters, since the uncertainty ispropagated with every optimization update.

Kalman ﬁlters:

Outside of the RL framework, the use ofKalman ﬁlter as an optimization method is discussed in(Haykin et al., 2001; Vuckovic, 2018; Gomez-Uribe & Kar-rer, 2018). Wilson & Finkel (2009) solve the dynamics ofeach parameter with Kalman ﬁltering. Wang et al. (2018)use Kalman ﬁlter for normalizing batches. In our work weuse Kalman ﬁltering for VF optimization in the context ofRL. EKF is connected with the incremental Gauss-Newtonmethod (Bertsekas, 1996), and with the on-line natural gra-dient (Ollivier et al., 2018). These methods require inversingthe ( d × d ) -dimensional Fisher information matrix (for d -dimensional parameter), thus require high computationalresources. Our method avoids this inversion in the update rust Region Value Optimization using Kalman Filtering Time steps M e a n e p i s o d e r e w a r d Swimmer-v2

Time steps

Hopper-v2

Time steps

HalfCheetah-v2

Time steps M e a n e p i s o d e r e w a r d Ant-v2

Time steps

Walker2d-v2

PPO with AdamPPO with KOVA (a)

Time steps M e a n e p i s o d e r e w a r d Swimmer-v2

Time steps

Hopper-v2

Time steps

HalfCheetah-v2

Time steps M e a n e p i s o d e r e w a r d Ant-v2

Time steps

Walker2d-v2

TRPO with AdamTRPO with KOVA (b)

Figure 3.

Mean episode reward during training for Mujoco environments. (a)

PPO or (b)

TRPO are used as policy optimization algorithms.We compare between Adam and KOVA optimizers for policy evaluation. For Swimmer-v2, Hopper-v2 and HalfCheetah-v2 we trainedover one million time steps and for Ant-v2 and Walker2d-v2 we trained over two million time steps. We present the average (solid lines)and standard deviation (shaded area) of the episodes rewards over 8 runnings, generated from random seeds. step which is more computationally efﬁcient.

Trust region for policies:

The natural gradient method,when applied to RL tasks, is mostly used in policy gradientalgorithms to estimate the parameters of the policy (Kakade,2002; Peters & Schaal, 2008; Schulman et al., 2015a). Trustregion methods in RL have been developed for parameter-ized policies (Schulman et al., 2015a; 2017). Despite that,trust region methods for parametrized VFs are rarely pre-sented in the RL literature. Recently, Wu et al. (2017) sug-gested to apply the natural gradient method also on the criticin the actor-critic framework, using Kronecker-factored ap-proximations. Schulman et al. (2015b) suggested to applyGauss-Newton method to estimate the VF. However, theydid not analyze and formalize the underlying model andassumptions that lead to the regularization in the objectivefunction, while this is the focus in our work.

Distributional perspective on values and observations:

Distributional RL (Bellemare et al., 2017) treats the full(general) distribution of total return, and considers VF pa-rameters as deterministic. In our work we assume Gaussiandistribution over the total return and in addition Gaussiandistribution over the VF parameters.Our work may be seen as a modern extension of GPTD(Engel et al., 2003; 2005) for DRL domains with contin-uous state and action spaces. GPTD uses Gaussian Pro-cesses (GPs) for both VF and total return, for solving theRL problem of value estimation. We introduce here severalimprovements and generalizations over their work: (1) Ourformulation is adapted to learning nonlinear VF approxima-tions, as common in DRL; (2) We include a fading memory option for previous observations by using a decay factor inthe error covariance prediction ( P v t ); (3) We allow for ageneral observation noise covariance (not necessarily diag-onal) and for a general noisy observations (not only 1-stepTD errors); (4) Our observation vector y ( u ) has a ﬁxed size N (the batch size) as opposed to the growing size vectorsin GPTD which grow for any new observation and make itdifﬁcult to train in DRL domains.The use of Kalman ﬁlters to solve RL tasks was proposed byGeist & Pietquin (2010). Their formulation, called KalmanTemporal Difference (KTD), serves as the base for our for-mulation for the optimizer we propose. We introduce hereseveral differences between their work and ours: (1) We re-formulate the observation equation (10) to increase trainingstability by using a target network for the VF that appearsin the target label (see Table 1). With this formulation,the observation function is simply the VF of the currentinput; (2) We use the Extended Kalman ﬁlter as opposedto their use of the Unscented Kalman ﬁlter to approximatenonlinear functions (Julier & Uhlmann, 1997; Wan & VanDer Merwe, 2000). In our formulation, the observationfunction is differential, allowing us to use ﬁrst order Taylorexpansion linearization. The UKF has shown superior per-formance in some applications (St-Pierre & Gingras, 2004;Van Der Merwe, 2004), however, its computational costis much greater than the computational cost of the EKF,due to its requirement of sampling the parameters in eachtraining step for d times. Moreover, it requires to evaluatethe observation function at these samples at every trainingstep. Unfortunately, this is not tractable in DNNs where theparameters might be high-dimensional. rust Region Value Optimization using Kalman Filtering M e a n R e w a r d Swimmer-v2 P o li c y e n t r o p y Time steps P o li c y l o ss HalfCheetah-v2

Zoom In

Time steps

PPO with Adam PPO with KOVA - 0.1 PPO with KOVA - 0.01 PPO with KOVA - 0.001 (a) M e a n R e w a r d Swimmer-v2 P o li c y e n t r o p y Time steps P o li c y l o ss HalfCheetah-v2

Zoom In

Time steps

PPO with Adam PPO with KOVA - 0.1 PPO with KOVA - 0.01 PPO with KOVA - 0.001 (b)

Figure 4.

Mean episode reward, policy entropy and the policy loss for a PPO agent in the Mujoco environments Swimmer-v2 andHalfCheetah-v2. We compare between optimizing the VF with Adam vs. our KOVA optimizer. For KOVA, we present three differentvalues for η = 0 . , . , . and two different values for the diagonal elements in P n t : (a) max-ratio and (b) batch-size. We presentthe average (solid lines) and standard deviation (shaded area) of the episodes rewards over 8 runnings, generated from random seeds.

5. Experiments

In this section we present experiments that illustrate theperformance attained by our KOVA optimizer . KOVA optimizer for policy evaluation:

We tested the per-formance of KOVA in domains with high state and actionspaces: the robotic tasks benchmarks implemented in Ope-nAI Gym (Brockman et al., 2016), which use the MuJoCophysics engine (Todorov et al., 2012). For the policy trainingwe used PPO (Schulman et al., 2017) and TRPO (Schul-man et al., 2015a) and used their baselines implementations(Dhariwal et al., 2017). For VF training we replaced the orig-inally used Adam optimizer (Kingma & Ba, 2014) with ourKOVA optimizer (Algoritm 1) and compared their affect onthe mean episode reward in each environment. The resultsare presented in Figure 3. When training with PPO, we cansee that KOVA improved the agent’s performance in fourout of ﬁve environments. In Ant-v2 it kept approximatelythe same performance. When training with TRPO, we cansee that KOVA improved the agent’s performance mostlyin Swimmer-v2 and HalfCheetah-v2. These improvements,both in PPO and in TRPO, demonstrate the importanceof incorporating uncertainty estimation in value functionapproximation for improving the agent’s performance.

Investigating the evolution and observation noises:

Themost interesting hyper-parameters in KOVA are related tothe covariances P v t and P n t . As seen in Corollary 1, fordeterministic interpretation of the parameters we simply set Code can be found in: https://github.com/KOVA-trustregion/KOVA. Technical details on policy and VF networksand on the hyper-parameters we used are described in thesupplementary material. P v t = . However, the more interesting setting would be P v t = η − η P t − | t − with η being a small number that con-trols the amount of fading memory (Ollivier et al., 2018). P n t can be used for incorporating prior domain knowledge.For example, a diagonal matrix implies independent obser-vations , while if observations are known to be correlated,additional non-diagonal elements can be added. We inves-tigated the effect of different values of η and P n t in theSwimmer and HalfCheetah environments, where KOVAgained the most success. The results are depicted in Figure4. We tested two different P n t settings: the batch-size set-ting where σ i = σ = N and the max-ratio setting where σ i = N max(1 , π old ( ai | si ) π new ( ai | si ) + (cid:15) ) . Interestingly, although us-ing KOVA results in lower policy loss (which we try tomaximize), it actually increases the policy entropy and en-courages exploration, which we believe helps in gaininghigher rewards during training. We can clearly see how themean rewards increases as the policy entropy increases, fordifferent values of η . This insight was observed in bothtested Mujoco environments and in both settings of P n t .

6. Conclusion

In this work we presented a novel regularized objectivefunction for optimizing VFs in policy evaluation, whichoriginates from a Bayesian perspective over both noisy ob-servations and value parameters. Our empirical results illus-trate how the KOVA optimizer can improve the performanceof various RL agents in domains with large state and actionspaces. For future work, it would be interesting to furtherinvestigate the connection between trust region over valueparameters and trust region over policy parameters and howto use this connection to improve exploration. rust Region Value Optimization using Kalman Filtering

References

Amari, S.-I. Natural gradient works efﬁciently in learning.

Neural computation , 10(2):251–276, 1998.Anderson, B. D. and Moore, J. B. Optimal ﬁltering.

Engle-wood Cliffs , 21:22–95, 1979.Bellemare, M. G., Dabney, W., and Munos, R. A distri-butional perspective on reinforcement learning. arXivpreprint arXiv:1707.06887 , 2017.Bertsekas, D. P. Incremental least squares methods and theextended kalman ﬁlter.

SIAM Journal on Optimization , 6(3):807–822, 1996.Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,D. Weight uncertainty in neural networks. arXiv preprintarXiv:1505.05424 , 2015.Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540 , 2016.Dann, C., Neumann, G., and Peters, J. Policy evaluationwith temporal differences: A survey and comparison.

TheJournal of Machine Learning Research , 15(1):809–883,2014.Depeweg, S., Hern´andez-Lobato, J. M., Doshi-Velez, F.,and Udluft, S. Learning and policy search in stochasticdynamical systems with bayesian neural networks. arXivpreprint arXiv:1605.07127 , 2016.Depeweg, S., Hern´andez-Lobato, J. M., Doshi-Velez, F.,and Udluft, S. Decomposition of uncertainty for activelearning and reliable reinforcement learning in stochasticsystems. arXiv preprint arXiv:1710.07283 , 2017.Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert,M., Radford, A., Schulman, J., Sidor, S., Wu, Y., andZhokhov, P. Openai baselines. https://github.com/openai/baselines , 2017.Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradientmethods for online learning and stochastic optimization.

Journal of Machine Learning Research , 12(Jul):2121–2159, 2011.Engel, Y., Mannor, S., and Meir, R. Bayes meets bellman:The gaussian process approach to temporal differencelearning. In

Proceedings of the 20th International Con-ference on Machine Learning (ICML-03) , pp. 154–161,2003.Engel, Y., Mannor, S., and Meir, R. Reinforcement learningwith gaussian processes. In

Proceedings of the 22ndinternational conference on Machine learning , pp. 201–208. ACM, 2005. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approx-imation: Representing model uncertainty in deep learn-ing. In international conference on machine learning , pp.1050–1059, 2016.Geist, M. and Pietquin, O. Kalman temporal differences.

Journal of artiﬁcial intelligence research , 39:483–532,2010.Gelb, A.

Applied optimal estimation . MIT press, 1974.Gomez-Uribe, C. A. and Karrer, B. The decoupled extendedkalman ﬁlter for dynamic exponential-family factoriza-tion models. arXiv preprint arXiv:1806.09976 , 2018.Haykin, S. S. et al.

Kalman ﬁltering and neural networks .Wiley Online Library, 2001.Julier, S. J. and Uhlmann, J. K. New extension of the kalmanﬁlter to nonlinear systems. In

AeroSense’97 , pp. 182–193.International Society for Optics and Photonics, 1997.Kakade, S. M. A natural policy gradient. In

Advances inneural information processing systems , pp. 1531–1538,2002.Kalman, R. E. et al. A new approach to linear ﬁltering andprediction problems.

Journal of basic Engineering , 82(1):35–45, 1960.Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Levine, N., Zahavy, T., Mankowitz, D. J., Tamar, A., andMannor, S. Shallow updates for deep reinforcement learn-ing. In

Advances in Neural Information Processing Sys-tems , pp. 3135–3145, 2017.Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,T., Tassa, Y., Silver, D., and Wierstra, D. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971 , 2015.Martens, J. New insights and perspectives on the naturalgradient method. arXiv preprint arXiv:1412.1193 , 2014.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. Playingatari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013.Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. In

International conference on machine learning , pp. 1928–1937, 2016.Ollivier, Y. et al. Online natural gradient as a kalman ﬁlter.

Electronic Journal of Statistics , 12(2):2930–2961, 2018. rust Region Value Optimization using Kalman Filtering

Peters, J. and Schaal, S. Natural actor-critic.

Neurocomput-ing , 71(7-9):1180–1190, 2008.Puskorius, G. V. and Feldkamp, L. A. Decoupled extendedkalman ﬁlter training of feedforward layered networks. In

Neural Networks, 1991., IJCNN-91-Seattle InternationalJoint Conference on , volume 1, pp. 771–777. IEEE, 1991.Ruder, S. An overview of gradient descent optimizationalgorithms. arXiv preprint arXiv:1609.04747 , 2016.S¨arkk¨a, S.

Bayesian ﬁltering and smoothing , volume 3.Cambridge University Press, 2013.Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. In

InternationalConference on Machine Learning , pp. 1889–1897, 2015a.Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,P. High-dimensional continuous control using generalizedadvantage estimation. arXiv preprint arXiv:1506.02438 ,2015b.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.St-Pierre, M. and Gingras, D. Comparison between theunscented kalman ﬁlter and the extended kalman ﬁlter forthe position estimation module of an integrated navigationinformation system. In

Intelligent Vehicles Symposium,2004 IEEE , pp. 831–835. IEEE, 2004.Sutton, R. S. and Barto, A. G.

Reinforcement learning: Anintroduction , volume 1. MIT press Cambridge, 1998.Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent magnitude.

COURSERA: Neural networks for machine learning , 4(2), 2012.Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physicsengine for model-based control. In

Intelligent Robotsand Systems (IROS), 2012 IEEE/RSJ International Con-ference on , pp. 5026–5033. IEEE, 2012.Van Der Merwe, R.

Sigma-point Kalman ﬁlters for prob-abilistic inference in dynamic state-space models . PhDthesis, Oregon Health & Science University, 2004.Vuckovic, J. Kalman gradient descent: Adaptive vari-ance reduction in stochastic optimization. arXiv preprintarXiv:1810.12273 , 2018.Wan, E. A. and Van Der Merwe, R. The unscented kalmanﬁlter for nonlinear estimation. In

Adaptive Systems forSignal Processing, Communications, and Control Sympo-sium 2000. AS-SPCC. The IEEE 2000 , pp. 153–158. Ieee,2000. Wang, G., Peng, J., Luo, P., Wang, X., and Lin, L.Batch kalman normalization: Towards training deepneural networks with micro-batches. arXiv preprintarXiv:1802.03133 , 2018.Wilson, R. and Finkel, L. A neural implementation ofthe kalman ﬁlter. In

Advances in neural informationprocessing systems , pp. 2062–2070, 2009.Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba,J. Scalable trust-region method for deep reinforcementlearning using kronecker-factored approximation. In

Advances in neural information processing systems , pp.5279–5288, 2017.Zeiler, M. D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 , 2012.Zhang, R., Li, C., Chen, C., and Carin, L. Learning struc-tural weight uncertainty for sequential decision-making. arXiv preprint arXiv:1801.00085 , 2017. rust Region Value Optimization using Kalman Filtering

Supplementary MaterialA. Theoretical Results

A.1. Extended Kalman Filter (EKF)

In this section we brieﬂy outline the Extended Kalman ﬁlter(Anderson & Moore, 1979; Gelb, 1974). The EKF considersthe following model: (cid:40) θ t = θ t − + v t y ( u t ) = h ( u t ; θ t ) + n t , (A.1)where θ t ∈ R d × are the parameters evaluated at time t , y ( u t ) = [ y ( u t ) , y ( u t ) , . . . , y ( u Nt )] (cid:62) ∈ R N × is the N -dimensional observation vector at time t , and h ( u t ; θ t ) =[ h ( u t ; θ t ) , h ( u t ; θ t ) , . . . , h ( u Nt ; θ t ) (cid:62) ∈ R N × where h ( u ; θ ) is a nonlinear observation function with input u and parameters θ .The evolution noise v t is white ( E [ v t ] = ) with covariance P v t (cid:44) E [ v t v (cid:62) t ] , E [ v t v (cid:62) t (cid:48) ] = , ∀ t (cid:54) = t (cid:48) .The observation noise n t is white ( E [ n t ] = ) with covari-ance P n t (cid:44) E [ n t n (cid:62) t ] , E [ n t n (cid:62) t (cid:48) ] = , ∀ t (cid:54) = t (cid:48) .The EKF sets the estimation of the parameters θ at time t according to the conditional expectation: ˆ θ t | t (cid:44) E [ θ t | y t )] ˆ θ t | t − (cid:44) E [ θ t | y t − ] = ˆ θ t − | t − (A.2)where with some abuse of notation, y t (cid:48) are the observationsgathered up to time t (cid:48) : y ( u ) , . . . , y ( u (cid:48) t ) . The parameterserrors are deﬁned by: ˜ θ t | t (cid:44) θ t − ˆ θ t | t ˜ θ t | t − (cid:44) θ t − ˆ θ t | t − (A.3)The conditional error covariances are given by: P t | t (cid:44) E (cid:2) ˜ θ t | t ˜ θ (cid:62) t | t | y t (cid:3) , P t | t − (cid:44) E (cid:2) ˜ θ t | t − ˜ θ (cid:62) t | t − | y t − (cid:3) = E (cid:2) ( θ t − ˆ θ t | t − )( θ t − ˆ θ t | t − ) (cid:62) | y t − (cid:3) = E (cid:2) ( θ t − + v t − ˆ θ t − | t − )( θ t − + v t − ˆ θ t − | t − ) (cid:62) | y t − (cid:3) = E (cid:2) ( ˜ θ t − | t − + v t )( ˜ θ t − | t − + v t ) (cid:62) | y t − (cid:3) = E (cid:2) ( ˜ θ t − | t − ˜ θ (cid:62) t − | t − | y t − (cid:3) + 2 E (cid:2) ˜ θ t − | t − v (cid:62) t | y t − (cid:3) + E (cid:2) v t v (cid:62) t | y t − (cid:3) = P t − | t − + P v t . P t | t − = P t − | t − + P v t (A.4)EKF considers several statistics of interest at each time step: The prediction of the observation function : ˆy t | t − (cid:44) E [ h ( u t ; θ t ) | y t − ] . The observation innovation : ˜y t | t − (cid:44) h ( u t ; θ t ) − ˆy t | t − . The covariance between the parameters error and the inno-vation : P ˜ θ t , ˜y t (cid:44) E [ ˜ θ t | t − ˜y (cid:62) t | t − | y t − ] . The covariance of the innovation : P ˜y t (cid:44) E [( ˜y t | t − ˜y (cid:62) t | t − | y t − ] + P n t . The

Kalman gain : K t (cid:44) P ˜ θ t , ˜y t P − ˜y t . The above statistics serve for the update of the parametersand the error covariance: (cid:40) ˆ θ t | t = ˆ θ t | t − + K t (cid:0) y ( u t ) − h ( u t ; ˆ θ t | t − ) (cid:1) , P t | t = P t | t − − K t P ˜y t K (cid:62) t . (A.5) A.2. EKF for Value Function Estimation

When applying the EKF formulation to value functionsapproximation, the observation at time t is the target label y ( u t ) (see Table 1 in the main article), and the observationfunction h can be the state value function, the state actionvalue function or the advantage function.The EKF uses a ﬁrst order Taylor series linearization for theobservation function: h ( u t ; θ t ) = h ( u t ; ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1) , (A.6)where ∇ θ t h ( u t ; ˆ θ ) = (cid:2) ∇ θ t h ( u t ; ˆ θ ) , . . . , ∇ θ t h ( u Nt ; ˆ θ ) (cid:3) ∈ R d × N and ˆ θ is typically chosen to be the previous estima-tion of the parameters at time t − , ˆ θ t | t − . This linearizationhelps in computing the statistics of interest. Recall that theexpectation here is over the random variable θ t where ˆ θ t | t − is ﬁxed. For simplicity, we keep to write ˆ θ . The predictionof the observation function : ˆy t | t − (cid:44) E [ h ( u t ; θ t ) | y t − ]= (cid:124)(cid:123)(cid:122)(cid:125) ( A. E (cid:104) h ( u t , ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1) | y t − (cid:105) = h ( u t ; ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) E [ θ t | y t − ] − ˆ θ (cid:1) = (cid:124)(cid:123)(cid:122)(cid:125) ( A. h ( u t ; ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) ˆ θ − ˆ θ (cid:1) = h ( u t ; ˆ θ ) = h ( u t ; ˆ θ t | t − ) rust Region Value Optimization using Kalman Filtering We conclude that: ˆy t | t − = h ( u t ; ˆ θ t | t − ) (A.7) The observation innovation : ˜y t | t − (cid:44) h ( u t ; θ t ) − ˆy t | t − = (cid:124)(cid:123)(cid:122)(cid:125) ( A. h ( u t , θ t ) − h ( u t ; ˆ θ t | t − ) (A.8)Let’s simplify the following: h ( u t , θ t ) − h ( u t ; ˆ θ )= (cid:124)(cid:123)(cid:122)(cid:125) ( A. (cid:0) (cid:24)(cid:24)(cid:24)(cid:24) h ( u t ; ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1) − (cid:24)(cid:24)(cid:24)(cid:24)(cid:24) h ( u t ; ˆ θ ) (cid:1) = ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1) (A.9) The covariance between the parameters error and the inno-vation (here we also denote ˆ θ = ˆ θ t | t − ): P ˜ θ t , ˜y t (cid:44) E [ ˜ θ t | t − ˜y (cid:62) t | t − | y t − ]= (cid:124)(cid:123)(cid:122)(cid:125) ( A. A. E [ (cid:0) θ t − ˆ θ (cid:1)(cid:0) h ( u t ; θ t ) − h ( u t ; ˆ θ ) (cid:1) (cid:62) | y t − ]= (cid:124)(cid:123)(cid:122)(cid:125) ( A. E [ (cid:0) θ t − ˆ θ (cid:1)(cid:0) θ t − ˆ θ (cid:1) (cid:62) ∇ θ t h ( u t ; ˆ θ ) | y t − ]= (cid:124)(cid:123)(cid:122)(cid:125) ( A. E [ ˜ θ t | t − ˜ θ (cid:62) t | t − | y t − ] ∇ θ t h ( u t ; ˆ θ )= (cid:124)(cid:123)(cid:122)(cid:125) ( A. P t | t − ∇ θ t h ( u t ; ˆ θ t | t − ) P ˜ θ t , ˜y t = P t | t − ∇ θ t h ( u t ; ˆ θ t | t − ) (A.10) The covariance of the innovation : P ˜y t (cid:44) E [( ˜y t | t − ˜y (cid:62) t | t − | y t − ] + P n t = (cid:124)(cid:123)(cid:122)(cid:125) ( A. E [ (cid:0) h ( u t , θ t ) − h ( u t ; ˆ θ ) (cid:1)(cid:0) h ( u t , θ t ) − h ( u t ; ˆ θ ) (cid:1) (cid:62) | y t − ]+ P n t = (cid:124)(cid:123)(cid:122)(cid:125) ( A. E (cid:104) ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1)(cid:0) θ t − ˆ θ (cid:1) (cid:62) ∇ θ t h ( u t ; ˆ θ ) | y t − (cid:105) + P n t = (cid:124)(cid:123)(cid:122)(cid:125) ( A. ∇ θ t h ( u t , ˆ θ ) (cid:62) E [ ˜ θ t | t − ˜ θ (cid:62) t | t − | y t − ] ∇ θ t h ( u t , ˆ θ ) + P n t = ∇ θ t h ( u t , ˆ θ ) (cid:62) P t | t − ∇ θ t h ( u t , ˆ θ ) (cid:62) + P n t P ˜y t = ∇ θ t h ( u t ; ˆ θ t | t − ) (cid:62) P t | t − ∇ θ t h ( u t ; ˆ θ t | t − ) + P n t (A.11) The Kalman gain : K t (cid:44) P ˜ θ t , ˜y t P − ˜y t = (cid:124)(cid:123)(cid:122)(cid:125) ( A. A. P t | t − ∇ θ t h ( u t , ˆ θ t | t − ) (cid:16) ∇ θ t h ( u t , ˆ θ t | t − ) (cid:62) P t | t − ∇ θ t h ( u t , ˆ θ t | t − ) + P n t (cid:17) − (A.12)and the update for the parameters of the value function andthe error covariance are the same as in Equation (A.5) as weprove in Theorem 1. A.3. A Bayesian approach: MAP estimator

We adopt the Bayesian approach in which we are interestedin ﬁnding the optimal set of parameters θ t that maximizesthe posterior distribution of the parameters given the ob-servations we have gathered up to time t , denoted as the y t .According to Bayes rule, the posterior distribution is deﬁnedas: p ( θ t | y t ) = p ( y t | θ t ) p ( θ t ) p ( y t ) where p ( y t | θ ) is the likelihood of the observations giventhe parameters θ and p ( θ ) is the prior distribution over θ .We will expend the term of the posterior (Van Der Merwe,2004): p ( θ t | y t ) = p ( y t | θ t ) p ( θ t ) p ( y t )= p ( y t | y t − , θ t ) p ( y t − | θ t ) p ( θ t ) p ( y t ) (A.13) = p ( y t | θ t ) p ( y t − | θ t ) p ( θ t ) p ( y t ) · p ( y t − ) p ( y t − ) (A.14) = p ( y t | θ t ) p ( θ t | y t − ) p ( y t − ) p ( y t ) (A.15)The transition in (A.13) is according to the conditional prob-ability: p ( y t | θ t ) = p ( y t , y t − | θ t )= p ( y t , y t − , θ t ) p ( θ t )= p ( y t − , θ t ) p ( y t | y t − , θ t ) p ( θ t )= p ( y t − | θ t ) p ( y t | y t − , θ t ) The transition in (A.14) is according to the conditional inde-pendence: p ( y t | y t − , θ t ) = p ( y t | θ t ) , and we multipliedthe numerator and the dominator by p ( y t − ) . rust Region Value Optimization using Kalman Filtering The transition in (A.15) is according to Bayes rule: p ( θ t | y t − ) = p ( y t − | θ t ) p ( θ t ) p ( y t − ) .The MAP estimator for θ t is the one who maximizes theposterior distribution described in (A.15). θ MAPt = arg max θ t (cid:8) p ( θ t | y t ) (cid:9) = arg max θ t (cid:110) p ( y t | θ t ) p ( θ t | y t − ) p ( y t − ) p ( y t ) (cid:111) = arg max θ t (cid:8) p ( y t | θ t ) p ( θ t | y t − ) (cid:9) = arg max θ t (cid:8) log (cid:0) p ( y t | θ t ) p ( θ t | y t − ) (cid:1)(cid:9) = arg max θ t (cid:8) log p ( y t | θ t ) + log p ( θ t | y t − ) (cid:9) = arg min θ t (cid:8) − log p ( y t | θ t ) − log p ( θ t | y t − ) (cid:9) (A.16)In (A.16) We used the derivation in ( A. ) and the fact thatthe argument which maximizes the posterior is the same asthe argument that maximizes the log( · ) of the posterior. Inaddition this argument also minimizes the negative log( · ) .We will replace here y t = y ( u t ) and receive: θ MAPt = arg min θ t (cid:8) − log p ( y ( u t ) | θ t ) − log p ( θ t | y t − ) (cid:9) (A.17)In order to solve (A.17), we consider the EKF formulationfor the value function parameters. A.4. Gaussian assumptions

When estimating using the EKF, it is common to makethe following assumptions regarding the likelihood and theposterior in Equation (A.17):

Assumption A.1.

The likelihood p ( y ( u t ) | θ t ) is assumedto be Gaussian: y ( u t ) | θ t ∼ N ( h ( u t , θ t ) , P n t ) . Assumption A.2.

The posterior distribution p ( θ t | y t − ) isassumed to be Gaussian: θ t | y t − ∼ N ( ˆ θ t | t − , P t | t − ) . Following are the calculations for the means and covari-ances in Assumptions A.1 and A.2. For the likelihood p ( y ( u t ) | θ t ) : E (cid:2) y ( o t ) | θ t (cid:3) = (cid:124)(cid:123)(cid:122)(cid:125) ( A. E (cid:2) h ( u t ; θ t ) + n t | θ t (cid:3) = E (cid:2) h ( u t ; θ t ) | θ t (cid:3) + E (cid:2) n t | θ t (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) = = h ( u t ; θ t ) (A.18)Let’s evaluate the following: y ( u t ) − E (cid:2) y ( u t ) | θ t (cid:3) = (cid:124)(cid:123)(cid:122)(cid:125) ( A. A. h ( u t ; θ t ) + n t − h ( u t ; θ t )= n t (A.19) Cov ( y ( u t ) | θ t ) (cid:44) E (cid:2)(cid:0) y ( u t ) − E (cid:2) y ( u t ) | θ t (cid:3)(cid:1)(cid:0) y ( u t ) − E (cid:2) y ( u t ) | θ t (cid:3)(cid:1) (cid:62) | θ t (cid:3) = (cid:124)(cid:123)(cid:122)(cid:125) ( A. E (cid:2) n t n (cid:62) t | θ t (cid:3) = P n t For the posterior p ( θ t | y t − ) : E θ t (cid:2) θ t | y t − (cid:3) = (cid:124)(cid:123)(cid:122)(cid:125) ( A. ˆ θ t | t − . Cov ( θ t | y t − ) (cid:44) E θ t (cid:2)(cid:0) θ t − ˆ θ t | t − (cid:1)(cid:0) θ t − ˆ θ t | t − (cid:1) (cid:62) | y t − (cid:3) = E θ t (cid:2) ˜ θ t | t − ˜ θ (cid:62) t | t − | y t − (cid:3) = (cid:124)(cid:123)(cid:122)(cid:125) ( A. P t | t − A.5. Proof of Theorem 1

Based on the Gaussian assumptions, we can derive the fol-lowing Theorem:

Theorem A.1.

Under Assumptions A.1 and A.2, ˆ θ EKF t | t (A.5)minimizes at each time step t the following regularized ob-jective function: L EKF t ( θ t ) = 12 (cid:0) δ ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) δ ( u t ; θ t ) (cid:1) + 12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − ) , (A.20) where ˆ θ EKF t | t ∈ arg min θ t L EKF t ( θ t ) .Proof. We solve the minimization problem in (A.17) by sub-stituting the Gaussian Assumptions A.1 and A.2. We showthat this minimization problem is equivalent to minimizethe objective function L EKF t in Theorem A.1. ˆ θ MAP t | t = arg min θ t (cid:8) − log (cid:16) p ( y ( u t ) | θ t ) (cid:17) − log (cid:16) p ( θ t | y t − ) (cid:17)(cid:9) = arg min θ t (cid:110) − log (cid:18) π ) N/ | P n t | / exp (cid:16) − (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1)(cid:17)(cid:19) − log (cid:18) π ) d/ | P t | t − | / exp (cid:16) −

12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − ) (cid:17)(cid:19) = arg min θ i (cid:110) (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) − log (cid:16) π ) N/ | P n t | / (cid:17) + 12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − ) − log (cid:16) π ) d/ | P t | t − | / (cid:17)(cid:111) = arg min θ i (cid:110) (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) + 12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − ) (cid:111) rust Region Value Optimization using Kalman Filtering where |·| denotes the determinant. We receive the followingobjective function: L t ( θ t ) = 12 (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) + 12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − ) (A.21)Which is exactly the objective function (A.20) in TheoremA.1, with: δ ( u t ; θ t ) = y ( u t ) − h ( u t ; θ t ) . To minimize thisobjective function we take the derivative of L EKF t ( θ t ) withrespect to θ t : ∇ θ t L EKF t ( θ t ) = −∇ θ t h ( u t , θ t ) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) + P − t | t − ( θ t − ˆ θ t | t − ) = 0 We use the linearization of the value function in Equation(A.6): P − t | t − ( θ t − ˆ θ ) = ∇ θ t (cid:0) h ( u t ; ˆ θ ) + ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1)(cid:1) P − n t (cid:16) y ( u t ) − h ( u t ; ˆ θ ) − ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1)(cid:17) = ∇ θ t h ( u t ; ˆ θ ) P − n t (cid:16) y ( u t ) − h ( u t ; ˆ θ ) (cid:17) − ∇ θ t h ( u t ; ˆ θ ) P − n t ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:0) θ t − ˆ θ (cid:1) We receive that: (cid:16) P − t | t − + ∇ θ t h ( u t ; ˆ θ ) P − n t ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:17) ( θ t − ˆ θ )= ∇ θ t h ( u t ; ˆ θ ) P − n t (cid:16) y ( u t ) − h ( u t ; ˆ θ ) (cid:17) and ﬁnally: θ t = ˆ θ + (cid:16) P − t | t − + ∇ θ t h ( u t ; ˆ θ ) P − n t ∇ θ t h ( u t ; ˆ θ ) (cid:62) (cid:17) − ∇ θ t h ( u t ; ˆ θ ) P − n t (cid:16) y ( u t ) − h ( u t ; ˆ θ ) (cid:17) (A.22)For simplicity we denote: ∇ h = ∇ θ t h ( u t ; ˆ θ ) . We will nowsimplify the following term: (cid:16) P − t | t − + ∇ hP − n t ∇ h (cid:62) (cid:17) − ∇ hP − n t = (cid:16) P − t | t − + ∇ hP − n t ∇ h (cid:62) (cid:17) − ∇ hP − n t (cid:16) ∇ h (cid:62) P t | t − ∇ h + P n t (cid:17)(cid:16) ∇ h (cid:62) P t | t − ∇ h + P n t (cid:17) − = (cid:16) P − t | t − + ∇ hP − n t ∇ h (cid:62) (cid:17) − (cid:16) ∇ hP − n t ∇ h (cid:62) P t | t − ∇ h + ∇ hP − n t P n t (cid:17)(cid:16) ∇ h (cid:62) P t | t − ∇ h + P n t (cid:17) − = (cid:16) P − t | t − + ∇ hP − n t ∇ h (cid:62) (cid:17) − (cid:16) ∇ hP − n t ∇ h (cid:62) + P − t | t − (cid:17) P t | t − ∇ h (cid:16) ∇ h (cid:62) P t | t − ∇ h + P n t (cid:17) − = P t | t − ∇ h (cid:16) ∇ h (cid:62) P t | t − ∇ h + P n t (cid:17) − = (cid:124)(cid:123)(cid:122)(cid:125) ( A. A. P ˜ θ t , ˜y t P − ˜y t = (cid:124)(cid:123)(cid:122)(cid:125) ( A. K t (A.23) Substituting this results in Equation (A.22), we receive theEKF update for the parameters: ˆ θ EKF t | t = ˆ θ t | t − + K t (cid:0) y ( u t ) − h ( u t ; ˆ θ t | t − ) (cid:1) (A.24)which is exactly as in Equation (A.5).We will now develop the term (cid:16) P − t | t − + ∇ hP − n t ∇ h (cid:62) (cid:17) − that appears in (A.22) by using the matrix inversion lemma: ( B − + CD − C (cid:62) ) − = B − BC ( D + C (cid:62) BC ) − C (cid:62) B (A.25)where B is a square symmetric positive-deﬁnite (and henceinvertible) matrix. For this purpose we assume that the errorcovariance matrix of θ t , P t | t − , is symmetric and positive-deﬁnite. (cid:16) P − t | t − + ∇ hP − n t ∇ h (cid:62) (cid:17) − = (cid:124)(cid:123)(cid:122)(cid:125) ( A. P t | t − − P t | t − ∇ h ( P n t + ∇ h (cid:62) P t | t − ∇ h ) − ∇ h (cid:62) P t | t − = (cid:124)(cid:123)(cid:122)(cid:125) ( A. P t | t − − K t ∇ h (cid:62) P t | t − = (cid:124)(cid:123)(cid:122)(cid:125) ( A. P t | t − − K t P (cid:62) ˜ θ t , ˜y t = (cid:124)(cid:123)(cid:122)(cid:125) ( A. P t | t − − K t P ˜y t K (cid:62) t We can write the update of the parameters error covarianceas: P t | t = P t | t − − K t P ˜y t K (cid:62) t (A.26)We conclude the proof by stating that the optimal parameter ˆ θ EKF t | t in (A.5) is the solution to the minimization of theobjective function in (A.20): ˆ θ EKF t | t ∈ arg min θ t L EKF t ( θ t ) A.6. Proof of Colloraly 1

Proof. If P n t is diagonal with diagonal elements σ i = N ,where N is the number of samples in a batch, then: δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t ) = 12 N N (cid:88) i =1 δ ( u it , θ t )= L MLE t ( θ t ) If in addition, P | = , and P v t = then the the initialerror covariance matrix does not change and L EKF t ( θ t ) = L MLE t ( θ t ) for each t . rust Region Value Optimization using Kalman Filtering A.7. Proof of Theorem 2

First let’s deﬁne the distributions of interest. We adoptthe notation from (Martens, 2014). Assume the inputs u are drawn independently from a target distribution Q u with density function q ( u ) , and assume the correspondingoutputs y are drawn from a conditional target distribution Q y | u with density function q ( y | u ) . The target joint dis-tribution is Q u,y whose density is q ( u, y ) = q ( y | u ) q ( u ) ,and the learned distribution is P u,y ( θ ) , whose density is p ( u, y | θ ) = p ( y | u, θ ) q ( u ) . Lemma A.1. If P n t is diagonal with diagonal elements σ i = N , then: δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t ) = C + N E ˆ Q u [ D KL ( ˆ Q y | u || P y | u ( θ ))] Proof.

By deﬁnition: D KL ( Q u,y || P u,y ( θ )) = (cid:90) q ( u, y ) log q ( u, y ) p ( u, y | θ ) dudy This is equivalent to the expected KL divergence over theconditional distributions.: E Q u [ D KL ( Q y | u || P y | u ( θ ))] since: E Q u [ D KL ( Q y | u || P y | u ( θ ))] = (cid:90) q ( u ) (cid:90) q ( y | u ) log q ( y | u ) p ( y | u, θ ) dydu = (cid:90) q ( u, y ) log q ( y | u ) q ( u ) p ( y | u, θ ) q ( u ) dudy = D KL ( Q u,y || P u,y ( θ )) Since we don’t have access to Q u we substitute an empiricaltraining distribution ˆ Q u for Q u which is given by a set S u of samples from Q u . Then we deﬁne: E ˆ Q u [ D KL ( Q y | u || P y | u ( θ ))] = 1 |S| (cid:88) u ∈S u D KL ( Q y | u || P y | u ( θ )) In our training setting, we only have access to a singlesample y from Q y | u for each u ∈ S u , giving an empiricaltraining distribution ˆ Q y | u . Then: E ˆ Q u [ D KL ( ˆ Q y | u || P y | u ( θ ))] = 1 |S| (cid:88) ( u,y ) ∈S p ( y | u, θ )= − |S| (cid:88) ( u,y ) ∈S log p ( y | u, θ ) since ˆ q ( y | u ) = 1 . Now, back to our EKF notations. Assumethat the N observations in y ( u t ) are independent, then: log p ( y ( u t ) | θ ) = log (cid:16) N (cid:89) i =1 p ( y ( u it ) | θ ) (cid:17) = N (cid:88) i =1 log p ( y | u it , θ ) where we changed the notation: p ( y ( u it ) | θ ) = p ( y | u it , θ ) .Now let’s write it explicitly for Gaussian distributions: log p ( y ( u t ) | θ ) = log (cid:18) π ) N/ | P n t | / exp (cid:16) − (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1)(cid:17)(cid:19) = C − (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) (cid:62) P − n t (cid:0) y ( u t ) − h ( u t ; θ t ) (cid:1) = C − δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t ) where C = log (cid:0) π ) N/ | P n t | / (cid:1) is constant with respectto θ . Then we have that: δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t ) = C − log p ( y ( u t ) | θ )= C − N (cid:88) i =1 log p ( y | u it , θ ) = C + N E ˆ Q u [ D KL ( ˆ Q y | u || P y | u ( θ ))] We have that: δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t ) = C + N E ˆ Q u [ D KL ( ˆ Q y | u || P y | u ( θ ))] Lemma A.2.

For the empirical Fisher information matrix ˆ F : D KL (cid:0) P u,y ( θ + ∆ θ ) || P u,y ( θ ) (cid:1) = 12 ( θ − ˆ θ ) T ˆF ( θ − ˆ θ )+ O ( (cid:107) ∆ θ (cid:107) ) Proof.

According to the KL-divergence deﬁnition: D KL (cid:0) P u,y ( θ + ∆ θ ) || P u,y ( θ ) (cid:1) = (cid:90) p ( u, y | θ + ∆ θ ) log p ( u, y | θ + ∆ θ ) dudy − (cid:90) p ( u, y | θ + ∆ θ ) log p ( u, y | θ ) dudy. According to Taylor expansion: log p ( u, y | θ ) = log p ( u, y | θ + ∆ θ ) − g T ∆ θ + 12 ∆ θ T H ∆ θ + O ( (cid:107) ∆ θ (cid:107) ) (A.27)where g is the gradient of log p ( u, y | θ ) at the point θ + ∆ θ : g = ∇ θ log p ( u, y | θ ) | θ +∆ θ . Note that p ( u, y | θ ) = p ( y | u, θ +∆ θ ) q ( u ) . Since q ( u ) doesnot depend on θ then ∇ θ log p ( u, y | θ ) = ∇ θ log p ( y | u, θ ) .Therefore, we can write g as: g = ∇ θ log p ( y | u, θ ) | θ +∆ θ =  ∂ log p ( y | u, θ +∆ θ ) ∂θ ... ∂ log p ( y | u, θ +∆ θ ) ∂θ d  rust Region Value Optimization using Kalman Filtering Similarly, the Hessian H can be written as: H = ∇ θ log p ( u, y | θ ) | θ +∆ θ = ∇ θ log p ( y | u, θ ) | θ +∆ θ =  ∂ log p ( y | u, θ +∆ θ ) ∂θ . . . ∂ log p ( y | u, θ +∆ θ ) ∂θ ∂θ d ... ... ... ∂ log p ( y | u, θ +∆ θ ) ∂θ d ∂θ . . . ∂ log p ( y | u, θ +∆ θ ) ∂θ d  We use this Taylor expansion in the KL-divergence term,and use the notation: ˆ θ = θ + ∆ θ → θ − ˆ θ = − ∆ θ . D KL (cid:0) P u,y ( θ + ∆ θ ) || P u,y ( θ ) (cid:1) = (cid:90) p ( u, y | ˆ θ ) log p ( u, y | ˆ θ ) dudy − (cid:90) p ( u, y | ˆ θ ) (cid:16) log p ( u, y | ˆ θ ) − g T ∆ θ + 12 ∆ θ T H ∆ θ (cid:17) dudy + O ( (cid:107) ∆ θ (cid:107) )= (cid:90) p ( u, y | ˆ θ ) log p ( u, y | ˆ θ ) dudy − (cid:90) p ( u, y | ˆ θ ) log p ( u, y | ˆ θ ) dudy (cid:124) (cid:123)(cid:122) (cid:125) =0 + (cid:90) p ( u, y | ˆ θ ) d (cid:88) i =1 ∂ log p ( y | u, ˆ θ ) ∂θ i ∆ θ i dudy (cid:124) (cid:123)(cid:122) (cid:125) =0 ,see ( ∗ ) − (cid:90) p ( u, y | ˆ θ ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j ∂ log p ( y | u, ˆ θ ) ∂θ i ∂θ j dudy (cid:124) (cid:123)(cid:122) (cid:125) = ∆ θ T F ∆ θ ,see ( ∗∗ ) + O ( (cid:107) ∆ θ (cid:107) )= 12 ∆ θ T F ∆ θ + O ( (cid:107) ∆ θ (cid:107) ) We explain (*), according to regularities in the Leibnizintegral rule (switching derivation and integral): (cid:90) p ( u, y | ˆ θ ) d (cid:88) i =1 ∂ log p ( y | u, ˆ θ ) ∂θ i ∆ θ i dudy = (cid:90) q ( u ) (cid:90) p ( y | u, ˆ θ ) d (cid:88) i =1 p ( y | u, ˆ θ ) ∂p ( y | u, ˆ θ ) ∂θ i ∆ θ i dydu = (cid:90) q ( u ) d (cid:88) i =1 ∆ θ i ∂∂θ i (cid:90) p ( y | u, ˆ θ ) dy (cid:124) (cid:123)(cid:122) (cid:125) =1 (cid:124) (cid:123)(cid:122) (cid:125) =0 du = 0 We explain (**): − (cid:90) p ( u, y | ˆ θ ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j ∂ log p ( y | u, ˆ θ ) ∂θ i ∂θ j dudy = − (cid:90) q ( u ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j · (cid:90) p ( y | u, ˆ θ ) ∂∂θ i (cid:16) p ( y | u, ˆ θ ) ∂p ( y | u, ˆ θ ) ∂θ j (cid:17) dydu = − (cid:90) q ( u ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j (cid:90) p ( y | u, ˆ θ ) (cid:16) p ( y | u, ˆ θ ) ∂ p ( y | u, ˆ θ ) ∂θ i ∂θ j − p ( y | u, ˆ θ ) ∂p ( y | u, ˆ θ ) ∂θ i ∂p ( y | u, ˆ θ ) ∂θ j (cid:17) dydu = − (cid:90) q ( u ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j (cid:90) (cid:16) ∂ p ( y | u, ˆ θ ) ∂θ i ∂θ j − p ( y | u, ˆ θ ) ∂ log p ( y | u, ˆ θ ) ∂θ i ∂ log p ( y | u, ˆ θ ) ∂θ j (cid:17) dydu = − (cid:90) q ( u ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j ∂ ∂θ i ∂θ j (cid:90) p ( y | u, ˆ θ ) dy (cid:124) (cid:123)(cid:122) (cid:125) =1 (cid:124) (cid:123)(cid:122) (cid:125) =0 du + 12 (cid:90) q ( u ) d (cid:88) i =1 d (cid:88) j =1 ∆ θ i ∆ θ j E P y | u ( ˆ θ ) (cid:104) ∂ log p ( y | u, ˆ θ ) ∂θ i ∂ log p ( y | u, ˆ θ ) ∂θ j (cid:105) du = 12 ∆ θ T F ∆ θ where F ij = E Q u (cid:34) E P y | u ( ˆ θ ) (cid:104) ∂ log p ( y | u, ˆ θ ) ∂θ i ∂ log p ( y | u, ˆ θ ) ∂θ j (cid:105)(cid:35) Since we don’t have access to Q u we will use the empiricaltraining distribution ˆ Q u : ˆF ij = 1 |S| (cid:88) u ∈S u E P y | u ( ˆ θ ) (cid:104) ∂ log p ( y | u, ˆ θ ) ∂θ i ∂ log p ( y | u, ˆ θ ) ∂θ j (cid:105) We received that: D KL (cid:0) P u,y ( θ + ∆ θ ) || P u,y ( θ ) (cid:1) =12 ( θ − ˆ θ ) T ˆF ( θ − ˆ θ ) + O ( (cid:107) ∆ θ (cid:107) ) Now we can summarize the proof for Theorem 2: rust Region Value Optimization using Kalman Filtering

Proof.

Adding the relationship from (Ollivier et al., 2018): ˆF t | t − = t P − t | t − , and combining the results from Lemma1 and Lemma 2, our objective function can be approximatedas: L EKF t ( θ t ) = 12 δ ( u t ; θ t ) (cid:62) P − n t δ ( u t ; θ t )+ 12 ( θ t − ˆ θ t | t − ) (cid:62) P − t | t − ( θ t − ˆ θ t | t − )= C + N E ˆ Q u [ D KL ( ˆ Q y | u || P y | u ( θ ))]+ t θ t − ˆ θ t | t − ) (cid:62) ˆF t | t − ( θ t − ˆ θ t | t − ) ≈ C + N E ˆ Q u [ D KL ( ˆ Q y | u || P y | u ( θ ))]+ t · D KL (cid:0) P u,y ( θ + ∆ θ ) || P u,y ( θ ) (cid:1) which completes the proof. B. Experimental details

Our experiments are based on the baselines implementation(Dhariwal et al., 2017) for PPO and TRPO. We used theirdefault hyper parameters, and only changed the optimizerfor the value function from Adam to KOVA. For brevity, webring here the network architecture and the hyper parametersfor each algorithm.

PPO:

Following (Schulman et al., 2017), the policy networkis a fully-connected MLP with two hidden layers, 64 unitsand tanh nonlinearities. The output of the policy network isthe mean and standard deviations of a Gaussian distributionof actions for a given (input) state. The value network is afully-connected MLP with two hidden layers, 64 units andtanh nonlinearities. The output of the value network is ascalar, representing the value function for a given (input)state. PPO uses the GAE estimator for the advantage func-tion (Schulman et al., 2015b). In Tables B.1 and B.2 wepresent the hyper parameters for the PPO experiments. TheHorizon represents the number of timesteps per each policyrollout.

TRPO:

The policy network and the value network are thesame as described for PPO, only with 32 units instead of 64.TRPO also uses the GAE estimator. In Tables B.3 and B.4we present the hyper parameters for the TRPO experiments.

Table B.1.

PPO hyper-parameters used for the Mujoco tasks

Hyper-parameter Value

Horizon 2048Adam learning rate · − Num. epochs 10Minibatch size 64Discount ( γ ) ( λ ) Table B.2.

KOVA hyper-parameters used for VF optimization in PPO

Hyper-parameter Value

KOVA learning rate . (Swimmer, HalfCheetah,Walker2d) . (Hopper, Ant) P n t type max-ratio η . (Hopper, HalfCheetah, Ant) . (Swimmer, Walker2d) Table B.3.

TRPO hyper-parameters used for Mujoco tasks

Hyper-parameter Value

Horizon 1024Batch size 64Discount ( γ ) ( λ ) − Normalize observations True

Table B.4.

KOVA hyper-parameters used for VF optimization in TRPO

Hyper-parameter Value

KOVA learning rate . (Swimmer, Hopper) . (HalfCheetah) . (Ant, Walker2d) P n t type max-ratio η ..