2019 53rd Annual Conference on Information Sciences and Systems (CISS) | 2019
Policy Gradient using Weak Derivatives for Reinforcement Learning
Abstract
Reinforcement Learning (RL) is a form of implicit stochastic adaptive control where the optimal control policy is estimated without directly estimating the underlying model parameters. This paper considers reinforcement learning for an infinite horizon discounted cost continuous state Markov decision process. In a MDP, actions affect the Markovian state dynamics and result in rewards for the agent. The objective is to find a map from the states to actions, also known as policy, that results in the accumulation of largest expected return over an infinite horizon. There are many ways to estimate such a policy: policy iteration, Q-learning (which operates in the so called “value” space), policy-gradients (which operates in the policy space); see [1, 14] for a detailed discussion of these methods.