Richard S. Sutton | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard S. Sutton is active.

Explore More

Publication

Featured researches published by Richard S. Sutton.

Machine Learning | 1988

Learning to Predict by the Methods of Temporal Differences

Richard S. Sutton

This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuels checker player, Hollands bucket brigade, and the authors Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.

Artificial Intelligence | 1999

Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning

Richard S. Sutton; Doina Precup; Satinder P. Singh

Learning, planning, and representing knowledge at multiple levels of temporal ab- straction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforce- ment learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include options—closed-loop policies for taking ac- tion over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as mus- cle twitches and joint torques. Overall, we show that options enable temporally abstract knowledge and action to be included in the reinforcement learning frame- work in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic pro- gramming and in learning methods such as Q-learning. Formally, a set of options defined over an MDP constitutes a semi-Markov decision process (SMDP), and the theory of SMDPs provides the foundation for the theory of options. However, the most interesting issues concern the interplay between the underlying MDP and the SMDP and are thus beyond SMDP theory. We present results for three such cases: 1) we show that the results of planning with options can be used during execution to interrupt options and thereby perform even better than planned, 2) we introduce new intra-option methods that are able to learn about an option from fragments of its execution, and 3) we propose a notion of subgoal that can be used to improve the options themselves. All of these results have precursors in the existing literature; the contribution of this paper is to establish them in a simpler and more general setting with fewer changes to the existing reinforcement learning framework. In particular, we show that these results can be obtained without committing to (or ruling out) any particular approach to state abstraction, hierarchy, function approximation, or the macro-utility problem.

Adaptive Behavior | 2005

Reinforcement Learning for RoboCup Soccer Keepaway

Peter Stone; Richard S. Sutton; Gregory Kuhlmann

RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.

Automatica | 2009

Natural actor-critic algorithms

Shalabh Bhatnagar; Richard S. Sutton; Mohammad Ghavamzadeh; Mark Lee

We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.

IEEE Control Systems Magazine | 1992

Reinforcement learning is direct adaptive optimal control

Richard S. Sutton; Andrew G. Barto; Ronald J. Williams

Control problems can be divided into two classes: 1) regulation and tracking problems, in which the objective is to follow a reference trajectory, and 2) optimal control problems, in which the objective is to extremize a functional of the controlled systems behavior that is not necessarily defined in terms of a reference trajectory. Adaptive methods for problems of the first kind are well known, and include self-tuning regulators and model-reference methods, whereas adaptive methods for optimal-control problems have received relatively little attention. Moreover, the adaptive optimal-control methods that have been studied are almost all indirect methods, in which controls are recomputed from an estimated system model at each step. This computation is inherently complex, making adaptive methods in which the optimal controls are estimated directly more attractive. Here we present reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems.

Intelligence\/sigart Bulletin | 1991

Dyna, an integrated architecture for learning, planning, and reacting

Richard S. Sutton

Dyna is an AI architecture that integrates learning, planning, and reactive execution. Learning methods are used in Dyna both for compiling planning results and for updating a model of the effects of the agents actions on the world. Planning is incremental and can use the probabilistic and ofttimes incorrect world models generated by learning processes. Execution is fully reactive in the sense that no planning intervenes between perception and action. Dyna relies on machine learning methods for learning from examples---these are among the basic building blocks making up the architecture---yet is not tied to any particular method. This paper briefly introduces Dyna and discusses its strengths and weaknesses with respect to other architectures.

Biological Cybernetics | 1981

Associative search network: A reinforcement learning associative memory

Andrew G. Barto; Richard S. Sutton; Peter S. Brouwer

An associative memory system is presented which does not require a “teacher” to provide the desired associations. For each input key it conducts a search for the output pattern which optimizes an external payoff or reinforcement signal. The associative search network (ASN) combines pattern recognition and function optimization capabilities in a simple and effective way. We define the associative search problem, discuss conditions under which the associative search network is capable of solving it, and present results from computer simulations. The synthesis of sensory-motor control surfaces is discussed as an example of the associative search problem.

Biological Cybernetics | 1981

Landmark learning: An illustration of associative search

Andrew G. Barto; Richard S. Sutton

In a previous paper we defined the associative search problem and presented a system capable of solving it under certain conditions. In this paper we interpret a spatial learning problem as an associative search task and describe the behavior of an adaptive network capable of solving it. This example shows how naturally the associative search problem can arise and permits the search, association, and generalization properties of the adaptive network to bee clearly illustrated.

Behavioural Brain Research | 1986

Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals

John W. Moore; John E. Desmond; Neil E. Berthier; Diana E.J. Blazis; Richard S. Sutton; Andrew G. Barto

A neuron-like adaptive element with computational features suitable for classical conditioning, the Sutton-Barto (S-B) model, was extended to simulate real-time aspects of the conditioned nictitating membrane (NM) response. The aspects of concern were response topography, CR-related neuronal firing, and interstimulus interval (ISI) effects for forward-delay and trace conditioning paradigms. The topography of the NM CR has the following features: response latency after CS onset decreases over trials; response amplitude increases gradually within the ISI and attains its maximum coincidentally with the UR. A similar pattern characterizes the firing of some (but not all) neurons in brain regions demonstrated experimentally to be important for NM conditioning. The variant of the S-B model described in this paper consists of a set of parameters and implementation rules based on 10-ms computational time steps. It differs from the original S-B model in a number of ways. The main difference is the assumption that CS inputs to the adaptive element are not instantaneous but are instead shaped by unspecified coding processes so as to produce outputs that conform with the real-time properties of NM conditioning. The model successfully simulates the aforementioned features of NM response topography. It is also capable of simulating appropriate ISI functions, i.e. with maximum conditioning strength with ISIs of 250 ms, for forward-delay and trace paradigms. The original models successful treatment of multiple-CS phenomena, such as blocking, conditioned inhibition, and higher-order conditioning, are retained by the present model.

Behavioural Brain Research | 1982

Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element

Andrew G. Barto; Richard S. Sutton

A neuron-like adaptive element is described that produces an important feature of the anticipatory nature of classical conditioning. The response that occurs after training (conditioned response) usually begins earlier than the reinforcing stimulus (unconditioned stimulus). The conditioned response therefore usually anticipates the unconditioned stimulus. This aspect of classical conditioning has been largely neglected by hypotheses that neurons provide single unit analogs of conditioning. This paper briefly presents the model and extends earlier results by computer simulation of conditioned inhibition and chaining of associations.

Explore More