Kazuteru Miyazaki
Tokyo Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kazuteru Miyazaki.
Artificial Intelligence | 1997
Kazuteru Miyazaki; Masayuki Yamamura; Shigenobu Kobayashi
Copyright (c) 1996 Elsevier Science B.V. All rights reserved. Reinforcement learning aims to adapt an agent to an unknown environment according to rewards. There are two issues to handle delayed reward and uncertainty. Q-learning is a representative reinforcement learning method. It is used in many works since it can learn an optimum policy. However, Q-learning needs numerous trials to converge to an optimum policy. If the target environments can be described in Markov decision processes, we can identify them from statistics of sensor−action pairs. When we build the correct environment model, we can derive an optimum policy with the Policy Iteration Algorithm. Therefore, we can construct an optimum policy through identifying environments efficiently. We separate the learning process into two phases: identifying an environment and determining an optimum policy. We propose the k-Certainty Exploration Method for identifying an environment. After that, an optimum policy is determined by the Policy Iteration Algorithm. We call a rule k-certainty if and only if it has been selected k times or more. The k-Certainty Exploration Method excepts any loop of rules that already achieve k-certainty. We show its effectiveness by comparing it with Q-learning in two experiments. One is Suttons maze-like environment, the other is an original environment where an optimum policy varies according to a parameter.
systems man and cybernetics | 2000
Kazuteru Miyazaki; Shigenobu Kobayashi
Reinforcement learning is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to a reward. In general, the purpose of a reinforcement learning system is to acquire an optimum policy that can maximize expected reward per action. However, it is not always important for any environment. Especially, if we apply reinforcement learning to engineering, we expect the agent to avoid all penalties. In Markov decision processes, we call a rule penalty if and only if it has a penalty or it can transit to a penalty state where it does not contribute to get any reward. After suppressing all penalty rules, we aim to make a rational policy whose expected reward per action is larger than zero. We propose the penalty avoiding rational policy making algorithm that can suppress any penalty as stable as possible and get a reward constantly. By applying the algorithm to the tick-tack-toe its effectiveness is shown.
New Generation Computing | 2001
Kazuteru Miyazaki; Shigenobu Kobayashi
AbstractIn multi-agent reinforcement learning systems, it is important to share a reward among all agents. We focus on theRationality Theorem of Profit Sharing5) and analyze how to share a reward among all profit sharing agents. When an agent gets adirect reward R (R>0), anindirect reward μR (μ≥0) is given to the other agents. We have derived the necessary and sufficient condition to preserve the rationality as follows;n
international symposium on autonomous decentralized systems | 1999
Sachiyo Arai; Kazuteru Miyazaki; Shigenobu Kobayashi
intelligent agents | 1999
Kazuteru Miyazaki; Shigenobu Kobayashi
mu < frac{{M - 1}}{{M^W (1 - (tfrac{1}{M})^{W_o } )(n - 1)L}}
systems man and cybernetics | 1999
Kazuteru Miyazaki; Shigenobu Kobayashi
computational intelligence | 2001
Kazuteru Miyazaki; Shigenobu Kobayashi
n whereM andL are the maximum number of conflicting all rules and rational rules in the same sensory input,W andWo are the maximum episode length of adirect and anindirect-reward agents, andn is the number of agents. This theory is derived by avoiding the least desirable situation whose expected reward per an action is zero. Therefore, if we use this theorem, we can experience several efficient aspects of reward sharing. Through numerical examples, we confirm the effectiveness of this theorem.
Archive | 2009
Kazuteru Miyazaki; Takuji Namatame; Hiroaki Kobayashi
In recent years, a reinforcement learning approach to build an agents knowledge in a multi-agent world has prevailed when the reinforcement learning is applied to such a world, a concurrent learning among the agents, a perceptual aliasing, and a designing rewards are the most important problems to be considered. We have already confirmed that profit-sharing algorithm shows its robustness against these three problems through some experiments. In this paper, we focus on an advantage of profit-sharing compared to Q-learning through the simulations of controlling cranes where there exist the conflicts among the agents. The conflict resolution problem must become a bottle-neck in the multi-agent world if we approach to it by the top-down method. Similarly, Q-learning is also weak in this problem without exhaustive design of the rewards or detailed information about other agents. We present that profit-sharing method can be available to resolve it, through the results of some experiments on the controlling cranes problem.
intelligent data engineering and automated learning | 2008
Kazuteru Miyazaki; Shigenobu Kobayashi
In multi-agent reinforcement learning systems, it is important to share a reward among all agents. We focus on the Rationality Theorem of Profit Sharing [5] and analyze how to share a reward among all profit sharing agents. When an agent gets a direct reward R (R > 0), an indirect reward µR (µ ≥ 0) is given to the other agents. We have derived the necessary and sufficient condition to preserve the rationality as follows; µ < M - 1/MW(1-(1/M)W0(n-1)L Where M and L are the maximum number of conflicting all rules and rational rules in the same sensory input, W and W0 are the maximum episode length of a direct and an indirect-reward agents, and n is the number of agents. This theory is derived by avoiding the least desirable situation whose expected reward per an action is zero. Therefore, if we use this theorem, we can experience several efficient aspects of reward sharing. Through numerical examples, we confirm the effectiveness of this theorem.
Artificial Life and Robotics | 2004
Kazuteru Miyazaki; Sougo Tsuboi; Shigenobu Kobayashi
Reinforcement learning is a kind of machine learning. Partially observable Markov decision process (POMDP) is a representative class of non-Markovian environments in reinforcement learning. The rational policy making (RPM) algorithm learns a deterministic rational policy in POMDPs. Though RPM can learn a policy very quickly, it needs numerous trials to improve the policy. Furthermore, RPM does not apply the class where there is no deterministic rational policy. In this paper, we propose the rational policy improvement (RPI) algorithm that combines RPM and the mark transit algorithm with /spl chi//sup 2/-goodness-of-fit test. RPI can learn a deterministic or stochastic rational policy in POMDPs. RPI is applied to maze environments. We show that RPI can learn the most stable rational policy in comparison with other methods.