Archive | 2019

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

 

Abstract


Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy π is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy π′. However, the policy computed by traditional RL algorithms might have worse performance compared to π. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of π′ is better than the performance of π given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method. 1 Model-based Exploration in RL To find the optimal policy quickly, the R-max algorithm [Brafman and Tennenholtz, 2002] incentivizes the agent to explore unknown parts of the environment in early stages of the learning process. To do so, it keeps track of a set of stateaction pairs considered known: Km = {(s, a) ∈ S ×A | n(s, a) ≥ m} , (1) where n(s, a) is the number of times the agent has applied action a in state s, and m is a threshold to consider a stateaction pair known. Often, the state space S can be represented by a set of state factors X = {X1, · · · , X|X|} where each factor has a domain Dom(Xi). When these factors are highly independent, a Factored MDP (FMDP) can compactly represent an MDP, using a dependence function D : S × A×X → I that indicates the commonalities among different factors, where I is a set of dependency identifiers [Strehl, 2007]. The probabilistic transition function can be compactly represented: T (s′ | s, a) = |X| ∏ i=1 P (si | D(s, a,Xi)), where si is the value of Xi in the next state s ′. The factored R-max algorithm is a direct extension of Rmax for FMDPs [Guestrin et al., 2003]. It maintains an estimate of each transition component distribution and decides which state-action pairs are known or not according to these estimates. This algorithm only considers as known parts of the environment where the estimate of all transition components have been experienced enough times. In particular, given a minimum number of samples for each factor ~ m = 〈m1, . . . ,m|X|〉 and the counters of each transition component n(j), the set of known state-action pairs is constructed as follows: K~ m = {(s, a) ∈ S ×A | ∀Xi : n(D(s, a,Xi)) ≥ mi}. (2) 2 Safe Policy Improvement Safe Policy Improvement (SPI) addresses the question of how to compute a new policy π that outperforms the behavior policy πb with high confidence 1 − δ, given a batch of previous interactions D and an admissible error ζ. The SPI by Baseline Bootstrapping (SPIBB) framework is a model-based approach that guarantees safety by bootstrapping unknown parts of the approximated model with the behavior policy πb [Laroche et al., 2019]. Formally, the set of bootstrapped state-action pairs Bm is the complement of the set Km (1). This way, the SPIBB algorithms guarantee to perform at least as well as the behavior policy and do not rely on a safety test, in contrast to other SPI algorithms. The policy-based Πb-SPIBB algorithm attributes the same probability to bootstrapped pairs as the behavior policy, which restricts the policy space to Πb = {π | π(s, a) = πb(s, a) : ∀π ∈ Π, ∀(s, a) ∈ Bm}. (3) Laroche et al. [2019] prove that if m = 2 2 log |S||A|2|S| δ then the Πb-SPIBB algorithm is safe, where is a bound on the L1 distance between the estimated transition function and the true transition function, that depends on the precision parameter ζ. The Πb-SPIBB algorithm can change the policy if a subset of the state-action pairs is well known, therefore it ca be less conservative than other SPI algorithms. Nevertheless, when the problem is described by a set of factors, m grows exponentially in the number of factors. In the next section we show that, by taking in account the independence between features, it is possible to exploit the factored representation of the problem using a minimum number of samples that is only polynomial in the number of parameters of the FMDP. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Volume None
Pages 6460-6461
DOI 10.24963/ijcai.2019/919
Language English
Journal None

Full Text