Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
IImproved Sample Complexity for IncrementalAutonomous Exploration in MDPs
Jean Tarbouriech
Facebook AI Research Paris & Inria Lille [email protected]
Matteo Pirotta
Facebook AI Research Paris [email protected]
Michal Valko
DeepMind Paris [email protected]
Alessandro Lazaric
Facebook AI Research Paris [email protected]
Abstract
We investigate the exploration of an unknown environment when no reward functionis provided. Building on the incremental exploration setting introduced by Lim andAuer [1], we define the objective of learning the set of ε -optimal goal-conditionedpolicies attaining all states that are incrementally reachable within L steps (inexpectation) from a reference state s . In this paper, we introduce a novel model-based approach that interleaves discovering new states from s and improving theaccuracy of a model estimate that is used to compute goal-conditioned policies toreach newly discovered states. The resulting algorithm, DisCo , achieves a samplecomplexity scaling as (cid:101) O ( L S L + ε Γ L + ε A ε − ) , where A is the number of actions, S L + ε is the number of states that are incrementally reachable from s in L + ε steps, and Γ L + ε is the branching factor of the dynamics over such states. Thisimproves over the algorithm proposed in [1] in both ε and L at the cost of an extra Γ L + ε factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an ε/c min -optimal policy for any cost-sensitiveshortest-path problem defined on the L -reachable states with minimum cost c min .Finally, we report preliminary empirical results confirming our theoretical findings. In cases where the reward signal is not informative enough — e.g., too sparse, time-varying oreven absent — a reinforcement learning (RL) agent needs to explore the environment driven byobjectives other than reward maximization, see [e.g., 2, 3, 4, 5, 6]. This can be performed by designingintrinsic rewards to drive the learning process, for instance via state visitation counts [7, 8], noveltyor prediction errors [9, 10, 11]. Other recent methods perform information-theoretic skill discoveryto learn a set of diverse and task-agnostic behaviors [12, 13, 14]. Alternatively, goal-conditionedpolicies learned by carefully designing the sequence of goals during the learning process are oftenused to solve sparse reward problems [15] and a variety of goal-reaching tasks [16, 17, 18, 19].While the approaches reviewed above effectively leverage deep RL techniques and are able toachieve impressive results in complex domains (e.g., Montezuma’s Revenge [15] or real-worldrobotic manipulation tasks [19]), they often lack substantial theoretical understanding and guarantees.Recently, some unsupervised RL objectives were analyzed rigorously. Some of them quantify howwell the agent visits the states under a sought-after frequency, e.g., to induce a maximally entropicstate distribution [20, 21, 22, 23]. While such strategies provably mimic their desired behavior viaa Frank-Wolfe algorithmic scheme, they may not learn how to effectively reach any state of theenvironment and thus may not be sufficient to efficiently solve downstream tasks. Another relevanttake is the reward-free RL paradigm of [24]: following its exploration phase, the agent is able to a r X i v : . [ c s . L G ] D ec ompute a near-optimal policy for any reward function at test time. While this framework yieldsstrong end-to-end guarantees, it is limited to the finite-horizon setting and the agent is thus unable totackle tasks beyond finite-horizon, e.g., goal-conditioned tasks.In this paper, we build on and refine the setting of incremental exploration of [1]: the agent starts atan initial state s in an unknown, possibly large environment, and it is provided with a RESET actionto restart at s . At a high level, in this setting the agent should explore the environment and stopwhen it has identified the tasks within its reach and learned to master each of them sufficiently well.More specifically, the objective of the agent is to learn a goal-conditioned policy for any state thatcan be reached from s within L steps in expectation; such a state is said to be L -controllable. Limand Auer [1] address this setting with the UcbExplore method for which they bound the number ofexploration steps that are required to identify in an incremental way all L -controllable states (i.e.,the algorithm needs to define a suitable stopping condition) and to return a set of policies that areable to reach each of them in at most L + ε steps. A key aspect of UcbExplore is to first focus onsimple states (i.e., states that can be reached within a few steps), learn policies to efficiently reachthem, and leverage them to identify and tackle states that are increasingly more difficult to reach.This approach aims to avoid wasting exploration in the attempt of reaching states that are furtherthan L steps from s or that are too difficult to reach given the limited knowledge available at earlierstages of the exploration process. Our main contributions are:• We strengthen the objective of incremental exploration and require the agent to learn ε -optimalgoal-conditioned policies for any L -controllable state. Formally, let V (cid:63) ( s ) be the length of theshortest path from s to s , then the agent needs to learn a policy to navigate from s to s in atmost V (cid:63) ( s ) + ε steps, while in [1] any policy reaching s in at most L + ε steps is acceptable.• We design DisCo , a novel algorithm for incremental exploration.
DisCo relies on an estimate ofthe transition model to compute goal-conditioned policies to the states observed so far and thenuse those policies to improve the accuracy of the model and incrementally discover new states.• We derive a sample complexity bound for
DisCo scaling as (cid:101) O ( L S L + ε Γ L + ε A ε − ) , where A is the number of actions, S L + ε is the number of states that are incrementally controllable from s in L + ε steps, and Γ L + ε is the branching factor of the dynamics over such incrementallycontrollable states. Not only is this sample complexity obtained for a more challenging objectivethan UcbExplore , but it also improves in both ε and L at the cost of an extra Γ L + ε factor, whichis small in most environments of interest.• Leveraging the model-based nature of DisCo , we can also readily compute an ε/c min -optimalpolicy for any cost-sensitive shortest-path problem defined on the L -controllable states withminimum cost c min . This result serves as a goal-conditioned counterpart to the reward-freeexploration framework defined by Jin et al. [24] for the finite-horizon setting. In this section we expand [1], with a more challenging objective for autonomous exploration. L -Controllable States We consider a reward-free
Markov decision process [25, Sect. 8.3] M := (cid:104)S , A , p, s (cid:105) . We assume afinite action space A with A = |A| actions, and a finite, possibly large state space S for which anupper bound S on its cardinality is known, i.e., |S| ≤ S . Each state-action pair ( s, a ) ∈ S × A ischaracterized by an unknown transition probability distribution p ( ·| s, a ) over next states. We denoteby Γ S (cid:48) := max s ∈S (cid:48) ,a (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈S (cid:48) (cid:107) the largest branching factor of the dynamics over states inany subset S (cid:48) ⊆ S . The environment has no extrinsic reward, and s ∈ S is a designated initial state.A deterministic stationary policy π : S → A is a mapping between states to actions and we denoteby Π the set of all possible policies. Since in environments with arbitrary dynamics the learner mayget stuck in a state without being able to return to s , we introduce the following assumption. We say that f ( ε ) = (cid:101) O ( ε α ) if there are constants a , b , such that f ( ε ) ≤ a · ε α log b (cid:0) ε (cid:1) . Lim and Auer [1] originally considered a countable, possibly infinite state space; however this leads to atechnical issue in the analysis of
UcbExplore (acknowledged by the authors via personal communication andexplained in App. E.3), which disappears by considering only finite state spaces. This assumption should be contrasted with the finite-horizon setting, where each policy resets automaticallyafter H steps, or assumptions on the MDP dynamics such as ergodicity or bounded diameter, which guaranteethat it is always possible to find a policy navigating between any two states. s is in white. Left:
Each transition be-tween states is deterministic and depicted withan edge.
Right:
Each transition from s to thefirst layer is equiprobable and the transitionsin the successive layers are deterministic. Ifwe set L = 3 , then the states belonging to S L are colored in red. As the right figure illus-trates, L -controllability is not necessarily linkedto a notion of distance between states and an L -controllable state may be achieved by traversingstates that are not L -controllable themselves. Assumption 1.
The action space contains a
RESET action s.t. p ( s | s, RESET ) = 1 for any s ∈ S . We make explicit the states where a policy π takes action RESET in the following definition.
Definition 1 (Policy restricted on a subset) . For any S (cid:48) ⊆ S , a policy π is restricted on S (cid:48) if π ( s ) = RESET for any s / ∈ S (cid:48) . We denote by Π( S (cid:48) ) the set of policies restricted on S (cid:48) . We measure the performance of a policy in navigating the MDP as follows.
Definition 2.
For any policy π and a pair of states ( s, s (cid:48) ) ∈ S , let τ π ( s → s (cid:48) ) be the (random)number of steps it takes to reach s (cid:48) starting from s when executing policy π , i.e., τ π ( s → s (cid:48) ) :=inf { t ≥ s t +1 = s (cid:48) | s = s, π } . We also set v π ( s → s (cid:48) ) := E [ τ π ( s → s (cid:48) )] as the expectedtraveling time, which corresponds to the value function of policy π in a stochastic shortest-pathsetting (SSP, [26, Sect. 3]) with initial state s , goal state s (cid:48) and unit cost function. Note that we have v π ( s → s (cid:48) ) = + ∞ when the policy π does not reach s (cid:48) from s with probability 1. Furthermore, forany subset S (cid:48) ⊆ S and any state s , we denote by V (cid:63) S (cid:48) ( s → s ) := min π ∈ Π( S (cid:48) ) v π ( s → s ) , the length of the shortest path to s , restricted to policies resetting to s from any state outside S (cid:48) . The objective of the learning agent is to control efficiently the environment in the vicinity of s . Wesay that a state s is controlled if the agent can reliably navigate to it from s , that is, there exists aneffective goal-conditioned policy — i.e., a shortest-path policy — from s to s . Definition 3 ( L -controllable states) . Given a reference state s , we say that a state s is L -controllableif there exists a policy π such that v π ( s → s ) ≤ L . The set of L -controllable states is then S L := { s ∈ S : min π ∈ Π v π ( s → s ) ≤ L } . (1)We illustrate the concept of controllable states in Fig. 1 for L = 3 . Interestingly, in the right figure,the black states are not L -controllable. In fact, there is no policy that can directly choose which oneof the black states to reach. On the other hand, the red state, despite being in some sense further from s than the black states, does belong to S L . In general, there is a crucial difference between theexistence of a random realization where a state s is reached from s in less than L steps (i.e., blackstates) and the notion of L - controllability , which means that there exists a policy that consistentlyreaches the state in a number of steps less or equal than L on average (i.e., red state). This explainsthe choice of the term controllable over reachable , since a state s is often said to be reachable if thereis a policy π with a non-zero probability to eventually reach it, which is a weaker requirement.Unfortunately, Lim and Auer [1] showed that in order to discover all the states in S L , the learner mayrequire a number of exploration steps that is exponential in L or |S L | . Intuitively, this negative resultis due to the fact that the minimum in Eq. 1 is over the set of all possible policies, including those thatmay traverse states that are not in S L . Hence, we similarly constrain the learner to focus on the setof incrementally controllable states.
Definition 4 (Incrementally controllable states S → L ) . Let ≺ be some partial order on S . The set S ≺ L of states controllable in L steps w.r.t. ≺ is defined inductively as follows. The initial state s We refer the reader to [1, Sect. 2.1] for a more formal and complete characterization of this negative result. elongs to S ≺ L by definition and if there exists a policy π restricted on { s (cid:48) ∈ S ≺ L : s (cid:48) ≺ s } with v π ( s → s ) ≤ L , then s ∈ S ≺ L . The set S → L of incrementally L -controllable states is defined as S → L := ∪ ≺ S ≺ L , where the union is over all possible partial orders. By way of illustration, in Fig. 1 for L = 3 , it holds that S → L = S L in the left figure, whereas S → L = { s } (cid:54) = S L in the right figure. Indeed, while the red state is L -controllable, it requirestraversing the black states, which are not L -controllable. AX Objectives
We are now ready to formalize two alternative objectives for
Autonomous eXploration ( AX ) in MDPs. Definition 5 ( AX sample complexity) . Fix any length L ≥ , error threshold ε > and confidencelevel δ ∈ (0 , . The sample complexities C AX L ( A , L, ε, δ ) and C AX (cid:63) ( A , L, ε, δ ) are defined as thenumber of time steps required by a learning algorithm A to identify a set K ⊇ S → L such that withprobability at least − δ , it has learned a set of policies { π s } s ∈K that respectively verifies thefollowing AX requirement ( AX L ) ∀ s ∈ K , v π s ( s → s ) ≤ L + ε , ( AX (cid:63) ) ∀ s ∈ K , v π s ( s → s ) ≤ V (cid:63) S → L ( s → s ) + ε. Designing agents satisfying the objectives defined above introduces critical difficulties w.r.t. standardgoal-directed learning in RL. First, the agent has to find accurate policies for a set of goals (i.e.,all incrementally L -controllable states) and not just for one specific goal. On top of this, the setof desired goals itself (i.e., the set S → L ) is unknown in advance and has to be estimated online.Specifically, AX L is the original objective introduced in [1] and it requires the agent to discover allthe incrementally L -controllable states as fast as possible. At the end of the learning process, foreach state s ∈ S → L the agent should return a policy that can reach s from s in at most L steps (inexpectation). Unfortunately, this may correspond to a rather poor performance in practice. Consider astate s ∈ S → L such that V (cid:63) S → L ( s → s ) (cid:28) L , i.e., the shortest path between s to s following policiesrestricted on S → L is much smaller than L . Satisfying AX L only guarantees that a policy reaching s in L steps is found. On the other hand, objective AX (cid:63) is more demanding, as it requires learninga near-optimal shortest-path policy for each state in S → L . Since V (cid:63) S → L ( s → s ) ≤ L and the gapbetween the two quantities may be arbitrarily large, especially for states close to s and far from thefringe of S → L , AX (cid:63) is a significantly tighter objective than AX L and it is thus preferable in practice.We say that an exploration algorithm solves the AX problem if its sample complexity C AX ( A , L, ε, δ ) in Def. 5 is polynomial in |K| , A , L, ε − and log( S ) . Notice that requiring a logarithmic dependencyon the size of S is crucial but nontrivial, since the overall state space may be large and we do notwant the agent to waste time trying to reach states that are not L -controllable. The dependencyon the (algorithmic-dependent and random) set K can be always replaced using the upper bound |K| ≤ |S → L + ε | , which is implied with high probability by both AX L and AX (cid:63) conditions. Finally,notice that the error threshold ε > has a two-fold impact on the performance of the algorithm. First, ε defines the largest set S → L + ε that could be returned by the algorithm: the larger ε , the bigger theset. Second, as ε increases, the quality (in terms of controllability and navigational precision) of theoutput policies worsens w.r.t. the shortest-path policy restricted on S → L . DisCo
Algorithm
The algorithm
DisCo — for
Discover and Control — is detailed in Alg. 1. It maintains a set K of “controllable” states and a set U of states that are considered “uncontrollable” so far . A state s is tagged as controllable when a policy to reach s in at most L + ε steps (in expectation from s )has been found with high confidence, and we denote by π s such policy. The states in U are statesthat have been discovered as potential members of S → L , but the algorithm has yet to produce a policyto control any of them in less than L + ε steps. The algorithm stores an estimate of the transitionmodel and it proceeds through rounds, which are indexed by k and incremented whenever a state in U gets transferred to the set K , i.e., when the transition model reaches a level of accuracy sufficient Note that we translated in the condition in [1] of a relative error of Lε to an absolute error of ε , to align itwith the common formulation of sample complexity in RL. lgorithm 1: Algorithm
DisCo
Input:
Actions A , initial state s , confidence parameter δ ∈ (0 , , error threshold ε > , L ≥ and(possibly adaptive) allocation function φ : P ( S ) → N (where P ( S ) denotes the power set of S ). Initialize k := 0 , K := { s } , U := {} and a restricted policy π s ∈ Π( K ) . Set ε := min { ε, } and continue := True . while continue do Set k += 1 . //new round// (cid:172) Sample collection on K For each ( s, a ) ∈ K k × A , execute policy π s until the total number of visits N k ( s, a ) to ( s, a ) satisfies N k ( s, a ) ≥ n k := φ ( K k ) . For each ( s, a ) ∈ K k × A , add s (cid:48) ∼ p ( ·| s, a ) to U k if s (cid:48) / ∈ K k . // (cid:173) Restriction of candidate states U Compute transitions (cid:98) p k ( s (cid:48) | s, a ) and W k := (cid:110) s (cid:48) ∈ U k : ∃ ( s, a ) ∈ K k × A , (cid:98) p k ( s (cid:48) | s, a ) ≥ − ε/ L (cid:111) · if W k is empty then Set continue := False . //condition STOP1 else // (cid:174) Computation of the optimistic policies on K for each state s (cid:48) ∈ W k do Compute ( (cid:101) u s (cid:48) , (cid:101) π s (cid:48) ) := OVI
SSP ( K k , A , s (cid:48) , N k , ε L ) , see Alg. 3 in App. D.1. Let s † := arg min s ∈W k (cid:101) u s ( s ) and (cid:101) u † := (cid:101) u s † ( s ) . if (cid:101) u † > L then Set continue := False . //condition STOP2 else // (cid:175) State transfer from U to K Set K k +1 := K k ∪ { s † } , U k +1 := U k \ { s † } and π s † := (cid:101) π s † . // (cid:176) Policy consolidation: computation on the final set K Set K := k . for each state s ∈ K K do Compute ( (cid:101) u s , (cid:101) π s ) := OVI
SSP ( K K , A , s, N K , ε L ) . Output: the states s in K K and their corresponding policy π s := (cid:101) π s . to compute a policy to control one of the states encountered before. We denote by K k (resp. U k ) theset of controllable (resp. uncontrollable) states at the beginning of round k . DisCo stops at a round K when it can confidently claim that all the remaining states outside of K K cannot be L -controllable.At each round, the algorithm uses all samples observed so far to build an estimate of the transitionmodel denoted by (cid:98) p ( s (cid:48) | s, a ) = N ( s, a, s (cid:48) ) /N ( s, a ) , where N ( s, a ) and N ( s, a, s (cid:48) ) are counters forstate-action and state-action-next state visitations. Each round is divided into two phases. The first isa sample collection phase. At the beginning of round k , the agent collects additional samples until n k := φ ( K k ) samples are available at each state-action pair in K k × A (step (cid:172) ). A key challenge liesin the careful (and adaptive) choice of the allocation function φ , which we report in the statement ofThm. 1 (see Eq. 19 in App. D.4 for its exact definition). Importantly, the incremental construction of K k entails that sampling at each state s ∈ K k can be done efficiently. In fact, for all s ∈ K k the agenthas already confidently learned a policy π s to reach s in at most L + ε steps on average (see how suchpolicy is computed in the second phase). The generation of transitions ( s, a, s (cid:48) ) for ( s, a ) ∈ K k × A achieves two objectives at once. First, it serves as a discovery step, since all observed next states s (cid:48) not in U k are added to it — in particular this guarantees sufficient exploration at the fringe (or border)of the set K k . Second, it improves the accuracy of the model p in the states in K k , which is essentialin computing near-optimal policies and thus fulfilling the AX (cid:63) condition.The second phase does not require interacting with the environment and it focuses on the computationof optimistic policies . The agent begins by significantly restricting the set of candidate states in eachround to alleviate the computational complexity of the algorithm. Namely, among all the states in U k ,it discards those that do not have a high probability of belonging to S → L by considering a restrictedset W k ⊆ U k (step (cid:173) ). In fact, if the estimated probability (cid:98) p k of reaching a state s ∈ U k from any ofthe controllable states in K k is lower than (1 − ε/ /L , then no shortest-path policy restricted on K k could get to s from s in less than L + ε steps on average. Then for each state s (cid:48) in W k , DisCo computes an optimistic policy restricted on K k to reach s (cid:48) . Formally, for any candidate state s (cid:48) ∈ W k ,we define the induced stochastic shortest path (SSP) MDP M (cid:48) k with goal state s (cid:48) as follows.5 efinition 6. We define the SSP-MDP M (cid:48) k := (cid:104)S , A (cid:48) k ( · ) , c (cid:48) k , p (cid:48) k (cid:105) with goal state s (cid:48) , where the actionspace is such that A (cid:48) k ( s ) = A for all s ∈ K k and A (cid:48) k ( s ) = { RESET } otherwise (i.e., we focus onpolicies restricted on K k ). The cost function is such that for all a ∈ A , c (cid:48) k ( s (cid:48) , a ) = 0 , and for any s (cid:54) = s (cid:48) , c (cid:48) k ( s, a ) = 1 . The transition model is p (cid:48) k ( s (cid:48) | s (cid:48) , a ) = 1 and p (cid:48) k ( ·| s, a ) = p ( ·| s, a ) otherwise. The solution of M (cid:48) k is the shortest-path policy from s to s (cid:48) restricted on K k . Since p (cid:48) k is unknown, DisCo cannot compute the exact solution of M (cid:48) k , but instead, it executes optimistic value iteration( OVI
SSP ) for SSP [27, 28] to obtain a value function (cid:101) u s (cid:48) and its associated greedy policy (cid:101) π s (cid:48) restrictedon K k (see App. D.1 for more details).The agent then chooses a candidate goal state s † for which the value (cid:101) u † := (cid:101) u s † ( s ) is the smallest.This step can be interpreted as selecting the optimistically most promising new state to control.Two cases are possible. If (cid:101) u † ≤ L , then s † is added to K k (step (cid:175) ), since the accuracy of themodel estimate on the state-action space K k × A guarantees that the policy (cid:101) π s † is able to reachthe state s † in less than L + ε steps in expectation with high probability (i.e., s † is incrementally ( L + ε ) -controllable). Otherwise, we can guarantee that S → L ⊆ K k with high probability. In thelatter case, the algorithm terminates and, using the current estimates of the model, it recomputes anoptimistic shortest-path policy π s restricted on the final set K K for each state s ∈ K K (step (cid:176) ). Thispolicy consolidation step is essential to identify near-optimal policies restricted on the final set K K (and thus on S → L ): indeed the expansion of the set of the so far controllable states may alter and refinethe optimal goal-reaching policies restricted on it (see App. A). Computational Complexity.
Note that algorithmically, we do not need to define M (cid:48) k (Def. 6) overthe whole state space S as we can limit it to K k ∪ { s (cid:48) } , i.e., the candidate state s (cid:48) and the set K k of sofar controllable states. As shown in Thm. 1, this set can be significantly smaller than S . In particularthis implies that the computational complexity of the value iteration algorithm used to compute theoptimistic policies is independent from S (see App. D.9 for more details). DisCo
We now present our main result: a sample complexity guarantee for
DisCo for the AX (cid:63) objective,which directly implies that AX L is also satisfied. Theorem 1.
There exists an absolute constant α > such that for any L ≥ , ε ∈ (0 , , and δ ∈ (0 , , if we set the allocation function φ as φ : X → α · (cid:32) L (cid:98) Θ( X ) ε log (cid:18) LSAεδ (cid:19) + L |X | ε log (cid:18) LSAεδ (cid:19)(cid:33) , (2) with (cid:98) Θ( X ) := max ( s,a ) ∈X ×A (cid:0) (cid:80) s (cid:48) ∈X (cid:112)(cid:98) p ( s (cid:48) | s, a )(1 − (cid:98) p ( s (cid:48) | s, a )) (cid:1) , then the algorithm DisCo (Alg. 1) satisfies the following sample complexity bound for AX (cid:63) C AX (cid:63) ( DisCo , L, ε, δ ) = (cid:101) O (cid:18) L Γ L + ε S L + ε Aε + L S L + ε Aε (cid:19) , (3) where S L + ε := |S → L + ε | and Γ L + ε := max ( s,a ) ∈S → L + ε ×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈S → L + ε (cid:107) ≤ S L + ε is the maximal support of the transition probabilities p ( ·| s, a ) restricted to the set S → L + ε . Given the definition of AX (cid:63) , Thm. 1 implies that DisCo terminates after C AX (cid:63) ( DisCo , L, ε, δ ) time steps, discovers a set of states K ⊇ S → L with |K| ≤ S L + ε , and for each s ∈ K outputs apolicy π s which is ε -optimal w.r.t. policies restricted on S → L , i.e., v π s ( s → s ) ≤ V (cid:63) S → L ( s → s ) + ε .Note that Eq. 3 displays only a logarithmic dependency on S , the total number of states. This propertyon the sample complexity of DisCo , along with its S -independent computational complexity, issignificant when the state space S grows large w.r.t. the unknown set of interest S → L . In words, all actions at states in K k behave exactly as in M and suffer a unit cost, in all states outside K k only the reset action to s is available with a unit cost, and all actions at the goal s (cid:48) induce a zero-cost self-loop. .1 Proof Sketch of Theorem 1 While the complete proof is reported in App. D, we now provide the main intuition behind the result.
State Transfer from U to K (step (cid:175) ). Let us focus on a round k and a state s † ∈ U k that getsadded to K k . For clarity we remove in the notation the round k , goal state s † and starting state s .We denote by v and (cid:101) v the value functions of the candidate policy (cid:101) π in the true and optimistic modelrespectively, and by (cid:101) u the quantity w.r.t. which (cid:101) π is optimistically greedy. We aim to prove that s † ∈ S → L + ε (with high probability). The main chain of inequalities underpinning the argument is v ≤ | v − (cid:101) v | + (cid:101) v (a) ≤ ε (cid:101) v (b) ≤ ε (cid:101) u + ε (c) ≤ L + ε, (4)where (c) is guaranteed by algorithmic construction and (b) stems from the chosen level of valueiteration accuracy. Inequality (a) has the flavor of a simulation lemma for SSP, by relating theshortest-path value function of a same policy between two models (the true one and the optimisticone). Importantly, when restricted to K these two models are close in virtue of the algorithmic designwhich enforces the collection of a minimum amount of samples at each state-action pair of K × A ,denoted by n . Specifically, we obtain that | v − (cid:101) v | = (cid:101) O (cid:16)(cid:114) L Γ K n + L |K| n (cid:17) , with Γ K := max ( s,a ) ∈K×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈K (cid:107) ≤ |K| . Note that Γ K is the branching factor restricted to the set K . Our choice of n (given in Eq. 2) is thendictated to upper bound the above quantity by ε/ in order to satisfy inequality (a). Let us pointout that, interestingly yet unfortunately, the structure of the problem does not appear to allow fortechnical variance-aware improvements seeking to lower the value of n prescribed above (indeed the AX framework requires to analytically encompass the uncontrollable states U into a single meta statewith higher transitional uncertainty, see App. D for details). Termination of the Algorithm.
Since S → L is unknown , we have to ensure that none of the states in S → L are “missed”. As such, we prove that with overwhelming probability, we have S → L ⊆ K K whenthe algorithm terminates at a round denoted by K . There remains to justify the final near-optimalguarantee w.r.t. the set of policies Π( S → L ) . Leveraging that step (cid:176) recomputes the policies ( π s ) s ∈K K on the final set K K , we establish the following chain of inequalities v ≤ | v − (cid:101) v | + (cid:101) v (a) ≤ ε (cid:101) v (b) ≤ ε (cid:101) u + ε (c) ≤ V (cid:63) K K + ε (d) ≤ V (cid:63) S → L + ε, (5)where (a) and (b) are as in Eq. 4, (c) leverages optimism and (d) stems from the inclusion S → L ⊆ K K . Sample Complexity Bound.
The choice of allocation function φ in Eq. 2 bounds n K which isthe total number of samples required at each state-action pair in K K × A . We then compute ahigh-probability bound ψ on the time steps needed to collect a given sample, and show that it scalesas (cid:101) O ( L ) . Since the sample complexity is solely induced by the sample collection phase (step (cid:172) ), itcan be bounded by the quantity ψ n K |K K | A . Putting everything together yields the bound of Thm. 1. UcbExplore [1]
We start recalling the critical distinction that
DisCo succeeds in tackling problem AX (cid:63) , while UcbExplore [1] fails to do so (see App. A for details on the AX objectives). Nonetheless, in thefollowing we show that even if we restrict our attention to AX L , for which UcbExplore is designed,
DisCo yields a better sample complexity in most of the cases. From [1],
UcbExplore verifies C AXL ( UcbExplore , L, ε, δ ) = (cid:101) O (cid:18) L S L + ε Aε (cid:19) · (6)Eq. 6 shows that the sample complexity of UcbExplore is linear in S L + ε , while for DisCo thedependency is somewhat worse. In the main-order term (cid:101) O (1 /ε ) of Eq. 3, the bound depends linearlyon S L + ε but also grows with the branching factor Γ L + ε , which is not the “global” branching factor Note that if we replace the error of ε for AX L with an error of Lε as in [1], we recover the sample complexityof (cid:101) O (cid:0) L S L + ε A/ε (cid:1) stated in [1, Thm. 8]. S → L + ε starting from S → L + ε . While in general we onlyhave Γ L + ε ≤ S L + ε , in many practical domains (e.g., robotics, user modeling), each state can onlytransition to a small number of states, i.e., we often have Γ L + ε = O (1) as long as the dynamics isnot too “chaotic”. While DisCo does suffer from a quadratic dependency on S L + ε in the secondterm of order (cid:101) O (1 /ε ) , we notice that for any S L + ε ≤ L ε − the bound of DisCo is still preferable.Furthermore, since for ε → , S L + ε tends to S L , the condition is always verified for small enough ε .Compared to DisCo , the sample complexity of
UcbExplore is worse in both ε and L . As stressed inSect. 2.2, the better dependency on ε both improves the quality of the output goal-reaching policies aswell as reduces the number of incrementally ( L + ε ) -controllable states returned by the algorithm. Itis interesting to investigate why the bound of [1] (Eq. 6) inherits a (cid:101) O ( ε − ) dependency. As reviewedin App. E, UcbExplore alternates between two phases of state discovery and policy evaluation.The optimistic policies computed by
UcbExplore solve a finite-horizon problem (with horizonset to H U CB ). However, minimizing the expected time to reach a target state is intrinsically anSSP problem, which is exactly what DisCo leverages. By computing policies that solve a finite-horizon problem (note that
UcbExplore resets every H U CB time steps), [1] sets the horizon to H U CB := (cid:100) L + L ε − (cid:101) , which leads to a policy-evaluation phase with sample complexity scalingas (cid:101) O ( H U CB ε − ) = (cid:101) O ( ε − ) . Since the rollout budget of (cid:101) O ( ε − ) is hard-coded into the algorithm,the dependency on ε of UcbExplore ’s sample complexity cannot be improved by a more refinedanalysis; instead a different algorithmic approach is required such as the one employed by
DisCo . S → L with DisCo
A compelling advantage of
DisCo is that it achieves an accurate estimation of the environment’sdynamics restricted to the unknown subset of interest S → L . In contrast to UcbExplore which needsto restart its sample collection from scratch whenever L , ε or some transition costs change, DisCo can thus be robust to changes in such problem parameters. At the end of its exploration phase inAlg. 1,
DisCo is able to perform zero-shot planning to solve other tasks restricted on S → L , suchas cost-sensitive ones. Indeed in the following we show how the DisCo agent is able to computean ε/c min -optimal policy for any stochastic shortest-path problem on S → L with goal state s ∈ S → L (i.e., s is absorbing and zero-cost) and cost function lower bounded by c min > . Corollary 1.
There exists an absolute constant β > such that for any L ≥ , ε ∈ (0 , and c min ∈ (0 , verifying ε ≤ β · ( L c min ) , with probability at least − δ , for whatever goal state s ∈ S → L and whatever cost function c in [ c min , , DisCo can compute (after its exploration phase,without additional environment interaction) a policy (cid:98) π s,c whose SSP value function V (cid:98) π s,c verifies V (cid:98) π s,c ( s → s ) ≤ V (cid:63) S → L ( s → s ) + εc min , where V π ( s → s ) := E (cid:104)(cid:80) τ π ( s → s ) t =1 c ( s t , π ( s t )) (cid:12)(cid:12) s = s (cid:105) is the SSP value function of a policy π and V (cid:63) S → L ( s → s ) := min π ∈ Π( S → L ) V π ( s → s ) is the optimal SSP value function restricted on S → L . It is interesting to compare Cor. 1 with the reward-free exploration framework recently introduced byJin et al. [24] in finite-horizon. At a high level, the result in Cor. 1 can be seen as a counterpart of [24]beyond finite-horizon problems, specifically in the goal-conditioned setting. While the parameter L defines the horizon of interest for DisCo , resetting after every L steps (as in finite-horizon) wouldprevent the agent to identify L -controllable states and lead to poor performance. This explains thedistinct technical tools used: while [24] executes finite-horizon no-regret algorithms, DisCo deploysSSP policies restricted on the set of states that it “controls” so far. Algorithmically, both approachesseek to build accurate estimates of the transitions on a specific (unknown) state space of interest: theso-called “significant” states within H steps for [24], and the incrementally L -controllable states S → L for DisCo . Bound-wise, the cost-sensitive AX (cid:63) problem inherits the critical role of the minimum cost c min in SSP problems (see App. C and e.g., [27, 28, 29]), which is reflected in the accuracy of Cor. 1scaling inversely with c min . Another interesting element of comparison is the dependency on the sizeof the state space. While the algorithm introduced in [24] is robust w.r.t. states that can be reachedwith very low probability, it still displays a polynomial dependency on the total number of states S .On the other hand, DisCo has only a logarithmic dependency on S , while it directly depends onthe number of ( L + ε ) -controllable states, which shows that DisCo effectively adapts to the statespace of interest and it ignores all other states. This result is significant since not only S L + ε can bearbitrarily smaller than S , but also because the set S → L + ε itself is initially unknown to the algorithm.8 · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.1 UcbExploreDisCo . . . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.4 UcbExploreDisCo . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.8 UcbExploreDisCo
Figure 2: Proportion of the incrementally L -controllable states identified by DisCo and
UcbExplore in a confusing chain domain for L = 4 . and ε ∈ { . , . , . } . Values are averaged over runs. In this section, we provide the first evaluation of algorithms in the incremental autonomous explorationsetting. In the implementation of both
DisCo and
UcbExplore , we remove the logarithmic andconstant terms for simplicity. We also boost the empirical performance of
UcbExplore in variousways, for example by considering confidence intervals derived from the empirical Bernstein inequality(see [30]) as opposed to Hoeffding as done in [1]. We refer the reader to App. F for details on thealgorithmic configurations and on the environments considered.We compare the sample complexity empirically achieved by
DisCo and
UcbExplore . Fig. 2 depictsthe time needed to identify all the incrementally L -controllable states when L = 4 . for differentvalues of ε , on a confusing chain domain. Note that the sample complexity is achieved soonafter, when the algorithm can confidently discard all the remaining states as non-controllable (itis reported in Tab. 2 of App. F). We observe that DisCo outperforms
UcbExplore for any valueof ε . In particular, the gap in performance increases as ε decreases, which matches the theoreticalimprovement in sample complexity from (cid:101) O ( ε − ) for UcbExplore to (cid:101) O ( ε − ) for DisCo . On asecond environment — the combination lock problem introduced in [31] — we notice that
DisCo again outperforms
UcbExplore , as shown in App. F.Another important feature of
DisCo is that it targets the tighter objective AX (cid:63) , whereas UcbExplore is only able to fulfill objective AX L and may therefore elect suboptimal policies. In App. F we showempirically that, as expected theoretically, this directly translates into higher-quality goal-reachingpolicies recovered by DisCo . Connections to existing deep-RL methods.
While we primarily focus the analysis of
DisCo in thetabular case, we believe that the formal definition of AX problems and the general structure of DisCo may also serve as a theoretical grounding of many recent approaches to unsupervised exploration.For instance, it is interesting to draw a parallel between
DisCo and the ideas behind Go-Explore [32].Go-Explore similarly exploits the following principles: (1) remember states that have previously beenvisited, (2) first return to a promising state (without exploration), (3) then explore from it. Go-Exploreassumes that the world is deterministic and resettable, meaning that one can reset the state of thesimulator to a previous visit to that cell. Very recently [15], the same authors proposed a way to relaxthis requirement by training goal-conditioned policies to reliably return to cells in the archive duringthe exploration phase. In this paper, we investigated the theoretical dimension of this direction, byprovably learning such goal-conditioned policies for the set of incrementally controllable states.
Future work.
Interesting directions for future investigation include: Deriving a lower bound for the AX problems; Integrating
DisCo into the meta-algorithm
MNM [33] which deals with incrementalexploration for AX L in non-stationary environments; Extending the problem to continuous statespace and function approximation; Relaxing the definition of incrementally controllable states andrelaxing the performance definition towards allowing the agent to have a non-zero but limited samplecomplexity of learning a shortest-path policy for any state at test time.9 roader Impact
This paper makes contributions to the fundamentals of online learning (RL) and due to its theoreticalnature, we see no ethical or immediate societal consequence of our work.
References [1] Shiau Hong Lim and Peter Auer. Autonomous exploration for navigating in MDPs. In
Conference on Learning Theory , pages 40–1, 2012.[2] J¨urgen Schmidhuber. A possibility for implementing curiosity and boredom in model-buildingneural controllers. In
Proc. of the international conference on simulation of adaptive behavior:From animals to animats , pages 222–227, 1991.[3] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated rein-forcement learning. In
Advances in neural information processing systems , pages 1281–1288,2005.[4] Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology ofcomputational approaches.
Frontiers in neurorobotics , 1:6, 2009.[5] Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivatedreinforcement learning: An evolutionary perspective.
IEEE Transactions on Autonomous MentalDevelopment , 2(2):70–82, 2010.[6] Adrien Baranes and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration for activemotor learning in robots: A case study. In , pages 1766–1773. IEEE, 2010.[7] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and RemiMunos. Unifying count-based exploration and intrinsic motivation. In
Advances in neuralinformation processing systems , pages 1471–1479, 2016.[8] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,Filip DeTurck, and Pieter Abbeel.
Advances in neural information processing systems , pages2753–2762, 2017.[9] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel.Variational information maximizing exploration.
Advances in Neural Information ProcessingSystems (NIPS) , 2016.[10] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven explorationby self-supervised prediction. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops , pages 16–17, 2017.[11] Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo Avila Pires, Jean-Bastian Grill, FlorentAltch´e, and R´emi Munos. World discovery models. arXiv preprint arXiv:1902.07685 , 2019.[12] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all youneed: Learning skills without a reward function. In
International Conference on LearningRepresentations , 2019.[13] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. In
International Conference on Learning Representa-tions , 2020.[14] V´ıctor Campos Cam´u˜nez, Alex Trott, Caiming Xiong, Richard Socher, Xavier Gir´o Nieto, andJordi Torres Vi˜nals. Explore, discover and learn: unsupervised discovery of state-covering skills.In
International Conference on Machine Learning , pages 1317–1327. PMLR, 2020.[15] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. First returnthen explore. arXiv preprint arXiv:2004.12919 , 2020.1016] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generationfor reinforcement learning agents. In
International Conference on Machine Learning , pages1515–1528, 2018.[17] C´edric Colas, Pierre Fournier, Mohamed Chetouani, Olivier Sigaud, and Pierre-Yves Oudeyer.Curious: intrinsically motivated modular multi-goal reinforcement learning. In
Internationalconference on machine learning , pages 1331–1340. PMLR, 2019.[18] David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, andVolodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In
International Conference on Learning Representations , 2019.[19] Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine.Skew-fit: State-covering self-supervised reinforcement learning. In
International Conferenceon Machine Learning , pages 7783–7792. PMLR, 2020.[20] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximumentropy exploration. In
International Conference on Machine Learning , pages 2681–2691,2019.[21] Jean Tarbouriech and Alessandro Lazaric. Active exploration in markov decision processes.In
The 22nd International Conference on Artificial Intelligence and Statistics , pages 974–982,2019.[22] Wang Chi Cheung. Exploration-exploitation trade-off in reinforcement learning on onlinemarkov decision processes with global concave rewards. arXiv preprint arXiv:1905.06466 ,2019.[23] Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, and Alessan-dro Lazaric. Active model estimation in markov decision processes. In
Conference on Uncer-tainty in Artificial Intelligence , 2020.[24] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free explorationfor reinforcement learning. In
International Conference on Machine Learning , pages 4870–4879.PMLR, 2020.[25] Martin L Puterman.
Markov Decision Processes.: Discrete Stochastic Dynamic Programming .John Wiley & Sons, 2014.[26] Dimitri Bertsekas.
Dynamic programming and optimal control , volume 2. 2012.[27] Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric.No-regret exploration in goal-oriented reinforcement learning. In
International Conference onMachine Learning , pages 9428–9437. PMLR, 2020.[28] Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret boundsfor stochastic shortest path. In
International Conference on Machine Learning , pages 8210–8219. PMLR, 2020.[29] Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions.
Lab. for Information and Decision Systems Report LIDS-P-2909, MIT , 2013.[30] Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos. Minimax regret bounds forreinforcement learning. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 263–272. JMLR. org, 2017.[31] Mohammad Gheshlaghi Azar, Vicenc¸ G´omez, and Hilbert J Kappen. Dynamic policy program-ming.
Journal of Machine Learning Research , 13(Nov):3207–3245, 2012.[32] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore:a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 , 2019.[33] Pratik Gajane, Ronald Ortner, Peter Auer, and Csaba Szepesvari. Autonomous exploration fornavigating in non-stationary CMPs. arXiv preprint arXiv:1910.08446 , 2019.1134] Blai Bonet. On the speed of convergence of value iteration on stochastic shortest-path problems.
Mathematics of Operations Research , 32(2):365–373, 2007.[35] Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. Tuning bandit algorithms in stochasticenvironments. In
International conference on algorithmic learning theory , pages 150–165.Springer, 2007.[36] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variancepenalization. arXiv preprint arXiv:0907.3740 , 2009.[37] Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems.
Mathematics of Operations Research , 16(3):580–595, 1991.[38] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2 with empiricalbernstein inequality. arXiv preprint arXiv:2007.05456 , 2020.[39] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conserva-tive contextual linear bandits. In
Advances in Neural Information Processing Systems , pages3910–3919, 2017. 12 ppendix
A Autonomous Exploration Objectives
We recall the two AX objectives stated in Def. 5: for any length L ≥ , error threshold ε > and confidence level δ ∈ (0 , , the sample complexities C AXL ( A , L, ε, δ ) and C AX (cid:63) ( A , L, ε, δ ) aredefined as the number of time steps required by a learning algorithm A to identify a set K ⊇ S → L such that with probability at least − δ , it has learned a set of policies { π s } s ∈K that respectivelyverifies the following AX requirement( AX L ) ∀ s ∈ K , v π s ( s → s ) ≤ L + ε ,( AX (cid:63) ) ∀ s ∈ K , v π s ( s → s ) ≤ V (cid:63) S → L ( s → s ) + ε. As we explain in Sect. 4,
DisCo (Alg. 1) succeeds in tackling condition AX (cid:63) , whereas UcbExplore [1], which is designed to tackle condition AX L , is unable to tackle AX (cid:63) . Note that thealgorithmic design of UcbExplore entails that it computes policies whose value function implicitlytargets V (cid:63) K t , with K t the current set of controllable states. While V (cid:63) K t is always smaller than L , UcbExplore cannot provide any tightness guarantees w.r.t. V (cid:63) K t since it has no guarantee that the tran-sition dynamics are estimated well enough on K t . An additional challenge with which UcbExplore fails to cope is the fact that the set K t increases over time and thus unlocks new states and paths,which may be useful to improve its shortest-path policies for previously discovered states.To better understand this phenomenon, let us introduce an alternative condition AX (cid:48) — tighter than AX L , but looser than AX (cid:63) — which stems from the challenge of not knowing S → L in advance. Wedefine AX (cid:48) as follows: for any state s in S → L , the objective is to find a policy that can reach s from s in at most L (cid:48) + ε steps on average, where L (cid:48) := min { l ≤ L : s ∈ S → l } , i.e.,( AX ’) ∀ s ∈ K , v π s ( s → s ) ≤ L (cid:48) + ε , where L (cid:48) := min { l ≤ L : s ∈ S → l } .As mentioned in [1, Corollary 9], it is possible to run separate instances of UcbExplore withincreasing L n = 1 + nε from n = 0 to (cid:100) L − ε (cid:101) (i.e., until n satisfies L n − ≤ L ≤ L n ). This verifiesthe condition AX (cid:48) at the cost of a worsened dependency on both ε and L as follows C AX (cid:48) ( UcbExplore , L, ε, δ ) = (cid:101) O (cid:18) L S L + ε Aε (cid:19) . While AX (cid:48) is tighter than AX L , it may be arbitrarily loose compared to AX (cid:63) , which illustrates theintrinsic limitations in UcbExplore design.
UcbExplore incrementally expands a set of “control-lable” states K : starting with K = { s } , at time t a state s is added to K t whenever UcbExplore can confidently assess that it managed to learn a policy reaching s in less than L steps. Since attime t UcbExplore can only consider policies restricted to the controllable states K t , even theshortest-path policy computed to reach s at time t may not be ε -optimal w.r.t. to the whole set S → L .Indeed, every time a state is added to K , this state may unlock new paths which may, for previouslycontrollable states, allow for better shortest-path policies restricted on the updated K . Fig. 3 illustratesthis behavior, where the state y unlocks a fast path from y to x which should be taken in y instead ofresetting to s . Consequently, if the agent seeks to tackle condition AX (cid:63) , it must have the faculty to backtrack , i.e., continuously update both its belief of the vicinity ( K ) and its notion of optimality onthe vicinity ( V (cid:63) K ). Unfortunately, UcbExplore can only compute policies targeting V (cid:63) K with K the current set of controllable states, but it fails to be accurate enough to revise such policies as the set of s x y Figure 3:
Let X := { s } ∪ { x } and Y := X ∪ { y } . For any l ≥ , suppose that from s , the agent reaches x in l steps withprobability / , or reaches y in l + 1 steps with probability / . If the goal state is x , constraining an agent to use policiesrestricted to X (i.e., that reset to s outside of X ) is detrimentalsince x can actually be reached in 1 step from y . Formally, wecan easily prove that V (cid:63) X ( s → x ) − V (cid:63) Y ( s → x ) = l + 1 ,which grows arbitrarily as l increases. X UcbExplore [1]
DisCo (Alg. 1) AX L (cid:101) O (cid:18) L S L + ε Aε (cid:19) (cid:101) O (cid:18) L Γ L + ε S L + ε Aε + L S L + ε Aε (cid:19) AX (cid:48) (cid:101) O (cid:18) L S L + ε Aε (cid:19) AX (cid:63) Unable
Table 1: Comparison be-tween the sample complexityof
UcbExplore and
DisCo ,depending on the condition AX L , AX (cid:48) or AX (cid:63) .controllable states K is expanded over time. In contrast, in virtue of its allocation function φ (Eq. 2)which enables to track the number of collected samples as K increases, DisCo is able to improve itscandidate shortest-path policies during the consolidation step (cid:176) when the final set K is considered.The following general and simple statement captures how the expansion of the state space of interestmay alter and refine the optimal policy restricted on it. Lemma 1.
For any two sets
X ⊆ Y and any state x ∈ X , we have V (cid:63) X ( s → x ) ≥ V (cid:63) Y ( s → x ) .Moreover, the gap between the two quantities may be arbitrarily large.Proof. The inequality is immediate from Asm. 1. Fig. 3 shows the gap may be arbitrarily large.Finally, we summarize all the sample complexity results in Tab. 1.
B Efficient Computation of Optimistic SSP Policy
In this section we recall from [27, 28] how to efficiently compute an optimistic stochastic shortest-path(SSP) policy.
B.1 Computation of Optimal Policy in Known SSP
This section details the procedure to efficiently compute an (arbitrarily near-) optimal policy π in a known SSP instance with positive costs and which admits at least one proper policy. Recall that a proper policy is a policy whose execution starting from any non-goal state eventually reaches thegoal state with probability one [26].
Definition 7 (SSP-MDP) . An SSP-MDP is an MDP M = ( S † , A , s † , p, c ) where S † is the set ofnon-goal states with |S † | = S † , A is the set of actions, p is the transition function and c is the costfunction. The goal state s † / ∈ S † is zero-cost and absorbing, i.e., p ( s † | s † , a ) = 1 and c ( s † , a ) = 0 for any a ∈ A . The (possibly unbounded) value function (also called expected cost-to-go) of any policy π ∈ Π starting from state s is defined as V π ( s ) := E (cid:20) + ∞ (cid:88) t =1 c ( s t , π ( s t )) (cid:12)(cid:12)(cid:12) s (cid:21) = E (cid:20) τ π ( s → s † ) (cid:88) t =1 c ( s t , π ( s t )) (cid:12)(cid:12)(cid:12) s (cid:21) . Assumption 2.
We restrict the attention to SSP-MDP M (see Def. 7) such that, for any ( s, a ) ∈S † × A , c ( s, a ) ∈ [ c min , with c min > . (Note that having positive costs ensures that for anynon-proper policy π there exists a state s with V π ( s ) = + ∞ .) Moreover, we assume that there existsat least one proper policy (i.e., that reaches the goal state s † with probability one starting from anystate in S † ). The procedure VI SSP considers the following inputs: a goal s † , non-goal states S † , a known model p and a known cost function c , with (non-goal) costs lower bounded by c min > . VI SSP outputs avector u (of size |S † | ) and a policy π which is greedy w.r.t. the vector u .The optimal Bellman operator is defined as follows for any vector u and any non-goal state s ∈ S † L u ( s ) := min a ∈A (cid:110) c ( s, a ) + (cid:88) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) . lgorithm 2: VI SSP
Input:
Non-goal states S † , action set A , transitions p , costs c and accuracy γ Output:
Value vector u and greedy policy π Define L u ( s ) := min a ∈A (cid:110) c ( s, a ) + (cid:80) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) Set u = S † and j = 0 u = L u while (cid:107) u j +1 − u j (cid:107) ∞ > γ do u j +1 = L u j Set u := u j and π ( s ) ∈ arg min a ∈A (cid:110) c ( s, a ) + (cid:80) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) for any s ∈ S † ∪ { s † } Note that by definition, V π ( s † ) = 0 for any π . We perform a value iteration ( VI ) scheme overthis operator as explained in [e.g., 29, 34, 27]. Namely, we consider initial vector u := 0 and setiteratively u i +1 := L u i (see Alg. 2). For a predefined VI precision γ > , the stopping condition isreached for the first iteration j such that (cid:107) u j +1 − u j (cid:107) ∞ ≤ γ . The policy is then selected to be thegreedy policy w.r.t. the vector u := u j , i.e., ∀ s ∈ S † ∪ { s † } , π ( s ) ∈ arg min a ∈A (cid:110) c ( s, a ) + (cid:88) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) . (7)Importantly, while u is not the value function of π , both quantities can be related according to thefollowing lemma. Lemma 2.
Consider an SSP-MDP M = ( S † , A , s † , p, c ) defined as in Def. 7 and satisfying Asm. 2.Let ( u, π ) = VI SSP ( S † , A , p, c, γ ) be the solution computed by VI SSP . Denote by V π the true valuefunction of π and by V (cid:63) = V π (cid:63) = L V (cid:63) the optimal value function. The following component-wiseinequalities hold • u ≤ V (cid:63) ≤ V π . • If the VI precision level verifies γ ≤ c min , then V π ≤ (cid:16) γc min (cid:17) u .Proof. The result can be obtained by adapting [27, Lem. 4 & App. E]. For the first inequality, giventhat we consider the initial vector u = 0 , we know that ≤ V (cid:63) with V (cid:63) = L V (cid:63) by definition. Bymonotonicity of the operator L [25, 26], we obtain u j ≤ V (cid:63) ≤ V π . As for the second inequality, weintroduce the following Bellman operators of a deterministic policy π for any vector u and state s , L π u ( s ) := c ( s, π ( s )) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) u ( s (cid:48) ) , T πγ u ( s ) := c ( s, π ( s )) − γ (cid:124) (cid:123)(cid:122) (cid:125) > + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) u ( s (cid:48) ) . Note that the SSP problem defined by the operator T πγ satisfies Asm. 2 since i) it has positive costsdue to the condition γ ≤ c min and ii) the fact that M satisfies Asm. 2 guarantees the existence of atleast one proper policy in the model p . We can write component-wise T πγ u j = L π u j − γ (a) = L u j − γ (b) ≤ u j , where (a) uses that π is the greedy policy w.r.t. u j and (b) stems from the chosen stopping conditionwhich yields L u j ≤ u j + γ . By monotonicity of the operator T πγ , we have for all m > , ( T πγ ) m u j ≤ u j . The asymptotic convergence of the operator in an SSP problem satisfying Asm. 2(see e.g., [26, Prop. 2.2.1]) guarantees that taking the limit m → + ∞ yields W πγ ≤ u j , where W πγ isdefined as the value function of policy π in the model p with γ subtracted to all the costs, i.e., W πγ ( s ) := E τ π ( s ) (cid:88) t =1 ( c ( s t , π ( s t ) − γ ) | s = s = V π ( s ) − γ E [ τ π ( s )] , VI SSP goal state s † non-goal states S † samples collected so far N costs c ≥ c min > VI precision γ > optimistic value vector (cid:101) u optimistic SSP policy (cid:101) π Figure 4: Optimistic Value Iteration for SSP (
OVI
SSP ).where τ π ( s ) denotes the (random) hitting time of policy π to reach the goal starting from state s .Moreover, we have c min E [ τ π ( s )] ≤ V π ( s ) ≤ c max E [ τ π ( s )] . Putting everything together, we thus get (cid:16) − γc min (cid:17) V π ≤ u j . Since γ ≤ c min , we ultimately obtain V π ≤ − γc min u j ≤ (cid:18) γc min (cid:19) u j , where the last inequality uses the fact that − x ≤ x holds for any ≤ x ≤ . B.2 Computation of Optimistic Model in Unknown SSP
Consider an SSP problem M defined as in Asm. 2. Consider that, at any given stage of the learningprocess, the agent is equipped with N ( s, a ) samples at each state-action pair. A method to computean optimistic model (cid:101) p is provided in [28], which we recall below.Denote by (cid:98) p the current empirical average of transitions: (cid:98) p ( s (cid:48) | s, a ) = N ( s, a, s (cid:48) ) /N ( s, a ) , andset (cid:98) σ ( s (cid:48) | s, a ) := (cid:98) p ( s (cid:48) | s, a )(1 − (cid:98) p ( s (cid:48) | s, a )) as well as N + ( s, a ) := max { , N ( s, a ) } . For any ( s, a, s (cid:48) ) ∈ S † × A × S † , the empirical Bernstein inequality [35, 36] is leveraged to select thefollowing confidence intervals (with probability at least − δ ) on the transition probabilities β ( s, a, s (cid:48) ) := 2 (cid:115) (cid:98) σ ( s (cid:48) | s, a ) N + ( s, a ) log (cid:18) SAN + ( s, a ) δ (cid:19) + 6 log (cid:16) SAN + ( s,a ) δ (cid:17) N + ( s, a ) , and β ( s, a, s † ) := (cid:80) s (cid:48) ∈S † β ( s, a, s (cid:48) ) . The selection of the optimistic model (cid:101) p is as follows: theprobability of reaching the goal s † is maximized at every state-action pair, which implies minimizingthe probability of reaching all other states and setting them at the lowest value of their confidencerange. Formally, we set for all ( s, a, s (cid:48) ) ∈ S † × A × S † , (cid:101) p ( s (cid:48) | s, a ) := max (cid:110)(cid:98) p ( s (cid:48) | s, a ) − β ( s, a, s (cid:48) ) , (cid:111) , and (cid:101) p ( s † | s, a ) := 1 − (cid:80) s (cid:48) ∈S † (cid:101) p ( s (cid:48) | s, a ) . B.3 Combining the two: Optimistic Value Iteration for SSP (
OVI
SSP ) OVI
SSP first computes an optimistic model (cid:101) p leveraging App. B.2, and it then runs the VI SSP procedureof App. B.1 in the model (cid:101) p , i.e., ( (cid:101) u, (cid:101) π ) = VI SSP ( S † , A , s † , (cid:101) p, c ) . This outputs an optimistic pair ( (cid:101) u, (cid:101) π ) composed of the VI vector (cid:101) u and the policy (cid:101) π that is greedy w.r.t. (cid:101) u in the model (cid:101) p . The OVI
SSP scheme is recapped in Fig. 4.
C Useful Result: Simulation Lemma for SSP
Consider a stochastic shortest-path (SSP) instance (see Def. 7) that satisfies Asm. 2. We denote by A = |A| the number of actions, S = |S| the number of non-goal states, g / ∈ S the (zero-cost andabsorbing) goal state, p the unknown transitions and c the known cost function. We assume that < c ( s, a ) ≤ for all ( s, a ) ∈ S ×A , and set c min := min s,a c ( s, a ) > . We also set S (cid:48) := S ∪{ g } .16ecall that the goal state is zero-cost (i.e., c ( g, a ) = 0 ) and absorbing (i.e., p ( g | g, a ) = 1 ), and thatthe value function of a policy amounts to the expected cumulative costs following this policy untilreaching the goal. Definition 8.
For any model p and η > , we introduce the set of models close to p w.r.t. the (cid:96) -normon the non-goal states as follows P ( p ) η := (cid:110) p (cid:48) ∈ R S (cid:48) × A × S (cid:48) : ∀ ( s, a ) ∈ S × A , p (cid:48) ( ·| s, a ) ∈ ∆( S (cid:48) ) , p ( g | g, a ) = 1 , (cid:88) y ∈S | p ( y | s, a ) − p (cid:48) ( y | s, a ) | ≤ η (cid:111) . Lemma 3 (Simulation Lemma for SSP) . Consider any model p and p (cid:48) ∈ P ( p ) η such that, for eachmodel, there exists at least one proper policy w.r.t. the goal state g . Consider any policy π that isproper in p (cid:48) , with value function denoted by V (cid:48) π , such that the following condition is verified η (cid:107) V (cid:48) π (cid:107) ∞ ≤ c min . (8) Then π is proper in p (i.e., its value function verifies V π < + ∞ component-wise), and we have ∀ s (cid:54) = g, V π ( s ) ≤ (cid:18) η (cid:107) V (cid:48) π (cid:107) ∞ c min (cid:19) V (cid:48) π ( s ) , and conversely, ∀ s (cid:54) = g, V (cid:48) π ( s ) ≤ (cid:18) η (cid:107) V (cid:48) π (cid:107) ∞ c min (cid:19) V π ( s ) . Combining the two inequalities above yields (cid:107) V π − V (cid:48) π (cid:107) ∞ ≤ η (cid:107) V (cid:48) π (cid:107) ∞ c min . Proof.
The proof of Lem. 3 requires a result of [37] recalled in Lem. 4 and can be seen as a general-ization of [28, Lem. B.4]. First, let us assume that π is proper in the model p (cid:48) . This implies that itsvalue function, denoted by V (cid:48) , is bounded component-wise. Moreover, for any non-goal state s ∈ S ,the Bellman equation holds as follows V (cid:48) ( s ) = c ( s, π ( s )) + (cid:88) y ∈S p (cid:48) ( y | s, π ( s )) V (cid:48) ( y )= c ( s, π ( s )) + (cid:88) y ∈S p ( y | s, π ( s )) V (cid:48) ( y ) + (cid:88) y ∈S ( p (cid:48) ( y | s, π ( s )) − p ( y | s, π ( s ))) V (cid:48) ( y ) . (9)By successively using H¨older’s inequality and the facts that p (cid:48) ∈ P ( p ) η and c ( s, π ( s )) ≥ c min , we get V (cid:48) ( s ) ≥ c ( s, π ( s )) − η (cid:107) V (cid:48) (cid:107) ∞ + p ( ·| s, π ( s )) (cid:62) V (cid:48) ≥ c ( s, π ( s )) (cid:18) − η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) + p ( ·| s, π ( s )) (cid:62) V (cid:48) . Let us now introduce the vector V (cid:48)(cid:48) := (cid:16) − η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:17) − V (cid:48) . Then for all s ∈ S , V (cid:48)(cid:48) ( s ) ≥ c ( s, π ( s )) + p ( ·| s, π ( s )) (cid:62) V (cid:48)(cid:48) . Hence, from Lem. 4, π is proper in p (i.e., V < + ∞ ), and we have V ≤ V (cid:48)(cid:48) ≤ (cid:18) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) V (cid:48) , (10)where the last inequality stems from condition (8) and the fact that − x ≤ x holds for any ≤ x ≤ . Conversely, analyzing Eq. 9 from the other side, we get V (cid:48) ( s ) ≤ c ( s, π ( s )) (cid:18) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) + p ( ·| s, π ( s )) (cid:62) V (cid:48) . V (cid:48)(cid:48) := (cid:16) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:17) − V (cid:48) . Then V (cid:48)(cid:48) ( s ) ≤ c ( s, π ( s )) + p ( ·| s, π ( s )) (cid:62) V (cid:48)(cid:48) . We then obtain in the same vein as Lem. 4 (by leveraging the monotonicity of the Bellman operator L π U ( s ) := c ( s, π ( s )) + p ( ·| s, π ( s )) (cid:62) U ) that V (cid:48)(cid:48) ≤ V , and therefore V (cid:48) ≤ (cid:18) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) V. (11)Combining Eq. 10 and 11 yields component-wise (cid:107) V − V (cid:48) (cid:107) ∞ ≤ η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:107) V (cid:48) (cid:107) ∞ + η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:107) V (cid:107) ∞ ≤ η (cid:107) V (cid:48) (cid:107) ∞ c min , where the last inequality uses that (cid:107) V (cid:107) ∞ ≤ (cid:107) V (cid:48) (cid:107) ∞ which stems from plugging condition (8) intoEq. 10.Note that here p and p (cid:48) play symmetric roles; we can perform the same reasoning in the case where π is proper in the model p and it would yield an equivalent result by switching the dependencies on V and V (cid:48) . Lemma 4 ([37], Lem. 1) . In an SSP-MDP satisfying Asm. 2, let π be any policy, then • If there exists a vector U : S → R such that U ( s ) ≥ c ( s, π ( s )) + (cid:80) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) U ( s (cid:48) ) for all s ∈ S , then π is proper, and V π the value function of π is upper bounded by U component-wise, i.e., V π ( s ) ≤ U ( s ) for all s ∈ S . • If π is proper, then its value function V π is the unique solution to the Bellman equations V π ( s ) = c ( s, π ( s )) + (cid:80) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) V π ( s (cid:48) ) for all s ∈ S . D Proof of Theorem 1 (Sample Complexity Analysis of
DisCo ) D.1 Computation of the Optimistic Policies
At each round k , for each goal state s † ∈ W k , DisCo computes an optimistic goal-oriented policyassociated to the MDP M (cid:48) k ( s † ) constructed as in Def. 6. This MDP is defined over the entire statespace S and restricts the action to the only action RESET outside K k . We can build an equivalentMDP by restricting the focus on K k . To this end, we define the following SSP-MDP. Definition 9.
Define M † k ( s † ) := (cid:104)S † k , A † k ( · ) , c † k , p † k (cid:105) where S † k := K k ∪ { s † , x } and S † k = |S † k | = |K k | + 2 . State x is a meta-state that encapsulates all the states that have been observed so farand are not in K k . The action space A † k ( · ) is such that A † k ( s ) = A for all states s ∈ K k and A † k ( s ) = { RESET } for s ∈ { s † , x } . The cost function is c † k ( x, a ) = 0 for any a ∈ A † k ( x ) and c † k ( s, a ) = 1 everywhere else. The transition function is defined as p † k ( s † | s † , a ) = p † k ( s | x, a ) = 1 for any a , p † k ( y | s, a ) = p ( y | s, a ) for any ( s, a, y ) ∈ K k × A × ( K k ∪ { s † } ) and p † k ( x | s, a ) =1 − (cid:80) y ∈K k ∪{ s † } p † k ( y | s, a ) . Note that solving M † k yields a policy effectively restricted to the set K k insofar as we can interpretthe meta-state x as S \ {K k ∪ { s † }} . Since p is unknown, we cannot construct M † k ( s † ) . Let N k be the state-action counts accumulated up until now. We denote by (cid:98) p k the “global” empiricalestimates, i.e., (cid:98) p k ( y | s, a ) = N k ( s, a, y ) /N k ( s, a ) . Given them, we define the “restricted” empiricalestimates (cid:98) p † k as follows: (cid:98) p † k ( y | s, a ) := (cid:98) p k ( y | s, a ) for any ( s, a, y ) ∈ K k × A × ( K k ∪ { s † } ) and (cid:98) p † k ( x | s, a ) := 1 − (cid:80) y ∈K k ∪{ s † } (cid:98) p † k ( y | s, a ) . Denoting N + k ( s, a ) := max { , N k ( s, a ) } , we then definethe following bonuses for any ( s, a, y ) ∈ K k × A × ( K k ∪ { s † } ) , β k ( s, a, y ) := 2 (cid:115) (cid:98) p k ( y | s, a )(1 − (cid:98) p k ( y | s, a )) N + k ( s, a ) log (cid:18) SAN + k ( s, a ) δ (cid:19) + 6 log (cid:16) SAN + k ( s,a ) δ (cid:17) N + k ( s, a ) , (12) β k ( s, a, x ) := (cid:88) y ∈K k ∪{ s † } β k ( s, a, y ) . (13)18 lgorithm 3: OVI
SSP
Input: K k , A , s † , N k , γ > Output:
Value vector (cid:101) u † and policy (cid:101) π † Estimate transitions probabilities (cid:98) p k using N k Compute the optimistic SSP-MDP (cid:102) M † k as detailed in Def. 10 Compute ( (cid:101) u † k , (cid:101) π † k ) = VI SSP ( S † k , A † k , c † k , (cid:101) p † k , γ ) (see Alg. 2)Moreover, we set the uncertainty about the MDP at the meta-state x and at the goal state s † to byconstruction (since their outgoing transitions are deterministic, respectively to s and s † ).We now leverage the optimistic construction mentioned in App. B.1. Definition 10.
We denote by (cid:102) M † k ( s † ) = (cid:104)S † k , A † k ( · ) , c † k , (cid:101) p † k (cid:105) the optimistic MDP associated to M † k ( s † ) defined in Def. 9. Then, ∀ ( s, a ) ∈ K k × A , (cid:101) p † k ( y | s, a ) := max { (cid:98) p k ( y | s, a ) − β k ( s, a, y ) , } , ∀ y ∈ K k ∪ { x } , (14) (cid:101) p † k ( s † | s, a ) := 1 − (cid:88) y ∈K k ∪{ x } (cid:101) p † k ( y | s, a ) , (15) (cid:101) p † k ( s † | s † , a ) = (cid:101) p † k ( s | x, a ) = 1 . (16)Given this MDP, we can compute the optimistic value vector (cid:101) u † k and policy (cid:101) π † k using value iterationfor SSP: ( (cid:101) u † k , (cid:101) π † k ) = VI SSP ( S † k , A † k , c † k , (cid:101) p † k , ε L ) . We summarize the construction of the optimisticmodel and the computation of value function and policy in Alg. 3 ( OVI
SSP ). Remark.
Note that the structure of the problem does not appear to allow for variance-awareimprovements in the analysis of Thm. 1 (specifically, when the analysis will apply an SSP simulationlemma argument). Indeed, given the possibly large number of states in the total environment S , thecomputation of the optimistic policies requires the construction of the meta-state x that encapsulatesall the states in S \ {K k ∪ { s † }} , where s † is the candidate goal state considered at round k . As aresult, the uncertainty on the transitions reaching x needs to be summed over multiple states, as shownin Eq. 13. This extra uncertainty at a single state in the induced MDP has the effect of canceling outBernstein techniques seeking to lower the prescribed requirement of the state-action samples that thealgorithm should collect. In turn this implies that such variance-aware techniques would not lead toany improvement in the final sample complexity bound. D.2 High-Probability EventLemma 5.
It holds with probability at least − δ that for any time step t ≥ and for any state-actionpair ( s, a ) and next state s (cid:48) , | (cid:98) p t ( s (cid:48) | s, a ) − p ( s (cid:48) | s, a ) | ≤ (cid:115) (cid:98) σ t ( s (cid:48) | s, a ) N + t ( s, a ) log (cid:18) SAN + t ( s, a ) δ (cid:19) + 6 log (cid:16) SAN + t ( s,a ) δ (cid:17) N + t ( s, a ) , (17) where N + t ( s, a ) := max { , N t ( s, a ) } and where (cid:98) σ t are the population variance of transitions, i.e., (cid:98) σ t ( s (cid:48) | s, a ) := (cid:98) p t ( s (cid:48) | s, a )(1 − (cid:98) p t ( s (cid:48) | s, a )) .Proof. The confidence intervals in Eq. 17 are constructed using the empirical Bernstein inequality,which guarantees that the considered event holds with probability at least − δ , see e.g., [38].Define the set of plausible transition probabilities as C † k := (cid:92) ( s,a ) ∈S † k ×A C † k ( s, a ) , C † k ( s, a ) := { (cid:101) p ∈ C | (cid:101) p ( · | s † , a ) = s † , (cid:101) p ( · | x, a ) = s , | (cid:101) p ( s (cid:48) | s, a ) − (cid:98) p k ( s (cid:48) | s, a ) | ≤ β k ( s, a, s (cid:48) ) } , with C the S † k -dimensional simplex and (cid:98) p k the empirical average of transitions. Lemma 6.
Introduce the event
Θ := (cid:84) + ∞ k =1 (cid:84) s † ∈W k { p † k ∈ C † k } . Then P (Θ) ≥ − δ .Proof. We have with probability at least − δ that, for any y (cid:54) = x , | p † k ( y | s, a ) − (cid:98) p † k ( y | s, a ) | ≤ β k ( s, a, y ) from the empirical Bernstein inequality (see Eq. 17), and moreover | (cid:98) p † k ( x | s, a ) − p † k ( x | s, a ) | = (cid:12)(cid:12)(cid:12) − (cid:80) y ∈K k ∪{ s † } p † k ( y | s, a ) − (cid:16) − (cid:80) y ∈K k ∪{ s † } (cid:98) p † k ( y | s, a ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:80) y ∈K k ∪{ s † } | p † k ( y | s, a ) − (cid:98) p † k ( y | s, a ) | ≤ β k ( s, a, x ) . Lemma 7.
Under the event Θ , for any round k and any goal state s † ∈ W k , the optimistic model (cid:101) p † k constructed in Def. 10 verifies (cid:101) p † k ∈ P ( p † k ) η k , with η k := 4 β k ( s, a, x ) where β k is defined in Eq. 13.Proof. Combining the construction in Def. 10, the proof of Lem. 6 and the triangle inequality yields (cid:88) y ∈K k ∪{ x } | (cid:101) p † k ( y | s, a ) − p † k ( y | s, a ) | ≤ (cid:88) y ∈K k ∪{ x } | (cid:101) p † k ( y | s, a ) − (cid:98) p † k ( y | s, a ) | + | (cid:98) p † k ( y | s, a ) − p † k ( y | s, a ) |≤ (cid:88) y ∈K k ∪{ x } β k ( s, a, y ) + 2 β k ( s, a, x ) ≤ β k ( s, a, x ) . Throughout the remainder of the proof, we assume that the event Θ holds. D.3 Properties of the Optimistic Policies and Value Vectors
We recall notation. Let us fix any round k and any goal state s † ∈ W k . We denote by (cid:101) π † k the greedypolicy w.r.t. (cid:101) u † k ( · → s † ) in the optimistic model (cid:101) p † k . Let (cid:101) v † k ( s → s † ) be the value function of policy (cid:101) π † k starting from state s in the model (cid:101) p † k . We can apply Lem. 2 given that the conditions of Asm. 2hold (indeed, we have c min = 1 > and there exists at least one proper policy to reach the goal state s † since it belongs to W k ). Moreover, we have that (cid:101) V (cid:63) K k ( s → s † ) ≤ V (cid:63) K k ( s → s † ) given the waythe optimistic model (cid:101) p † k is computed (i.e., by maximizing the probability of transitioning to the goalat any state-action pair), see [28, Lem. B.12]. Hence we get the two following important properties. Lemma 8.
For any round k , goal state s † ∈ W k and state s ∈ K k ∪ { x } , we have under the event Θ , (cid:101) u † k ( s → s † ) ≤ V (cid:63) K k ( s → s † ) . Lemma 9.
For any round k , goal state s † ∈ W k and state s ∈ K k ∪ { x } , we have (cid:101) v † k ( s → s † ) ≤ (1 + 2 γ ) (cid:101) u † k ( s → s † ) . D.4 State Transfer from U to K (step (cid:175) ) We fix any round k and any goal state s † ∈ W k that is added to the set of “controllable” states K , i.e.,for which (cid:101) u † k ( s → s † ) ≤ L . Lemma 10.
Under the event Θ , we have both following inequalities (cid:40) v † k ( s → s † ) ≤ L + ε,v † k ( s → s † ) ≤ V (cid:63) K k ( s → s † ) + ε. In particular, the first inequality entails that s † ∈ S → L + ε , which justifies the validity of the statetransfer from U to K . roof. We have (cid:101) v † k ( s → s † ) (a) ≤ (1 + 2 γ ) (cid:101) u † k ( s → s † ) ≤ (b) ≤ L + ε (c) ≤ V (cid:63) K k ( s → s † ) + ε , (18)where inequality (a) comes from Lem. 9, inequality (b) combines the algorithmic condition (cid:101) u † k ( s → s † ) ≤ L and the VI precision level γ := ε L , and finally inequality (c) combines Lem. 8 and the VI precision level. Moreover, for any state in K k , (cid:101) v † k ( s → s † ) (a) ≤ (cid:101) V (cid:63) K k ( s → s † ) + ε (b) ≤ (cid:101) V (cid:63) K k ( s → s † ) + 1 + ε ≤ (cid:101) v † k ( s → s † ) + 1 + ε , where (a) comes from Lem. 8 and (b) stems from the presence of the RESET action (Asm. 1).We now provide the exact choice of allocation function φ in Alg. 1. We introduce γ := 2 ε L + 1 + ε )( L + ε ) . (Note that γ = O ( ε/L ) .) We set the following requirement of samples for each state-action pair ( s, a ) at round k , n k = φ ( K k ) = X k γ (cid:34) log (cid:32) eX k √ SA √ δγ (cid:33)(cid:35) + 24 |S † k | γ log (cid:32) |S † k | SAδγ (cid:33) , (19)where we define X k := max ( s,a ) ∈S † k ×A (cid:88) s (cid:48) ∈S † k (cid:113)(cid:98) σ k ( s (cid:48) | s, a ) , with (cid:98) σ k ( s (cid:48) | s, a ) := (cid:98) p † k ( s (cid:48) | s, a )(1 − (cid:98) p † k ( s (cid:48) | s, a )) the estimated variance of the transition from ( s, a ) to s (cid:48) . Leveraging the empirical Bernstein inequality (Lem. 5) and perfoming simple algebraicmanipulations (see e.g., [39, Lem. 8 and 9]) yields that β k ( s, a, x ) ≤ γ . From Lem. 7, this impliesthat (cid:101) p † k ∈ P ( p † k ) η with η := 4 γ . We can then apply Lem. 3 (whose condition 8 is verified), which gives v † k ( s → s † ) ≤ (cid:16) η (cid:107) (cid:101) v † k ( · → s † ) (cid:107) ∞ (cid:17)(cid:101) v † k ( s → s † ) (20) ≤ (1 + η ( L + 1 + ε )) (cid:101) v † k ( s → s † ) ≤ (cid:101) v † k ( s → s † ) + 2 ε , where the last inequality uses that η ( L + 1 + ε )( L + ε ) = ε by definition of γ . Plugging in Eq. 18yields the sought-after inequalities. D.5 Termination of the AlgorithmLemma 11 (Variant of Lem. 17 of [1]) . Suppose that for every state s ∈ S , each action a ∈ A is executed b ≥ (cid:100) L log (cid:0) ALSδ (cid:1) (cid:101) times. Let S (cid:48) s,a be the set of all next states visited during the b executions of ( s, a ) . Denote by Λ the complementary of the event (cid:26) ∃ ( s (cid:48) , s, a ) ∈ S × A : p ( s (cid:48) | s, a ) ≥ L ∧ s (cid:48) / ∈ S (cid:48) s,a (cid:27) . Then P (Λ) ≥ − δ . Lemma 12.
Under the event Θ ∩ Λ , for any round k , either S → L ⊆ K k , or there exists a state s † ∈ S → L \ K k such that s † ∈ W k and is L -controllable with a policy restricted to K k . Moreover, |W k | ≤ LA |K k | . roof of Lem. 12. Consider a round k such that S → L \ K k is non-empty. Due to the incrementalconstruction of the set S → L (Def. 4), there exists a state s † ∈ S → L and a policy restricted to K k that canreach s † in at most L steps (in expectation). Hence there exists a state-action pair ( s, a ) ∈ K k × A such that p ( s † | s, a ) ≥ L . Since φ ( K k ) ≥ (cid:100) L log (cid:0) ALSδ (cid:1) (cid:101) samples are available at each state-actionpair, according to Lem. 11, we get that, under the event Λ , s † is found during the sample collectionprocedure for the state-action pair ( s, a ) (step (cid:172) ), which implies that s † ∈ U k .Moreover, the choice of allocation function φ guarantees in particular that there are more than Ω( L ε log( LSAδε )) samples available at each state-action pair ( s, a ) ∈ K k × A . From the empiricalBernstein inequality of Eq. 17, we thus have that | p ( s † | s, a ) − (cid:98) p k ( s † | s, a ) | ≤ ε L under the event Θ .Consequently we have (cid:98) p k ( s † | s, a ) ≥ L − | p ( s † | s, a ) − (cid:98) p k ( s † | s, a ) | ≥ − ε L , which implies that s † ∈ W k . Furthermore, we can decompose W k the following way W k = (cid:91) ( s,a ) ∈K k ×A Y k ( s, a ) , where we introduce the subset Y k ( s, a ) := (cid:26) s (cid:48) ∈ U k : (cid:98) p k ( s (cid:48) | s, a ) ≥ − ε L (cid:27) . We then have (cid:88) s (cid:48) ∈S (cid:98) p k ( s (cid:48) | s, a ) ≥ (cid:88) s (cid:48) ∈Y k ( s,a ) (cid:98) p k ( s (cid:48) | s, a ) ≥ − ε L |Y k ( s, a ) | . We conclude the proof by writing that |W k | ≤ (cid:88) ( s,a ) ∈K k ×A |Y k ( s, a ) | ≤ L − ε A |K k | ≤ LA |K k | , where the last inequality uses that ε ≤ (from line 2 of Alg. 1). Lemma 13.
Under the event Θ ∩ Λ , when either condition STOP1 or STOP2 is triggered (at a roundindexed by K ), we have S → L ⊆ K K .Proof. If condition
STOP1 is triggered, Lem. 12 immediately guarantees that S → L ⊆ K K under theevent Λ . If condition STOP2 is triggered, we have for all s ∈ W K , (cid:101) u s ( s → s ) > L . From Lem. 8this means that, under the event Θ , for all s ∈ W K , V (cid:63) K K ( s → s ) > L . Hence none of the states in W K can be reached in at most L steps (in expectation) with a policy restricted to K K . We concludethe proof using Lem. 12. Lemma 14.
Under the event Θ ∩ Λ , when DisCo terminates at round K , for any state s ∈ K K , thepolicy π s computed during step (cid:176) verifies v π s ( s → s ) ≤ min π ∈ Π( S → L ) v π ( s → s ) + ε. Moreover, we have that S → L ⊆ K K ⊆ S → L + ε .Proof. Assume that the event Θ ∩ Λ holds. Then when the final set K K is considered and the newpolicies are computed using all the samples, Lem. 10 yields for all s ∈ K K , v π s ( s → s ) ≤ min π ∈ Π( K K ) v π ( s → s ) + ε. Moreover Lem. 13 entails that K K ⊇ S → L . This implies from Lem. 1 that min π ∈ Π( K K ) v π ( s → s ) ≤ min π ∈ Π( S → L ) v π ( s → s ) , which means that K K ⊆ S → L + ε . 22 .6 High Probability Bound on the Sample Collection Phase (step (cid:172) ) Denote by K the (random) index of the last round during which the algorithm terminates. We focuson the sample collection procedure for any state s ∈ K K . We denote by k s the index of the roundduring which s was added to the set of “controllable” states K . To collect samples at state s , thelearner uses the shortest-path policy π s . We say that an attempt to collect a specific sample is a rollout . We denote by Z K := |K K | AN K the total number of samples that the learner needs to collect.As such, at most Z K rollouts must take place. Assume that the event Θ holds. Then from Lem. 14,we have K K ⊆ S → L + ε . Hence, denoting S L + ε := |S → L + ε | , we have Z K ≤ Z L + ε := S L + ε A Φ( S → L + ε ) .The following lemma provides a high-probability upper bound on the time steps required to meet thesampling requirements. Lemma 15.
Assume that the event Θ holds. Set ψ := 4( L + ε + 1) log (cid:18) Z L + ε δ (cid:19) , and introduce the following event T := (cid:110) ∃ one rollout (with goal state s ) s.t. τ π s ( s → s ) > ψ (cid:111) . We have P ( T ) ≤ δ .Proof. Assume that the event Θ holds. Leveraging a union bound argument and applying Lem. 16 topolicy π s which verifies v π s ( s (cid:48) → s ) ≤ L + ε + 1 for any s (cid:48) ∈ K k s , we get P ( T ) ≤ (cid:88) rollouts (cid:18) − ψ L + ε + 1) (cid:19) ≤ Z L + ε exp (cid:18) − ψ L + ε + 1) (cid:19) ≤ δ , where the last inequality comes from the choice of ψ . Lemma 16 ([28], Lem. B.5) . Let π be a proper policy such that for some d > , V π ( s ) ≤ d for everynon-goal state s . Then the probability that the cumulative cost of π to reach the goal state from anystate s is more than m , is at most e − m/ (4 d ) for all m ≥ . Note that a cost of at most m impliesthat the number of steps is at most m/c min . D.7 Putting Everything Together: Sample Complexity Bound
The sample complexity of the algorithm is solely induced by the sample collection procedure (step (cid:172) ).Recall that we denote by K the index of the round at which the algorithm terminates. With probabilityat least − δ , Lem. 13 holds, and so does the event Θ . Hence the algorithm discovers a set ofstates K K ⊇ S → L . Moreover, from Lem. 14, the algorithm outputs for each s ∈ K K a policy π s with E [ τ π s ( s → s )] ≤ V (cid:63) S → L ( s ) + ε . Hence we also have |K K | ≤ S L + ε := |S → L + ε | .We denote by Z K := |K K | A φ ( K K ) the total number of samples that the learner needs to collect.From Lem. 15, with probability at least − δ , the total sample complexity of the algorithm is at most ψZ K , where ψ := 4( L + ε + 1) log (cid:16) Z L + ε δ (cid:17) .Now, from Eq. 19 there exists an absolute constant α > such that DisCo selects as allocationfunction φ φ : X → α · (cid:32) L (cid:98) Θ( X ) ε log (cid:18) LSAεδ (cid:19) + L |X | ε log (cid:18) LSAεδ (cid:19)(cid:33) , where (cid:98) Θ( X ) := max ( s,a ) ∈X ×A (cid:32) (cid:88) s (cid:48) ∈X (cid:112)(cid:98) p ( s (cid:48) | s, a )(1 − (cid:98) p ( s (cid:48) | s, a )) (cid:33) . The total requirement is φ ( K K ) . Note that from Cauchy-Schwarz’s inequality, we have (cid:98) Θ( K K ) ≤ Γ K := max ( s,a ) ∈K K ×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈K K (cid:107) ≤ |K K | . − δ , ψZ K = (cid:101) O (cid:18) L Γ K |K K | Aε + L |K K | Aε (cid:19) . We finally use that K K ⊂ S → L + ε from Lem. 14, which implies that C AX (cid:63) ( DisCo , L, ε, δ ) = (cid:101) O (cid:18) L Γ L + ε S L + ε Aε + L S L + ε Aε (cid:19) , where Γ L + ε := max ( s,a ) ∈S → L + ε ×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈S → L + ε (cid:107) . This concludes the proof of Thm. 1. D.8 Proof of Corollary 1
The result given in Cor. 1 comes from retracing the analysis of Lem. 14 and therefore Lem. 10 byconsidering non-uniform costs between [ c min , instead of costs all equal to . Specifically, Eq. 20needs to account for the inverse dependency on c min of the simulation lemma of Lem. 3. This inducesthe final ε/c min accuracy level achieved by the policies output by DisCo . There remains to guaranteethat condition 8 of Lem. 3 is verified. In particular the condition holds if η ( L + 1 + ε ) ≤ c min ,where η is the model accuracy prescribed in the proof of Lem. 10. We see that this is the casewhenever we have ε = O ( Lc min ) due to the fact that η = Ω( ε/L ) . D.9 Computational Complexity of
DisCo
The overall computational complexity of
DisCo can be expressed as (cid:80) Kk =1 |W k | · C ( OVI
SSP ) , where C ( OVI
SSP ) denotes the complexity of an OVI
SSP procedure and where we recall that K denotes the(random) index of the last round during which the algorithm terminates. Note that it holds with highprobability that K ≤ | S → L + ε | and |W k | ≤ LA |K k | ≤ LA | S → L + ε | . Moreover C ( OVI
SSP ) capturesthe complexity of the value iteration (VI) algorithm for SSP, which was proved in [34] to converge intime quadratic w.r.t. the size of the considered state space (here, K k ) and (cid:107) V (cid:63) (cid:107) ∞ /c min . Here we have c min = 1 , and we can easily prove that in all the SSP instances considered by DisCo , the optimalvalue function V (cid:63) verifies (cid:107) V (cid:63) (cid:107) ∞ = O ( L ) , due to the restriction of the goal state in W k (indeedthis restriction implies that there exists a state-action pair in K k × A that transitions to the goal statewith probability Ω(1 /L ) in the true MDP). Putting everything together gives DisCo ’s computationalcomplexity. Interestingly, we notice that while it depends polynomially on S L + ε , L and A , it isindependent from S the size of the global state space. E The
UcbExplore
Algorithm [1]
E.1 Outline of the Algorithm
The
UcbExplore algorithm was introduced by Lim and Auer [1] to specifically tackle condition AX L .The algorithm maintains a set K of “controllable” states and a set U of “uncontrollable” states. Italternates between two phases of state discovery and policy evaluation . In a state discovery phase,new candidate states are discovered as potential members of the set of controllable states. Any policyevaluation phase is called a round and it relies on an optimistic principle: it attempts to reach an“optimistic” state s (i.e., the easiest state to reach based on information collected so far) among all thecandidate states by executing an optimistic policy π s that minimizes the optimistic expected hittingtime truncated at a horizon of H U CB := (cid:100) L + L ε − (cid:101) . Within the round of evaluation of policy π s ,the algorithm proceeds through at most λ U CB := (cid:6) L ε − log (cid:0) |K| δ − (cid:1)(cid:7) episodes, each of whichbegins at s and ends either when π s successfully reaches s or when H U CB steps have been executed.If the empirical performance of π s is poor (measured through a performance check done after eachepisode), the round is said to have failed . Otherwise, the round is successful which means that s iscontrollable and an acceptable policy ( π s ) has been discovered. A failure round leads to selectinganother candidate state-policy pair for evaluation, while a success round leads to a state discoveryphase which in turn adds more candidate states for the subsequent rounds. As explained in App. A, UcbExplore is unable to tackle the more challenging objective AX (cid:63) .24 .2 Minor Issue and Fix in the Analysis of UcbExplore
The key insight of
UcbExplore is to bound the number of failure rounds of the algorithm, by lower-and upper-bounding the so-called “regret” contribution of failure rounds, where the regret of a failureround k is defined as e k (cid:88) j =1 (cid:104) H U CB − L − Γ − (cid:88) i =0 r i (cid:105) , where e k ≤ λ U CB is the actual number of episodes executed in round k and where the reward r i ∈ { , } is equal to 1 only if the state is the goal state. However, upper bounding the regretcontribution of failure rounds implies applying a concentration inequality on only specific roundsthat are chosen given their empirical performance . Hence Lim and Auer [1, Lem. 18] improperly usea martingale argument to bound a sum whose summands are chosen in a non-martingale way, i.e.,depending on their realization.To avoid the aforementioned issue, one must upper and lower bound the cumulative regret of the entire set of rounds and not only the failure rounds in order to obtain a bound on the number of failurerounds. However, this would yield a sample complexity that has a second term scaling as (cid:101) O ( ε − ) .Following personal communication with the authors, the fix is to change the definition of regret of around, making it equal to e k (cid:88) j =1 (cid:101) u H U CB ( s → s ) − H U CB − (cid:88) i =0 r i , where s is the considered goal state and (cid:101) u H U CB ( s → s ) is the optimistic H U CB -step reward (wherethe reward is equal to 1 only at state s ). With this new definition, it is possible to recover the samplecomplexity provided in [1] scaling as (cid:101) O ( ε − ) . E.3 Issue with a Possibly Infinite State Space
Lim and Auer [1] claim that their setting can cope with a countable, possibly infinite state space.However, this leads to a technical issue, which has been acknowledged by the authors via personalcommunication and as of now has not been resolved. Indeed, it occurs when a union bound over theunknown set U is taken to guarantee high-probability statements (e.g., the Lem. 14 or 17 of [1]). Yetfor each realization of the algorithm, we do not know what the set U , or equivalently K , looks like,hence it is improper to perform a union bound over a set of unknown identity. Simple workaroundsto circumvent this issue are to impose a finite state space, or to assume prior knowledge over a finitesuperset of U . In this paper we opt for the first option. It remains an open and highly non-trivialquestion as to how (and whether) the framework can cope with an infinite state space. E.4 Effective Horizon of the AX Problem and its Dependency on ε UcbExplore [1] designs finite-horizon problems with horizon H U CB := (cid:100) L + L ε − (cid:101) and outputspolicies that reset every H U CB time steps. In the following we prove that the effective horizon of the AX problem actually scales as O (cid:0) log( Lε − ) L (cid:1) , i.e., only logarithmically w.r.t. ε − . We begin bydefining the concept of “resetting” policies as follows. Definition 11.
For any π ∈ Π and horizon H ≥ , we denote by π | H the non-stationary policy thatexecutes the actions prescribed by π and performs the RESET action every H steps, i.e., π | Ht ( a | s ) := (cid:26) RESET if t ≡ mod H ) ,π ( a | s ) otherwise.We denote by Π | H the set of such “resetting” policies. The following lemma captures the effective horizon H eff of the problem, in the sense that restrictingour attention to Π | H ( S → L ) for H ≥ H eff does not compromise the possibility of finding policies thatachieve the performance required by AX (cid:63) (and thus also by AX L ).25 emma 17. For any ε ∈ (0 , and L ≥ , whenever H ≥ H eff := 4( L + 1) (cid:6) log (cid:0) L + 1) ε (cid:1)(cid:7) , we have for any s † ∈ S → L , min π | H ∈ Π | H ( S → L ) v π | H ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + ε. Proof.
Consider any goal state s † ∈ S → L . Set ε (cid:48) := ε L +1) ≤ . Denote by π ∈ Π( S → L ) theminimizer of V (cid:63) S → L ( s → s † ) . For any horizon H ≥ , we introduce the truncated value function v π,H ( s → s (cid:48) ) := E [ τ π ( s → s (cid:48) ) ∧ H ] and the tail probability q π,H ( s → s (cid:48) ) := P ( τ π ( s → s (cid:48) ) > H ) .Due to the presence of the RESET action, the value function of π can be bounded for all states s ∈ S → L \ { s † } as v π ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + 1 ≤ L + 1 . This entails that the probability of the goal-reaching time decays exponentially. More specifically, wehave q π,H ( s → s † ) ≤ (cid:18) − H L + 1) (cid:19) ≤ ε (cid:48) , (21)where the first inequality stems from Lem. 16 and the second inequality comes from the choiceof H ≥ L + 1) (cid:6) log (cid:0) ε (cid:48) (cid:1)(cid:7) . Furthermore, we have τ π ( s → s (cid:48) ) ∧ H ≤ τ π ( s → s (cid:48) ) and thus E [ τ π ( s → s (cid:48) ) ∧ H ] ≤ E [ τ π ( s → s (cid:48) )] . Consequently, v π,H ( s → s † ) ≤ v π ( s → s † ) = V (cid:63) S → L ( s → s † ) . (22)Now, from [1, Eq. 4], the value function of π can be related to its truncated value function and tailprobability as follows v π | H = v π,H + q π,H − q π,H . (23)Plugging Eq. 21 and 22 into Eq. 23 yields v π | H ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + ε (cid:48) − ε (cid:48) . Notice that the inequalities − x ≤ x and x − x ≤ x hold for any < x ≤ . Applying themfor x = ε (cid:48) yields V (cid:63) S → L ( s → s † ) + ε (cid:48) − ε (cid:48) ≤ (1 + 2 ε (cid:48) ) V (cid:63) S → L ( s → s † ) + 2 ε (cid:48) . From the inequality V (cid:63) S → L ( s → s † ) ≤ L and the definition of ε (cid:48) , we finally obtain v π | H ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + ε, which completes the proof.Lem. 17 reveals that the effective horizon H eff of the AX problem scales only logarithmicallyand not linearly in ε − . This highlights that the design choice in UcbExplore to tackle finite-horizon problems with horizon H U CB unavoidably leads to a suboptimal dependency on ε in its AX L sample complexity bound. In contrast, by designing SSP problems and thus leveraging the intrinsicgoal-oriented nature of the problem, DisCo can (implicitly) capture the effective horizon of theproblem. This observation is at the heart of the improvement in the ε dependency from (cid:101) O ( ε − ) of UcbExplore [1] to (cid:101) O ( ε − ) of DisCo (Thm. 1). 26
Experiments
This section complements the experimental findings partially reported in Sect. 5. We provide detailsabout the algorithmic configurations and the environments as well as additional experiments.
F.1 Algorithmic ConfigurationsExperimental improvements to
UcbExplore [1].
We introduce several modifications to
UcbExplore in order to boost its practical performance. We remove all the constants and loga-rithmic terms from the requirement for state discovery and policy evaluation (refer to [1, Fig. 1]).Furthermore, we remove the constants in the definition of the accuracy ε (cid:48) = ε/L used by UcbExplore (while their original algorithm requires ε (cid:48) to be divided by , we remove this constant). We alsosignificantly improve the planning phase of UcbExplore [1, Fig. 2]. Their procedure requires todivide the samples into H := (1 + 1 /ε (cid:48) ) L disjoint sets to estimate the transition probability ofeach stage h of the finite-horizon MDP. This substantially reduces the accuracy of the estimatedtransition probability since for each stage h only N k ( s, a ) /H are used. In our experiments, we useall the samples to estimate a stationary MDP (i.e., (cid:98) p k ( s (cid:48) | s, a ) = N k ( s, a, s (cid:48) ) /N k ( s, a ) ) rather than astage-dependent model. Estimating a stationary model instead of bucketing the data is simpler andmore efficient since leads to a higher accuracy of the estimated model. To avoid to move too far awayfrom the original UcbExplore , we decided to define the confidence intervals as if bucketing wasused. We thus consider N k ( s, a ) = N k ( s, a ) /H for the construction of the confidence intervals. Forplanning, we use the optimistic backward induction procedure as in [30]. We thus leverage empiricalBernstein inequalities —which are much tighter— rather than Hoeffding inequalities as suggestedin [1]. In particular, we further approximate the bonus suggested in [30, Alg. 4] as b h ( s, a ) = (cid:115) V ar s (cid:48) ∼ (cid:98) p k ( ·| s,a ) [ V k,h +1 ( s (cid:48) )] N k ( s, a ) ∨ H − h ) N k ( s, a ) ∨ . For
DisCo , we follow the same approach of removing constants and logarithmic terms. Wethus use the definition of φ as in Thm. 1 with α = 1 and without log-terms. For plan-ning, we use the procedure described in App. D with b k ( s, a, s (cid:48) ) = (cid:113) (cid:98) p k ( s (cid:48) | s,a )(1 − (cid:98) p k ( s (cid:48) | s,a )) N k ( s,a ) ∨ + N k ( s,a ) ∨ . Finally, in the experiments we use a state-action dependent value (cid:98) Θ( s, a, K k ) = (cid:0) (cid:80) s (cid:48) ∈K k (cid:112)(cid:98) p k ( s (cid:48) | s, a )(1 − (cid:98) p k ( s (cid:48) | s, a )) (cid:1) instead of taking the maximum over ( s, a ) .Even though we boosted the practical performance of UcbExplore w.r.t. the original algorithmproposed in [1] (e.g., the use of Bernstein), we believe it makes the comparison between
DisCo and
UcbExplore as fair as possible.
F.2 Confusing Chain
The confusing chain environment referred to in Sect. 5 is constructed as follows. It is an MDPcomposed of an initial state s , a chain of length C (states are denoted by s , . . . , s C ) and a setof K confusing states ( s C +1 , . . . , s C + K ). Two actions are available in each state. In state s ,we have a forward action a that moves to the chain with probability p c ( p ( s | s , a ) = p c and p ( s | s , a ) = 1 − p c ) and a confusing action that has uniform probability of reaching any confusingstate ( p ( s i | s , a ) = 1 /K for any i ∈ { C + 1 , . . . , C + K } ). In the confusing states, all actions movedeterministically to the end of the chain ( p ( s C | s i , a ) = 1 for any i ∈ { C + 1 , . . . , C + K } and a ). Ineach state of the chain, there is a forward action a that behaves as in s ( p ( s min( C,i +1) | s i , a ) = p c and p ( s i | s i , a ) = 1 − p c , for any i ∈ { , . . . , C − } ) and a skip action a that moves to m statesahead with probability p skip ( p ( s min( C,i + m ) | s i , a ) = p skip and p ( s i | s i , a ) = 1 − p skip , for any i ∈ { , . . . , C − } ). Finally, p ( s | s c , a ) = 1 for any action a . In our experiments, we set m = 4 , p skip = 1 / , p c = 1 , C = 5 , K = 6 , L = 4 . . 27 DisCo UcbExplore -Bernstein . ,
263 (13 , , ,
688 (92 , . ,
569 (4 , ,
580 (13 , . ,
160 (829) 108 ,
894 (2 , . ,
349 (475) 40 ,
538 (805)0 . ,
891 (244) 21 ,
270 (441)
Table 2: Sample complexity of
DisCo and
UcbExplore -Bernstein, on the confusing chain do-main. Values are averaged over runs and the -confidence interval of the mean is reported inparenthesis. UcbExplore -Bernstein ε Expected hitting time v π ( s → s i ) s s s s s s . , . . .
94 (0 . . .
36 (0 .
11) 4 4 .
53 (0 . . .
38 (0 .
11) 4 .
07 (0 .
07) 4 .
53 (0 . Table 3: Expected hitting time of state s i of the goal-oriented policy π s i recovered by UcbExplore -Bernstein, on the confusing chain domain.
DisCo recovers the optimal goal-oriented policy in allthe runs and for all ε . The advantage of DisCo lies in its final policy consolidation step. Values areaveraged over runs and the -confidence interval of the mean is reported in parenthesis (it isomitted when equal to ). This shows that UcbExplore recovers the optimal goal-oriented policy inevery run only for ε equal to . and . . Sample complexity.
We provide in Tab. 2 the sample complexity of the algorithms for varyingvalues of ε . As mentioned in Sect. 5, DisCo outperforms
UcbExplore for any value of ε , andincreasingly so when ε decreases. Fig. 7 complements Fig. 2 for additional values of ε . Quality of goal-reaching policies.
We now investigate the quality of the policies recovered by
DisCo and
UcbExplore . In particular, we show that
DisCo is able to find the incrementally near-optimal shortest-path policies to any goal state, while
UcbExplore may only recover sub-optimalpolicies. On the confusing chain domain, the intuition is that the set of confusing states makes s C reachable in just steps but the confusing states are not in the controllable set and thus the algorithmsare not able to recover the shortest-path policy to s C . On the other hand, state s C is controllablethrough two policies: 1) the policies π that takes always the forward action a reaches s C in steps;2) the policy π that takes the skip action a in s reaches s C in steps. We observed empiricallythat DisCo always recovers policy π (i.e., the fastest policy) while UcbExplore selects policy π inseveral cases. This is highlighted in Tab. 3 where we report the expected hitting time of the policiesrecovered by the algorithms. This finding is not surprising since, as we explain in Sect. 4 and App. A, UcbExplore is designed to find policies reaching states in at most L steps on average, yet it is notable to recover incrementally near-optimal shortest-path policies, as opposed to DisCo . F.3 Combination Lock
We consider the combination lock problem introduced in [31]. The domain is a stochastic chain with S = 6 states and A = 2 actions. In each state s k , action right ( a ) is deterministic and leads to state s k +1 , while action left ( a ) moves to a state s k − l with probability proportional to / ( k − l ) (i.e.,inversely proportional to the distance of the states). Formally, we have that n ( x k , x l ) = (cid:26) k − l if l < k otherwise and p ( x l | x k , a ) = n ( x k , x l ) (cid:80) s n ( x k , s ) . We set the initial state to be at / of the chain, i.e., (cid:106) N/ (cid:107) . The actions in the end states areabsorbing, i.e., p ( s | s , a ) = 1 and p ( s N − | s N − , a ) = 1 , while the remaining actions behavenormally. See Fig. 5 for an illustration of the domain.28 s s s s s a a / / / / /
11 12 / / / /
25 60 / / / / / Figure 5: Combination lock domain with S = 6 states. Expected hitting times from the initial state s are v π ( s → s ) = (2 . , . , . , , , . Consider L = 3 , the set of incrementally L -controllablestates is S → L = { s , s , s , s } . The goal-oriented policy to reach s and s takes always the rightaction a , while the policy for s always selects the left action a . . . . . · . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.2 UcbExploreDisCo
Figure 6: Proportion of the incrementally L -controllable states identified by DisCo and
UcbExplore in the combination lock domain for L = 2 . and ε = 0 . . Values are averaged over runs. Sample complexity.
We evaluate the two algorithms
DisCo and
UcbExplore on the combinationlock domain, for ε = 0 . and L = 2 . . We further boost the empirical performance of UcbExplore by using N instead of N for the construction of the confidence intervals (i.e., we do not accountfor the data bucketing in [1], see App. F.1). To preserve the robustness of the algorithm, we use log( |K k | ) / ( ε (cid:48) ) episodes for UcbExplore ’s policy evaluation phase (indeed we noticed that theremoval of the logarithmic term here sometimes leads
UcbExplore to miss some states in S → L in this domain). For the same reason, in DisCo we use the value (cid:98) Θ( K k ) = max s,a (cid:98) Θ( s, a, K k ) prescribed by the theoretical algorithm instead of the state-action dependent values used in theprevious experiment. We average the experiments over runs and obtain a sample complexity of , ( , ) for DisCo and , ( , ) for UcbExplore . Fig. 6 reports the proportion ofincrementally L -controllable states identified by the algorithms as a function of time. We notice thatonce again DisCo clearly outperforms
UcbExplore .29 · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.1 UcbExploreDisCo · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.2 UcbExploreDisCo . . . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.4 UcbExploreDisCo · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.6 UcbExploreDisCo . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.8 UcbExploreDisCo
Figure 7: Proportion of the incrementally L -controllable states identified by DisCo and