[PDF] Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Abstract

Full PDF

IImproved Sample Complexity for IncrementalAutonomous Exploration in MDPs

Jean Tarbouriech

Facebook AI Research Paris & Inria Lille [email protected]

Matteo Pirotta

Facebook AI Research Paris [email protected]

Michal Valko

DeepMind Paris [email protected]

Alessandro Lazaric

Facebook AI Research Paris [email protected]

Abstract

We investigate the exploration of an unknown environment when no reward functionis provided. Building on the incremental exploration setting introduced by Lim andAuer [1], we deﬁne the objective of learning the set of ε -optimal goal-conditionedpolicies attaining all states that are incrementally reachable within L steps (inexpectation) from a reference state s . In this paper, we introduce a novel model-based approach that interleaves discovering new states from s and improving theaccuracy of a model estimate that is used to compute goal-conditioned policies toreach newly discovered states. The resulting algorithm, DisCo , achieves a samplecomplexity scaling as (cid:101) O ( L S L + ε Γ L + ε A ε − ) , where A is the number of actions, S L + ε is the number of states that are incrementally reachable from s in L + ε steps, and Γ L + ε is the branching factor of the dynamics over such states. Thisimproves over the algorithm proposed in [1] in both ε and L at the cost of an extra Γ L + ε factor, which is small in most environments of interest. Furthermore, DisCo is the ﬁrst algorithm that can return an ε/c min -optimal policy for any cost-sensitiveshortest-path problem deﬁned on the L -reachable states with minimum cost c min .Finally, we report preliminary empirical results conﬁrming our theoretical ﬁndings. In cases where the reward signal is not informative enough — e.g., too sparse, time-varying oreven absent — a reinforcement learning (RL) agent needs to explore the environment driven byobjectives other than reward maximization, see [e.g., 2, 3, 4, 5, 6]. This can be performed by designingintrinsic rewards to drive the learning process, for instance via state visitation counts [7, 8], noveltyor prediction errors [9, 10, 11]. Other recent methods perform information-theoretic skill discoveryto learn a set of diverse and task-agnostic behaviors [12, 13, 14]. Alternatively, goal-conditionedpolicies learned by carefully designing the sequence of goals during the learning process are oftenused to solve sparse reward problems [15] and a variety of goal-reaching tasks [16, 17, 18, 19].While the approaches reviewed above effectively leverage deep RL techniques and are able toachieve impressive results in complex domains (e.g., Montezuma’s Revenge [15] or real-worldrobotic manipulation tasks [19]), they often lack substantial theoretical understanding and guarantees.Recently, some unsupervised RL objectives were analyzed rigorously. Some of them quantify howwell the agent visits the states under a sought-after frequency, e.g., to induce a maximally entropicstate distribution [20, 21, 22, 23]. While such strategies provably mimic their desired behavior viaa Frank-Wolfe algorithmic scheme, they may not learn how to effectively reach any state of theenvironment and thus may not be sufﬁcient to efﬁciently solve downstream tasks. Another relevanttake is the reward-free RL paradigm of [24]: following its exploration phase, the agent is able to a r X i v : . [ c s . L G ] D ec ompute a near-optimal policy for any reward function at test time. While this framework yieldsstrong end-to-end guarantees, it is limited to the ﬁnite-horizon setting and the agent is thus unable totackle tasks beyond ﬁnite-horizon, e.g., goal-conditioned tasks.In this paper, we build on and reﬁne the setting of incremental exploration of [1]: the agent starts atan initial state s in an unknown, possibly large environment, and it is provided with a RESET actionto restart at s . At a high level, in this setting the agent should explore the environment and stopwhen it has identiﬁed the tasks within its reach and learned to master each of them sufﬁciently well.More speciﬁcally, the objective of the agent is to learn a goal-conditioned policy for any state thatcan be reached from s within L steps in expectation; such a state is said to be L -controllable. Limand Auer [1] address this setting with the UcbExplore method for which they bound the number ofexploration steps that are required to identify in an incremental way all L -controllable states (i.e.,the algorithm needs to deﬁne a suitable stopping condition) and to return a set of policies that areable to reach each of them in at most L + ε steps. A key aspect of UcbExplore is to ﬁrst focus onsimple states (i.e., states that can be reached within a few steps), learn policies to efﬁciently reachthem, and leverage them to identify and tackle states that are increasingly more difﬁcult to reach.This approach aims to avoid wasting exploration in the attempt of reaching states that are furtherthan L steps from s or that are too difﬁcult to reach given the limited knowledge available at earlierstages of the exploration process. Our main contributions are:• We strengthen the objective of incremental exploration and require the agent to learn ε -optimalgoal-conditioned policies for any L -controllable state. Formally, let V (cid:63) ( s ) be the length of theshortest path from s to s , then the agent needs to learn a policy to navigate from s to s in atmost V (cid:63) ( s ) + ε steps, while in [1] any policy reaching s in at most L + ε steps is acceptable.• We design DisCo , a novel algorithm for incremental exploration.

DisCo relies on an estimate ofthe transition model to compute goal-conditioned policies to the states observed so far and thenuse those policies to improve the accuracy of the model and incrementally discover new states.• We derive a sample complexity bound for

DisCo scaling as (cid:101) O ( L S L + ε Γ L + ε A ε − ) , where A is the number of actions, S L + ε is the number of states that are incrementally controllable from s in L + ε steps, and Γ L + ε is the branching factor of the dynamics over such incrementallycontrollable states. Not only is this sample complexity obtained for a more challenging objectivethan UcbExplore , but it also improves in both ε and L at the cost of an extra Γ L + ε factor, whichis small in most environments of interest.• Leveraging the model-based nature of DisCo , we can also readily compute an ε/c min -optimalpolicy for any cost-sensitive shortest-path problem deﬁned on the L -controllable states withminimum cost c min . This result serves as a goal-conditioned counterpart to the reward-freeexploration framework deﬁned by Jin et al. [24] for the ﬁnite-horizon setting. In this section we expand [1], with a more challenging objective for autonomous exploration. L -Controllable States We consider a reward-free

Markov decision process [25, Sect. 8.3] M := (cid:104)S , A , p, s (cid:105) . We assume aﬁnite action space A with A = |A| actions, and a ﬁnite, possibly large state space S for which anupper bound S on its cardinality is known, i.e., |S| ≤ S . Each state-action pair ( s, a ) ∈ S × A ischaracterized by an unknown transition probability distribution p ( ·| s, a ) over next states. We denoteby Γ S (cid:48) := max s ∈S (cid:48) ,a (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈S (cid:48) (cid:107) the largest branching factor of the dynamics over states inany subset S (cid:48) ⊆ S . The environment has no extrinsic reward, and s ∈ S is a designated initial state.A deterministic stationary policy π : S → A is a mapping between states to actions and we denoteby Π the set of all possible policies. Since in environments with arbitrary dynamics the learner mayget stuck in a state without being able to return to s , we introduce the following assumption. We say that f ( ε ) = (cid:101) O ( ε α ) if there are constants a , b , such that f ( ε ) ≤ a · ε α log b (cid:0) ε (cid:1) . Lim and Auer [1] originally considered a countable, possibly inﬁnite state space; however this leads to atechnical issue in the analysis of

UcbExplore (acknowledged by the authors via personal communication andexplained in App. E.3), which disappears by considering only ﬁnite state spaces. This assumption should be contrasted with the ﬁnite-horizon setting, where each policy resets automaticallyafter H steps, or assumptions on the MDP dynamics such as ergodicity or bounded diameter, which guaranteethat it is always possible to ﬁnd a policy navigating between any two states. s is in white. Left:

Each transition be-tween states is deterministic and depicted withan edge.

Right:

Each transition from s to theﬁrst layer is equiprobable and the transitionsin the successive layers are deterministic. Ifwe set L = 3 , then the states belonging to S L are colored in red. As the right ﬁgure illus-trates, L -controllability is not necessarily linkedto a notion of distance between states and an L -controllable state may be achieved by traversingstates that are not L -controllable themselves. Assumption 1.

The action space contains a

RESET action s.t. p ( s | s, RESET ) = 1 for any s ∈ S . We make explicit the states where a policy π takes action RESET in the following deﬁnition.

Deﬁnition 1 (Policy restricted on a subset) . For any S (cid:48) ⊆ S , a policy π is restricted on S (cid:48) if π ( s ) = RESET for any s / ∈ S (cid:48) . We denote by Π( S (cid:48) ) the set of policies restricted on S (cid:48) . We measure the performance of a policy in navigating the MDP as follows.

Deﬁnition 2.

For any policy π and a pair of states ( s, s (cid:48) ) ∈ S , let τ π ( s → s (cid:48) ) be the (random)number of steps it takes to reach s (cid:48) starting from s when executing policy π , i.e., τ π ( s → s (cid:48) ) :=inf { t ≥ s t +1 = s (cid:48) | s = s, π } . We also set v π ( s → s (cid:48) ) := E [ τ π ( s → s (cid:48) )] as the expectedtraveling time, which corresponds to the value function of policy π in a stochastic shortest-pathsetting (SSP, [26, Sect. 3]) with initial state s , goal state s (cid:48) and unit cost function. Note that we have v π ( s → s (cid:48) ) = + ∞ when the policy π does not reach s (cid:48) from s with probability 1. Furthermore, forany subset S (cid:48) ⊆ S and any state s , we denote by V (cid:63) S (cid:48) ( s → s ) := min π ∈ Π( S (cid:48) ) v π ( s → s ) , the length of the shortest path to s , restricted to policies resetting to s from any state outside S (cid:48) . The objective of the learning agent is to control efﬁciently the environment in the vicinity of s . Wesay that a state s is controlled if the agent can reliably navigate to it from s , that is, there exists aneffective goal-conditioned policy — i.e., a shortest-path policy — from s to s . Deﬁnition 3 ( L -controllable states) . Given a reference state s , we say that a state s is L -controllableif there exists a policy π such that v π ( s → s ) ≤ L . The set of L -controllable states is then S L := { s ∈ S : min π ∈ Π v π ( s → s ) ≤ L } . (1)We illustrate the concept of controllable states in Fig. 1 for L = 3 . Interestingly, in the right ﬁgure,the black states are not L -controllable. In fact, there is no policy that can directly choose which oneof the black states to reach. On the other hand, the red state, despite being in some sense further from s than the black states, does belong to S L . In general, there is a crucial difference between theexistence of a random realization where a state s is reached from s in less than L steps (i.e., blackstates) and the notion of L - controllability , which means that there exists a policy that consistentlyreaches the state in a number of steps less or equal than L on average (i.e., red state). This explainsthe choice of the term controllable over reachable , since a state s is often said to be reachable if thereis a policy π with a non-zero probability to eventually reach it, which is a weaker requirement.Unfortunately, Lim and Auer [1] showed that in order to discover all the states in S L , the learner mayrequire a number of exploration steps that is exponential in L or |S L | . Intuitively, this negative resultis due to the fact that the minimum in Eq. 1 is over the set of all possible policies, including those thatmay traverse states that are not in S L . Hence, we similarly constrain the learner to focus on the setof incrementally controllable states.

Deﬁnition 4 (Incrementally controllable states S → L ) . Let ≺ be some partial order on S . The set S ≺ L of states controllable in L steps w.r.t. ≺ is deﬁned inductively as follows. The initial state s We refer the reader to [1, Sect. 2.1] for a more formal and complete characterization of this negative result. elongs to S ≺ L by deﬁnition and if there exists a policy π restricted on { s (cid:48) ∈ S ≺ L : s (cid:48) ≺ s } with v π ( s → s ) ≤ L , then s ∈ S ≺ L . The set S → L of incrementally L -controllable states is deﬁned as S → L := ∪ ≺ S ≺ L , where the union is over all possible partial orders. By way of illustration, in Fig. 1 for L = 3 , it holds that S → L = S L in the left ﬁgure, whereas S → L = { s } (cid:54) = S L in the right ﬁgure. Indeed, while the red state is L -controllable, it requirestraversing the black states, which are not L -controllable. AX Objectives

We are now ready to formalize two alternative objectives for

Autonomous eXploration ( AX ) in MDPs. Deﬁnition 5 ( AX sample complexity) . Fix any length L ≥ , error threshold ε > and conﬁdencelevel δ ∈ (0 , . The sample complexities C AX L ( A , L, ε, δ ) and C AX (cid:63) ( A , L, ε, δ ) are deﬁned as thenumber of time steps required by a learning algorithm A to identify a set K ⊇ S → L such that withprobability at least − δ , it has learned a set of policies { π s } s ∈K that respectively veriﬁes thefollowing AX requirement ( AX L ) ∀ s ∈ K , v π s ( s → s ) ≤ L + ε , ( AX (cid:63) ) ∀ s ∈ K , v π s ( s → s ) ≤ V (cid:63) S → L ( s → s ) + ε. Designing agents satisfying the objectives deﬁned above introduces critical difﬁculties w.r.t. standardgoal-directed learning in RL. First, the agent has to ﬁnd accurate policies for a set of goals (i.e.,all incrementally L -controllable states) and not just for one speciﬁc goal. On top of this, the setof desired goals itself (i.e., the set S → L ) is unknown in advance and has to be estimated online.Speciﬁcally, AX L is the original objective introduced in [1] and it requires the agent to discover allthe incrementally L -controllable states as fast as possible. At the end of the learning process, foreach state s ∈ S → L the agent should return a policy that can reach s from s in at most L steps (inexpectation). Unfortunately, this may correspond to a rather poor performance in practice. Consider astate s ∈ S → L such that V (cid:63) S → L ( s → s ) (cid:28) L , i.e., the shortest path between s to s following policiesrestricted on S → L is much smaller than L . Satisfying AX L only guarantees that a policy reaching s in L steps is found. On the other hand, objective AX (cid:63) is more demanding, as it requires learninga near-optimal shortest-path policy for each state in S → L . Since V (cid:63) S → L ( s → s ) ≤ L and the gapbetween the two quantities may be arbitrarily large, especially for states close to s and far from thefringe of S → L , AX (cid:63) is a signiﬁcantly tighter objective than AX L and it is thus preferable in practice.We say that an exploration algorithm solves the AX problem if its sample complexity C AX ( A , L, ε, δ ) in Def. 5 is polynomial in |K| , A , L, ε − and log( S ) . Notice that requiring a logarithmic dependencyon the size of S is crucial but nontrivial, since the overall state space may be large and we do notwant the agent to waste time trying to reach states that are not L -controllable. The dependencyon the (algorithmic-dependent and random) set K can be always replaced using the upper bound |K| ≤ |S → L + ε | , which is implied with high probability by both AX L and AX (cid:63) conditions. Finally,notice that the error threshold ε > has a two-fold impact on the performance of the algorithm. First, ε deﬁnes the largest set S → L + ε that could be returned by the algorithm: the larger ε , the bigger theset. Second, as ε increases, the quality (in terms of controllability and navigational precision) of theoutput policies worsens w.r.t. the shortest-path policy restricted on S → L . DisCo

Algorithm

The algorithm

DisCo — for

Discover and Control — is detailed in Alg. 1. It maintains a set K of “controllable” states and a set U of states that are considered “uncontrollable” so far . A state s is tagged as controllable when a policy to reach s in at most L + ε steps (in expectation from s )has been found with high conﬁdence, and we denote by π s such policy. The states in U are statesthat have been discovered as potential members of S → L , but the algorithm has yet to produce a policyto control any of them in less than L + ε steps. The algorithm stores an estimate of the transitionmodel and it proceeds through rounds, which are indexed by k and incremented whenever a state in U gets transferred to the set K , i.e., when the transition model reaches a level of accuracy sufﬁcient Note that we translated in the condition in [1] of a relative error of Lε to an absolute error of ε , to align itwith the common formulation of sample complexity in RL. lgorithm 1: Algorithm

DisCo

Input:

Actions A , initial state s , conﬁdence parameter δ ∈ (0 , , error threshold ε > , L ≥ and(possibly adaptive) allocation function φ : P ( S ) → N (where P ( S ) denotes the power set of S ). Initialize k := 0 , K := { s } , U := {} and a restricted policy π s ∈ Π( K ) . Set ε := min { ε, } and continue := True . while continue do Set k += 1 . //new round// (cid:172) Sample collection on K For each ( s, a ) ∈ K k × A , execute policy π s until the total number of visits N k ( s, a ) to ( s, a ) satisﬁes N k ( s, a ) ≥ n k := φ ( K k ) . For each ( s, a ) ∈ K k × A , add s (cid:48) ∼ p ( ·| s, a ) to U k if s (cid:48) / ∈ K k . // (cid:173) Restriction of candidate states U Compute transitions (cid:98) p k ( s (cid:48) | s, a ) and W k := (cid:110) s (cid:48) ∈ U k : ∃ ( s, a ) ∈ K k × A , (cid:98) p k ( s (cid:48) | s, a ) ≥ − ε/ L (cid:111) · if W k is empty then Set continue := False . //condition STOP1 else // (cid:174) Computation of the optimistic policies on K for each state s (cid:48) ∈ W k do Compute ( (cid:101) u s (cid:48) , (cid:101) π s (cid:48) ) := OVI

SSP ( K k , A , s (cid:48) , N k , ε L ) , see Alg. 3 in App. D.1. Let s † := arg min s ∈W k (cid:101) u s ( s ) and (cid:101) u † := (cid:101) u s † ( s ) . if (cid:101) u † > L then Set continue := False . //condition STOP2 else // (cid:175) State transfer from U to K Set K k +1 := K k ∪ { s † } , U k +1 := U k \ { s † } and π s † := (cid:101) π s † . // (cid:176) Policy consolidation: computation on the final set K Set K := k . for each state s ∈ K K do Compute ( (cid:101) u s , (cid:101) π s ) := OVI

SSP ( K K , A , s, N K , ε L ) . Output: the states s in K K and their corresponding policy π s := (cid:101) π s . to compute a policy to control one of the states encountered before. We denote by K k (resp. U k ) theset of controllable (resp. uncontrollable) states at the beginning of round k . DisCo stops at a round K when it can conﬁdently claim that all the remaining states outside of K K cannot be L -controllable.At each round, the algorithm uses all samples observed so far to build an estimate of the transitionmodel denoted by (cid:98) p ( s (cid:48) | s, a ) = N ( s, a, s (cid:48) ) /N ( s, a ) , where N ( s, a ) and N ( s, a, s (cid:48) ) are counters forstate-action and state-action-next state visitations. Each round is divided into two phases. The ﬁrst isa sample collection phase. At the beginning of round k , the agent collects additional samples until n k := φ ( K k ) samples are available at each state-action pair in K k × A (step (cid:172) ). A key challenge liesin the careful (and adaptive) choice of the allocation function φ , which we report in the statement ofThm. 1 (see Eq. 19 in App. D.4 for its exact deﬁnition). Importantly, the incremental construction of K k entails that sampling at each state s ∈ K k can be done efﬁciently. In fact, for all s ∈ K k the agenthas already conﬁdently learned a policy π s to reach s in at most L + ε steps on average (see how suchpolicy is computed in the second phase). The generation of transitions ( s, a, s (cid:48) ) for ( s, a ) ∈ K k × A achieves two objectives at once. First, it serves as a discovery step, since all observed next states s (cid:48) not in U k are added to it — in particular this guarantees sufﬁcient exploration at the fringe (or border)of the set K k . Second, it improves the accuracy of the model p in the states in K k , which is essentialin computing near-optimal policies and thus fulﬁlling the AX (cid:63) condition.The second phase does not require interacting with the environment and it focuses on the computationof optimistic policies . The agent begins by signiﬁcantly restricting the set of candidate states in eachround to alleviate the computational complexity of the algorithm. Namely, among all the states in U k ,it discards those that do not have a high probability of belonging to S → L by considering a restrictedset W k ⊆ U k (step (cid:173) ). In fact, if the estimated probability (cid:98) p k of reaching a state s ∈ U k from any ofthe controllable states in K k is lower than (1 − ε/ /L , then no shortest-path policy restricted on K k could get to s from s in less than L + ε steps on average. Then for each state s (cid:48) in W k , DisCo computes an optimistic policy restricted on K k to reach s (cid:48) . Formally, for any candidate state s (cid:48) ∈ W k ,we deﬁne the induced stochastic shortest path (SSP) MDP M (cid:48) k with goal state s (cid:48) as follows.5 eﬁnition 6. We deﬁne the SSP-MDP M (cid:48) k := (cid:104)S , A (cid:48) k ( · ) , c (cid:48) k , p (cid:48) k (cid:105) with goal state s (cid:48) , where the actionspace is such that A (cid:48) k ( s ) = A for all s ∈ K k and A (cid:48) k ( s ) = { RESET } otherwise (i.e., we focus onpolicies restricted on K k ). The cost function is such that for all a ∈ A , c (cid:48) k ( s (cid:48) , a ) = 0 , and for any s (cid:54) = s (cid:48) , c (cid:48) k ( s, a ) = 1 . The transition model is p (cid:48) k ( s (cid:48) | s (cid:48) , a ) = 1 and p (cid:48) k ( ·| s, a ) = p ( ·| s, a ) otherwise. The solution of M (cid:48) k is the shortest-path policy from s to s (cid:48) restricted on K k . Since p (cid:48) k is unknown, DisCo cannot compute the exact solution of M (cid:48) k , but instead, it executes optimistic value iteration( OVI

SSP ) for SSP [27, 28] to obtain a value function (cid:101) u s (cid:48) and its associated greedy policy (cid:101) π s (cid:48) restrictedon K k (see App. D.1 for more details).The agent then chooses a candidate goal state s † for which the value (cid:101) u † := (cid:101) u s † ( s ) is the smallest.This step can be interpreted as selecting the optimistically most promising new state to control.Two cases are possible. If (cid:101) u † ≤ L , then s † is added to K k (step (cid:175) ), since the accuracy of themodel estimate on the state-action space K k × A guarantees that the policy (cid:101) π s † is able to reachthe state s † in less than L + ε steps in expectation with high probability (i.e., s † is incrementally ( L + ε ) -controllable). Otherwise, we can guarantee that S → L ⊆ K k with high probability. In thelatter case, the algorithm terminates and, using the current estimates of the model, it recomputes anoptimistic shortest-path policy π s restricted on the ﬁnal set K K for each state s ∈ K K (step (cid:176) ). Thispolicy consolidation step is essential to identify near-optimal policies restricted on the ﬁnal set K K (and thus on S → L ): indeed the expansion of the set of the so far controllable states may alter and reﬁnethe optimal goal-reaching policies restricted on it (see App. A). Computational Complexity.

Note that algorithmically, we do not need to deﬁne M (cid:48) k (Def. 6) overthe whole state space S as we can limit it to K k ∪ { s (cid:48) } , i.e., the candidate state s (cid:48) and the set K k of sofar controllable states. As shown in Thm. 1, this set can be signiﬁcantly smaller than S . In particularthis implies that the computational complexity of the value iteration algorithm used to compute theoptimistic policies is independent from S (see App. D.9 for more details). DisCo

We now present our main result: a sample complexity guarantee for

DisCo for the AX (cid:63) objective,which directly implies that AX L is also satisﬁed. Theorem 1.

There exists an absolute constant α > such that for any L ≥ , ε ∈ (0 , , and δ ∈ (0 , , if we set the allocation function φ as φ : X → α · (cid:32) L (cid:98) Θ( X ) ε log (cid:18) LSAεδ (cid:19) + L |X | ε log (cid:18) LSAεδ (cid:19)(cid:33) , (2) with (cid:98) Θ( X ) := max ( s,a ) ∈X ×A (cid:0) (cid:80) s (cid:48) ∈X (cid:112)(cid:98) p ( s (cid:48) | s, a )(1 − (cid:98) p ( s (cid:48) | s, a )) (cid:1) , then the algorithm DisCo (Alg. 1) satisﬁes the following sample complexity bound for AX (cid:63) C AX (cid:63) ( DisCo , L, ε, δ ) = (cid:101) O (cid:18) L Γ L + ε S L + ε Aε + L S L + ε Aε (cid:19) , (3) where S L + ε := |S → L + ε | and Γ L + ε := max ( s,a ) ∈S → L + ε ×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈S → L + ε (cid:107) ≤ S L + ε is the maximal support of the transition probabilities p ( ·| s, a ) restricted to the set S → L + ε . Given the deﬁnition of AX (cid:63) , Thm. 1 implies that DisCo terminates after C AX (cid:63) ( DisCo , L, ε, δ ) time steps, discovers a set of states K ⊇ S → L with |K| ≤ S L + ε , and for each s ∈ K outputs apolicy π s which is ε -optimal w.r.t. policies restricted on S → L , i.e., v π s ( s → s ) ≤ V (cid:63) S → L ( s → s ) + ε .Note that Eq. 3 displays only a logarithmic dependency on S , the total number of states. This propertyon the sample complexity of DisCo , along with its S -independent computational complexity, issigniﬁcant when the state space S grows large w.r.t. the unknown set of interest S → L . In words, all actions at states in K k behave exactly as in M and suffer a unit cost, in all states outside K k only the reset action to s is available with a unit cost, and all actions at the goal s (cid:48) induce a zero-cost self-loop. .1 Proof Sketch of Theorem 1 While the complete proof is reported in App. D, we now provide the main intuition behind the result.

State Transfer from U to K (step (cid:175) ). Let us focus on a round k and a state s † ∈ U k that getsadded to K k . For clarity we remove in the notation the round k , goal state s † and starting state s .We denote by v and (cid:101) v the value functions of the candidate policy (cid:101) π in the true and optimistic modelrespectively, and by (cid:101) u the quantity w.r.t. which (cid:101) π is optimistically greedy. We aim to prove that s † ∈ S → L + ε (with high probability). The main chain of inequalities underpinning the argument is v ≤ | v − (cid:101) v | + (cid:101) v (a) ≤ ε (cid:101) v (b) ≤ ε (cid:101) u + ε (c) ≤ L + ε, (4)where (c) is guaranteed by algorithmic construction and (b) stems from the chosen level of valueiteration accuracy. Inequality (a) has the ﬂavor of a simulation lemma for SSP, by relating theshortest-path value function of a same policy between two models (the true one and the optimisticone). Importantly, when restricted to K these two models are close in virtue of the algorithmic designwhich enforces the collection of a minimum amount of samples at each state-action pair of K × A ,denoted by n . Speciﬁcally, we obtain that | v − (cid:101) v | = (cid:101) O (cid:16)(cid:114) L Γ K n + L |K| n (cid:17) , with Γ K := max ( s,a ) ∈K×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈K (cid:107) ≤ |K| . Note that Γ K is the branching factor restricted to the set K . Our choice of n (given in Eq. 2) is thendictated to upper bound the above quantity by ε/ in order to satisfy inequality (a). Let us pointout that, interestingly yet unfortunately, the structure of the problem does not appear to allow fortechnical variance-aware improvements seeking to lower the value of n prescribed above (indeed the AX framework requires to analytically encompass the uncontrollable states U into a single meta statewith higher transitional uncertainty, see App. D for details). Termination of the Algorithm.

Since S → L is unknown , we have to ensure that none of the states in S → L are “missed”. As such, we prove that with overwhelming probability, we have S → L ⊆ K K whenthe algorithm terminates at a round denoted by K . There remains to justify the ﬁnal near-optimalguarantee w.r.t. the set of policies Π( S → L ) . Leveraging that step (cid:176) recomputes the policies ( π s ) s ∈K K on the ﬁnal set K K , we establish the following chain of inequalities v ≤ | v − (cid:101) v | + (cid:101) v (a) ≤ ε (cid:101) v (b) ≤ ε (cid:101) u + ε (c) ≤ V (cid:63) K K + ε (d) ≤ V (cid:63) S → L + ε, (5)where (a) and (b) are as in Eq. 4, (c) leverages optimism and (d) stems from the inclusion S → L ⊆ K K . Sample Complexity Bound.

The choice of allocation function φ in Eq. 2 bounds n K which isthe total number of samples required at each state-action pair in K K × A . We then compute ahigh-probability bound ψ on the time steps needed to collect a given sample, and show that it scalesas (cid:101) O ( L ) . Since the sample complexity is solely induced by the sample collection phase (step (cid:172) ), itcan be bounded by the quantity ψ n K |K K | A . Putting everything together yields the bound of Thm. 1. UcbExplore [1]

We start recalling the critical distinction that

DisCo succeeds in tackling problem AX (cid:63) , while UcbExplore [1] fails to do so (see App. A for details on the AX objectives). Nonetheless, in thefollowing we show that even if we restrict our attention to AX L , for which UcbExplore is designed,

DisCo yields a better sample complexity in most of the cases. From [1],

UcbExplore veriﬁes C AXL ( UcbExplore , L, ε, δ ) = (cid:101) O (cid:18) L S L + ε Aε (cid:19) · (6)Eq. 6 shows that the sample complexity of UcbExplore is linear in S L + ε , while for DisCo thedependency is somewhat worse. In the main-order term (cid:101) O (1 /ε ) of Eq. 3, the bound depends linearlyon S L + ε but also grows with the branching factor Γ L + ε , which is not the “global” branching factor Note that if we replace the error of ε for AX L with an error of Lε as in [1], we recover the sample complexityof (cid:101) O (cid:0) L S L + ε A/ε (cid:1) stated in [1, Thm. 8]. S → L + ε starting from S → L + ε . While in general we onlyhave Γ L + ε ≤ S L + ε , in many practical domains (e.g., robotics, user modeling), each state can onlytransition to a small number of states, i.e., we often have Γ L + ε = O (1) as long as the dynamics isnot too “chaotic”. While DisCo does suffer from a quadratic dependency on S L + ε in the secondterm of order (cid:101) O (1 /ε ) , we notice that for any S L + ε ≤ L ε − the bound of DisCo is still preferable.Furthermore, since for ε → , S L + ε tends to S L , the condition is always veriﬁed for small enough ε .Compared to DisCo , the sample complexity of

UcbExplore is worse in both ε and L . As stressed inSect. 2.2, the better dependency on ε both improves the quality of the output goal-reaching policies aswell as reduces the number of incrementally ( L + ε ) -controllable states returned by the algorithm. Itis interesting to investigate why the bound of [1] (Eq. 6) inherits a (cid:101) O ( ε − ) dependency. As reviewedin App. E, UcbExplore alternates between two phases of state discovery and policy evaluation.The optimistic policies computed by

UcbExplore solve a ﬁnite-horizon problem (with horizonset to H U CB ). However, minimizing the expected time to reach a target state is intrinsically anSSP problem, which is exactly what DisCo leverages. By computing policies that solve a ﬁnite-horizon problem (note that

UcbExplore resets every H U CB time steps), [1] sets the horizon to H U CB := (cid:100) L + L ε − (cid:101) , which leads to a policy-evaluation phase with sample complexity scalingas (cid:101) O ( H U CB ε − ) = (cid:101) O ( ε − ) . Since the rollout budget of (cid:101) O ( ε − ) is hard-coded into the algorithm,the dependency on ε of UcbExplore ’s sample complexity cannot be improved by a more reﬁnedanalysis; instead a different algorithmic approach is required such as the one employed by

DisCo . S → L with DisCo

A compelling advantage of

DisCo is that it achieves an accurate estimation of the environment’sdynamics restricted to the unknown subset of interest S → L . In contrast to UcbExplore which needsto restart its sample collection from scratch whenever L , ε or some transition costs change, DisCo can thus be robust to changes in such problem parameters. At the end of its exploration phase inAlg. 1,

DisCo is able to perform zero-shot planning to solve other tasks restricted on S → L , suchas cost-sensitive ones. Indeed in the following we show how the DisCo agent is able to computean ε/c min -optimal policy for any stochastic shortest-path problem on S → L with goal state s ∈ S → L (i.e., s is absorbing and zero-cost) and cost function lower bounded by c min > . Corollary 1.

There exists an absolute constant β > such that for any L ≥ , ε ∈ (0 , and c min ∈ (0 , verifying ε ≤ β · ( L c min ) , with probability at least − δ , for whatever goal state s ∈ S → L and whatever cost function c in [ c min , , DisCo can compute (after its exploration phase,without additional environment interaction) a policy (cid:98) π s,c whose SSP value function V (cid:98) π s,c veriﬁes V (cid:98) π s,c ( s → s ) ≤ V (cid:63) S → L ( s → s ) + εc min , where V π ( s → s ) := E (cid:104)(cid:80) τ π ( s → s ) t =1 c ( s t , π ( s t )) (cid:12)(cid:12) s = s (cid:105) is the SSP value function of a policy π and V (cid:63) S → L ( s → s ) := min π ∈ Π( S → L ) V π ( s → s ) is the optimal SSP value function restricted on S → L . It is interesting to compare Cor. 1 with the reward-free exploration framework recently introduced byJin et al. [24] in ﬁnite-horizon. At a high level, the result in Cor. 1 can be seen as a counterpart of [24]beyond ﬁnite-horizon problems, speciﬁcally in the goal-conditioned setting. While the parameter L deﬁnes the horizon of interest for DisCo , resetting after every L steps (as in ﬁnite-horizon) wouldprevent the agent to identify L -controllable states and lead to poor performance. This explains thedistinct technical tools used: while [24] executes ﬁnite-horizon no-regret algorithms, DisCo deploysSSP policies restricted on the set of states that it “controls” so far. Algorithmically, both approachesseek to build accurate estimates of the transitions on a speciﬁc (unknown) state space of interest: theso-called “signiﬁcant” states within H steps for [24], and the incrementally L -controllable states S → L for DisCo . Bound-wise, the cost-sensitive AX (cid:63) problem inherits the critical role of the minimum cost c min in SSP problems (see App. C and e.g., [27, 28, 29]), which is reﬂected in the accuracy of Cor. 1scaling inversely with c min . Another interesting element of comparison is the dependency on the sizeof the state space. While the algorithm introduced in [24] is robust w.r.t. states that can be reachedwith very low probability, it still displays a polynomial dependency on the total number of states S .On the other hand, DisCo has only a logarithmic dependency on S , while it directly depends onthe number of ( L + ε ) -controllable states, which shows that DisCo effectively adapts to the statespace of interest and it ignores all other states. This result is signiﬁcant since not only S L + ε can bearbitrarily smaller than S , but also because the set S → L + ε itself is initially unknown to the algorithm.8 · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.1 UcbExploreDisCo . . . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.4 UcbExploreDisCo . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.8 UcbExploreDisCo

Figure 2: Proportion of the incrementally L -controllable states identiﬁed by DisCo and

UcbExplore in a confusing chain domain for L = 4 . and ε ∈ { . , . , . } . Values are averaged over runs. In this section, we provide the ﬁrst evaluation of algorithms in the incremental autonomous explorationsetting. In the implementation of both

DisCo and

UcbExplore , we remove the logarithmic andconstant terms for simplicity. We also boost the empirical performance of

UcbExplore in variousways, for example by considering conﬁdence intervals derived from the empirical Bernstein inequality(see [30]) as opposed to Hoeffding as done in [1]. We refer the reader to App. F for details on thealgorithmic conﬁgurations and on the environments considered.We compare the sample complexity empirically achieved by

DisCo and

UcbExplore . Fig. 2 depictsthe time needed to identify all the incrementally L -controllable states when L = 4 . for differentvalues of ε , on a confusing chain domain. Note that the sample complexity is achieved soonafter, when the algorithm can conﬁdently discard all the remaining states as non-controllable (itis reported in Tab. 2 of App. F). We observe that DisCo outperforms

UcbExplore for any valueof ε . In particular, the gap in performance increases as ε decreases, which matches the theoreticalimprovement in sample complexity from (cid:101) O ( ε − ) for UcbExplore to (cid:101) O ( ε − ) for DisCo . On asecond environment — the combination lock problem introduced in [31] — we notice that

DisCo again outperforms

UcbExplore , as shown in App. F.Another important feature of

DisCo is that it targets the tighter objective AX (cid:63) , whereas UcbExplore is only able to fulﬁll objective AX L and may therefore elect suboptimal policies. In App. F we showempirically that, as expected theoretically, this directly translates into higher-quality goal-reachingpolicies recovered by DisCo . Connections to existing deep-RL methods.

While we primarily focus the analysis of

DisCo in thetabular case, we believe that the formal deﬁnition of AX problems and the general structure of DisCo may also serve as a theoretical grounding of many recent approaches to unsupervised exploration.For instance, it is interesting to draw a parallel between

DisCo and the ideas behind Go-Explore [32].Go-Explore similarly exploits the following principles: (1) remember states that have previously beenvisited, (2) ﬁrst return to a promising state (without exploration), (3) then explore from it. Go-Exploreassumes that the world is deterministic and resettable, meaning that one can reset the state of thesimulator to a previous visit to that cell. Very recently [15], the same authors proposed a way to relaxthis requirement by training goal-conditioned policies to reliably return to cells in the archive duringthe exploration phase. In this paper, we investigated the theoretical dimension of this direction, byprovably learning such goal-conditioned policies for the set of incrementally controllable states.

Future work.

Interesting directions for future investigation include: Deriving a lower bound for the AX problems; Integrating

DisCo into the meta-algorithm

MNM [33] which deals with incrementalexploration for AX L in non-stationary environments; Extending the problem to continuous statespace and function approximation; Relaxing the deﬁnition of incrementally controllable states andrelaxing the performance deﬁnition towards allowing the agent to have a non-zero but limited samplecomplexity of learning a shortest-path policy for any state at test time.9 roader Impact

This paper makes contributions to the fundamentals of online learning (RL) and due to its theoreticalnature, we see no ethical or immediate societal consequence of our work.

References [1] Shiau Hong Lim and Peter Auer. Autonomous exploration for navigating in MDPs. In

Conference on Learning Theory , pages 40–1, 2012.[2] J¨urgen Schmidhuber. A possibility for implementing curiosity and boredom in model-buildingneural controllers. In

Proc. of the international conference on simulation of adaptive behavior:From animals to animats , pages 222–227, 1991.[3] Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated rein-forcement learning. In

Advances in neural information processing systems , pages 1281–1288,2005.[4] Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology ofcomputational approaches.

Frontiers in neurorobotics , 1:6, 2009.[5] Satinder Singh, Richard L Lewis, Andrew G Barto, and Jonathan Sorg. Intrinsically motivatedreinforcement learning: An evolutionary perspective.

IEEE Transactions on Autonomous MentalDevelopment , 2(2):70–82, 2010.[6] Adrien Baranes and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration for activemotor learning in robots: A case study. In , pages 1766–1773. IEEE, 2010.[7] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and RemiMunos. Unifying count-based exploration and intrinsic motivation. In

Advances in neuralinformation processing systems , pages 1471–1479, 2016.[8] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,Filip DeTurck, and Pieter Abbeel.

Advances in neural information processing systems , pages2753–2762, 2017.[9] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel.Variational information maximizing exploration.

Advances in Neural Information ProcessingSystems (NIPS) , 2016.[10] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven explorationby self-supervised prediction. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops , pages 16–17, 2017.[11] Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo Avila Pires, Jean-Bastian Grill, FlorentAltch´e, and R´emi Munos. World discovery models. arXiv preprint arXiv:1902.07685 , 2019.[12] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all youneed: Learning skills without a reward function. In

International Conference on LearningRepresentations , 2019.[13] Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. In

International Conference on Learning Representa-tions , 2020.[14] V´ıctor Campos Cam´u˜nez, Alex Trott, Caiming Xiong, Richard Socher, Xavier Gir´o Nieto, andJordi Torres Vi˜nals. Explore, discover and learn: unsupervised discovery of state-covering skills.In

International Conference on Machine Learning , pages 1317–1327. PMLR, 2020.[15] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. First returnthen explore. arXiv preprint arXiv:2004.12919 , 2020.1016] Carlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generationfor reinforcement learning agents. In

International Conference on Machine Learning , pages1515–1528, 2018.[17] C´edric Colas, Pierre Fournier, Mohamed Chetouani, Olivier Sigaud, and Pierre-Yves Oudeyer.Curious: intrinsically motivated modular multi-goal reinforcement learning. In

Internationalconference on machine learning , pages 1331–1340. PMLR, 2019.[18] David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, andVolodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. In

International Conference on Learning Representations , 2019.[19] Vitchyr H Pong, Murtaza Dalal, Steven Lin, Ashvin Nair, Shikhar Bahl, and Sergey Levine.Skew-ﬁt: State-covering self-supervised reinforcement learning. In

International Conferenceon Machine Learning , pages 7783–7792. PMLR, 2020.[20] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efﬁcient maximumentropy exploration. In

International Conference on Machine Learning , pages 2681–2691,2019.[21] Jean Tarbouriech and Alessandro Lazaric. Active exploration in markov decision processes.In

The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pages 974–982,2019.[22] Wang Chi Cheung. Exploration-exploitation trade-off in reinforcement learning on onlinemarkov decision processes with global concave rewards. arXiv preprint arXiv:1905.06466 ,2019.[23] Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, and Alessan-dro Lazaric. Active model estimation in markov decision processes. In

Conference on Uncer-tainty in Artiﬁcial Intelligence , 2020.[24] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free explorationfor reinforcement learning. In

International Conference on Machine Learning , pages 4870–4879.PMLR, 2020.[25] Martin L Puterman.

Markov Decision Processes.: Discrete Stochastic Dynamic Programming .John Wiley & Sons, 2014.[26] Dimitri Bertsekas.

Dynamic programming and optimal control , volume 2. 2012.[27] Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric.No-regret exploration in goal-oriented reinforcement learning. In

International Conference onMachine Learning , pages 9428–9437. PMLR, 2020.[28] Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret boundsfor stochastic shortest path. In

International Conference on Machine Learning , pages 8210–8219. PMLR, 2020.[29] Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions.

Lab. for Information and Decision Systems Report LIDS-P-2909, MIT , 2013.[30] Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos. Minimax regret bounds forreinforcement learning. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 263–272. JMLR. org, 2017.[31] Mohammad Gheshlaghi Azar, Vicenc¸ G´omez, and Hilbert J Kappen. Dynamic policy program-ming.

Journal of Machine Learning Research , 13(Nov):3207–3245, 2012.[32] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore:a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995 , 2019.[33] Pratik Gajane, Ronald Ortner, Peter Auer, and Csaba Szepesvari. Autonomous exploration fornavigating in non-stationary CMPs. arXiv preprint arXiv:1910.08446 , 2019.1134] Blai Bonet. On the speed of convergence of value iteration on stochastic shortest-path problems.

Mathematics of Operations Research , 32(2):365–373, 2007.[35] Jean-Yves Audibert, R´emi Munos, and Csaba Szepesv´ari. Tuning bandit algorithms in stochasticenvironments. In

International conference on algorithmic learning theory , pages 150–165.Springer, 2007.[36] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variancepenalization. arXiv preprint arXiv:0907.3740 , 2009.[37] Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems.

Mathematics of Operations Research , 16(3):580–595, 1991.[38] Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2 with empiricalbernstein inequality. arXiv preprint arXiv:2007.05456 , 2020.[39] Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi, and Benjamin Van Roy. Conserva-tive contextual linear bandits. In

Advances in Neural Information Processing Systems , pages3910–3919, 2017. 12 ppendix

A Autonomous Exploration Objectives

We recall the two AX objectives stated in Def. 5: for any length L ≥ , error threshold ε > and conﬁdence level δ ∈ (0 , , the sample complexities C AXL ( A , L, ε, δ ) and C AX (cid:63) ( A , L, ε, δ ) aredeﬁned as the number of time steps required by a learning algorithm A to identify a set K ⊇ S → L such that with probability at least − δ , it has learned a set of policies { π s } s ∈K that respectivelyveriﬁes the following AX requirement( AX L ) ∀ s ∈ K , v π s ( s → s ) ≤ L + ε ,( AX (cid:63) ) ∀ s ∈ K , v π s ( s → s ) ≤ V (cid:63) S → L ( s → s ) + ε. As we explain in Sect. 4,

DisCo (Alg. 1) succeeds in tackling condition AX (cid:63) , whereas UcbExplore [1], which is designed to tackle condition AX L , is unable to tackle AX (cid:63) . Note that thealgorithmic design of UcbExplore entails that it computes policies whose value function implicitlytargets V (cid:63) K t , with K t the current set of controllable states. While V (cid:63) K t is always smaller than L , UcbExplore cannot provide any tightness guarantees w.r.t. V (cid:63) K t since it has no guarantee that the tran-sition dynamics are estimated well enough on K t . An additional challenge with which UcbExplore fails to cope is the fact that the set K t increases over time and thus unlocks new states and paths,which may be useful to improve its shortest-path policies for previously discovered states.To better understand this phenomenon, let us introduce an alternative condition AX (cid:48) — tighter than AX L , but looser than AX (cid:63) — which stems from the challenge of not knowing S → L in advance. Wedeﬁne AX (cid:48) as follows: for any state s in S → L , the objective is to ﬁnd a policy that can reach s from s in at most L (cid:48) + ε steps on average, where L (cid:48) := min { l ≤ L : s ∈ S → l } , i.e.,( AX ’) ∀ s ∈ K , v π s ( s → s ) ≤ L (cid:48) + ε , where L (cid:48) := min { l ≤ L : s ∈ S → l } .As mentioned in [1, Corollary 9], it is possible to run separate instances of UcbExplore withincreasing L n = 1 + nε from n = 0 to (cid:100) L − ε (cid:101) (i.e., until n satisﬁes L n − ≤ L ≤ L n ). This veriﬁesthe condition AX (cid:48) at the cost of a worsened dependency on both ε and L as follows C AX (cid:48) ( UcbExplore , L, ε, δ ) = (cid:101) O (cid:18) L S L + ε Aε (cid:19) . While AX (cid:48) is tighter than AX L , it may be arbitrarily loose compared to AX (cid:63) , which illustrates theintrinsic limitations in UcbExplore design.

UcbExplore incrementally expands a set of “control-lable” states K : starting with K = { s } , at time t a state s is added to K t whenever UcbExplore can conﬁdently assess that it managed to learn a policy reaching s in less than L steps. Since attime t UcbExplore can only consider policies restricted to the controllable states K t , even theshortest-path policy computed to reach s at time t may not be ε -optimal w.r.t. to the whole set S → L .Indeed, every time a state is added to K , this state may unlock new paths which may, for previouslycontrollable states, allow for better shortest-path policies restricted on the updated K . Fig. 3 illustratesthis behavior, where the state y unlocks a fast path from y to x which should be taken in y instead ofresetting to s . Consequently, if the agent seeks to tackle condition AX (cid:63) , it must have the faculty to backtrack , i.e., continuously update both its belief of the vicinity ( K ) and its notion of optimality onthe vicinity ( V (cid:63) K ). Unfortunately, UcbExplore can only compute policies targeting V (cid:63) K with K the current set of controllable states, but it fails to be accurate enough to revise such policies as the set of s x y Figure 3:

Let X := { s } ∪ { x } and Y := X ∪ { y } . For any l ≥ , suppose that from s , the agent reaches x in l steps withprobability / , or reaches y in l + 1 steps with probability / . If the goal state is x , constraining an agent to use policiesrestricted to X (i.e., that reset to s outside of X ) is detrimentalsince x can actually be reached in 1 step from y . Formally, wecan easily prove that V (cid:63) X ( s → x ) − V (cid:63) Y ( s → x ) = l + 1 ,which grows arbitrarily as l increases. X UcbExplore [1]

DisCo (Alg. 1) AX L (cid:101) O (cid:18) L S L + ε Aε (cid:19) (cid:101) O (cid:18) L Γ L + ε S L + ε Aε + L S L + ε Aε (cid:19) AX (cid:48) (cid:101) O (cid:18) L S L + ε Aε (cid:19) AX (cid:63) Unable

Table 1: Comparison be-tween the sample complexityof

UcbExplore and

DisCo ,depending on the condition AX L , AX (cid:48) or AX (cid:63) .controllable states K is expanded over time. In contrast, in virtue of its allocation function φ (Eq. 2)which enables to track the number of collected samples as K increases, DisCo is able to improve itscandidate shortest-path policies during the consolidation step (cid:176) when the ﬁnal set K is considered.The following general and simple statement captures how the expansion of the state space of interestmay alter and reﬁne the optimal policy restricted on it. Lemma 1.

For any two sets

X ⊆ Y and any state x ∈ X , we have V (cid:63) X ( s → x ) ≥ V (cid:63) Y ( s → x ) .Moreover, the gap between the two quantities may be arbitrarily large.Proof. The inequality is immediate from Asm. 1. Fig. 3 shows the gap may be arbitrarily large.Finally, we summarize all the sample complexity results in Tab. 1.

B Efﬁcient Computation of Optimistic SSP Policy

In this section we recall from [27, 28] how to efﬁciently compute an optimistic stochastic shortest-path(SSP) policy.

B.1 Computation of Optimal Policy in Known SSP

This section details the procedure to efﬁciently compute an (arbitrarily near-) optimal policy π in a known SSP instance with positive costs and which admits at least one proper policy. Recall that a proper policy is a policy whose execution starting from any non-goal state eventually reaches thegoal state with probability one [26].

Deﬁnition 7 (SSP-MDP) . An SSP-MDP is an MDP M = ( S † , A , s † , p, c ) where S † is the set ofnon-goal states with |S † | = S † , A is the set of actions, p is the transition function and c is the costfunction. The goal state s † / ∈ S † is zero-cost and absorbing, i.e., p ( s † | s † , a ) = 1 and c ( s † , a ) = 0 for any a ∈ A . The (possibly unbounded) value function (also called expected cost-to-go) of any policy π ∈ Π starting from state s is deﬁned as V π ( s ) := E (cid:20) + ∞ (cid:88) t =1 c ( s t , π ( s t )) (cid:12)(cid:12)(cid:12) s (cid:21) = E (cid:20) τ π ( s → s † ) (cid:88) t =1 c ( s t , π ( s t )) (cid:12)(cid:12)(cid:12) s (cid:21) . Assumption 2.

We restrict the attention to SSP-MDP M (see Def. 7) such that, for any ( s, a ) ∈S † × A , c ( s, a ) ∈ [ c min , with c min > . (Note that having positive costs ensures that for anynon-proper policy π there exists a state s with V π ( s ) = + ∞ .) Moreover, we assume that there existsat least one proper policy (i.e., that reaches the goal state s † with probability one starting from anystate in S † ). The procedure VI SSP considers the following inputs: a goal s † , non-goal states S † , a known model p and a known cost function c , with (non-goal) costs lower bounded by c min > . VI SSP outputs avector u (of size |S † | ) and a policy π which is greedy w.r.t. the vector u .The optimal Bellman operator is deﬁned as follows for any vector u and any non-goal state s ∈ S † L u ( s ) := min a ∈A (cid:110) c ( s, a ) + (cid:88) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) . lgorithm 2: VI SSP

Input:

Non-goal states S † , action set A , transitions p , costs c and accuracy γ Output:

Value vector u and greedy policy π Deﬁne L u ( s ) := min a ∈A (cid:110) c ( s, a ) + (cid:80) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) Set u = S † and j = 0 u = L u while (cid:107) u j +1 − u j (cid:107) ∞ > γ do u j +1 = L u j Set u := u j and π ( s ) ∈ arg min a ∈A (cid:110) c ( s, a ) + (cid:80) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) for any s ∈ S † ∪ { s † } Note that by deﬁnition, V π ( s † ) = 0 for any π . We perform a value iteration ( VI ) scheme overthis operator as explained in [e.g., 29, 34, 27]. Namely, we consider initial vector u := 0 and setiteratively u i +1 := L u i (see Alg. 2). For a predeﬁned VI precision γ > , the stopping condition isreached for the ﬁrst iteration j such that (cid:107) u j +1 − u j (cid:107) ∞ ≤ γ . The policy is then selected to be thegreedy policy w.r.t. the vector u := u j , i.e., ∀ s ∈ S † ∪ { s † } , π ( s ) ∈ arg min a ∈A (cid:110) c ( s, a ) + (cid:88) s (cid:48) ∈S † p ( s (cid:48) | s, a ) u ( s (cid:48) ) (cid:111) . (7)Importantly, while u is not the value function of π , both quantities can be related according to thefollowing lemma. Lemma 2.

Consider an SSP-MDP M = ( S † , A , s † , p, c ) deﬁned as in Def. 7 and satisfying Asm. 2.Let ( u, π ) = VI SSP ( S † , A , p, c, γ ) be the solution computed by VI SSP . Denote by V π the true valuefunction of π and by V (cid:63) = V π (cid:63) = L V (cid:63) the optimal value function. The following component-wiseinequalities hold • u ≤ V (cid:63) ≤ V π . • If the VI precision level veriﬁes γ ≤ c min , then V π ≤ (cid:16) γc min (cid:17) u .Proof. The result can be obtained by adapting [27, Lem. 4 & App. E]. For the ﬁrst inequality, giventhat we consider the initial vector u = 0 , we know that ≤ V (cid:63) with V (cid:63) = L V (cid:63) by deﬁnition. Bymonotonicity of the operator L [25, 26], we obtain u j ≤ V (cid:63) ≤ V π . As for the second inequality, weintroduce the following Bellman operators of a deterministic policy π for any vector u and state s , L π u ( s ) := c ( s, π ( s )) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) u ( s (cid:48) ) , T πγ u ( s ) := c ( s, π ( s )) − γ (cid:124) (cid:123)(cid:122) (cid:125) > + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) u ( s (cid:48) ) . Note that the SSP problem deﬁned by the operator T πγ satisﬁes Asm. 2 since i) it has positive costsdue to the condition γ ≤ c min and ii) the fact that M satisﬁes Asm. 2 guarantees the existence of atleast one proper policy in the model p . We can write component-wise T πγ u j = L π u j − γ (a) = L u j − γ (b) ≤ u j , where (a) uses that π is the greedy policy w.r.t. u j and (b) stems from the chosen stopping conditionwhich yields L u j ≤ u j + γ . By monotonicity of the operator T πγ , we have for all m > , ( T πγ ) m u j ≤ u j . The asymptotic convergence of the operator in an SSP problem satisfying Asm. 2(see e.g., [26, Prop. 2.2.1]) guarantees that taking the limit m → + ∞ yields W πγ ≤ u j , where W πγ isdeﬁned as the value function of policy π in the model p with γ subtracted to all the costs, i.e., W πγ ( s ) := E  τ π ( s ) (cid:88) t =1 ( c ( s t , π ( s t ) − γ ) | s = s  = V π ( s ) − γ E [ τ π ( s )] , VI SSP goal state s † non-goal states S † samples collected so far N costs c ≥ c min > VI precision γ > optimistic value vector (cid:101) u optimistic SSP policy (cid:101) π Figure 4: Optimistic Value Iteration for SSP (

OVI

SSP ).where τ π ( s ) denotes the (random) hitting time of policy π to reach the goal starting from state s .Moreover, we have c min E [ τ π ( s )] ≤ V π ( s ) ≤ c max E [ τ π ( s )] . Putting everything together, we thus get (cid:16) − γc min (cid:17) V π ≤ u j . Since γ ≤ c min , we ultimately obtain V π ≤ − γc min u j ≤ (cid:18) γc min (cid:19) u j , where the last inequality uses the fact that − x ≤ x holds for any ≤ x ≤ . B.2 Computation of Optimistic Model in Unknown SSP

Consider an SSP problem M deﬁned as in Asm. 2. Consider that, at any given stage of the learningprocess, the agent is equipped with N ( s, a ) samples at each state-action pair. A method to computean optimistic model (cid:101) p is provided in [28], which we recall below.Denote by (cid:98) p the current empirical average of transitions: (cid:98) p ( s (cid:48) | s, a ) = N ( s, a, s (cid:48) ) /N ( s, a ) , andset (cid:98) σ ( s (cid:48) | s, a ) := (cid:98) p ( s (cid:48) | s, a )(1 − (cid:98) p ( s (cid:48) | s, a )) as well as N + ( s, a ) := max { , N ( s, a ) } . For any ( s, a, s (cid:48) ) ∈ S † × A × S † , the empirical Bernstein inequality [35, 36] is leveraged to select thefollowing conﬁdence intervals (with probability at least − δ ) on the transition probabilities β ( s, a, s (cid:48) ) := 2 (cid:115) (cid:98) σ ( s (cid:48) | s, a ) N + ( s, a ) log (cid:18) SAN + ( s, a ) δ (cid:19) + 6 log (cid:16) SAN + ( s,a ) δ (cid:17) N + ( s, a ) , and β ( s, a, s † ) := (cid:80) s (cid:48) ∈S † β ( s, a, s (cid:48) ) . The selection of the optimistic model (cid:101) p is as follows: theprobability of reaching the goal s † is maximized at every state-action pair, which implies minimizingthe probability of reaching all other states and setting them at the lowest value of their conﬁdencerange. Formally, we set for all ( s, a, s (cid:48) ) ∈ S † × A × S † , (cid:101) p ( s (cid:48) | s, a ) := max (cid:110)(cid:98) p ( s (cid:48) | s, a ) − β ( s, a, s (cid:48) ) , (cid:111) , and (cid:101) p ( s † | s, a ) := 1 − (cid:80) s (cid:48) ∈S † (cid:101) p ( s (cid:48) | s, a ) . B.3 Combining the two: Optimistic Value Iteration for SSP (

OVI

SSP ) OVI

SSP ﬁrst computes an optimistic model (cid:101) p leveraging App. B.2, and it then runs the VI SSP procedureof App. B.1 in the model (cid:101) p , i.e., ( (cid:101) u, (cid:101) π ) = VI SSP ( S † , A , s † , (cid:101) p, c ) . This outputs an optimistic pair ( (cid:101) u, (cid:101) π ) composed of the VI vector (cid:101) u and the policy (cid:101) π that is greedy w.r.t. (cid:101) u in the model (cid:101) p . The OVI

SSP scheme is recapped in Fig. 4.

C Useful Result: Simulation Lemma for SSP

Consider a stochastic shortest-path (SSP) instance (see Def. 7) that satisﬁes Asm. 2. We denote by A = |A| the number of actions, S = |S| the number of non-goal states, g / ∈ S the (zero-cost andabsorbing) goal state, p the unknown transitions and c the known cost function. We assume that < c ( s, a ) ≤ for all ( s, a ) ∈ S ×A , and set c min := min s,a c ( s, a ) > . We also set S (cid:48) := S ∪{ g } .16ecall that the goal state is zero-cost (i.e., c ( g, a ) = 0 ) and absorbing (i.e., p ( g | g, a ) = 1 ), and thatthe value function of a policy amounts to the expected cumulative costs following this policy untilreaching the goal. Deﬁnition 8.

For any model p and η > , we introduce the set of models close to p w.r.t. the (cid:96) -normon the non-goal states as follows P ( p ) η := (cid:110) p (cid:48) ∈ R S (cid:48) × A × S (cid:48) : ∀ ( s, a ) ∈ S × A , p (cid:48) ( ·| s, a ) ∈ ∆( S (cid:48) ) , p ( g | g, a ) = 1 , (cid:88) y ∈S | p ( y | s, a ) − p (cid:48) ( y | s, a ) | ≤ η (cid:111) . Lemma 3 (Simulation Lemma for SSP) . Consider any model p and p (cid:48) ∈ P ( p ) η such that, for eachmodel, there exists at least one proper policy w.r.t. the goal state g . Consider any policy π that isproper in p (cid:48) , with value function denoted by V (cid:48) π , such that the following condition is veriﬁed η (cid:107) V (cid:48) π (cid:107) ∞ ≤ c min . (8) Then π is proper in p (i.e., its value function veriﬁes V π < + ∞ component-wise), and we have ∀ s (cid:54) = g, V π ( s ) ≤ (cid:18) η (cid:107) V (cid:48) π (cid:107) ∞ c min (cid:19) V (cid:48) π ( s ) , and conversely, ∀ s (cid:54) = g, V (cid:48) π ( s ) ≤ (cid:18) η (cid:107) V (cid:48) π (cid:107) ∞ c min (cid:19) V π ( s ) . Combining the two inequalities above yields (cid:107) V π − V (cid:48) π (cid:107) ∞ ≤ η (cid:107) V (cid:48) π (cid:107) ∞ c min . Proof.

The proof of Lem. 3 requires a result of [37] recalled in Lem. 4 and can be seen as a general-ization of [28, Lem. B.4]. First, let us assume that π is proper in the model p (cid:48) . This implies that itsvalue function, denoted by V (cid:48) , is bounded component-wise. Moreover, for any non-goal state s ∈ S ,the Bellman equation holds as follows V (cid:48) ( s ) = c ( s, π ( s )) + (cid:88) y ∈S p (cid:48) ( y | s, π ( s )) V (cid:48) ( y )= c ( s, π ( s )) + (cid:88) y ∈S p ( y | s, π ( s )) V (cid:48) ( y ) + (cid:88) y ∈S ( p (cid:48) ( y | s, π ( s )) − p ( y | s, π ( s ))) V (cid:48) ( y ) . (9)By successively using H¨older’s inequality and the facts that p (cid:48) ∈ P ( p ) η and c ( s, π ( s )) ≥ c min , we get V (cid:48) ( s ) ≥ c ( s, π ( s )) − η (cid:107) V (cid:48) (cid:107) ∞ + p ( ·| s, π ( s )) (cid:62) V (cid:48) ≥ c ( s, π ( s )) (cid:18) − η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) + p ( ·| s, π ( s )) (cid:62) V (cid:48) . Let us now introduce the vector V (cid:48)(cid:48) := (cid:16) − η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:17) − V (cid:48) . Then for all s ∈ S , V (cid:48)(cid:48) ( s ) ≥ c ( s, π ( s )) + p ( ·| s, π ( s )) (cid:62) V (cid:48)(cid:48) . Hence, from Lem. 4, π is proper in p (i.e., V < + ∞ ), and we have V ≤ V (cid:48)(cid:48) ≤ (cid:18) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) V (cid:48) , (10)where the last inequality stems from condition (8) and the fact that − x ≤ x holds for any ≤ x ≤ . Conversely, analyzing Eq. 9 from the other side, we get V (cid:48) ( s ) ≤ c ( s, π ( s )) (cid:18) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) + p ( ·| s, π ( s )) (cid:62) V (cid:48) . V (cid:48)(cid:48) := (cid:16) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:17) − V (cid:48) . Then V (cid:48)(cid:48) ( s ) ≤ c ( s, π ( s )) + p ( ·| s, π ( s )) (cid:62) V (cid:48)(cid:48) . We then obtain in the same vein as Lem. 4 (by leveraging the monotonicity of the Bellman operator L π U ( s ) := c ( s, π ( s )) + p ( ·| s, π ( s )) (cid:62) U ) that V (cid:48)(cid:48) ≤ V , and therefore V (cid:48) ≤ (cid:18) η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:19) V. (11)Combining Eq. 10 and 11 yields component-wise (cid:107) V − V (cid:48) (cid:107) ∞ ≤ η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:107) V (cid:48) (cid:107) ∞ + η (cid:107) V (cid:48) (cid:107) ∞ c min (cid:107) V (cid:107) ∞ ≤ η (cid:107) V (cid:48) (cid:107) ∞ c min , where the last inequality uses that (cid:107) V (cid:107) ∞ ≤ (cid:107) V (cid:48) (cid:107) ∞ which stems from plugging condition (8) intoEq. 10.Note that here p and p (cid:48) play symmetric roles; we can perform the same reasoning in the case where π is proper in the model p and it would yield an equivalent result by switching the dependencies on V and V (cid:48) . Lemma 4 ([37], Lem. 1) . In an SSP-MDP satisfying Asm. 2, let π be any policy, then • If there exists a vector U : S → R such that U ( s ) ≥ c ( s, π ( s )) + (cid:80) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) U ( s (cid:48) ) for all s ∈ S , then π is proper, and V π the value function of π is upper bounded by U component-wise, i.e., V π ( s ) ≤ U ( s ) for all s ∈ S . • If π is proper, then its value function V π is the unique solution to the Bellman equations V π ( s ) = c ( s, π ( s )) + (cid:80) s (cid:48) ∈S p ( s (cid:48) | s, π ( s )) V π ( s (cid:48) ) for all s ∈ S . D Proof of Theorem 1 (Sample Complexity Analysis of

DisCo ) D.1 Computation of the Optimistic Policies

At each round k , for each goal state s † ∈ W k , DisCo computes an optimistic goal-oriented policyassociated to the MDP M (cid:48) k ( s † ) constructed as in Def. 6. This MDP is deﬁned over the entire statespace S and restricts the action to the only action RESET outside K k . We can build an equivalentMDP by restricting the focus on K k . To this end, we deﬁne the following SSP-MDP. Deﬁnition 9.

Deﬁne M † k ( s † ) := (cid:104)S † k , A † k ( · ) , c † k , p † k (cid:105) where S † k := K k ∪ { s † , x } and S † k = |S † k | = |K k | + 2 . State x is a meta-state that encapsulates all the states that have been observed so farand are not in K k . The action space A † k ( · ) is such that A † k ( s ) = A for all states s ∈ K k and A † k ( s ) = { RESET } for s ∈ { s † , x } . The cost function is c † k ( x, a ) = 0 for any a ∈ A † k ( x ) and c † k ( s, a ) = 1 everywhere else. The transition function is deﬁned as p † k ( s † | s † , a ) = p † k ( s | x, a ) = 1 for any a , p † k ( y | s, a ) = p ( y | s, a ) for any ( s, a, y ) ∈ K k × A × ( K k ∪ { s † } ) and p † k ( x | s, a ) =1 − (cid:80) y ∈K k ∪{ s † } p † k ( y | s, a ) . Note that solving M † k yields a policy effectively restricted to the set K k insofar as we can interpretthe meta-state x as S \ {K k ∪ { s † }} . Since p is unknown, we cannot construct M † k ( s † ) . Let N k be the state-action counts accumulated up until now. We denote by (cid:98) p k the “global” empiricalestimates, i.e., (cid:98) p k ( y | s, a ) = N k ( s, a, y ) /N k ( s, a ) . Given them, we deﬁne the “restricted” empiricalestimates (cid:98) p † k as follows: (cid:98) p † k ( y | s, a ) := (cid:98) p k ( y | s, a ) for any ( s, a, y ) ∈ K k × A × ( K k ∪ { s † } ) and (cid:98) p † k ( x | s, a ) := 1 − (cid:80) y ∈K k ∪{ s † } (cid:98) p † k ( y | s, a ) . Denoting N + k ( s, a ) := max { , N k ( s, a ) } , we then deﬁnethe following bonuses for any ( s, a, y ) ∈ K k × A × ( K k ∪ { s † } ) , β k ( s, a, y ) := 2 (cid:115) (cid:98) p k ( y | s, a )(1 − (cid:98) p k ( y | s, a )) N + k ( s, a ) log (cid:18) SAN + k ( s, a ) δ (cid:19) + 6 log (cid:16) SAN + k ( s,a ) δ (cid:17) N + k ( s, a ) , (12) β k ( s, a, x ) := (cid:88) y ∈K k ∪{ s † } β k ( s, a, y ) . (13)18 lgorithm 3: OVI

SSP

Input: K k , A , s † , N k , γ > Output:

Value vector (cid:101) u † and policy (cid:101) π † Estimate transitions probabilities (cid:98) p k using N k Compute the optimistic SSP-MDP (cid:102) M † k as detailed in Def. 10 Compute ( (cid:101) u † k , (cid:101) π † k ) = VI SSP ( S † k , A † k , c † k , (cid:101) p † k , γ ) (see Alg. 2)Moreover, we set the uncertainty about the MDP at the meta-state x and at the goal state s † to byconstruction (since their outgoing transitions are deterministic, respectively to s and s † ).We now leverage the optimistic construction mentioned in App. B.1. Deﬁnition 10.

We denote by (cid:102) M † k ( s † ) = (cid:104)S † k , A † k ( · ) , c † k , (cid:101) p † k (cid:105) the optimistic MDP associated to M † k ( s † ) deﬁned in Def. 9. Then, ∀ ( s, a ) ∈ K k × A , (cid:101) p † k ( y | s, a ) := max { (cid:98) p k ( y | s, a ) − β k ( s, a, y ) , } , ∀ y ∈ K k ∪ { x } , (14) (cid:101) p † k ( s † | s, a ) := 1 − (cid:88) y ∈K k ∪{ x } (cid:101) p † k ( y | s, a ) , (15) (cid:101) p † k ( s † | s † , a ) = (cid:101) p † k ( s | x, a ) = 1 . (16)Given this MDP, we can compute the optimistic value vector (cid:101) u † k and policy (cid:101) π † k using value iterationfor SSP: ( (cid:101) u † k , (cid:101) π † k ) = VI SSP ( S † k , A † k , c † k , (cid:101) p † k , ε L ) . We summarize the construction of the optimisticmodel and the computation of value function and policy in Alg. 3 ( OVI

SSP ). Remark.

Note that the structure of the problem does not appear to allow for variance-awareimprovements in the analysis of Thm. 1 (speciﬁcally, when the analysis will apply an SSP simulationlemma argument). Indeed, given the possibly large number of states in the total environment S , thecomputation of the optimistic policies requires the construction of the meta-state x that encapsulatesall the states in S \ {K k ∪ { s † }} , where s † is the candidate goal state considered at round k . As aresult, the uncertainty on the transitions reaching x needs to be summed over multiple states, as shownin Eq. 13. This extra uncertainty at a single state in the induced MDP has the effect of canceling outBernstein techniques seeking to lower the prescribed requirement of the state-action samples that thealgorithm should collect. In turn this implies that such variance-aware techniques would not lead toany improvement in the ﬁnal sample complexity bound. D.2 High-Probability EventLemma 5.

It holds with probability at least − δ that for any time step t ≥ and for any state-actionpair ( s, a ) and next state s (cid:48) , | (cid:98) p t ( s (cid:48) | s, a ) − p ( s (cid:48) | s, a ) | ≤ (cid:115) (cid:98) σ t ( s (cid:48) | s, a ) N + t ( s, a ) log (cid:18) SAN + t ( s, a ) δ (cid:19) + 6 log (cid:16) SAN + t ( s,a ) δ (cid:17) N + t ( s, a ) , (17) where N + t ( s, a ) := max { , N t ( s, a ) } and where (cid:98) σ t are the population variance of transitions, i.e., (cid:98) σ t ( s (cid:48) | s, a ) := (cid:98) p t ( s (cid:48) | s, a )(1 − (cid:98) p t ( s (cid:48) | s, a )) .Proof. The conﬁdence intervals in Eq. 17 are constructed using the empirical Bernstein inequality,which guarantees that the considered event holds with probability at least − δ , see e.g., [38].Deﬁne the set of plausible transition probabilities as C † k := (cid:92) ( s,a ) ∈S † k ×A C † k ( s, a ) , C † k ( s, a ) := { (cid:101) p ∈ C | (cid:101) p ( · | s † , a ) = s † , (cid:101) p ( · | x, a ) = s , | (cid:101) p ( s (cid:48) | s, a ) − (cid:98) p k ( s (cid:48) | s, a ) | ≤ β k ( s, a, s (cid:48) ) } , with C the S † k -dimensional simplex and (cid:98) p k the empirical average of transitions. Lemma 6.

Introduce the event

Θ := (cid:84) + ∞ k =1 (cid:84) s † ∈W k { p † k ∈ C † k } . Then P (Θ) ≥ − δ .Proof. We have with probability at least − δ that, for any y (cid:54) = x , | p † k ( y | s, a ) − (cid:98) p † k ( y | s, a ) | ≤ β k ( s, a, y ) from the empirical Bernstein inequality (see Eq. 17), and moreover | (cid:98) p † k ( x | s, a ) − p † k ( x | s, a ) | = (cid:12)(cid:12)(cid:12) − (cid:80) y ∈K k ∪{ s † } p † k ( y | s, a ) − (cid:16) − (cid:80) y ∈K k ∪{ s † } (cid:98) p † k ( y | s, a ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:80) y ∈K k ∪{ s † } | p † k ( y | s, a ) − (cid:98) p † k ( y | s, a ) | ≤ β k ( s, a, x ) . Lemma 7.

Under the event Θ , for any round k and any goal state s † ∈ W k , the optimistic model (cid:101) p † k constructed in Def. 10 veriﬁes (cid:101) p † k ∈ P ( p † k ) η k , with η k := 4 β k ( s, a, x ) where β k is deﬁned in Eq. 13.Proof. Combining the construction in Def. 10, the proof of Lem. 6 and the triangle inequality yields (cid:88) y ∈K k ∪{ x } | (cid:101) p † k ( y | s, a ) − p † k ( y | s, a ) | ≤ (cid:88) y ∈K k ∪{ x } | (cid:101) p † k ( y | s, a ) − (cid:98) p † k ( y | s, a ) | + | (cid:98) p † k ( y | s, a ) − p † k ( y | s, a ) |≤ (cid:88) y ∈K k ∪{ x } β k ( s, a, y ) + 2 β k ( s, a, x ) ≤ β k ( s, a, x ) . Throughout the remainder of the proof, we assume that the event Θ holds. D.3 Properties of the Optimistic Policies and Value Vectors

We recall notation. Let us ﬁx any round k and any goal state s † ∈ W k . We denote by (cid:101) π † k the greedypolicy w.r.t. (cid:101) u † k ( · → s † ) in the optimistic model (cid:101) p † k . Let (cid:101) v † k ( s → s † ) be the value function of policy (cid:101) π † k starting from state s in the model (cid:101) p † k . We can apply Lem. 2 given that the conditions of Asm. 2hold (indeed, we have c min = 1 > and there exists at least one proper policy to reach the goal state s † since it belongs to W k ). Moreover, we have that (cid:101) V (cid:63) K k ( s → s † ) ≤ V (cid:63) K k ( s → s † ) given the waythe optimistic model (cid:101) p † k is computed (i.e., by maximizing the probability of transitioning to the goalat any state-action pair), see [28, Lem. B.12]. Hence we get the two following important properties. Lemma 8.

For any round k , goal state s † ∈ W k and state s ∈ K k ∪ { x } , we have under the event Θ , (cid:101) u † k ( s → s † ) ≤ V (cid:63) K k ( s → s † ) . Lemma 9.

For any round k , goal state s † ∈ W k and state s ∈ K k ∪ { x } , we have (cid:101) v † k ( s → s † ) ≤ (1 + 2 γ ) (cid:101) u † k ( s → s † ) . D.4 State Transfer from U to K (step (cid:175) ) We ﬁx any round k and any goal state s † ∈ W k that is added to the set of “controllable” states K , i.e.,for which (cid:101) u † k ( s → s † ) ≤ L . Lemma 10.

Under the event Θ , we have both following inequalities (cid:40) v † k ( s → s † ) ≤ L + ε,v † k ( s → s † ) ≤ V (cid:63) K k ( s → s † ) + ε. In particular, the ﬁrst inequality entails that s † ∈ S → L + ε , which justiﬁes the validity of the statetransfer from U to K . roof. We have (cid:101) v † k ( s → s † ) (a) ≤ (1 + 2 γ ) (cid:101) u † k ( s → s † ) ≤  (b) ≤ L + ε (c) ≤ V (cid:63) K k ( s → s † ) + ε , (18)where inequality (a) comes from Lem. 9, inequality (b) combines the algorithmic condition (cid:101) u † k ( s → s † ) ≤ L and the VI precision level γ := ε L , and ﬁnally inequality (c) combines Lem. 8 and the VI precision level. Moreover, for any state in K k , (cid:101) v † k ( s → s † ) (a) ≤ (cid:101) V (cid:63) K k ( s → s † ) + ε (b) ≤ (cid:101) V (cid:63) K k ( s → s † ) + 1 + ε ≤ (cid:101) v † k ( s → s † ) + 1 + ε , where (a) comes from Lem. 8 and (b) stems from the presence of the RESET action (Asm. 1).We now provide the exact choice of allocation function φ in Alg. 1. We introduce γ := 2 ε L + 1 + ε )( L + ε ) . (Note that γ = O ( ε/L ) .) We set the following requirement of samples for each state-action pair ( s, a ) at round k , n k = φ ( K k ) =  X k γ (cid:34) log (cid:32) eX k √ SA √ δγ (cid:33)(cid:35) + 24 |S † k | γ log (cid:32) |S † k | SAδγ (cid:33) , (19)where we deﬁne X k := max ( s,a ) ∈S † k ×A (cid:88) s (cid:48) ∈S † k (cid:113)(cid:98) σ k ( s (cid:48) | s, a ) , with (cid:98) σ k ( s (cid:48) | s, a ) := (cid:98) p † k ( s (cid:48) | s, a )(1 − (cid:98) p † k ( s (cid:48) | s, a )) the estimated variance of the transition from ( s, a ) to s (cid:48) . Leveraging the empirical Bernstein inequality (Lem. 5) and perfoming simple algebraicmanipulations (see e.g., [39, Lem. 8 and 9]) yields that β k ( s, a, x ) ≤ γ . From Lem. 7, this impliesthat (cid:101) p † k ∈ P ( p † k ) η with η := 4 γ . We can then apply Lem. 3 (whose condition 8 is veriﬁed), which gives v † k ( s → s † ) ≤ (cid:16) η (cid:107) (cid:101) v † k ( · → s † ) (cid:107) ∞ (cid:17)(cid:101) v † k ( s → s † ) (20) ≤ (1 + η ( L + 1 + ε )) (cid:101) v † k ( s → s † ) ≤ (cid:101) v † k ( s → s † ) + 2 ε , where the last inequality uses that η ( L + 1 + ε )( L + ε ) = ε by deﬁnition of γ . Plugging in Eq. 18yields the sought-after inequalities. D.5 Termination of the AlgorithmLemma 11 (Variant of Lem. 17 of [1]) . Suppose that for every state s ∈ S , each action a ∈ A is executed b ≥ (cid:100) L log (cid:0) ALSδ (cid:1) (cid:101) times. Let S (cid:48) s,a be the set of all next states visited during the b executions of ( s, a ) . Denote by Λ the complementary of the event (cid:26) ∃ ( s (cid:48) , s, a ) ∈ S × A : p ( s (cid:48) | s, a ) ≥ L ∧ s (cid:48) / ∈ S (cid:48) s,a (cid:27) . Then P (Λ) ≥ − δ . Lemma 12.

Under the event Θ ∩ Λ , for any round k , either S → L ⊆ K k , or there exists a state s † ∈ S → L \ K k such that s † ∈ W k and is L -controllable with a policy restricted to K k . Moreover, |W k | ≤ LA |K k | . roof of Lem. 12. Consider a round k such that S → L \ K k is non-empty. Due to the incrementalconstruction of the set S → L (Def. 4), there exists a state s † ∈ S → L and a policy restricted to K k that canreach s † in at most L steps (in expectation). Hence there exists a state-action pair ( s, a ) ∈ K k × A such that p ( s † | s, a ) ≥ L . Since φ ( K k ) ≥ (cid:100) L log (cid:0) ALSδ (cid:1) (cid:101) samples are available at each state-actionpair, according to Lem. 11, we get that, under the event Λ , s † is found during the sample collectionprocedure for the state-action pair ( s, a ) (step (cid:172) ), which implies that s † ∈ U k .Moreover, the choice of allocation function φ guarantees in particular that there are more than Ω( L ε log( LSAδε )) samples available at each state-action pair ( s, a ) ∈ K k × A . From the empiricalBernstein inequality of Eq. 17, we thus have that | p ( s † | s, a ) − (cid:98) p k ( s † | s, a ) | ≤ ε L under the event Θ .Consequently we have (cid:98) p k ( s † | s, a ) ≥ L − | p ( s † | s, a ) − (cid:98) p k ( s † | s, a ) | ≥ − ε L , which implies that s † ∈ W k . Furthermore, we can decompose W k the following way W k = (cid:91) ( s,a ) ∈K k ×A Y k ( s, a ) , where we introduce the subset Y k ( s, a ) := (cid:26) s (cid:48) ∈ U k : (cid:98) p k ( s (cid:48) | s, a ) ≥ − ε L (cid:27) . We then have (cid:88) s (cid:48) ∈S (cid:98) p k ( s (cid:48) | s, a ) ≥ (cid:88) s (cid:48) ∈Y k ( s,a ) (cid:98) p k ( s (cid:48) | s, a ) ≥ − ε L |Y k ( s, a ) | . We conclude the proof by writing that |W k | ≤ (cid:88) ( s,a ) ∈K k ×A |Y k ( s, a ) | ≤ L − ε A |K k | ≤ LA |K k | , where the last inequality uses that ε ≤ (from line 2 of Alg. 1). Lemma 13.

Under the event Θ ∩ Λ , when either condition STOP1 or STOP2 is triggered (at a roundindexed by K ), we have S → L ⊆ K K .Proof. If condition

STOP1 is triggered, Lem. 12 immediately guarantees that S → L ⊆ K K under theevent Λ . If condition STOP2 is triggered, we have for all s ∈ W K , (cid:101) u s ( s → s ) > L . From Lem. 8this means that, under the event Θ , for all s ∈ W K , V (cid:63) K K ( s → s ) > L . Hence none of the states in W K can be reached in at most L steps (in expectation) with a policy restricted to K K . We concludethe proof using Lem. 12. Lemma 14.

Under the event Θ ∩ Λ , when DisCo terminates at round K , for any state s ∈ K K , thepolicy π s computed during step (cid:176) veriﬁes v π s ( s → s ) ≤ min π ∈ Π( S → L ) v π ( s → s ) + ε. Moreover, we have that S → L ⊆ K K ⊆ S → L + ε .Proof. Assume that the event Θ ∩ Λ holds. Then when the ﬁnal set K K is considered and the newpolicies are computed using all the samples, Lem. 10 yields for all s ∈ K K , v π s ( s → s ) ≤ min π ∈ Π( K K ) v π ( s → s ) + ε. Moreover Lem. 13 entails that K K ⊇ S → L . This implies from Lem. 1 that min π ∈ Π( K K ) v π ( s → s ) ≤ min π ∈ Π( S → L ) v π ( s → s ) , which means that K K ⊆ S → L + ε . 22 .6 High Probability Bound on the Sample Collection Phase (step (cid:172) ) Denote by K the (random) index of the last round during which the algorithm terminates. We focuson the sample collection procedure for any state s ∈ K K . We denote by k s the index of the roundduring which s was added to the set of “controllable” states K . To collect samples at state s , thelearner uses the shortest-path policy π s . We say that an attempt to collect a speciﬁc sample is a rollout . We denote by Z K := |K K | AN K the total number of samples that the learner needs to collect.As such, at most Z K rollouts must take place. Assume that the event Θ holds. Then from Lem. 14,we have K K ⊆ S → L + ε . Hence, denoting S L + ε := |S → L + ε | , we have Z K ≤ Z L + ε := S L + ε A Φ( S → L + ε ) .The following lemma provides a high-probability upper bound on the time steps required to meet thesampling requirements. Lemma 15.

Assume that the event Θ holds. Set ψ := 4( L + ε + 1) log (cid:18) Z L + ε δ (cid:19) , and introduce the following event T := (cid:110) ∃ one rollout (with goal state s ) s.t. τ π s ( s → s ) > ψ (cid:111) . We have P ( T ) ≤ δ .Proof. Assume that the event Θ holds. Leveraging a union bound argument and applying Lem. 16 topolicy π s which veriﬁes v π s ( s (cid:48) → s ) ≤ L + ε + 1 for any s (cid:48) ∈ K k s , we get P ( T ) ≤ (cid:88) rollouts (cid:18) − ψ L + ε + 1) (cid:19) ≤ Z L + ε exp (cid:18) − ψ L + ε + 1) (cid:19) ≤ δ , where the last inequality comes from the choice of ψ . Lemma 16 ([28], Lem. B.5) . Let π be a proper policy such that for some d > , V π ( s ) ≤ d for everynon-goal state s . Then the probability that the cumulative cost of π to reach the goal state from anystate s is more than m , is at most e − m/ (4 d ) for all m ≥ . Note that a cost of at most m impliesthat the number of steps is at most m/c min . D.7 Putting Everything Together: Sample Complexity Bound

The sample complexity of the algorithm is solely induced by the sample collection procedure (step (cid:172) ).Recall that we denote by K the index of the round at which the algorithm terminates. With probabilityat least − δ , Lem. 13 holds, and so does the event Θ . Hence the algorithm discovers a set ofstates K K ⊇ S → L . Moreover, from Lem. 14, the algorithm outputs for each s ∈ K K a policy π s with E [ τ π s ( s → s )] ≤ V (cid:63) S → L ( s ) + ε . Hence we also have |K K | ≤ S L + ε := |S → L + ε | .We denote by Z K := |K K | A φ ( K K ) the total number of samples that the learner needs to collect.From Lem. 15, with probability at least − δ , the total sample complexity of the algorithm is at most ψZ K , where ψ := 4( L + ε + 1) log (cid:16) Z L + ε δ (cid:17) .Now, from Eq. 19 there exists an absolute constant α > such that DisCo selects as allocationfunction φ φ : X → α · (cid:32) L (cid:98) Θ( X ) ε log (cid:18) LSAεδ (cid:19) + L |X | ε log (cid:18) LSAεδ (cid:19)(cid:33) , where (cid:98) Θ( X ) := max ( s,a ) ∈X ×A (cid:32) (cid:88) s (cid:48) ∈X (cid:112)(cid:98) p ( s (cid:48) | s, a )(1 − (cid:98) p ( s (cid:48) | s, a )) (cid:33) . The total requirement is φ ( K K ) . Note that from Cauchy-Schwarz’s inequality, we have (cid:98) Θ( K K ) ≤ Γ K := max ( s,a ) ∈K K ×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈K K (cid:107) ≤ |K K | . − δ , ψZ K = (cid:101) O (cid:18) L Γ K |K K | Aε + L |K K | Aε (cid:19) . We ﬁnally use that K K ⊂ S → L + ε from Lem. 14, which implies that C AX (cid:63) ( DisCo , L, ε, δ ) = (cid:101) O (cid:18) L Γ L + ε S L + ε Aε + L S L + ε Aε (cid:19) , where Γ L + ε := max ( s,a ) ∈S → L + ε ×A (cid:107){ p ( s (cid:48) | s, a ) } s (cid:48) ∈S → L + ε (cid:107) . This concludes the proof of Thm. 1. D.8 Proof of Corollary 1

The result given in Cor. 1 comes from retracing the analysis of Lem. 14 and therefore Lem. 10 byconsidering non-uniform costs between [ c min , instead of costs all equal to . Speciﬁcally, Eq. 20needs to account for the inverse dependency on c min of the simulation lemma of Lem. 3. This inducesthe ﬁnal ε/c min accuracy level achieved by the policies output by DisCo . There remains to guaranteethat condition 8 of Lem. 3 is veriﬁed. In particular the condition holds if η ( L + 1 + ε ) ≤ c min ,where η is the model accuracy prescribed in the proof of Lem. 10. We see that this is the casewhenever we have ε = O ( Lc min ) due to the fact that η = Ω( ε/L ) . D.9 Computational Complexity of

DisCo

The overall computational complexity of

DisCo can be expressed as (cid:80) Kk =1 |W k | · C ( OVI

SSP ) , where C ( OVI

SSP ) denotes the complexity of an OVI

SSP procedure and where we recall that K denotes the(random) index of the last round during which the algorithm terminates. Note that it holds with highprobability that K ≤ | S → L + ε | and |W k | ≤ LA |K k | ≤ LA | S → L + ε | . Moreover C ( OVI

SSP ) capturesthe complexity of the value iteration (VI) algorithm for SSP, which was proved in [34] to converge intime quadratic w.r.t. the size of the considered state space (here, K k ) and (cid:107) V (cid:63) (cid:107) ∞ /c min . Here we have c min = 1 , and we can easily prove that in all the SSP instances considered by DisCo , the optimalvalue function V (cid:63) veriﬁes (cid:107) V (cid:63) (cid:107) ∞ = O ( L ) , due to the restriction of the goal state in W k (indeedthis restriction implies that there exists a state-action pair in K k × A that transitions to the goal statewith probability Ω(1 /L ) in the true MDP). Putting everything together gives DisCo ’s computationalcomplexity. Interestingly, we notice that while it depends polynomially on S L + ε , L and A , it isindependent from S the size of the global state space. E The

UcbExplore

Algorithm [1]

E.1 Outline of the Algorithm

The

UcbExplore algorithm was introduced by Lim and Auer [1] to speciﬁcally tackle condition AX L .The algorithm maintains a set K of “controllable” states and a set U of “uncontrollable” states. Italternates between two phases of state discovery and policy evaluation . In a state discovery phase,new candidate states are discovered as potential members of the set of controllable states. Any policyevaluation phase is called a round and it relies on an optimistic principle: it attempts to reach an“optimistic” state s (i.e., the easiest state to reach based on information collected so far) among all thecandidate states by executing an optimistic policy π s that minimizes the optimistic expected hittingtime truncated at a horizon of H U CB := (cid:100) L + L ε − (cid:101) . Within the round of evaluation of policy π s ,the algorithm proceeds through at most λ U CB := (cid:6) L ε − log (cid:0) |K| δ − (cid:1)(cid:7) episodes, each of whichbegins at s and ends either when π s successfully reaches s or when H U CB steps have been executed.If the empirical performance of π s is poor (measured through a performance check done after eachepisode), the round is said to have failed . Otherwise, the round is successful which means that s iscontrollable and an acceptable policy ( π s ) has been discovered. A failure round leads to selectinganother candidate state-policy pair for evaluation, while a success round leads to a state discoveryphase which in turn adds more candidate states for the subsequent rounds. As explained in App. A, UcbExplore is unable to tackle the more challenging objective AX (cid:63) .24 .2 Minor Issue and Fix in the Analysis of UcbExplore

The key insight of

UcbExplore is to bound the number of failure rounds of the algorithm, by lower-and upper-bounding the so-called “regret” contribution of failure rounds, where the regret of a failureround k is deﬁned as e k (cid:88) j =1 (cid:104) H U CB − L − Γ − (cid:88) i =0 r i (cid:105) , where e k ≤ λ U CB is the actual number of episodes executed in round k and where the reward r i ∈ { , } is equal to 1 only if the state is the goal state. However, upper bounding the regretcontribution of failure rounds implies applying a concentration inequality on only speciﬁc roundsthat are chosen given their empirical performance . Hence Lim and Auer [1, Lem. 18] improperly usea martingale argument to bound a sum whose summands are chosen in a non-martingale way, i.e.,depending on their realization.To avoid the aforementioned issue, one must upper and lower bound the cumulative regret of the entire set of rounds and not only the failure rounds in order to obtain a bound on the number of failurerounds. However, this would yield a sample complexity that has a second term scaling as (cid:101) O ( ε − ) .Following personal communication with the authors, the ﬁx is to change the deﬁnition of regret of around, making it equal to e k (cid:88) j =1 (cid:101) u H U CB ( s → s ) − H U CB − (cid:88) i =0 r i , where s is the considered goal state and (cid:101) u H U CB ( s → s ) is the optimistic H U CB -step reward (wherethe reward is equal to 1 only at state s ). With this new deﬁnition, it is possible to recover the samplecomplexity provided in [1] scaling as (cid:101) O ( ε − ) . E.3 Issue with a Possibly Inﬁnite State Space

Lim and Auer [1] claim that their setting can cope with a countable, possibly inﬁnite state space.However, this leads to a technical issue, which has been acknowledged by the authors via personalcommunication and as of now has not been resolved. Indeed, it occurs when a union bound over theunknown set U is taken to guarantee high-probability statements (e.g., the Lem. 14 or 17 of [1]). Yetfor each realization of the algorithm, we do not know what the set U , or equivalently K , looks like,hence it is improper to perform a union bound over a set of unknown identity. Simple workaroundsto circumvent this issue are to impose a ﬁnite state space, or to assume prior knowledge over a ﬁnitesuperset of U . In this paper we opt for the ﬁrst option. It remains an open and highly non-trivialquestion as to how (and whether) the framework can cope with an inﬁnite state space. E.4 Effective Horizon of the AX Problem and its Dependency on ε UcbExplore [1] designs ﬁnite-horizon problems with horizon H U CB := (cid:100) L + L ε − (cid:101) and outputspolicies that reset every H U CB time steps. In the following we prove that the effective horizon of the AX problem actually scales as O (cid:0) log( Lε − ) L (cid:1) , i.e., only logarithmically w.r.t. ε − . We begin bydeﬁning the concept of “resetting” policies as follows. Deﬁnition 11.

For any π ∈ Π and horizon H ≥ , we denote by π | H the non-stationary policy thatexecutes the actions prescribed by π and performs the RESET action every H steps, i.e., π | Ht ( a | s ) := (cid:26) RESET if t ≡ mod H ) ,π ( a | s ) otherwise.We denote by Π | H the set of such “resetting” policies. The following lemma captures the effective horizon H eff of the problem, in the sense that restrictingour attention to Π | H ( S → L ) for H ≥ H eff does not compromise the possibility of ﬁnding policies thatachieve the performance required by AX (cid:63) (and thus also by AX L ).25 emma 17. For any ε ∈ (0 , and L ≥ , whenever H ≥ H eff := 4( L + 1) (cid:6) log (cid:0) L + 1) ε (cid:1)(cid:7) , we have for any s † ∈ S → L , min π | H ∈ Π | H ( S → L ) v π | H ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + ε. Proof.

Consider any goal state s † ∈ S → L . Set ε (cid:48) := ε L +1) ≤ . Denote by π ∈ Π( S → L ) theminimizer of V (cid:63) S → L ( s → s † ) . For any horizon H ≥ , we introduce the truncated value function v π,H ( s → s (cid:48) ) := E [ τ π ( s → s (cid:48) ) ∧ H ] and the tail probability q π,H ( s → s (cid:48) ) := P ( τ π ( s → s (cid:48) ) > H ) .Due to the presence of the RESET action, the value function of π can be bounded for all states s ∈ S → L \ { s † } as v π ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + 1 ≤ L + 1 . This entails that the probability of the goal-reaching time decays exponentially. More speciﬁcally, wehave q π,H ( s → s † ) ≤ (cid:18) − H L + 1) (cid:19) ≤ ε (cid:48) , (21)where the ﬁrst inequality stems from Lem. 16 and the second inequality comes from the choiceof H ≥ L + 1) (cid:6) log (cid:0) ε (cid:48) (cid:1)(cid:7) . Furthermore, we have τ π ( s → s (cid:48) ) ∧ H ≤ τ π ( s → s (cid:48) ) and thus E [ τ π ( s → s (cid:48) ) ∧ H ] ≤ E [ τ π ( s → s (cid:48) )] . Consequently, v π,H ( s → s † ) ≤ v π ( s → s † ) = V (cid:63) S → L ( s → s † ) . (22)Now, from [1, Eq. 4], the value function of π can be related to its truncated value function and tailprobability as follows v π | H = v π,H + q π,H − q π,H . (23)Plugging Eq. 21 and 22 into Eq. 23 yields v π | H ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + ε (cid:48) − ε (cid:48) . Notice that the inequalities − x ≤ x and x − x ≤ x hold for any < x ≤ . Applying themfor x = ε (cid:48) yields V (cid:63) S → L ( s → s † ) + ε (cid:48) − ε (cid:48) ≤ (1 + 2 ε (cid:48) ) V (cid:63) S → L ( s → s † ) + 2 ε (cid:48) . From the inequality V (cid:63) S → L ( s → s † ) ≤ L and the deﬁnition of ε (cid:48) , we ﬁnally obtain v π | H ( s → s † ) ≤ V (cid:63) S → L ( s → s † ) + ε, which completes the proof.Lem. 17 reveals that the effective horizon H eff of the AX problem scales only logarithmicallyand not linearly in ε − . This highlights that the design choice in UcbExplore to tackle ﬁnite-horizon problems with horizon H U CB unavoidably leads to a suboptimal dependency on ε in its AX L sample complexity bound. In contrast, by designing SSP problems and thus leveraging the intrinsicgoal-oriented nature of the problem, DisCo can (implicitly) capture the effective horizon of theproblem. This observation is at the heart of the improvement in the ε dependency from (cid:101) O ( ε − ) of UcbExplore [1] to (cid:101) O ( ε − ) of DisCo (Thm. 1). 26

Experiments

This section complements the experimental ﬁndings partially reported in Sect. 5. We provide detailsabout the algorithmic conﬁgurations and the environments as well as additional experiments.

F.1 Algorithmic ConﬁgurationsExperimental improvements to

UcbExplore [1].

We introduce several modiﬁcations to

UcbExplore in order to boost its practical performance. We remove all the constants and loga-rithmic terms from the requirement for state discovery and policy evaluation (refer to [1, Fig. 1]).Furthermore, we remove the constants in the deﬁnition of the accuracy ε (cid:48) = ε/L used by UcbExplore (while their original algorithm requires ε (cid:48) to be divided by , we remove this constant). We alsosigniﬁcantly improve the planning phase of UcbExplore [1, Fig. 2]. Their procedure requires todivide the samples into H := (1 + 1 /ε (cid:48) ) L disjoint sets to estimate the transition probability ofeach stage h of the ﬁnite-horizon MDP. This substantially reduces the accuracy of the estimatedtransition probability since for each stage h only N k ( s, a ) /H are used. In our experiments, we useall the samples to estimate a stationary MDP (i.e., (cid:98) p k ( s (cid:48) | s, a ) = N k ( s, a, s (cid:48) ) /N k ( s, a ) ) rather than astage-dependent model. Estimating a stationary model instead of bucketing the data is simpler andmore efﬁcient since leads to a higher accuracy of the estimated model. To avoid to move too far awayfrom the original UcbExplore , we decided to deﬁne the conﬁdence intervals as if bucketing wasused. We thus consider N k ( s, a ) = N k ( s, a ) /H for the construction of the conﬁdence intervals. Forplanning, we use the optimistic backward induction procedure as in [30]. We thus leverage empiricalBernstein inequalities —which are much tighter— rather than Hoeffding inequalities as suggestedin [1]. In particular, we further approximate the bonus suggested in [30, Alg. 4] as b h ( s, a ) = (cid:115) V ar s (cid:48) ∼ (cid:98) p k ( ·| s,a ) [ V k,h +1 ( s (cid:48) )] N k ( s, a ) ∨ H − h ) N k ( s, a ) ∨ . For

DisCo , we follow the same approach of removing constants and logarithmic terms. Wethus use the deﬁnition of φ as in Thm. 1 with α = 1 and without log-terms. For plan-ning, we use the procedure described in App. D with b k ( s, a, s (cid:48) ) = (cid:113) (cid:98) p k ( s (cid:48) | s,a )(1 − (cid:98) p k ( s (cid:48) | s,a )) N k ( s,a ) ∨ + N k ( s,a ) ∨ . Finally, in the experiments we use a state-action dependent value (cid:98) Θ( s, a, K k ) = (cid:0) (cid:80) s (cid:48) ∈K k (cid:112)(cid:98) p k ( s (cid:48) | s, a )(1 − (cid:98) p k ( s (cid:48) | s, a )) (cid:1) instead of taking the maximum over ( s, a ) .Even though we boosted the practical performance of UcbExplore w.r.t. the original algorithmproposed in [1] (e.g., the use of Bernstein), we believe it makes the comparison between

DisCo and

UcbExplore as fair as possible.

F.2 Confusing Chain

The confusing chain environment referred to in Sect. 5 is constructed as follows. It is an MDPcomposed of an initial state s , a chain of length C (states are denoted by s , . . . , s C ) and a setof K confusing states ( s C +1 , . . . , s C + K ). Two actions are available in each state. In state s ,we have a forward action a that moves to the chain with probability p c ( p ( s | s , a ) = p c and p ( s | s , a ) = 1 − p c ) and a confusing action that has uniform probability of reaching any confusingstate ( p ( s i | s , a ) = 1 /K for any i ∈ { C + 1 , . . . , C + K } ). In the confusing states, all actions movedeterministically to the end of the chain ( p ( s C | s i , a ) = 1 for any i ∈ { C + 1 , . . . , C + K } and a ). Ineach state of the chain, there is a forward action a that behaves as in s ( p ( s min( C,i +1) | s i , a ) = p c and p ( s i | s i , a ) = 1 − p c , for any i ∈ { , . . . , C − } ) and a skip action a that moves to m statesahead with probability p skip ( p ( s min( C,i + m ) | s i , a ) = p skip and p ( s i | s i , a ) = 1 − p skip , for any i ∈ { , . . . , C − } ). Finally, p ( s | s c , a ) = 1 for any action a . In our experiments, we set m = 4 , p skip = 1 / , p c = 1 , C = 5 , K = 6 , L = 4 . . 27 DisCo UcbExplore -Bernstein . ,

263 (13 , , ,

688 (92 , . ,

569 (4 , ,

580 (13 , . ,

160 (829) 108 ,

894 (2 , . ,

349 (475) 40 ,

538 (805)0 . ,

891 (244) 21 ,

270 (441)

Table 2: Sample complexity of

DisCo and

UcbExplore -Bernstein, on the confusing chain do-main. Values are averaged over runs and the -conﬁdence interval of the mean is reported inparenthesis. UcbExplore -Bernstein ε Expected hitting time v π ( s → s i ) s s s s s s . , . . .

94 (0 . . .

36 (0 .

11) 4 4 .

53 (0 . . .

38 (0 .

11) 4 .

07 (0 .

07) 4 .

53 (0 . Table 3: Expected hitting time of state s i of the goal-oriented policy π s i recovered by UcbExplore -Bernstein, on the confusing chain domain.

DisCo recovers the optimal goal-oriented policy in allthe runs and for all ε . The advantage of DisCo lies in its ﬁnal policy consolidation step. Values areaveraged over runs and the -conﬁdence interval of the mean is reported in parenthesis (it isomitted when equal to ). This shows that UcbExplore recovers the optimal goal-oriented policy inevery run only for ε equal to . and . . Sample complexity.

We provide in Tab. 2 the sample complexity of the algorithms for varyingvalues of ε . As mentioned in Sect. 5, DisCo outperforms

UcbExplore for any value of ε , andincreasingly so when ε decreases. Fig. 7 complements Fig. 2 for additional values of ε . Quality of goal-reaching policies.

We now investigate the quality of the policies recovered by

DisCo and

UcbExplore . In particular, we show that

DisCo is able to ﬁnd the incrementally near-optimal shortest-path policies to any goal state, while

UcbExplore may only recover sub-optimalpolicies. On the confusing chain domain, the intuition is that the set of confusing states makes s C reachable in just steps but the confusing states are not in the controllable set and thus the algorithmsare not able to recover the shortest-path policy to s C . On the other hand, state s C is controllablethrough two policies: 1) the policies π that takes always the forward action a reaches s C in steps;2) the policy π that takes the skip action a in s reaches s C in steps. We observed empiricallythat DisCo always recovers policy π (i.e., the fastest policy) while UcbExplore selects policy π inseveral cases. This is highlighted in Tab. 3 where we report the expected hitting time of the policiesrecovered by the algorithms. This ﬁnding is not surprising since, as we explain in Sect. 4 and App. A, UcbExplore is designed to ﬁnd policies reaching states in at most L steps on average, yet it is notable to recover incrementally near-optimal shortest-path policies, as opposed to DisCo . F.3 Combination Lock

We consider the combination lock problem introduced in [31]. The domain is a stochastic chain with S = 6 states and A = 2 actions. In each state s k , action right ( a ) is deterministic and leads to state s k +1 , while action left ( a ) moves to a state s k − l with probability proportional to / ( k − l ) (i.e.,inversely proportional to the distance of the states). Formally, we have that n ( x k , x l ) = (cid:26) k − l if l < k otherwise and p ( x l | x k , a ) = n ( x k , x l ) (cid:80) s n ( x k , s ) . We set the initial state to be at / of the chain, i.e., (cid:106) N/ (cid:107) . The actions in the end states areabsorbing, i.e., p ( s | s , a ) = 1 and p ( s N − | s N − , a ) = 1 , while the remaining actions behavenormally. See Fig. 5 for an illustration of the domain.28 s s s s s a a / / / / /

11 12 / / / /

25 60 / / / / / Figure 5: Combination lock domain with S = 6 states. Expected hitting times from the initial state s are v π ( s → s ) = (2 . , . , . , , , . Consider L = 3 , the set of incrementally L -controllablestates is S → L = { s , s , s , s } . The goal-oriented policy to reach s and s takes always the rightaction a , while the policy for s always selects the left action a . . . . . · . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.2 UcbExploreDisCo

Figure 6: Proportion of the incrementally L -controllable states identiﬁed by DisCo and

UcbExplore in the combination lock domain for L = 2 . and ε = 0 . . Values are averaged over runs. Sample complexity.

We evaluate the two algorithms

DisCo and

UcbExplore on the combinationlock domain, for ε = 0 . and L = 2 . . We further boost the empirical performance of UcbExplore by using N instead of N for the construction of the conﬁdence intervals (i.e., we do not accountfor the data bucketing in [1], see App. F.1). To preserve the robustness of the algorithm, we use log( |K k | ) / ( ε (cid:48) ) episodes for UcbExplore ’s policy evaluation phase (indeed we noticed that theremoval of the logarithmic term here sometimes leads

UcbExplore to miss some states in S → L in this domain). For the same reason, in DisCo we use the value (cid:98) Θ( K k ) = max s,a (cid:98) Θ( s, a, K k ) prescribed by the theoretical algorithm instead of the state-action dependent values used in theprevious experiment. We average the experiments over runs and obtain a sample complexity of , ( , ) for DisCo and , ( , ) for UcbExplore . Fig. 6 reports the proportion ofincrementally L -controllable states identiﬁed by the algorithms as a function of time. We notice thatonce again DisCo clearly outperforms

UcbExplore .29 · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.1 UcbExploreDisCo · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.2 UcbExploreDisCo . . . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.4 UcbExploreDisCo · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.6 UcbExploreDisCo . . . · . . . . Time P r o p o r t i o n o f L - c o n t r o ll a b l e s t a t e s (cid:15) =0.8 UcbExploreDisCo

Figure 7: Proportion of the incrementally L -controllable states identiﬁed by DisCo and