[PDF] Near-optimal Regret Bounds for Stochastic Shortest Path

Abstract

Full PDF

aa r X i v : . [ c s . L G ] F e b Near-optimal Regret Bounds for Stochastic Shortest Path

Alon Cohen ∗ Haim Kaplan † Yishay Mansour ‡ Aviv Rosenberg § February 25, 2020

Abstract

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agenthas to reach a goal state in minimum total expected cost. In the learning formulation of the problem, theagent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly playfor a given number of episodes, while learning the problem’s optimal solution. Unlike other well-studiedmodels in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) andis inﬂuenced by the agent’s actions. Recently, Tarbouriech et al. (2019) studied this problem in thecontext of regret minimization, and provided an algorithm whose regret bound is inversely proportionalto the square root of the minimum instantaneous cost. In this work we remove this dependence on theminimum cost—we give an algorithm that guarantees a regret bound of e O ( B ⋆ | S | p | A | K ), where B ⋆ isan upper bound on the expected cost of the optimal policy, S is the set of states, A is the set of actionsand K is the number of episodes. We additionally show that any learning algorithm must have at leastΩ( B ⋆ p | S || A | K ) regret in the worst case. Stochastic shortest path (SSP) is one of the most basic models in reinforcement learning (RL). It includesthe discounted return model and the ﬁnite-horizon model as special cases. In SSP the goal of the agent isto reach a predeﬁned goal state in minimum expected cost. This setting captures a wide variety of realisticscenarios, such as car navigation, game playing and drone ﬂying; i.e., tasks carried out in episodes thateventually terminate.The focus of this work is on regret minimization in SSP. It builds on extensive literature on theoreticalaspects of online RL, and in particular on the copious works about regret minimization in either the averagecost model or the ﬁnite-horizon model. A major contribution to this literature is the UCRL2 algorithmJaksch et al. (2010) that gives a general framework to achieve optimism in face of uncertainty for thesesettings. The main methodology is to deﬁne a conﬁdence set that includes the true model parameterswith high probability. The algorithm periodically computes an optimistic policy that minimizes the overallexpected cost simultaneously over all policies and over all parameters within the conﬁdence set, and proceedsto play this policy.The only regret minimization algorithm speciﬁcally designed for SSP is that of Tarbouriech et al. (2019)that assumes that all costs are bounded away from zero (i.e., there is a c min > c min , 1]). They show a regret bound that scales as O ( D | S | p | A | K / c min ) where D is the minimumexpected time of reaching the goal state from any state, S is the set of states, A is the set of actions and K is the number of episodes. In addition, they show that the algorithm’s regret is e O ( K ) when the costs arearbitrary (namely, may be zero). ∗ Google Research, Tel Aviv; [email protected] . † Tel-Aviv University and Google Research, Tel Aviv; [email protected] . ‡ Tel-Aviv University and Google Research, Tel Aviv; Supported in part by a grant from the ISF. [email protected] . § Tel-Aviv University; [email protected] . c –1min and allow for zero costs while maintaining regret of e O ( √ K ). Second, wegive a much simpler algorithm in which the computation of the optimistic policy has a simple solution. Ourmain regret term is e O ( B ⋆ | S | p | A | K ), where B ⋆ is an upper bound on the expected cost of the optimal policy(note that B ⋆ ≤ D ). We show that this is almost optimal by giving a lower bound of Ω( B ⋆ p | S || A | K ).Our technical contribution is as follows. We start by assuming that the costs are lower bounded by c min and give an algorithm that is simple to analyze and achieves a regret bound of e O ( B ⋆ | S | p | A | K / c min ).Note that this bound is comparable to the one of Tarbouriech et al. (2019), yet our algorithm and itsanalysis are signiﬁcantly simpler and more intuitive. We subsequently improve our algorithm by utilizingbetter conﬁdence sets based on the Bernstein concentration inequality (Azar et al., 2017). This algorithm iseven simpler than our ﬁrst one mainly since picking the parameters of the optimistic model is particularlyeasy. The analysis, however, is somewhat more delicate. We achieve our ﬁnal bound by perturbing theinstantaneous costs to be at least ǫ >

0. The additional cost due to this perturbation has a small eﬀect sincethe dependency of our regret on c –1min is additive and does not multiply any term depending on K . Early work by Bertsekas and Tsitsiklis (1991) studied the problem of planning in SSPs, that is, computingthe optimal strategy eﬃciently in a known SSP instance. They established that, under certain assumptions,the optimal strategy is a deterministic stationary policy (a mapping from states to actions) and can becomputed eﬃciently using standard planning algorithms, e.g., Value Iteration or Policy Iteration.The extensive literature about regret minimization in RL focuses on the the average-cost inﬁnite-horizonmodel Bartlett and Tewari (2009); Jaksch et al. (2010) and on the ﬁnite-horizon model Osband et al. (2016);Azar et al. (2017); Dann et al. (2017); Zanette and Brunskill (2019). These recent works give algorithms withnear-optimal regret bounds using Bernstein-type concentration bounds.Another related model is that of loop-free SSP with adversarial costs Neu et al. (2010, 2012); Zimin and Neu(2013); Rosenberg and Mansour (2019a,b). This model eliminates the challenge of avoiding policies that neverterminate, but the adversarial costs pose a diﬀerent, unrelated, challenge.

An instance of the SSP problem is a Markov decision process (MDP) M = ( S , A , P , c , s init ) where S is thestate space and A is the action space. The agent begins at the initial state s init , and ends her interactionwith M by arriving at the goal state g (where g S ). Whenever she plays action a in state s , she pays acost c ( s , a ) ∈ [0, 1] and the next state s ′ ∈ S is chosen with probability P ( s ′ | s , a ). Note that to simplifythe presentation we avoid addressing the goal state g explicitly – we assume that the probability of reachingthe goal state by playing action a at state s is 1 – P s ′ ∈ S P ( s ′ | s , a ).We now review planning in a known SSP instance. Under certain assumptions that we shall brieﬂy discuss,the optimal behaviour of the agent, i.e., the policy that minimizes the expected total cost of reaching thegoal state from any state, is a stationary, deterministic and proper policy. A stationary and deterministicpolicy π : S A is a mapping that selects action π ( s ) whenever the agent is at state s . A proper policy isdeﬁned as follows. Deﬁnition 1 (Proper and Improper Policies) . A policy π is proper if playing π reaches the goal state withprobability 1 when starting from any state. A policy is improper if it is not proper.Any policy π induces a cost-to-go function J π : S [0, ∞ ] deﬁned as J π ( s ) = lim T →∞ E π (cid:2)P Tt =1 c ( s t , a t ) | s = s (cid:3) , where the expectation is taken w.r.t the random sequence of states generated by playing accordingto π when the initial state is s . For a proper policy π , since the number of states | S | is ﬁnite, it follows that J π ( s ) is ﬁnite for all s ∈ S . However, note that J π ( s ) may be ﬁnite even if π is improper. We additionallydenote by T π ( s ) the expected time it takes for π to reach g starting at s ; in particular, if π is proper then2 π ( s ) is ﬁnite for all s , and if π is improper there must exist some s such that T π ( s ) = ∞ . In this work weassume the following about the SSP model. Assumption 1.

There exists at least one proper policy.With Assumption 1, we have the following important properties of proper policies. In particular, theﬁrst result shows that a policy is proper if and only if its cost-to-go function satisﬁes the Bellman equations.The second result proves that a policy is optimal if and only if it satisﬁes the Bellman optimality criterion.Note that they assume that every improper policy has high cost.

Lemma 2.1 (Bertsekas and Tsitsiklis, 1991, Lemma 1) . Suppose that Assumption 1 holds and that for everyimproper policy π ′ there exists at least one state s ∈ S such that J π ′ ( s ) = ∞ . Let π be any policy, then(i) If there exists some J : S R such that J ( s ) ≥ c (cid:0) s , π ( s ) (cid:1) + P s ′ ∈ S P (cid:0) s ′ | s , π ( s ) (cid:1) J ( s ′ ) for all s ∈ S ,then π is proper. Moreover, it holds that J π ( s ) ≤ J ( s ), ∀ s ∈ S . (ii) If π is proper then J π is the unique solution to the equations J π ( s ) = c (cid:0) s , π ( s ) (cid:1) + P s ′ ∈ S P (cid:0) s ′ | s , π ( s ) (cid:1) J π ( s ′ ) for all s ∈ S .

Lemma 2.2 (Bertsekas and Tsitsiklis, 1991, Proposition 2) . Under the conditions of Lemma 2.1 the optimalpolicy π ⋆ is stationary, deterministic, and proper. Moreover, a policy π is optimal if and only if it satisﬁesthe Bellman optimality equations for all s ∈ S :J π ( s ) = min a ∈ A c (cid:0) s , a (cid:1) + X s ′ ∈ S P (cid:0) s ′ | s , a (cid:1) J π ( s ′ ), (1) π ( s ) ∈ arg min a ∈ A c (cid:0) s , a (cid:1) + X s ′ ∈ S P (cid:0) s ′ | s , a (cid:1) J π ( s ′ ).In this work we are not interested in approximating the optimal policy overall, but rather the best proper policy. In this case the second requirement in the lemmas above, that for every improper policy π thereexists some state s ∈ S such that J π ( s ) = ∞ , can be circumvented in the following way (Bertsekas and Yu,2013). First, note that this requirement is trivially satisﬁed when all instantaneous costs are strictly positive.Then, one can perturb the instantaneous costs by adding a small positive cost ǫ ∈ [0, 1], i.e., the new costfunction is c ǫ ( s , a ) = max { c ( s , a ), ǫ } . After this perturbation, all proper policies remain proper, and everyimproper policy has inﬁnite cost-to-go from some state (as all costs are positive). In the modiﬁed MDP,we apply Lemma 2.2 and obtain an optimal policy π ⋆ǫ that is stationary, deterministic and proper and hasa cost-to-go function J ⋆ǫ . Taking the limit as ǫ →

0, we have that π ⋆ǫ → π ⋆ and J ⋆ǫ → J π ⋆ , where π ⋆ isthe optimal proper policy in the original model that is also stationary and deterministic, and J π ⋆ denotesits cost-to-go function. We use this observation to obtain Corollaries 2.5 and 2.6 below that only requireAssumption 1 to hold. Learning formulation.

We assume that the costs are deterministic and known to the learner, and thetransition probabilities P are ﬁxed but unknown to the learner. The learner interacts with the model inepisodes: each episode starts at the initial state s init , and ends when the learner reaches the goal state g (note that she might never reach the goal state). Success is measured by the learner’s regret over K suchepisodes, that is the diﬀerence between her total cost over the K episodes and the total expected cost of theoptimal proper policy: R K = K X k =1 I k X i =1 c ( s ki , a ki ) – K · min π ∈ Π proper J π ( s init ),where I k is the time it takes the learner to complete episode k (which may be inﬁnite), Π proper is the setof all stationary, deterministic and proper policies (that is not empty by Assumption 1), and ( s ki , a ki ) is the i -th state-action pair at episode k . In the case that I k is inﬁnite for some k , we deﬁne R K = ∞ .We denote the optimal proper policy by π ⋆ , i.e., J π ⋆ ( s ) = arg min π ∈ Π proper J π ( s ) for all s ∈ S . Moreover,let B ⋆ > J π ⋆ and let T ⋆ > T π ⋆ ,i.e., B ⋆ ≥ max s ∈ S J π ⋆ ( s ) and T ⋆ ≥ max s ∈ S T π ⋆ ( s ). 3 .1 Summary of our results In Section 3 we present our Hoeﬀding-based algorithms (Algorithms 1 and 3) and their analysis. In Section 4we show our Bernstein-based algorithm (Algorithm 2) for which we prove improved regret bounds. Inaddition, we give a lower bound on the learner’s regret showing that Algorithm 2 is near-optimal (seeAppendix C).The learner must reach the goal state otherwise she has inﬁnite regret. Therefore, she has to trade-oﬀtwo objectives, one is to reach the goal state and the other is to minimize the cost. Under the followingassumption, the two objectives essentially coincide.

Assumption 2.

All costs are positive, i.e., there exists c min > c ( s , a ) ≥ c min for every ( s , a ) ∈ S × A .This assumption allows us to upper bound the running time of the algorithm by its total cost up to a factorof c –1min . In particular, it guarantees that any policy that does not reach the goal state has inﬁnite cost, soany bounded regret algorithm has to reach the goal state. We eventually relax Assumption 2 by a techniquesimilar to that of Bertsekas and Yu (2013). We add a small positive perturbation to the instantaneous costsand run our algorithms on the model with the perturbed costs. This provides a regret bound that scaleswith the expected running time of the optimal policy.We now summarize our results. For ease of comparison, we ﬁrst present our regret bounds for both theHoeﬀding and Bernstein-based algorithms for when Assumption 2 holds, and subsequently show the regretbounds of both algorithms for the general case. In order to simplify the presentation of our results, weassume that | S | ≥ | A | ≥ K ≥ | S | | A | throughout. In addition, we denote L = log( KB ⋆ | S || A | / δ c min ).The complete proof of all statements is found in the supplementary material. Positive costs.

The following results hold when Assumption 2 holds (recall that we always assume Assumption 1).In particular, when this assumption holds the optimal policy overall is proper (Lemma 2.2) hence the regretbounds below are with respect to the best overall policy.

Theorem 2.3.

Suppose that Assumption 2 holds. With probability at least δ the regret of Algorithm 3 isbounded as follows: R K = O s B ⋆ | S | | A | Kc min L + B ⋆ | S | | A | c min L ! .The main issue with the regret bound in Theorem 2.3 is that it scales with p K / c min which cannot beavoided regardless of how large K is with respect to c –1min . This problem is alleviated in Algorithm 2 thatuses the tighter Bernstein-based conﬁdence bounds. Theorem 2.4.

Assume that Assumption 2 holds. With probability at least δ the regret of Algorithm 2 isbounded as follows: R K = O B ⋆ | S | p | A | K L + s B ⋆ | S | | A | c min L ! .Note that when K ≫ B ⋆ | S | | A | / c min , the regret bound above scales as e O ( B ⋆ | S | p | A | K ) thus obtaininga near-optimal rate. Arbitrary costs.

Recall that in this case we can no longer assume that the optimal policy is proper.Therefore, the regret bounds below are with comparison to the best proper policy. Assumption 2 can be easilyalleviated by adding a small ﬁxed cost to the cost of all state-action pairs. Following the perturbation of thecosts, we obtain regret bounds from Theorems 2.3 and 2.4 with c min ← ǫ and B ⋆ ← B ⋆ + ǫ T ⋆ , and the learneralso suﬀers an additional cost of ǫ T ⋆ K due to the misspeciﬁcation of the model caused by the perturbation.By picking ǫ to balance these terms we get the following corollaries (letting e L = log( KB ⋆ T ⋆ | S || A | / δ )).4 orollary 2.5. Running Algorithm 3 using costs c ǫ ( s , a ) = max { c ( s , a ), ǫ } for ǫ = ( | S | | A | / K ) gives thefollowing regret bound with probability at least δ :R K = O T ⋆ | S | | A | K e L + T ⋆ | S | | A | e L ! . Corollary 2.6.

Running Algorithm 2 using costs c ǫ ( s , a ) = max { c ( s , a ), ǫ } for ǫ = | S | | A | / K gives thefollowing regret bound with probability at least δ :R K = O B ⋆ | S | p | A | K e L + T ⋆ | S | | A | e L ! . Moreover, when the algorithm knows B ⋆ and K ≫ | S | | A | T ⋆ , then choosing ǫ = B ⋆ | S | | A | / K gets a near-optimal regret bound of e O ( B ⋆ | S | p | A | K ) . Lower bound.

In Appendix C we show that Corollary 2.6 is nearly-tight using the following theorem.

Theorem 2.7.

There exists an SSP problem instance M = ( S , A , P , c , s init ) in which J π ⋆ ( s ) ≤ B ⋆ for alls ∈ S , | S | ≥ , | A | ≥ , B ⋆ ≥ , K ≥ | S || A | , and c ( s , a ) = 1 for all s ∈ S , a ∈ A, such the expected regretof any learner after K episodes satisﬁes E [ R K ] ≥ B ⋆ p | S || A | K . We start with a simpler case in which B ⋆ is known to the learner. In Section 3.2 we alleviate this assumptionwith a penalty of an additional log-factor in the regret bound. For now, we prove the following bound onthe learner’s regret. Theorem 3.1.

Suppose that Assumption 2 holds. With probability at least δ the regret of Algorithm 1 isbounded as follows: R K = O s B ⋆ | S | | A | Kc min L + B ⋆ | S | | A | c min L ! .Our algorithm follows the known concept of optimism in face of uncertainty. That is, it maintainsconﬁdence sets that contain the true transition function with high probability and picks an optimistic optimalpolicy—a policy that minimizes the expected cost over all policies and all transition functions in the currentconﬁdence set. The computation of the optimistic optimal policy can be done eﬃciently as shown byTarbouriech et al. (2019). Construct an augmented MDP whose states are S and its action set consists oftuples ( a , e P ) where a ∈ A and e P is any transition function such that (cid:13)(cid:13) e P ( ·| s , a ) – ¯ P ( ·| s , a ) (cid:13)(cid:13) ≤ s | S | log( | S || A | N + ( s , a )/ δ ) N + ( s , a ) (2)where ¯ P is the empirical estimate of P . It can be shown that the optimistic policy and the optimisticmodel, i.e., those that minimize the expected total cost over all policies and feasible transition functions,correspond to the optimal policy of the augmented MDP.To ensure that the algorithm reaches the goal state in every episode, we deﬁne a state-action pair ( s , a )as known if the number of visits to this pair is at least B ⋆ | S | c log B ⋆ | S || A | δ c min and as unknown otherwise. Weshow with high probability the optimistic policy chosen by the algorithm will be proper once all state-action5 lgorithm 1 Hoeffding-type confidence bounds and known B ⋆ input: state space S , action space A , bound on cost-to-go of optimal policy B ⋆ , conﬁdence parameter δ . initialization: ∀ ( s , a , s ′ ) ∈ S × A × S : N ( s , a , s ′ ) ← N ( s , a ) ←

0, an arbitrary policy ˜ π , t ← for k = 1, 2, . . . do set s t ← s init . while s t = g do follow optimistic optimal policy: a t ← ˜ π ( s t ).observe next state s t +1 ∼ P ( · | s t , a t ). update : N ( s t , a t , s t +1 ) ← N ( s t , a t , s t +1 ) + 1, N ( s t , a t ) ← N ( s t , a t ) + 1. if N ( s t +1 , ˜ π ( s t +1 )) ≤ B ⋆ | S | c log B ⋆ | S || A | δ c min or s t +1 = g then compute empirical transition function ¯ P as ¯ P ( s ′ | s , a ) = N ( s , a , s ′ )/ N + ( s , a ) where N + ( s , a ) =max { N ( s , a ), 1 } . compute optimistic policy ˜ π by minimizing expected cost over transition functions e P that sat-isfy Eq. (2). end ifset t ← t + 1. end whileend for pairs are known. However, when some pairs are still unknown, our chosen policies may be improper. Thisimplies that the strategy of keeping the policy ﬁxed throughout an episode, as done usually in episodic RL,will fail. Consequently, our algorithm changes policies at the start of every episode and also every time wereach an unknown state-action pair.Formally, we split the time into intervals . The ﬁrst interval begins at the ﬁrst time step, and everyinterval ends by reaching the goal state or a state s such that ( s , ˜ π ( s )) is unknown (where ˜ π is the currentpolicy followed by the learner). Recall that once all state-action pairs are known, the optimistic policy willeventually reach the goal state. Therefore, recomputing the optimistic policy at the end of every intervalensures that the algorithm will eventually reach the goal state with high probability. Note that the totalnumber of intervals is at most the number of visits to an unknown state-action pair plus the number ofepisodes. Observation 3.2.

The total number of intervals, M , is O (cid:18) K + B ⋆ | S | | A | c log B ⋆ | S || A | δ c min (cid:19) . The proof of Theorem 3.1 begins by deﬁning the “good event” in which our conﬁdence sets contain the truetransition function and the total cost in every interval is bounded. This in turn implies that all episodes endin ﬁnite time. We prove that the good event holds with high probability.Then, independently, we give a high-probability bound on the regret of the algorithm when the goodevent holds. To do so, recall that at the beginning of every interval m , the learner computes an optimisticpolicy by minimizing over all policies and over all transition functions within the current conﬁdence set.We denote the chosen policy by ˜ π m and let e P m be the minimizing transition function (i.e., the optimisticmodel). A key observation is that by the deﬁnition of our conﬁdence sets, e P m is such that there is alwayssome positive probability to transition to the goal state directly from any state-action. This implies that allpolicies are proper in the optimistic model and that the cost-to-go function of ˜ π m deﬁned with respect to e P m , and denoted by e J m , is ﬁnite. By Lemma 2.1, the following Bellman optimality equations hold for all6 ∈ S , e J m ( s ) = min a ∈ A c ( s , a ) + X s ′ ∈ S e P m ( s ′ | s , a ) e J m ( s ′ ). (3) High probability events.

For every interval m , we let Ω m denote the event that the conﬁdence set forinterval m contains the true transition function P . Formally, let ¯ P m denote the empirical estimate of thetransition function at the beginning of interval m , let N m ( s , a ) denote the number of visits to state-actionpair ( s , a ) up to interval m (not including), and let n m ( s , a ) be the number of visits to ( s , a ) during interval m . Then we say that Ω m holds if for all ( s , a ) ∈ S × A , we have ( N m + ( s , a ) = max { N m ( s , a ) } ) k P ( ·| s , a ) – ¯ P m ( ·| s , a ) k ≤ s | S | log (cid:0) | S || A | N m + ( s , a )/ δ (cid:1) N m + ( s , a ) . (4)In the following lemma we show that, with high probability, the events Ω m hold and that the total costin each interval is bounded. Combining this with Observation 3.2 we get that all episodes terminate withina ﬁnite number of steps, with high probability. Lemma 3.3.

With probability at least δ /2 , for all intervals m simultaneously, we have that Ω m holds andthat P H m h =1 c ( s mh , a mh ) ≤ B ⋆ log m δ , where H m denotes the length of interval m, s mh is the observed state attime h of interval m and a mh = ˜ π m ( s mh ) is the chosen action. This implies that the total number of steps ofthe algorithm is T = O (cid:18) KB ⋆ c min L + B ⋆ | S | | A | c min L (cid:19) . Proof sketch.

The events Ω m hold with high probability due to standard concentration inequalities, and thusit remains to address the high probability bound on the total cost within each interval.This proof consists of three parts. In the ﬁrst, we show that when Ω m occurs we have that e J m ( s ) ≤ J π ⋆ ( s ) ≤ B ⋆ for all s ∈ S due to the optimistic nature of the computation of ˜ π m . In the second part, wepostulate that had all state-action pairs been known, then having Ω m hold implies that J m ( s ) ≤ B ⋆ forall s ∈ S . That is, when all state-action pairs are known, not only ˜ π m is proper in the true model, but itsexpected cumulative cost is at most 2 B ⋆ .The third part of the proof deals with the general case when not all state-action pairs are known. Fixsome interval m . Since the interval ends when we reach an unknown state-action, it must be that all butthe ﬁrst state-action pair visited during the interval are known. For this unknown ﬁrst state-action pair, itfollows from the Bellman equations (Eq. (3)) and from e J m ( s ) ≤ B ⋆ for all s ∈ S that ˜ π m never picks anaction whose instantaneous cost is larger than B ⋆ . Therefore, the cost of this ﬁrst unknown state-action pairis at most B ⋆ , and we focus on bounding the total cost in the remaining time steps with high probability.To that end, we deﬁne the following modiﬁed MDP M know = ( S know , A , P know , c , s init ) in which everystate s ∈ S such that ( s , ˜ π m ( s )) is unknown is contracted to the goal state. Let P know be the transitionfunction induced in M know by P , and let J m know be the cost-to-go of ˜ π m in M know w.r.t P know . Similarly,deﬁne e P know m as the transition function induced in M know by e P m , and e J m know as the cost-to-go of ˜ π m in M know w.r.t e P know m . It is clear that e J m know ( s ) ≤ e J m ( s ) for every s ∈ S from whence e J m know ( s ) ≤ B ⋆ . Moreover, sinceall states s ∈ S for which ( s , ˜ π m ( s )) is unknown were contracted to the goal state, in M know all remainingstates-action pairs are known. Therefore, by the second part of the proof, J m know ( s ) ≤ B ⋆ for all s ∈ S .Note that reaching the goal state in M know is equivalent to reaching either the goal state or an unknownstate-action pair in the true model hence the latter argument shows that the total expected cost in doingso is at most 2 B ⋆ . We further obtain the high probability bound by a probabilistic ampliﬁcation argumentusing the Markov property of the MDP. 7 egret analysis. In what follows, instead of bounding R K , we bound e R K = P Mm =1 P H m h =1 c ( s mh , a mh ) I { Ω m } – K · J π ⋆ ( s init ), where I is the indicator function. Note that according to Lemma 3.3, we have that e R K = R K with high probability.The deﬁnition of e R K allows the analysis to disentangle two dependent probabilistic events. The ﬁrst isthe intersection of the events Ω m which is dealt with in Lemma 3.3. The second holds when, for a ﬁxedpolicy, the costs suﬀered by the learner do not deviate signiﬁcantly from their expectation. In the followinglemma we bound e R K . Lemma 3.4.

With probability at least δ /2 , we have e R K ≤ O (cid:18) B ⋆ | S | | A | c min log B ⋆ | S || A | c min δ | {z } (1) + B ⋆ r T log T δ + B ⋆ r | S | log | S || A | T δ X s , a M X m =1 n m ( s , a ) p N m + ( s , a ) | {z } (2) (cid:19) .Here we only explain how to interpret the resulting bound. The term (1) bounds the total cost spentin intervals that ended in unknown state-action pairs (it does not depend on K ). The term (2) is at most O ( p | S || A | T ) when Lemma 3.3 holds, and then the dominant term in Lemma 3.4 becomes e O ( B | S | p | A | T ).Theorem 3.1 is ﬁnally obtained by applying a union bound on Lemmas 3.3 and 3.4 and using Lemma 3.3 tobound T . In this section we relax the assumption that B ⋆ is known to the learner. Instead, we keep an estimate e B that is initialized to c min and doubles every time the cost in interval m (denoted as C m ) reaches 24 e B log m δ .By Lemma 3.3, with high probability, e B ≤ B ⋆ . We end an interval as before (once the goal state is reachedor an unknown state-action pair is reached), but also when e B is doubled. The algorithm for this case ispresented in Appendix A (Algorithm 3). Since e B changes, every state-action pair can become known oncefor every diﬀerent value of e B . Observation 3.5.

When B ⋆ is unknown to the learner, the number of times a state-action pair can becomeknown is at most log ( B ⋆ / c min ). The number of intervals M is O (cid:18) K + B ⋆ | S | | A | c log B ⋆ | S || A | δ c min (cid:19) . Lemma 3.6.

When B ⋆ is unknown, with probability at least δ /2 , for all intervals m simultaneously, wehave that Ω m holds and that P H m h =1 c ( s mh , a mh ) ≤ B ⋆ log m δ . This implies that the total number of steps ofthe algorithm is T = O (cid:18) KB ⋆ c min L + B ⋆ | S | | A | c min L (cid:19) .The analysis follows that of Algorithm 1. In particular, Lemma 3.4 still holds (with 2 B ⋆ instead of B ⋆ ),and jointly with Lemma 3.6 imply Theorem 2.3. Algorithm 1 has two drawbacks. The ﬁrst one is the use of Hoeﬀding-style conﬁdence bounds which weimprove with Bernstein-style conﬁdence bounds. The second is the number of times the optimistic optimalpolicy is computed. In this section we propose to compute it in a way similar to UCRL2, i.e., once thenumber of visits to some state-action pair is doubled. Note that this change also eliminates the need to knowor to estimate B ⋆ . 8 lgorithm 2 Bernstein-type confidence bounds input: state space S , action space A and conﬁdence parameter δ . initialization: i ← t ←

1, arbitrary policy ˜ π , ∀ ( s , a , s ′ ) ∈ S × A × S : N ( s , a , s ′ ) ← N ( s , a ) ← n ( s , a , s ′ ) ← n ( s , a ) ← for k = 1, 2, . . . doset s t ← s init . while s t = g do follow optimistic optimal policy: a t ← ˜ π i ( s t ).observe next state s t +1 ∼ P ( · | s t , a t ). set : n i ( s t , a t ) ← n i ( s t , a t ) + 1, n i ( s t , a t , s t +1 ) ← n i ( s t , a t , s t +1 ) + 1. if n i ( s t +1 , ˜ π i ( s t +1 )) < N i ( s t +1 , ˜ π i ( s t +1 )) thenset t ← t + 1 and continue . end if set : N i +1 ( s , a , s ′ ) ← N i ( s , a , s ′ ) + n i ( s , a , s ′ ), N i +1 ( s , a ) ← N i ( s , a ) + n i ( s , a ), n i +1 ( s , a ) ← n i +1 ( s , a , s ′ ) ← s , a , s ′ ) ∈ S × A × S . compute empirical transition function ¯ P as ¯ P ( s ′ | s , a ) = N ( s , a , s ′ )/ N + ( s , a ) for every ( s , a , s ′ ) ∈ S × A × S where N + ( s , a ) = max { N ( s , a ), 1 } . compute optimistic transition function e P using Eq. (5). compute optimal policy ˜ π w.r.t e P . i ← i + 1, t ← t + 1. end whileend for The algorithm is presented in Algorithm 2. It consists of epochs . The ﬁrst epoch starts at the ﬁrst timestep, and each epoch ends once the number of visits to some state-action pair is doubled. An optimisticpolicy is computed at the end of every epoch using (empirical) Bernstein conﬁdence bounds. In contrast toAlgorithm 1, Algorithm 2 deﬁnes a conﬁdence range for each state, action, and next state, separately, aroundits empirical estimate (i.e., we use an L ∞ “ball” rather than an L “ball” around the empirical estimates).This allows us to disentangle the computation of the optimistic policy from the computation of the optimisticmodel. Indeed, the computation of the optimistic model becomes very easy: one simply has to maximizethe probability of transition directly to the goal state at every state-action pair which means minimizing theprobability of transition to all other states and setting them at the lowest possible value of their conﬁdencerange. This results in the following formula for e P ( s ′ | s , a ):max { ¯ P ( s ′ | s , a ) – 28 A ( s , a ) – 4 q ¯ P ( s ′ | s , a ) A ( s , a ), 0 } , (5)where A ( s , a ) = log( | S || A | N + ( s , a )/ δ )/ N + ( s , a ). The optimistic policy is then the optimal policy in the SSPmodel deﬁned by the transition function e P . In this section we prove Theorem 2.4. We start by showing that our new conﬁdence sets contain P withhigh probability which implies that each episode ends in ﬁnite time with high probability. Consequently, weare able to bound the regret through summation of our conﬁdence bounds.We once again distinguish between known and unknown state-action pairs similarly to Algorithm 1. Astate-action pair ( s , a ) becomes known at the end of an epoch if the total number of visits to ( s , a ) haspassed α · B ⋆ | S | c min log B ⋆ | S || A | δ c min at some time step during the epoch (for some constant α > s , a ) may be strictly larger than α · B ⋆ | S | c min log B ⋆ | S || A | δ c min but at mosttwice as much by the deﬁnition of our algorithm. Furthermore, we split each epoch into intervals similar to9hat did in Section 3. The ﬁrst interval starts at the ﬁrst time step and each interval ends once (1) the totalcost in the interval accumulates to at least B ⋆ ; (2) an unknown state-action pair is reached; (3) the currentepisode ends; or (4) the current epoch ends. We have the following observation. Observation 4.1.

Let C M denote the cost of the learner after M intervals. Observe that the total cost ineach interval is at least B ⋆ unless the interval ends in the goal state, in an unknown state-action pair or theepoch ends. Thus the total number of intervals satisﬁes M ≤ C M B ⋆ + 2 | S || A | log T + K + O (cid:18) B ⋆ | S | | A | c min log B ⋆ | S || A | δ c min (cid:19) ,and the total time satisﬁes T ≤ C M / c min .Recall that in the analysis of Algorithm 1 we show that once all state-action pairs are known, the opti-mistic policies generated by the algorithm are proper in the true MDP. The same holds true for Algorithm 2,yet we never prove this directly. Instead, our proof goes as follows. We prove that C M , the cost accumulatedby the learner during the ﬁrst M intervals, is at most K · J π ⋆ ( s init ) + B ⋆ √ M with high probability as long asno more than K episodes have been completed during these M intervals. We notice that once all state-actionpairs are known, the total cost in each interval is at least B ⋆ (ignoring intervals that end with the end ofan epoch or an episode), which implies that the total number of intervals M is bounded by C M / B ⋆ . Thisallows us to get a bound on C M that is independent of the number of intervals by solving the inequality C M . K · J π ⋆ ( s init ) + B ⋆ √ M . K · J π ⋆ ( s init ) + √ B ⋆ · C M . From this, and since the instantaneous costsare strictly positive (by Assumption 2), it must be that the learner eventually completes all K episodes; i.e.,there must be a time from which Algorithm 2 generates only proper policies. Notation.

The epoch that interval m belongs to is denoted by i ( m ), other notations are as in Section 3.1.Note that since the optimistic policy is computed at the end of an epoch and not at the end of an interval,it follows that ˜ π m = ˜ π i ( m ) and e J m = e J i ( m ) . The trajectory visited in interval m is denoted by U m =( s m , a m , . . . , s mH m , a mH m , s mH m +1 ), where a mh is the action taken in s mh , and H m is the length of the interval. Inaddition, the concatenation of the trajectories of the intervals up to and including interval m is denoted by¯ U m , that is ¯ U m = ∪ mm ′ =1 U m ′ . High probability events.

Throughout the analysis we denote S + = S ∪ { g } . For every interval m we letΩ m denote the event that the conﬁdence set for epoch i = i ( m ) contains the actual transition function P .Formally, if Ω m holds then for all ( s , a , s ′ ) ∈ S × A × S + , we have (denote N m + ( s , a ) = max { N m ( s , a ) } , A mh = A ( s mh , a mh )) | P ( s ′ | s , a ) – ¯ P m ( s ′ | s , a ) | ≤ A mh + 4 q ¯ P m ( s ′ | s , a ) A mh . (6)In the following lemma we show that the events Ω m hold with high probability. Lemma 4.2.

With probability at least δ /2 , Ω m holds for all intervals m simultaneously. Regret analysis.

In the following section, instead of bounding R K , we bound e R M = P Mm =1 P H m h =1 c ( s mh , a mh ) I { Ω m } – KJ π ⋆ ( s init ) for any number of intervals M . This implies Theorem 2.4 by the following argument. Lemma 4.2implies that e R M = R M with high probability for any number of intervals M ( R M is the true regret withinthe ﬁrst M intervals). In particular, when M is the number of intervals in which the ﬁrst K episodes elapse,this implies Theorem 2.4 (we show that the learner indeed completes these K episodes).To bound e R M , we use the next lemma to decompose e R M into two terms which we bound independently. We neglect low order terms here. emma 4.3. It holds that e R M = P Mm =1 e R m + P Mm =1 e R m – K · J π ⋆ ( s init ), where e R m = (cid:0)e J m ( s m ) – e J m ( s mH m +1 ) (cid:1) I { Ω m } , and e R m = H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } .The lemma breaks down e R M into two terms. The ﬁrst term accounts for the number of times in whichthe learner changes her policy in the middle of an episode which is at most the number of epochs. Thesecond term sums the errors between the cost-to-go of the observed next state and its estimated expectation.Indeed, P Mm =1 e R m is related to the total number of epochs which is at most | S || A | log T due to thefollowing lemma. Lemma 4.4.

It holds that P Mm =1 e R m ≤ B ⋆ | S || A | log T + KJ π ⋆ ( s init ).The next lemma shows that P Mm =1 e R m does not deviate from P Mm =1 E [ e R m | ¯ U m –1 ] signiﬁcantly. Lemma 4.5.

With probability at least δ /4 , M X m =1 e R m ≤ M X m =1 E (cid:2) e R m | ¯ U m –1 (cid:3) + 3 B ⋆ r M log 8 M δ .The key property of the lemma is that the deviations between P Mm =1 e R m and its corresponding expecta-tion is of order √ M and do not scale with T .To prove the lemma, we recall that an interval ends at most at the ﬁrst time step in which the accumulatedcost in the interval surpasses B ⋆ . We show in our analysis that e J m ( s ) ≤ J π ⋆ ( s ) ≤ B ⋆ for all s ∈ S due tothe optimistic computation of ˜ π m . Therefore, ˜ π m never picks an action whose instantaneous cost is morethan B ⋆ . This implies that the total cost within each interval is at most 2 B ⋆ . Then, we use the Bellmanequations to bound e R m by order of the total cost in the interval, and the lemma follows by an applicationof Azuma’s concentration inequality.Lemma 4.6 below bounds E (cid:2) e R m | ¯ U m –1 (cid:3) for every interval m by a sum of the conﬁdence bounds used inAlgorithm 2. Lemma 4.6.

For every interval m, E (cid:2) e R m | ¯ U m –1 (cid:3) ≤ E " H m X h =1 q | S | V mh A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 + 272 E " H m X h =1 B ⋆ | S | A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 , (7) where V mh is the empirical variance deﬁned as V mh = P s ′ ∈ S + P ( s ′ | s mh , a mh ) (cid:0)e J m ( s ′ ) – µ mh (cid:1) , and µ mh = P s ′ ∈ S + P ( s ′ | s mh , a mh ) e J m ( s ′ ) . The next step is the part of our proof in which our analysis departs from that of Algorithm 1. Note thatwhen Ω m holds, V mh ≤ B ⋆ . Using this bound for each time step separately will result in a bound similar tothat of Theorem 2.3. However, this bound is loose due to the following intuitive argument. Suppose thatwe replace e J m with the true cost-to-go function of ˜ π m , J m , in the deﬁnition of V mh . Note that from theBellman equations (Eq. (1)) we have J m ( s mh ) > J m ( s mh +1 ) in expectation on consecutive time steps h and h + 1 hence we surmise that in expectation V mh would also decrease on consecutive time steps. A similarargument holds when in reality we use e J m because all-but-one of the state-action pairs in the interval areknown, and e J m is a “close enough” approximation of J m on known state-action pairs since they have beensampled suﬃciently many times. Indeed, in Lemma 4.7 we use the technique of Azar et al. (2017) to showthat (up to a constant) B ⋆ bounds the expected sum of the variances over the time steps of an interval. Lemma 4.7. E (cid:2)P H m h =1 V mh I { Ω m } | ¯ U m –1 (cid:3) ≤ B ⋆ . 11rmed with Lemma 4.7, we upper bound P Mm =1 E (cid:2) e R m | ¯ U m –1 (cid:3) by applying some algebraic manipulationon Eq. (7), and summing over all intervals which gives the next lemma. Lemma 4.8.

With probability at least δ /4 , M X m =1 E (cid:2) e R m | ¯ U m –1 (cid:3) ≤ B ⋆ r M | S | | A | log T | S || A | δ + 8160 B ⋆ | S | | A | log T | S || A | δ .Theorem 2.4 is obtained by ﬁrst applying a union bound on Lemmas 4.2, 4.5 and 4.8, plugging in thebounds of Lemmas 4.4, 4.5 and 4.8 into Lemma 4.3, and bounding T and M using Observation 4.1. Thisresults in a quadratic inequality in √ C M and solving it yields the theorem. Acknowledgements

HK is supported by the Israeli Science Foundation (ISF) grant 1595/19.

References

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmedbandit problem.

SIAM journal on computing , 32(1):48–77, 2002.Mohammad Gheshlaghi Azar, Ian Osband, and R´emi Munos. Minimax regret bounds for reinforcementlearning. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages263–272. JMLR. org, 2017.Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning inweakly communicating mdps. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in ArtiﬁcialIntelligence , pages 35–42. AUAI Press, 2009.Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems.

Mathematicsof Operations Research , 16(3):580–595, 1991.Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions.

Lab. forInformation and Decision Systems Report LIDS-P-2909, MIT , 2013.Nicolo Cesa-Bianchi and G´abor Lugosi.

Prediction, learning, and games . Cambridge university press, 2006.Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds forepisodic reinforcement learning. In

Advances in Neural Information Processing Systems , pages 5713–5723,2017.Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.

Journal of Machine Learning Research , 11(Apr):1563–1600, 2010.Gergely Neu, Andr´as Gy¨orgy, and Csaba Szepesv´ari. The online loop-free stochastic shortest-path problem.In

COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010 , pages 231–243,2010.Gergely Neu, Andras Gyorgy, and Csaba Szepesv´ari. The adversarial stochastic shortest path problem withunknown transition probabilities. In

Artiﬁcial Intelligence and Statistics , pages 805–813, 2012.Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized valuefunctions. In

International Conference on Machine Learning , pages 2377–2386, 2016.Aviv Rosenberg and Yishay Mansour. Online stochastic shortest path with bandit feedback and unknowntransition function. In

Advances in Neural Information Processing Systems , pages 2209–2218, 2019a.12viv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes.In

International Conference on Machine Learning , pages 5478–5486, 2019b.Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regretexploration in goal-oriented reinforcement learning, 2019.Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalitiesfor the l1 deviation of the empirical distribution.

Hewlett-Packard Labs, Tech. Rep , 2003.Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learningwithout domain knowledge using value function bounds. In

International Conference on Machine Learning ,pages 7304–7312, 2019.Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relativeentropy policy search. In

Advances in Neural Information Processing Systems 26: 27th Annual Conferenceon Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, LakeTahoe, Nevada, United States , pages 1583–1591, 2013.13

Algorithm

Algorithm 3

Hoeffding-type confidence bounds input: state space S , action space A and conﬁdence parameter δ . initialization: arbitrary policy ˜ π , m ← e B ← c min , C ← ∀ ( s , a , s ′ ) ∈ S × A × S : N ( s , a , s ′ ) ← N ( s , a ) ← for k = 1, 2, . . . do set s ← s init . while s = g do follow optimistic optimal policy: a ← ˜ π ( s ).suﬀer cost: C m ← C m + c ( s , a ).observe next state s ′ ∼ P ( · | s , a ).update visit counters: N ( s , a , s ′ ) ← N ( s , a , s ′ ) + 1, N ( s , a ) ← N ( s , a ) + 1. if N ( s ′ , ˜ π ( s ′ )) ≤ e B | S | c log e B | S || A | δ c min or s ′ = g or C m ≥ e B log m δ then if C m ≥ e B log m δ then update B ⋆ estimate: e B ← e B . end if advance intervals counter: m ← m + 1.initialize cost suﬀered in interval: C m ← compute empirical transition function ¯ P for every ( s , a , s ′ ) ∈ S × A × S :¯ P ( s ′ | s , a ) = N ( s , a , s ′ )max { N ( s , a ), 1 } . compute policy ˜ π that minimizes the expected cost with respect to a transition function e P , suchthat for every ( s , a ) ∈ S × A : (cid:13)(cid:13) e P ( · | s , a ) – ¯ P ( · | s , a ) (cid:13)(cid:13) ≤ s | S | log (cid:0) | S || A | N + ( s , a )/ δ (cid:1) N + ( s , a ) . end if set s ← s ′ . end whileend for B Proofs

B.1 Proofs for Section 3.1

B.1.1 Proof of Lemma 3.3Lemma (restatement of Lemma 3.3) . With probability at least δ /2 , Ω m holds and P H m h =1 c ( s mh , a mh ) ≤ B ⋆ log m δ for all intervals m simultaneously. This implies that the total number of steps of the algorithmis T = O (cid:18) KB ⋆ c min log KB ⋆ | S || A | δ c min + B ⋆ | S | | A | c min log KB ⋆ | S || A | δ c min (cid:19) . Lemma B.1.

The event Ω m holds for all intervals m simultaneously with probability at least δ /4 . roof. Fix a state s and an action a . Consider an inﬁnite sequence { Z i } ∞ i =1 of draws from the distribution P ( · | s , a ). By Theorem D.2 we get that for a preﬁx of length t of this sequence (that is { Z i } ti =1 ) (cid:13)(cid:13) P ( · | s , a ) – ¯ P { Z i } ti =1 ( · | s , a ) (cid:13)(cid:13) ≤ r | S | log( δ –1 t ) t ,holds with probability 1 – δ t , where ¯ P { Z i } ti =1 ( · | s , a ) is the empirical distribution deﬁned by the draws { Z i } ti =1 . We repeat this argument for every preﬁx { Z i } ti =1 of { Z i } ∞ i =1 and for every state-action pair, with δ t = δ /8 | S || A | t . Then from the union bound we get that Ω m holds for all intervals m simultaneously withprobability at least 1 – δ /4. Lemma B.2.

Let m be an interval. If Ω m holds then e J m ( s ) ≤ J π ⋆ ( s ) ≤ B ⋆ for every s ∈ S .Proof.

Tarbouriech et al. (2019) show that all the transition functions in the conﬁdence set of Eq. (4) can becombined into a single augmented MDP. The optimal policy of the augmented MDP can be found eﬃciently,e.g., with Extended Value Iteration. The optimistic policy is the optimal policy in the augmented MDP.It minimizes e J m ( s ) over all policies and feasible transition functions, for all states s ∈ S simultaneously(following Bertsekas and Tsitsiklis, 1991). Since Ω m holds, it follows that the real transition function is inthe conﬁdence set therefore it is also considered in the minimization. Thus e J m ( s ) ≤ J π ⋆ ( s ) for all s ∈ S .Finally, J π ⋆ ( s ) ≤ B ⋆ by the deﬁnition of B ⋆ . Lemma B.3.

Let m be an interval and ( s , a ) be a known state-action pair. If Ω m holds then k e P m ( · | s , a ) – P ( · | s , a ) k ≤ c ( s , a )2 B ⋆ . Proof.

By the deﬁnition of the conﬁdence set k e P m ( · | s , a ) – ¯ P m ( · | s , a ) k ≤ s | S | log (cid:0) | S || A | N m + ( s , a )/ δ (cid:1) N m + ( s , a ) ≤ c ( s , a )4 B ⋆ ,where the last inequality follows because log( x )/ x is decreasing, and N m + ( s , a ) ≥ B ⋆ | S | c log B ⋆ | S || A | δ c min since( s , a ) is known. Similarly, since Ω m holds we also have that k P ( · | s , a ) – ¯ P m ( · | s , a ) k ≤ s | S | log (cid:0) | S || A | N m + ( s , a )/ δ (cid:1) N m + ( s , a ) ≤ c ( s , a )4 B ⋆ ,and the lemma follows by the triangle inequality. Lemma B.4.

Let ˜ π be a policy and e P be a transition function. Denote the cost-to-go of ˜ π with respect to e P by e J . Assume that for every s ∈ S , e J ( s ) ≤ B ⋆ and that (cid:13)(cid:13) e P ( · | s , ˜ π ( s )) – P ( · | s , ˜ π ( s )) (cid:13)(cid:13) ≤ c ( s , ˜ π ( s ))2 B ⋆ . Then, ˜ π is proper (with respect to P ), and it holds that J ˜ π ( s ) ≤ B ⋆ for every s ∈ S .Proof.

Consider the Bellman equations of ˜ π with respect to transition function e P at some state s ∈ S (seeLemma 2.1), deﬁned as e J ( s ) = c ( s , ˜ π ( s )) + X s ′ ∈ S e P ( s ′ | s , ˜ π ( s )) e J ( s ′ )= c ( s , ˜ π ( s )) + X s ′ ∈ S P ( s ′ | s , ˜ π ( s )) e J ( s ′ ) + X s ′ ∈ S e J ( s ′ ) (cid:16) e P ( s ′ | s , ˜ π ( s )) – P ( s ′ | s , ˜ π ( s )) (cid:17) . (8)15otice that by our assumptions and using H¨older inequality, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X s ′ ∈ S e J ( s ′ ) (cid:16) e P ( s ′ | s , ˜ π ( s )) – P ( s ′ | s , ˜ π ( s )) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k e P ( · | s , ˜ π ( s )) – P ( · | s , ˜ π ( s )) k · k e J k ∞ ≤ c ( s , ˜ π ( s ))2 B ⋆ · B ⋆ = c ( s , ˜ π ( s ))2 .Plugging this into Eq. (8), we obtain e J ( s ) ≥ c ( s , ˜ π ( s )) + X s ′ ∈ S P ( s ′ | s , ˜ π ( s )) e J ( s ′ ) – c ( s , ˜ π ( s ))2 = c ( s , ˜ π ( s ))2 + X s ′ ∈ S P ( s ′ | s , ˜ π ( s )) e J ( s ′ ).Therefore, deﬁning J ′ = 2 e J , then J ′ ( s ) ≥ c ( s , ˜ π ( s )) + P s ′ ∈ S P ( s ′ | s , ˜ π ( s )) J ′ ( s ′ ) for all s ∈ S . Thestatement now follows by Lemma 2.1. Lemma B.5.

Let π be a proper policy such that for some v > , J π ( s ) ≤ v for every s ∈ S . Then, theprobability that the cost of π to reach the goal state from any state s is more than m, is at most e – m /4 v forall m ≥ . Note that a cost of at most m implies that the number of steps is at most mc min .Proof. By Markov inequality, the probability that π accumulates cost of more than 2 v before reaching thegoal state is at most 1/2. Iterating this argument, we get that the probability that π accumulates cost ofmore than 2 kv before reaching the goal state is at most 2 – k for every integer k ≥

0. In general, for any m ≥

0, the probability that π suﬀers a cost of more than m is at most 2 – ⌊ m /2 v ⌋ ≤ · – m /2 v ≤ e – m /4 v .For the next lemma we will need the following deﬁnitions. The trajectory visited in interval m is denotedby U m = ( s m , a m , . . . , s mH m , a mH m , s mH m +1 ) where a mh is the action taken in s mh , and H m is the length of theinterval. In addition, the concatenation of the trajectories in the intervals up to and including interval m isdenoted by ¯ U m = ∪ mm ′ =1 U m ′ . Lemma B.6.

Let m be an interval. For all r ≥ , we have that P " H m X h =1 c (cid:0) s mh , a mh (cid:1) I { Ω m } > r | ¯ U m –1 ≤ e – r /8 B ⋆ . Proof.

Note that Ω m is determined given ¯ U m –1 , and suppose that Ω m holds otherwise P H m h =1 c (cid:0) s mh , a mh (cid:1) I { Ω m } is 0. Also assume that r ≥ B ⋆ or else the statement holds trivially.Deﬁne the MDP M know = ( S know , A , P know , c , s init ) in which every state s ∈ S such that ( s , ˜ π m ( s )) isunknown is contracted into the goal state. Let P know be the transition function induced in M know by P , andlet J m know be the cost-to-go of ˜ π m in M know with respect to P know . Similarly, deﬁne e P know m as the transitionfunction induced in M know by e P m , and e J m know as the cost-to-go of ˜ π m in M know with respect to e P know m . Itis clear that e J m know ( s ) ≤ e J m ( s ) for every s ∈ S , so by Lemma B.2, e J m know ( s ) ≤ B ⋆ . Moreover, since all thestates s ∈ S for which ( s , ˜ π m ( s )) is unknown were contracted to the goal state, we can use Lemma B.3 toobtain for all s ∈ S know : (cid:13)(cid:13) e P know m ( · | s , ˜ π m ( s )) – P know ( · | s , ˜ π m ( s )) (cid:13)(cid:13) ≤ (cid:13)(cid:13) e P m ( · | s , ˜ π m ( s )) – P ( · | s , ˜ π m ( s )) (cid:13)(cid:13) ≤ c ( s , ˜ π m ( s ))2 B ⋆ . (9)We can apply Lemma B.4 in M know and obtain that J m know ( s ) ≤ B ⋆ for every s ∈ S know . Notice thatreaching the goal state in M know is equivalent to reaching the goal state or an unknown state-action pair in M , and also recall that all state-action pairs in the interval are known except for the ﬁrst one. Thus, fromLemma B.5, P " H m X h =2 c (cid:0) s mh , a mh (cid:1) I { Ω m } > r – B ⋆ | ¯ U m –1 ≤ e –( r – B ⋆ )/8 B ⋆ ≤ e – r /8 B ⋆ .16ince e J m ≤ B ⋆ , our algorithm will never select an action whose instantaneous cost is larger than B ⋆ .Since the ﬁrst state-action in the interval might not be known, its cost is at most B ⋆ , and therefore P " H m X h =1 c (cid:0) s mh , a mh (cid:1) I { Ω m } > r | ¯ U m –1 ≤ P " H m X h =2 c (cid:0) s mh , a mh (cid:1) I { Ω m } > r – B ⋆ | ¯ U m –1 ≤ e – r /8 B ⋆ . Proof of Lemma 3.3.

From Lemma B.6, with probability at least 1– δ /16 m , P H m h =1 c (cid:0) s mh , a mh (cid:1) ≤ B ⋆ log m δ ,and by the union bound this holds for all intervals m simultaneously with probability at least 1 – δ /4. ByLemma B.1, with probability 1 – δ /4, Ω m holds for all intervals m . Combining these two facts again by aunion bound, we get that both Ω m holds and the cost of interval m is at most 24 B ⋆ log m δ simultaneouslyto all intervals m with probability at least 1 – δ /2.If the cost of all intervals is bounded (and therefore so is the length of the interval), we can use the boundon the number of intervals in Observation 3.2 to conclude that T = O (cid:18) B ⋆ c min log M δ · (cid:18) K + B ⋆ | S | | A | c log B ⋆ | S || A | δ c min (cid:19)(cid:19) = O (cid:18) KB ⋆ c min log KB ⋆ | S || A | δ c min + B ⋆ | S | | A | c log KB ⋆ | S || A | δ c min (cid:19) . B.1.2 Proof of Lemma 3.4Lemma (restatement of Lemma 3.4) . With probability at least δ /2 , we have e R K ≤ B ⋆ | S | | A | c min log B ⋆ | S || A | c min δ + B ⋆ r T log 4 T δ + 10 B ⋆ r | S | log | S || A | T δ X s , a M X m =1 n m ( s , a ) p N m + ( s , a ) .To analyze e R K , we begin by plugging in the Bellman optimality equation of ˜ π m with respect to e P m into e R K . This allows us to decompose e R K into three terms as follows. e R K = M X m =1 H m X h =1 e J m ( s mh ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } – K · J π ⋆ ( s init )= M X m =1 H m X h =1 (cid:16)e J m ( s mh ) – e J m ( s mh +1 ) (cid:17) I { Ω m } – K · J π ⋆ ( s init ) (10)+ M X m =1 H m X h =1 X s ′ ∈ S e J m ( s ′ ) (cid:16) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:17) I { Ω m } (11)+ M X m =1 H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S P ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } . (12)Eq. (10) is a bound on the cost suﬀered from switching policies each time we visit an unknown state-actionpair and is bounded by the following lemma. Lemma B.7. P Mm =1 P H m h =1 (cid:16)e J m ( s mh ) – e J m ( s mh +1 ) (cid:17) I { Ω m } ≤ B ⋆ | S || A | · B ⋆ | S | c min log B ⋆ | S || A | δ c min + K · J π ⋆ ( s init ). Proof.

Note that per interval P H m h =1 ( e J m ( s mh )– e J m ( s mh +1 )) is a telescopic sum which equals e J m ( s m )– e J m ( s mH m +1 ).Furthermore, for every two consecutive intervals m , m + 1 one of the following occurs:(i) If interval m ended in the goal state then e J m ( s mH m +1 ) = e J m ( g ) = 0 and e J m +1 ( s m +11 ) = e J m +1 ( s init ).Thus, using Lemma B.2 for the last inequality, e J m +1 ( s m +11 ) I { Ω m +1 } – e J m ( s mH m +1 ) I { Ω m } = e J m +1 ( s init ) I { Ω m +1 } ≤ J π ⋆ ( s init ).This happens at most K times. 17ii) If interval m ended in an unknown state then e J m +1 ( s m +11 ) I { Ω m +1 } – e J m ( s mH m +1 ) I { Ω m } ≤ e J m +1 ( s m +11 ) I { Ω m +1 } ≤ B ⋆ .This happens at most | S || A | · B ⋆ | S | c log B ⋆ | S || A | δ c min times.Lemma B.8 bounds Eq. (11) using techniques borrowed from Jaksch et al. (2010). Lemma B.8.

It holds that M X m =1 H m X h =1 X s ′ ∈ S e J m ( s ′ ) (cid:16) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:17) I { Ω m } ≤ B ⋆ r | S | log | S || A | T δ X s , a M X m =1 n m ( s , a ) p N m + ( s , a ) . Proof.

Using the deﬁnition of the conﬁdence sets we obtain M X m =1 H m X h =1 X s ′ ∈ S e J m ( s ′ ) (cid:16) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:17) I { Ω m } ≤≤ B ⋆ X s ∈ S X a ∈ A M X m =1 n m ( s , a ) k P ( · | s , a ) – e P m ( · | s , a ) k I { Ω m }≤ B ⋆ X s ∈ S X a ∈ A M X m =1 n m ( s , a ) s | S | log (cid:0) | S || A | N m + ( s , a )/ δ (cid:1) N m + ( s , a ) ≤ B ⋆ r | S | log | S || A | T δ X s ∈ S X a ∈ A M X m =1 n m ( s , a ) p N m + ( s , a ) .where the ﬁrst inequality follows from H¨older inequality and Lemma B.2, and the second because e P m and P are both in the conﬁdence set of Eq. (4) when Ω m holds. The third inequality follows because N m + ( s , a ) ≤ T . Lemma B.9 bounds the term in Eq. (12) using Azuma’s concentration inequality. Lemma B.9.

With probability at least δ /2 , M X m =1 H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S P ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } ≤ B ⋆ r T log 4 T δ . Proof.

Consider the inﬁnite sequence of random variables X t = e J m ( s mh +1 ) – X s ′ ∈ S P ( s ′ | s mh , ˜ π m ( s mh )) e J m ( s ′ ) ! I { Ω m } ,where m is the interval containing time t , and h is the index of time step t within interval m . Notice thatthis is a martingale diﬀerence sequence, and | X t | ≤ B ⋆ by Lemma B.2. Now, we apply anytime Azuma’sinequality (Theorem D.1) to any preﬁx of the sequence { X t } ∞ t =1 . Thus, with probability at least 1 – δ /2, forevery T : T X t =1 X t ≤ B ⋆ r T log 4 T δ .18 .1.3 Proof of Theorem 3.1Theorem (restatement of Theorem 3.1) . Suppose that Assumption 2 holds. With probability at least δ the regret of Algorithm 1 is bounded as follows:R K = O s B ⋆ | S | | A | Kc min log KB ⋆ | S || A | δ c min + B ⋆ | S | | A | c min log KB ⋆ | S || A | δ c min ! . Lemma B.10.

Assume that the number of steps in every interval is is at most B ⋆ c min log m δ . Then for everys ∈ S and a ∈ A, M X m =1 n m ( s , a ) p N m + ( s , a ) ≤ p N M +1 ( s , a ) . Proof.

We claim that, by the assumption of the lemma, for every interval m we have that n m ( s , a ) ≤ N m + ( s , a ). Indeed, if ( s , a ) is unknown then n m ( s , a ) = 1 and since N m + ( s , a ) ≥ s , a ) is known then N m + ( s , a ) ≥ B ⋆ | S | c log B ⋆ | S || A | δ c min and by our assumption the length of the interval, andin particular n m ( s , a ), is at most B ⋆ c min log m δ . Our statement then follows by Jaksch et al. (2010, Lemma19). Proof of Theorem 3.1.

With probability at least 1 – δ , both Lemmas 3.3 and B.9 hold. Lemma 3.3 statesthat the length of every interval is at most B ⋆ c min log m δ , and Lemma B.10 obtains X s ∈ S X a ∈ A M X m =1 n m ( s , a ) p N m + ( s , a ) ≤ X ( s , a ) ∈ S × A p N M +1 ( s , a ) ≤ p | S || A | T , (13)where the last inequality follows from Jensen’s inequality and the fact that P ( s , a ) ∈ S × A N M +1 ( s , a ) ≤ T .Next, we sum the bounds of Lemmas B.7 to B.9 and use Eq. (13) to obtain R K ≤ B ⋆ | S | | A | c log B ⋆ | S || A | δ c min + 30 B ⋆ | S | r | A | T log | S || A | T δ + B ⋆ r T log 4 T δ .To ﬁnish the proof use Lemma 3.3 to bound T . B.2 Proofs for Section 4.1

B.2.1 Proof of Lemma 4.2Lemma (restatement of Lemma 4.2) . With probability at least δ /2 , Ω m holds for all intervals m simul-taneously.Proof. Fix a triplet ( s , a , s ′ ) ∈ S × A × S + . Consider an inﬁnite sequence ( Z i ) ∞ i =1 of draws from thedistribution P ( · | s , a ) and let X i = I { Z i = s ′ } . We apply Eq. (25) of Theorem D.3 with δ t = δ | S | | A | t to apreﬁx of length t of the sequence ( X i ) ∞ i =1 . Then divide Eq. (25) by t and obtain that, after simplifying usingthe assumptions that | S | ≥ | A | ≥

2, Eq. (6) holds with probability 1 – δ t . We repeat this argumentfor every preﬁx ( Z i ) ti =1 of ( Z i ) ∞ i =1 and for every state-action-state triplet. Then from the union bound weget that Ω m holds for all intervals m simultaneously with probability at least 1 – δ /2.19 .2.2 Proof of Lemma 4.3Lemma (restatement of Lemma 4.3) . It holds that e R M = M X m =1 H m X h =1 e J m ( s mh ) – e J m ( s mh +1 ) ! I { Ω m } – K · J π ⋆ ( s init )+ M X m =1 H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } . Lemma B.11.

Let m be an interval. If Ω m holds then ˜ π m satisﬁes the Bellman equations in the optimisticmodel: e J m ( s ) = c ( s , ˜ π m ( s )) + X s ′ ∈ S e P m ( s ′ | s , ˜ π i ( s )) e J m ( s ′ ), ∀ s ∈ S . Proof.

Note that the Bellman equations hold in the optimistic model since as we deﬁned this model, there isa nonzero probability of transition to the goal state by any action from every state. Thus in the optimisticmodel every policy is a proper policy and in particular Lemma 2.2 holds.

Proof of Lemma 4.3.

By Lemma B.11, we can use the Bellman equations in the optimistic model to havethe following interpretation of the costs for every interval m and time h : c ( s mh , a mh ) I { Ω m } = e J m ( s mh ) – X s ′ ∈ S e P i ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } = e J m ( s mh ) – e J m ( s mh +1 ) ! I { Ω m } + e J m ( s mh +1 ) – X s ′ ∈ S e P i ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } .(14)We now write e R M = P Mm =1 P H m h =1 c ( s mh , a mh ) I { Ω m } – K · J π ⋆ ( s init ), and substitute for each cost using Eq. (14)to get the lemma. B.2.3 Proof of Lemma 4.4Lemma (restatement of Lemma 4.4) . P Mm =1 (cid:0)P H m h =1 e J m ( s mh )– e J m ( s mh +1 ) (cid:1) I { Ω m } – K · J π ⋆ ( s init ) ≤ B ⋆ | S || A | log T . Lemma B.12.

Let m be an interval. If Ω m holds then e J m ( s ) ≤ J π ⋆ ( s ) ≤ B ⋆ for every s ∈ S .Proof.

Denote by e P the transition function computed by Algorithm 2 at the beginning of epoch i ( m ), and by e J the cost-to-go with respect to e P . We claim that for every proper policy π and state s ∈ S , e J π ( s ) ≤ J π ( s ).Then, the lemma follows easily since e J m ( s ) ≤ e J π ⋆ ( s ) ≤ J π ⋆ ( s ) ≤ B ⋆ .Indeed, let s ∈ S and consider the Bellman equations of π with respect to P : J π ( s ) = c ( s , π ( s )) + X s ′ ∈ S P ( s ′ | s , π ( s )) J π ( s ′ ) ≥ c ( s , π ( s )) + X s ′ ∈ S e P ( s ′ | s , π ( s )) J π ( s ′ ),where the inequality follows because e P ( s ′ | s , a ) ≤ P ( s ′ | s , a ) for every ( s , a , s ′ ) ∈ S × A × S . This holdssince P is in the conﬁdence set of Eq. (6) (as Ω m holds), and by the way e P is computed in Algorithm J π ( s ) ≥ e J π ( s ) for every s ∈ S as required. Proof of Lemma 4.4.

For every two consecutive intervals m , m + 1, denoting i = i ( m ), we have one of thefollowing: 20i) If interval m ended in the goal state then e J i ( m ) ( s mH m +1 ) = e J i ( m ) ( g ) = 0 and e J i ( m +1) ( s m +11 ) = e J i ( m +1) ( s init ). Therefore, by Lemma B.12, e J i ( m +1) ( s m +11 ) I { Ω m +1 } – e J i ( m ) ( s mH m +1 ) I { Ω m } = e J i ( m +1) ( s init ) I { Ω m +1 } ≤ J π ⋆ ( s init ).This happens at most K times.(ii) If interval m ended in an unknown state-action pair or since the cost reached B ⋆ , and we stay in thesame epoch, then i ( m ) = i ( m + 1) = i and s m +11 = s mH m +1 . Thus e J i ( m +1) ( s m +11 ) I { Ω m +1 } – e J i ( m ) ( s mH m +1 ) I { Ω m } = e J i ( s m +11 ) I { Ω m } – e J i ( s mH m +1 ) I { Ω m } = 0.(iii) If interval m ended by doubling the visit count to some state-action pair, then we start a new epoch.Thus by Lemma B.12, e J i ( m +1) ( s m +11 ) I { Ω m +1 } – e J i ( m ) ( s mH m +1 ) I { Ω m } ≤ e J i +1 ( s m +11 ) I { Ω m +1 } ≤ B ⋆ ,This happens at most 2 | S || A | log T times.To conclude, we have M X m =1 H m X h =1 e J i ( m ) ( s mh ) – e J i ( m ) ( s mh +1 ) ! I { Ω m } – KJ π ⋆ ( s init ) ≤ KJ π ⋆ ( s init ) + 2 B ⋆ | S || A | log T – KJ π ⋆ ( s init )= 2 B ⋆ | S || A | log T . B.2.4 Proof of Lemma 4.5Lemma (restatement of Lemma 4.5) . With probability at least δ /4 , the following holds for all M = 1, 2, . . . simultaneously. M X m =1 H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m }≤ M X m =1 E " H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } | ¯ U m –1 + 3 B ⋆ r M log 8 M δ . Proof.

Consider the following martingale diﬀerence sequence ( X m ) ∞ m =1 deﬁned by X m = H m X h =1 (cid:0)e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:1) I { Ω m } .The Bellman optimality equations of ˜ π m with respect to e P m (Lemma B.11) obtains | X m | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:18)e J m ( s mH m +1 ) – e J m ( s m ) | {z } ≤ B ⋆ + H m X h =1 c ( s mh , a mh ) | {z } ≤ B ⋆ (cid:19) I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ≤ B ⋆ ,where the inequality follows from Lemma B.12 and the fact that the total cost within each interval at most2 B ⋆ by construction. Therefore, we use anytime Azuma’s inequality (Theorem D.1) to obtain that withprobability at least 1 – δ /4: M X m =1 X m ≤ M X m =1 E (cid:2) X m | ¯ U m –1 (cid:3) + 3 B ⋆ r M log 8 M δ .21 .2.5 Proof of Lemma 4.6Lemma (restatement of Lemma 4.6) . For every interval m and time h, denote A mh = log( | S || A | N m + ( s mh , a mh )/ δ ) N m + ( s mh , a mh ) . Then, E " H m X h =1 e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) ! I { Ω m } | ¯ U m –1 ≤ · E " H m X h =1 q | S | V mh A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 + 272 · E " H m X h =1 B ⋆ | S | A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 , where V mh is the empirical variance deﬁned as V mh = X s ′ ∈ S + P ( s ′ | s mh , a mh ) e J m ( s ′ ) – X s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) ! .The next lemma gives a diﬀerent interpretation to the conﬁdence bounds of Eq. (6), and will be usefulin the proofs that follow. Lemma B.13.

Denote A mh = log( | S || A | N m + ( s , a )/ δ )/ N m + ( s , a ) . When Ω m holds we have for any ( s , a , s ′ ) ∈ S × A × S + : (cid:12)(cid:12) P ( s ′ | s , a ) – e P m ( s ′ | s , a ) (cid:12)(cid:12) ≤ q P ( s ′ | s , a ) A mh + 136 A mh . Proof.

Since Ω m holds we have for all ( s , a , s ′ ) ∈ S × A × S + that¯ P m ( s ′ | s , a ) – P ( s ′ | s , a ) ≤ q ¯ P m ( s ′ | s , a ) A mh + 28 A mh .This is a quadratic inequality in p ¯ P m ( s ′ | s , a ). Using the fact that x ≤ a · x + b implies x ≤ a + √ b with a = 4 p A mh and b = P ( s ′ | s , a ) + 28 A mh , we have q ¯ P m ( s ′ | s , a ) ≤ p A mh + q P ( s ′ | s , a ) + 28 A mh ≤ p P ( s ′ | s , a ) + 10 p A mh ,where we used the inequality √ x + y ≤ √ x + √ y that holds for any x ≥ y ≥

0. Substituting backinto Eq. (6) obtains (cid:12)(cid:12) P ( s ′ | s , a ) – ¯ P m ( s ′ | s , a ) (cid:12)(cid:12) ≤ q P ( s ′ | s , a ) A mh + 68 A mh .From a similar argument (cid:12)(cid:12) e P m ( s ′ | s , a ) – ¯ P m ( s ′ | s , a ) (cid:12)(cid:12) ≤ q P ( s ′ | s , a ) A mh + 68 A mh .Using the triangle inequality ﬁnishes the proof. Proof of Lemma 4.6.

Denote X m = (cid:0)P H m h =1 e J m ( s mh +1 ) – P s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:1) I { Ω m } , and Z mh = (cid:0)e J m ( s mh +1 ) – P s ′ ∈ S P ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:1) I { Ω m } . Think of the interval as an inﬁnite continuous stochasticprocess, and note that, conditioned on ¯ U m –1 , (cid:0) Z mh (cid:1) ∞ h =1 is a martingale diﬀerence sequence w.r.t ( U h ) ∞ h =1 ,where U h is the trajectory of the learner from the beginning of the interval and up to and including time h . This holds since, by conditioning on ¯ U m –1 , Ω m is determined and is independent of the randomnessgenerated during the interval. Note that H m is a stopping time with respect to ( Z mh ) ∞ h =1 which is bounded22y 2 B ⋆ / c min . Hence by the optional stopping theorem E [ P H m h =1 Z mh | ¯ U m –1 ] = 0, which gets us E [ X m | ¯ U m –1 ] = E " H m X h =1 (cid:18)e J m ( s mh +1 ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:19) I { Ω m } | ¯ U m –1 = E " H m X h =1 Z mh | ¯ U m –1 + E " H m X h =1 X s ′ ∈ S (cid:0) P ( s ′ | s mh , a mh ) – e P , ( s ′ | s mh , a mh ) (cid:1)e J m ( s ′ ) I { Ω m } | ¯ U m –1 = E " H m X h =1 X s ′ ∈ S (cid:0) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:1)e J m ( s ′ ) I { Ω m } | ¯ U m –1 .Furthermore, we have E " H m X h =1 X s ′ ∈ S (cid:0) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:1)e J m ( s ′ ) I { Ω m } | ¯ U m –1 = E " H m X h =1 X s ′ ∈ S + (cid:18) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:19)(cid:18)e J m ( s ′ ) – X s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) (cid:19) I { Ω m } | ¯ U m –1 ≤ E " H m X h =1 X s ′ ∈ S + vuut A mh P ( s ′ | s mh , a mh ) e J m ( s ′ ) – X s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) ! I { Ω m } | ¯ U m –1 + E " H m X h =1 X s ′ ∈ S + A mh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)e J m ( s ′ ) – X s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I { Ω m } | ¯ U m –1 ≤ E " H m X h =1 q | S | V mh A mh I { Ω m } + 272 | S | B ⋆ A mh I { Ω m } | ¯ U m –1 ,where the ﬁrst equality follows since e J m ( g ) = 0, and P ( · | s mh , a mh ) and e P i ( · | s mh , a mh ) are probabilitydistributions over S + whence P s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) does not depend on s ′ . The ﬁrst inequalityfollows from Lemma B.13, and the second inequality from Jensen’s inequality, Lemma B.12, | S + | ≤ | S | ,and the deﬁnition of V mh . B.2.6 Proof of Lemma 4.7Lemma (restatement of Lemma 4.7) . For any interval m, E (cid:2)P H m h =1 V mh I { Ω m } | ¯ U m –1 (cid:3) ≤ B ⋆ . Lemma B.14.

By Lemma B.13 we have that (cid:12)(cid:12) e P m (cid:0) s ′ | s , a (cid:1) – P (cid:0) s ′ | s , a (cid:1)(cid:12)(cid:12) ≤ s P ( s ′ | s , a ) log (cid:0) | S || A | N m + ( s , a )/ δ (cid:1) N m + ( s , a ) + 136 log (cid:0) | S || A | N m + ( s , a )/ δ (cid:1) N m + ( s , a )which gives the required bound because log( x )/ x is decreasing, and ( s , a ) is a known state-action pair so N m + ( s , a ) ≥ · B ⋆ | S | c min log B ⋆ | S || A | δ c min . 23 roof of Lemma 4.7. Note that the ﬁrst state-action pair in the subinterval, ( s m , a m ), might be unknownand that all state-action pairs that appear afterwards are known. Thus, we bound E " H m X h =1 V mh | ¯ U m –1 = E " V m I { Ω m } | ¯ U m –1 + E " H m X h =2 V mh I { Ω m } | ¯ U m –1 .The ﬁrst summand is trivially bounded by B ⋆ (Lemma B.12). We now upper bound E (cid:2)P H m h =2 V mh I { Ω m } | ¯ U m –1 (cid:3) . Denote Z mh = (cid:0)e J m ( s mh +1 ) – P s ′ ∈ S P ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:1) I { Ω m } , and think of the interval as aninﬁnite continuous stochastic process. Note that, conditioned on ¯ U m –1 , (cid:0) Z mh (cid:1) ∞ h =1 is a martingale diﬀerencesequence w.r.t ( U h ) ∞ h =1 , where U h is the trajectory of the learner from the beginning of the interval and upto time h and including. This holds since, by conditioning on ¯ U m –1 , Ω m is determined and is independentof the randomness generated during the interval. Note that H m is a stopping time with respect to ( Z mh ) ∞ h =1 which is bounded by 2 B ⋆ / c min . Therefore, applying Lemma B.15 found below obtains E " H m X h =2 V mh I { Ω m } | ¯ U m –1 = E " H m X h =2 Z mh I { Ω m } ! | ¯ U m –1 . (15)We now proceed by bounding | P H m h =1 Z mh | when Ω m occurs. Therefore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H m X h =2 Z mh (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H m X h =2 e J m ( s mh +1 ) – X s ′ ∈ S P ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H m X h =2 e J m ( s mh +1 ) – e J m ( s mh ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (16)+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H m X h =2 e J m ( s mh ) – X s ′ ∈ S e P m ( s ′ | s mh , a mh ) e J m ( s ′ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (17)+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H m X h =2 X s ′ ∈ S + (cid:16) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:17)(cid:16)e J m ( s ′ ) – X s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,(18)where Eq. (18) is given as P ( · | s mh , a mh ) and e P i ( · | s mh , a mh ) are probability distributions over S + , P s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) is constant w.r.t s ′ , and e J m ( g ) = 0.We now bound each of the three terms above individually. Eq. (16) is a telescopic sum that is at most B ⋆ on Ω m (Lemma B.12). For Eq. (17), we use the Bellman equations for ˜ π m on the optimistic model deﬁned bythe transitions e P m (Lemma B.11) thus it is at most P H m h =2 c (cid:0) s mh , a mh (cid:1) ≤ B ⋆ (see text following Lemma 4.5).For Eq. (18), recall that all states-action pairs at times h = 2, . . . , H m are known by deﬁnition of H m . Hence24y Lemma B.14, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X s ′ ∈ S + (cid:16)e J m ( s ′ ) – X s ′′ ∈ S + P (cid:0) s ′′ | s mh , a mh (cid:1)e J m ( s ′′ ) (cid:17)(cid:16) e P m (cid:0) s ′ | s mh , a mh (cid:1) – P (cid:0) s ′ | s mh , a mh (cid:1)(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X s ′ ∈ S + s c min · P (cid:0) s ′ | s mh , a mh (cid:1)(cid:0)e J m ( s ′ ) – P s ′′ ∈ S + P (cid:0) s ′′ | s mh , a mh (cid:1)e J m ( s ′′ ) (cid:1) | S | B ⋆ + X s ′ ∈ S + c min | S | B ⋆ · (cid:12)(cid:12)(cid:12)e J m ( s ′ ) – X s ′′ ∈ S P (cid:0) s ′′ | s mh , a mh (cid:1)e J m ( s ′′ ) (cid:12)(cid:12)(cid:12)| {z } ≤ B ⋆ by Lemma B.12 ≤ s c min · V mh B ⋆ + c (cid:0) s mh , a mh (cid:1) c min ≤ c ( s mh , a mh ), | S + | ≤ | S | )and again by Jensen’s inequality and that the total cost throughout the interval is at most 2 B ⋆ , we have onΩ mH m X h =2 s c min · V mh B ⋆ + c (cid:0) s mh , a mh (cid:1) ≤ vuuut H m |{z} ≤ B ⋆ / c min · H m X h =2 c min · V mh B ⋆ + 12 H m X h =2 c (cid:0) s mh , a mh (cid:1)| {z } ≤ B ⋆ (Jensen’s inequality) ≤ vuut H m X h =2 V mh + B ⋆ .Plugging these bounds back into Eq. (15) gets us E " H m X h =2 V mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 ≤ E " B ⋆ + 14 vuut H m X h =1 V mh I { Ω m } ! (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 ≤ B ⋆ + 14 E " H m X h =2 V mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 ,where the last inequality is by the elementary inequality ( a + b ) ≤ a + b ). Rearranging gets us E (cid:2)P H m h =2 V mh I { Ω m } | ¯ U m –1 (cid:3) ≤ B ⋆ , and the lemma follows. Lemma B.15.

Let ( X t ) ∞ t =1 be a martingale diﬀerence sequence adapted to the ﬁltration ( F t ) ∞ t =0 . Let Y n =( P nt =1 X t ) – P nt =1 E [ X t | F t –1 ] . Then ( Y n ) ∞ n =0 is a martingale, and in particular if τ is a stopping timesuch that τ ≤ c almost surely, then E [ Y τ ] = 0 . roof. We ﬁrst show that ( Y n ) ∞ n =1 is a martingale. Indeed, E [ Y n | F n –1 ] = E " n X t =1 X t ! – n X t =1 E [ X t | F t –1 ] | F n –1 = E " n –1 X t =1 X t ! – 2 n –1 X t =1 X t ! X n + X n – n X t =1 E [ X t | F t –1 ] | F n –1 = n –1 X t =1 X t ! – 2 n –1 X t =1 X t ! · E [ X n | F n –1 ] – n X t =1 E [ X t | F t –1 ] ( E [ X n | F n –1 ] = 0)= n –1 X t =1 X t ! – n –1 X t =1 E [ X t | F t –1 ] = Y n –1 .We would now like to show that E [ Y τ ] = E [ Y ] = 0 using the optional stopping theorem. The latterholds since τ ≤ c almost surely and E [ Y ] = E [ X – E [ X | F ]] = 0. B.2.7 Proof of Lemma 4.8Lemma (restatement of Lemma 4.8) . With probability at least δ /4 , M X m =1 E " H m X h =1 X s ′ ∈ S (cid:0) P ( s ′ | s mh , a mh ) – e P m ( s ′ | s mh , a mh ) (cid:1)e J m ( s ′ ) I { Ω m } | ¯ U m –1 ≤ B ⋆ r M | S | | A | log T | S || A | δ + 8160 B ⋆ | S | | A | log T | S || A | δ . Proof.

Recall the following deﬁnitions: A mh = log( | S || A | N m + ( s mh , a mh )/ δ ) N m + ( s mh , a mh ) . V mh = X s ′ ∈ S + P ( s ′ | s mh , a mh ) e J m ( s ′ ) – X s ′′ ∈ S + P ( s ′′ | s mh , a mh ) e J m ( s ′′ ) ! .From Lemma 4.6 we have that E " H m X h =1 X s ′ ∈ S (cid:0) P ( s ′ | s mh , a mh ) – e P i ( s ′ | s mh , a mh ) (cid:1)e J i ( s ′ ) I { Ω m } | ¯ U m –1 ≤ E " p | S | H m X h =1 p V mh A mh I { Ω m } + 272 B ⋆ | S | A mh I { Ω m } | ¯ U m –1 .Moreover, by applying the Cauchy-Schwartz inequality twice, we get that E " H m X h =1 p V mh A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 ≤ E "vuut H m X h =1 V mh I { Ω m } · vuut H m X h =1 A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 ≤ vuut E " H m X h =1 A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 · vuut E " H m X h =1 V mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 ≤ B ⋆ vuut E " H m X h =1 A mh I { Ω m } (cid:12)(cid:12)(cid:12)(cid:12) ¯ U m –1 . (Lemma 4.7)26e sum over all intervals to obtain M X m =1 E " H m X h =1 X s ′ ∈ S (cid:0) P ( s ′ | s mh , a mh ) – e P i ( s ′ | s mh , a mh ) (cid:1)e J i ( s ′ ) I { Ω m } | ¯ U m –1 ≤≤ B ⋆ M X m =1 vuut | S | E " H m X h =1 A mh I { Ω m } | ¯ U m –1 + 272 B ⋆ | S | M X m =1 E " H m X h =1 A mh I { Ω m } | ¯ U m –1 ≤ B ⋆ vuut M | S | M X m =1 E " H m X h =1 A mh I { Ω m } | ¯ U m –1 + 272 B ⋆ | S | M X m =1 E " H m X h =1 A mh I { Ω m } | ¯ U m –1 ,where the last inequality follows from Jensen’s inequality. We ﬁnish the proof using Lemma B.16 below. Lemma B.16.

With probability at least δ /4 , the following holds for M = 1, 2, . . . simultaneously. M X m =1 E " H m X h =1 A mh I { Ω m } | ¯ U m –1 ≤ O (cid:18) | S || A | log T | S || A | δ (cid:19) . Proof.

Deﬁne the inﬁnite sequence of random variables: X m = P H m h =1 A mh I { Ω m } for which | X m | ≤ | S || A | / δ )due to Lemma B.17 below. We apply Eq. (26) of Lemma D.4 to obtain with probability at least 1 – δ /4, forall M = 1, 2, . . . simultaneously M X m =1 E (cid:2) X m | ¯ U m –1 (cid:3) ≤ M X m =1 X m + 12 log (cid:18) | S || A | δ (cid:19) log (cid:18) M δ (cid:19) .Now, we bound the sum over X m by rewriting it as a sum over epochs: M X m =1 X m ≤ M X m =1 H m X h =1 log( | S || A | N i + ( s mh , a mh )/ δ ) N i + ( s mh , a mh ) ≤ log | S || A | T δ X s ∈ S X a ∈ A E X i =1 n i ( s , a ) N i + ( s , a ) ,where E is the last epoch. Finally, from Lemma B.18 below we have that for every ( s , a ) ∈ S × A , E X i =1 n i ( s , a ) N i + ( s , a ) ≤ N E +1 ( s , a ) ≤ T .We now plugin the resulting bound for P Mm =1 X m and simplify the acquired expression by using M ≤ T . Lemma B.17.

For any interval m, | P H m h =1 A mh | ≤ | S || A | / δ ). Proof.

Note that all state-action pairs ( s mh , a mh ) (except the ﬁrst one ( s m , a m )) are known. Hence, for h ≥ N i + ( s mh , a mh ) ≥ · B ⋆ | S | c min log B ⋆ | S || A | δ c min . Therefore, since log( x )/ x is decreasing and since | S | ≥ | A | ≥ H m X h =1 log( | S || A | N i + ( s mh , a mh )/ δ ) N i + ( s mh , a mh ) ≤ log( | S || A | N i + ( s m , a m )/ δ ) N i + ( s m , a m ) + H m X h =2 log( | S || A | N i + ( s mh , a mh )/ δ ) N i + ( s mh , a mh ) ≤ log( | S || A | / δ ) + c min H m B ⋆ ≤ log( | S || A | / δ ) + 2 ( H m ≤ B ⋆ c min by deﬁnition.) ≤ | S || A | / δ ).27 emma B.18. For any sequence of integers z , . . . , z n with ≤ z k ≤ Z k –1 := max { P k –1 i =1 z i } and Z = 1 ,it holds that n X k =1 z k Z k –1 ≤ Z n . Proof.

We use the inequality x ≤ x ) for every 0 ≤ x ≤ n X k =1 z k Z k –1 ≤ n X k =1 log (cid:18) z k Z k –1 (cid:19) = 2 n X k =1 log Z k –1 + z k Z k –1 = 2 n X k =1 log Z k Z k –1 = 2 log n Y k =1 Z k Z k –1 = 2 log Z n . B.2.8 Proof of Theorem 2.4Theorem (restatement of Theorem 2.4) . Assume that Assumption 2 holds. With probability at least δ the regret of Algorithm 2 is bounded as follows:R K = O (cid:18) B ⋆ | S | p | A | K log KB ⋆ | S || A | δ c min + s B ⋆ | S | | A | c min log KB ⋆ | S || A | δ c min (cid:19) . Proof.

Let C M denote the cost of the learner after M intervals. First, with probability at least 1 – δ , we haveLemmas 4.2, 4.5 and 4.8 via a union bound. Now, as Ω m hold for all intervals, we have e R M = R M for anynumber of intervals M . Plugging in the bounds of Lemmas 4.4, 4.5 and 4.8 into Lemma 4.3, we have thatfor any number of intervals M : C M = O (cid:18) K · J π ⋆ ( s init ) + B ⋆ r M | S | | A | log T | S || A | δ + B ⋆ | S | | A | log T | S || A | δ (cid:19) .We now plug in the bounds on M and T from Observation 4.1 into the bound above. First, we plugin the bound on M . As long as the K episodes have not elapsed we have that M ≤ O (cid:0) C M / B ⋆ + K +2 | S || A | log T + B ⋆ | S | | A | c min log B ⋆ | S || A | δ c min (cid:1) . This gets after using the subadditivity of the square root to simplifythe resulting expression, C M = O (cid:18) K · J π ⋆ ( s init ) + B ⋆ r K | S | | A | log T | S || A | δ + r B ⋆ C M | S | | A | log T | S || A | δ + s B ⋆ | S | | A | c min log TB ⋆ | S || A | c min δ (cid:19) .From which, by solving for C M (using that x ≤ a √ x + b implies x ≤ ( a + √ b ) for a ≥ b ≥ J π ⋆ ( s init ) ≤ B ⋆ and our assumptions that K ≥ | S | | A | ,28 S | ≥ | A | ≥

2, we get that C M = O (cid:18)r B ⋆ | S | | A | log T | S || A | δ + vuut K · J π ⋆ ( s init ) + B ⋆ r K | S | | A | log T | S || A | δ + s B ⋆ | S | | A | c min log TB ⋆ | S || A | c min δ (cid:19) ! = O B ⋆ | S | | A | log T | S || A | δ + r B ⋆ | S | | A | log T | S || A | δ · vuut K · J π ⋆ ( s init ) + B ⋆ r K | S | | A | log T | S || A | δ + s B ⋆ | S | | A | c min log TB ⋆ | S || A | c min δ + K · J π ⋆ ( s init ) + B ⋆ r K | S | | A | log T | S || A | δ + s B ⋆ | S | | A | c min log TB ⋆ | S || A | c min δ ! = O B ⋆ | S | | A | log T | S || A | δ + B ⋆ r K | S | | A | log T | S || A | δ + vuut B ⋆ | S | | A | c log TB ⋆ | S || A | c min δ + K · J π ⋆ ( s init ) + B ⋆ r K | S | | A | log T | S || A | δ + s B ⋆ | S | | A | c min log TB ⋆ | S || A | c min δ ! = O K · J π ⋆ ( s init ) + B ⋆ r K | S | | A | log T | S || A | δ + s B ⋆ | S | | A | c min log TB ⋆ | S || A | c min δ ! . (19)Note that in particular, by simplifying the bound above, we have C M = O (cid:16)p B ⋆ | S | | A | KT / c min δ (cid:17) .Next we combine this with the fact, stated in Observation 4.1 that T ≤ C M / c min . Isolating T gets T = O (cid:16) B ⋆ | S | | A | Kc δ (cid:17) , and plugging this bound back into Eq. (19) and simplifying gets us C M = O (cid:18) K · J π ⋆ ( s init ) + B ⋆ | S | s | A | K log KB ⋆ | S || A | c min δ + s B ⋆ | S | | A | c min log KB ⋆ | S || A | c min δ (cid:19) .Finally, we note that the bound above holds for any number of intervals M as long as K episodes do notelapse. As the instantaneous costs in the model are positive, this means that the learner must eventuallyﬁnish the K episodes from which we derive the bound for R K claimed by the theroem. C Lower Bound

In this section we prove Theorem 2.7. At ﬁrst glance, it is tempting to try and use the lower bound ofJaksch et al. (2010, Theorem 5) on the regret suﬀered against learning average-reward MDPs by reducingany problem instance from an average-reward MDP to an instance of SSP. However, it is unclear to us if sucha reduction is possible, and if it is, how to perform it. We consequently prove the theorem here directly.By Yao’s minimax principle, in order to derive a lower bound on the learner’s regret, it suﬃces to showa distribution over MDP instances that forces any deterministic learner to suﬀer a regret of Ω( B ⋆ p | S || A | K )in expectation. Even though a reduction in the reverse direction is fairly straight-forward in the unit-cost case (Tarbouriech et al., 2019).

29o simplify our arguments, let us ﬁrst consider the following simpler problem before considering theproblem in its full generality. Think of a simple MDP with two states: the initial state and a goal state. Theset of actions A has a special action a ⋆ chosen uniformly at random a-priori. Upon choosing the special action,the learner transitions to the goal state with probability ≈ B ⋆ and remains at s init with the remainingprobability. Concretely P ( g | a ⋆ ) = 1/ B ⋆ and P ( s init | a ⋆ ) = 1 – 1/ B ⋆ , and for any other action a = a ⋆ wehave P ( g | a ) = (1 – ǫ )/ B ⋆ and P ( s init | a ) = 1 – (1 – ǫ )/ B ⋆ for some ǫ ∈ (0, 1/8). The costs of all actionsequal 1; i.e., c ( s init , a ) = 1 for all a ∈ A . Clearly, the optimal policy constantly plays a ⋆ and therefore J π ⋆ ( s init ) = B ⋆ .Fix any deterministic learning algorithm, we shall now quantify the regret of the learner during a singleepisode in terms of the number of times that it chooses a ⋆ . Let N k denote the number of steps that thelearner spends in s init during episode k , and let N ⋆ k be the number of times the learner plays a ⋆ at s init during the episode. Note that N k is also the total cost that the learning algorithm suﬀered during episode k . We have the following lemma. Lemma C.1. E (cid:2) N k (cid:3) – J π ⋆ ( s init ) = ǫ · E (cid:2) N k – N ⋆ k (cid:3) . Proof.

Let us denote by s , s , . . . and a , a , . . . the sequences of states and actions observed by the learnerduring the episode. We have, E [ N k ] = ∞ X t =1 P [ s t = s init ]= 1 + ∞ X t =2 P [ s t = s init ]= 1 + ∞ X t =2 P [ s t = s init | s t –1 = s init , a t –1 = a ⋆ ] P [ s t –1 = s init , a t –1 = a ⋆ ]+ ∞ X t =2 P [ s t = s init | s t –1 = s init , a t –1 = a ⋆ ] P [ s t –1 = s init , a t –1 = a ⋆ ]= 1 + ∞ X t =2 (cid:18) B ⋆ (cid:19) P [ s t –1 = s init , a t –1 = a ⋆ ] + ∞ X t =2 (cid:18) ǫ B ⋆ (cid:19) P [ s t –1 = s init , a t –1 = a ⋆ ]= 1 + (cid:18) B ⋆ (cid:19) ∞ X t =1 P [ s t = s init , a t = a ⋆ ] + (cid:18) ǫ B ⋆ (cid:19) ∞ X t =1 P [ s t = s init , a t = a ⋆ ]= 1 + (cid:18) B ⋆ (cid:19) E [ N ⋆ k ] + (cid:18) ǫ B ⋆ (cid:19) E [ N k – N ⋆ k ].Rearranging using J π ⋆ ( s init ) = B ⋆ gives the Lemma’s statement.By Lemma C.1 the overall regret of the learner over K episodes is: E [ R K ] = ǫ · E (cid:2) N – N ⋆ (cid:3) , where N = P Kk =1 N k and N ⋆ = P Kk =1 N ⋆ k . In words, the regret of the learner is ǫ times the expected number ofvisits to s init in which the learner did not play a ⋆ .In the remainder of the proof we lower bound N in expectation and upper bound the expected value of N ⋆ . To upper bound N ⋆ , we use standard techniques from lower bounds of multi-armed bandits (Auer et al.,2002) that bound the total variation distance between the distribution of the sequence of states traversed bythe learner in the original MDP and that generated in a “uniform MDP” in which all actions are identical.However, we cannot apply this argument directly since it requires N ⋆ to be bounded almost surely, yet here N ⋆ depends on the total length of all K episodes which is unbounded in general. We ﬁx this issue by looking For ease of notation and since there is only one state other than g , we do not write this state as the origin state in thedeﬁnition of the transition function. T steps (where T is to be determined) and showing that the regret is large even in these T steps.Formally, we view the run of the K episodes as a continuous process in which when the learner reachesthe goal state we transfer it to s init (at no cost) and let it restart from there. Furthermore, we cap thelearning process to consist of exactly T steps as follows. If the K episodes are completed before T stepsare elapsed, the learner remains in g (until completing T steps) without suﬀering any additional cost, andotherwise we stop the learner after T steps before it completes its K episodes. In this capped process, wedenote the number of visits in s init by N – and the number of times the learner played a ⋆ in s init by N ⋆ – . Wehave E [ R K ] ≥ ǫ · (cid:0) E (cid:2) N – (cid:3) – E (cid:2) N ⋆ – (cid:3)(cid:1) . (20)The number of visits to s init under this capping is lower bounded by the following lemma. Lemma C.2.

For any deterministic learner, if T ≥ KB ⋆ then we have that E (cid:2) N – (cid:3) ≥ KB ⋆ /4. Proof.

If the capped learner ﬁnished its K episodes then N – = N . Otherwise, it visits the goal state lessthan K times and therefore N – ≥ T – K . Hence E (cid:2) N – (cid:3) ≥ E (cid:2) min { T – K , N } (cid:3) ≥ P Kk =1 E (cid:2) min { T / K – 1, N k } (cid:3) .Since T ≥ KB ⋆ , the lemma will follow if we show that N k ≥ B ⋆ with probability at least 1/4. We lowerbound the probability that N k ≥ B ⋆ by the probability of staying at s init for B ⋆ steps and picking a ⋆ in theﬁrst B ⋆ – 1 steps. Indeed, using (1 – 1/ x ) x –1 ≥ e for x ≥

1, we get that P [ N k ≥ B ⋆ ] ≥ (cid:0) B ⋆ (cid:1) B ⋆ –1 ≥ .We now introduce an additional distribution of the transitions which call P unif . P unif is identical to P as deﬁned above, except that P ( g | a ) = (1 – ǫ )/ B ⋆ for all actions a . We denote expectations over P unif by E unif . The following lemma uses standard lower bound techniques used for multi-armed bandits (see, e.g.,Jaksch et al., 2010, Theorem 13) to bound the diﬀerence in the expectation of N ⋆ – when the learner plays in P compared to when it plays in P unif . Lemma C.3.

For any deterministic learner we have that E (cid:2) N ⋆ – (cid:3) ≤ E unif (cid:2) N ⋆ – (cid:3) + ǫ T p E unif [ N ⋆ – ]/ B . Proof.

Fix any deterministic learner. Let us denote by s ( t ) the sequence of states observed by the learnerup to time t and including. Now, as N ⋆ – ≤ T and the fact that N ⋆ – is a function of s ( T ) , E (cid:2) N ⋆ – (cid:3) ≤ E unif (cid:2) N ⋆ – (cid:3) + T · TV( P unif [ s ( T ) ], P [ s ( T ) ]), and Pinsker’s inequality yieldsTV( P unif [ s ( T ) ], P [ s ( T ) ]) ≤ r

12 KL( P unif [ s ( T ) ] k P [ s ( T ) ]). (21)Next, the chain rule of the KL divergence obtainsKL( P unif [ s ( T ) ] k P [ s ( T ) ]) = T X t =1 X s ( t –1) P unif [ s ( t –1) ] · KL( P unif [ s t | s ( t –1) ] k P [ s t | s ( t –1) ]).Observe that at any time, since the learning algorithm is deterministic, the learner chooses an actiongiven s ( t –1) regardless of whether s ( t –1) was generated under P or under P unif . Thus, the KL( P unif [ s t | s ( t –1) ] k P [ s t | s ( t –1) ]) is zero if a t –1 = a ⋆ , and otherwiseKL( P unif [ s t | s ( t –1) ] k P [ s t | s ( t –1) ]) = X s ∈ S P unif [ s t | s t –1 = s init , a t –1 = a ⋆ ] log P unif [ s t | s t –1 = s init , a t –1 = a ⋆ ] P [ s t | s t –1 = s init , a t –1 = a ⋆ ]= 1 – ǫ B ⋆ · log(1 – ǫ ) + (cid:18) ǫ B ⋆ (cid:19) log (cid:18) ǫ B ⋆ – 1 (cid:19) ≤ ǫ B ⋆ – 1 . (using log(1 + x ) ≤ x for all x > B ⋆ ≥ Theorem C.4.

Suppose that B ⋆ ≥ , ǫ ∈ (0, ) and | A | ≥ . For the problem described above we have that E [ R K ] ≥ ǫ KB ⋆ (cid:18)

18 – 2 ǫ s K | A | (cid:19) . Proof of Theorem C.4.

Note that as under P unif the transition distributions are identical for all actions, wehave that X a ∈ A | a ⋆ = a E unif (cid:2) N ⋆ – (cid:3) = E unif " X a ∈ A | a ⋆ = a N ⋆ – = E unif (cid:2) N – (cid:3) ≤ T . (22)Suppose that a ⋆ is sampled uniformly at random before the game starts. Denote the probability andexpectation with respect to the distribution induced by a speciﬁc choice of a ⋆ = a by P a and E a respectively.Then for T = 2 KB ⋆ , E [ R K ] = 1 | A | X a ∈ A E a [ R K ] ≥ | A | X a ∈ A E a [ N – – N ⋆ – ] (Eq. (20)) ≥ | A | X a ∈ A | a ⋆ = a (cid:18) KB ⋆ E unif [ N ⋆ – ] – ǫ T s E unif [ N ⋆ – ] B ⋆ (cid:19) (Lemmas C.2 and C.3) ≥ KB ⋆ | A | X a ∈ A | a ⋆ = a E unif [ N ⋆ – ] – ǫ T vuut B ⋆ · | A | X a ∈ A | a ⋆ = a E unif [ N ⋆ – ] (Jensen’s inequality) ≥ KB ⋆ T | A | – ǫ T s TB ⋆ | A | (Eq. (22))= ǫ (cid:18) KB ⋆ KB ⋆ | A | – 2 ǫ KB ⋆ s KB ⋆ | A | B ⋆ (cid:19) = ǫ KB ⋆ (cid:18)

14 – 2 | A | – 2 ǫ s K | A | (cid:19) .The theorem follows from | A | ≥

16 and by rearranging.

Proof of Theorem 2.7.

Consider the following MDP. Let S be the set of states disregarding g . The initialstate is sampled uniformly at random from S . Each s ∈ S has its own special action a ⋆ s . The transitiondistributions are deﬁned P ( g | a ⋆ s , s ) = 1/ B ⋆ , P ( s | a ⋆ s , s ) = 1 – 1/ B ⋆ , and P ( g | a , s ) = (1 – ǫ )/ B ⋆ , P ( s | a , s ) = 1 – (1 – ǫ )/ B ⋆ for any other action a ∈ A \{ a ⋆ s } .Note that for each s ∈ S , the learner is faced with a simple problem as the one described above fromwhich it cannot learn about from other states s ′ = s . Therefore, we can apply Theorem C.4 for each s ∈ S separately and lower bound the learner’s expected regret the sum of the regrets suﬀered at each s ∈ S ,which would depend on the number of times s ∈ S is drawn as the initial state. Since the states are chosenuniformly at random there are many states (constant fraction) that are chosen Θ( K / | S | ) times. Summingthe regret bounds of Theorem C.4 over only these states and choosing ǫ appropriately gives the sought-afterbound. 32enote by K s the number of episodes that start in each state s ∈ S . E [ R K ] ≥ X s ∈ S E " ǫ K s B ⋆ (cid:18)

18 – 2 ǫ s K s | A | (cid:19) = ǫ KB ⋆ ǫ B ⋆ s | A | X s ∈ S E [ K s ]. (23)Taking expectation over the initial states and applying Cauchy-Schwartz inequality gives X s ∈ S E (cid:2) K s (cid:3) ≤ X s ∈ S p E [ K s ] p E [ K s ] = X s ∈ S p E [ K s ] p E [ K s ] + V [ K s ] = X s ∈ S s K | S | s K | S | + K ( | S | – 1) | S | ≤ K s K | S | ,where we have used the expectation and variance formulas of the Binomial distribution. The lower bound isnow given by applying the inequality above in Eq. (23) and choosing ǫ = p | A || S | / K . D Concentration inequalities

Theorem D.1 (Anytime Azuma) . Let ( X n ) ∞ n =1 be a martingale diﬀerence sequence with respect to theﬁltration ( F n ) ∞ n =0 such that | X n | ≤ B almost surely. Then with probability at least δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ B r n log 2 n δ , ∀ n ≥ Theorem D.2 (Weissman et al., 2003) . Let p ( · ) be a distribution over m elements, and let ¯ p t ( · ) be theempirical distribution deﬁned by t iid samples from p ( · ) . Then, with probability at least δ , (cid:13)(cid:13) ¯ p t ( · ) – p ( · ) (cid:13)(cid:13) ≤ s m log δ t . Theorem D.3 (Anytime Bernstein) . Let ( X n ) ∞ n =1 be a sequence of i.i.d. random variables with expectation µ . Suppose that ≤ X n ≤ B almost surely. Then with probability at least δ , the following holds for alln ≥ simultaneously: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( X i – µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r B µ n log 2 n δ + B log 2 n δ . (24) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( X i – µ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ vuut B n X i =1 X i log 2 n δ + 7 B log 2 n δ . (25) Proof.

Fix some n ≥

1. By Bernstein’s concentration inequality (see for example, Cesa-Bianchi and Lugosi,2006, Corollary A.3), we have with probability at least 1 – δ n that Eq. (24) holds. By a union bound, theinequality holds with probability at least 1 – δ for all n ≥ µ · n – n X i =1 X i ≤ r B µ n log 2 n δ + B log 2 n δ that is a quadratic inequality in µ . This implies that √ µ ≤ vuut n n X i =1 X i + 3 s B log n δ n .Plugging this inequality back into the RHS of Eq. (24) gets us Eq. (25).33 emma D.4. Let ( X n ) ∞ n =1 be a sequence of random variables with expectation adapted to the ﬁltration ( F n ) ∞ n =0 . Suppose that ≤ X n ≤ B almost surely. Then with probability at least δ , the following holdsfor all n ≥ simultaneously: n X i =1 E [ X i | F i –1 ] ≤ n X i =1 X i + 4 B log 2 n δ . (26) Proof.

For all n ≥

1, we have E [ e – X n / B | F n –1 ] ≤ E (cid:20) X n B + X n B (cid:12)(cid:12)(cid:12) F n –1 (cid:21) ( e – x ≤ x + x for all x ≥ ≤ E [ X n | F n –1 ] B + E [ X n | F n –1 ]2 B ( X n ≤ B )= 1 – E [ X n | F n –1 ]2 B ≤ e – E [ X n |F n –1 ]/2 B . (1 – x ≤ e – x for all x )Hence, ﬁx some n ≥

1, then E " exp (cid:18) B n X i =1 (cid:18) E [ X i | F i –1 ] – X i (cid:19)(cid:19) = E " exp (cid:18) B n –1 X i =1 (cid:18) E [ X i | F i –1 ] – X i (cid:19)(cid:19) · E " exp (cid:18) B (cid:18) E [ X n | F n –1 ] – X n (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) F n –1 ≤ ≤ E " exp (cid:18) B n –1 X i =1 (cid:18) E [ X i | F i –1 ] – X i (cid:19)(cid:19) ≤

1. (by repeating the last argument inductively.)Therefore, P " n X i =1 (cid:18) E [ X i | F i –1 ] – X i (cid:19) > B log 2 n δ ≤ P " exp (cid:18) B n X i =1 (cid:18) E [ X i | F i –1 ] – X i (cid:19)(cid:19) > n δ ≤ E " exp (cid:18) B n X i =1 (cid:18) E [ X i | F i –1 ] – X i (cid:19)(cid:19) · δ n (Markov inequality) ≤ δ n .Hence the above holds for all n ≥≥