[PDF] Constrained episodic reinforcement learning in concave-convex and knapsack settings

Abstract

Full PDF

CConstrained episodic reinforcement learning inconcave-convex and knapsack settings

Kianté Brantley ∗ Miroslav Dudik † Thodoris Lykouris ‡ Sobhan Miryooseﬁ § Max Simchowitz ¶ Aleksandrs Slivkins (cid:107)

Wen Sun ∗∗ Abstract

We propose an algorithm for tabular episodic reinforcement learning with constraints. Weprovide a modular analysis with strong theoretical guarantees for settings with concave rewardsand convex constraints, and for settings with hard constraints (knapsacks). Most of the previouswork in constrained reinforcement learning is limited to linear constraints, and the remainingwork focuses on either the feasibility question or settings with a single episode. Our experimentsdemonstrate that the proposed algorithm signiﬁcantly outperforms these approaches in existingconstrained episodic environments. ∗ University of Maryland, [email protected] . Work was supported by National Science Foundation under GrantNo. 1618193 and by an ACM SIGHPC/Intel Computational and Data Science Fellowship. † Microsoft Research NYC, [email protected] ‡ Microsoft Research NYC, [email protected] § Princeton University, [email protected] ¶ UC Berkeley, [email protected] . (cid:107) Microsoft Research NYC, [email protected] ∗∗ Microsoft Research NYC, [email protected] a r X i v : . [ c s . L G ] J un Introduction

Standard reinforcement learning (RL) approaches seek to maximize a scalar reward (Sutton andBarto, 1998, 2018; Schulman et al., 2015; Mnih et al., 2015), but in many settings this is insuﬃcient,because the desired properties of the agent behavior are better described using constraints. Forexample, an autonomous vehicle should not only get to the destination, but should also respectsafety, fuel eﬃciency, and human comfort constraints along the way (Le et al., 2019); a robot shouldnot only fulﬁll its task, but should also control its wear and tear, for example, by limiting the torqueexerted on its motors (Tessler et al., 2019). Moreover, in many settings, we wish to satisfy suchconstraints already during training and not only during the deployment . For example, a power grid,an autonomous vehicle, or a real robotic hardware should avoid costly failures, where the hardwareis damaged or humans are harmed, already during training (Leike et al., 2017; Ray et al., 2020).Constraints are also key in additional sequential decision making applications, such as dynamicpricing with limited supply, e.g., (Besbes and Zeevi, 2009; Babaioﬀ et al., 2015), scheduling ofresources on a computer cluster (Mao et al., 2016), and imitation learning, where the goal is to stayclose to an expert behavior (Syed and Schapire, 2007; Ziebart et al., 2008; Sun et al., 2019).In this paper we study constrained episodic reinforcement learning , which encompasses all ofthese applications. An important characteristic of our approach, distinguishing it from previouswork (Altman, 1999; Achiam et al., 2017; Tessler et al., 2019; Miryooseﬁ et al., 2019; Ray et al., 2020),is our focus on eﬃcient exploration , leading to reduced sample complexity. Notably, the modularity ofour approach enables extensions to more complex settings such as (i) maximizing concave objectivesunder convex constraints, and (ii) reinforcement learning under hard constraints, where the learnerhas to stop when some constraint is violated (e.g., a car runs out of gas). For these extensions, whichwe refer to as concave-convex setting and knapsack setting , we provide the ﬁrst regret guarantees inthe episodic setting (see related work below for a detailed comparison). Moreover, our guaranteesare anytime , meaning that the constraint violations are bounded at any point during learning, evenif the learning process is interrupted. This is important for those applications where the systemcontinues to learn after it is deployed.Our approach relies on the principle of optimism under uncertainty to eﬃciently explore. Ourlearning algorithms optimize their actions with respect to a model based on the empirical statistics,while optimistically overestimating rewards and underestimating the resource consumption (i.e.,overestimating the distance from the constraint). This idea was previously introduced in multi-armed bandits (Agrawal and Devanur, 2014); extending it to episodic reinforcement learning posesadditional challenges since the policy space is exponential in the episode horizon. Circumventingthese challenges, we provide a modular way to analyze this approach in the basic setting where bothrewards and constraints are linear (Section 3) and then transfer this result to the more complicatedconcave-convex and knapsack settings (Sections 4 and 5). We empirically compare our approachwith the only previous works that can handle convex constraints and show that our algorithmicinnovations lead to signiﬁcant empirical improvements (Section 6).

Related work.

Sample-eﬃcient exploration in constrained episodic reinforcement learning hasonly recently started to receive attention. Most previous works on episodic reinforcement learningfocus on unconstrained settings (Jaksch et al., 2010; Azar et al., 2017; Dann et al., 2017). A notableexception is the work of Cheung (2019), which provides theoretical guarantees for the reinforcementlearning setting with a single episode, but requires a strong reachability assumption, which is notneeded in the episodic setting studied here. Also, our results for the knapsack setting allow for a2igniﬁcantly smaller budget as we illustrate in Section 5. Moreover, our approach is based on atighter bonus, which leads to a superior empirical performance (see Section 6). Recently, there havealso been several concurrent and independent works on sample-eﬃcient exploration for reinforcementlearning with constraints (Singh et al., 2020; Efroni et al., 2020; Qiu et al., 2020; Ding et al., 2020).Unlike our work, all of these approaches focus on linear reward objective and linear constraints anddo not handle the concave-convex and knapsack settings that we consider.Constrained reinforcement learning has also been studied in settings that do not focus on sample-eﬃcient exploration (Achiam et al., 2017; Tessler et al., 2019; Miryooseﬁ et al., 2019). Among these,only Miryooseﬁ et al. (2019) handle convex constraints, albeit without a reward objective (theysolve the feasibility problem). Since these works do not focus on sample-eﬃcient exploration, theirperformance drastically deteriorates when the task requires exploration (as we show in Section 6).Sample-eﬃcient exploration under constraints has been studied in multi-armed bandits, startingfrom a line of work on dynamic pricing with limited supply (Besbes and Zeevi, 2009, 2011; Babaioﬀet al., 2015; Wang et al., 2014). A general setting for bandits with global knapsack constraints( bandits with knapsacks ) was deﬁned and solved in (Badanidiyuru et al., 2018); see Ch. 10 of Slivkins(2019) for a discussion of this and related work in bandits. Within this literature, the closest to oursis the work of Agrawal and Devanur (2014), who study bandits with concave objectives and convexconstraints. Our work is directly inspired by theirs and lifts their techniques to the more generalepisodic reinforcement learning setting.

In episodic reinforcement learning, a learner repeatedly interacts with an environment across K episodes. The environment includes the state space S , the action space A , the episode horizon H ,and the initial state s . To capture constrained settings, the environment includes a set D of d resources where each i ∈ D has a capacity constraint ξ ( i ) ∈ R + . The above are ﬁxed and known tothe learner. Constrained Markov Decision Process.

We work with MDPs that have resource consumptionin addition to rewards. Formally, a constrained

MDP ( cMDP ) is a triple M = ( p, r, c ) that describestransition probabilities p : S × A → ∆( S ) , rewards r : S × A → [0 , , and resource consumption c : S × A → [0 , d . For convenience, we denote c ( s, a, i ) = c i ( s, a ) . We allow stochastic rewardsand consumptions, in which case r and c refer to the conditional expectations, conditioned on s and a (our deﬁnitions and algorithms are based on this conditional expectation rather than the fullconditional distribution).The above deﬁnition is useful in two ways. On the one hand, there is a true cMDP M (cid:63) = ( p (cid:63) , r (cid:63) , c (cid:63) ) ,which is ﬁxed but unknown to the learner. Selecting action a at state s results in rewards andconsumptions drawn from (possibly correlated) distributions with means r (cid:63) ( s, a ) and c (cid:63) ( s, a ) and sup-ports in [0 , and [0 , d respectively. Next states are generated from transition probabilities p (cid:63) ( s, a ) .On the other hand, our algorithm is model-based and, at episode k , uses a cMDP M ( k ) as model. A ﬁxed and known initial state is without loss of generality. In general, there is a ﬁxed but unknown distribution ρ from which the initial state is drawn before each episode. We modify the MDP by adding a new state s as initialstate, such that the next state is sampled from ρ for any action. Then ρ is “included” within the transition probabilities.The extra state s does not contribute any reward and does not consume any resources. pisodic reinforcement learning protocol. At episode k ∈ [ K ] , the learner commits to a policy π k = ( π k,h ) Hh =1 where π k,h : S → ∆( A ) speciﬁes how to select actions at step h for every state. Thelearner starts from state s k, = s ; At step h = 1 , . . . , H , she selects an action a k,h ∼ π k,h ( s k,h ) . Thelearner earns reward r k,h and suﬀers consumption c k,h , both drawn from the true cMDP M (cid:63) onstate-action pair ( s k,h , a k,h ) as described above, and transitions to state s k,h +1 ∼ p (cid:63) ( s k,h , a k,h ) . Objectives.

In the basic setting (Section 3), the learner wishes to maximize reward while respectingthe consumption constraints in expectation by competing favorably against the following benchmark: max π E π,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105) s.t. ∀ i ∈ D : E π,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) (cid:0) s h , a h , i (cid:1)(cid:105) ≤ ξ ( i ) , (1)where E π,p denotes the expectation over the run of policy π according to transitions p , and s h , a h arethe induced random state-action pairs. We denote by π (cid:63) the policy that maximizes this objective.Our main results hold more generally for concave reward objective and convex consumption constraints(Section 4) and extend to the knapsack setting where constraints are hard (Section 5).For the basic setting, we track two performance measures: reward regret compares the learner’s totalreward to the benchmark and consumption regret bounds excess in resource consumption: RewReg ( k ) := E π (cid:63) ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105) − k k (cid:88) t =1 E π t ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105) , (2) ConsReg ( k ) := max i ∈D (cid:16) k k (cid:88) t =1 E π t ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) (cid:0) s h , a h , i (cid:1)(cid:105) − ξ ( i ) (cid:17) . (3)Our guarantees are anytime , i.e., they hold at any episode k and not only after the last episode. Tabular MDPs.

We assume that the state space S and the action space A are ﬁnite (tabularsetting). We construct standard empirical estimates separately for each state-action pair ( s, a ) , usingthe learner’s observations up to and not including a given episode k . Eqs. (4–7) deﬁne sample counts,empirical transition probabilities, empirical rewards, and empirical resource consumption. N k ( s, a ) = max (cid:26) , (cid:88) t ∈ [ k − , h ∈ [ H ] { s t,h = s, a t,h = a } (cid:27) (4) (cid:98) p k ( s (cid:48) | s, a ) = 1 N k ( s, a ) (cid:88) t ∈ [ k − , h ∈ [ H ] { s t,h = s, a t,h = a, s t,h +1 = s (cid:48) } , (5) (cid:98) r k ( s, a ) = 1 N k ( s, a ) (cid:88) t ∈ [ k − , h ∈ [ H ] r t,h · { s t,h = s, a t,h = a } (6) (cid:98) c k ( s, a, i ) = 1 N k ( s, a ) (cid:88) t ∈ [ k − , h ∈ [ H ] c t,h,i · { s t,h = s, a t,h = a } ∀ i ∈ D (7) Preliminaries for theoretical analysis. A quality function ( Q -function) is a standard object inRL that tracks the learner’s expected performance if she starts from state s ∈ S at step h , selectsaction a ∈ A , and then follows a policy π under a model with transitions p for the remainder of The max operator in Eq. (4) is to avoid dividing by . objective function m : S × A → [0 , , which can be eithera reward, i.e., m ( s, a ) = r ( s, a ) , or consumption of some resource i ∈ D , i.e., m ( s, a ) = c ( s, a, i ) .(For the unconstrained setting, the objective is the reward.) The performance of the policy in aparticular step h is evaluated by the value function which corresponds to the expected Q -function ofthe selected action (where the expectation is taken over the possibly randomized action selection of π ). Both quality and value functions can be recursively deﬁned by dynamic programming. Q π,pm ( s, a, h ) = m ( s, a ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a ) V π,pm ( s (cid:48) , h + 1) ,V π,pm ( s, h ) = E a ∼ π ( ·| s ) (cid:104) Q π,pm ( s, a, h ) (cid:105) and V π,pm ( s, H + 1) = 0 By slight abuse of notation, for m ∈ { r }∪{ c i } i ∈D , we denote by m (cid:63) ∈ { r (cid:63) }∪{ c (cid:63)i } i ∈D the correspondingobjectives with respect to the rewards and consumptions of the true cMDP M (cid:63) . For objectives m (cid:63) and transitions p (cid:63) , the above are the Bellman equations of the system (Bellman, 1957).Estimating the Q -function based on the model parameters p and m rather than the ground truthparameters p (cid:63) and m (cid:63) introduces errors. These errors are localized across stages by the notion of Bellman error which contrasts the performance of policy π starting from stage h under the modelparameters to a benchmark that behaves according to the model parameters starting from the nextstage h + 1 but uses the true parameters of the system in stage h . More formally, for objective m : Bell π,pm ( s, a, h ) = Q π,pm ( s, a, h ) − (cid:16) m (cid:63) ( s, a ) + (cid:88) s (cid:48) ∈S p (cid:63) ( s (cid:48) | s, a ) V π,pm ( s (cid:48) , h + 1) (cid:17) . (8)Note that when the cMDP is M (cid:63) ( m = m (cid:63) , p = p (cid:63) ), there is no mismatch and Bell π,p (cid:63) m (cid:63) = 0 . In this section, we introduce a simple algorithm that allows to simultaneously eﬀectively boundreward and consumption regrets for the basic setting introduced in the previous section. Even in thisbasic setting, we provide the ﬁrst sample-eﬃcient guarantees in constrained episodic reinforcementlearning. The modular analysis of the guarantees also allows us to subsequently extend (in Section 4and 5) the algorithm and guarantees to the more general concave-convex and knapsack settings.

Our algorithm.

At episode k , we construct an estimated cMDP M ( k ) = (cid:0) p ( k ) , r ( k ) , c ( k ) (cid:1) based onthe observations collected so far. The estimates are bonus-enhanced (formalized below) to encouragemore targeted exploration. Our algorithm ConRL selects a policy π k by the following constrainedoptimization problem which we refer to as BasicConPlanner ( p ( k ) , r ( k ) , c ( k ) ) : max π E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105) s.t. ∀ i ∈ D : E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h , i (cid:1)(cid:105) ≤ ξ ( i ) . The above optimization problem is similar to the objective (1) but uses the estimated model insteadof the (unknown to the learner) true model. We also note that this optimization problem can beoptimally solved as it is a linear program on the occupation measures (Puterman, 2014), i.e., setting We refer the reader to the related work (in Section 1) for discussion on concurrent and independent papers. Unlikeour results, these papers do not extend to either concave-convex or knapsack settings.

5s variables the probability of each state-action pair and imposing ﬂow conservation constraints withrespect to the trainsitions. This program is described in Appendix A.1.

Bonus-enhanced model.

A standard approach to implement the principle of optimism underuncertainty is to introduce, at each episode k , a bonus term (cid:98) b k ( s, a ) that favors under-exploredactions. Speciﬁcally, we add this bonus to the empirical rewards (6), and subtract it from theconsumptions (7): r ( k ) ( s, a ) = (cid:98) r k ( s, a ) + (cid:98) b k ( s, a ) and c ( k ) ( s, a, i ) = (cid:98) c k ( s, a, i ) − (cid:98) b k ( s, a ) for eachresource i .Following the unconstrained analogues (Azar et al., 2017; Dann et al., 2017), we deﬁne the bonus as: (cid:98) b k ( s, a ) = H (cid:115) (cid:0) SAH ( d + 1) k /δ ) N k ( s, a ) , (9)where δ > is the desired failure probability of the algorithm and N k ( s, a ) is the number of times ( s, a ) pair is visited, c.f. (4), S = |S| , and A = |A| . Thus, under-explored actions have a largerbonus, and therefore appear more appealing to the planner. For estimated transition probabilities,we just use the empirical averages (5): p ( k ) ( s (cid:48) | s, a ) = (cid:98) p ( s (cid:48) | s, a ) . Valid bonus and Bellman-error decomposition.

For a bonus-enhanced model to achieveeﬀective exploration, the resulting bonuses need to be valid , i.e., they should ensure that theestimated rewards overestimate the true rewards (and the consumptions underestimate the trueconsumptions).

Deﬁnition 3.1.

A bonus b k : S × A → R is valid if, ∀ s ∈ S , a ∈ A , h ∈ [ H ] , m ∈ { r } ∪ { c i } i ∈D : (cid:12)(cid:12)(cid:12)(cid:16) (cid:98) m k ( s, a ) − m (cid:63) ( s, a ) (cid:17) + (cid:88) s (cid:48) ∈S (cid:16)(cid:98) p k ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:17) V π (cid:63) ,p (cid:63) m (cid:63) ( s (cid:48) , h + 1) (cid:12)(cid:12)(cid:12) ≤ b k ( s, a ) . By classical concentration bounds (Appendix B.1), the bonus (cid:98) b k of Eq. (9) satisﬁes this condition: Lemma 3.2.

With probability − δ , the bonus (cid:98) b k ( s, a ) is valid for all episodes k simultaneously. Our algorithm optimizes the

BasicConPlanner optimization problem based on a bonus-enhancedmodel. When the bonuses are valid, we can upper bound on the per-episode regret by the expectedsum of Bellman errors across steps. This is the ﬁrst part in classical unconstrained analyses and thefollowing proposition extends this decomposition to constrained episodic reinforcement learning. Theproof uses the so-called simulation lemma (Kearns and Singh, 2002) and is provided in Appendix B.3.

Proposition 3.3. If (cid:98) b k ( s, a ) is valid for all episodes k simultaneously then the per-episode rewardand consumption regrets can be upper bounded by the expected sum of Bellman errors (8) : E π (cid:63) ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:3) − E π k ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105) ≤ E π k (cid:104) H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) r ( k ) (cid:0) s h , a h , h (cid:1)(cid:12)(cid:12)(cid:12)(cid:105) (10) ∀ i ∈ D : E π k ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) (cid:0) s h , a h , i (cid:1)(cid:105) − ξ ( i ) ≤ E π k (cid:104) H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) c ( k ) i (cid:0) s h , a h , h (cid:1)(cid:12)(cid:12)(cid:12)(cid:105) (11) Final guarantee.

One diﬃculty with directly bounding the Bellman error is that the value functionis not independent to the draws forming r ( k ) ( s, a ) , c ( k ) ( s, a ) , and p ( k ) ( s (cid:48) | s, a ) . Hence we cannot applyHoeﬀding inequality directly. While Azar et al. (2017) propose a trick to bound Bellman error in theorder of √ S in unconstrained settings, the trick relies on the crucial property of Bellman optimality:6or an unconstrained MDP, its optimal policy π (cid:63) satisﬁes the condition, V π (cid:63) r (cid:63) ( s, h ) ≥ V πr (cid:63) ( s, h ) forall s, h, π (i.e., π (cid:63) is optimal at any state). However, when constraints exist, the optimal policydoes not satisfy the Bellman optimality property. Indeed, we can only guarantee optimality withrespect to the initial state distribution, i.e., V π (cid:63) r (cid:63) ( s , ≥ V πr (cid:63) ( s , for any π , but not everywhereelse. This illustrates a fundamental diﬀerence between constrained MDPs and unconstrained MDPs.Thus we cannot directly apply the trick from Azar et al. (2017). Instead we follow an alternativeapproach of bounding the value function via an (cid:15) -net on the possible values. This analysis leads to aguarantee that is weaker by a factor of √ S than the unconstrained results. The proof is provided inAppendix B.6. Theorem 3.4.

Let c ∈ R + be some absolute constant. With probability at least − δ , reward andconsumption regrets are both upper bounded by: c √ k · S √ AH · (cid:113) ln( k ) ln (cid:0) SAH ( d + 1) k/δ (cid:1) + ck · S / AH (cid:113) ln (cid:0) SAH ( d + 1) k/δ (cid:1) . Comparison to single-episode results.

We note that for the single-episode setting , Cheung(2019) achieves √ S dependency under the further assumption that the transitions are sparse, i.e., (cid:107) p (cid:63) ( s, a ) (cid:107) (cid:28) S for all ( s, a ) . We do not make such assumptions on the sparsity of the MDP and wenote that the regret bound from Cheung (2019) scales in the order of S when (cid:107) p (cid:63) ( s, a ) (cid:107) = Θ( S ) . We now extend the algorithm and guarantees derived for the basic setting to when the objective isconcave function of the accumulated reward and the constraints a convex function of the cumulativeconsumptions. Our approach is modular, seamlessly building on the basic setting.

Setting and objective.

Formally, there is a concave reward-objective function f : R → R and a convex consumption-objective function g : R d → R ; the only assumption is that thesefunctions are L -Lipschitz for some constant L , i.e., | f ( x ) − f ( y ) | ≤ L | x − y | for any x, y ∈ R , and | g ( x ) − g ( y ) | ≤ L (cid:107) x − y (cid:107) for any x, y ∈ R d . Analogous to (1), the learner wishes to compete againstthe following benchmark which which can be viewed as a reinforcement learning variant of thebenchmark used by Agrawal and Devanur (2014) in multi-armed bandits: max π f (cid:16) E π,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) s.t. g (cid:16) E π,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) ≤ (12)The reward and consumption regrets are therefore adapted to: ConvexRewReg ( k ) := f (cid:16) E π (cid:63) ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) − f (cid:16) k k (cid:88) t =1 E π t ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) , ConvexConsReg ( k ) := g (cid:16) k k (cid:88) t =1 E π t ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) . Our algorithm.

As in the basic setting, we wish to create a bonus-enhanced model and optimizeover it. To model the transition probabilites, we use empirical estimates p ( k ) = (cid:98) p k of Eq. (5) as before. The single-episode setting requires a strong reachability assumption, not present in the episodic setting. (cid:98) b k (deﬁnedin Eq. 9) as we did before. Instead we compute the policy π k of episode k together with the modelby solving the following optimization problem which we call ConvexConPlanner : max π max r ( k ) ∈ (cid:2) (cid:98) r k ± (cid:98) b k (cid:3) f (cid:16) E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) s.t. min c ( k ) ∈ (cid:2) (cid:98) c k ± (cid:98) b k · (cid:3) g (cid:16) E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) ≤ . The above problem is convex in the occupation measures , i.e., the probability ρ ( s, a, h ) that thelearner is at state-action-step ( s, a, h ) — c.f. Appendix A.2 for further discussion. max ρ max r ∈ (cid:2) (cid:98) r k ± (cid:98) b k (cid:3) f (cid:16) (cid:88) s,a,h ρ ( s, a, h ) r ( s, a ) (cid:17) s.t. min c ∈ (cid:2) (cid:98) c k ± (cid:98) b k · (cid:3) g (cid:16) (cid:88) s,a,h ρ ( s, a, h ) c ( s, a ) (cid:17) ≤ ∀ s (cid:48) , h : (cid:88) a ρ ( s (cid:48) , a, h + 1) = (cid:88) s,a ρ ( s, a, h ) (cid:98) p k ( s (cid:48) | s, a ) ∀ s, a, h : 0 ≤ ρ ( s, a, h ) ≤ and (cid:88) s,a ρ ( s, a, h ) = 1 Guarantee for concave-convex setting.

To extend the guarantee of the basic setting to theconcave-convex setting, we face an additional challenge: it is not immediately clear that the optimalpolicy π (cid:63) is feasible for the ConvexConPlanner program , since

ConvexConPlanner is deﬁnedwith respect to the empirical transition probabilities p ( k ) . We use a novel application of mean-valuetheorem to show that π (cid:63) is indeed a feasible solution of that program. This enables us to thencreate a similar regret decomposition to Proposition 3.3, which then allows us to plug in the resultsdeveloped for the basic setting. The full proof is provided in Appendix C. Theorem 4.1.

Let L be the Lipschitz constant for f and g and let RewReg and

ConsReg be thereward and consumption regrets for the basic setting (Theorem 3.4) while δ is its failure probability.With probability − δ , our algorithm in the concave-convex setting has reward and consumption regretupper bounded by L · RewReg and Ld · ConsReg respectively.

The linear dependence on d in the consumption regret above comes from the fact that we assume g is Lipschitz under (cid:96) norm. Our last technical section extends the algorithm and guarantee of the basic setting to scenaria wherethe constraints are hard which is in accordance with most of the literature on bandits with knapsacks .The goal here is to achieve aggregate reward regret that is sublinear in the time-horizon (in ourcase, the number of episodes K ), while also respecting budget constraints for as small budgets aspossible. We derive guarantees in terms of reward regret , as deﬁned previously, and then argue thatour benchmark extends to the seemingly stronger benchmark of the best dynamic policy. Under mild assumptions, this program can be solved in polynomial time similar to its bandit analogue of Lemma4.3 in (Agrawal and Devanur, 2014). We note that in the basic setting, it reduces to just a linear program. Note that in multi-armed bandit concave-convex setting (Agrawal and Devanur, 2014), proving feasibility of thebest arm is straightfoward as there are not transitions. etting and objective. Each resource i ∈ D has an aggregate budget B i and the learner shouldnot exceed it over K episodes; unlike the basic setting, this is therefore a hard constraint. Asin most works on bandits with knapsacks, the algorithm is allowed to use a “null action” for anepisode, i.e., an action that gives reward and consumption when selected in the beginning ofthe episode. The learner wishes to maximize her aggregate reward while respecting these hardconstraints. Formally, the reward objective is the same as deﬁned in (1) and (2) with ξ ( i ) = B i K ;instead of the consumption objective, we can think of selecting the null action once any constraint isever violated. We denote this benchmark as π (cid:63) . Note that π (cid:63) satisﬁes constraints in expectation.We explain how our algorithm can also compete against a benchmark that respects constraintsdeterministically at the end of this section. Our algorithm.

In the basic setting of Section 3, we showed a reward regret guarantee anda consumption regret guarantee (i.e., average constraint violation scales in the order of / √ K ).Since this is not suﬃcient for hard constraints, we need to ensure that the learned policy satis-ﬁes budget constraints with high probability. Our algorithm optimizes a mathematical program KnapsackConPlanner (13) that strengthens the consumption requirement. max π E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105) s.t. ∀ i ∈ D : E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h , i (cid:1)(cid:105) ≤ (1 − (cid:15) ) B i K . (13)In the above, p ( k ) , r ( k ) , c ( k ) are exactly as in the basic setting and (cid:15) > is instantiated in thetheorem below. Note that the program in (13) is feasible thanks to the existence of the null action.The following mixture policy induces a feasible solution: with probability − (cid:15) , we play the optimalpolicy π (cid:63) for the entire episode; with probability (cid:15) , we play the null action for the entire episode.Note that the above program can again be casted as a linear program in the occupancy measurespace — c.f. Appendix A.3 for further discussion.. Guarantee for knapsack setting.

The guarantee of the basic setting on this tighter mathematicalprogram seamlessly transfers to a reward guarantee that does not violate the hard constraints.

Theorem 5.1.

Assume that min i B i ≤ KH . Let

AggReg ( δ ) be a bound on the aggregate (acrossepisodes) reward or consumption regret for the soft-constraint setting (Theorem 3.4) where δ is itsfailure probability. Let (cid:15) = AggReg ( δ )min i B i . If min i B i > AggReg ( δ ) then, with probability − δ , thereward regret in the hard-constraint setting is at most H AggReg ( δ )min i B i and constraints are not violated. The above theorem implies that the aggregate reward regret is sublinear on K as long as min i B i (cid:29) H AggReg ( δ ) . The analysis in the above main theorem (provided in Appendix D) is modular ina sense that it leverages the ConRL ’s performance to solve (13) in a black-box manner. Smaller

AggReg ( δ ) from the basic soft-constraint setting immediately translates to smaller reward regretand smaller budget regime (i.e., min i B i can be smaller). In particular, using the AggReg ( δ ) boundof Theorem 3.4, the reward regret is sublinear as long as min i B i = Ω( √ K ) .In contrast, previous work of Cheung (2019) can only deal with larger budget regime, i.e., min i B i =Ω( K / ) . Although the guarantees are not directly comparable as the latter is for the single-episodesetting which requires further reachability assumptions, the budget we can handle is signiﬁcantlysmaller and in the next section we show that our algorithm has superior empirical performance inepisodic settings even when such assumptions are granted. Otherwise the setting is unconstrained. ynamic policy benchmark. The common benchmark used in Bandits with Knapsacks is notthe best stationary policy π (cid:63) that respects constraints in expectation but rather the best dynamic policy (i.e., a policy that makes decisions based on the history) that never violates hard constraints deterministically . In Appendix D, we show that the optimal dynamic policy (formally deﬁned there)has reward less than policy π (cid:63) (informally, this is because π (cid:63) respects constraints in expectationwhile the dynamic policy has to satisfy constraints deterministically) and therefore the guarantee ofTheorem 5.1 also applies against the optimal dynamic policy. r e w a r d A - Mars Rover

ConRL-Value IterationConRL-A2CRCPOApproPOconstraint c o n s t r a i n t s a t i s f a c t i o n number of trajectories t r a i n c o n s t r a i n t v o l a t i o n B - Box number of trajectories

C - Mars Rover number of trajectories

ConRL-Value IterationConRL-A2CTFW-UCRL2

Figure 1: The performance of the algorithms as a function of the number of sample trajectories(trajectory = 30 samples); showing average and standard deviation over 10 runs. Dashed line is theupper bound for the constraint (for all algorithms), while the dashed line for the reward is a lowerbound and only required by

ApproPO .In this section, we evaluate the performance of

ConRL against previous approaches. Althoughour

ConPlanner (see Appendix A) can be solved exactly using linear programming (Altman,1999), in our experiments, it suﬃces to use Lagrangian heuristic, denoted as

LagrConPlanner (see Appendix E.1). This Lagrangian heuristic only needs a planner for the unconstrained RL task.We use two planning algorithms with

LagrConPlanner : ConRL-Value Iteration usinga value iteration planner and

ConRL-A2C using a model-based Advantage Actor-Critic (A2C)(Mnih et al., 2016) planner which uses ﬁctitious samples. We run our experiments on two grid-worldenvironments

Mars rover (Tessler et al., 2019) and

Box (Leike et al., 2017). Code is available at https://github.com/miryoosefi/ConRL ars rover. The agent must move from starting position to the goal without crashing into therocks. If the agent reaches the goal or crashes into a rock it will stay in that cell for the remainingsteps in the episode horizon. Reward is when agent reaches the goal or crashes into a rock and /H afterwards. Constraint is when agents crashes into a rock and /H afterwards. The episodehorizon H is and the agent the agent’s action is perturbed with probability . to a randomaction. Box.

The agent must move from starting position to the goal while avoiding to cause irreversibleside eﬀects, that is, moving the box into a corner (cell adjacent to at least two walls). If the agentreaches the goal it stays in that cell for the remaining steps of the episode. Reward is when agentreaches the goal for the ﬁrst time and /H afterwards; constraint is /H for every state in whichbox is in a corner. Horizon H is and the agent’s action is perturbed with probability . to arandom action.We compare ConRL to previous constrained approaches (derived for either episodic or single-episodesettings) in Figure 1. We keep track of three metrics: reward (top-row), constraint (middle-row) andaggregate constraint (bottom-row) . Further details on the experiments are provided in Appendix E. Episodic setting.

We ﬁrst compare to two episodic RL approaches:

ApproPO (Miryooseﬁ et al.,2019) and

RCPO (Tessler et al., 2019). We note that none of prevnous approaches in this settingaddress sample-eﬃcient exploration. In addition, most of them are limited to linear constraintswith the exception of

ApproPO (Miryooseﬁ et al., 2019) which is capable of handling generalconvex constraints . Both ApproPO and

RCPO (used as a baseline in (Miryooseﬁ et al., 2019))maintain and update weight vector λ , used to derive reward for a learning algorithm capable ofsolving unconstrained RL task. However, ApproPO focuses on the feasibility problem, thereforein experiments we also need to specify a lower bound on the reward for

ApproPO which is set tobe . for Mars rover and . for Box. We see that in Figure 1 ﬁrst column (A-Mars Rover) andsecond column (B-Box), both versions of ConRL achieve the highest reward (top-row) and satisfythe constraint (middle-row), while requiring signiﬁcantly fewer trajectories (i.e. samples).

Single-episode setting.

Closest to our work is

TFW-UCRL2 (Cheung, 2019), which is basedon UCRL (Jaksch et al., 2010). However, this approach focuses on the single-episode setting andrequires a strong reachability assumption. By connecting terminal states of our MDP to the intialstate, we reduce our episodic setting to single-episode setting in which we can compare

ConRL againsts

TFW-UCRL2 . Results for Mars rover are depicted in last column of Figure 1 (C-MarsRover) . The result shows that both versions of ConRL signiﬁcantly outperform

TFW-UCRL2 and suggests that

TFW-UCRL2 might be impractical in (at least some) episodic settings.

Acknowledgements

The authors would like to thank Rob Schapire for useful discussions that helped in the initial stagesof this work. Bottom row is the aggregate actual constraint incurred during training, unlike the ﬁrst and second row which iswith respect to true model r ∗ and c ∗ as deﬁned in (2) and (3) In addition to that, trust region methods like CPO (Achiam et al., 2017) address a more restrictive setting andrequire constraint satisfaction at each iteration; for this reason, they are not included in the experiments We attempted to run

TFW-UCRL2 in the Box environment as well, however, due to larger state space, it wascomputationally infeasible eferences Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 22–31.JMLR. org.Agrawal, S. and Devanur, N. R. (2014). Bandits with concave rewards and convex knapsacks. In

Proceedings of the 15th ACM Conference on Economics and Computatxion (EC) .Altman, E. (1999).

Constrained Markov Decision Processes . Chapman and Hall.Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning.In

Proceedings of the 34th International Conference on Machine Learning (ICML) .Babaioﬀ, M., Dughmi, S., Kleinberg, R. D., and Slivkins, A. (2015). Dynamic pricing with limitedsupply.

TEAC , 3(1):4. Special issue for , 2012.Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2018). Bandits with knapsacks.

Journal of theACM , 65(3):13:1–13:55. Preliminary version in

FOCS 2013 .Bellman, R. (1957). A markovian decision process.

Indiana Univ. Math. J. , 6:679–684.Besbes, O. and Zeevi, A. (2009). Dynamic pricing without knowing the demand function: Riskbounds and near-optimal algorithms.

Operations Research , 57(6):1407–1420.Besbes, O. and Zeevi, A. (2011). On the minimax complexity of pricing in a changing environment.

Operations Reseach , 59(1):66–79.Cesa-Bianchi, N. and Lugosi, G. (2006).

Prediction, learning, and games . Cambridge universitypress.Cheung, W. C. (2019). Regret minimization for reinforcement learning with vectorial feedback andcomplex objectives. In

Advances in Neural Information Processing Systems (NeurIPS) .Dann, C., Lattimore, T., and Brunskill, E. (2017). Unifying pac and regret: Uniform pac bounds forepisodic reinforcement learning. In

Advances in Neural Information Processing Systems , pages5713–5723.Ding, D., Wei, X., Yang, Z., Wang, Z., and Jovanović, M. R. (2020). Provably eﬃcient safe explorationvia primal-dual policy optimization. arXiv preprint arXiv:2003.00534 .Efroni, Y., Mannor, S., and Pirotta, M. (2020). Exploration-exploitation in constrained mdps. arXivpreprint arXiv:2003.02189 .Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning.

Journal of Machine Learning Research , 11(Apr):1563–1600.Kearns, M. and Singh, S. (2002). Near-optimal reinforcement learning in polynomial time.

MachineLearning , 49(2):209–232.Le, H. M., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints.

CoRR ,abs/1903.08738.Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg,S. (2017). Ai safety gridworlds. arXiv preprint arXiv:1711.09883 .12ao, H., Alizadeh, M., Menache, I., and Kandula, S. (2016). Resource management with deepreinforcement learning. In

Proceedings of the 15th ACM Workshop on Hot Topics in Networks ,page 50–56, New York, NY, USA. Association for Computing Machinery.Miryooseﬁ, S., Brantley, K., Daume III, H., Dudik, M., and Schapire, R. E. (2019). Reinforcementlearning with convex constraints. In

Advances in Neural Information Processing Systems (NeurIPS) .Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu,K. (2016). Asynchronous methods for deep reinforcement learning. In

International Conferenceon Machine Learning , pages 1928–1937.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deepreinforcement learning.

Nature , 518(7540):529.Puterman, M. L. (2014).

Markov decision processes: discrete stochastic dynamic programming . JohnWiley & Sons.Qiu, S., Wei, X., Yang, Z., Ye, J., and Wang, Z. (2020). Upper conﬁdence primal-dual optimiza-tion: Stochastically constrained markov decision processes with adversarial losses and unknowntransitions. arXiv preprint arXiv:2003.00660 .Ray, A., Achiam, J., and Amodei, D. (2020). Benchmarking safe exploration in deep reinforcementlearning. https://cdn.openai.com/safexp-short.pdf. Accessed March 11, 2020.Rosenberg, A. and Mansour, Y. (2019). Online convex optimization in adversarial markov decisionprocesses. In

International Conference on Machine Learning , pages 5478–5486.Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policyoptimization.

CoRR , abs/1502.05477.Singh, R., Gupta, A., and Shroﬀ, N. B. (2020). Learning in markov decision processes underconstraints. arXiv preprint arXiv:2002.12435 .Slivkins, A. (2019). Introduction to multi-armed bandits.

Foundations and Trends R (cid:13) in MachineLearning , 12(1-2):1–286. Also available at https://arxiv.org/abs/1904.07272 .Sun, W., Vemula, A., Boots, B., and Bagnell, J. A. (2019). Provably eﬃcient imitation learningfrom observation alone. arXiv preprint arXiv:1905.10948 .Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. SIGARTBull. , 2(4):160–163.Sutton, R. S. and Barto, A. G. (1998).

Reinforcement Learning: An Introduction . MIT Press, ﬁrstedition.Sutton, R. S. and Barto, A. G. (2018).

Reinforcement Learning: An Introduction . MIT Press, secondedition.Syed, U. and Schapire, R. E. (2007). A game-theoretic approach to apprenticeship learning. In

Proceedings of the 20th International Conference on Neural Information Processing Systems ,NIPS’07, page 1449–1456, Red Hook, NY, USA. Curran Associates Inc.Tessler, C., Mankowitz, D. J., and Mannor, S. (2019). Reward constrained policy optimization. In

International Conference on Learning Representations .13ang, Z., Deng, S., and Ye, Y. (2014). Close the gaps: A learning-while-doing algorithm forsingle-product revenue management problems.

Operations Research , 62(2):318–331.Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inversereinforcement learning. In

Aaai , volume 8, pages 1433–1438. Chicago, IL, USA.Zinkevich, M. (2003). Online convex programming and generalized inﬁnitesimal gradient ascent. In

Proceedings of the Twentieth International Conference on International Conference on MachineLearning (ICML) . 14

Algorithm: Formal description and design choices

Our main algorithm, denoted by

ConRL , is presented at Algorithm 1. We instantiate

ConRL forour diﬀerent settings (i.e. basic setting, concave-convex, and knapsack) by using the appropriate

ConPlanner that we discuss in the remainder of this section.

Algorithm 1

ConRL for Episode k from to K do Compute empirical estimates:

Compute N k , (cid:98) p k , (cid:98) r k , and (cid:98) c k based on equations (4-7) Compute bonus:

Compute (cid:98) b k as equation (9) Call constrained planner : π k ← ConPlanner ( (cid:98) p k , (cid:98) r k , (cid:98) c k , (cid:98) b k ) Execute policy : initial state s k, = s for Stage h from to H do Select a k,h ∼ π k (cid:16) s k,h (cid:17) Observe reward r k,h , consumptions ∀ i ∈ D : c k,h,i , and new state s k,h +1 end for end for A.1 Basic setting -

BasicConPlanner

We deﬁne the bonus-enhanced cMDP , i.e. M ( k ) = (cid:0) p ( k ) , r ( k ) , c ( k ) (cid:1) , as p ( k ) ( s (cid:48) | s, a ) = (cid:98) p k ( s (cid:48) | s, a ) ∀ s, a, s (cid:48) r ( k ) ( s, a ) = (cid:98) r k ( s, a ) + (cid:98) b k ( s, a ) ∀ s, ac ( k ) ( s, a, i ) = (cid:98) c k ( s, a, i ) − (cid:98) b k ( s, a ) ∀ s, a, i ∈ D then we solve the following optimization problem max π E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105) s.t. ∀ i ∈ D : E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h , i (cid:1)(cid:105) ≤ ξ ( i ) . This optimization problem can be solved exactly since it is equivalent to the following linear programon occupation measures (Rosenberg and Mansour, 2019; Altman, 1999). Decision variables are ρ ( s, a, h ) , i.e. probability of agent being at state action pair ( s, a ) at time step h . max ρ (cid:88) s,a,h ρ ( s, a, h ) r ( k ) ( s, a ) s.t. (cid:88) s,a,h ρ ( s, a, h ) c ( k ) ( s, a, i ) ≤ ξ ( i ) ∀ i ∈ D∀ s (cid:48) , h (cid:88) a ρ ( s (cid:48) , a, h + 1) = (cid:88) s,a ρ ( s, a, h ) p ( k ) ( s (cid:48) | s, a ) ∀ s, a, h ≤ ρ ( s, a, h ) ≤ (cid:88) s,a ρ ( s, a, h ) = 1 (14)15 .2 Concave-convex setting - ConvexConPlanner

In this setting, unlike basic setting, objective and constraints are not linear. Therefore, due to lackof monotonicity, we cannot explicitly deﬁne the bonus-enhanced cMDP M ( k ) = (cid:0) p ( k ) , r ( k ) , c ( k ) (cid:1) .The bonus-enhanced cMDP is implicit in the following program that we solve (see section 4) max π max r ( k ) ∈ (cid:2) (cid:98) r k ± (cid:98) b k (cid:3) f (cid:16) E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) s.t. min c ( k ) ∈ (cid:2) (cid:98) c k ± (cid:98) b k · (cid:3) g (cid:16) E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h (cid:1)(cid:105)(cid:17) ≤ . Similar to before, expressing this program based on occupation measures provides a convex program. max ρ max r ∈ (cid:2) (cid:98) r k ± (cid:98) b k (cid:3) f (cid:16) (cid:88) s,a,h ρ ( s, a, h ) r ( s, a ) (cid:17) s.t. min c ∈ (cid:2) (cid:98) c k ± (cid:98) b k · (cid:3) g (cid:16) (cid:88) s,a,h ρ ( s, a, h ) c ( s, a ) (cid:17) ≤ ∀ s (cid:48) , h : (cid:88) a ρ ( s (cid:48) , a, h + 1) = (cid:88) s,a ρ ( s, a, h ) (cid:98) p k ( s (cid:48) | s, a ) ∀ s, a, h : 0 ≤ ρ ( s, a, h ) ≤ and (cid:88) s,a ρ ( s, a, h ) = 1 (15)The notations r ∈ (cid:2)(cid:98) r k ± (cid:98) b k (cid:3) and c ∈ (cid:2)(cid:98) c k ± (cid:98) b k · (cid:3) are deﬁned as r ∈ (cid:2)(cid:98) r k ± (cid:98) b k (cid:3) ⇐⇒ ∀ s, a : r ( s, a ) ∈ [ (cid:98) r k ( s, a ) − (cid:98) b k ( s, a ) , (cid:98) r k ( s, a ) + (cid:98) b k ( s, a )] c ∈ (cid:2)(cid:98) c k ± (cid:98) b k · (cid:3) ⇐⇒ ∀ i ∈ D , s, a : c ( s, a, i ) ∈ [ (cid:98) c k ( s, a, i ) − (cid:98) b k ( s, a ) , (cid:98) c k ( s, a, i ) + (cid:98) b k ( s, a )] Note that if f and g are linear, we end up with a linear program similar to (14) A.3 Knapsack setting -

KnapsackConPlanner

We deﬁne the bonus-enhanced cMDP, i.e. M ( k ) = (cid:0) p ( k ) , r ( k ) , c ( k ) (cid:1) similar to basic setting (A.1). Wealso solve a similar optimization problem with tighter constraints: max π E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105) s.t. ∀ i ∈ D : E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h , i (cid:1)(cid:105) ≤ (1 − (cid:15) ) B i K .

This optimization problem can again be solved using the following linear program on occupationmeasures. Decision variables are ρ ( s, a, h ) , i.e. probability of agent being at state action pair ( s, a ) at step h . max ρ (cid:88) s,a,h ρ ( s, a, h ) r ( k ) ( s, a ) s.t. (cid:88) s,a,h ρ ( s, a, h ) c ( k ) ( s, a, i ) ≤ (1 − (cid:15) ) B i K ∀ i ∈ D∀ s (cid:48) , h (cid:88) a ρ ( s (cid:48) , a, h + 1) = (cid:88) s,a ρ ( s, a, h ) p ( k ) ( s (cid:48) | s, a ) ∀ s, a, h ≤ ρ ( s, a, h ) ≤ (cid:88) s,a ρ ( s, a, h ) = 1 (16)16 Analysis: Basic setting (Section 3)

In this section, we prove the main guarantee for the basic setting.

B.1 Validity of bonus (Lemma 3.2)

We ﬁrst prove that (cid:98) b k ( s, a ) = H (cid:114) (cid:0) SAH ( d +1) k /δ ) N k ( s,a ) of Eq. (9) is valid as in the Deﬁnition 3.1. Proof of Lemma 3.2.

We focus on a single state-action pair s, a , stage h , and objective m . Sincethe support of m is in [0 , and the one of the value is in [0 , H − , by Hoeﬀding inequality (seeLemma F.2), it holds that, for all k , since ( s, a ) -pair is visited N k ( s, a ) times prior to episode k , withprobability at least − δ (cid:48) : (cid:12)(cid:12)(cid:12)(cid:16) (cid:98) m k ( s, a ) − m (cid:63) ( s, a ) (cid:17) + (cid:88) s (cid:48) ∈S (cid:16)(cid:98) p k ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:17) V (cid:12)(cid:12)(cid:12) ≤ H (cid:115) /δ (cid:48) ) N k ( s, a ) . As a result, the bonus (cid:98) b k ( s, a, δ ) satisﬁes this inequality for a particular state-action-step-objectivewith failure probability at most δ (cid:48) = δ SAH ( d +1) k and is therefore valid (satisfying it for all states-actions-steps-objectives) with failure probability δ k . Union bounding across episodes, the probabilityof (cid:98) b k ( s, a, δ ) not being valid for some k is at most (cid:80) Kk =1 δ k ≤ δ . B.2 Valid bonus implies optimism

The main reason to optimize a bonus-enhanced model with valid bonuses is because the latter renderthe model optimistic , i.e., its estimated reward is an overestimate of the true reward. Similarly, inconstrained settings, its estimated resource consumptions are underestimates of the true resourceconsumptions. This is formalized in the following deﬁnition.

Deﬁnition B.1. A cMDP M = ( p, r, c ) is optimistic if its estimated reward (resp. consumption)value function for policy π (cid:63) upper (resp. lower) bounds its corresponding value function under theground truth: E (cid:104) V π (cid:63) ,pr ( s , (cid:105) ≥ E (cid:104) V π (cid:63) ,p (cid:63) r (cid:63) ( s , (cid:105) and E (cid:104) V π (cid:63) ,pc i ( s , (cid:105) ≤ E (cid:104) V π (cid:63) ,p (cid:63) c (cid:63)i ( s , (cid:105) ∀ i ∈ D . An important block of the analysis for the basic setting is to show that, when using a bonus-enhancedmodel with valid bonuses, the resulting cMDP is optimistic.

Lemma B.2.

If the bonus (cid:98) b k ( s, a ) of Eq. (9) in episode k is valid (Deﬁnition 3.1) for the corre-sponding cMDP M ( k ) = (cid:0) p ( k ) , r ( k ) , c ( k ) (cid:1) then M ( k ) is optimistic.Proof. We ﬁrst prove the optimism of the model for the reward objective. More concretely, we showby induction that for any state s , action a , and stage h , Q π (cid:63) ,p ( k ) r ( k ) ( s, a, h ) ≥ Q π (cid:63) ,p (cid:63) r (cid:63) ( s, a, h ) ; takingexpectation on the state-action pair of the ﬁrst state, the claim then follows.Since the setting ends at episode H , Q π (cid:63) ,p ( k ) r ( k ) ( s, a, H + 1) = Q π (cid:63) ,p (cid:63) r (cid:63) ( s, a, H + 1) = 0 .17e assume that the inductive hypothesis Q π (cid:63) ,p ( k ) r ( k ) ( s, a, h + 1) ≥ Q π (cid:63) ,p (cid:63) r (cid:63) ( s, a, h + 1) (and thus also V π (cid:63) ,p ( k ) r ( k ) ( s, h + 1) ≥ V π (cid:63) ,p (cid:63) r (cid:63) ( s, h + 1) ) holds, and proceed with the inductive step. The Q -functions inquestion are: Q π (cid:63) ,p ( k ) r ( k ) ( s, a, h ) = r ( k ) ( s, a ) + (cid:88) s (cid:48) ∈S p ( k ) ( s (cid:48) | s, a ) V π (cid:63) ,p ( k ) r ( k ) ( s (cid:48) , h + 1) ≥ r ( k ) ( s, a ) + (cid:88) s (cid:48) ∈S p ( k ) ( s (cid:48) | s, a ) V π (cid:63) ,p (cid:63) r (cid:63) ( s (cid:48) , h + 1) Q π (cid:63) ,p (cid:63) r (cid:63) ( s, a, h ) = r (cid:63) ( s, a ) + (cid:88) s (cid:48) ∈S p (cid:63) ( s (cid:48) | s, a ) V π (cid:63) ,p (cid:63) r (cid:63) ( s (cid:48) , h + 1) Subtracting, we have: Q π (cid:63) ,p ( k ) r ( k ) ( s, a, h ) − Q π (cid:63) ,p (cid:63) r (cid:63) ( s, a, h ) ≥ (cid:16)(cid:98) r k ( s, a ) + (cid:98) b k ( s, a ) − r (cid:63) ( s, a ) (cid:17) + (cid:88) s (cid:48) ∈S (cid:16)(cid:98) p k ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:17) V π (cid:63) ,p (cid:63) r (cid:63) ( s (cid:48) , h + 1) ≥ , where the last inequality holds since the bonuses are valid.The optimism of the model with respect to the consumption objectives follows the same steps alteringthe direction of the inequalities and setting the estimate as empirical mean minus the bonus.We emphasize that our bonus in Eq (9) does not scale polynomially with respect to |S| ; despite that,as indicated by the above lemma, it suﬃces to prove optimism. B.3 Simulation lemma

To prove the Bellman-error regret decomposition, an essential piece is the so called simulation lemma (Kearns and Singh, 2002) which we adapt to constrained settings below:

Lemma B.3 (Simulation lemma) . For any policy π , any cMDP M = ( p, r, c ) , and any objective m ∈ { r } ∪ { c i } i ∈D with corresponding true objective m (cid:63) ∈ { r (cid:63) } ∪ { c (cid:63)i } i ∈D , , it holds that: E π (cid:104) V π,pm ( s , (cid:105) − E π (cid:104) V π,p (cid:63) m (cid:63) ( s , (cid:105) = E π (cid:104) H (cid:88) h =1 Bell π,pm ( s h , a h , h ) (cid:105) . (17) Proof.

For all of m ∈ { r } ∪ { c i } i ∈D , rearranging the deﬁnitions of Bellman errors, we obtain: Q π,pm ( s, a, h ) = (cid:16) Bell π,pm ( s, a, h ) + m (cid:63) ( s, a ) (cid:17) + (cid:88) s (cid:48) ∈S p (cid:63) ( s (cid:48) | s, a ) V π,pm ( s (cid:48) , h + 1) Q π,p (cid:63) m (cid:63) ( s, a, h ) = (cid:16) Bell π,p (cid:63) m (cid:63) ( s, a, h ) + m (cid:63) ( s, a ) (cid:17) + (cid:88) s (cid:48) ∈S p (cid:63) ( s (cid:48) | s, a ) V π,p ∗ m ∗ ( s (cid:48) , h + 1) By deﬁnition of the Bellman error, the Bellman error with respect to the true model is equal to .As a result, subtracting the two above equations, we obtain: Q π,pm ( s, a, h ) − Q π,p (cid:63) m (cid:63) ( s, a, h ) = Bell π,pm ( s, a, h ) + (cid:88) s (cid:48) ∈S p (cid:63) ( s (cid:48) | s, a ) (cid:16) V π,pm ( s (cid:48) , h + 1) − V π,p (cid:63) m (cid:63) ( s (cid:48) , h + 1) (cid:17) . π to select a , the initial state s , and setting h = 1 , we obtain: E s (cid:104) V π,pm (cid:0) s (1) , (cid:1) − V π,p (cid:63) m (cid:63) (cid:0) s , (cid:1)(cid:105) = E π (cid:104) Bell π,pm (cid:0) s , a , (cid:1)(cid:105) + E π (cid:104) V π,pm (cid:0) s , (cid:1) − V π,p ∗ m (cid:63) (cid:0) s , (cid:1)(cid:105) . Recursively bounding the second term of the RHS as above concludes the lemma.

B.4 Bellman-error regret decomposition (Proposition 3.3)

Proof of Proposition 3.3.

The consumption requirement (11) for resource i follows by applying thesimulation lemma (Lemma B.3) on cMDP M ( k ) and objective m = c ( k ) i (with corresponding trueobjective m (cid:63) = c (cid:63)i ) and using that π k is feasible for ConPlanner ( p ( k ) , r ( k ) , c ( k ) ) : E π k ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) ( s h , a h , i ) (cid:105) = E π k (cid:104) V π,p (cid:63) c (cid:63)i ( s , (cid:105) = E (cid:104) V π k ,pc i ( s , (cid:105) − E π k (cid:104) H (cid:88) h =1 Bell π k ,p ( k ) c ( k ) i (cid:0) s h , a h , h ) (cid:105) ≤ ξ ( i ) + E π k (cid:104) H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) c ( k ) i (cid:0) s h , a h , h (cid:1)(cid:12)(cid:12)(cid:12)(cid:105) Regarding the reward requirement (10), what we wish to bound is: E π (cid:63) ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) ( s h , a h ) (cid:105) − E π k ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) ( s h , a h ) (cid:105) = E (cid:104) V π (cid:63) ,p (cid:63) r (cid:63) ( s , (cid:105) − E (cid:104) V π k ,p (cid:63) r (cid:63) ( s , (cid:105) the validity of the bonus implies that the model M ( k ) is optimistic (Lemma B.2), i.e., we have that E (cid:104) V π (cid:63) ,p (cid:63) r (cid:63) ( s , (cid:105) ≤ E (cid:104) V π (cid:63) ,p ( k ) r ( k ) ( s , (cid:105) . If π (cid:63) is feasible for ConPlanner ( p ( k ) , r ( k ) , c ( k ) ) then, since π k is the maximizer for this program: E (cid:104) V π (cid:63) ,p ( k ) r ( k ) ( s , (cid:105) − E (cid:104) V π k ,p (cid:63) r (cid:63) ( s , (cid:105) ≤ E (cid:104) V π k ,p ( k ) r ( k ) ( s , (cid:105) − E (cid:104) V π k ,p (cid:63) r (cid:63) ( s , (cid:105) (18) = E π k (cid:104) H (cid:88) h =1 Bell π k ,p ( k ) r ( k ) (cid:0) s h , a h , h (cid:1)(cid:105) where the last equality holds by applying the simulation lemma with m = r . Hence, this proves (10).What is left to show is that π (cid:63) is indeed feasible for ConPlanner ( p ( k ) , r ( k ) , c ( k ) ) . Since M ( k ) isoptimistic and π (cid:63) is feasible for the ground truth M (cid:63) , for all resources i ∈ D : E (cid:104) V π (cid:63) ,p ( k ) c ( k ) i ( s , (cid:105) ≤ E (cid:104) V π (cid:63) ,p (cid:63) c (cid:63)i ( s , (cid:105) ≤ ξ ( i ) . This completes the proof of the proposition.

B.5 Bounding the Bellman error

We now provide an upper bound on the Bellman error which arises in the RHS of the regretdecomposition (Proposition 3.3). 19 emma B.4.

Let (cid:15) > . If the bonus (cid:98) b k is valid for all episodes k simultaneously then, withprobability at least − δ : for all objectives m ( k ) ∈ { r ( k ) } ∪ { c ( k ) i } i ∈D , transitions p = p ( k ) , and stages h , the Bellman error at episode k is upper bounded by: (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) m ( k ) ( s, a, h ) (cid:12)(cid:12)(cid:12) ≤ H (cid:115) S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ )) N k ( s, a ) + (cid:15)S. Proof of Lemma B.4.

Let Ψ be an (cid:15) -net in [ − ( H − , ( H − S . For a ﬁxed value ¯ V ∈ Ψ , similar toLemma 3.2, with probability − δ (cid:48) , simultaneously for all states s ∈ S , actions a ∈ A , steps h ∈ [ H ] ,episodes k ∈ [ K ] , and objectives m ( k ) ∈ { r ( k ) } ∪ { c ( k ) i } i ∈D , it holds that: (cid:12)(cid:12)(cid:12) m ( k ) ( s, a ) − m (cid:63) ( s, a ) + (cid:88) s (cid:48) ∈S (cid:16) p ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:17) ¯ V ( s (cid:48) ) (cid:12)(cid:12)(cid:12) ≤ (cid:98) b k ( s, a ) + H (cid:115) (cid:0) SAH ( d + 1) k /δ (cid:48) ) N k ( s, a ) Since Ψ is an (cid:15) -net for ¯ V , there are (2 H/(cid:15) ) S potential values. In order to have the above holdsimultaneously for all these values with probability − δ , we need to set δ (cid:48) = δ (2 H/(cid:15) ) S .Since the value (cid:0) p ( k ) ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:1) V π k ,pm ( k ) ( s (cid:48) , h + 1) is in [ − ( H − , ( H − for all s (cid:48) , it holdsthat there exists a value V in the (cid:15) -net with distance at most (cid:15)S . As a result, since (cid:98) b k ( s, a ) is validfor k : (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) m ( k ) ( s, a, h ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) m ( k ) ( s, a ) − m (cid:63) ( s, a ) + (cid:88) s (cid:48) ∈S (cid:16) p ( k ) ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:17) V ( s (cid:48) ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) ∈S (cid:16) p ( k ) ( s (cid:48) | s, a ) − p (cid:63) ( s (cid:48) | s, a ) (cid:17)(cid:16) V ( s (cid:48) ) − V π k ,p ( k ) m ( k ) ( s (cid:48) , h + 1) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:98) b k ( s, a ) + H (cid:115) S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ )) N k ( s, a ) + (cid:15)S. Upper bounding (cid:98) b k ( s, a ) ≤ H (cid:114) S ln (cid:0) SAH ( d +1) k / ( (cid:15)δ )) N k ( s,a ) completes the lemma. B.6 Final guaraantee for the basic setting (Theorem 3.4)

Proof.

The failure probability of the algorithm is δ due to the validity of bonus (cid:98) b k ( s, a ) (Lemma 3.2)and another δ by the bound on Bellman error (Lemma B.4). When neither failure events occur(probability − δ ), Proposition 3.3 upper bounds either of reward or consumption regret by E π k (cid:104)(cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) m ( k ) ( s h , a h , h ) (cid:12)(cid:12)(cid:12)(cid:105) . By Lemma B.4, the Bellman error at episode t , for (cid:15) > , is at most: (cid:12)(cid:12)(cid:12) Bell π t ,p ( t ) m ( t ) ( s t,h , a t,h , h ) (cid:12)(cid:12)(cid:12) ≤ H (cid:115) S ln (cid:0) SAH ( d + 1) t / ( (cid:15)δ )) N t ( s, a ) + (cid:15)S h = 1 . . . H and t = 1 , . . . , k , the sum of Bellman errors is at most: k (cid:88) t =1 H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π t ,p ( t ) m ( t ) ( s t,h , a t,h , h ) (cid:12)(cid:12)(cid:12) ≤ k (cid:88) t =1 H (cid:88) h =1 (cid:16) H (cid:115) S ln (cid:0) SAH ( d + 1) t / ( (cid:15)δ )) N t ( s, a ) + (cid:15)S (cid:17) ≤ (cid:88) s,a (cid:16) H (cid:88) j =1 H (cid:113) S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ (cid:1) + N k ( s,a ) (cid:88) j = H +1 H (cid:115) S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ )) j + (cid:15)S (cid:17) The second inequality follows since a particular state-action pair may have the same visitationsfor H times (as we only update this quantity at the end of the episode). To avoid incurring anadditional dependence on H , we separate the ﬁrst H visitations of each state-action pair and treatthe bound as if j = 1 for them. For the remaining visitations, j and N k ( s, a ) are always within afactor of and this factor therefore appears within the square root.We now bound the second term: (cid:88) s,a (cid:16) N k ( s,a ) (cid:88) j = H +1 H (cid:115) S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ )) j + (cid:15)S (cid:17) ≤ SAH (cid:113) N k ( s, a ) ln (cid:0) N k ( s, a ) (cid:1) · S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ )) + (cid:15)kHS ≤ SAH (cid:115) kH · S · ln( k ) ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ ) (cid:1) SA + (cid:15)kHS ≤ S √ AH · √ k · (cid:113) ln( k ) ln (cid:0) SAH ( d + 1) k/δ (cid:1) + 1 . The last inequality holds by setting (cid:15) = kHS .The ﬁrst term can be bounded by additive terms that depend only logarithmically on k : (cid:88) s,a (cid:16) H (cid:88) j =1 H (cid:113) S ln (cid:0) SAH ( d + 1) k / ( (cid:15)δ (cid:1) ≤ S / AH (cid:113) ln(2 SAH ( d + 1) k/δ (cid:1) As a result: k (cid:88) t =1 H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π t ,p ( t ) m ( t ) ( s t,h , a t,h , h ) (cid:12)(cid:12)(cid:12) ≤ S √ AH √ k · (cid:113) ln( k ) ln (cid:0) SAH ( d + 1) k/δ (cid:1) + 1+ 16 S / AH (cid:113) ln (cid:0) SAH ( d + 1) k/δ (cid:1) The reason why we sum until H in the ﬁrst term is since we want to consider all such visitations that occur inan episode that started with N k ( s, a ) < H ; the additional factor of in the second term comes since, j/N t ( s, a ) ≤ if N t ( s, a ) ≥ H and the j -th visitation happens within the same episode. { π t } (as needed by Proposition 3.3) via a simple martingale argument. From Lemma F.3,with probability at least − δ , we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) t =1 H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π t ,p ( t ) m ( t ) ( s t,h , a t,h , h ) (cid:12)(cid:12)(cid:12) − k (cid:88) t =1 H (cid:88) h =1 E π t (cid:34) H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π t ,p ( t ) m ( t ) ( s h , a h , h ) (cid:12)(cid:12)(cid:12)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ H . (cid:112) k /δ ) k, where we use the fact that | Bell π,pm | ≤ H due to of Q π,pm ( s, a ) ∈ [0 , H ] , m (cid:63) ( s, a ) ∈ [0 , , and V π,pm ( s ) ∈ [0 , H ] . Combining the above, we conclude the proof. C Analysis: concave-convex setting (Section 4)

In this section, we prove the main guarantee for the convex-concave setting. Since the regretdecomposition of the basic setting (Proposition 3.3) does not hold direclty as f and g are not linear,we need to create an analogous regret decomposition (Proposition C.2) for the convex-concave setting.This can be done by leveraging the Lipschitzness of the functions. Armed with this new regretdecomposition, we can directly call the results we have for for the basic setting (e.g., upper bounds ofBellman errors) to conclude the regret analysis for the convex-concave setting. The ﬁrst step leadingto this regret decomposition is to show that π (cid:63) is a feasible solution of ConvexConPlanner . C.1 Feasibility of optimal policy in concave-convex setting (Lemma C.1)

Lemma C.1.

If the bonus (cid:98) b k is valid (in the sense of Deﬁnition 3.1) then policy π (cid:63) that maximizesthe objective of the convex-concave setting is feasible in ConvexConPlanner .Proof.

Unlike the linear case, the feasibility of π (cid:63) , requires more care. Applying the same dynamicprogramming arguments as in Lemma B.2, it follows that: ∀ i ∈ D : E (cid:104) V π (cid:63) ,p ( k ) (cid:98) c i,k − b k ( s , (cid:105) ≤ E s (cid:104) V π (cid:63) ,p (cid:63) c (cid:63)i ( s , (cid:105) ≤ E (cid:104) V π (cid:63) ,p ( k ) (cid:98) c i,k + b k ( s , (cid:105) . Letting (cid:101) g ( α ) = E (cid:104) V π (cid:63) ,p ( k ) (cid:98) c i,k + αb k ( s (1) , (cid:105) , the above can be rewritten as: ∀ i ∈ D : (cid:101) g ( − ≤ E (cid:104) V π (cid:63) ,p (cid:63) c (cid:63)i ( s , (cid:105) ≤ (cid:101) g (1) . Since (cid:101) g ( · ) is the expected value over the same policy and under the same transitions, it is continuouswith respect to its argument. As a result, applying mean-value theorem on each i separately, thereexists some α i such that (cid:101) g ( α i ) = E s (cid:104) V π (cid:63) ,p (cid:63) c (cid:63)i ( s , (cid:105) . Due to the feasibility of π (cid:63) on the true transitionsand consumptions, it holds that g (cid:16)(cid:101) g ( α i ) (cid:17) ≤ . Hence, selecting estimates (cid:98) c i,k + α i (cid:98) b k creates afeasible solution for π (cid:63) under the estimated transitions of the ConvexConPlanner program. Theﬁnal value of π (cid:63) at this program maximizes the objective retaining feasibility; hence the existence ofone feasible selection of consumption estimates concludes the proof of the lemma.22e conclude by remarking that proving optimism feasibility for the concave-convex setting in multiple-step RL setting is more challenging than that in single-step multi-arm bandit setting Agrawal andDevanur (2014) since in bandits, there are no transitions. In the proof above, to show that π (cid:63) isfeasible in ConvexConPlanner which is deﬁned with respect to p ( k ) , we leverage the fact that (cid:101) g ( α ) is continuous and a novel application of mean-value theorem to link π (cid:63) ’s performance in theoptimistic model E (cid:104) V π (cid:63) ,p ( k ) (cid:98) c i,k + α i b k ( s , (cid:105) and π (cid:63) ’s performance under the real model E s (cid:104) V π (cid:63) ,p (cid:63) c (cid:63)i ( s , (cid:105) . C.2 Regret decomposition for concave-convex setting

Using the Lipschitz continuous assumption of f and g , we can decompose the regret into a sum ofBellman errors as before, but scaled by the Lipschitz constant this time. Proposition C.2.

Let L be the Lipschitz constant for f and g . If (cid:98) b k ( s, a, δ ) is valid for all episodes k simultaneously then the per-episode reward and consumption regrets can be upper bounded by: f (cid:16) E π (cid:63) ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) ( s h , a h ) (cid:105)(cid:17) − f (cid:16) E π k ,p (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) ( s h , a h ) (cid:105)(cid:17) ≤ L · E π k (cid:104) H (cid:88) h =1 Bell π k ,p ( k ) r ( k ) (cid:0) s h , a h , h ) (cid:105)(cid:17) g (cid:16) E π k ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63) ( s h , a h , i ) (cid:105)(cid:17) ≤ L (cid:88) i ∈D · E π k (cid:104) H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) c ( k ) i ( s h , a h , h ) (cid:12)(cid:12)(cid:12)(cid:105) Proof.

We ﬁrst prove the reward requirement. Let r ( π ) be the solution of the inner maximizationprogram for policy π , and we deﬁne r ( k ) = r ( π k ) . For notational convenience, we denote V π,pm = E π,p (cid:104) V π,pm (cid:105) Since r (cid:63) ( s, a ) ∈ [ (cid:98) r ( s, a ) − (cid:98) b k ( s, a, δ ) , (cid:98) r ( s, a ) + (cid:98) b k ( s, a, δ )] and the bonus (cid:98) b k is valid, similarto Lemma B.2, it holds: V π (cid:63) ,p (cid:63) r (cid:63) ∈ (cid:104) V π (cid:63) ,p ( k ) (cid:98) r − b , V π (cid:63) ,p ( k ) (cid:98) r + b (cid:105) . (19)As a result, by mean-value theorem, there exists α ∈ [ − , such that V π (cid:63) ,p (cid:63) r (cid:63) = V π (cid:63) ,p ( k ) (cid:98) r + αb . Since π k isthe maximizer of ConvexConPlanner and π (cid:63) is feasible for that program, it holds that: f (cid:16) V π k ,p ( k ) r ( π k ) (cid:17) ≥ f (cid:16) V π (cid:63) ,p ( k ) r ( π (cid:63) ) (cid:17) ≥ f (cid:16) V π (cid:63) ,p ( k ) (cid:98) r + αb (cid:17) = f (cid:16) V π (cid:63) ,p (cid:63) r (cid:63) (cid:17) , (20)where the second-to-last inequality holds since r ( π (cid:63) ) is the maximizer of the inner program for π (cid:63) and the equality holds by (19).We are now ready to provide the equivalent of the regret decomposition: f ( V π (cid:63) ,p (cid:63) r (cid:63) ) − f ( V π k ,p (cid:63) r (cid:63) ) ≤ f ( V π k ,p ( k ) r ( π k ) ) − f ( V π k ,p (cid:63) r (cid:63) ) ≤ L · (cid:12)(cid:12)(cid:12) V π k ,p ( k ) r ( π k ) − V π k ,p (cid:63) r (cid:63) (cid:12)(cid:12)(cid:12) ≤ L · E π k (cid:32) H (cid:88) h =1 Bell π k ,p ( k ) r ( k ) (cid:0) s h , a h , h (cid:1)(cid:33) where the ﬁrst inequality holds by (20). the second inequality by Lipschitzness and the last inequalityholds by simulation lemma (Lemma B.3).For the consumption requirement, since π k is feasible in ConvexConPlanner , denoting again by c ( π ) the consumption in the maximizer for policy π in the inner mathematical program. Same as23bove we deﬁne c ( k ) = c ( π k ) . It holds that: g (cid:16) E π k ,p ( k ) (cid:104) H (cid:88) h =1 c h ( π k ) (cid:105)(cid:17) ≤ (21)As a result, g (cid:16) E π k ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63)h (cid:105)(cid:17) − g (cid:16) E π k ,p ( k ) (cid:104) H (cid:88) h =1 c h ( π k ) (cid:105)(cid:17) ≤ L (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E π k ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63)h (cid:105) − E π k ,p ( k ) (cid:104) H (cid:88) h =1 c h ( π k ) (cid:105)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = L (cid:88) i ∈D (cid:12)(cid:12)(cid:12) E π k ,p (cid:63) (cid:104) H (cid:88) h =1 c (cid:63)h ( i ) (cid:105) − E π k ,p ( k ) (cid:104) H (cid:88) h =1 c h ( π k , i ) (cid:105)(cid:12)(cid:12)(cid:12) ≤ L · (cid:88) i ∈D E π (cid:32) H (cid:88) h =1 (cid:12)(cid:12)(cid:12) Bell π k ,p ( k ) c ( k ) i (cid:0) s h , a h , h (cid:1)(cid:12)(cid:12)(cid:12)(cid:33) , where again we applied Lipschitness and simulation lemma. C.3 Concave-convex theorem (Theorem 4.1)

Proof of Theorem 4.1.

The proof follows similarly to the proof of Theorem 3.4 by replacing Proposi-tion 3.3 with Proposition C.2. The linear dependency on d in the consumption regret comes fromthe fact that the Lipschitzness of g is deﬁned in L1 norm. D Analysis: Knapsack setting (Section 5)

In this section, we prove the guarantee for the hard-constraint setting. The goal is to show thatover K episodes, our algorithm has sublinear reward regret comparing to the best dynamic policy(formally deﬁned in Appendix D.2), while satisfying hard budget constraints with high probability. D.1 Theorem with hard constraints (Theorem 5.1)

Proof of Theorem 5.1.

We denote by

Opt the expected total reward of π (cid:63) . Consider now the policy (cid:101) π (cid:63) that selects the null policy with probability (cid:15) and follows π (cid:63) otherwise. This policy is feasible for(13); as a result the expected reward (cid:101) π (cid:63) for (13) is at least (1 − (cid:15) ) Opt . Since the total reward isupper bounded by KH , it therefore holds that: K (cid:88) k =1 E (cid:101) π (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105) ≥ (1 − (cid:15) ) Opt ≥ Opt − (cid:15)KH (22)In the high-probability event where the regret guarantee of AggReg ( δ ) does not fail, the reward ofthe algorithm is at least: K (cid:88) k =1 H (cid:88) h =1 r k,h ≥ K (cid:88) k =1 E (cid:101) π (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) ( s h , a h ) (cid:105) − AggReg ( δ ) , (23)24ombining (22) and (23), with probability − δ , the reward regret with respect to π (cid:63) is at most: RewReg(K) ≤ K AggReg ( δ ) + (cid:15)H (24)We now focus on the consumption. Since we optimize (13), for any resource i ∈ D , when the regretguarantee AggReg ( δ ) against (cid:101) π (cid:63) does not fail and given that (cid:101) π (cid:63) is feasible for (13), it holds that: K (cid:88) k =1 H (cid:88) h =1 c k,h,i ≤ K (cid:88) k =1 E (cid:101) π (cid:63) (cid:104) H (cid:88) h =1 c (cid:0) s h , a h , i (cid:1)(cid:105) + AggReg ( δ ) ≤ (1 − (cid:15) ) B i + AggReg ( δ ) Hence, when the regret guarantee

AggReg ( δ ) does not fail, the consumption is less than B i for all i as long as (cid:15) ≥ AggReg ( δ )min i B i . Moreover (cid:15) is a probability as a result it should also be less than whichholds when min i B i ≥ AggReg ( δ ) . Applying on (24) and assuming without loss of generality that KH > min i B i (otherwise the setting is essentially unconstrained), the reward regret is at most RewReg ( K ) ≤ H AggReg ( δ )min i B i . D.2 Dynamic policy benchmark

We call a policy dynamic if it maps the entire history to a distribution over the action space.Speciﬁcally we denote history H k,h as the history that contains all the information from thebeginning of the ﬁrst episode to the end of the step h − at the k -th episode plus the state atstep h in episode k. At any episode k and step h , a dynamic policy (cid:101) π ( ·|H k ; h ) ∈ ∆( A ) maps history H k ; h to a distribution over action space. We denote Π dynamic as the set of all dynamic policiesthat satisﬁes the budget constraints deterministically, i.e., for any (cid:101) π ∈ Π dynamic , when executedfor K episodes in the MDP, we have (cid:80) Kk =1 (cid:80) Hh =1 c i ( s k,h , a k,h ) ≤ B i for all i ∈ D , deterministically.Ideally we want to compare against the best dynamic policy that maximizes the expected totalreward max (cid:101) π ∈ Π dynamic E (cid:101) π (cid:104)(cid:80) Kk =1 (cid:80) Kh =1 r k,h (cid:105) . We denote such an optimal dynamic policy as (cid:101) π (cid:63) andits expected total reward across K episodes as Opt := max (cid:101) π ∈ Π dynamic E (cid:101) π (cid:34) K (cid:88) k =1 K (cid:88) h =1 r k,h (cid:35) . The lemma below shows that indeed the stationary Markovian policy π (cid:63) actually achieves no smallerexpected total reward across K episodes than that of the best dynamic policy. Lemma D.1.

The reward of the policy π (cid:63) maximizing program (1) with ξ ( i ) = B i K is at least as largeas the per-episode reward of the optimal dynamic policy that is subject to hard constraints instead: E π (cid:63) (cid:104) H (cid:88) h =1 r (cid:63) (cid:0) s h , a h (cid:1)(cid:105) ≥ K max (cid:101) π ∈ Π dynamic E (cid:101) π (cid:104) K (cid:88) k =1 H (cid:88) h =1 r ( s k,h , a k,h ) (cid:105) = Opt

K .

Proof.

Denote (cid:101) π (cid:63) as the optimal dynamic policy from Π dynamic . Any policy induces a state-actiondistribution at episode k and stage h , denoted as ρ (cid:101) π ( s, a ; h, k ) , which stands for the probability of25 π visits state-action pair ( s, a ) at stage h in episode k . Denote ρ (cid:101) π ( s, a ; h ) = (cid:80) Kk =1 ρ (cid:101) π ( s, a ; h, k ) /K which stands for the probability of (cid:101) π visiting ( s, a ) at stage h . We have: (cid:88) a ρ (cid:101) π ( s (cid:48) , a ; h, k ) = (cid:88) s,a ρ (cid:101) π ( s, a ; h − , k ) p (cid:63) ( s (cid:48) | s, a ) , ∀ s (cid:48) , due to the Markovian transition p (cid:63) ( s (cid:48) | s, a ) , which implies that: (cid:88) a ρ (cid:101) π ( s (cid:48) , a ; h ) = (cid:88) s,a ρ (cid:101) π ( s, a ; h − p (cid:63) ( s (cid:48) | s, a ) , ∀ s (cid:48) . Hence, ρ (cid:101) π ( s, a ; h ) satisﬁes the ﬂow constraints, and hence induces a stationary Markovian policy: π (cid:101) π ( a | s ) ∝ ρ (cid:101) π ( s, a ; h ) / (cid:88) a ρ (cid:101) π ( s, a ; h ) , and π (cid:101) π induces state-action visitation distribution that are exactly equal to ρ (cid:101) π ( s, a ; h ) .Note that (cid:101) π (cid:63) satisﬁes the budget constraints deterministically, which means in expectation, it willsatisﬁes the constraints as well, i.e., K (cid:88) k =1 H (cid:88) h =1 (cid:88) ( s,a ) ρ (cid:101) π (cid:63) ( s, a ; h ) c i ( s, a ) ≤ B i , ∀ i ∈ D , which implies that in expectation, for π (cid:101) π (cid:63) , we have that for all i ∈ D : E π (cid:101) π(cid:63) (cid:34) H (cid:88) h =1 c i ( s h , a h ) (cid:35) = H (cid:88) h =1 (cid:88) ( s,a ) ρ π (cid:101) π(cid:63) ( s, a, h ) c i ( s, a ) = K (cid:88) k =1 H (cid:88) h =1 (cid:88) ( s,a ) ρ (cid:101) π (cid:63) ( s, a ; h ) c i ( s, a ) /K ≤ B i /K. This means that π (cid:101) π (cid:63) is a feasible solution of the hard-constraint program.Similarly, we have that the expected per-episode total reward of (cid:101) π (cid:63) is the same as the expected totalreward of π (cid:101) π (cid:63) : E π (cid:101) π(cid:63) (cid:104) H (cid:88) h =1 r h ( s h , a h ) (cid:105) = 1 K E (cid:101) π (cid:63) (cid:34) K (cid:88) k =1 H (cid:88) h =1 r k,h (cid:35) . Hence, due to the optimality of π (cid:63) , we immediately have: E π (cid:63) (cid:104) H (cid:88) h =1 r h (cid:105) ≥ E π (cid:101) π(cid:63) (cid:104) H (cid:88) h =1 r h (cid:105) = 1 K E (cid:101) π (cid:63) (cid:104) K (cid:88) k =1 H (cid:88) h =1 r k,h (cid:105) . Since our approach incurs sublinear regret with respect to π (cid:63) , it follows from the above lemma thatit incurs sublinear regret with respect to Opt – the total reward across K episodes from the bestdynamic policy. 26 Experimental details

In the experiments, both

ApproPO and

RCPO use the same policy gradient algorithm, speciﬁcally,Advantage Actor-Critic (A2C) Mnih et al. (2016) as the learning algorithm. We implemented

ConRL using two version of

LagrConPlanner (see algorithm 2 below) as

ConPlanner in whichthe planner is either value iteration (exact planner) or A2C (approximate planner similar to Dynamodel-base RL Sutton (1991)) using ﬁctitious samples. All three algorithms have outer-loop learningrates which we tuned while hyperparameters used for A2C is same across all three methods. Here,we report the result for the best learning rate for each method.

TFW-UCRL2 gives ﬁxed weights to reward and constraint violation and maximizes a scalar function.Therefore we tuned

TFW-UCRL2 for diﬀerent weights and the reported result is for the best weight.

E.1

LagrConPlanner

Our theoretical results posit that

ConPlanner is solved optimally, which can be indeed achievedvia linear programming (see Appendix A). However in our experiments it suﬃces to use a generalheuristic for

ConPlanner . Our approach is to Lagrangify the constraints, and create a min-maxmathematical program with the Lagrangean objective: min ∀ i ∈D : λ ( i ) ≤ max π (cid:16) E π,p ( k ) (cid:104) H (cid:88) h =1 r ( k ) (cid:0) s h , a h (cid:1)(cid:105) + (cid:88) i ∈D λ ( i ) (cid:16) E π,p ( k ) (cid:104) H (cid:88) h =1 c ( k ) (cid:0) s h , a h , i (cid:1)(cid:105) − ξ ( i ) (cid:17) . Deﬁne pseudo-reward r ( k ) λ as r ( k ) λ ( s, a ) = r ( k ) ( s, a ) + (cid:88) i ∈ D λ ( i )[ c ( k ) ( s, a ) − ξ ( i )] With a ﬁxed choice of Lagrange multipliers { λ ( i ) } i ∈D , this is an unconstrained planning programwhich we refer to as Planner ( p ( k ) , r ( k ) λ ) and it can be solved by a planning oracle.We update Lagrange multipliers via projected gradient descent Zinkevich (2003). The overheadof ConPlanner is computational, as we do not require new samples. The full procedure is inAlgorithm 2. The near-optimality of Algorithm 2 can be proved by leveraging the fact that we areiteratively updating π and λ using no-regret online learning procedure (Best Response for π andOGD for λ ) (e.g., Cesa-Bianchi and Lugosi (2006)). We omit the analysis for Algorithm 2 as it isnot the main focus of this work.In our experiments, two versions of Planner have been implemented: Value Iteration (exact planner)and A2C with ﬁctitious samples (approximate planner)

Value Iteration as

Planner

This program takes p and r as input. Finite horizon value iterationis simply solving the following acyclic dynamic program. Q ( s, a, h ) = (cid:40) h = H + 1 r ( s, a ) + (cid:80) s (cid:48) (cid:2) p ( s (cid:48) | s, a ) max a (cid:48) Q ( s (cid:48) , a (cid:48) , h + 1) (cid:3) h = 1 , . . . , H then the optimal policy for step h is computed as π h ( s ) = argmax a Q ( s, a, h ) lgorithm 2 Lagrangean-based Constrained Planner (

LagrConPlanner ) hyper-parameters: learning rate η Input:

Estimates (cid:98) p k , (cid:98) r k , (cid:98) c k and bonus ˆ b k Compute bonus-enhanced model M ( k ) = (cid:0) p ( k ) , r ( k ) , c ( k ) ) p ( k ) ( s (cid:48) | s, a ) = (cid:98) p k ( s (cid:48) | s, a ) ∀ s, a, s (cid:48) r ( k ) ( s, a ) = (cid:98) r k ( s, a ) + (cid:98) b k ( s, a ) ∀ s, ac ( k ) ( s, a, i ) = (cid:98) c k ( s, a, i ) − (cid:98) b k ( s, a ) ∀ s, a, i ∈ D Initialize Lagrange parameters λ ( i ) ← for i ∈ D for Iteration k from to N do Deﬁne r ( k ) λ ( s, a ) = r ( k ) ( s, a ) + (cid:88) i ∈ D λ ( i )[ c ( k ) ( s, a ) − ξ ( i )] π k ← Planner ( p ( k ) , r ( k ) λ ) λ k +1 ( i ) ← min (cid:110) , λ k ( i ) − η E π k ,p ( k ) (cid:104)(cid:80) Hh =1 [ c ( k ) ( s h , a h , i )] − ξ ( i ) (cid:105)(cid:111) ∀ i ∈ D end for Return mixture policy π := N (cid:80) Nk =1 π k and the algorithm returns the H -step policy π = ( π ) Hh =1 A2C with ﬁctious samples as

Planner

This program takes p and r as input. Since we havethe full model ( p and r ) we can make ﬁctitious episodes (not adding to sample complexity) and usethose samples to train our A2C agent. The algorithm is given Algorithm 3 (Parameterized policy π θ and value function estimate V θ ) E.2 Hyperparameter Tuning

Both

ConRL-A2C and

RCPO used the Adam optimizer. For our method we performed a hyper-paramter search on both domains over the following values in Table 1; selected values are given inTable 2. Note that reset row refers to when using the A2C planner during each call to the plannerwe tried the following options: (warm-start) reuse previous weights and reset the optimizer (warm-start), or (continue) continue learning using the previous weights (continue) and optimizer, or (none)reset the model weights and optimizer. 28 lgorithm 3

A2C planner with ﬁctitious samples hyper-parameters: learning rate η , α ∈ [0 , Input: transitions p , reward function r Deﬁne A2C loss L ( θ ) = E π θ ,p [ H (cid:88) h =1 − log π θ ( a h | s h )( R ( h ) − V θ ( s h )) + α ( R ( h ) − V θ ( s h )) ] R ( h ) = H (cid:88) h (cid:48) = h r ( s h , a h ) Initialize θ arbitrarily for Iteration i from to T do Emulate an episode by running π θ on MDP with transitions p and reward function r update θ ← θ − η ∇ θ L ( θ ) end for Return π θ Table 1: Considered HyperparametersHyperparameter Values ConsideredA2C learning rate − , − , − lambda learning rate , { , , } × − , × − , − , × − reset warm-start, continue, noneconplanner iterations , , , , , , , A2C Entropy coeﬀ − A2C Value loss coeﬀ . F Concentration tools

This section contains general concentration inequalities that are not tied with the constrained RLsetting considered in the paper.

Lemma F.1 (Hoeﬀding) . Let { X i } Ni =1 be a set with each X i i.i.d sampled from some distributionand E [ X i ] = 0 for all i and max i | X i | ≤ b . Then with probability at least − δ , it holds that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ b (cid:114) /δ ) N .

Lemma F.2 (Anytime version of Hoeﬀding) . Let { X i } ∞ i =1 be a set with each X i i.i.d sampled fromsome distribution and E [ X i ] = 0 for all i and max i | X i | ≤ b . Then with probability at least − δ , forany N ∈ N + , it holds that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ b (cid:114) N /δ ) N .

Proof.

We ﬁrst ﬁx N ∈ N + and apply standard Hoeﬀding (Lemma F.1) with a failure probability29able 2: Selected HyperparametersHyperparameter Gridworld BoxA2C learning rate − − lambda learning rate × − − reset none noneconplanner iterations

10 10

A2C Entropy coeﬀ − − A2C Value loss coeﬀ . . δ/N . Then we apply a union bound over N + and use the fact that (cid:80) N> δ N ≤ δ to conclude thelemma.The following lemma is used when bounding the ﬁnal regret in the above analysis where we bound thediﬀerence between the cumulative Bellman error along the empirical trajectories and the cumulativeBellman error under the expectation of trajectories (the expectation is taken with respect to thepolicies generating these trajectories cross episodes). Lemma F.3.

Consider a sequence of episodes k = 1 to K , a sequence of policies { π k } Kk =1 , and asequence of functions { f k } Kk =1 with corresponding ﬁltration {F k } with π k ∈ F k − and f k ∈ F k − . Eachpolicy π k generates a sequence of trajectory { s k ; h , s k ; h } Hh =1 . Denote a function f k : S × A → [0 , H ] ,with f k ∈ F k − . With probability at least − δ , for any K , we have: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) i =1 H (cid:88) h =1 f k ( s k ; h , a k ; h ) − K (cid:88) k =1 E π k (cid:32) H (cid:88) h =1 f k ( s ( h ) , a ( h )) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ H . (cid:112) K /δ ) K. Proof.

Denote the random variable v k ; h = f k ( s k ; h , a k ; h ) . Denote E k ; h as the conditional expectationthat is conditioned on all history from the beginning to time step h (not including step h ) at episode k . Note that we have: E k ; h [ v k ] = E π k ( f k ( s k ; h , a k ; h )) . Note that | v k ; h | ≤ H for any k, h by theassumption on f k . Hence, { v k ; h } k,h forms a sequence of Martingales. Applying Hoeﬀding’s inequality,we have with probability at least − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 H (cid:88) h =1 v k ; h − K (cid:88) k =1 E π k (cid:32) H (cid:88) h =1 f k ( s ( h ) , a ( h )) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ H (cid:112) /δ ) KH = 3 H . (cid:112) /δ ) K. Assigning failure probability δ/k for each episode kk