[PDF] Reinforcement Learning of Markov Decision Processes with Peak Constraints

Abstract

In this paper, we consider reinforcement learning of Markov Decision Processes (MDP) with peak constraints, where an agent chooses a policy to optimize an objective and at the same time satisfy additional constraints. The agent has to take actions based on the observed states, reward outputs, and constraint-outputs, without any knowledge about the dynamics, reward functions, and/or the knowledge of the constraint-functions. We introduce a game theoretic approach to construct reinforcement learning algorithms where the agent maximizes an unconstrained objective that depends on the simulated action of the minimizing opponent which acts on a finite set of actions and the output data of the constraint functions (rewards). We show that the policies obtained from maximin Q-learning converge to the optimal policies. To the best of our knowledge, this is the first time learning algorithms guarantee convergence to optimal stationary policies for the MDP problem with peak constraints for both discounted and expected average rewards.

Full PDF

aa r X i v : . [ m a t h . O C ] D ec Proceedings of Machine Learning Research vol xxx:1–16, 2020

Reinforcement Learning of Markov Decision Processes with PeakConstraints

Ather Gattami

ATHER . GATTAMI @ RI . SE RISE AIResearch Institutes of SwedenStockholm, Sweden

Abstract

In this paper, we consider reinforcement learning of Markov Decision Processes (MDP) with peakconstraints, where an agent chooses a policy to optimize an objective and at the same time satisfyadditional peak constraints. The agent has to take actions based on the observed states, reward out-puts, and constraint-outputs, without any knowledge about the dynamics, reward functions, and/orthe knowledge of the constraint functions. We introduce a transformation of the original problemin order to apply reinforcement learning algorithms where the agent maximizes a bounded andunconstrained objective. We show that the policies obtained from the transformed problem areoptimal whenever the original problem is feasible. Out solution is memory efﬁcient and doesn’trequire to store the values of the constraint functions. To the best of our knowledge, this is theﬁrst time learning algorithms guarantee convergence to optimal stationary policies for the MDPproblem with peak constraints for discounted and expected average rewards, respectively.

Keywords:

Markov Decision Process, Reinforcement Learning, Peak Constraints, Memory Efﬁ-cient Learning

1. Introduction

Reinforcement learning is concerned with optimizing an objective function that depends on a givenagent’s action and the state of the process to be controlled. However, many applications in practicerequire that we take actions that are subject to additional constraints that need to be fulﬁlled. Oneexample is wireless communication where the total transmission power of the connected wirelessdevices is to be minimized subject to constraints on the quality of service (QoS) such as maximumdelay constraints Djonin and Krishnamurthy (2007). Another example is the use of reinforcementlearning methods to select treatments for future patients. To achieve just the right effect of a drugfor speciﬁc patients, one needs to make sure that other important patient values satisfy some peakconstraints that should not be violated in the short or long run for the safety of the patient.Informally, the problem of reinforcement learning for Markov decision processes with peakconstraints is described as follows (note that bandit optimization with peak constraints becomesa special case). Given a stochastic process with state s k at time step k , reward function r , and adiscount factor < γ < , the constrained reinforcement learning problem is that for the optimiz-ing agent to ﬁnd a stationary policy π ( s k ) , taking values in some ﬁnite set A , that minimizes the c (cid:13) EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS discounted reward ∞ X k =0 γ k E ( r ( s k , π ( s k ))) (1)or the expected average reward lim T →∞ T T − X k =0 E ( r ( s k , π ( s k ))) (2)subject to the constraints r j ( s k , a k ) ≥ , a k ∈ A, for all k, j = 1 , ..., J (3)(a more formal deﬁnition of the problem is introduced in the next section).The peak constrained reinforcement learning problem of Markov decision processes is that ofﬁnding an optimal policy that satisﬁes a number of peak constraints of the form (3). The constraints(3) can be equivalently written as a k ∈ A ( s k ) , where the feasibility set A ( s k ) is state dependent andis given by the inequality constraints (3). If the functions r , .., r j where known, then the optimiza-tion problem is straightforward and standard Q -learning solves the problem in the case of unknownprocess and reward r . However, if we don’t know the constraint functions r , ..., r J , then we don’tknow the feasibility set A ( s k ) , and hence we need to learn it. It’s important to note that sincethe constraint functions are unknown, it’s inevitable that we violate the constraints in the learningprocess. The agent may, however, measure the samples r ( s k , a k ) , ..., r J ( s k , a k ) . The goal of thispaper is to provide an algorithm that asymptotically converges to a feasible and optimal solution.One way could be to wait until one passes through all pairs ( s, a ) to observe r ( s, a ) , ..., r J ( s, a ) ,learn them, and then construct the feasibility sets A ( s ) and run Q -learning for optimizing. How-ever, this imposes huge cost in terms of memory, as one must store r ( s, a ) , ..., r J ( s, a ) for all ( s, a ) ,which mounts to S × A × J elements. Furthermore, not optimizing along the learning process couldbe costly.The following example from wireless communication describes in more detail a model wherewe have a Markov decision process with hard (peak) constraints and where the agent doesn’t haveknowledge of the process and reward functions, but only observations of the state s and rewardsamples r ( s, a ) , r ( s, a ) , ..., r J ( s, a ) . Example 1 (Wireless communication)

Consider the problem of wireless communication were thegoal is to minimize the average of the transmitted power subject to a strict quality of service (QoS)constraint. Let s k denote the channel state at time step k which belongs to a ﬁnite set and let the a k be the bandwidth allocation action, also belonging to a ﬁnite set of actions. The power required tooccupy a bandwidth a given the channel state is P ( s, a ) , which is unknown to the agent. The poweraffects the channel state, and hence, the channel evolves according to a probability distributiongiven by p ( s k +1 | s k , a k ) which is also unknown. The QoS is given by a lower bound b on the biterror rate, given by q ( s k , a k ) ≥ b . The function q ( s k , a k ) is not known as it is affected by noise thatis not accessible to the agent. By introducing r ( s, a ) = − P ( s, a ) and r ( s, a ) = q ( s, a ) − b , thetask is to solve the following optimization problem sup a k lim N →∞ N N − X k =0 E ( r ( s k , a k )) s. t. r ( s k , a k ) ≥ EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Example 2 (Search Engine)

In a search engine, there is a number of documents that are relatedto a certain query. There are two values that are related to every document, the ﬁrst being a (ad-vertisement) value u i of document i for the search engine and the second being a value v i for theuser (could be a measure of how strongly related the document is to the user query). The task ofthe search engine is to display the documents in a row some order, where each row has an attentionvalue, A j for row j . We assume that u i and v i are known to the search engine for all i , whereas theattention values { A j } are not known. The action of the search engine is to display document i inposition j , a i = j . Thus, the expected average reward for the search engine is R = lim N →∞ N N X i =1 E ( u i A a ( i ) ) The search engine has multiple objectives here where it wants to maximize the rewards for the userand itself. One solution is to deﬁne a measure for the quality of service, v i A a ( i ) ≥ q and thenmaximize its reward subject to the quality of service constraint, that is sup a i lim N →∞ N N X i =1 E ( u i A a ( i ) ) s. t. v i A a ( i ) − q ≥ Although constrained Markov decision process problems are fundamental and have been studiedextensively in the literature (see Altman (1999) and the references therein), the reinforcement learn-ing counter part of ﬁnding the optimal policies seem to be still open, and even less is known for thecase of peak constraints considered in this paper. When an agent has to take actions based on theobserved states, rewards outputs, and constraint-outputs solely (without any knowledge about thedynamics, reward functions, and/or the knowledge of the constraint functions), a general solutionseem to be lacking to the best of the authors’ knowledge.

Most of the work on Markov decision processes and bandit optimization considers constraints inthe form of discounted or expected average rewards Altman (1999). Constrained MDP problemsare convex and hence one can convert the constrained MDP problem to an unconstrained zero-sumgame where the objective is the Lagrangian of the optimization problem Altman (1999). How-ever, when the dynamics and rewards are not known, it doesn’t become apparent how to do it asthe Lagrangian will itself become unknown to the optimizing agent. Previous work regarding con-strained MDP:s, when the dynamics of the stochastic process are not known, considers scalarizationthrough weighted sums of the rewards, see Roijers et al. (2013) and the references therein. Anotherapproach is to consider Pareto optimality when multiple objectives are present Moffaert and Nowé(2014). However, none of the aforementioned approaches guarantee to satisfy lower bounds for agiven set of reward functions simultaneously. In Gabor et al. (1998), a multi-criteria problem is con-sidered where the search is over deterministic policies. In general, however, deterministic policiesare not optimal Altman (1999). Also, the multi-criteria approach in Gabor et al. (1998) may providea deterministic solution to a multi-objective problem in the case of two objectives and it’s not clearhow to generalize to a number of objectives larger than two. In Geibel (2006), the author considersa single constraint and allowing for randomized policies. However, no proofs of convergence are EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS provided for the proposed sub-optimal algorithms. Sub-optimal solutions with convergence guaran-tees are provided in Chow et al. (2017) for the single constraint problem, allowing for randomizedpolices. In Borkar (2005), an actor-critic sub-optimal algorithm is provided for one single con-straints and it’s claimed that it can generalized to an arbitrary number of constraints. Sub-optimalsolutions to constrained reinforcement learning problems with expected average rewards in a wire-less communications context were considered in Djonin and Krishnamurthy (2007). Sub-optimalreinforcement learning algorithms were presented in Lizotte et al. (2010) for controlled trial anal-ysis with multiple rewards, again by considering a scalarization approach. In Drugan and Nowé(2013), multi-objective bandit algorithms were studied by considering scalarization functions andPareto partial orders, respectively, and present regret bounds. As previous results, the approach inDrugan and Nowé (2013) doesn’t guarantee to satisfy the constraints that correspond to the multipleobjectives. In Achiam et al. (2017), constrained policy optimization is studied for the continuousMDP problem and some heurestic algorithms were suggested.

We consider the problem of optimization and learning for Markov decision processes with peakconstraints given by (3), for both discounted and expected average rewards, respectively. We re-formulate the optimization problem with peak constraints to a zero-sum game where the opponentacts on a ﬁnite set of actions (and not on a continuous space of actions). This transformation is es-sential in order to achieve a tractable optimal algorithm. The reason is that using Lagrange dualitywithout model knowledge requires inﬁnite dimensional optimization since the Lagrange multipli-ers are continuous (compare to the intractability of a partially observable MDP, where the beliefsare continuous variables). Furthermore, the Lagrange multipliers imply that the reward function isunbounded. We introduce a reformulation of the problem that is proved to be equivalent, where weget bounded reward functions and thereafter apply reinforcement learning algorithms that convergeto optimal policies for the Markov decision process problem with peak constraints. The algorithmworks without imposing a cost in terms of memory, where there is no need to store the learnedvalues of r ( s, a ) , ..., r J ( s, a ) for all ( s, a ) , saving memory of size S × A × J . We give completeproofs for both cases of discounted and expected average rewards, respectively. EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS N The set of nonnegative integers. [ J ] The set of integers { , ..., J } . R The set of real numbers. E The expectation operator.

Pr Pr ( x | y ) denotes the probability of thestochastic variable x given y . arg max π ⋆ = arg max π ∈ Π f π denotes an element π ⋆ ∈ Π that maximizes the function f π . ≥ For λ = ( λ , ..., λ J ) , λ ≥ denotesthat λ i ≥ for i = 1 , ..., J . { X } ( x ) 1 { X } ( x ) = 1 if x ∈ X and { X } ( x ) = 0 if x / ∈ X . n n = (1 , , ..., ∈ R n . N ( t, s, a ) N ( t, s, a ) = P tk =1 { s,a } ( s k , a k ) .e e : ( s, a ) . | S | Denotes the number of elements in S . s + For a state s = s k , we have s + = s k +1 .

2. Problem Formulation

Consider a Markov Decision Process (MDP) deﬁned by the tuple ( S, A, P ) , where S = { S , S , ..., S n } is a ﬁnite set of states, A = { A , A , ..., A m } is a ﬁnite set of actions taken by the agent, and P : S × A × S → [0 , is a transition function mapping each triple ( s, a, s + ) to a probability givenby P ( s, a, s + ) = Pr ( s + | s, a ) and hence, X s + ∈ S P ( s, a, s + ) = 1 , ∀ ( s, a ) ∈ S × A Let Π be the set of policies that map a state s ∈ S to a probability distribution of the actions with aprobability assigned to each action a ∈ A , that is π ( s ) = a with probability Pr ( a | s ) . The agent’sobjective is to ﬁnd a stationary policy π ∈ Π that maximizes the expected value of the total thediscounted reward ∞ X k =0 γ k E ( r ( s k , π ( s k ))) (4)or the expected average reward lim T →∞ T T − X k =0 E ( r ( s k , π ( s k ))) (5)for s = s ∈ S , where r : S × A → R is some unknown reward function. The parameter γ ∈ (0 , is a discount factor which models how much weight to put on future rewards. The expectation istaken with respect to the randomness introduced by the policy π and the transition mapping P . EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

The (hard or peak) constrained reinforcement learning problem is concerned with ﬁnding apolicy that satisﬁes a set of peak constraints given (3), where r j : S × A → R are bounded functions,for j = 1 , ..., J . The agent doesn’t have knowledge of the process and reward functions, but onlymeasurements of the state s and reward samples r ( s k , a k ) , r ( s k , a k ) , ..., r J ( s k , a k ) . However, theagent knows that the reward functions are bounded by some constant c . Assumption 1

The absolute values of the reward functions r and { r j } Jj =1 are bounded by someconstant c known to the agent. Assumption 2 r ( s, a ) > for all ( s, a ) ∈ S × A . Note that Assumption 2 is not restrictive since we can replace r with r + c + ǫ for some positive realnumber ǫ > and obtain an equivalent problem with positive reward r in the objective function andit will remain bounded by the constant c + ǫ .

3. Reinforcement Learning for Markov Decision Processes

Consider a Markov Decision Process where the agent is maximizing the total discounted rewardgiven by V ( s ) = ∞ X k =0 γ k E ( R ( s k , a k )) (6)for some initial state s ∈ S . Let Q ⋆ ( s, a ) be the expected reward of the agent taking action a ∈ A from state s ∈ S , and continuing with an optimal policy thereafter.Then for any stationary policy π , we have that Q ⋆ ( s, a ) = R ( s, a ) + max π ∈ Π ∞ X k =1 γ k E ( R ( s k , π ( s k )))= R ( s, a ) + γ · max π ∈ Π E ( Q ⋆ ( s + , π ( s + ))) (7)Equation (21) is known as the Bellman equation, and the solution to (21) with respect to Q ⋆ thatcorresponds to the optimal policy π ⋆ and optimal actions of the opponent is denoted Q ⋆ . If we havethe function Q ⋆ , then we can obtain the optimal policy π ⋆ according to the equation π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) (8)which maximizes the total discounted reward ∞ X k =0 γ k E ( R ( s k , π ⋆ ( s k ))) = E ( Q ⋆ ( s, π ⋆ ( s ))) for s = s . Note that the optimal policy may not be deterministic, as opposed to reinforcementlearning for unconstrained Markov Decision Processes, where there is always an optimal policythat is deterministic.In the case we don’t know the process P and the reward function R , we will not be able totake advantage of the Bellman equation directly. The following results show that we will be able todesign an algorithm that always converges to Q ⋆ . EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Deﬁnition 1 (Unichain MDP)

An MDP is called unichain, if for each policy π the Markov chaininduced by π is ergodic, i.e. each state is reachable from any other state. Unichain MDP:s are usually considered in reinforcement learning problems with discounted re-wards, since they guarantee that we learn the process dynamics from the initial states. Thus, for thediscounted reward case we will make the following assumption.

Assumption 3 (Unichain MDP)

The MDP ( S, A, P ) is assumed to be unichain. Proposition 1

Consider a Markov Decision Process given by the tuple ( S, A, P ) , where ( S, A, P ) is unichain, and suppose that R is bounded. Let Q = Q ⋆ and π = π ⋆ be solutions to the Bellmanequation Q ( s, a ) = R ( s, a ) + γ · max π ∈ Π E ( Q ( s + , π ( s ))) π ( s ) = arg max π ∈ Π E ( Q ( s, π ( s ))) (9) Let α k ( s, a ) = α k · { s,a } ( s k , a k ) satisfy ≤ α k ( s, a ) < ∞ X k =0 α k ( s, a ) = ∞ ∞ X k =0 α k ( s, a ) < ∞∀ ( s, a ) ∈ S × A (10) Then, the update rule Q k +1 ( s, a ) = (1 − α k ( s, a )) Q k ( s, a ) + α k ( s, a )( R ( s, a ) + γ max a ∈ A Q k ( s + , a )) (11) converges to Q ⋆ with probability 1. Furthermore, the optimal policy π ⋆ ∈ Π given by (8) maximizes(6). Proof

Consult Jaakkola et al. (1994).

The agent’s objective here is to maximize the total reward given by lim T →∞ T T − X k =0 E ( R ( s k , a k )) (12)for some initial state s ∈ S .We will make a simple assumption regarding the existence of recurring state, a standard as-sumption in Markov decision process problems with expected average rewards to ensure that theexpected value of the reward is independent of the initial state. EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Assumption 4

There exists a state s ∗ ∈ S which is recurrent for every stationary policy π playedby the agent. Assumption 4 implies that E ( r j ( s k , a k )) is independent of the initial state at stationarity. Proposition 2

If Assumption 4 holds, then the value of the Markov Decision Process ( S, A, P ) ,with ﬁnite state and action spaces, is independent of the initial state. Proof

Consult Bertsekas (2005).

Proposition 3

Under Assumption 4, there exists a number v and a vector ( h ( S ) , ..., h ( S n )) ∈ R n ,such that for each s ∈ S , we have that h ( s ) + v = T h ( s ) = max π ∈ Π E (cid:16) R ( s, π ( s )) + X s + ∈ S P ( s + | s, π ( s )) h ( s + ) (cid:17) (13) Furthermore, the value of (12) is equal to v . Proof

Consult Bertsekas (2005).Similar to Q -learning (but still different), our goal is to ﬁnd a function Q ⋆ ( s, a ) that satisﬁes Q ⋆ ( s, a ) + v ⋆ = R ( s, a ) + X s + ∈ S P ( s + | s, a ) h ⋆ ( s + ) (14)for any solutions h ⋆ and v ⋆ to Equations (13)-(14). Note that we have h ⋆ ( s ) = max π ∈ Π E ( Q ⋆ ( s, π ( s ))) In the case we don’t know the process P and the reward function R , we will not be able to takeadvantage of (14) directly. The next proposition shows that we will be able to design an algorithmthat always converges to Q ⋆ . It’s worth to note here that the operator T in Equation (13) is not acontraction, so the standard Q -learning that is commonly used for reinforcement learning in Markovdecision processes with discounted rewards can’t be applied here. Assumption 5 (Learning rate)

The sequence β ( k ) satisﬁes:1. For every < x < , sup k β ( ⌊ xk ⌋ ) /β ( k ) < ∞ P ∞ k =1 β ( k ) = ∞ and P ∞ k =0 β ( k ) < ∞ .3. For every < x < , the fraction P ⌊ yt ⌋ k =1 β ( k ) P tk =1 β ( k ) converges to 1 uniformly in y ∈ [ x, as t → ∞ . EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

For example, β ( k ) = k and β ( k ) = k log k (for k > ) satisfy Assumption 5.Now deﬁne N ( t, s, a ) as the number of times that state s and actions a and b were played up totime t , that is N ( t, s, a ) = t X k =1 { s,a } ( s k , a k ) The following assumption is needed to guarantee that all combinations of the triple ( s, a ) arevisited often. Assumption 6 (Often updates)

There exists a deterministic number d > such that for every s ∈ S , and a ∈ A , we have that lim inf t →∞ N ( t, s, a ) t ≥ d with probability 1. Deﬁnition 2

We deﬁne the set Φ as the set of all functions f : R n × m → R such that1. f is Lipschitz2. For any c ∈ R , f ( cQ ) = cf ( Q )

3. For any r ∈ R and b Q ( s, a ) = Q ( s, a ) + r for all ( s, a ) ∈ R n × m , we have f ( b Q ) = f ( Q ) + r Proposition 4

Suppose that R is bounded and that Assumption 4, 5, and 6 hold. Introduce F Q ( s, a ) = max π ∈ Π E ( Q ( s, π ( s ))) (15) and let f ∈ Φ be given, where the set Φ is deﬁned as in Deﬁnition 2. Then, the asynchronous updatealgorithm given by Q t +1 ( s, a ) = Q t ( s, a ) + 1 ( s,a ) ( s t , a t ) ×× β ( N ( t, s, a ))( R ( s t , a t ) + F Q t ( s t +1 , a t ) − f ( Q t ( s t , a t )) − Q t ( s, a )) (16) converges to Q ⋆ in (13)-(14) with probability 1. Proof

Consult Abounadi et al. (2001).

4. Reinforcement Learning for Constrained Markov Decision Processes

Consider the optimization problem of ﬁnding a stationary policy π subject to the initial state s = s and the constraints (1), that is sup a k ∈ A ∞ X k =0 γ k E ( r ( s k , a k )) s. t. r j ( s k , a k ) ≥ , k ∈ N j = 1 , ..., Js = s (17) EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Since γ > , optimization problem (17) is equivalent to sup a k ∈ A ∞ X k =0 γ k E ( r ( s k , a k )) s. t. γ k r j ( s k , a k ) ≥ , k ∈ N j = 1 , ..., Js = s (18)The Lagrange dual of the problem gives the following equivalent formulation sup a k ∈ A min λ k ≥ ∞ X k =0 γ k  E ( r ( s k , a k )) + J X j =1 λ jk r j ( s k , a k )  s. t. s = s (19)Now introduce R ( s, a, λ ) = r ( s, a ) + J X j =1 λ j r j ( s, a ) (20)Let Q ⋆ ( s, a ) be the expected reward of the agent taking action a = a ∈ A from state s = s satisfying the constraints in (17), and continuing with an optimal policy satisfying the constraints in(17) thereafter. Then, we have that Q ⋆ ( s, a ) = min λ ≥ R ( s, a, λ ) + γ · max π ∈ Π E ( Q ⋆ ( s + , π ( s + ))) (21)and the optimal policy π ⋆ is given by π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) The next theorem states that if the optimization problem (17) is feasible, then it’s equivalent toan unconstrained optimization problem with bounded rewards, in the sense that an optimal strategyof the agent in unconstrained problem for (17), and the value of the unconstrained problem is equalto the value of (17).

Theorem 1

Suppose that Assumption 1 and 2 hold and set C = c · γ/ (1 − γ ) . Let Q ⋆ and π ⋆ besolutions to Q ⋆ ( s, a ) = max (cid:18) − C, min λ ≥ R ( s, a, λ ) (cid:19) + γ · max π ∈ Π E ( Q ⋆ ( s + , π ( s + ))) π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) (22) If Q ⋆ ( s, a ) > for all ( s, a ) ∈ S × A , then (19) is feasible and π ⋆ is an optimal solution.Otherwise,if Q ⋆ ( s, a ) ≤ for some ( s, a ) ∈ S × A , then (19) is not feasible. Proof

See the appendix.The reason why we need Theorem 1 is that we want to transform the optimization problem (17)to an equivalent problem where the reward functions are bounded.Now that we are equipped with Theorem 1, we are ready to state and prove our ﬁrst main result. EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Theorem 2

Consider the constrained MDP problem (17) and suppose that it’s feasible and thatAssumption 3, 1, and 2 hold. Introduce C = c · γ/ (1 − γ ) and R ( s, a ) = max (cid:18) − C, min λ ≥ R ( s, a, λ ) (cid:19) Let Q k be given by the recursion according to (11). Then, Q k → Q ⋆ as k → ∞ where Q ⋆ is thesolution to (22). Furthermore, the policy π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) is an optimal solution to (17). Proof

Follows from Theorem 1, that R ( s, a ) is bounded for all ( s, a ) ∈ S × A , and Proposition 1. Consider the optimization problem of ﬁnding a stationary policy π subject to the constraints (2),that is sup a k ∈ A lim T →∞ T T − X k =0 E ( r ( s k , a k )) s. t. r j ( s k , a k ) ≥ for j = 1 , ..., J (23)Under Assumption 4, the value of the objective function is independent of the initial state accordingto Proposition 2. The Lagrange dual of the problem gives the following equivalent formulation sup a k ∈ A min λ k ≥ lim T →∞ T T − X k =0  E ( r ( s k , a k )) + J X j =1 λ jk r j ( s k , a k )  (24)Let Q ⋆ ( s, a ) be the expected reward of the agent taking action a = a ∈ A from state s = s satisfying the constraints in (17), and continuing with an optimal policy satisfying the constraints in(17) thereafter. Also, let R ( s, a, λ ) be given by (20). Then, we have that h ⋆ ( s ) + v ⋆ = max π ∈ Π (cid:16) min λ ≥ R ( s, a, λ ) + X s + ∈ S P ( s + | s, π ( s )) h ⋆ ( s + ) (cid:17) Q ⋆ ( s, a ) + v ⋆ = min λ ≥ R ( s, a, λ ) + X s + ∈ S P ( s + | s, a ) h ⋆ ( s + ) (25)and the optimal policy π ⋆ is given by π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) The value of (23) and (24) is given by v ⋆ . EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Theorem 3

Suppose that Assumption 1 and 2. Also, let Q ⋆ , h ⋆ , and π ⋆ be solutions to h ⋆ ( s ) + v ⋆ = max π ∈ Π E (cid:16) max (cid:18) − c, min λ ≥ R ( s, a, λ ) (cid:19) + X s + ∈ S P ( s + | s, π ( s )) h ⋆ ( s + ) (cid:17) Q ⋆ ( s, a ) + v ⋆ = max (cid:18) − c, min λ ≥ R ( s, a, λ ) (cid:19) + X s + ∈ S P ( s + | s, a ) h ⋆ ( s + ) π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) (26) If Q ⋆ ( s, a ) + v ⋆ > for all ( s, a ) ∈ S × A , then (24) is feasible and π ⋆ is an optimal solu-tion.Otherwise, if Q ⋆ ( s, a ) + v ⋆ ≤ for some ( s, a ) ∈ S × A , then (24) is not feasible. Proof

See the appendix.The reason why we need Theorem 3 is that we want to transform the optimization problem (24)to an equivalent problem where the reward functions are bounded.Now that we are equipped with Theorem 3, we are ready to state and proof the second mainresult.

Theorem 4

Consider the constrained Markov Decision Process problem (23) and suppose thatAssumption 4, 5, and 6 hold. Introduce R ( s, a ) = max (cid:18) − c, min λ ≥ R ( s, a, λ ) (cid:19) Let Q k be given by the recursion according to (16) and suppose that Assumptions 5 and 6 hold.Then, Q k → Q ⋆ as k → ∞ where Q ⋆ is the solution to (26). Furthermore, the policy π ⋆ ( s ) = arg max π ∈ Π E ( Q ⋆ ( s, π ( s ))) (27) is a solution to (23). Proof

Follows from Theorem 3, that R ( s, a ) is bounded for all ( s, a ) ∈ S × A , and Proposition 4.

5. Conclusions

We considered the problem of optimization and learning for peak constrained Markov decisionprocesses, for both discounted and expected average rewards, respectively. We transformed theoriginal problems in order to apply reinforcement learning to a bounded unconstrained problem.The algorithm works without imposing a cost in terms of memory, where there is no need to storethe learned values of r ( s, a ) , ..., r J ( s, a ) for all ( s, a ) , saving memory of size S × A × J .It would be interesting to combine our algorithms with deep reinforcement learning and studythe performance of the approach introduced in this paper when the value function Q is modeled asa deep neural network. EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

References

J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for markov decision processeswith average cost.

SIAM J. Control and Optimization , 40:681–698, 11 2001.J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy opti-mization. In

Proceedings of the 34th International Conference on Ma-chine Learning - Volume 70 , ICML’17, pages 22–31. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305381.3305384 .E. Altman. Constrained markov decision processes, 1999.D. P. Bertsekas.

Dynamic Programming and Optimal Control , volume 1. Athena Scientiﬁc, 2005.ISBN 1886529094.V. S. Borkar. An actor-critic algorithm for constrained markov decision pro-cesses.

Systems & Control Letters , 54(3):207 – 213, 2005. ISSN0167-6911. doi: https://doi.org/10.1016/j.sysconle.2004.08.007. URL .Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learningwith percentile risk criteria.

Journal of Machine Learning Research , 18:167:1–167:51, 2017.D. V. Djonin and V. Krishnamurthy. Mimo transmission control in fading channels: A constrainedmarkov decision process formulation with monotone randomized policies.

IEEE Transactionson Signal Processing , 55(10):5069–5083, Oct 2007. ISSN 1053-587X. doi: 10.1109/TSP.2007.897859.M. M. Drugan and A. Nowé. Designing multi-objective multi-armed bandits algorithms: A study.In

The 2013 International Joint Conference on Neural Networks (IJCNN) , pages 1–8, Aug 2013.doi: 10.1109/IJCNN.2013.6707036.Z. Gabor, Z. Kalmár, and C. Szepesvari. Multi-criteria reinforcement learning. In

ICML , 1998.P. Geibel. Reinforcement learning for MDPs with constraints. In Johannes Fürnkranz, Tobias Schef-fer, and Myra Spiliopoulou, editors,

Machine Learning: ECML 2006 , pages 646–653, Berlin,Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-46056-5.T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochas-tic iterative dynamic programming algorithms.

Neural Comput. , 6(6):1185–1201,November 1994. ISSN 0899-7667. doi: 10.1162/neco.1994.6.6.1185. URL http://dx.doi.org/10.1162/neco.1994.6.6.1185 .D. Lizotte, M. H. Bowling, and A. S. Murphy. Efﬁcient reinforcement learning withmultiple reward functions for randomized controlled trial analysis. In

Proceedings ofthe 27th International Conference on International Conference on Machine Learning ,ICML’10, pages 695–702, USA, 2010. Omnipress. ISBN 978-1-60558-907-7. URL http://dl.acm.org/citation.cfm?id=3104322.3104411 . EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

K. V. Moffaert and A. Nowé. Multi-objective reinforcement learning using sets of paretodominating policies.

Journal of Machine Learning Research , 15:3663–3692, 2014. URL http://jmlr.org/papers/v15/vanmoffaert14a.html .D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequen-tial decision-making.

J. Artif. Int. Res. , 48(1):67–113, October 2013. ISSN 1076-9757. URL http://dl.acm.org/citation.cfm?id=2591248.2591251 . EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Appendix

Proof of Theorem 1

First note that the value of optimization problem (17) is positive if it’s feasible since r is positiveaccording to Assumption 2. Thus, the value of (19) is either positive if the constraints in (17) aresatisﬁed, or it goes to −∞ if (17) is not feasible. Furthermore, the value of (19) (which is equal tothe value of (17) is at most c/ (1 − γ ) since r is bouned by c according to Assumption 1. Thus, γ max π ∈ Π E ( Q ⋆ ( s, π ( s ))) ≤ γ · c/ (1 − γ ) = C for all s ∈ S . Suppose that a is such that r j ( s, a ) < for some j . Then, max (cid:18) − C, min λ ≥ R ( s, a, λ ) (cid:19) = − C and Q ⋆ ( s, a ) = max (cid:18) − C, min λ ≥ R ( s, a, λ ) (cid:19) + γ · max π ∈ Π E ( Q ⋆ ( s + , π ( s + ))) ≤ On the other hand, if a is such that r j ( s, a ) ≥ for all j ∈ [ J ] , then we have that max (cid:18) − C, min λ ≥ R ( s, a, λ ) (cid:19) = min λ ≥ R ( s, a, λ ) and Q ⋆ ( s, a ) = min λ ≥ R ( s, a, λ ) + γ · max π ∈ Π E ( Q ⋆ ( s + , π ( s + ))) which is identical to (21). Thus, by choosing a policy that satisﬁes the constraints in (17), r j ( s, a ) ≥ , Q ⋆ ( s, a ) will be positive. We conclude that if the policy π ⋆ implies that Q ⋆ ( s, a ) > , then it’soptimal for (19). Otherwise at least one of the constraints r j ( s, a ) ≥ is not satisﬁed and the valueof max π ∈ Π E ( Q ⋆ ( s, π ( s ))) = E ( Q ⋆ ( s, π ⋆ ( s ))) is non positive which implies that the value of (19) goes to −∞ , and so (17) is not feasible. Proof of Theorem 3

First note that the value of optimization problem (23) is positive if it’s feasible since r is positiveaccording to Assumption 2. Thus, the value of (24) is either positive if the constraints in (23)are satisﬁed, or it goes to −∞ if (23) is not feasible. Now we have that min λ ≥ R ( s, a, λ ) ≤ c according to Assumption 1. Thus, h ⋆ ( s ) = max π ∈ Π E ( Q ⋆ ( s, π ( s ))) ≤ c for all s ∈ S . Suppose that a is such that r j ( s, a ) < for some j . Then, max (cid:18) − c, min λ ≥ R ( s, a, λ ) (cid:19) = − c EINFORCEMENT L EARNING OF M ARKOV D ECISION P ROCESSES WITH P EAK C ONSTRAINTS

Then Q ⋆ ( s, a ) + v ⋆ = max (cid:18) − c, min λ ≥ R ( s, a, λ ) (cid:19) + X s + ∈ S P ( s + | s, a ) h ⋆ ( s + )= − c + X s + ∈ S P ( s + | s, a ) h ⋆ ( s + ) ≤ − c + c = 0 (28)On the other hand, if a is such that r j ( s, a ) ≥ for all j ∈ [ J ] , then we have that max (cid:18) − c, min λ ≥ R ( s, a, λ ) (cid:19) = min λ ≥ R ( s, a, λ ) and Q ⋆ ( s, a ) + v ⋆ = min λ ≥ R ( s, a, λ ) + X s + ∈ S P ( s + | s, a ) h ⋆ ( s + ) which is identical to (25). Thus, by choosing a policy that satisﬁes the constraints in (23), r j ( s, a ) ≥ , Q ⋆ ( s, a ) + v ⋆ will be positive. We conclude that if the policy π ⋆ implies that Q ⋆ ( s, a ) + v ⋆ > ,then it’s optimal for (24). Otherwise at least one of the constraints r j ( s, a ) ≥ is not satisﬁed andthe value of max π ∈ Π E ( Q ⋆ ( s, π ( s ))) + v ⋆ = E ( Q ⋆ ( s, π ⋆ ( s ))) + v ⋆ is non positive which implies that the value of (24) goes to −∞ , and so (23) is not feasible., and so (23) is not feasible.