[PDF] A Primal-Dual Approach to Constrained Markov Decision Processes

Abstract

In many operations management problems, we need to make decisions sequentially to minimize the cost while satisfying certain constraints. One modeling approach to study such problems is constrained Markov decision process (CMDP). When solving the CMDP to derive good operational policies, there are two key challenges: one is the prohibitively large state space and action space; the other is the hard-to-compute transition kernel. In this work, we develop a sampling-based primal-dual algorithm to solve CMDPs. Our approach alternatively applies regularized policy iteration to improve the policy and subgradient ascent to maintain the constraints. Under mild regularity conditions, we show that the algorithm converges at rate O(\log(T)/\sqrt{T}), where T is the number of iterations. When the CMDP has a weakly coupled structure, our approach can substantially reduce the dimension of the problem through an embedded decomposition. We apply the algorithm to two important applications with weakly coupled structures: multi-product inventory management and multi-class queue scheduling, and show that it generates controls that outperform state-of-art heuristics.

Full PDF

aa r X i v : . [ m a t h . O C ] J a n A Primal-Dual Approach to Constrained MarkovDecision Processes

Yi Chen

Department of Industrial Engineering & Management Sciences, Northwestern University, Evanston, IL, 60208

Jing Dong

Columbia Business School, New York City, NY, 12007

Zhaoran Wang

Department of Industrial Engineering & Management Sciences, Northwestern University, Evanston, IL, 60208

In many operations management problems, we need to make decisions sequentially to minimize the costwhile satisfying certain constraints. One modeling approach to study such problems is constrained Markovdecision process (CMDP). When solving the CMDP to derive good operational policies, there are two keychallenges: one is the prohibitively large state space and action space; the other is the hard-to-computetransition kernel. In this work, we develop a sampling-based primal-dual algorithm to solve CMDPs. Ourapproach alternatively applies regularized policy iteration to improve the policy and subgradient ascent tomaintain the constraints. Under mild regularity conditions, we show that the algorithm converges at rate O (log( T ) / √ T ), where T is the number of iterations. When the CMDP has a weakly coupled structure, ourapproach can substantially reduce the dimension of the problem through an embedded decomposition. Weapply the algorithm to two important applications with weakly coupled structures: multi-product inventorymanagement and multi-class queue scheduling, and show that it generates controls that outperform state-of-art heuristics. Key words : Constrained Markov decision process, primal-dual algorithm, weakly coupled Markov decisionprocess

1. Introduction

In many sequential decision-making problems, a single utility might not suﬃce to describe the realobjectives faced by the decision-makers. A natural approach to study such problems is to optimizeone objective while putting constraints on the others. In this context, the constrained Markovdecision process (CMDP) has become an important modeling tool for sequential multi-objectivedecision-making problems under uncertainty. A CMDP aims to minimize one type of cost whilekeeping the other costs below certain thresholds. It has been successfully applied to analyze variousimportant applications, including admission control and routing in telecommunication networks,scheduling for hospital admissions, and maintenance scheduling for infrastructures (Altman 1999).Due to the complicated system dynamics and the scale of the problem, exact optimal solutionsto CMDPs can rarely be derived. Instead, numerical approximations become the main workhorse to study CMDPs. In this paper, we propose a sampling-based primal-dual algorithm that caneﬃciently solve a wide range of CMDPs.One basic approach to solve the CMDP is to use a linear programming (LP) formulation basedon the occupancy measure. This approach faces two key challenges in implementations: it requiresknowledge of the transition kernel of the underlying dynamical system explicitly; it does notscale well as the state space and action space get large. An alternative approach is to apply theLagrangian duality. In particular, by dualizing the constraints and utilizing strong duality, we cantranslate the CMDP into a max-min problem, where for a given Lagrangian multiplier, the innerminimization problem is just a standard Markov decision process (MDP). This approach allowsus to solve the inner problem using standard dynamic programming based methods. It does notrequire direct knowledge of the transition kernel as long as we can estimate the value functionsfrom simulated or empirical data. In implementations, one would iteratively update the MDP pol-icy and the Langrangian multiplier. The current development of this approach requires solvingthe MDP to get the optimal policy for each updated Langrangian multiplier (see, for example,Le et al. (2019), Miryooseﬁ et al. (2019)), which can be computationally costly. A more naturalidea is to solve the MDP only approximately at each iteration. In this paper, we investigate thisidea and show that at each iteration, we only need to do one iteration of policy update to achievethe optimal convergence rate (in terms of the number of primal-dual iterations). Compared to theexisting algorithms utilizing Lagrangian duality, our primal-dual algorithm can be run at a muchlower cost at each iteration. We also demonstrate that our algorithm can be easily combined withmany other approximate dynamic programming techniques, such as Monte Carlo policy evaluation,TD-learning, and value function approximations (Sutton and Barto 2018).A key ingredient of our algorithm is regularized policy iteration. The standard policy iterationincludes two steps: policy evaluation and policy improvement. The policy evaluation step calculatesthe action-value function under a given policy. Then, the policy improvement step deﬁnes a newpolicy by taking the action that minimizes the action-value function. Through a Kullback-Leibler(KL) regularization term, the regularized policy iteration modiﬁes the policy improvement step byreweighing the probability of taking each action via a softmax transformation of the action-valuefunction. This modiﬁcation allows us to view the policy update step as running mirror descent forthe objective function in the policy space (Nemirovski 2012). In addition, we update the Lagrangianmultiplier using subgradient ascent, which also belongs to the family of mirror descent methods.This uniﬁed viewpoint makes the improved primal-dual algorithm possible. Noticeably, many recentdevelopments in reinforcement learning also beneﬁt from regularization, which has been shown toimprove exploration and robustness. For example, Trust Region Policy Optimization and ProximalPolicy Optimization use KL divergence between two consecutive policies as a penalty in policy improvement (Schulman et al. 2015, 2017). Soft-Q-learning uses Shannon entropy as a penalty invalue iteration (Haarnoja et al. 2017). Geist et al. (2019) propose a uniﬁed framework to analyzethe above algorithms via regularized Bellman operator (see also Liu et al. (2019a), Shani et al.(2019), Wang et al. (2019) for convergence analysis of regularized policy iteration).In terms of applications of the algorithm, we study an important class of CMDPs which we referto as weakly coupled CMDPs (Singh and Cohn 1998). A weakly coupled CMDP comprises multiplesub-problems that are independent except for a collection of coupling constraints. Due to thelinking constraints, the scale of problem grows exponentially in the number of sub-problems. Hence,even in the case where each sub-problem is computationally tractable, it can be computationallyprohibitive to solve the joint problem. Our primal-dual algorithm naturally helps break the curseof dimensionality in this case. In particular, the weakly coupled CMDP can be decomposed intoindependent sub-problems in the policy iteration step. In this case, the complexity only growslinearly with the number of sub-problems. We also comment that the weakly coupled CMDP canbe viewed as a Lagrangian relaxation of the weakly coupled MDP (Adelman and Mersereau 2008).Even though there is a relaxation gap between the two, as we will demonstrate in our numericalexperiments, the (modiﬁed) policy obtained via CMDP can perform very well for the original MDPproblem in the applications we considered.We apply the primal-dual algorithm to solve two classical operations management problems:inventory planning and queue scheduling. For the inventory planning problem, we consider a multi-product newsvendor problem with budget constraints (Turken et al. 2012). We formulate this prob-lem as a weakly coupled CMDP and study a small-scale instance where we can numerically solve forthe optimal policy. We show that our policy can indeed achieve O (log( T ) / √ T ) convergence in thiscase, where T is the number of iterations. For the queue scheduling problem, we consider a multi-class multi-pool parallel-server system where the decision-maker needs to route diﬀerent classes ofcustomers to diﬀerent pools of servers in order to minimize the performance cost (holding cost plusoverﬂow cost). We allow the service rates to be both customer-class and server-pool dependent.Since each pool only has a ﬁnite number of servers, the routing policy needs to satisfy the capacityconstraints. This optimal scheduling problem can be formulated as a weakly coupled MDP. Weconsider instances where it is prohibitive to solve for the optimal policy. Applying the Lagrangianrelaxation, we solve the resulting weakly coupled CMDP by combining our primal-dual algorithmwith value function approximation techniques. We show that our method generates comparable oreven better policies than the state-of-art policies. In this section, we review some of the existing methods/results to solve CMDPs. The goal isto clearly state the contribution of our work. Most existing algorithms for CMDPs is adapted from methods for MDPs, and can be roughly divided into three categories: LP based approaches,dynamic programming based approaches (including policy iteration and value iteration), and thepolicy gradient methods.One LP based approach utilizes the occupation measure, which is the weighted proportion oftime the system spends at each state-action pair. The objective and constraints can be written asthe inner products of instantaneous cost functions and occupation measure. The other LP basedapproach utilizes the dynamic programming principle and treats the value function (deﬁned onstate space) as the decision variables. In particular, the optimal value function of an MDP is thelargest super-harmonic function that satisﬁes certain linear constraints determined by transitiondynamics. For the CMDP, we obtain an LP by combing the dynamic programming principle withthe Lagrangian duality. These two LP formulations are dual of each other (Altman 1999). Manyworks on LP based approaches aim to ﬁnd eﬃcient ways to solve the LPs by exploiting the struc-tures of some speciﬁc MDPs/CMDPs. For example, Bertsimas and Orlin (1994) use the ellipsoidmethod to derive eﬃcient algorithms for problems with side constraints, including the travelingsalesman problem and the vehicle routing problem. Neely (2011) studies a linear fractional pro-gramming method to solve CMDPs. More recently, Caramanis et al. (2014) propose two algorithmsbased on the column generation and the generalized experts framework, respectively. These algo-rithms are shown to be as eﬃcient as value iteration. Another challenge of LP based approachesis that we need to know the transition kernel up front and in explicit forms. Some recent develop-ments aim to overcome this challenge. For example, Chen and Wang (2016) reformulate the LP ofan MDP as a saddle point problem and use stochastic approximation to solve it. Chen and Wang(2016) combine the saddle point formulation of LP with value function approximation and developa proximal stochastic mirror descent method to solve it. However, most existing developments inthis line focus on MDPs.The policy/value iteration stems from the Bellman operator on the value function, and convergesto the optimal value function linearly. In implementations, the value function can be estimatedvia simulation or data. Thus this class of methods does not require knowledge of the transitionkernel up front. For example, in reinforcement learning, we learn the transition dynamics of thesystem while solving for the optimal policy (Sutton and Barto 2018). There are many works thatapply policy/value iteration to solve CMDPs. For example, Gattami (2019) formulates the CMDPas a zero-sum game and uses primal-dual Q -learning to solve the game. It proves the almost sureconvergence of the algorithm, but does not establish the rate of convergence. Le et al. (2019) studyCMDPs in the oﬄine learning setting, and combine various dynamic programming techniqueswith Lagrangian duality to solve it. Miryooseﬁ et al. (2019) extend the constraints of CMDPs tononlinear forms and propose to solve it via Lagrangian duality and Q -learning as well. An O (1 / √ T ) rate of convergence is obtained in both Le et al. (2019) and Miryooseﬁ et al. (2019). However,their algorithms require solving for the optimal policy at each updated Lagrangian multiplier. Ourmethod can be viewed as an improved version of their algorithms. In particular, our algorithmonly requires one policy iteration for each updated Lagrangian multiplier while achieving the sameconvergence rate. We comment that Le et al. (2019) and Miryooseﬁ et al. (2019) consider morecomplicated settings than the classical CMDPs studied in this paper. It would be interesting toextend our primal-dual algorithm to solve the more general problems.Both LP based methods and policy/value iteration can suﬀer from the curse of dimensional-ity when dealing with a large action space. The policy gradient based approaches alleviate thedimensionality issue by approximating the policy using a parametric family of functions. In thiscase, searching over the policy space reduces to a ﬁnite dimensional optimization problem. Sev-eral works combine this idea with Lagrangian duality to solve large-scale CMDPs. For example,Borkar (2005) and Bhatnagar and Lakshmanan (2012) combine actor-critic algorithms with policyfunction approximations. Tessler et al. (2018) use two-timescale stochastic approximation. Beyondduality, several other policy gradient based approaches to solve CMDPs have been developed. Forexample, Achiam et al. (2017) propose a trust region method that focuses on safe exploration.Liu et al. (2019b) develop an interior point method with logarithmic barrier functions. Chow et al.(2018) propose to use Lyapunov functions to handle constraints. However, the key challenge ofpolicy gradient based methods to solve CMDPs is that the corresponding optimization problemsare non-convex. In most cases, only convergence to a local minimum can be guaranteed and theconvergence rates are often hard to establish. The paper is organized as follows. We ﬁrst introduce the CMDP and review some classical resultsthat are relevant to our subsequent development in Section 2. We then introduce our algorithmin Section 3, and show that the algorithm achieves the optimal convergence rate in Section 4. InSection 5, we discuss how our algorithm can be applied to (approximately) solve weakly coupledCMDPs and weakly coupled MDPs. We then implement our algorithm to solve an inventory planingproblem and a queue scheduling problem in Sections 6 and 7 respectively. Lastly, we conclude thepaper and discuss some interesting future directions in Section 8.The following notations are used throughout the paper. For a positive integer K , we denote [ K ] asthe set { , , . . . , K } . For a vector λ ∈ R K , [ λ ] k denotes its k -th coordinate and k λ k = ( P Kk =1 [ λ ] k ) / denotes its L norm. Given two vectors a, b ∈ R K , we say a ≤ b if the inequality holds for eachcoordinate, i.e., [ a ] k ≤ [ b ] k ∀ k ∈ [ K ]. Given a vector x ∈ R K , [ x ] + = (max { [ x ] , } , . . . , max { [ x ] K , } ).Finally, given two sequences of real numbers { a n } n ≥ and { b n } n ≥ , we say b n = O ( a n ), b n = Ω( a n ), and b n = Θ( a n ) if there exist some constants C, C ′ > b n ≤ Ca n , b n ≥ C ′ a n , and C ′ a n ≤ b n ≤ Ca n , respectively. We also introduce the ˜ O ( · ) notation when we ignore the logarithmic factors.For example, if b n ≤ Ca n · log( n ), we denote it by b n = ˜ O ( a n ).

2. Constrained Markov Decision Process

We start by considering a discrete-time MDP characterized by the tuple ( S , A , P, γ, µ ). Here, S and A denote the state and action spaces; P = { P ( ·| s, a ) } ( s,a ) ∈S×A is the collection of probabilitymeasures indexed by the state-action pair ( s, a ). For each ( s, a ), P ( ·| s, a ) characterizes the one-steptransition probability of the Markov chain conditioning on being in state s and taking action a .Function c = { c ( s, a ) } ( s,a ) ∈S×A is the expected instantaneous cost where c ( s, a ) is the cost incurredby taking action a at state s . Lastly, γ ∈ (0 ,

1) and µ = { µ ( s ) } s ∈S are the discount rate and thedistribution of the initial state, respectively. Given an MDP ( S , A , P, γ, µ ), a policy π determineswhat action to take at each state. We deﬁne the expected cumulative discounted cost with initialstate s under policy π as V π ( s ) = (1 − γ ) · E π (cid:2) ∞ X t =0 γ t · c ( s t , a t ) (cid:12)(cid:12) s (cid:3) , (1)where s t , a t are the state and action at time t and E π denotes the expectation with respect to thetransition dynamics determined by policy π . We further weight the expected costs according tothe initial state distribution and deﬁne C ( π ) = E s ∼ µ (cid:2) V π ( s ) (cid:3) . (2)Our goal is to minimize the cost C ( π ) over a suitably deﬁned class of policies.As an extension to MDP, the CMDP model optimizes one objective while keeping others satis-fying certain constraints. Speciﬁcally, in addition to the original cost c , we introduce K auxiliaryinstantaneous costs d k = { d k ( s, a ) } ( s,a ) ∈S×A , ∀ k ∈ [ K ]. The goal of a CMDP is to ﬁnd the policythat minimizes the cost deﬁned in (2) while keeping the following constraints satisﬁed D k ( π ) = (1 − γ ) · E s ∼ µ h E π (cid:2) ∞ X t =0 γ t · d k ( s t , a t ) (cid:12)(cid:12) s (cid:3)i ≤ q k , ∀ k ∈ [ K ] . (3)In order to make the expression more concise, we deﬁne D ( π ) := ( D ( π ) , . . . , D K ( π )) ⊤ , q :=( q , . . . , q K ) ⊤ , and write the constraints in (3) as D ( π ) ≤ q .We remark that the CMDP is only one modeling choice to model problems with multiple objec-tives/constraints. This particular modeling choice turns out to enjoy a lot of analytical and com-putational tractability as we will discuss next. CMDP is also closely connected to an importantclass of MDPs – weakly coupled MDP. In particular, CMDPs can be viewed as a relaxation ofweakly coupled MDPs (Adelman and Mersereau 2008). We will provide more discussions aboutthis in Section 5. Solving a CMDP requires ﬁnding the optimal policy over a properly deﬁned policy space, whichis a function space. Imposing suitable regularity conditions on the policy space will facilitate thedevelopment of algorithms. We next introduce a few classes of commonly used policies. It is naturalto require that all policies are non-anticipative, which means that the decision-maker does not haveaccess to future information. Deﬁne the history at time t to be the sequence of previous states andactions as well as the current state, i.e., h t := ( s , a , . . . , a t − , s t ). Then a non-anticipative policycan be viewed as a mapping from h t and t to the action space. We refer to such a policy as a “behavior policy” . If a policy only depends on the current state s t and time t instead of the wholehistory h t , it is called a “Markov policy” . For a Markov policy, if it is independent of the time index t , it is referred to as a “stationary policy” . When a stationary policy is a deterministic mappingfrom the state space to the action space, it becomes a “stationary deterministic policy” . We useΠ, Π M , Π S , Π D to denote the space of behavior, Markov, stationary, and stationary deterministicpolicies, respectively.Given an arbitrary policy space U , we can further generate a new type of policy called “mixingpolicies” via an initial randomization. Speciﬁcally, let ρ be a probability measure on U . Under amixing policy on U with mixing probability ρ , we ﬁrst draw a policy, say π g , from U following thedistribution ρ . Then π g is executed for t = 0 , , , · · · . We denote by M ( U ) the space of mixing poli-cies constructed from U . An important special case is M (Π S ), i.e., the space of mixing stationarypolicies. When allowing mixing operation, we incorporate the randomness of the initial mixing intothe calculation of the accumulated cost. In particular, for π ∈ M ( U ) with initial randomization ρ , C ( π ) = (1 − γ ) · E s ∼ µ h Z U E π g (cid:2) ∞ X t =0 γ t · c ( s t , a t ) (cid:12)(cid:12) s (cid:3) ρ (d g ) i . By deﬁnition, we note that Π ⊇ Π M ⊇ Π S ⊇ Π D , M (Π S ) ⊇ Π S . A class of policies U is called a “dominating class” for a CMDP, if for any policy π ∈ Π, thereexists a policy ¯ π ∈ U such that C (¯ π ) ≤ C ( π ) , D k (¯ π ) ≤ D k ( π ) , ∀ k ∈ [ K ] . For CMDPs, when the instantaneous costs c ( · , · ) and d k ( · , · ) are uniformly bounded from below,Π S is dominating (Altman 1999). The class of mixing stationary polices M (Π S ) is also dominatingin this case (Theorem 8.4 in (Altman 1999)). There are two classical approaches to CMDPs. We use CMDPs with ﬁnite state and action spacesas examples. The ﬁrst method utilizes the occupation measure. Given a policy π , the occupationmeasure is deﬁned as ν π ( s, a ) := (1 − γ ) · E s ∼ µ h ∞ X t =0 γ t P π ( s t = s, a t = a | s ) i , (4)where P π ( · , ·| s ) denotes the probability measure induced by policy π with initial state s . Notethat the occupation measure is the weighted long-run proportion of time that the system spendsat each state-action pair. We can then express the accumulated costs in (2) and (3) as C ( π ) = X ( s,a ) ∈S×A c ( s, a ) · ν π ( s, a ) D k ( π ) = X ( s,a ) ∈S×A d k ( s, a ) · ν π ( s, a ) , ∀ k ∈ [ K ] . Let Q denote the set of feasible occupation measures, i.e., for any occupancy measure ν ∈ Q thereexists a policy π that leads to ν . By Theorem 3.2 in Altman (1999), Q can be represented by thecollection of vectors { ν ( s, a ) } ( s,a ) ∈S×A that satisﬁes the following system of linear equations: X ( s,a ) ∈S×A ν ( s, a ) (cid:16) s = s ′ ) − γP ( s ′ | s, a ) (cid:17) = (1 − γ ) · µ ( s ′ ) , ∀ s ′ ∈ S , ν ( s, a ) ≥ , ∀ ( s, a ) ∈ S × A , where 1( · ) is the indicator function. Then we obtain the following LP formulation of CMDPmin X ( s,a ) ∈S×A c ( s, a ) · ν ( s, a ) (5)s.t. (cid:8) ν ( s, a ) (cid:9) ( s,a ) ∈S×A ∈ Q , X ( s,a ) ∈S×A d k ( s, a ) · ν ( s, a ) ≤ q k , ∀ k ∈ [ K ] . The second method utilizes the Lagrangian duality. Let λ ∈ R K denote the Lagrangian multipler.Deﬁne L ( π, λ ) := C ( π ) + K X k =1 [ λ ] k · ( D k ( π ) − q k ) . (6)Then the CMDP can be equivalently formulated as inf π ∈ Π S sup λ ≥ L ( π, λ ) . By Theorem 3.6 inAltman (1999), we can exchange the order of inf and sup and obtain,inf π ∈ Π S sup λ ≥ L ( π, λ ) = sup λ ≥ inf π ∈ Π S L ( π, λ ) = sup λ ≥ inf π ∈ Π D L ( π, λ ) , where the last equation holds because for each ﬁxed λ , the inner problem is an unconstrained MDPand the optimal policy is a stationary deterministic policy. We emphasize that given the optimal solution λ ∗ to the dual problem, not every policy π ( λ ∗ ) that minimizes L ( π, λ ∗ ) is the optimalpolicy to the original CMDP. A necessary condition for π ( λ ∗ ) to be optimal for the original CMDPis the complementary slackness: [ λ ∗ ] k · D k ( π ( λ ∗ )) = 0 , ∀ k ∈ [ K ].The dual problem sup λ ≥ inf π ∈ Π D L ( π, λ ) leads to the following LP formulation:max φ,λ X s ∈S µ ( s ) φ ( s ) − K X k =1 [ λ ] k q k (7)s.t. φ ( s ) ≤ (1 − γ ) (cid:16) c ( s, a ) + K X k =1 [ λ ] k d k ( s, a ) (cid:17) + γ · X s ′ ∈S φ ( s ′ ) P ( s ′ | s, a ) , where φ ( s ) denotes the value function with initial state s . Note that (5) and (7) are dual of eachother.Various methods have been developed in the literature to solve the LPs (5) or (7). There aretwo main obstacles to solve the LPs in practice. First, it can be computationally prohibitive whendealing with a large state space or a large action space. Second, it requires explicit characterizationof the transition kernel P . To overcome these diﬃculties, we next develop a sampling-based primal-dual algorithm to solve CMDPs.

3. The Primal-Dual Algorithm

Consider the Lagrangian dual problem sup λ ≥ inf π ∈ Π S L ( π, λ ) . (8)For each ﬁxed λ , the inner problem is an unconstrained MDP. A natural idea is to solve theunconstrained MDP via a sampling-based method and then update the Lagrangian multipliersvia subgradient ascent. Such an idea is exploited in (Le et al. 2019). However, this method iscomputationally expensive, since we need to solve a new MDP every time the Lagrangian multipliersare updated. In contrast, our method only requires a single policy update at each iteration, i.e.,we do not need to solve for the corresponding optimal policy at each iteration.We develop the algorithm and analyze its convergence in M (Π S ), the space of mixing stationarypolicies, rather than Π S . The beneﬁts of allowing the mixing are twofolds. First, it provides anintuitive way to understand strong duality:inf π ∈M (Π S ) sup λ ≥ L ( π, λ ) = sup λ ≥ inf π ∈M (Π S ) L ( π, λ ) . (9)With the mixing operation, we can treat C ( π ) and D ( π ) as inﬁnite-dimensional linear functionswith respect to the distributions of initial randomization of policies in Π S . Hence, the Lagrangian L ( π, λ ) is a bilinear function and strong duality follows from the minimax theorem (Sion et al. π ( ·| s ), i.e., the probability of taking each actionat each state, directly. The mixing operation provides a simple way to average the occupationmeasures. In addition, given a mixing policy, under mild regularity conditions, there exists a non-mixing stationary policy that has the same occupation measure (Theorem 3.1 of Altman (1999)).In particular, for π ∈ M (Π S ), let ν π ( · , · ) be the corresponding occupation measure. Then, we canconstruct such a stationary policy ˜ π via˜ π ( a | s ) = ν π ( s, a ) P a ∈A ν π ( s, a ) . (10)Our algorithmic development is based on strong duality (9), which holds under certain regularityconditions (see Section 4 for details). By the minimax theorem, there exists a saddle point ( π ∗ , λ ∗ )such that L ( π ∗ , λ ) ≤ L ( π ∗ , λ ∗ ) ≤ L ( π, λ ∗ ) , ∀ λ ∈ R K + , π ∈ M (Π S ) . (11)Moreover, π ∗ is an optimal solution to the primal problem and λ ∗ is an optimal solution to thedual problem. In addition, L ( π ∗ , λ ∗ ) equals to the optimal cost of the CMDP. The saddle pointproperty (11) suggests that we can use iterative primal-dual updates to ﬁnd the saddle point.We next introduce our actual algorithm. Note that for a ﬁxed value of λ , the inner inf-problem isan unconstrained MDP with modiﬁed instantaneous cost c λ ( s, a ) := c ( s, a ) + P Kk =1 [ λ ] k ( d k ( s, a ) − q k ).In what follows, we refer to the inner problem inf π ∈M (Π S ) L ( π, λ ) as the modiﬁed unconstrainedMDP.For a given policy π and Lagrangian multiplier λ , deﬁne Q π,λ ( s, a ) := (1 − γ ) · (cid:16) c λ ( s, a ) + E π h ∞ X t =1 γ t c λ ( s t , a t ) (cid:12)(cid:12) s = s, a = a i(cid:17) , (12)which is known as the action-value function or Q -function. Let π m and λ m denote the policy andthe Lagrangian multiplier obtained at iteration m . For the policy update, we use KL divergence asthe regularization (Geist et al. 2019). In particular, the regularized policy iteration is deﬁned as π m ( a | s ) = arg min π ( ·| s ) ∈ ∆ A n(cid:10) Q π m − ,λ m − ( s, · ) , π ( ·| s ) (cid:11) + η − m − · KL (cid:0) π ( ·| s ) k π m − ( ·| s ) (cid:1)o , (13) where η m − > s ∈ S . The minimization is taken over theprobability simplex ∆ A := { π ( ·| s ) : 0 ≤ π ( a | s ) ≤ , P a ∈A π ( a | s ) = 1 } .Let Λ M denote a suitably bounded domain that includes the dual optimal solution λ ∗ . We willprovide an explicit construction of Λ M in Section 4. To update the Lagrangian multiplier, we usethe projected subgradient ascent: λ m = Proj Λ M n λ m − + η m − · ∂ λ L ( π m − , λ m − ) o , (14)where Proj Λ M {·} denotes the projection (in L -norm) on Λ M . We need such a projection to ensurethe boundedness of “subgradient” in order to establish convergence.By the deﬁnition of KL-divergence, the regularized policy iteration can be re-written as π m ( ·| s ) = Z − m − · π m − ( ·| s ) · exp (cid:8) − η m − · Q π m − ,λ m − ( s, · ) (cid:9) , (15)where Z m − is some normalization constant. For the subgradient ascent update, we have (cid:2) ∂ λ L ( π m − , λ m − ) (cid:3) k = D k ( π m − ) − q k . (16)Both (15) and (16) can be evaluated/approximated using simulation. More advanced approximationtechniques for policy evaluation like TD-learning can also be applied here.Suppose that our algorithm runs T − { ( π m , λ m ) } ≤ m ≤ T − .Then, the algorithm outputs a mixing policy and Lagrangian multiplier by taking a weightedaverage of the outputs:¯ π T = T − X m =0 ˜ η m π m , ¯ λ T = T − X m =0 ˜ η m λ m , where ˜ η m = η m / P T − m =0 η m . (17)The averaging operation is required for convergence, since the objective L ( π, λ ) is bilinear anddoes not possess suﬃcient convexity. In particular, counter-examples that fail to converge withoutaveraging exist. The summation in the deﬁnition of ¯ π T is interpreted as the mixing operation, i.e.,it mixes policies ( π , . . . , π T − ) with initial randomization distribution (˜ η , · · · , ˜ η T − ). Note thatthis essentially takes the average of the occupation measures of π m ’s. From ¯ π T , we can apply (10)to deﬁne a non-mixing stationary policy that has the same occupation measure.Above all, our primal-dual algorithm is summarized in Algorithm 1. Algorithm 1

Primal-Dual Algorithm to CMDPInput: pre-speciﬁed projection domain Λ M , stepsizes { η m } m ≥ , initial policy π and Lagrangianmultiplier λ for m = 1 , . . . , T − do update Lagrangian multipliers and policy as ( λ m = Proj Λ M n λ m − + η m − · ∂ λ L ( π m − , λ m − ) o ,π m ( ·| s ) ∝ π m − ( ·| s ) · exp (cid:8) − η m − · Q π m − ,λ m − ( s, · ) (cid:9) . end for Output: mixing policy ¯ π T = P T − m =0 ˜ η m π m , where ˜ η m = η m / P T − m =0 η m .

4. Convergence Analysis

In this section, we conduct detailed performance analysis of Algorithm 1. In particular, we studythe performance of policy ¯ π T by analyzing the values of the objective C (¯ π T ) and the constraints D (¯ π T ). We show that the objective value C (¯ π T ) converges to the optimal C ∗ := C ( π ∗ ) = L ( π ∗ , λ ∗ )at a rate of O (log( T ) / √ T ). In addition, even though ¯ π T may be infeasible, we show that theviolation of constraints, which is measured by (cid:13)(cid:13) [ D (¯ π T ) − q ] + (cid:13)(cid:13) := K X k =1 (cid:0) [ D k (¯ π T ) − q k ] + (cid:1) ! / , (18)converges to zero at a rate of O (log( T ) / √ T ). The analysis builds on a combination of subgradientmethod for saddle point problem and mirror descent for regularized policy iteration.Recall that our algorithmic development builds on the strong duality of CMDP. For CMDPswith ﬁnite state and action spaces, the strong duality always holds (Theorem 3.6 in (Altman1999)). However, when the state space is countably inﬁnite, we need more regularity conditions toensure the strong duality. One suﬃcient condition is that the instantaneous costs of the CMDPare uniformly bounded from below (see Deﬁnition 7.1, Theorem 9.9, and Chapter 10.3 in (Altman1999)). Speciﬁcally, we impose the following assumption. Assumption 1. [Lower Bound of Instantaneous Costs] There exists a constant W such that forall s ∈ S , a ∈ A , and k = 1 , , . . . , K , c ( s, a ) > W, d k ( s, a ) > W. To establish the convergence result, we also require the Slater’s condition:

Assumption 2. [Slater’s Condition] There exists some policy ˜ π such that D k (˜ π ) < q k , ∀ ≤ k ≤ K. Slater’s condition ensures the existences of ﬁnite and bounded optimal Lagrangian multipliers λ ∗ = arg max λ ≥ n inf π ∈M (Π S ) L ( π, λ ) o . This condition is commonly assumed in the constrained optimization literature. For many practicalproblems, the Slater’s condition holds trivially.Our last assumption is about the boundedness of the “subgradient”, which regularizes the move-ment of policies and Lagrangian multipliers at each iteration. Recall that in Algorithm 1, afterapplying the subgradient ascent for Lagrangian multipliers, we project λ onto a bounded domainΛ M , which takes the form Λ M = (cid:8) λ ∈ R K + : k λ k ≤ M + r (cid:9) , (19)where M is an upper bound of k λ ∗ k and r > Assumption 3. [Bounded Subgradient] There exists some constant

G > such that for any λ ∈ Λ M and policy π ∈ M (Π S ) , (cid:13)(cid:13) ∂ λ L ( π, λ ) (cid:13)(cid:13) ≤ G, sup s ∈S sup a ∈A (cid:12)(cid:12) Q π,λ ( s, a ) (cid:12)(cid:12) ≤ G. Since Q π,λ ( s, a ) is linear in λ , it is necessary to restrict λ to a bounded domain Λ M for Assumption3 to hold. That is why we need the projection step in updating λ . Note that when the instantaneouscost functions c ( · , · ) and d k ( · , · ) are uniformly bounded or when the state and action spaces areﬁnite, Assumption 3 holds trivially.Lastly, we comment that the Slater’s condition (Assumption 2) not only guarantees the existenceand boundedness of λ ∗ , but also provides an explicit upper bound of k λ ∗ k . In particular, let ˜ π bea Slater point (a policy that satisﬁes the Slater’s condition), then we have k λ ∗ k ≤ − C (˜ π ) − ˜ c max ≤ k ≤ K (cid:8) D k (˜ π ) − q k (cid:9) , (20)where ˜ c ≤ C ( π ∗ ) is an arbitrary lower bound for the dual problem. In many applications, it ispossible to obtain a better upper bound of k λ ∗ k than (20) by exploiting the structure of the speciﬁcproblem.Next, to establish the convergence, we need to construct an appropriate potential function, whichis also known as Bregman divergence in the optimization literature. The potential function ensuresthat the regularized policy iteration is equivalent to minimize the sum of a linear approximation ofthe objective function and the potential function. We next introduce this potential function, whichis essentially a weighted KL-divergence. Consider the state occupation measure ν πs induced by a policy π ∈ Π S , i.e., ν πs ( s ) := (1 − γ ) · E s ∼ µ h ∞ X t =0 γ t P π ( s t = s | s ) i . The KL-divergence between two stationary policies π and π weighted by ν πs is deﬁned asΦ π ( π k π ) = E s ∼ ν πs h KL (cid:0) π ( ·| s ) k π ( ·| s ) (cid:1)i . (21)When π and π are mixing policies, we ﬁrst transform them to the equivalent stationary policies via(10), and then deﬁne Φ π ( π k π ) as the weighted KL-divergence between the equivalent stationarypolicies.By deﬁnition, Φ π ( π k π ) measures the discrepancy between two policies weighted by a givenstate occupation measure. It connects the regularized policy iteration in (13), which is deﬁnedstate-wise, with a single objective and serves as the Bregman divergence in mirror descent analysis.Unlike the traditional analysis of mirror descent where the potential function is ﬁxed (Nemirovski2012), in the analysis of regularized policy iteration, we need to construct a policy-dependentpotential function and cannot ﬁx the weight of KL-divergence. However, since policy updates aredeﬁned state-wise, for an arbitrary weight, the regularized policy iteration always takes the formof minimizing a linear approximation of the objective function regularized by a certain potentialfunction. Thus, the analysis of mirror descent can be applied here with some modiﬁcations.We are now ready to introduce the convergence results of our primal-dual algorithm. Theorem 1. (Convergence of Main Algorithm) Under Assumptions 1-3, if the step size η m =Θ(1 / √ m ) , then there exist positive constants κ and κ , such that (cid:13)(cid:13) [ D (¯ π T ) − q ] + (cid:13)(cid:13) ≤ (cid:16) G (cid:0) κ log( T ) (cid:1) + Φ π ∗ ( π ∗ k π ) (cid:17) r (1 − γ ) κ √ T , and C (¯ π T ) − L ( π ∗ , λ ∗ ) ≤ (cid:16) G · κ · log( T ) + Φ π ∗ ( π ∗ k π ) + k λ k (cid:17) − γ ) κ √ T ,C (¯ π T ) − L ( π ∗ , λ ∗ ) ≥ −k λ ∗ k (cid:16) G (cid:0) κ log( T ) (cid:1) + Φ π ∗ ( π ∗ k π ) (cid:17) r (1 − γ ) κ √ T .

If the step size is constant η m = η , then (cid:13)(cid:13) [ D (¯ π T ) − q ] + (cid:13)(cid:13) ≤ (cid:0) G + (1 − γ ) − · Φ π ∗ ( π ∗ k π ) (cid:1) rT η + (cid:16)

12 + 18(1 − γ ) (cid:17) G η r , and C (¯ π T ) − L ( π ∗ , λ ∗ ) ≤ (cid:18) (1 − γ ) − · Φ π ∗ ( π ∗ k π ) + k λ k (cid:19) T η + 5 G η − γ ) ,C (¯ π T ) − L ( π ∗ , λ ∗ ) ≥ − k λ ∗ k (cid:0) G + (1 − γ ) − · Φ π ∗ ( π ∗ k π ) (cid:1) rT η − k λ ∗ k (cid:16)

12 + 18(1 − γ ) (cid:17) G η r , where r is the slackness constant in (19) . Theorem 1 indicates that with decreasing step size, η m = Θ(1 / √ m ), our primal-dual algorithmachieves O (log( T ) / √ T ) convergence. In particular, (cid:13)(cid:13) [ D (¯ π T ) − q ] + (cid:13)(cid:13) = O (log( T ) / √ T ) and | C (¯ π T ) − L ( π ∗ , λ ∗ ) | = O (log( T ) / √ T ) . For constant step size, η m = η , our primal-dual algorithm converges to a neighborhood of theoptimal at rate O (1 /T ). In particular, (cid:13)(cid:13) [ D (¯ π T ) − q ] + (cid:13)(cid:13) = O (1 / ( ηT ) + η ) and | C (¯ π T ) − L ( π ∗ , λ ∗ ) | = O (1 / ( ηT ) + η )These convergence rates match those in Le et al. (2019), which requires solving the modiﬁed uncon-strained MDP to the optimal at each iteration. We also note that it is unlikely to improve theconvergence rate beyond Θ(1 / √ T ). This is because the dual problem is a ﬁnite-dimensional concaveoptimization problem without strong concavity. The convergence rate of the subgradient methodin this case is lower bounded by Ω(1 / √ T ) (Bubeck 2014). The proof of Theorem 1 is deferred tothe appendix.We comment that in the bounds in Theorem 1, although the slackness constant r appears indenominators only, the constant G , which is an upper bound of the subgradients, grows linearlyin r . In particular, by Assumption 3, G is determined by the shape of Λ M . Hence, r cannot be settoo large.

5. Weakly Coupled MDP and Weakly Coupled CMDP

One fundamental challenge in solving MDPs and CMDPs is the curse of dimensionality. However,there is an important class of problems that has certain decomposable structures. These problems,which are often referred to as weakly coupled MDPs/CMDPs, contain multiple subproblems whichare almost independent of each other except for some linking constraints on the action space(Singh and Cohn 1998). More precisely, for a weakly coupled MDP consisting of I sub-problems { ( S i , A i , P i , c i ( · , · ) , γ, µ i ) } i ∈ [ I ] , we have the following structural properties: P1. Its state and actionspaces can be expressed in the form of Cartesian products, i.e., s = ( s , . . . , s I ) , S = S × S × . . . × S I , a = ( a , . . . , a I ) , A = A × A × . . . × A I . P2. For each state s t and action a t , the instantaneous cost admits an additively separable form c ( s t , a t ) = I X i =1 c i ( s it , a it ) . P3. The joint initial distribution satisﬁes µ ( s ) = µ ( s ) · µ ( s ) · . . . · µ I ( s I ) and the one-steptransition dynamics of the sub-MDPs are independent of each other, i.e., P ( s t +1 | s t , a t ) = I Y i =1 P i ( s it +1 | s it , a it ) . For the linking constraints, let b i ( · , · ) : S i × A i → R K be a K -dimensional real function, which canbe interpreted as the resource consumed by the i -th sub-problem, i ∈ [ I ]. Then, at each state s ,the feasible actions need to satisfy b ( s , a ) = I X i =1 b i ( s i , a i ) ≤ q. (22)where q ∈ R K . Note that the linking constraint (22) is a hard constraint and needs to be satisﬁedpath-by-path almost surely. For a weakly couple CMDP, it satisﬁes the same structural properties,P1-P3, as the weekly coupled MDP. The only diﬀerence is that the linking constraint now takesthe form (1 − γ ) · E s ∼ µ h ∞ X t =0 γ t · I X i =1 b i ( s it , a it ) (cid:12)(cid:12) s i ≤ q. (23)The weakly coupled MDP and the weakly coupled CMDP are closely related to each other. Let¯ A ( s ) = n a = ( a , . . . , a I ) ∈ A : I X i =1 b i ( s i , a i ) ≤ q o be the (joint) action space of a weakly coupled MDP. Then, the Bellman equation is V ∗ µ ( s ) = min a ∈ ¯ A ( s ) n I X i =1 c i ( s i , a i ) + γ · X s ′ ∈ S V ∗ µ ( s ′ ) · P ( s ′ | s , a ) o . When the number of sub-MDPs I is large, even if the scale of each subproblem is small, the size ofjoint state space S can be prohibitively large. Hence, solving the MDP directly can be intractable.Two decomposition schemes have been proposed to alleviate the curse of dimensionality: LP-basedADP relaxation and Lagrangian relaxation (Adelman and Mersereau 2008). Both of them lead to I independent sub-LPs, which reduces the complexity signiﬁcantly. The LP-based ADP relaxationapproximates the value function with additively separable functions, i.e., V ∗ µ ( s ) ≈ P Ii =1 V ∗ µ i ( s i ) . The Lagrangian relaxation dualizes the constraints (22) based on the LP representation of theBellman equation. The latter relaxation translates the weakly coupled MDP to a weakly coupledCMDP. It has been established that the optimal cost of the relaxed CMDP provides a lower boundfor the optimal cost of the original MDP (Adelman and Mersereau 2008).Many Operations Management problems can be formulated as weakly coupled MDPs/CMDPs.Examples include inventory planning problems with multiple types of inventories and budget con-straints, and scheduling of parallel-server queues with multiple classes of customers. We provide more details about these problems in Sections 6 and 7, where we apply our primal-dual algorithmsto solve them.When applying the primal-dual algorithm to solve weakly coupled CMDPs, it can be easilyadapted to enjoy the decomposability. We call a policy π decomposable if it takes the productform: π ( a | s ) = I Y i =1 π i ( a i | s i ) . Since our algorithm converges with any initial policy, we shall start with a decomposable policy.Let { s t } t ≥ = { ( s t , . . . , s It ) } t ≥ and { a t } t ≥ = { ( a t , . . . , a It ) } t ≥ be the trajectory of the CMDP underpolicy π = ( π , . . . , π I ). To simplify the notations, for each i ∈ [ I ], we deﬁne C i ( π i ) = (1 − γ ) · E π i s i ∼ µ i h ∞ X t =0 γ t · c i ( s it , a it ) (cid:12)(cid:12) s i i , B i ( π i ) = (1 − γ ) · E π i s i ∼ µ i h ∞ X t =0 γ t · b i ( s it , a it ) (cid:12)(cid:12) s i i . Then, the CMDP can be written asmin ( π ,...,π I ) I X i =1 C i ( π i ) , s.t. I X i =1 B i ( π i ) ≤ q. (24)When applying the primal-dual algorithm, if we start with a decomposable policy, then the policiesobtained in all subsequent iterations are decomposable. To see this, we note that the Lagrangianfunction, L ( π , λ ) = I X i =1 (cid:0) C i ( π i ) + λ ⊤ B i ( π i ) (cid:1) − λ ⊤ q, can be decomposed into I independent subproblems. If π m is decomposable, Q π m ,λ ( s , · ) = I X i =1 Q π im ,λ ( s i , · ) , where Q π im ,λ ( · , · ) is the Q -function of the i -th modiﬁed sub-MDP with instantaneous cost c i ( · , · ) + λ ⊤ b i ( · , · ). Here, we ignore the constant λ T q , since subtracting a common constant in the Q -functiondoes not change the updates of regularized policy iteration. This indicates that the regularizedpolicy iteration, including policy evaluation and improvement, can be implemented separately inparallel via π im +1 ( ·| s i ) ∝ π im ( ·| s i ) · exp n − η m · Q π im ,λ ( s i , · ) o , ∀ i ∈ [ I ] . Moreover, as the subgradient of Lagrangian multiplier takes form ∂ λ L ( π m , λ ) = P Ii =1 B i ( π im ) − q ,it can be evaluated for the sub-MDPs in parallel as well. Above all, in this case, the primal-dualalgorithm improves the computational complexity from depending exponentially on I to linearlyon I .

6. Application to an Inventory Planning Problem

In this section, we apply the primal-dual algorithm to solve a multi-product multi-period newsven-dor problem with budget constraints.Consider the inventory planning problem with I distinct products. At the beginning of eachperiod, we need to decide the quantities to order based on the current inventory levels. The ordersare assumed to be fulﬁlled without delay. After the inventory is replenished, a random demand isrealized. We assume the demands for each product are independent. Let F i denote the cumulativedistribution function of the demand for product i in each period. In particular, for each period, thedemand for product i is an independent draw from the distribution F i . For each product i ∈ [ I ],we denote its inventory level at the beginning of period t by s it , the quantity we ordered by a it , andthe demand in period t by w it . For product i in period t , if the demand does not exceed the currentinventory level, i.e., w it ≤ s it + a it , all the demand is fulﬁlled and the remaining inventory can becarried to the next period. Otherwise, only s it + a it units are fulﬁlled in the current period. Theremaining ( w it − s it − a it ) units are carried to the next period as backlog. We allow s it ’s to be negativeto represent backlogs. For product i , inventory incurs a holding cost of h i per unit per period andbacklog incurs a backlog cost of b i per unit per period. In addition, product i in inventory consumes v i resource per unit per period. For a ﬁxed q >

0, we impose the following budget constraint(1 − γ ) · E h ∞ X t =0 I X i =1 γ t · [ s it + a it ] + · v i (cid:12)(cid:12)(cid:12) ( s , . . . , s I ) i ≤ q. (25)The resource can be interpreted as, for example, the volume of each product. In this case, theabove constraint put restrictions on the warehouse space.The inventory planning problem can be formulated as a weakly coupled CMDP with state s =( s , . . . , s I ), action a = ( a , . . . , a I ), and transition dynamics s it +1 = s it + a it − w it , w it ∼ F i ( w ) , ∀ i ∈ [ I ] . As the demands are independent, P ( s t +1 | s t , a t +1 ) = Q Ii =1 P ( s it +1 | s it , a it +1 ). The instantaneous costfunction and auxiliary cost function are c ( s t , a t ) = I X i =1 h i · [ s it + a it − w it ] + + b i · [ w it − s it − a it ] + ,b ( s t , a t ) = I X i =1 [ s it + a it ] + · v i . To verify the correctness of our convergence analysis, we consider a small-scale instance of theproblem with appropriate truncations. Such a truncation makes the state and action spaces ﬁnite. In this case, the optimal cost can be solved numerically (using the LP formulation). In partic-ular, consider I = 2, and demands for the two products are both uniformly distributed on set { , , . . . , } , We impose an upper bound 10 and a lower bound −

10 for the state space. In par-ticular, when backlogs drop below −

10, the excess demands are lost without incurring any cost.For other systems parameters, we set the holding costs h = 1 , h = 2, backlog costs b = 2 , b = 3,resource consumptions v = 1 . , v = 1, threshold q = 10, and discount rate γ = 0 . Q -function for a given policy. Since the system scale is small, we can enumerate all thestate-action pairs in policy evaluation, i.e., no approximation of the value function is needed. Theestimation of Q -function is based on an average of 400 independent replications of the inventoryprocess over 40 periods of time. We implement two versions of the algorithm, one with constant stepsizes η m = 0 .

2, the other with decreasing step size η m = 0 . / √ m + 1. In each experiment, we run500 iterations in total and calculate the objective values for each iteration. The results of numericalexperiments are summarized in Figures 1 and 2. Figure 1 shows the trajectories of objective valuesand the constraint violations for diﬀerent iterations with constant step size. We observe that after500 iterations, the averaged CMDP cost (without multiplying the (1 − γ ) factor) converge to 49 . .

47. In terms of feasibility, we calculate the violation ofconstraints, which is the expected value of the auxiliary cost minus the budget threshold. Weobserve that the averaged violation value converges to 0 . P T − t =0 ˜ η t C ( π t ) and thereciprocal of the number of iterations (for constant step size) or the reciprocal of square root ofthe number of iterations (for decreasing step size). In both cases, we observe a straight line, whichconﬁrms the rates of convergence developed in Theorem 1.

7. Application to Queueing Scheduling

In this section, we apply our primal-dual algorithm to a queue scheduling problem, which is moti-vated by applications in service operations management. Service systems often feature multipleclasses of customers with diﬀerent service needs and multiple pools of servers with diﬀerent skillsets.Eﬃciently matching customers with compatible servers is critical to the management of these sys-tems. In this context, we consider a parallel-server system (PSS) with multiple classes of customersand multiple pools (types) of servers. Customers waiting in queue incur some holding costs androuting customers to diﬀerent pools leads to diﬀerent routing costs. The goal is to ﬁnd a schedulingpolicy that minimizes the performance cost (holding cost plus routing cost). This class of problemsis known as the skill-based routing problem and has been widely studied in the literature. We referto (Chen et al. 2020) for a comprehensive survey of related works. Iteration C M D P c o s t (a) C ( π T ) Iteration A v e r a g e d C M D P c o s t (b) P T − t =0 ˜ η t C ( π t ) Iteration V i o l a t i o n (c) k [ P i =1 B i ( π T ) − q ] + k Iteration A v e r a g e d v i o l a t i o n (d) P T − t =0 ˜ η t k [ P i =1 B ( π t ) − q ] + k Figure 1 Trajectories of costs and constraints with constant step sizes A v e r a g e d C M D P c o s t (a) η m = 0 . (cid:10)(cid:11)(cid:12)(cid:13)(cid:14) (cid:15)(cid:16)(cid:17)(cid:18)(cid:19) T A v e r a g e d C M D P c o s t (b) η m = 0 . / √ m + 1 Figure 2 Convergence rate with constant and decreasing step sizes In what follows, we ﬁrst introduce the queueing model and some heuristic policies adapted frompolicies developed in the literature. We then present the implementation details of our primal-dualalgorithm in this setting. Due to the large state and action spaces, we combine our primal-dualalgorithm with several approximation techniques. Lastly, we compare the performance of our policywith the benchmark policies numerically.

The multi-class multi-pool queuing network has I classes of customers and J pools of servers. Weconsider a discrete time model. In each period, the number of arrivals of class i customers follows aPoisson distribution with rate θ i . There are N j homogeneous servers in pool j , j ∈ [ J ]. We assumethat each customer can only be served by one server and each server can only serve one customerat a time. If a class i customer is served by a server from pool j , its service time follows a geometricdistribution with success probability µ ij . When there is no compatibility between customer class i and server type j , µ ij = 0. Figure 3 provides a pictorial illustration of such a system. Figure 3 Multi-class multi-pool queueing system

We consider non-preemptive scheduling policies. Let A i ( t ) denote the number of new class i arrivals in time period t , i.e., A i ( t ) follows a Poisson distribution with rate θ i . Let Z ij ( t ) denotethe number of class i customers in service in pool j at the beginning time period t . We also denote U ij ( t ) as the number of class i customers assigned to pool j for time period t . Note that U ij ( t )’sare determined by our scheduling policy. Then the number of class i departures from pool j at theend of time period t , R ij ( t ), follows a Binomial distribution with parameter Z ij ( t ) + U ij ( t ) and µ ij .Let X i ( t ) denote the number of class i customers waiting in queue at the beginning of period t .Then we have the following system dynamics, X i ( t + 1) = X i ( t ) + A i ( t ) − J X j =1 U ij ( t ) , ∀ i ∈ [ I ] Z ij ( t + 1) = Z ij ( t ) + U ij ( t ) − R ij ( t ) , ∀ i ∈ [ I ] , j ∈ [ J ] . (26) The state of the system is s ( t ) = ( X i ( t ) , Z ij ( t ) : i ∈ [ I ] , j ∈ [ J ]) ∈ N I × ( J +1) . The action is a ( t ) =( U ij ( t ) : i ∈ [ I ] , j ∈ [ J ]) ∈ N I × J .The routing policy needs to satisfy the following constraints U ij ( t ) ∈ N , J X j =1 U ij ( t ) ≤ X i ( t ) , ∀ i ∈ [ I ] , j ∈ [ J ] , ∀ t ≥ . (27)i.e., we can not schedule more customers than there are waiting, and I X i =1 Z ij ( t ) + U ij ( t ) ≤ N j , ∀ j ∈ [ J ] , ∀ t ≥ , (28)i.e., the number of customers in service can not exceed the capacity. Note that constraints (27)-(28)are hard constraints, i.e, they need to be satisﬁed path-by-path.Each class i customer waiting in queue incurs a holding cost of h i per period. There is also aone-shot routing cost of r ij for scheduling a class i customer to a pool j server. The overall costfor period t is given by c (cid:0) s ( t ) , a ( t ) (cid:1) = I X i =1 h i X i ( t ) + I X i =1 J X j =1 r ij U ij ( t ) . Our goal is to minimize the cumulative discounted costs:(1 − γ ) · E π " ∞ X t =0 γ t · c ( s ( t ) , a ( t )) . The problem we consider here is a weakly coupled MDP with I sub-problems, where eachsub-problem is an inverted-V model (i.e., a single customer class and multiple server pools). Inparticular, for the i -th sub-problem, deﬁne state and action as s i ( t ) = ( X i ( t ) , Z i ( t ) , . . . , Z iJ ( t )) and a i ( t ) = ( U i ( t ) , . . . , U iJ ( t )). The transition dynamics of the i -th sub-system follows X i ( t + 1) = X i ( t ) + A i ( t ) − J X j =1 U ij ( t ) , Z ij ( t + 1) = Z ij ( t ) + U ij ( t ) − R ij ( t ) , ∀ j ∈ [ J ] . Given s i ( t ), the corresponding action space is deﬁned as A i ( s i ( t )) = n(cid:8) U ij ( t ) (cid:9) j ∈ [ J ] : U ij ( t ) ∈ N , J X j =1 U ij ( t ) ≤ X i ( t ) , Z ij ( t ) + U ij ( t ) ≤ N j , ∀ j ∈ [ J ] o . We also deﬁne the auxiliary cost function b i ( s i ( t ) , a i ( t )) = (cid:0) Z i ( t ) + U i ( t ) , . . . , Z iJ ( t ) + U iJ ( t ) (cid:1) ⊤ ∈ N J . Then the capacity constraints (28) can be expressed as I X i =1 b i ( s i ( t ) , a i ( t )) ≤ ( N , . . . , N J ) ⊤ , which takes the same form as the linking constraint in (22).There are three important features of the problem that we attempt to address in this section: 1)non-preemptive routing; 2) both class-and-pool dependent service rate; 3) routing cost (overﬂowcost). The ﬁrst two features require us to keep track of a very high dimensional state space, i.e., I ( J + 1). The third feature has not been extensively studied in the literature.We next introduce two heuristic policies adapted from policies developed in the literature. ForPSS with multiple classes of customers and multiple pools of servers, a myopic policy called the cµ -rule (or generalization of it), has been shown to be asymptotically optimal in some systems,where the goal is to minimize the holding cost (Mandelbaum and Stolyar 2004). The idea is tominimize the instantaneous cost-reduction rate at each decision epoch. Another policy is called themax-pressure policy, which is known to be throughput optimal and asymptotically cost optimal forsome forms of convex holding cost (Stolyar et al. 2004, Dai et al. 2008). We next consider modiﬁedversions of the above routing policies, which take the routing costs into account (Chen et al. 2020).At each decision epoch t , we choose U ij ( t )’s that solve the following optimization problem:max U ij ( t ) I X i =1 J X j =1 ω ij ( t ) U ij ( t )s.t. J X j =1 U ij ( t ) ≤ X i ( t ) , ∀ i ∈ [ I ] , j ∈ [ J ] , ∀ t ≥ , I X i =1 Z ij ( t ) + U ij ( t ) ≤ N j , ∀ j ∈ [ J ] , ∀ t ≥ ,U ij ( t ) ∈ N ∀ i ∈ [ I ] , j ∈ [ J ] , ∀ t ≥ , where ω ij ( t )’s are some modiﬁed instantaneous costs we introduce next. We consider two diﬀerentforms of w ij ( t )’s. The ﬁrst one sets ω ij ( t ) = h i − r ij , which is adapted from the cµ -rule. We referto this policy as the modiﬁed cµ -rule. The second one sets w ij ( t ) = h i X i ( t ) − r ij , which is adaptedfrom the max-pressure policy. We refer to this policy as the modiﬁed max-pressure policy. We consider the CMDP relaxation of the weakly coupled MDP:min π (1 − γ ) · E π h ∞ X t =0 γ t · c ( s ( t ) , a ( t )) i s.t. (1 − γ ) · E h I X i =1 ∞ X t =0 γ t · d i ( s i ( t ) , a i ( t )) i ≤ ( N , . . . , N J ) ⊤ , and apply the primal-dual algorithm to solve it. The decoupling allows us to translate the originalproblem to I sub-problems. In particular, in each iteration, we use regularized policy iteration toupdate the scheduling policy for a single-class multi-pool system with modiﬁed instantaneous cost: c iλ (cid:0) s i ( t ) , a i ( t ) (cid:1) = h i X i ( t ) + J X j =1 r ij U ij ( t ) + J X j =1 λ j (cid:0) Z ij ( t ) + U ij ( t ) (cid:1) for the i -th sub-problem.Even with the decomposition, the state and policy spaces are still too large in this case. We nextintroduce some further approximations to reduce the dimension of the problem. We shall omit theindex i in subsequent discussions as the development focuses on each sub-problem. Policy space reduction:

For each sub-problem, the policy space is still prohibitive. To seethis, consider a system with 3 pools and 30 servers in each pool. When the queue length is 90and all pools are empty, there are roughly 30 feasible actions. To overcome the challenge, wereduce the action space to only include priority rules. State-dependent extreme policies have beenshown to be asymptotically optimal in the scheduling of PSS due to the linear system dynamicsand linear holding costs (Harrison and Zeevi 2004). Denote − −

1. For example, priority (1 , , −

1) means pool1 is preferred to pool 2, which is preferred to waiting. When following priority (1 , , − A . Value function approximation:

In our policy iteration step, given a policy π , we need toestimate the function Q π,λ ( s, a ) for all s ∈ S , a ∈ ˜ A , where the state s = ( x, z , . . . , z J ). Due to thelarge state space, we can not enumerate all the states to evaluate the value function. Instead, weuse value function approximation with quadratic basis. The idea is to ﬁnd θ π,a ∈ R ( J +1) +1 suchthat Q π,λ ( s, a ) ≈ h φ ( s ) , θ π,a i . where φ ( s ) is the quadratic basis. To obtain θ π,a at each iteration, we ﬁrst randomly sample M states { s i } i ∈ [ M ] and use Monte Carlo simulation to estimate Q π,λ ( s i , a ). Then, set θ π,a = argmin θ n M · M X i =1 ( Q π,λ ( s i , a ) − h φ ( s i ) , θ i ) o . For the numerical experiments, we consider a similar setting as that in Dai and Shi (2019), whichis motivated by hospital inpatient-ﬂow management. In particular, we consider a network with i is considered the primary pool for class i customers with r ii = 0, ∀ i ∈ [ I ]. The major diﬀerence between our model and the model consideredin Dai and Shi (2019) is that we allow the service rates to vary for diﬀerent server types, i.e, µ ij depends on both i and j . This captures the potential slowdown eﬀect due to oﬀ-service placement(Song et al. 2020).For the system parameters, we set the arrival rates ( θ , θ , θ ) = (12 , , h , h , h ) = (3 , , N , N , N ) = (40 , , µ , µ , µ ) = (0 . , . , . , ( µ , µ , µ ) = (0 . , . , . , ( µ , µ , µ ) = (0 . , . , . . We run two sets of experiments, corresponding to large routing/overﬂow costs:( r , r , r ) = (0 , , , ( r , r , r ) = (3 , , , ( r , r , r ) = (1 , , , (29)and small routing costs:( r , r , r ) = (0 , . , . , ( r , r , r ) = (0 . , , . , ( r , r , r ) = (0 . , . , . (30)Note that for class i customers, the primary server pool i has the largest service rate and zerorouting cost. For customer class i , we deﬁne its nominal traﬃc intensity as ρ i = θ i / ( N i µ ii ). Thenthe nominal traﬃc intensity of the three classes are ρ = 1, ρ = 16 /

15, and ρ = 5 /

6. This indicatesthat the ﬁrst two classes are unstable if we do not do any “overﬂow”.We initialize the system with X i (0) = 50 and Z (0) = 20, Z (0) = 30, Z (0) = 40, and Z ij (0) = 0for i = j , i, j ∈ [3]. We compare the performance of our policy with the two benchmark policies forproblems with diﬀerent routing costs and discount rates.When constructing the policy space for our primal-dual algorithms, because each customer classhas a primary server pool with the fastest service rate and zero routing cost, we always give theprimary pool the highest priority. In particular, the action spaces for three classes are deﬁned as A = { (1 , − , (1 , , − , (1 , , − , (1 , , , − , (1 , , , − } , A = { (2 , − , (2 , , − , (2 , , − , (2 , , , − , (2 , , , − } , and A = { (3 , − , (3 , , − , (3 , , − , (3 , , , − , (3 , , , − } , respectively.In our primal-dual update, we use the constant stepsize 0 .

1. When using simulation to estimatethe value function, we truncate at T = 100 , ,

800 for γ = 0 . , . , .

99 respectively. This ensuresthat γ T ≈ − , i.e., the truncation errors are almost negligible. When ﬁtting the parameters forthe quadratic value function approximation, we sample 1000 states and use simulation to estimatethe Q -function at these states. For each value of γ , we start with the Lagrangian multipliers λ = (10 , ,

10) and run the prima-dual algorithm for 30 iterations, and take the policy obtained in the last iteration. Note that this policy may not be feasible to the original weakly coupled MDP.In order to obtain a feasible policy, we adopt the following modiﬁcation. In each period, for eachpool, when the number of scheduled customers exceeds the capacity, the primary customers areprioritized for admission. We then admit the “overﬂowed” customers uniformly at random untilthe capacity is reached. The customers who are not admitted to service will be sent back to theircorresponding queues and wait for the next decision epoch. For example, suppose that there are20 servers available in pool 1 but the policy schedules (15 , ,

5) customers from the three classes tothis pool. The modiﬁed policy ﬁrst admits the 15 customers from class 1 and then randomly picks5 among the 10 customers of classes 2 and 3 to admit.Given a policy, to evaluate its performance, we estimate the cumulative discounted costs from500 independent replications of the system over T periods of time. The results are summarized inTables 1 and 2 . Table 1 Cumulative discounted costs under diﬀerent policies with large routing costs (29) under diﬀerentdiscount factors. (Numbers in bracket are the standard errors from simulation estimation.) γ = 0 . γ = 0 . γ = 0 . cµ -rule 270 .

49 286 .

11 467 . .

50) (2 .

07) (4 . .

62 269 .

19 278 . .

25) (1 .

74) (2 . .

77 243 .

87 218 . (1 .

47) (2 .

26) (2 . Table 2 Cumulative discounted costs under diﬀerent policies with small routing costs (30) under diﬀerentdiscount factors. (Numbers in bracket are the standard errors from simulation estimation.) γ = 0 . γ = 0 . γ = 0 . cµ -rule 232 .

22 230 .

54 266 . .

20) (1 .

71) (3 . .

89 266 .

83 308 . .

21) (1 .

77) (2 . .

53 251 .

20 210 . (1 .

50) (2 .

40) (2 . cµ -rule increases substantially as the discount rate γ increases. When taking a closerlook at w ij ( t )’s, we note that in this case, w ( t ) = − . w ( t ) = − .

6. This implies that themodiﬁed cµ -rule would never overﬂow class 2 customers. As a result, the system is unstable, i.e.,the class 2 queue blow up as t increases. (The cumulative discounted cost is well-deﬁned as the discount rate decays exponentially in t while the queue length grows linearly in t .) The modiﬁedmax-pressure is able to achieve reasonably good performance in this case. When γ is small, ouralgorithm achieves comparable (slightly better) performance as the max-pressure policy. When γ is large, i.e, γ = 0 .

99, our policy is able to achieve a substantially lower cost than the max-pressurepolicy, i.e., a 21% cost reduction. This is because the max-pressure policy only starts overﬂowingwhen the queues are large enough. In this example where overﬂow is necessary to achieve systemstability, we need more aggressive overﬂow. Our policy is able to “learn” this through the primal-dual training.When the overﬂow cost is small (Table (2), the modiﬁed cµ -rule is able to achieve better perfor-mance than the modiﬁed max-pressure policy. Note that in this case, all w ij ( t )’s are nonnegativefor both the modiﬁed cµ -rule and the modiﬁed max-pressure policy (when X i ( t ) > γ issmall, our policy achieves comparable performance as the modiﬁed cµ -rule, when γ is large, i.e., γ = 0 .

99, our policy can achieve a 21% cost reduction over the modiﬁed cµ -rule. This suggests thatoverﬂow needs to be exercised carefully.We next discuss the structure of the policies obtained via primal-dual algorithm. We observethat our policies in general follow a threshold structure: overﬂow customers only when the queuelength exceeds some threshold. However, the thresholds are highly dependent on the states of thesystem. Take the scheduling policy for class 1 and 2 customers with discount rate γ = 0 . Z ’s and Z ’s. We observe that holding Z and Z ﬁxed, as Z increases, the threshold foroverﬂow decreases. Similarly, holding Z and Z ﬁxed, as Z increases, the threshold for overﬂowalso decreases. (a) Fix Z = 0 , Z = 0, and vary Z (b) Fix Z = 0 , Z = 0, and vary Z Figure 4 The threshold for class 1 and 2 queues when starting overﬂowing.

8. Conclusion and Future Directions

In this work, we propose a sampling-based primal-dual algorithm to solve CMDPs. Our approachalternatively applies regularized policy iteration to improve the policy and subgradient ascent tomaintain the constraints. The algorithm achieves O (log( T ) / √ T ) convergence rate and only requiresone policy update at each primal-dual iteration. Our algorithm also enjoys the decomposabilityproperty for weakly coupled CMDPs. We demonstrate the applications of our algorithm to solvetwo important operations management problems with weakly coupled structures: multi-productinventory management and multi-class queue scheduling.In Section 7, we also show the good empirical performance of our algorithm to solve an importantclass of weakly coupled MDPs. This opens two directions for future research. First, it is be impor-tant to quantify the optimality gap between the weakly coupled MDP and its CMDP relaxationtheoretically. The gap can be large in some problems as demonstrate in Adelman and Mersereau(2008). It would be interesting to establish easy-to-verify conditions about when the gap is small.Second, the policy obtained via the Lagrangian relaxation may not satisfy the hard constraints inthe original MDP. One approach to overcome the issue is to use more stringent thresholds whendeﬁning constraints in the CMDP relaxation (Balseiro et al. 2019). The other approach is to modifythe CMDP based policies to construct good MDP policies. For example, Brown and Smith (2020)study a dynamic assortment problem and propose an index heuristic from the relaxed problem,and show that the policy achieves asymptotic optimality. In Section 7, we apply a rather straight-forward modiﬁcation to the CMDP based policy in order to satisfy the hard constraints in theoriginal MDP. In general, how to “translate” the policy derived based on the relaxed problem tothe original MDP would be an interesting research direction. References

Achiam J, Held D, Tamar A, Abbeel P (2017) Constrained policy optimization. arXiv preprintarXiv:1705.10528 .Adelman D, Mersereau AJ (2008) Relaxations of weakly coupled stochastic dynamic programs.

OperationsResearch

Constrained Markov decision processes , volume 7 (CRC Press).Balseiro SR, Brown DB, Chen C (2019) Dynamic pricing of relocating resources in large networks.

ACMSIGMETRICS Performance Evaluation Review

Convex optimization algorithms (Athena Scientiﬁc Belmont).Bertsimas D, Orlin JB (1994) A technique for speeding up the solution of the Lagrangian dual.

MathematicalProgramming

Journal of Optimization Theory and Applications

Systems & controlletters

Man-agement Science .Bubeck S (2014) Convex optimization: Algorithms and complexity. arXiv preprint arXiv:1405.4980 .Caramanis C, Dimitrov NB, Morton DP (2014) Eﬃcient algorithms for budget-constrained Markov decisionprocesses.

IEEE Transactions on Automatic Control

Queueing Systems arXiv preprint arXiv:1612.02516 .Chow Y, Nachum O, Duenez-Guzman E, Ghavamzadeh M (2018) A Lyapunov-based approach to safe rein-forcement learning.

Advances in neural information processing systems , 8092–8101.Dai J, Shi P (2019) Inpatient overﬂow: An approximate dynamic programming approach.

Manufacturing &Service Operations Management

The Annals of Applied Probability arXiv preprint arXiv:1901.08978 .Geist M, Scherrer B, Pietquin O (2019) A theory of regularized Markov decision processes. arXiv preprintarXiv:1901.11275 .Haarnoja T, Tang H, Abbeel P, Levine S (2017) Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165 .Harrison JM, Zeevi A (2004) Dynamic scheduling of a multiclass queue in the halﬁn-whitt heavy traﬃcregime.

Operations Research

ICML , volume 2,267–274.Le HM, Voloshin C, Yue Y (2019) Batch policy learning under constraints. arXiv preprint arXiv:1903.08738 .Liu B, Cai Q, Yang Z, Wang Z (2019a) Neural proximal/trust region policy optimization attains globallyoptimal policy. arXiv preprint arXiv:1906.10306 .0Liu Y, Ding J, Liu X (2019b) Ipo: Interior-point policy optimization under constraints. arXiv preprintarXiv:1910.09615 .Mandelbaum A, Stolyar AL (2004) Scheduling ﬂexible servers with convex delay costs: Heavy-traﬃc opti-mality of the generalized c µ -rule. Operations Research

Advances in Neural Information Processing Systems , 14093–14102.Nedi´c A, Ozdaglar A (2009) Subgradient methods for saddle-point problems.

Journal of optimization theoryand applications , 353–360 (IEEE).Nemirovski A (2012) Tutorial: Mirror descent algorithms for large-scale deterministic and stochastic convexoptimization.

Conference on Learning Theory (COLT) .Schulman J, Levine S, Abbeel P, Jordan M, Moritz P (2015) Trust region policy optimization.

Internationalconference on machine learning , 1889–1897.Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .Shani L, Efroni Y, Mannor S (2019) Adaptive trust region policy optimization: Global convergence and fasterrates for regularized mdps. arXiv preprint arXiv:1909.02769 .Singh SP, Cohn D (1998) How to dynamically merge Markov decision processes.

Advances in neural infor-mation processing systems , 1057–1063.Sion M, et al. (1958) On general minimax theorems.

Paciﬁc Journal of mathematics

Management Science

The Annals of Applied Probability

Reinforcement learning: An introduction (MIT press).Tessler C, Mankowitz DJ, Mannor S (2018) Reward constrained policy optimization. arXiv preprintarXiv:1805.11074 .Turken N, Tan Y, Vakharia AJ, Wang L, Wang R, Yenipazarli A (2012) The multi-product newsvendorproblem: Review, extensions, and directions for future research.

Handbook of newsvendor problems ,3–39 (Springer).Wang L, Cai Q, Yang Z, Wang Z (2019) Neural policy gradient methods: Global optimality and rates ofconvergence. arXiv preprint arXiv:1909.01150 .1 Appendix. Proof of Main Results

The proof of Theorem 1 relies the following lemma, which upper and lower bounds the movement of theLagrangian after a single iteration/update of the policy and the Lagrangian multipliers.

Lemma 1.

Let { ( π m , λ m ) } m ≥ be the sequences of stationary policies and Lagrangian multipliers generatedby Algorithm 1. Then for arbitrary λ ∈ R K + and π ∈ Π S , we have the upper bound L ( π m , λ ) − L ( π m , λ m ) ≤ (2 η m ) − · (cid:0) k λ − λ m k − k λ − λ m +1 k (cid:1) + η m / · (cid:13)(cid:13) ∂ λ L ( π m , λ m ) (cid:13)(cid:13) , and the lower bound L ( π, λ m ) − L ( π m , λ m ) ≥ (cid:0) (1 − γ ) η m (cid:1) − (cid:16) Φ π ( π k π m +1 ) − Φ π ( π k π m ) (cid:17) − η m − γ ) · (cid:0) sup s ∈S sup a ∈A | Q λ m ,π m ( s, a ) | (cid:1) . Before we prove Lemma 1, we ﬁrst present two auxiliary lemmas. The ﬁrst lemma (Lemma 2) is ratherstandard. A similar version of the result can be found in Proposition 3.2.2 in Bertsekas and Scientiﬁc (2015).For self-completeness, we still provide the proof here.

Lemma 2.

Let f be a proper convex function on a space Ω (not necessary a Euclidean space). Let C be anopen set in Ω , and Ψ ξ ( ·k· ) be the Bregman divergence induced by a strictly convex function ξ on Ω . For anarbitrary constant η > and a point x ∈ Ω , deﬁne x ∗ = arg min x ∈C n f ( x ) + 1 η Ψ ξ (cid:0) x k x (cid:1)o . Then we have f ( x ) − f ( x ∗ ) ≥ η (cid:16) Ψ ξ (cid:0) x ∗ k x (cid:1) + Ψ ξ (cid:0) x k x ∗ (cid:1) − Ψ ξ (cid:0) x k x (cid:1)(cid:17) , ∀ x ∈ Ω . By symmetry, for a concave function g on Ω and ˆ x ∗ = arg max x ∈C n g ( x ) − η Ψ ξ (cid:0) x k x (cid:1)o . Then g ( x ) − g (ˆ x ∗ ) ≤ − η (cid:16) Ψ ξ (cid:0) ˆ x ∗ k x (cid:1) + Ψ ξ (cid:0) x k ˆ x ∗ (cid:1) − Ψ ξ (cid:0) x k x (cid:1)(cid:17) , ∀ x ∈ Ω . Proof of Lemma 2

We ﬁrst consider the minimization problem. Since x ∗ minimizes the objective f ( x ) + η − · Ψ ξ ( x k x ) on set C , there exists a subgradient of the form p ∗ = q ∗ + η − · ∂ x Ψ ξ ( x ∗ k x ) = q ∗ + η − · (cid:0) ∇ ξ ( x ∗ ) − ∇ ξ ( x ) (cid:1) such that h p ∗ , x − x ∗ i ≥ , ∀ x ∈ C . Here q ∗ ∈ ∂ x f ( x ∗ ) is some subgradient of f ( x ) at x ∗ . As a result, by the property of subgradient, for all x ∈ C ,we have f ( x ) ≥ f ( x ∗ ) + h q ∗ , x − x ∗ i≥ f ( x ∗ ) + η − · h∇ ξ ( x ) − ∇ ξ ( x ∗ ) , x − x ∗ i = f ( x ∗ ) + η − · (cid:16) Ψ ξ (cid:0) x ∗ k x (cid:1) + Ψ ξ (cid:0) x k x ∗ (cid:1) − Ψ ξ (cid:0) x k x (cid:1)(cid:17) , ξ ( x k y ) = ξ ( x ) − ξ ( y ) − (cid:10) ∇ ξ ( y ) , x − y (cid:11) . For the maximization problem, we only need to consider − g and apply above result. (cid:3) The next lemma is Lemma 6.1 in (Kakade and Langford 2002). Given two policies, it characterizes thediﬀerence of expected accumulated costs as the inner product of the advantage function of one policy andthe occupation measure of another policy. Note that the value function V π and the action-value function Q π of an MDP under policy π are deﬁned in (1) and (12). Lemma 3.

For arbitrary policies π, π ′ ∈ Π S , E s ∼ µ (cid:2) V π ( s ) (cid:3) − E s ∼ µ (cid:2) V π ′ ( s ) (cid:3) = 11 − γ E ( s,a ) ∼ ν π ′ (cid:2) Q π ( s, a ) − V π ( s ) (cid:3) . where ν π ′ ( · , · ) is the occupation measure associated with π ′ .Proof of Lemma 1 For the upper bound, note that because L ( π m , λ ) is linear in λ , λ m +1 = Proj Λ M (cid:8) λ m + η m · ∂ λ L ( π m , λ m ) (cid:9) is equivalent to λ m +1 = arg max λ ∈ Λ M n L ( π m , λ ) − η m k λ − λ m k o . Then, by Lemma 2, we have L ( π m , λ ) − L ( π m , λ m +1 ) ≤ (2 η m ) − (cid:0) k λ − λ m k − k λ − λ m +1 k − k λ m +1 − λ m k (cid:1) ≤ (2 η m ) − (cid:0) k λ − λ m k − k λ − λ m +1 k (cid:1) . Next, L ( π m , λ ) − L ( π m , λ m ) ≤ (2 η m ) − (cid:0) k λ − λ m k − k λ − λ m +1 k (cid:1) + L ( π m , λ m +1 ) − L ( π m , λ m )=(2 η m ) − (cid:0) k λ − λ m k − k λ − λ m +1 k (cid:1) + (cid:10) ∂ λ L ( π m , λ m ) , λ m +1 − λ m (cid:11) ≤ (2 η m ) − (cid:0) k λ − λ m k − k λ − λ m +1 k (cid:1) + η m / · (cid:13)(cid:13) ∂ λ L ( π m , λ m ) (cid:13)(cid:13) , where the last inequality follows from the deﬁnition of λ m +1 and the non-expansive property of the projection.Then we obtain the upper bound.For the lower bound, recall that we update π m via π m +1 ( ·| s ) = arg min π ( ·| s ) ∈ ∆ A n(cid:10) Q π m ,λ m ( s, · ) , π ( ·| s ) (cid:11) + 1 η m KL (cid:0) π ( ·| s ) k π m ( ·| s ) (cid:1)o , for each state s ∈ S . Then, for an arbitrary stationary policy π ′ ∈ Π S , we have π m +1 = arg min π ∈ Π S n E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π ( ·| s ) (cid:11) + 1 η m KL (cid:0) π ( ·| s ) k π m ( ·| s ) (cid:1)io where ν π ′ s is the state occupation measure associated with π ′ .3Note that the space of the stationary policy, Π S , can be represented as the product space of simplex ∆ A .Consider Ω := Π S = (cid:0) ∆ A (cid:1) ⊗|S| and let g ( π ) := E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π ( ·| s ) (cid:11)i , Ψ ξ ( π ) := E s ∼ ν π ′ s h KL (cid:0) π ( ·| s ) k π m ( ·| s ) (cid:1)i = Φ π ′ ( π k π m ) . where Φ π ′ is deﬁned in (21). Since g ( π ) is linear in π , setting π = π ′ , by Lemma 2, we obtain E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π ′ ( ·| s ) − π m +1 ( ·| s ) (cid:11)i ≥ η − m (cid:16) Φ π ′ ( π m +1 k π m ) + Φ π ′ ( π ′ k π m +1 ) − Φ π ′ ( π ′ k π m ) (cid:17) , which can be equivalently written as η − m · (cid:16) Φ π ′ ( π ′ k π m +1 ) − Φ π ′ ( π ′ k π m ) + Φ π ′ ( π m +1 k π m ) (cid:17) ≤ E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π ′ ( ·| s ) − π m ( ·| s ) (cid:11)i + E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π m ( ·| s ) − π m +1 ( ·| s ) (cid:11)i . (31)We next derive an upper bound for the right-hand side of inequalities (31). Let k · k TV denotes the totalvariation norm of probability distributions. First, for each state s ∈ S , η m · (cid:10) Q π m ,λ m ( s, · ) , π m ( ·| s ) − π m +1 ( ·| s ) (cid:11) ≤ η m · sup a ∈A (cid:12)(cid:12) Q π m ,λ m ( s, a ) (cid:12)(cid:12) · (cid:13)(cid:13) π m ( ·| s ) − π m +1 ( ·| s ) (cid:13)(cid:13) TV ≤ η m · (cid:16) sup s ∈S sup a ∈A (cid:12)(cid:12) Q π m ,λ m ( s, a ) (cid:12)(cid:12)(cid:17) + 2 · (cid:13)(cid:13) π m +1 ( ·| s ) − π m ( ·| s ) (cid:13)(cid:13) ≤ η m · (cid:16) sup s ∈S sup a ∈A (cid:12)(cid:12) Q π m ,λ m ( s, a ) (cid:12)(cid:12)(cid:17) + KL (cid:0) π m +1 ( ·| s ) k π m ( ·| s ) (cid:1) (by Pinsker’s inequality) . Hence, by taking the average, we obtain E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π m ( ·| s ) − π m +1 ( ·| s ) (cid:11)i ≤ η m · (cid:16) sup s ∈S sup a ∈A (cid:12)(cid:12) Q π m ,λ m ( s, a ) (cid:12)(cid:12)(cid:17) + η − m · Φ π ′ ( π m +1 k π m ) . (32)Second, recall that ν π ( s, a ) = ν πs ( s ) · π ( a | s ) and V π ( s ) = h Q π ( s, · ) , π ( ·| s ) i . Then, by Lemma 3, for the modiﬁedunconstrained MDP, we have E s ∼ ν π ′ s h(cid:10) Q π m ,λ m ( s, · ) , π ′ ( ·| s ) − π m ( ·| s ) (cid:11)i = E ( s,a ) ∼ ν π ′ (cid:2) Q π m ,λ m ( s, a ) − V π m ,λ m ( s ) (cid:3) = (1 − γ ) · (cid:0) L ( π ′ , λ m ) − L ( π m , λ m ) (cid:1) . (33)Finally, combining (31)-(33), we obtain L ( π ′ , λ m ) − L ( π m , λ m ) ≥ (cid:0) (1 − γ ) η m (cid:1) − (cid:16) Φ π ′ ( π ′ k π m +1 ) − Φ π ′ ( π ′ k π m ) (cid:17) − η m − γ ) (cid:0) sup s ∈S sup a ∈A | Q π m ,λ m ( s, a ) | (cid:1) . (cid:3) We are now ready to prove Theorem 1.4

Proof of Theorem 1

We prove the bound for D (¯ π T ) − q ﬁrst. For this, we only need to establish anupper bound for (cid:13)(cid:13) [ ∂ λ L (¯ π T , λ )] + (cid:13)(cid:13) . Since L ( π m , λ ) is linear in λ , we have L ( π m , λ m ) − L ( π m , λ ∗ ) = ( λ m − λ ∗ ) ⊤ ∂ λ L ( π m , λ m ) . (34)By the ﬁrst part of Lemma 1, for any λ , we have η m · ( λ − λ m ) ⊤ ∂ λ L ( π m , λ m ) = η m · ( L ( π m , λ ) − L ( π m , λ m )) ≤ (cid:0) k λ m − λ k − k λ m +1 − λ k (cid:1) / η m G / . (35)On the other hand, by the saddle point property of ( π ∗ , λ ∗ ), we also have L ( π m , λ ∗ ) ≥ L ( π ∗ , λ ∗ ) . (36)In the following, we denote by L ∗ := L ( π ∗ , λ ∗ ). By combining inequalities (34)-(36), we obtain η m · ( λ − λ ∗ ) ⊤ ∂ λ L ( π m , λ m )= η m · ( λ − λ m ) ⊤ ∂ λ L ( π m , λ m ) + η m · ( λ m − λ ∗ ) ⊤ ∂ λ L ( π m , λ m ) ≤ (cid:0) k λ m − λ k − k λ m +1 − λ k (cid:1) / η m G / η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) . By taking the telescope sum of above inequality, for any λ ∈ Λ M , we have T − X m =0 η m · ( λ − λ ∗ ) ⊤ ∂ λ L ( π m , λ m ) ≤ (cid:0) k λ − λ k − k λ T − λ k (cid:1) / (cid:16) T − X m =0 η m / (cid:17) · G + T − X m =0 η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) . (37)For the left hand side of (37), let ζ T := T − X m =0 η m · ∂ λ L ( π m , λ m ) = (cid:16) T − X m =0 η m (cid:17) · ∂ λ L (¯ π T , λ ) , where the last equality follows from the deﬁnition of ¯ π T and the linearity of value function under the mixingoperation. If [ ζ T ] + = 0, then the upper bound holds trivially. Otherwise, let˜ λ = λ ∗ + r · [ ζ T ] + (cid:13)(cid:13) [ ζ T ] + (cid:13)(cid:13) , where r is the slackness constant in the deﬁnition of Λ M in (19). Then it is easy to see that ˜ λ ∈ Λ M . By (37),we have (˜ λ − λ ∗ ) ⊤ ζ T ≤ max λ ∈ Λ M k λ − λ k / (cid:16) T − X m =0 η m / (cid:17) · G + T − X m =0 η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) . By the deﬁnition of ˜ λ , we also have(˜ λ − λ ∗ ) ⊤ ζ T = r · ([ ζ T ] + ) ⊤ ζ T (cid:13)(cid:13) [ ζ T ] + (cid:13)(cid:13) = r · (cid:13)(cid:13) [ ζ T ] + (cid:13)(cid:13) = r · (cid:16) T − X m =0 η m (cid:17) · (cid:13)(cid:13) [ ∂ λ L (¯ π T , λ )] + (cid:13)(cid:13) . (cid:13)(cid:13) [ ∂ λ L (¯ π T , λ )] + (cid:13)(cid:13) ≤ max λ ∈ Λ M k λ − λ k r · P T − m =0 η m + G P T − m =0 η m / r · P T − m =0 η m + P T − m =0 η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) r · P T − m =0 η m . (38)Next, recall that ¯ π T = P T − m =0 ˜ η m π m , where ˜ η m = η m / ( P T − m =0 η m ) , m = 0 , . . . , T − . Since λ ∗ is the optimalsolution of the dual problem and L ( π ∗ , λ ∗ ) ≥ L ( π ∗ , ¯ λ T ), by the saddle point property, we have T − X m =0 ˜ η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) = T − X m =0 ˜ η m · L ( π m , λ m ) − L ∗ ≤ T − X m =0 ˜ η m · L ( π m , λ m ) − L ( π ∗ , ¯ λ T )= T − X m =0 ˜ η m · (cid:0) L ( π m , λ m ) − L ( π ∗ , λ m ) (cid:1) . (39)Similarly, under Assumption 3, by the second part in Lemma 1, we have T − X m =0 ˜ η m · (cid:0) L ( π m , λ m ) − L ( π ∗ , λ m ) (cid:1) ≤ (cid:16) (1 − γ ) · T − X m =0 η m (cid:17) − · (cid:16) G · T − X m =0 η m + Φ π ∗ ( π ∗ k π ) (cid:17) , (40)as the weighted KL divergence Φ π ∗ ( ·||· ) is nonnegative.Lastly, combining inequalities (38)-(40), we have (cid:13)(cid:13) [ ∂ λ L (¯ π T , λ )] + (cid:13)(cid:13) ≤ G r · P T − m =0 η m + (cid:16)

12 + 18(1 − γ ) (cid:17) G P T − m =0 η m r · P T − m =0 η m + (1 − γ ) − Φ π ∗ ( π ∗ k π )2 r · P T − m =0 η m . If we set η m = Θ(1 / √ m ), there exists ﬁnite constants κ and κ such that T − X m =0 η m ≥ κ √ T and T − X m =0 η m ≤ κ log( T ) . Subsequently, we obtain (cid:13)(cid:13) [ ∂ λ L (¯ π T , λ )] + (cid:13)(cid:13) ≤ (cid:16) G · (cid:16) κ log( T ) (cid:17) + Φ π ∗ ( π ∗ k π ) (cid:17) r (1 − γ ) κ √ T .

Similarly, if we set η m = η (constant step size), then (cid:13)(cid:13) [ ∂ λ L (¯ π T , λ )] + (cid:13)(cid:13) ≤ (cid:0) G + (1 − γ ) − · Φ π ∗ ( π ∗ k π ) (cid:1) rT η + (cid:16)

12 + 18(1 − γ ) (cid:17) G η r , We next prove the bound for C (¯ π T ) − L ∗ . We start with the upper bound. By the deﬁnition of ¯ π T ,we have C (¯ π T ) − L ∗ = T − X m =0 ˜ η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) − T − X m =0 ˜ η m · λ ⊤ m ( D ( π m ) − q ) . (41)From inequalities (39) and (40), we have T − X m =0 ˜ η m · (cid:0) L ( π m , λ m ) − L ∗ (cid:1) ≤ (cid:16) (1 − γ ) · T − X m =0 η m (cid:17) − · (cid:16) G T − X m =0 η m + Φ π ∗ ( π ∗ k π ) (cid:17) . Next, since D ( π m ) − q = ∂ λ L ( π m , λ m ), setting λ = 0 in (35), similarly, we obtain − T − X m =0 ˜ η m · λ ⊤ m ( D ( π m ) − q ) ≤ k λ k + G · P T − m =0 η m P T − m =0 η m . η m = Θ(1 / √ m ), we have C (¯ π T ) − L ∗ ≤ (cid:16) G · κ · log( T ) + Φ π ∗ ( π ∗ k π ) + k λ k (cid:17) − γ ) κ √ T .

For the lower bound, by the saddle point property, we have C (¯ π T ) = L (¯ π T , λ ∗ ) − ( λ ∗ ) ⊤ D (¯ π T ) ≥ L ∗ − ( λ ∗ ) ⊤ D (¯ π T ) . Since λ ∗ ≥ D (¯ π T ) ≤ [ D (¯ π T )] + , C (¯ π T ) − L ∗ ≥ −k λ ∗ k (cid:13)(cid:13) [ D (¯ π T )] + (cid:13)(cid:13) ≥ −k λ ∗ k (cid:16) G (cid:16) κ log( T ) (cid:17) + Φ π ∗ ( π ∗ k π ) (cid:17) r (1 − γ ) κ √ T .

Similarly, when if η m = η , we have C (¯ π T ) − L ∗ ≤ (cid:0) (1 − γ ) − Φ π ∗ ( π ∗ k π ) + k λ k / (cid:1) T η + 5 G η − γ ) ,C (¯ π T ) − L ∗ ≥ −k λ ∗ k (cid:0) G + (1 − γ ) − Φ π ∗ ( π ∗ k π ) (cid:1) rT η − k λ ∗ k (cid:16)

12 + 18(1 − γ ) (cid:17) G η r .r .