Constrained Reinforcement Learning Has Zero Duality Gap
Santiago Paternain, Luiz F.O. Chamon, Miguel Calvo-Fullana, Alejandro Ribeiro
CConstrained Reinforcement Learning Has ZeroDuality Gap
Santiago Paternain, Luiz F. O. Chamon, Miguel Calvo-Fullana and Alejandro Ribeiro
Electrical and Systems EngineeringUniversity of Pennsylvania {spater,luizf,cfullana,aribeiro}@seas.upenn.edu
Abstract
Autonomous agents must often deal with conflicting requirements, such as com-pleting tasks using the least amount of time/energy, learning multiple tasks, ordealing with multiple opponents. In the context of reinforcement learning (RL),these problems are addressed by (i) designing a reward function that simultane-ously describes all requirements or (ii) combining modular value functions that en-code them individually. Though effective, these methods have critical downsides.Designing good reward functions that balance different objectives is challenging,especially as the number of objectives grows. Moreover, implicit interferencebetween goals may lead to performance plateaus as they compete for resources,particularly when training on-policy. Similarly, selecting parameters to combinevalue functions is at least as hard as designing an all-encompassing reward, giventhat the effect of their values on the overall policy is not straightforward. Thelater is generally addressed by formulating the conflicting requirements as a con-strained RL problem and solved using Primal-Dual methods. These algorithmsare in general not guaranteed to converge to the optimal solution since the prob-lem is not convex. This work provides theoretical support to these approaches byestablishing that despite its non-convexity, this problem has zero duality gap, i.e.,it can be solved exactly in the dual domain, where it becomes convex. Finally, weshow this result basically holds if the policy is described by a good parametriza-tion (e.g., neural networks) and we connect this result with primal-dual algorithmspresent in the literature and we establish the convergence to the optimal solution.
Autonomous agents must often deal with conflicting requirements, such as completing a task in theleast amount of time/energy, learning multiple tasks or contexts, dealing with multiple opponents orwith several specifications that are designed to guide the agent in the learning process. In the contextof reinforcement learning [1], these problems are generally addressed by combining modular valuefunctions that encode them individually, by multiplying each signal by its own coefficient, whichcontrols the emphasis placed on it [2–4]. Although effective, the multi-objective problem [5] hasseveral downsides. First, for each set of penalty coefficients, there exists a different, optimal solution,also known as Pareto optimality [6]. In practice, the exact coefficient is selected through a timeconsuming and a computationally intensive process of hyper-parameter tuning that often times aredomain dependent, as showed in [7–9]. Moreover, implicit interference between the goals may leadto training plateaus as they compete for resources in the policy [10].An alternative, is to embed all conflicting requirements in a constrained RL problem and to use aprimal-dual algorithm as in [7, 11] that chooses the parameters automatically. The main advantage ofthis approach is that constraints ensure satisfying behavior without the need for manually selecting a r X i v : . [ c s . L G ] O c t he penalty coefficients. In these algorithms the policy update is on a faster time-scale than themultiplier update. Thus, effectively, these approaches work as if the dual problem of the constrainedreinforcement learning problem was being solved. Thus, guaranteeing to obtain the feasible solutionwith the smallest suboptimality. Yet, there is no guarantee on how small the suboptimality is. In thiswork we provide an answer to the previous question. In particular we establish that:1. Despite its non-convexity, constrained reinforcement learning for policies belonging to ageneral distribution class has zero duality gap, i.e., it can be solved exactly in the dualdomain, where the problem is actually convex2. Since working with generic distributions as policies is in general intractable, we extend thisresult to parametrized policies, by showing that the suboptimality bound also holds whenthe parametrization is a universal approximator, e.g., a neural network [12]).3. We leverage these theoretical results to establish that the family of primal-dual algorithmsfor constrained reinforcement learning, e.g. [7, 11], in fact converge to the optimal solutionunder mild assumptions. Constrained Markov Decision Processes (CMDPs) [13] are an active field of research. CMDP ap-plications cover a vast number of topics, such as: electric grids [14], networking [15], robotics[3, 16, 17] and finance [18, 19]. The most common approaches to solve this problems can be di-vided under the following categories.
Manual selection of Lagrange multipliers: constrained Re-inforcement Learning problems can be solved through by maximizing an unconstrained Lagrangian,for a specific multiplier [2]. The combination of different rewards with manually selected Lagrangemultipliers has been applied for instance to learning complex movements for humanoids [4] or tolimit the variance of the constraint that needs to be satisfied [19, 20].
Integrating prior knowledge about the system transitions is exploited in order to project the action chosen by the policy to a setthat ensures the satisfaction of the constraints [21].
Primal-dual algorithms [7, 11], allow us tochoose dynamically the multipliers by find the best policy for the current set of parameters and thentaking steps along the gradient of the Lagrangian with respect to the multipliers. These allow toconsider general constraints and the algorithm is reward agnostic and it does not require the use ofprior knowledge.
Let t ∈ N ∪ { } denote the time instant and S ⊂ R n and A ⊂ R d be compact sets describing thepossible states and actions of an agent described by a Markovian dynamical system with transitionprobability density p , i.e., p ( s t +1 | { s u , a u } u ≤ t ) = p ( s t +1 | s t , a t ) for s t ∈ S and a t ∈ A forall t . The agent chooses actions sequentially based on a policy π ∈ P ( S ) , where P ( S ) is thespace of probability measures on ( A , B ( A )) parametrized by elements of S , where B ( A ) are theBorel sets of A . The action taken by the agent at each state results in rewards defined by thefunctions r i : S × A → R , for i = 0 , . . . , m , that the agent accumulates over time. These rewardsdescribe different objectives that the agent must achieve, such as completing a task, remaining withina region of the state space, or not running out of battery. The goal of constrained RL is then to finda policy π (cid:63) ∈ P ( S ) that meets these objectives by solving the problem P (cid:63) (cid:44) max π ∈P ( S ) V ( π ) (cid:44) E s,π (cid:34) ∞ (cid:88) t =0 γ t r ( s t , π ( s t )) (cid:35) subject to V i ( π ) (cid:44) E s,π (cid:34) ∞ (cid:88) t =0 γ t r i ( s t , π ( s t )) (cid:35) ≥ c i , i = 1 , . . . , m , (PI)where γ ∈ (0 , is a discount factor and c i ∈ R represent the i -th reward specification. It isimportant to contrast the formulation in (PI) with the unconstrained, regularized problem commonlyfound in the literature [4, 19, 20] maximize π ∈P ( S ) V ( π ) + m (cid:88) i =1 w i ( V i ( π ) − c i ) , ( ˜ PI)2here w i ≥ are the regularization parameters. First, (PI) precludes the manual balancing ofdifferent requirements through the choice of w i . Even with expert knowledge, tuning these pa-rameters can be as hard as solving the RL problem itself, since there is no straightforward relationbetween the value of w i and the value V i ( π (cid:63) ) given by the final policy. What is more, note thatthe objective of ( ˜ PI) can be written as a single value function ¯ V ( π ) (cid:44) E s,π [ (cid:80) ∞ t =0 γ t ¯ r ( s t , π ( s t ))] for ¯ r ( s t , π ( s t )) = r ( s t , π ( s t )) + (cid:80) mi =1 w i r i ( s t , π ( s t )) . In other words, choosing the value of w i amounts to designing a reward that simultaneously encodes different, possibly conflicting, objec-tives and/or requirements. Given the challenge that can be designing good reward functions for asingle task, it is ready that this regularized approach is neither efficient nor effective.Though promising, solving the constrained RL problem in (PI) is intricate. Indeed, it is both infinitedimensional and non-convex, so that it is in general not tractable in the primal domain. Its dualproblem, on the other hand, is convex and has dimensionality equal to the number of constraints.However, since (PI) is not a convex program, its dual problem in general only provides an upperbound on P (cid:63) . How good the policy obtained by solving the dual problem is depends on the tightnessof this bound. What is more, formulating the problem in the dual domain is at least as hard assolving ( ˜ PI), which is also infinite dimensional and non-convex. In the sequel, we address these twoissues by first showing that (PI) has no duality gap (Section 3), i.e., that the upper bound on P (cid:63) fromthe dual problem is tight. This implies that (PI) can be solved exactly in the dual domain. Then, weshow that we lose (almost) nothing by parametrizing the policies π (Section 4), which immediatelyaddresses the issue of dimensionality in (PI)–( ˜ PI). Finally, we put forward and analyze a primal-dual algorithm for constrained RL (Section 5), showing that under mild conditions it yields a locallyoptimal, feasible solution of (PI).
Let us start by formalizing the concept of dual problem. Let the vector λ ∈ R m + collect the Lagrangemultipliers of the constraints of (PI) and define its Lagrangian as L ( π, λ ) (cid:44) V ( π ) + m (cid:88) i =1 λ i ( V i ( π ) − c i ) . (1)The dual function is then the point-wise maximum of (1) with respect to the policy π , i.e., d ( λ ) (cid:44) max π ∈P ( S ) L ( π, λ ) . (2)The dual function (2) provides an upper bounds on the value of (PI), i.e., d ( λ ) ≥ P (cid:63) for all λ ∈ R m + [22, Section 5.1.3]. The tighter the bound, the closer the policy obtained from (2) is to theoptimal solution of (PI). Hence, the dual problem is that of finding the tightest of these bounds: D (cid:63) (cid:44) min λ ∈ R m + d ( λ ) . (DI)Note that the dual function (2) can be related to the unconstrained, regularized problem ( ˜ PI) fromSection 2 by taking λ i = w i in (1). Hence, (2) takes on the optimal value of ( ˜ PI) for all possibleregularization parameters. Problem (DI) then finds the best regularized problem, i.e., that whosevalue is closest to P (cid:63) . It turns out, this problem is tractable if d ( λ ) can be evaluated, since (DI) isa convex program (the dual function is the point-wise maximum of a set of linear functions and istherefore convex) [22, Section 3.2.3].Despite these similarities, (DI) [and consequently ( ˜ PI)] do not necessarily solve the same problemas (PI). In other words, there need not be a relation between the optimal dual variables λ (cid:63) from (DI)or the regularization parameters w i and the specifications c i of (PI). This depends on the value of theduality gap ∆ = D (cid:63) − P (cid:63) . Indeed, if ∆ is small, then so is the suboptimality of the policies obtainedfrom (DI). In the limit case where ∆ = 0 , problems (PI)–(DI) and ( ˜ PI) would all be essentiallyequivalent. Since (PI) is not a convex program, however, this result does not hold immediately. Still,we calim in Theorem 1 that (PI) has zero duality gap under Slater’s conditions. Before stating theTheorem we define the perturbation function associated to problem (PI) which is fundamental forthe proof of the result and for future reference. For any ξ ∈ R n , the perturbation function associated3o (PI) is defined as P ( ξ ) (cid:44) max π ∈P ( S ) V ( π )subject to V i ( π ) ≥ c i + ξ i , i = 1 . . . m. (PI (cid:48) )Notice that P (0) = P (cid:63) , the optimal value of (PI). We formally state next the conditions under whichProblem (PI) has zero duality gap. Theorem 1.
Suppose that r i is bounded for all i = 0 , . . . , m and that Slater’s condition holdsfor (PI) . Then, strong duality holds for (PI) , i.e., P (cid:63) = D (cid:63) .Proof. This proof relies on a well-known result from perturbation theory connecting strong dualityto the convexity of the perturbation function defined in(PI (cid:48) ). We formalize this result next.
Proposition 1 (Fenchel-Moreau) . If (i) Slater’s condition holds for (PI) and (ii) its perturbationfunction P ( ξ ) is concave, then strong duality holds for (PI) .Proof. See, e.g., [23, Cor. 30.2.2].Condition (i) of Proposition 1 is satisfied by the hypotheses of Theorem 1. It suffices then to showthat the perturbation function is concave [(ii)], i.e., that for every ξ , ξ ∈ R m , and µ ∈ (0 , , P (cid:2) µξ + (1 − µ ) ξ (cid:3) ≥ µP (cid:0) ξ (cid:1) + (1 − µ ) P (cid:0) ξ (cid:1) . (3)If for either perturbation ξ or ξ the problem becomes infeasible then P ( ξ ) = −∞ or P ( ξ ) = −∞ and thus (3) holds trivially. For perturbations that keep the problem feasible, suppose P ( ξ ) and P ( ξ ) are achieved by the policies π ∈ P ( S ) and π ∈ P ( S ) respectively. Then, P ( ξ ) = V ( π ) with V i ( π ) − c i ≥ ξ i and P ( ξ ) = V ( π ) with V i ( π ) − c i ≥ ξ i for i = 1 , . . . , m .To establish (3) it suffices to show that for every µ ∈ (0 , there exists a policy π µ such that V i ( π µ ) − c i ≥ µξ i + (1 − µ ) ξ i and V ( π µ ) = µV ( π ) + (1 − µ ) V ( π ) . Notice that any policy π µ satisfying the previous conditions is a feasible policy for the slack c i + µξ i + (1 − µ ) ξ i . Hence,by definition of the perturbed function (PI (cid:48) ), it follows that P (cid:2) µξ + (1 − µ ) ξ (cid:3) ≥ V ( π µ ) = µV ( π ) + (1 − µ ) V ( π ) = µP (cid:0) ξ (cid:1) + (1 − µ ) P (cid:0) ξ (cid:1) . (4)If such policy exists, the previous equation implies (3). Thus, to complete the proof of the result weneed to establish its existence. To do so we start by formulating a linear program equivalent to (PI (cid:48) ).Notice that for any i = 0 , . . . , m we can write V i ( π ) = (cid:90) ( S×A ) ∞ (cid:32) ∞ (cid:88) t =0 γ t r i ( s t , a t ) (cid:33) p π ( s , a , . . . ) ds . . . da . . . . (5)Since the reward functions are bounded the Dominated Convergence Theorem holds. This allows usto exchange the order of the sum and the integral. Moreover, using conditional probabilities and theMarkov property of the transition of the system we can write V i ( π ) as V i ( π ) = ∞ (cid:88) t =0 γ t (cid:90) ( S×A ) ∞ r i ( s t , a t ) ∞ (cid:89) u =1 p ( s u | s u − , a u − ) π ( a u | s u ) p ( s ) π ( a | s ) ds . . . da . . . . (6)Notice that for every u > t the integrals with respect to a u and s u yield one, since they are integratingdensity functions. Thus, the previous expression reduces to V i ( π ) = ∞ (cid:88) t =0 γ t (cid:90) ( S×A ) t r i ( s t , a t ) t (cid:89) u =1 p ( s u | s u − , a u − ) π ( a u | s u ) p ( s ) π ( a | s ) ds . . . ds t da . . . da t . (7)Notice that the probability density of being at state s and choosing action a under the policy π attime t can be written as p tπ ( s t , a t ) = (cid:90) ( S×A ) t − t (cid:89) u =1 p ( s u | s u − , a u − ) π ( a u | s u ) p ( s ) π ( a | s ) ds . . . ds t − da . . . da t − . (8)4hus, using again the Dominated Convergence Theorem, one can write compactly (7) as V i ( π ) = (cid:90) S×A r i ( s, a ) ∞ (cid:88) t =0 γ t p tπ ( s, a ) dsda. (9)By defining the occupation measure ρ ( s, a ) = (1 − γ ) (cid:80) ∞ t =0 γ t p tπ ( s, a ) it follows that (1 − γ ) V i ( π ) = (cid:82) S×A r i ( s, a ) ρ ( s, a ) dsda . Denote by M ( S , A ) the measures over S × A and define theset R as the set of all occupation measures induced by the policies π ∈ P ( S ) as R := (cid:40) ρ ∈ M ( S , A ) (cid:12)(cid:12) ρ ( s, a ) = (1 − γ ) (cid:32) ∞ (cid:88) t =0 γ t p π ( s t = s, a t = a ) (cid:33)(cid:41) , (10)where It follows from [24, Theorem 3.1] that the set of occupation measures R is convex and com-pact. Hence, we can write the following linear program equivalent to (PI (cid:48) ) P ( ξ ) (cid:44) max ρ ∈R − γ (cid:90) S×A r ( s, a ) ρ ( s, a ) dsda subject to 11 − γ (cid:90) S×A r i ( s, a ) ρ ( s, a ) dsda ≥ c i + ξ i , i = 1 , . . . , m. (PI (cid:48)(cid:48) )Let ρ , ρ ∈ R be the occupation measures associated to π and π . Since, R is convex, there existsa policy π µ ∈ P ( S ) such that its corresponding occupation measure is ρ µ = µρ + (1 − µ ) ρ ∈ R .Notice that ρ µ satisfies the constraints with slack c i + µξ i + (1 − µ ) ξ i for i = 1 , . . . , m since theintegral is linear and ρ and ρ satisfy the constraints with slacks c i + ξ i and c i + ξ i respectively.Thus, it follows that P ( µξ + (1 − µ ) ξ ) ≥ − γ (cid:90) S×A r ( s, a ) ρ µ ( s, a ) dsda = µV ( π ) + (1 − µ ) V ( π ) , (11)where we have used again the linearity of the integral. Since π i are such that V ( π ) = P ( ξ ) and V ( π ) = P ( ξ ) , inequality (3) follows. This completes the proof that the perturbation function isconcave.Theorem 1 establishes a fundamental equivalence between the constrained (PI) and the dual prob-lem (DI) [and therefore also ( ˜ PI)]. Indeed, since (PI) has no duality gap, its solution can be obtainedby solving (DI). What is more, the trade-offs expressed by the w i in ( ˜ PI) are the same as thoseexpressed by the specifications c i in the sense that they trace the same Pareto front. Nevertheless,note that the relationship between c i and w i is not trivial and that specifying the constrained prob-lem is often considerably simpler. Theorem 1 establishes that this is indeed a valid transformation,since both problems are equivalent. Observe that due to the non-convexity of the objective in RLproblems, this result is in fact not immediate.The theoretical importance of the previous result notwithstanding, it does not yield a procedure tosolve (PI) since evaluating the dual function involves a maximization problem that is intractable forgeneral classes of distributions. In the next section, we study the effect of using a finite parametriza-tion for the policies and show that the price to pay in terms of duality gap depends on how “good”the parametrization is. If we consider, for instance, a neural network—which are universal functionapproximators [12, 25–28]—the loss in optimality can be made arbitrarily small. We consider next the problem where the policies are parametrized by a vector θ ∈ R p . This vectorcould be for instance the coefficients of a neural network or the weights of a linear combinationof functions. In this work, we focus our attention however on a widely used class of parametriza-tions that we term near-universal , which are able to model any function in P ( S ) to within a statedaccuracy. We formalize this concept in the following definition. Definition 1.
A parametrization π θ is an (cid:15) -universal parametrization of functions in P ( S ) if, forsome (cid:15) > , there exists for any π ∈ P ( S ) a parameter θ ∈ R p such that max s ∈S (cid:90) A | π ( a | s ) − π θ ( a | s ) | da ≤ (cid:15). (12)5he previous definition includes all parametrizations that induce distributions that are close to dis-tributions in P ( S ) in total variational norm. Notice that this is a milder requirement than approxi-mation in uniform norm which is a property that has been established to be satisfied by radial basisfunctions networks [29], reproducing kernel Hilbert spaces [30] and deep neural networks [12].Notice that the objective function and the constraints in Problem (PI) involve an infinite horizonand thus, the policy is applied an infinite number of times. Hence, the error introduced by theparametrization could a priori accumulate and induce distributions over trajectories that differ con-siderably from the distributions induced by policies in P ( S ) . We claim in the following lemma thatthis is not the case. Lemma 1.
Let ρ and ρ θ be occupation measures induced by the policies π ∈ P ( S ) and π θ respec-tively, where π θ is an (cid:15) - parametrization of π . Then, it follows that (cid:90) S×A | ρ ( s, a ) − ρ θ ( s, a ) | dsda ≤ (cid:15) − γ . (13)The previous result, although derived as a technical result required to bound the duality gap forparametric problems, has a natural interpretation. The larger γ —the more the operation is concernedabout rewards far in the future —the larger the error in the approximation of the occupation measure.Having defined the concept of universal approximator, we shift focus to writing the parametricversion of the constrained reinforcement learning problem. This is, to find the parameters that solve(PI), where now the policies are restricted to the functions induced by the chosen parametrization P (cid:63)θ (cid:44) max θ V ( θ ) (cid:44) E s,π θ (cid:34) ∞ (cid:88) t =0 γ t r ( s t , π θ ( s t )) (cid:35) subject to V i ( θ ) (cid:44) E s,π θ (cid:34) ∞ (cid:88) t =0 γ t r i ( s t , π θ ( s t )) (cid:35) ≥ c i , i = 1 . . . m. (PII)Notice that the problem (PII) is similar to the original problem (PI), with the only difference that theexpectations are now with respect to distributions induced by the parameter vector θ . As done in theprevious section, let λ ∈ R m and define the dual function associated to (PII) as d θ ( λ ) (cid:44) min θ ∈ R p L θ ( θ, λ ) (cid:44) min θ ∈ R p V ( θ ) + m (cid:88) i =1 λ i ( V i ( θ ) − c i ) , (14)Likewise we define the dual problem as finding the tightest upper bound for (PII) D (cid:63)θ (cid:44) minimize λ ∈ R m + d θ ( λ ) . (DII)As previously stated, the reason for introducing the parametrization is to turn the original functionaloptimization problem into a tractable problem in which the optimization variable is a finite dimen-sional vector of parameters. Yet, there is a cost for introducing the aforementioned parametrization:the duality gap is no longer null. The latter means that the solution obtained through the dual prob-lem is sub-optimal. We claim however that this gap is bounded by a function that is linear with theapproximation error (cid:15) , and thus if the parametrization has a good representation power the price topay is almost zero. This is the subject of the following theorem. Theorem 2.
Suppose that r i is bounded for all i = 0 , . . . , m by constants B r i > and define B r =max i =1 ...m B r i . Let λ (cid:63)(cid:15) be the solution to the dual problem associated to (PI (cid:48) ) for perturbation ξ i = B r (cid:15)/ (1 − γ ) for all i = 1 , . . . , m . Then, under the hypothesis of Theorem 1 it follows that P (cid:63) ≥ D (cid:63)θ ≥ P (cid:63) − ( B r + (cid:107) λ (cid:63)(cid:15) (cid:107) B r ) (cid:15) − γ , (15) where P (cid:63) is the optimal value of (PI) , and D (cid:63)θ the value of the parametrized dual problem (DII) . The implication of the previous result is that there is almost no price to pay by introducing aparametrization. By solving the dual problem (DII) the sub-optimality achieved is of order (cid:15) , i.e.,the error on the representation of the policies. Notice that this error could be made arbitrarily smallby increasing the representation ability of the parametrization, by for instance increasing the dimen-sion of the vector of parameters θ . The latter means that if we can compute the dual function it is6ossible to solve (PI) approximately. Moreover, working on the dual domain provides two computa-tional advantages; on one hand, the dimension of the problem is the number of constraints in (PI). Inaddition, the dual function is always convex, hence gradient descent on the dual domain solves theproblem of interest. In the next section we propose an algorithm to solve (PI) approximately basedon the previous discussion.Before doing so notice that we have not assumed anything about the feasibility of problem (PII).Notice that if the problem is infeasible then we have that D (cid:63)θ = −∞ and thus the upper boundon (15) holds trivially. On the other hand if the problem is infeasible it also means that there is nopolicy π ∈ P ( S ) that satisfies the constraints of (PI) with slack B r (cid:15)/ (1 − γ ) since θ is an (cid:15) -universalapproximation of P ( S ) . Hence the perturbed problem is infeasible which yields a dual multiplier λ (cid:63)(cid:15) that has infinite norm. Thus the right hand side of (15) holds as well. In that sense, as long as theparameterization introduced keeps the problem feasible the price to pay for parameterizing is almostzero. As previously stated, the dual function is always a convex function since it is the point-wise maxi-mum of linear functions. Thus the dual problem (DII) can be efficiently solved using (sub)gradientdescent, with the caveat that because we require the dual iterates to remain in the positive orthant,we include a projection onto this space after taking the gradient step λ k +1 = [ λ k − η∂d θ ( λ k )] + , (16)where η > is the step-size of the algorithm, [ · ] + denotes the projection onto R m + and ∂d θ ( λ ) denotes —with a slight abuse of notation —a vector in the subgradient of d θ ( λ ) . The latter can becomputed by virtue of Dankin’s Theorem (see e.g. [31, Chapter 3]) by evaluating the constraintsin the original problem (PII) at the primal maximizer of the Lagrangian. Thus, the main theoreticaldifficulty in this computation lies on finding said maximizer since the Lagrangian is non-convexwith respect to θ . However, maximizing the Lagrangian with respect to θ corresponds to learning apolicy that uses as reward the following linear combination of rewards r λ ( s, a ) = r ( s, a ) + m (cid:88) i =1 λ i r i ( s, a ) . (17)Indeed, using the linearity of the expectation, the cumulative discounted cost for the reward r λ ( s, a ) yields E s,π (cid:34) ∞ (cid:88) t =0 γ t r λ ( s t , a t ) (cid:35) = E s,π θ (cid:34) ∞ (cid:88) t =0 γ t r ( s t , a t ) (cid:35) + m (cid:88) i =1 λ i E s,π θ (cid:34) ∞ (cid:88) t =0 γ t r i ( s t , a t ) (cid:35) = L ( θ, λ ) . (18)And therefore reinforcement learning algorithms such as policy gradient [32] or actor-critic meth-ods [33] can be used to find the parameters θ such that they maximize the Lagrangian. The goodperformance of these algorithms is rooted in the fact that they are able to maximize the expectedcumulative reward or at least to achieve a value that is close to the maximum. The next assumptionformalizes this idea. Assumption 1.
Let π θ be a parametrization of functions in P ( S ) and let L θ ( θ, λ ) with λ ∈ R m + be the Lagrangian associated to (PII) . Denote by θ (cid:63) ( λ ) , θ † ( λ ) ∈ R P the maximum of L ( θ, λ ) anda local maximum respectively achieved by a generic reinforcement learning algorithm. Then, thereexists δ > such that for all λ ∈ R m + it holds that L θ ( θ (cid:63) ( λ ) , λ ) ≤ L θ ( θ † ( λ ) , λ ) + δ . Notice that the previous assumption only means that we are able to solve the regularized uncon-strained problem approximately. This means that the parameter at time k + 1 is θ k +1 ≈ argmax θ ∈ R p L ( λ k , θ ) . (19)Then, the dual variable is updated following the gradient descent scheme suggested in (16), wherewe replace the subgradient of the dual function by the constraint of the primal problem (PII). Defin-ing ˆ ∂d k (cid:44) V ( θ k +1 ) − s , the update yields λ k +1 = (cid:104) λ k − η ˆ ∂d k (cid:105) + = [ λ k − η ( V ( θ k +1 ) − s )] + . (20)7 lgorithm 1 dualDescent Input: η Initialize : θ = 0 , λ = 0 for k = 0 , . . . Compute an approximation of θ k +1 ≈ argmax L θ ( θ, λ k ) with a RL algorithm Compute the dual ascent step λ k +1 = [ λ k − η ( V ( θ k +1 ) − s )] + . end The algorithm given by (19)–(20) is summarized under Algorithm 1. The previous algorithm relieson the fact that the ˆ ∂d k does not differ much from ∂d θ ( λ k ) . We claim in the following propositionthat this is the case. In particular, we establish that the constraint evaluation does not differ from thesubgradient in more than δ , the error on the primal maximization defined in Assumption 1. Proposition 2.
Under Assumption 1, the constraint in (PII) evaluated at a local maximizer of La-grangian θ † ( λ ) approximate the subgradient of the dual function (14) . In particular it follows that d θ ( λ ) − d θ ( λ (cid:63)θ ) ≤ ( λ − λ (cid:63)θ ) (cid:62) (cid:0) V ( θ † ( λ )) − s (cid:1) + δ. (21)The previous proposition is key in establishing convergence of the algorithm proposed since allowsus to claim that the dual updated is an approximation of a dual descent step. We formalize this resultnext and we establish a maximum number of dual steps required to achieve a desired accuracy. Theorem 3.
Let π θ be an (cid:15) universal parametrization of P ( S ) according to Definition 1, B r =max i =1 ...m B r i with B r i > bounds on the rewards r i and γ ∈ (0 , be the discount factor. Then,if Slater’s conditions hold for (PII) , under Assumption 1 and for any ε > , the sequence of updatesof Algorithm 1 with step size η converges in K > steps, with K ≤ (cid:107) λ − λ (cid:63)θ (cid:107) ηε , (22) to a neighborhood of P (cid:63) –the solution of (PI) – satisfying P (cid:63) − ( B r + (cid:107) λ (cid:63)(cid:15) (cid:107) B r ) (cid:15) − γ ≤ d θ ( λ K ) ≤ P (cid:63) + η B δ + ε. (23) where B = (cid:80) mi =1 ( B r i / (1 − γ ) − c i ) and λ (cid:63) is the solution of (DI) . The previous result establishes a bound on the number of dual iterations required to converge to aneighborhood of the optimal solution. This bound is linear with the inverse of the desired accuracy ε . Notice that the size of the neighborhood to which the dual descent algorithm converges dependson the representation ability of the parametrization chosen, and the goodness of the solution of themaximization of the Lagrangian. Since the cost of running policy gradient or actor-critic algorithmsuntil convergence before updating the dual variable might result in an algorithm that is computa-tionally prohibitive, an alternative that is common in the context of optimization is to update bothvariables in parallel [34]. This idea can be applied in the context of reinforcement learning as well,where a policy gradient —or actor critic as in [7, 11] —update is followed by an update of the mul-tipliers along the direction of the constraint violation. In these algorithms the update on the policyis on a faster scale than the update of the multipliers, and therefore they operate from a theoreticalpoint of view as (1). In particular, the proofs in [7, 11] rely on the fact that this different time-scaleis such that allows to consider the multiplier as constant. In this section, we include a numerical example in order to showcase the consequences of our the-oretical results. As an illustrative example, we consider a gridworld navigation scenario. Thisscenario, illustrated in Figure 1, consists of an agent attempting to navigate from a starting positionto a goal. To do so, the agent must cross from the left side of the world to the right side using eitherone of two bridges, one of which is deemed “unsafe”. The agents uses a softmax policy with fourpossible actions (moving up , down , left , and right ) over a table-lookup of states and actions. Theagent receives a reward r ( s, a ) = 10 for reaching the goal and a reward of r ( s, a ) = − for each8 igure 1: Safe (blue) and unsafe(red) optimal path. Parametrizationcoarseness is on the bottom left. . . . . × − − − − Iteration ( k ) D u a lit y G a p Duality GapDuality Gap (PG)
Figure 2: Duality gap of the policies. . . . . × − − − Iteration ( k ) D u a lit y G a p d = 1 d = 2 d = 3 d = 4 d = 5 − − − Parametrization Size ( d ) D u a lit y G a p Figure 3: Effect of parametrization coarseness. step it wanders outside of goal. The scenario is designed such that the shortest path requires crossingthe unsafe path (red bridge), while the safe path (blue bridge) requires a longer detour. Using ourformulation, we constrain the agent to not cross the unsafe bridge with probability.We train the agent via Algorithm 1, agent and plot in Fig. 2, the resulting normalized duality gap.We consider two cases, an inexact primal maximization via policy gradient and, an exact primalmaximization. In order to obtain the global primal minimizer, for a given value of the dual variables λ , the optimal primal minimizer can be easily found via Dijkstra’s algorithm. We show that bysolving Step 4 of Algorithm 1 exactly the duality gap effectively vanishes (red curve). We alsoshowcase a curve in which Step 4 is replaced by a single policy gradient step (blue curve). Sincethe minimization in Step 4 is done approximately, the duality gap decreases at a slower rate and willonly converge to a neighborhood of zero (as per Theorem 3). In any of the two cases, ultimately, theagent learns to navigate from start to goal by crossing the safe bridge (blue path in Fig. 1).Now, we turn our attention to the effect of the parametrization size. We consider parametrizationof different coarseness via state aggregation, as shown in Fig. 1. This will correspond, as perDefinition 1, in parametrizations with lager values of (cid:15) , i.e., looser approximators. Figure 3 displaysthe effect of using coarser parametrizations, as the parametrization becomes coarser, the duality gapincreases (as per Theorem 2). Specially, for very coarse parametrizations (such as the cyan case),the agent cannot learn a successful policy due to the poor covering properties of its parametrizationand resultantly such problem will have a large duality gap. Throughout this work we have developed a duality theory for constrained reinforcement learningproblems. In particular we have established that for policies belonging to a general class of distri-butions, the duality gap of this problems is null and therefore by solving the problem on the dualdomain —which always yields a finite dimensional convex problem —yields the same result as solv-ing the original problem directly. Moreover, it establishes the equivalence between the constrainedproblem and the regularized problem —or manual selection of multipliers —in the sense that bothproblems track the same Pareto optimal front.These theoretical implications however do not imply that it is always possible to solve the problem.To be able to solve the dual problem, one is required to evaluate the dual function, which might resultintractable in several problems, for instance in cases where arbitrary policies are considered. Toovercome this limitation, we have shown that for sufficiently rich parametrizations the zero dualitygap result holds approximately. However, for the most part, the parametrizations considered in theliterature are not necessarily universal approximators of distributions since in general the output ofthe neural network reduces to the mean —and in some cases the variance —of a distribution.Regardless of these limitations, the primal dual algorithm considered here and those proposed in[7, 11] provide a manner to solve constrained policy optimization problems without the need toperform an exhaustive search over the weights that we assign to each reward function, as it is the casein [4, 19, 20]. Likewise, the need of imposing constraints might arise directly from the algorithmdesign, this is for instance the case in Trust Region Policy Optimization [35], where a constrainton the divergence of the policy is included. Although our theorems do not guarantee that the zeroduality gap result holds under these constraints, since they reduce to a projection onto a convex setit would not be surprising that it could be adapted.9 eferences [1] Richard S Sutton and Andrew G Barto,
Reinforcement learning: An introduction , MIT press,2018.[2] Vivek S Borkar, “An actor-critic algorithm for constrained markov decision processes,”
Sys-tems & control letters , vol. 54, no. 3, pp. 207–213, 2005.[3] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel, “Constrained policy optimiza-tion,” in
Proceedings of the 34th International Conference on Machine Learning-Volume 70 .JMLR. org, 2017, pp. 22–31.[4] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne, “Deepmimic:Example-guided deep reinforcement learning of physics-based character skills,”
ACM Trans-actions on Graphics (TOG) , vol. 37, no. 4, pp. 143, 2018.[5] Shie Mannor and Nahum Shimkin, “A geometric approach to multi-criterion reinforcementlearning,”
Journal of machine learning research , vol. 5, no. Apr, pp. 325–360, 2004.[6] Kristof Van Moffaert and Ann Nowé, “Multi-objective reinforcement learning using sets ofPareto dominating policies,”
The Journal of Machine Learning Research , vol. 15, no. 1, pp.3483–3512, 2014.[7] Chen Tessler, Daniel J Mankowitz, and Shie Mannor, “Reward constrained policy optimiza-tion,” arXiv preprint arXiv:1805.11074 , 2018.[8] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq,Laurent Orseau, and Shane Legg, “Ai safety gridworlds,” arXiv preprint arXiv:1711.09883 ,2017.[9] Horia Mania, Aurelia Guy, and Benjamin Recht, “Simple random search provides a competi-tive approach to reinforcement learning,” arXiv preprint arXiv:1803.07055 , 2018.[10] Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu, “Ray interference: a sourceof plateaus in deep reinforcement learning,” arXiv preprint arXiv:1904.11455 , 2019.[11] Shalabh Bhatnagar and K Lakshmanan, “An online actor–critic algorithm with function ap-proximation for constrained markov decision processes,”
Journal of Optimization Theory andApplications , vol. 153, no. 3, pp. 688–708, 2012.[12] Kurt Hornik, Maxwell Stinchcombe, and Halbert White, “Multilayer feedforward networksare universal approximators,”
Neural networks , vol. 2, no. 5, pp. 359–366, 1989.[13] Eitan Altman,
Constrained Markov decision processes , vol. 7, CRC Press, 1999.[14] Iordanis Koutsopoulos and Leandros Tassiulas, “Control and optimization meet the smartpower grid: Scheduling of power demands for optimal energy management,” in
Proceedingsof the 2nd International Conference on Energy-efficient Computing and Networking . ACM,2011, pp. 41–50.[15] Chen Hou and Qianchuan Zhao, “Optimization of web service-based control system for bal-ance between network traffic and delay,”
IEEE Transactions on Automation Science and En-gineering , vol. 15, no. 3, pp. 1152–1162, 2018.[16] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone, “Risk-sensitive and robustdecision-making: a cvar optimization approach,” in
Advances in Neural Information Pro-cessing Systems , 2015, pp. 1522–1530.[17] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine, “Deep reinforcement learn-ing for robotic manipulation with asynchronous off-policy updates,” in . IEEE, 2017, pp. 3389–3396.[18] Pavlo Krokhmal, Jonas Palmquist, and Stanislav Uryasev, “Portfolio optimization with condi-tional value-at-risk objective and constraints,”
Journal of risk , vol. 4, pp. 43–68, 2002.[19] Dotan Di Castro, Aviv Tamar, and Shie Mannor, “Policy gradients with variance related riskcriteria,” arXiv preprint arXiv:1206.6404 , 2012.[20] Aviv Tamar and Shie Mannor, “Variance adjusted actor critic algorithms,” arXiv preprintarXiv:1310.3697 , 2013. 1021] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yu-val Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757 ,2018.[22] S. Boyd and L. Vandenberghe,
Convex optimization , Cambridge University Press, 2004.[23] R. T. Rockafellar,
Convex analysis , Princeton University Press, 1970.[24] Vivek S Borkar, “A convex analytic approach to markov decision processes,”
ProbabilityTheory and Related Fields , vol. 78, no. 4, pp. 583–602, 1988.[25] Ken-Ichi Funahashi, “On the approximate realization of continuous mappings by neural net-works,”
Neural networks , vol. 2, no. 3, pp. 183–192, 1989.[26] George Cybenko, “Approximation by superpositions of a sigmoidal function,”
Mathematicsof control, signals and systems , vol. 2, no. 4, pp. 303–314, 1989.[27] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang, “The expressivepower of neural networks: A view from the width,” in
Advances in Neural Information Pro-cessing Systems , 2017, pp. 6231–6239.[28] Hongzhou Lin and Stefanie Jegelka, “Resnet with one-neuron hidden layers is a universalapproximator,” in
Advances in Neural Information Processing Systems , 2018, pp. 6169–6178.[29] Jooyoung Park and Irwin W Sandberg, “Universal approximation using radial-basis-functionnetworks,”
Neural computation , vol. 3, no. 2, pp. 246–257, 1991.[30] Bharath Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet, “On the relation between uni-versality, characteristic kernels and rkhs embedding of measures,” in
Proceedings of the Thir-teenth International Conference on Artificial Intelligence and Statistics , 2010, pp. 773–780.[31] Dimitri P Bertsekas and Athena Scientific,
Convex optimization algorithms , Athena ScientificBelmont, 2015.[32] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour, “Policy gradi-ent methods for reinforcement learning with function approximation,” in
Advances in neuralinformation processing systems , 2000, pp. 1057–1063.[33] Vijay R Konda and John N Tsitsiklis, “Actor-critic algorithms,” in
Advances in neural infor-mation processing systems , 2000, pp. 1008–1014.[34] Kenneth J. Arrow and Leonard Hurwicz,
Studies in linear and nonlinear programming , Stan-ford University Press, CA, 1958.[35] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz, “Trustregion policy optimization,” in
International Conference on Machine Learning , 2015, pp.1889–1897. 11
Proofs
Proof of Lemma 1.
Let us start by writing the left hand side of (13) as (cid:90)
S×A | ρ ( s, a ) − ρ θ ( s, a ) | dsda = (1 − γ ) (cid:90) S×A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) t =0 γ t (cid:0) p tπ ( s, a ) − p tθ ( s, a ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dsda (24)Using the triangle inequality, we upper bound the previous expression as (cid:90) S×A | ρ ( s, a ) − ρ θ ( s, a ) | dsda ≤ (1 − γ ) ∞ (cid:88) t =0 γ t (cid:90) S×A (cid:12)(cid:12) p tπ ( s, a ) − p tθ ( s, a ) (cid:12)(cid:12) dsda. (25)Notice that to complete the proof it suffices to show that the right hand side of the previous ex-pression is bounded by (cid:15)/ (1 − γ ) . We next work towards that end, and we start by bounding thedifference | p tπ ( s, a ) − p tθ ( s, a ) | . Notice that this difference can be upper bounded using the triangleinequality as (cid:12)(cid:12) p tπ ( s, a ) − p tθ ( s, a ) (cid:12)(cid:12) ≤ p tπ ( s ) | π ( a | s ) − π θ ( a | s ) | + π θ ( a | s ) (cid:12)(cid:12) p tπ ( s ) − p tθ ( s ) (cid:12)(cid:12) . (26)Since π θ is an (cid:15) -approximation of π , it follows from Definition 1 that (cid:90) S×A p tπ ( s ) | π ( a | s ) − π θ ( a | s ) | dsda ≤ (cid:15) (cid:90) S p tπ ( s ) ds = (cid:15), (27)where the last equality follows from the fact that p tπ ( s ) is a density. We next work towards boundingthe integral of the second term in (26). Using the fact that π θ ( a | s ) is a density, it follows that (cid:90) S×A π θ ( a | s ) (cid:12)(cid:12) p tπ ( s ) − p tθ ( s ) (cid:12)(cid:12) dsda = (cid:90) S (cid:12)(cid:12) p tπ ( s ) − p tθ ( s ) (cid:12)(cid:12) ds. (28)Notice that the previous difference is zero for t = 0 and for any t > it can be upper bounded by (cid:90) S (cid:12)(cid:12) p tπ ( s ) − p tθ ( s ) (cid:12)(cid:12) ds ≤ (cid:90) S (cid:90) S×A p ( s | s (cid:48) , a (cid:48) ) (cid:12)(cid:12) p t − π ( s (cid:48) , a (cid:48) ) − p t − θ ( s (cid:48) , a (cid:48) ) (cid:12)(cid:12) dsds (cid:48) da (cid:48) = (cid:90) S×A (cid:12)(cid:12) p t − π ( s (cid:48) , a (cid:48) ) − p t − θ ( s (cid:48) , a (cid:48) ) (cid:12)(cid:12) ds (cid:48) da (cid:48) (29)Combining the bounds derived in (25), (27), (29) we have that (1 − γ ) ∞ (cid:88) t =0 γ t (cid:90) S×A (cid:12)(cid:12)(cid:0) p tπ ( s, a ) − p tθ ( s, a ) (cid:1)(cid:12)(cid:12) dsda ≤ (1 − γ ) ∞ (cid:88) t =0 γ t (cid:15) + (1 − γ ) ∞ (cid:88) t =1 γ t (cid:90) S×A (cid:12)(cid:12)(cid:0) p t − π ( s, a ) − p t − θ ( s, a ) (cid:1)(cid:12)(cid:12) dsda. (30)Notice that the first term on the right hand side of the previous expression is the sum of the geometricmultiplied by − γ . Hence we have that (1 − γ ) (cid:80) ∞ t =0 γ t (cid:15) = (cid:15) . The second term on the right handside of the previous expression is in fact the same as the term on the left hand side of the expressionmultiplied by the discount factor γ . Thus, rearranging the terms, the previous expression impliesthat (1 − γ ) ∞ (cid:88) t =0 γ t (cid:90) S×A (cid:12)(cid:12)(cid:0) p tπ ( s, a ) − p tθ ( s, a ) (cid:1)(cid:12)(cid:12) dsda ≤ (cid:15) − γ . (31)This completes the proof of the Lemma. Proof of Theorem 2.
Notice that the dual functions d ( λ ) and d θ ( λ ) associated to the problems (PI)and (PII) respectively are such that for every λ we have that d θ ( λ ) ≤ d ( λ ) . The latter follows fromthe fact that the set of maximizers of the Lagrangian for the parametrized policies is contained inthe set of maximizers of the non-parametrized policies. In particular, this holds for λ (cid:63) the solutionof the dual problem associated to (PI). Hence we have the following sequence of inequalities D (cid:63) = d ( λ (cid:63) ) ≥ d θ ( λ (cid:63) ) ≥ D (cid:63)θ , (32)12here the last inequality follows from the fact that D (cid:63)θ is the minimum of (DII). The zero dualitygap established in Theorem 1 completes the proof of the upper bound for D (cid:63)θ . We next work towardsproving the lower bound for D (cid:63)θ . Let us next write the dual function of the parametrized problem(DII) as d θ ( λ ) = d ( λ ) − (cid:18) max π ∈P ( S ) L ( π, λ ) − max θ ∈ R p L θ ( θ, λ ) (cid:19) (33)Let π (cid:63) (cid:44) argmax π ∈P ( S ) L ( π, λ ) and let θ (cid:63) be an (cid:15) -approximation of π (cid:63) . Then, by definition of themaximum it follows that d θ ( λ ) ≥ d ( λ ) − ( L ( π (cid:63) , λ ) − L θ ( θ (cid:63) , λ )) (34)We next work towards a bound for L ( π (cid:63) , λ ) − L θ ( θ (cid:63) , λ ) . To do so, notice that we can write the dif-ference in terms of the occupation measures where ρ (cid:63) and ρ (cid:63)θ are the occupation measures associatedto the the policies π (cid:63) and the policy π θ (cid:63) L ( π (cid:63) , λ ) − L θ ( θ (cid:63) , λ ) = (cid:90) S×A (cid:0) r + λ (cid:62) r (cid:1) ( dρ (cid:63) ( λ ) − dρ (cid:63)θ ( λ )) . (35)Since π θ (cid:63) is by definition an (cid:15) approximation of π (cid:63) it follows from Lemma 1 that (cid:90) S×A | dρ (cid:63) ( λ ) − dρ (cid:63)θ ( λ ) | ≤ (cid:15) − γ . (36)Using the bounds on the the reward functions we can upper bound the difference L ( π (cid:63) , λ ) −L θ ( θ (cid:63) , λ ) by L ( π (cid:63) , λ ) − L θ ( θ (cid:63) , λ ) ≤ ( B r + (cid:107) λ (cid:107) B r ) (cid:15) − γ . (37)Combining the previous bound with (34) we can lower bound d θ ( λ ) as d θ ( λ ) ≥ d ( λ ) − ( B r + (cid:107) λ (cid:107) B r ) (cid:15) − γ (38)Let us next define d (cid:15) ( λ ) = d ( λ ) − B r (cid:15)/ (1 − γ ) (cid:107) λ (cid:107) , and notice that in fact d (cid:15) ( λ ) is the dual functionassociated to Problem (PI (cid:48) ) with ξ i = B r (cid:15)/ (1 − γ ) for all i = 1 , . . . , m . With this definition, (38)reduces to d θ ( λ ) ≥ d (cid:15) ( λ ) − B r (cid:15) − γ . (39)Since the previous expression holds for every λ , in particular it holds for λ (cid:63)θ , the dual solution of theparametrized problem (DII). Thus, we have that D (cid:63)θ ≥ d (cid:15) ( λ (cid:63)θ ) − B r (cid:15) − γ . (40)Recall that λ (cid:63)(cid:15) = argmin d (cid:15) ( λ ) , and use the definition of the dual function to lower bound D (cid:63)θ by D (cid:63)θ ≥ max π ∈P ( S ) V ( π ) + m (cid:88) i =1 λ (cid:63)(cid:15),i (cid:18) V i ( π ) − c i − B r (cid:15) − γ (cid:19) − B r (cid:15) − γ . (41)By definition of maximum, we can lower bound the previous expression by substituting by any π ∈ P ( S ) . In particular, we select π (cid:63) the solution to (PI) D (cid:63)θ ≥ V ( π (cid:63) ) + m (cid:88) i =1 λ (cid:63)(cid:15),i (cid:18) V i ( π (cid:63) ) − c i − B r (cid:15) − γ (cid:19) − B r (cid:15) − γ . (42)Since π (cid:63) is the optimal solution to (PI) it follows that V i ( π (cid:63) ) − c i ≥ and since λ (cid:63)(cid:15),i ≥ theprevious expression reduces to D (cid:63)θ ≥ V ( π (cid:63) ) − ( B r + B r (cid:107) λ (cid:63)(cid:15) (cid:107) ) (cid:15) − γ = P (cid:63) − ( B r + B r (cid:107) λ (cid:63)(cid:15) (cid:107) ) (cid:15) − γ (43)Which completes the proof of th result 13 roof of Proposition 2. Let λ (cid:63)θ be the solution of the parametrized dual problem (DII). Then we canwrite the difference of the dual function evaluated at an arbitrary λ ∈ R m + and λ (cid:63)θ as d θ ( λ ) − d θ ( λ (cid:63)θ ) = max θ L θ ( θ, λ ) − max θ L θ ( θ, λ (cid:63)θ ) ≤ L θ ( θ (cid:63) ( λ ) , λ ) − L θ ( θ † ( λ ) , λ (cid:63)θ ) . (44)It follows from Assumption 1 that there exists δ > such that L θ ( θ (cid:63) ( λ ) , λ ) ≤ L θ ( θ † ( λ ) , λ ) + δ ,thus we can upper bound the right hand side of the previous inequality by L θ ( θ (cid:63) ( λ ) , λ ) −L θ ( θ † ( λ ) , λ (cid:63)θ ) ≤ L θ ( θ † ( λ ) , λ )+ δ −L ( θ † ( λ ) , λ (cid:63)θ ) = ( λ − λ (cid:63)θ ) (cid:62) (cid:0) V ( θ † ( λ )) − c (cid:1) + δ. (45)Combining the two upper bounds completes the proof of the proposition. Proof of Theorem 3.
We start by showing the lower bound, which in fact holds for any λ . Noticethat for any λ and by definition of the dual problem it follows that d θ ( λ ) ≥ D (cid:63)θ . Combining thisbound with the result of Theorem 2 it follows that d θ ( λ ) ≥ P (cid:63) − ( B r + (cid:107) λ (cid:63)(cid:15) (cid:107) B r ) (cid:15) − γ . (46)To show the upper bound we start by writing the difference between the dual multiplier k + 1 andthe solution of (DII) in terms of the iteration at time k . Since λ (cid:63)θ ∈ R m + and using the non-expansiveproperty of the projection it follows that (cid:107) λ k +1 − λ (cid:63)θ (cid:107) ≤ (cid:13)(cid:13) λ k − η (cid:0) V ( θ † ( λ k )) − c (cid:1) − λ (cid:63)θ (cid:13)(cid:13) (47)Expanding the square and using that B = (cid:80) mi =1 ( B r i / (1 − γ ) − c i ) is a bound on the normsquared of V ( θ ) − s it follows that (cid:107) λ k +1 − λ (cid:63)θ (cid:107) ≤ (cid:107) λ k − λ (cid:63)θ (cid:107) − η ( λ k − λ (cid:63)θ ) (cid:62) (cid:0) V ( θ † ( λ k )) − c (cid:1) + η B. (48)Using the result of Proposition 2 we can further upper bound the inner product in the previousexpression by the difference of the dual function evaluated at λ k and λ (cid:63)θ plus δ , the error in thesolution of the primal maximization, (cid:107) λ k +1 − λ (cid:63)θ (cid:107) ≤ (cid:107) λ k − λ (cid:63)θ (cid:107) + 2 η ( δ + d θ ( λ (cid:63)θ ) − d θ ( λ k )) + η B. (49)Defining α k = 2( δ + d θ ( λ (cid:63)θ ) − d θ ( λ k )) + ηB and writing recursively the previous expression yields (cid:107) λ k +1 − λ (cid:63)θ (cid:107) ≤ (cid:107) λ − λ (cid:63)θ (cid:107) + η k (cid:88) j =0 α j . (50)Since d θ ( λ (cid:63)θ ) is the minimum of the dual function, the difference d θ ( λ (cid:63)θ ) − d θ ( λ k ) is always negative.Thus, when λ k is not close to the solution of the dual problem α k is negative. The latter impliesthat the distance between λ k and λ (cid:63)θ is reduced by virtue of (50). To be formal, for any ε > , when a j > − ε we have that d θ ( λ j ) − d θ ( λ (cid:63)θ ) ≤ η B δ + ε. (51)Using the result of Theorem 2 we can upper bound D (cid:63)θ by P (cid:63) which establishes the neighborhooddefined in (23). We are left to show that the number of iterations required to do so is bounded by K ≤ (cid:107) λ − λ (cid:63)θ (cid:107) ηε . (52)To do so, let K > be the first iterate in the neighborhood (23). Formally, K = min j ∈ N α j > − ε .Then it follows from the recursion that (cid:107) λ K − λ (cid:63)θ (cid:107) ≤ (cid:107) λ − λ (cid:63)θ (cid:107) − Kηε. (53)Since (cid:107) λ K − λ (cid:63)θ (cid:107) is positive the previous expression reduces to Kη(cid:15) ≤ (cid:107) λ − λ (cid:63)θ (cid:107)2