[PDF] Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement Learning via Frank-Wolfe Policy Optimization

Abstract

Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications, such as scheduling in networked systems with resource constraints and control of a robot with kinematic constraints. While the existing projection-based approaches ensure zero constraint violation, they could suffer from the zero-gradient problem due to the tight coupling of the policy gradient and the projection, which results in sample-inefficient training and slow convergence. To tackle this issue, we propose a learning algorithm that decouples the action constraints from the policy parameter update by leveraging state-wise Frank-Wolfe and a regression-based policy update scheme. Moreover, we show that the proposed algorithm enjoys convergence and policy improvement properties in the tabular case as well as generalizes the popular DDPG algorithm for action-constrained RL in the general case. Through experiments, we demonstrate that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.

Full PDF

aa r X i v : . [ c s . L G ] F e b Escaping from Zero Gradient: Revisiting Action-Constrained ReinforcementLearning via Frank-Wolfe Policy Optimization

Jyun-Li Lin Wei Hung Shang Hsuan Yang Ping-Chun Hsieh Xi Liu Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan Applied Machine Learning, Facebook AI, Menlo Park, CA, USA * Equal Contribution

Abstract

Action-constrained reinforcement learning (RL)is a widely-used approach in various real-worldapplications, such as scheduling in networked sys-tems with resource constraints and control of arobot with kinematic constraints. While the exist-ing projection-based approaches ensure zero con-straint violation, they could suffer from the zero-gradient problem due to the tight coupling ofthe policy gradient and the projection, which re-sults in sample-inefﬁcient training and slow con-vergence. To tackle this issue, we propose a learn-ing algorithm that decouples the action constraintsfrom the policy parameter update by leveragingstate-wise Frank-Wolfe and a regression-basedpolicy update scheme. Moreover, we show thatthe proposed algorithm enjoys convergence andpolicy improvement properties in the tabular caseas well as generalizes the popular DDPG algo-rithm for action-constrained RL in the generalcase. Through experiments, we demonstrate thatthe proposed algorithm signiﬁcantly outperformsthe benchmark methods on a variety of controltasks.

Action-constrained reinforcement learning (RL) is a popu-lar approach for sequential decision making in real-worldsystems. One classic example is maximizing the network-wide utility by optimally allocating the network resourceunder capacity constraints (Xu et al., 2018; Gu et al., 2019;Zhang et al., 2020). Another example is robot control underkinematic constraints (Pham et al., 2018; Gu et al., 2017;Jaillet and Porta, 2012; Tsounis et al., 2020), which cap-ture the limitations of the physical components of a robot(e.g., in terms of velocity, torque, or output power). In these examples, the constraints essentially characterize the set offeasible actions at each state. To ensure the safe and normaloperation of these real-world systems, it is required thatthese action constraints are satisﬁed throughout the evalua-tion as well as the training processes (Chow et al., 2018;Liu et al., 2020a; Gu et al., 2017). Therefore, in action-constrained RL, an effective training algorithm is requiredto achieve the following two tasks simultaneously: (i) itera-tively improving the policy and (ii) ensuring zero constraintviolation at each training step.To enable RL with action constraints, one popular genericapproach is to include an additional differentiable projec-tion layer at the output of the policy network and follow thestandard end-to-end policy gradient approach (Pham et al.,2018; Dalal et al., 2018; Bhatia et al., 2019). While beinga general-purpose solution, this projection layer could re-sult in the zero-gradient issue during training due to thetight coupling of the policy gradient update and the pro-jection layer. Speciﬁcally, zero gradient occurs when theoriginal output of the policy network falls outside of thefeasible action set and any small perturbation of the policyparameters does not lead to any change in the ﬁnal outputaction due to the projection mechanism. To better under-stand the zero-gradient issue, let us consider a toy exam-ple of a policy network with one hidden layer and a linearoutput layer used to produce a deterministic scalar action.Suppose the actions are required to be non-negative. To sat-isfy the non-negativity action constraint, an additional L -projection layer, which is equivalent to a Rectiﬁed LinearUnit (ReLU), is added to the output of the policy. It can beseen that the policy network can easily suffer from zero gra-dient due to the clipping effect of ReLU (Maas et al., 2013).If the zero-gradient issue occurs in a large portion of thestate space, the training process could be sample-inefﬁcientas most of the samples are wasted, and therefore the conver-gence speed could be slow. Notably, the zero-gradient issuecan be particularly severe in the early training phase sincethe pre-projection actions produced by the policy networkare likely to be far away from the feasible sets.he fundamental cause of the zero-gradient issue is thetight coupling of the policy parameter update and the pro-jection layer under the standard policy gradient framework.Speciﬁcally, in the end-to-end policy-gradient-based train-ing process, the update of the policy parameters relies onthe gradient of the actual policy output with respect tothe policy parameters and thereby involves the gradient ofthis additional projection layer. To escape from the zero-gradient issue, we take a different approach and propose alearning algorithm that decouples the parameter update forpolicy improvement from constraint satisfaction, withoutusing the policy gradient theorem. The proposed algorithmcan be highlighted as follows:• To accommodate the action constraints, we leverage theFrank-Wolfe method (Frank et al., 1956) to search for fea-sible action update directions directly within the feasibleaction sets in a state-wise manner. Through this proce-dure, for a collection of states, we obtain the referenceactions that are used to guide the update of the policyparameters for improving the current policy.• To update the parameters of the policy network, we pro-pose to construct a loss function (e.g., mean squared er-ror) that enables the policy network to adjust its outputstoward the reference actions. This update scheme can beviewed as solving a regression problem based on the ref-erence actions by taking one-step gradient descent. Inthis way, the parameter update is completely decoupledfrom the action constraints.Since the proposed framework obviates the need for the gra-dient of a projection layer, it avoids the zero-gradient issueby nature. Our Contributions.

In this paper, we revisit the action-constrained RL problem and propose a novel learningframework that avoids the zero-gradient issue and achieveszero constraint violation simultaneously:• To better describe the proposed learning framework,we start from the case of ﬁnite state spaces and intro-duce Frank-Wolfe policy optimization (FWPO) with tab-ular policy parameterization, which can be viewed asan instance of the generalized policy iteration. By di-rectly searching for update directions within the feasi-ble sets via state-wise Frank-Wolfe, FWPO automaticallyachieves zero constraint violation and does not requireany additional projection. Moreover, we establish theconvergence of FWPO as well as its policy improvementproperty.• Built on FWPO, we propose NFWPO by extending theidea of FWPO to the general neural policies via a regres-sion argument. By constructing a loss function and lever-aging state-wise Frank-Wolfe, we decouple the policy pa-rameter update from the action constraints. This designautomatically prevents the zero-gradient issue. Moreover,we show that the vanilla DDPG is a special case ofNFWPO if there is no action constraints. • Through experiments on various real applications, weempirically show the zero-gradient problem and demon-strate that the proposed algorithms signiﬁcantly out-perform the popular benchmark methods for action-constrained RL.

We consider an inﬁnite-horizon discounted Markov de-cision process (MDP) deﬁned by a tuple ( S , A , p, r, γ ) ,where S is state space, A denotes the action space, p isthe state transition probability, r is the reward function, and γ ∈ (0 , denotes the discount factor. We assume that theaction space A ⊆ R N is continuous and the reward func-tion takes value in [0 , for all state-action pairs. At eachtime step t = 0 , , · · · , the learner observes state s t , takesan action a t , and receives an immediate reward r t . In thispaper, we consider the action-constrained MDPs where foreach state s ∈ S there is a feasible action set C ( s ) ⊆ A determined by the underlying collection of constraints. Weassume that C ( s ) is compact and convex. In this paper, wefocus on deterministic policies and use π ( · ; θ ) : S → A to denote a deterministic parametric policy with parametervector θ ∈ R n . Under a policy π , the value functions aredeﬁned as the expected long-term rewards V ( s ; π ) = E h ∞ X t =0 γ t r ( s t , a t ) | s = s, π i , (1) Q ( s, a ; π ) = E h ∞ X t =0 γ t r ( s t , a t ) | s = s, a = a, π i . (2)To make a comparison between policies, for any two poli-cies π and π ′ , we say that π ≥ π ′ if V ( s ; π ) ≥ V ( s ; π ′ ) ,for all s ∈ S . This essentially constructs a partial order-ing among policies. To construct a total ordering of all thepolicies, consider the performance objective deﬁned as aweighted average of the value function J µ ( π ) := E s ∼ µ [ V ( s ; π )] , (3)where µ is called the restarting state distribution (Kakadeand Langford, 2002). Note that one common choice of µ isthe initial state distribution. It is also convenient to deﬁnethe discounted state visitation distribution d πµ as d πµ ( s ) :=(1 − γ ) E s ∼ µ [ P ∞ t =0 γ t P ( s t = s | s , π )] , for each s ∈ S . To optimize the objective J µ ( π ) , the typical approach is toapply gradient ascent based on the policy gradient. Underthe standard regularity conditions, the deterministic policygradient (Silver et al., 2014) can be written as ∇ θ J µ ( π ( · ; θ ))= E s ∼ d πµ h ∇ θ π ( s ; θ ) ∇ a Q ( s, a ; π ( · ; θ )) | a = π ( s ; θ ) i . (4)s a practical implementation of the deterministic policygradient approach, DDPG (Lillicrap et al., 2016) extendsthe deep Q -learning (Mnih et al., 2015) to continuous ac-tion space in an actor-critic manner. Speciﬁcally, DDPGupdates the policy parameter θ by applying stochastic gra-dient ascent according to (4) and obtains an approximated Q -function Q ( s, a ; φ ) parameterized by φ by using a Q -learning-like critic, which updates φ by minimizing theloss E ( s,a,s ′ ,r ) ∼ ρ [( r + γQ ( s ′ , π ( s ′ ; θ − ); φ − ) − Q ( s, a ; φ )) ] ,where ρ denotes the sampling distribution of the replaybuffer, θ − and φ − are the parameters of the actor and critictarget networks, respectively. In this section, we provide an overview of the Frank-Wolfealgorithms. Consider an optimization problem in the form max x ∈X F ( x ) , (5)where F ( · ) : R d → R is a differentiable function witha Lipschitz continuous gradient, X ⊆ R d is the feasibleset characterized by the underlying constraints on x . Onepopular approach is to apply the projected gradient ascentmethod (Bubeck et al., 2015), which combines the standardgradient ascent with a projection step. By contrast, as aprojection-free method, the classic Frank-Wolfe algorithms(Frank et al., 1956) and its variants solve the constrained op-timization problems in (5) by leveraging a ﬁrst-order sub-problem. We brieﬂy summarize the Frank-Wolfe algorithmfor non-convex objective functions in the batch settings asfollows (Lacoste-Julien, 2016; Reddi et al., 2016):• Initialization.

Let x k denote the input at the k -th iter-ation and choose an arbitrary x ∈ X to be the initialpoint.• Search for an update direction within the fea-sible set.

In the k -th iteration, compute v k =argmax v ∈X h v, ∇ x F ( x ) | x = x k i and update the iterate as x k +1 = x k + β k ( v k − x k ) , where v k − x k is the updatedirection and β k denotes the learning rate.For unconstrained optimization problems, the convergenceproperties are typically analyzed in terms of the gradientnorm k∇ x F ( x ) k . By contrast, for constrained maximiza-tion problems, one widely-used metric of convergence inthe Frank-Wolfe literature is the Frank-Wolfe gap deﬁnedas G ( x ) := max z ∈X h z − x, ∇ x F ( x ) i . It is easy to verifythat G ( x ) = 0 is a necessary and sufﬁcient condition of that x is a stationary point. In the literature, the Frank-Wolfe gap is typically deﬁned as max z ∈X h z − x, −∇ x F ( x ) i since the goal is to minimize an ob-jective function. By contrast, as the goal of RL is to optimize thepolicy in terms of rewards, we consider the maximization problemin the form of (5) and make the required changes accordingly. Notations . We use the standard notations k·k p and k·k F todenote the L p -norm of a vector and the Frobenius norm ofa matrix, respectively. We use h· , ·i to denote the inner prod-uct of two real vectors. For a set D , we deﬁne the diameterof D as diam k·k ( D ) := sup x ,x ∈D k x − x k . We usedom f to denote the domain of a function f . In this section, we formally present the proposed learningalgorithms for action constrained RL. To better describethe proposed learning framework, we start from a stylizedsetting with tabular policy parameterization for ﬁnite statespaces and extend the idea to develop a more practical al-gorithm for the general parametric policies.

For ease of exposition, we ﬁrst illustrate the proposed algo-rithm for the case of ﬁnite state spaces and tabular policieswith direct parameterization, i.e., π ( s ; θ ) ≡ θ ( s ) , for all s ∈ S . We consider the performance objective J µ ( π ) withsome restarting state distribution µ with µ ( s ) > , for all s ∈ S , and deﬁne µ min := min s ∈S µ ( s ) . For ease of nota-tion, we also deﬁne D s := diam k·k ( C ( s )) for each s and D max := max s ∈S D s .Now we present the proposed FWPO algorithm. We use θ k to denote the policy parameters in the k -th iteration andchoose feasible initial policy parameters θ which satisfy θ ( s ) ∈ C ( s ) , for all s ∈ S . FWPO adopts the generalizedpolicy iteration framework (Sutton and Barto, 2018) by al-ternating between two subroutines in each iteration:• Policy update via state-wise Frank-Wolfe.

FWPO up-dates the policy by ﬁnding a feasible update direction ofeach state s ∈ S via Frank-Wolfe as c k ( s ) = argmax c ∈C ( s ) h c, ∇ a Q ( s, a ; π ( · ; θ k )) | a = θ k ( s ) i , (6) θ k +1 ( s ) = θ k ( s ) + α k ( s )( c k ( s ) − θ k ( s )) , (7)where c k ( s ) − θ k ( s ) is the update direction and α k ( s ) denotes the (state-dependent) learning rate. Moreover, itis natural to deﬁne the state-wise Frank-Wolfe gap of the Q -function at θ k as g k ( s ) := h c k ( s ) − θ k ( s ) , ∇ a Q ( s, a ; π ( · ; θ k )) | a = θ k ( s ) i . (8)It is easy to verify that g k ( s ) ≥ , for all k ∈ N andfor all s ∈ S . As will be shown momentarily, to en-sure convergence, the learning rate is conﬁgured to be α k ( s ) = (1 − γ ) µ min LD s g k ( s ) . Evaluation of the current policy.

FWPO then evaluatesthe updated policy and obtain the Q -function (or an ap-proximated version) for the next iteration. This can bedone by a standard policy evaluation approach.The above scheme of FWPO is detailed in Algorithm 1. Assuggested by Algorithm 1, FWPO always searches for anupdate direction within the feasible action sets. Therefore,FWPO automatically achieves zero constraint violation anddoes not require any additional projection by nature. Algorithm 1

Frank-Wolfe Policy Optimization (FWPO) Input:

Initialize the policy parameters as θ that satis-ﬁes θ ( s ) ∈ C ( s ) for all s ∈ S for each iteration k = 0 , , · · · do Evaluate π ( · ; θ k ) and obtain Q ( s, a ; π ( · ; θ k )) for each state s ∈ S do Compute the Frank-Wolfe update direction by c k ( s ) = argmax c ∈C ( s ) h c, ∇ a Q ( s, a ; π ( · ; θ k )) i g k ( s ) = h c k ( s ) − θ k ( s ) , ∇ a Q ( s, a ; π ( · ; θ k )) i α k ( s ) = (1 − γ ) µ min LD s g k ( s ) θ k +1 ( s ) = θ k ( s ) + α k ( s )( c k ( s ) − θ k ( s )) end for end forRemark 1 One salient feature of FWPO is that the policyupdate in (6)-(7) is done by searching for feasible updatedirections based on ∇ a Q ( s, a ; π ) on a per-state basis withstate-dependent learning rates, instead of using the standardpolicy gradient of the performance objective J µ ( π ) . As willbe seen in Section 3.2, this design plays a critical role in de-coupling the policy parameter update from constraint satis-faction. Another advantage of FWPO is that it is agnostic tothe discounted state visitation distribution d πµ ( cf. the deter-ministic policy gradient in (4)) due to the state-wise nature.This feature allows FWPO to be directly applicable in theoff-policy settings in its original form . Remark 2

As the policy update under FWPO is done ona state-by-state basis instead of directly on J µ ( π ) , the con-vergence guarantees of the standard Frank-Wolfe methodsdo not directly apply to the objective J µ ( π ) under FWPO.From this perspective, FWPO is not a trivial combinationof the Frank-Wolfe methods and policy iteration.As suggested by Remark 2, we proceed to establish theconvergence result of FWPO. For the convergence analy-sis, based on the state-wise Frank-Wolfe gaps deﬁned in(8), we deﬁne the effective Frank-Wolfe gap of J µ ( π ( · ; θ )) at θ k as G k := (cid:16) X s ∈S g k ( s ) (cid:17) / . (9) In the off-policy settings, the deep policy gradient ap-proaches typically require dropping a term in the policy gradi-ent expression to accommodate the behavior policy (Silver et al.,2014).

Note that G k = 0 if and only if the update direction iszero for all the states, i.e., c k ( s ) − θ k ( s ) = 0 . Hence, G k indicates whether the J µ ( π ( · ; θ )) converges to a stationarypoint. We also deﬁne ¯ G T := min ≤ k ≤ T G k . To establishthe convergence results, we also assume mild regularityconditions on r and p as follows. Deﬁnition 1

A differentiable function f : dom f → R issaid to be L -smooth if there exists L ≥ such that forany x, y ∈ dom f , k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . Regularity Assumptions:(A1)

The reward function r ( s, a ) is differentiable and is L r -smooth in a , for all s, a . (A2) The transition probability p ( s ′ | s, a ) is twice differ-entiable and L p -smooth in a , for all s, s ′ , a . Moreover, p ( s ′ | s, a ) satisﬁes sup s,a,s ′ k∇ a p ( s ′ | s, a ) k < C p .As the ﬁrst step, we introduce the following proposition onthe smoothness of the performance objective J µ ( π ( · ; θ )) .Notably, given the regularity assumptions of r and p in ac-tion, it remains non-trivial to establish the smoothness of J µ ( π ( · ; θ )) in θ due to the multi-step compound effect ofthe changes in policy parameters on the value functions. Proposition 1

Under the regularity assumptions (A1)-(A2), there exists some constant

L > such that for anyrestarting state distribution µ , J µ ( π ( · ; θ )) is L -smooth in θ . The proof of Proposition 1 is provided in Appendix A.1.Now we are ready to present the convergence result.

Proposition 2

Under the FWPO algorithm with α k ( s ) = (1 − γ ) µ min LD s g k ( s ) , { π ( · ; θ k ) } form a non-decreasing se-quence of policies in the sense that π ( · ; θ k +1 ) ≥ π ( · ; θ k ) ,for all k . Moreover, the effective Frank-Wolfe gap of FWPOconverges to zero as k → ∞ , and the convergence rate canbe quantiﬁed as ∞ X k =0 G k ≤ LD (1 − γ ) µ , (10) which implies that ¯ G T = O ( T − / ) . Proof

Due to space limitation, we provide a sketch ofproof: (i) To show the non-decreasing property, we leveragethe policy difference lemma (Kakade and Langford, 2002)and verify a sufﬁcient condition of strict policy improve-ment; (ii) To show the convergence result, we leverage thesmoothness of the value functions as well as the objectiveand use the technique for convergence of non-convex op-timization similar to that in (Reddi et al., 2016; Lacoste-Julien, 2016); (iii) A proper learning rate can be selectedby taking the smoothness conditions as well as the restart-ing state distribution into account. For completeness, thedetailed proof is provided in Appendix A.2. (cid:3) emark 3

The style of the convergence guarantee inProposition 2 is common in the analysis of gradient descentmethods for non-convex smooth functions (Bottou et al.,2018). Moreover, the result (i.e., convergence to a station-ary point) in Proposition 2 resembles those of the policygradient algorithms (Sutton, 2000; Silver et al., 2014), butfor the action-constrained RL settings. On the other hand,in (6), the search of the update direction requires the gradi-ent of the Q -function. In practice, it may not be feasible toobtain the whole true Q -function, and a value function ap-proximator can be included. In practice, it can be expectedthat a sufﬁciently accurate critic shall provide a sufﬁcientlygood update direction. In this section, we formally present the proposed NFWPOalgorithm for general parametric policies for action-constrained RL. As highlighted in Section 1, we propose todecouple constraint satisfaction from the policy parameterupdate. Speciﬁcally, to accommodate the action constraints,we extend the state-wise Frank-Wolfe subroutine to the gen-eral parametric policies. One inherent challenge of such ex-tension is that the Frank-Wolfe method searches for an up-date direction within the feasible set by nature. However,under neural parameterization, an action produced by theneural network is not guaranteed to stay in the feasible ac-tion set. To address this, we propose to incorporate a pro-jection step into the state-wise Frank-Wolfe subroutine. De-ﬁne a projection operator as Π C ( s ) ( z ) = argmin y ∈C ( s ) k y − z k . (11)For ease of exposition, in the sequel we call the input z a pre-projection action and Π C ( s ) ( z ) a post-projection ac-tion .NFWPO adopts the actor-critic architecture. Let ¯ θ and ¯ φ be the current parameters of the actor and the critic, respec-tively. The main features of NFWPO are captured by theactor part as below.• Derive reference actions via state-wise Frank-Wolfe.

For each s in the mini-batch B , NFWPO uses Frank-Wolfe to compute the reference action at each state s as ˜ a s = Π C ( s ) ( π ( s ; ¯ θ )) + α (cid:0) ¯ c ( s ) − Π C ( s ) ( π ( s ; ¯ θ )) (cid:1) , (12)where α is the learning rate of Frank-Wolfe and ¯ c ( s ) = argmax c ∈C ( s ) h c, ∇ a Q ( s, a ; ¯ φ ) | a =Π C ( s ) ( π ( s ;¯ θ )) i . (13)(Note that the projection Π C ( s ) ( · ) is only for generatingfeasible actions and does not require backpropagation.) • Construct an MSE loss function.

NFWPO constructs aloss function L NFWPO ( θ ; ¯ θ ) as the MSE between the ac-tions of the current policy and the reference actions, i.e., L NFWPO ( θ ; ¯ θ ) = X s ∈B (cid:0) π ( s ; θ ) − ˜ a s (cid:1) . (14)• Update policy by gradient descent.

NFWPO updatesthe policy parameter by minimizing the MSE loss in (14)by using gradient descent for one step, i.e., θ ← θ − β ∇ θ L NFWPO ( θ ; ¯ θ ) . (15)On the other hand, the critic of NFWPO can be based onany standard policy evaluation technique. For ease of expo-sition, for NFWPO, we use the same critic as the vanillaDDPG (as described in Section 2.1). The detailed pseudocode of NFWPO is provided in the supplementary material.Notably, similar to (6), NFWPO only uses ∇ a Q ( s, a ; ¯ φ ) forderiving reference actions, without using the deterministicpolicy gradient in (4). This design allows NFWPO to de-couple constraint satisfaction in (12)-(13) from the param-eter update in (14)-(15). As highlighed in Section 1, thisdecoupling obviates the need for the gradient of a projec-tion layer and hence automatically avoids the zero-gradientissue. Moreover, below we show that DDPG is actually aspecial case of NFWPO when there is no action constraints.The proof is provided in Appendix B. Proposition 3

If there is no action constraints, then thepolicy update scheme of NFWPO in (12)-(15) is equivalentto the vanilla DDPG by (Lillicrap et al., 2016).

Remark 4

While NFWPO leverages a projection step in(12), this projection step is only for deriving reference ac-tions and does not take part in the policy parameter update.As a result, NFWPO does not require backpropagation ofthe projection step (as shown in (12)-(15)) and therefore au-tomatically avoids the zero-gradient issue. Hence, NFWPOis essentially different from the existing solutions that com-bine DDPG with a projection layer for end-to-end training(Pham et al., 2018; Dalal et al., 2018).

In this section, we empirically evaluate FWPO andNFWPO in various real-world applications, including bikesharing systems, communication networks, and continuouscontrol in MuJoCo. We compare the proposed algorithmsagainst the following popular benchmark methods:•

DDPG+Projection : The training procedure is identicalto the vanilla DDPG (Lillicrap et al., 2016) except thatthe action is post-processed by the L -projection opera-tor Π C ( s ) ( · ) before being applied to the environment.• DDPG+RewardShaping : Built on DDPG+Projection,this algorithm adds the L -norm between the pre-projection and post-projection actions as a penalty to theintrinsic reward. DDPG+OptLayer : This design uses a differentiable pro-jection layer, namely the OptLayer, that supports end-to-end training via gradient descent (Pham et al., 2018).Moreover, for the projection step (without the needof backpropagation) required by DDPG+Projection,DDPG+RewardShaping, and NFWPO, we implementthis functionality on the Gurobi optimization solver(Gurobi Optimization, 2021). Therefore, the post-projection actions are guaranteed to satisfy the actionconstraints for all the algorithms. For each task, eachalgorithm is trained under the common set of 5 randomseeds. Each evaluation consists of 10 episodes, and wereport the average performance along with the standarddeviation in Figures 1-5. We also summarize the averagereturn over the ﬁnal 10 evaluations in Table 1. The detailedtraining setup can be found in Appendix D.

We use the open-source BSS simulator , which was orig-inally proposed by (Ghosh and Varakantham, 2017) andlater used for evaluating action-constrained RL by (Bhatiaet al., 2019). In a bike-sharing problem, there are m bikesand n stations, each of which has a pre-determined bikestorage capacity C . An action is to allocate m bikes to n sta-tions under random demands. The reward signal consists ofthree parts: (i) Moving cost: the cost of moving one or mul-tiple bikes from one station to another; (ii) Lost-demandcost: the cost of unserved demand due to bike outage. (iii)Overﬂow cost: the cost incurred when the number of bikesin one station exceeds its capacity. Evaluating FWPO.

Since the bike sharing environmenthas a ﬁnite state space, we ﬁrst use it to evaluate FWPOagainst the baseline methods, all with tabular policy param-eterization. For the action value function, we use the same Q -learning-like critic as the vanilla DDPG for all the algo-rithms. A medium-sized system with n = 3 , m = 90 , and C = 35 is chosen that allows to analytically ﬁnd the opti-mal policy. There are two types of constraints: (i) Globalconstraint: all the action entries shall sum to ; (ii) Localconstraints: each entry of the action shall be between and . Figure 1(a) shows the average return of the three algo-rithms. We observe that FWPO performs the best, whileDDPG+Projection and DDPG+RewardShaping both sufferfrom slow learning. This is also reﬂected by Figure 1(b),which shows that FWPO converges to a near-optimal pol-icy much faster than the baselines. The above phenomenonis mainly due to the inaccurate policy gradient of DDPGunder action constraints. Speciﬁcally, the critics of the twobaselines are trained with samples with feasible actionswhile the gradients ∇ a Q ( s, a ; φ ) are mostly evaluated atthose actions outside the feasible sets. By contrast, FWPOalways stays in the feasible action sets and hence naturallyavoids the issue of inaccurate gradients. BSS: https://github.com/bhatiaabhinav/gym-BSS (a) (b)

Figure 1: Bike sharing problem with n = 3 (BSS-3) undertabular policies: (a) Average return over 5 random seeds;(b) L -norm between the learned policies and the optimalpolicy at each training step. Evaluating NFWPO.

We proceed to compare NFWPOwith the other three baselines in solving a larger-scalebike-sharing problem with m = 150 , n = 5 , and C =35 . As shown by Figure 2(a), NFWPO converges fasterand achieves a larger return than the other baselines. Tobetter understand its behavior, Figure 2(b) shows the cu-mulative constraint violations of the pre-projection ac-tions. Interestingly, the pre-projection actions of NFWPOcan largely avoid constraint violation, and thus requiresless help from the projection during training. By con-trast, all the baselines rely heavily on the projection stepto stay feasible, because most of their pre-projection ac-tions fail to satisfy the constraints. We also observe thatDDPG+Projection and DDPG+RewardShaping attain sim-ilar average return and frequency of violation. This is be-cause they both produce pre-projection actions far fromthe feasible sets and thereby obtain similar post-projectionactions. Meanwhile, DDPG+OptLayer suffers from nearlyzero learning progress due to the zero-gradient issue. Fig-ure 2(c) compares the sample-based gradient with respectto the pre and post-OptLayer actions. Since the gradientsof the pre-OptLayer actions (green line in Figure 2(c))are mostly close to zero, the sample-based policy gradi-ents ∇ θ ˆ J µ ( π ( · ; θ )) are therefore close to zero for mostof the training steps. As the gradients with respect to thepost-OptLayer actions are always non-zero (blue dottedline in Figure 2(c)), we know that the zero-gradient issueof ∇ θ ˆ J µ ( π ( · ; θ )) indeed results from the projection layer.This conﬁrms that the additional OptLayer could easilylead to the zero-gradient issue and sample-inefﬁcient train-ing. In this section, we evaluate the proposed methods over thetask of utility maximization in communication networks. a) (b) (c)

Figure 2: Bike sharing problem with n = 5 (BSS-5) under neural policies: (a) Average return over 5 random seeds; (b)Cumulative number of constraint violations of the pre-projection actions during training; (c) L -norm of the sample-basedgradients with respect to the pre- and post-OptLayer actions of DDPG+OptLayer.We simulate the network with the open-source network sim-ulator from PCC-RL (Jay et al., 2019). For the networktopology, we consider the classic T3 NSFNET Backboneand set the bandwidth of each link to be 50 packets persecond throughout the experiments. We generate three net-work ﬂows, each of which has three candidate paths fromits source to the destination. The action is to determine therate allocation of each ﬂow along each candidate path. Thereward consists of three parts: (i) Throughput: the numberof received packets per second; (ii) Drop rate: the numberof dropped packets per second; (iii) Latency: the averagelatency of the packets in the last second. For each ﬂow i , itsimmediate reward is log (cid:16) throughput i drop rate . i × latency . i (cid:17) , which cor-responds to the widely-used proportional fairness criteria(Kelly, 1997). One salient feature of a communication net-work is that when the total packet arrival rate of a link ap-proaches its bandwidth, the latency will grow rapidly, andaccordingly most of the packets would be dropped. There-fore, in this environment, the action constraints correspondto the link bandwidth constraints, i.e., the total assignedpacket arrival rate of each link should be bounded by 50.Figure 3(a) shows the training curves and indicates thatNFWPO still converges fast (in about steps) andachieves much larger return than the baselines. Moreover,similar to the bike-sharing problems, we see from Figure3(b) that most of the pre-projection actions of NFWPO al-ready satisfy the constraints. In this task, we ﬁnd that re-ward shaping does help in guiding the pre-projection ac-tions towards the feasible action sets, but only under somerandom seeds (and therefore the large variance in Figures3(a)-(b)). Regarding DDPG+OptLayer, in the initial train-ing phase, we observe that it mostly produces pre-OptLayeractions with small ﬂow rates, which lead to a smallernumber of constraint violations and moderate returns. Toachieve a higher return, DDPG+OptLayer then gradually PCC-RL: https://github.com/PCCproject/PCC-RL increases the ﬂow rates but accidentally causes more con-straint violations of pre-OptLayer actions and suffers fromthe inaccurate gradient issue described in Section 4.1. Ul-timately, DDPG+OptLayer can only achieve a fairly lowreturn. (a) (b)

Figure 3: Utility maximization in NSFNET: (a) Averagereturn over 5 random seeds; (b) Cumulative constraint vio-lations of the pre-projection actions during training.

To further validate NFWPO, we consider popular continu-ous control tasks in MuJoCo (Todorov et al., 2012) withlinear, non-linear, and state-dependent action constraints:

Reacher with linear and non-linear constraints.

In thistask, the action space is -dimensional (denoted by u , u ),and each action entry corresponds to the torque of a joint ofa 2-DoF robot. To validate the applicability of NFWPO, weimpose both linear and nonlinear action constraints as: (i) | u + u | ≤ . and (ii) u + u ≤ . . We intentionallymake the constraints quite tight to better showcase the per-formance of NFWPO. From Figure 4(a), we observe a sim-ilar trend that NFWPO still converges faster and achieves alarger return than the other baseline methods. In this task,DDPG+OptLayer can achieve a return closer to NFWPOable 1: Average return over the ﬁnal 10 evaluations.Methods BSS-3 BSS-5 NSFNET Reacher HalfcheetahNFWPO -1673.04 -15132.21 13770.67 -7.15 6513.26 DDPG+Projection -2254.52 -16123.48 1514.44 -11.95 2746.72DDPG+Reward Shaping -2308.00 -16123.48 9010.46 -10.10 3065.37DDPG+OptLayer - -16686.04 1667.59 -8.33 1399.37as it has fewer pre-OptLayer violations as shown in Figure4(b). DDPG+Projection and DDPG+RewardShaping stillperform poorly as they always produce actions far from thefeasible sets and relies heavily on the projection step. (a) (b)

Figure 4: Reacher-v2: (a) Average return over 5 randomseeds; (b) Cumulative number of constraint violations ofthe pre-projection actions during training.

Halfcheetah with state-dependent constraints.

In thistask, an action is -dimensional and is denoted by ( v , · · · , v ) . We consider a challenging scenario where theconstraint is state-dependent. Speciﬁcally, the imposed con-straint is P i =1 | v i w i | ≤ , where w i denotes the angu-lar velocity of the i -th joint and is part of the state. Thisconstraint is meant to capture the limitation of total outputpower. Similar to the other environments, from Figure 5(a)-(b), we still observe that NFWPO achieves better sampleefﬁciency than the other baselines and has much fewer con-straint violations in the pre-projection actions. The constrained RL problems have been extensively stud-ied from two main perspectives. The ﬁrst category encodesthe constraints via cost signals, which are incurred at eachstep along with the reward signals, and accordingly focuseson the average-cost constraints. This line of research worksborrows a variety of ideas from the optimization litera-ture. For example, (Chow et al., 2015) addressed chanceconstraints by using a primal-dual approach to achieve atrade-off between return and risk. Similarly, (Tessler et al.,2018) proposed Reward Constrained Policy Optimizationwhich applied Lagrangian relaxation and converted theconstraints into penalty terms in the objective. (Achiam (a) (b)

Figure 5: Halfcheetah-v2 with state-dependent action con-straints: (a) Average return over 5 random seeds; (b) Cumu-lative number of constraint violations of the pre-projectionactions during training.et al., 2017) proposed Constrained Policy Optimization toachieve strict policy improvement under the average costconstraints by using the trust-region approach. In (Chowet al., 2018), Lyapunov-based safe reinforcement learningwas proposed to address the constraints by solving a lin-ear program. (Yang et al., 2019) proposed Projection-BasedConstrained Policy Optimization to achieve no constraintviolation by taking a projection step after the local rewardimprovement update. In (Liu et al., 2020b), Interior-pointPolicy Optimization was proposed to handle the averagecost constraints by augmenting the objective with logarith-mic barrier functions. (Satija et al., 2020) took a differentapproach by converting the trajectory-level constraints intoper-step state-wise constraints and accordingly deﬁning asafe policy improvement step. Different from all the aboveprior works, this paper considers the RL settings with state-wise action constraints.The second category is on the state-wise constraints thatneed to be satisﬁed on a step-by-step basis. (Pham et al.,2018) studied the state-wise action constraints of roboticsystems and proposed a projection-based OptLayer to en-force the constraints. (Dalal et al., 2018) also consideredstate-wise safety constraints under linearization and pro-posed a projection-based safety layer to handle the con-straints. Similarly, (Bhatia et al., 2019) considered resourceconstraints and proposed variants of OptLayer to improvethe computational efﬁciency. (Shah et al., 2020) proposeda more efﬁcient projection scheme for enforcing linear ac-ion constraints. Despite the similarity in the problem set-ting, we take a different approach and propose a decouplingframework by leveraging Frank-Wolfe to address the actionconstraints and completely avoid the zero-gradient issue.

This paper revisits action-constrained RL to tackle the zero-gradient issue and ensure zero constraint violation simul-taneously. We achieve this goal by developing a learn-ing framework that decouples the policy parameter updatefrom constraint satisfaction by leveraging state-wise Frank-Wolfe and a regression argument. Our theoretical and ex-perimental results demonstrate that the proposed learn-ing algorithm is indeed a promising approach for action-constrained RL.

References

Joshua Achiam, David Held, Aviv Tamar, and PieterAbbeel. Constrained policy optimization. In

Interna-tional Conference on Machine Learning , pages 22–31.PMLR, 2017.Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gau-rav Mahajan. On the theory of policy gradient meth-ods: Optimality, approximation, and distribution shift. arXiv:1908.00261 , 2019.Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gau-rav Mahajan. Optimality and approximation with policygradient methods in Markov decision processes. In

Con-ference on Learning Theory , pages 64–66, 2020.Akshay Agrawal, Brandon Amos, Shane Barratt, StephenBoyd, Steven Diamond, and Zico Kolter. Differentiableconvex optimization layers. In

Advances in Neural Infor-mation Processing Systems , 2019.Brandon Amos and J Zico Kolter. Optnet: Differentiableoptimization as a layer in neural networks. In

Interna-tional Conference on Machine Learning , pages 136–145.PMLR, 2017.Abhinav Bhatia, Pradeep Varakantham, and Akshat Kumar.Resource constrained deep reinforcement learning. In

Proceedings of the International Conference on Auto-mated Planning and Scheduling , volume 29, pages 610–620, 2019.Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimiza-tion methods for large-scale machine learning.

Siam Re-view , 60(2):223–311, 2018.Sébastien Bubeck et al. Convex Optimization: Algorithmsand Complexity.

Foundations and Trends® in MachineLearning , 8(3-4):231–357, 2015. Yinlam Chow, Aviv Tamar, Shie Mannor, and MarcoPavone. Risk-sensitive and robust decision-making: aCVaR optimization approach. In

Advances in Neural In-formation Processing Systems , pages 1522–1530, 2015.Yinlam Chow, Oﬁr Nachum, Edgar Duenez-Guzman, andMohammad Ghavamzadeh. A Lyapunov-based ap-proach to safe reinforcement learning. In

Advancesin Neural Information Processing Systems , pages 8103–8112, 2018.Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, ToddHester, Cosmin Paduraru, and Yuval Tassa. Safe explo-ration in continuous action spaces. arXiv:1801.08757 ,2018.Marguerite Frank, Philip Wolfe, et al. An algorithm forquadratic programming.

Naval research logistics quar-terly , 3(1-2):95–110, 1956.Scott Fujimoto, Herke Hoof, and David Meger. Addressingfunction approximation error in actor-critic methods. In

International Conference on Machine Learning , pages1587–1596, 2018.Supriyo Ghosh and Pradeep Varakantham. Incentivizingthe use of bike trailers for dynamic repositioning in bikesharing systems. In

Proceedings of the InternationalConference on Automated Planning and Scheduling , vol-ume 27, 2017.Lin Gu, Deze Zeng, Wei Li, Song Guo, Albert Y Zomaya,and Hai Jin. Intelligent VNF orchestration and ﬂowscheduling via model-assisted deep reinforcement learn-ing.

IEEE Journal on Selected Areas in Communications ,38(2):279–291, 2019.Shixiang Gu, Ethan Holly, Timothy Lillicrap, and SergeyLevine. Deep reinforcement learning for robotic manip-ulation with asynchronous off-policy updates. In

IEEEInternational Conference on Robotics and Automation(ICRA) , pages 3389–3396, 2017.LLC Gurobi Optimization. Gurobi optimizer referencemanual, 2021. URL .Léonard Jaillet and Josep M Porta. Path planning un-der kinematic constraints by rapidly exploring manifolds.

IEEE Transactions on Robotics , 29(1):105–117, 2012.Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, MichaelSchapira, and Aviv Tamar. Internet congestion controlvia deep reinforcement learning, 2019.Sham Kakade and John Langford. Approximately optimalapproximate reinforcement learning. In

InternationalConference on Machine Learning , 2002.Frank Kelly. Charging and rate control for elastic trafﬁc.

European transactions on Telecommunications , 8(1):33–37, 1997.imon Lacoste-Julien. Convergence rate of Frank-Wolfefor non-convex objectives. arXiv:1607.00345 , 2016.Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, andDaan Wierstra. Continuous control with deep reinforce-ment learning. In

International Conference on LearningRepresentations , 2016.Anqi Liu, Guanya Shi, Soon-Jo Chung, Anima Anandku-mar, and Yisong Yue. Robust regression for safe explo-ration in control. In

Learning for Dynamics and Control ,pages 608–619. PMLR, 2020a.Yongshuai Liu, Jiaxin Ding, and Xin Liu. IPO: Interior-point policy optimization under constraints. In

AAAIConference on Artiﬁcial Intelligence , volume 34, pages4940–4947, 2020b.Andrew L Maas, Awni Y Hannun, and Andrew Y Ng.Rectiﬁer nonlinearities improve neural network acousticmodels. In

International Conference on Machine Learn-ing , volume 30, page 3, 2013.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An-drei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al. Human-level control through deep rein-forcement learning.

Nature , 518(7540):529–533, 2015.Tu-Hoa Pham, Giovanni De Magistris, and RyukiTachibana. Optlayer - practical constrained optimiza-tion for deep reinforcement learning in the real world.In

IEEE International Conference on Robotics and Au-tomation (ICRA) , pages 6236–6243, 2018.Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and AlexSmola. Stochastic frank-wolfe methods for nonconvexoptimization. In

Annual Allerton Conference on Com-munication, Control, and Computing (Allerton) , pages1244–1251, 2016.Harsh Satija, Philip Amortila, and Joelle Pineau. Con-strained markov decision processes via backward valuefunctions. In

International Conference on MachineLearning , pages 8502–8511. PMLR, 2020.Sanket Shah, Sinha Arunesh, Varakantham Pradeep, Per-rault Andrew, and Tambe Milind. Solving online threatscreening games using constrained action space rein-forcement learning. In

AAAI Conference on ArtiﬁcialIntelligence , volume 34, pages 2226–2235, 2020.David Silver, Guy Lever, Nicolas Heess, Thomas Degris,Daan Wierstra, and Martin Riedmiller. Deterministic pol-icy gradient algorithms. In

International conference onmachine learning , pages 387–395, 2014. Richard S Sutton. Policy gradient methods for reinforce-ment learning with function approximation.

Advances inNeural Information Processing Systems , 12:1057–1063,2000.Richard S Sutton and Andrew G Barto.

Reinforcementlearning: An introduction . MIT press, 2018.Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Re-ward constrained policy optimization. In

InternationalConference on Learning Representations , 2018.E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physicsengine for model-based control. In , pages 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.Vassilios Tsounis, Mitja Alge, Joonho Lee, Farbod Farshid-ian, and Marco Hutter. Deepgait: Planning and controlof quadrupedal gaits using deep reinforcement learning.

IEEE Robotics and Automation Letters , 5(2):3699–3706,2020.Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang,Yanzhi Wang, Chi Harold Liu, and Dejun Yang.Experience-driven networking: A deep reinforcementlearning based approach. In

IEEE Conference on Com-puter Communications (INFOCOM) , pages 1871–1879,2018.Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan,and Peter J Ramadge. Projection-based constrained pol-icy optimization. In

International Conference on Learn-ing Representations , 2019.Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, andH Jonathan Chao. CFR-RL: Trafﬁc engineering with re-inforcement learning in SDN.

IEEE Journal on SelectedAreas in Communications , 38(10):2249–2259, 2020.

PROOFS OF THE THEORETICAL RESULTS

In this section, we provide the proof of the convergence result of the tabular FWPO in Proposition 2 by leveraging the prooftechnique of smooth optimization. We start by establishing the key smoothness result of Proposition 1 and then proceed tothe convergence analysis.

A.1 SMOOTHNESS RESULTS AND PROOF OF PROPOSITION 1

In this section, we show the smoothness results of the value functions as well as the performance objective under the regular-ity assumptions (A1)-(A2). Notably, given the regularity conditions of r and p in action, it remains non-trivial to establishthe smoothness of V ( · ; θ ) and J µ ( π ( · ; θ )) in θ due to the multi-step compound effect of the changes in policy parameterson the value functions. Despite this challenge, we are still able to characterize the required smoothness conditions.Recall from Deﬁnition 1 that a differentiable function f : dom f → R is L -smooth if there exists L ≥ such that for any x, y ∈ dom f , we have k∇ f ( x ) − ∇ f ( y ) k ≤ L k x − y k . One useful property is that if f is L -smooth, then f satisﬁes | f ( y ) − f ( x ) − h∇ f ( x ) , y − x i| ≤ L k y − x k , ∀ x, y ∈ dom f. (16)For notational convenience, we explicitly number the state space as S = { , · · · , M } . By slightly abusing the notation,we use π to denote the M × N matrix where the s -th row represents the action selected by a deterministic policy π ateach state s ∈ { , · · · , M } . Let P π denote a M × M matrix where the ( i, j ) -th entry is p ( j | i, π ( i )) . Given a deterministicpolicy π , consider a “perturbed” version of the policy deﬁned by π δ = π + δ W , where δ ∈ R and W ∈ R M × N is someﬁxed matrix with k W k F = 1 . Moreover, we simplify the notations by letting P ( δ ) ≡ P π δ and V s ( δ ) ≡ V ( s ; π δ ) and thendeﬁne G ( δ ) := ( I M − γ P ( δ )) − , where I M denotes the identity matrix of size M × M . To show the smoothness resultof V ( s ; θ ) , it is sufﬁcient to show that | d V s ( δ ) dδ | is bounded for any W ∈ R M × N with k W k F = 1 , for any state s . We ﬁrstintroduce several useful lemmas. Lemma 1 (Matrix-by-scalar derivatives) d G ( δ ) dδ = γ G ( δ ) d P ( δ ) dδ G ( δ ) , (17) d G ( δ ) dδ = 2 γ G ( δ ) d P ( δ ) dδ G ( δ ) d P ( δ ) dδ G ( δ ) + G ( δ ) d P ( δ ) dδ G ( δ ) . (18) Proof (Lemma 1)

By the deﬁnition of G ( δ ) , we know I M = G ( δ )( I M − γ P ( δ )) = G ( δ ) − γ G ( δ ) P ( δ ) . (19)Note that the product rule of matrix-by-scalar derivative suggests that for two matrices U ( δ ) , U ( δ ) , we have ddδ ( U U ) = U d U dδ + d U dδ U . By taking the matrix-by-scalar derivative with respect to δ on both sides of (19)and using the product rule, we have d G ( δ ) dδ − (cid:16) γ G ( δ ) d P ( δ ) dδ + γ d G ( δ ) dδ P ( δ ) (cid:17) = M , (20)where M denotes the M × M zero matrix. By reorganizing the terms in (20), it is straightforward to verify that (17) holds.Based on (17), we can obtain the second-order derivative of G ( δ ) as d G ( δ ) dδ = ddδ (cid:16) G ( δ ) d P ( δ ) dδ G ( δ ) (cid:17) (21) = G ( δ ) ddδ (cid:16) d P ( δ ) dδ G ( δ ) (cid:17) + d G ( δ ) dδ (cid:16) d P ( δ ) dδ G ( δ ) (cid:17) (22) = 2 γ G ( δ ) d P ( δ ) dδ G ( δ ) d P ( δ ) dδ G ( δ ) + G ( δ ) d P ( δ ) dδ G ( δ ) . (23)Hence, we conclude that (17) and (18) indeed hold. (cid:3) emma 2 For any x ∈ R M , G ( δ ) satisﬁes that k G ( δ ) x k ∞ ≤ − γ k x k ∞ . (24) Proof (Lemma 2)

Note that G ( δ ) = ( I n − γ P ( δ )) − = P ∞ m =0 γ m P ( z ) m . Given that P ( δ ) is a stochastic matrix whereall the elements are non-negative and each row sums to 1, we know P ( δ ) m is also a stochastic matrix, for any m ∈ N .For each i , let e i denote an n -dimensional one-hot vector with the i -th entry equal to . Then, it is easy to verify that k G ( δ ) e i k ∞ ≤ − γ , for any i ∈ { , · · · , M } . This thereby implies that k G ( δ ) x k ∞ ≤ − γ k x k ∞ , for any x ∈ R M . (cid:3) Lemma 3

Under the regularity assumptions (A1)-(A2), there exist constants C , C > such that for any W ∈ R M × N with k W k F = 1 and for any x ∈ R M , we have (cid:13)(cid:13)(cid:13)(cid:13) d P ( δ ) dδ x (cid:13)(cid:13)(cid:13)(cid:13) ∞ < C k x k ∞ , (25) (cid:13)(cid:13)(cid:13)(cid:13) d P ( δ ) dδ x (cid:13)(cid:13)(cid:13)(cid:13) ∞ < C k x k ∞ . (26) Proof (Lemma 3)

For convenience, we use W i to denote the i -th row of W . For any pair of i, j ∈ { , · · · , M } , the ( i, j ) -th element of d P ( δ ) dδ evaluated at δ = 0 satisﬁes (cid:12)(cid:12)(cid:12)(cid:12)h d P ( δ ) dδ (cid:12)(cid:12)(cid:12) δ =0 i ( i,j ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) lim h → p ( j | i, π h ( i )) − p ( j | i, π ( i )) h (cid:12)(cid:12)(cid:12)(cid:12) (27) = (cid:12)(cid:12)(cid:12)(cid:12) h∇ a p ( j | i, a ) (cid:12)(cid:12) a = π ( i ) , W ⊤ i i (cid:12)(cid:12)(cid:12)(cid:12) (28) ≤ (cid:13)(cid:13)(cid:13) ∇ a p ( j | i, a ) (cid:12)(cid:12) a = π ( i ) (cid:13)(cid:13)(cid:13) · k W i k (29) < C p , (30)where (28) follows from the differentiability of p and the property of directional derivatives, (29) holds by the Cauchy-Schwarz inequality, and (30) follows from the regularity assumptions and that k W k F = 1 . Then, by (30), it is straightfor-ward to verify that (25) indeed holds.Regarding (26), we ﬁrst let H a ( i, j ) denote the Hessian of p ( j | i, a ) with respect to a . The second directional derivative d P ( δ ) dδ satisﬁes that (cid:12)(cid:12)(cid:12)(cid:12)h d P ( δ ) dδ (cid:12)(cid:12)(cid:12) δ =0 i ( i,j ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) W i H a ( i, j ) (cid:12)(cid:12) a = π ( i ) W ⊤ i (cid:12)(cid:12)(cid:12) (31) ≤ k W i k · (cid:13)(cid:13)(cid:13) H a ( i, j ) (cid:12)(cid:12) a = π ( i ) W ⊤ i (cid:13)(cid:13)(cid:13) (32) ≤ (cid:13)(cid:13)(cid:13) H a ( i, j ) (cid:12)(cid:12) a = π ( i ) (cid:13)(cid:13)(cid:13) F (33) < L p M , (34)where (31) holds by the basic property of second directional derivatives, (32) is due to the Cauchy-Schwarz inequality, (33)follows from that k W k F ≤ and (cid:13)(cid:13) H a ( i, j ) | a = π ( i ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) H a ( i, j ) | a = π ( i ) (cid:13)(cid:13) F , and (34) holds by the L p -smoothness of p .Therefore, by (34) we conclude that (26) holds. (cid:3) Now we are ready to establish the smoothness conditions of the value functions. Deﬁne L Q = L r + M L p γ − γ . Lemma 4

Under the regularity assumptions (A1)-(A2) and tabular direct policy parameterization, we have the followingsmoothness properties of Q ( s, a ; π ) and V ( s ; π ) :(i) Under any ﬁxed policy π , Q ( s, a ; π ) is L Q -smooth in action a , for any state s ∈ S .(ii) There exists some constant L > such that V ( s ; π θ ) is L -smooth in the policy parameters θ , for any state s ∈ S . roof (Lemma 4) For (i): Recall that the action-value function can be expressed as Q ( s, a ; π ) = r ( s, a ) + γ X s ′ ∈S p ( s ′ | s, a ) V ( s ′ ; π ) . (35)By taking the derivative of (35) with respect to a , we have ∇ a Q ( s, a ; π ) = ∇ a r ( s, a ) + γ X s ′ ∈S ∇ a p ( s ′ | s, a ) V ( s ′ ; π ) . (36)By the regularity assumptions of r and p , we know that for any state s and any two actions a ′ and a ′′ , k∇ a Q ( s, a ; π ) | a = a ′ − ∇ a Q ( s, a ; π ) | a = a ′′ k (37) ≤ k∇ a r ( s, a ) | a = a ′ − ∇ a r ( s, a ) | a = a ′′ k + γ X s ′ ∈S (cid:16) k∇ a p ( s ′ | s, a ) | a = a ′ − ∇ a p ( s ′ | s, a ) | a = a ′′ k · | V ( s ′ ; π ) | (cid:17) (38) ≤ L r k a ′ − a ′′ k + M L p γ − γ k a ′ − a ′′ k , (39)where (38) follows from (36) and the triangle inequality, and (39) holds by the regularity assumptions of r and p as well asthe fact that | V ( s ′ ; π ) | ≤ − γ . This implies that Q ( s, a ; π ) is L Q -smooth in a . For (ii):

We adapt the proof technique in (Agarwal et al., 2019, 2020) and show the smoothness result of V ( s ; π θ ) in θ .Recall that by the Bellman equation, under a deterministic policy, we have V ( s ; π ) = r ( s, π ( s )) + γ X s ′ ∈S p ( s ′ | s, a ) V ( s ′ ; π ) , ∀ s ∈ S . (40)Let r π be an M -dimensional column vector where the i -th entry is r ( i, π ( i )) , and V π denote an M -dimensional columnvector where the i -th entry is V ( i ; π ) . Then, we can rewrite (40) in matrix form as V π = r π + γ P π V π . (41)By (41), we immediately know that V π = ( I M − γ P π ) − r π . (42)For each s ∈ { , · · · , M } , let e s denote an M -dimensional one-hot vector with the s -th entry equal to . Hence, for each s ∈ S , we know V ( s ; π ) = e ⊤ s ( I M − γ P π ) − r π . By Lemma 1, we have dV s ( δ ) dδ = γe ⊤ s G ( δ ) d P ( δ ) dδ G ( δ ) r π , (43) d V s ( δ ) dδ = 2 γe ⊤ s G ( δ ) d P ( δ ) dδ G ( δ ) d P ( δ ) dδ G ( δ ) r π + e ⊤ s G ( δ ) d P ( δ ) dδ G ( δ ) r π . (44)Then, for any W ∈ R M × N with k W k F = 1 , we have (cid:12)(cid:12)(cid:12) dV s ( δ ) dδ (cid:12)(cid:12)(cid:12) ≤ γ (cid:13)(cid:13)(cid:13)(cid:13) G ( δ ) d P ( δ ) dδ G ( δ ) r π (cid:13)(cid:13)(cid:13)(cid:13) ∞ (45) ≤ γC (1 − γ ) , (46)where (45) is a direct result of (43), and (46) follows from Lemmas 2-3 and k r π k ∞ ≤ . Similarly, we know (cid:12)(cid:12)(cid:12) d V s ( δ ) dδ (cid:12)(cid:12)(cid:12) ≤ γ (cid:13)(cid:13)(cid:13)(cid:13) G ( δ ) d P ( δ ) dδ G ( δ ) d P ( δ ) dδ G ( δ ) r π (cid:13)(cid:13)(cid:13)(cid:13) ∞ + (cid:13)(cid:13)(cid:13)(cid:13) G ( δ ) d P ( δ ) dδ G ( δ ) r π (cid:13)(cid:13)(cid:13)(cid:13) ∞ (47) ≤ γC (1 − γ ) + C (1 − γ ) , (48)where (47) is a direct result of (44), and (48) holds by Lemmas 2-3 and k r π k ∞ ≤ . Then, since (47)-(48) hold for any W ∈ R M × N with k W k F = 1 , then we know for every state s , V ( s ; π θ ) is L -smooth in θ , where L = γC (1 − γ ) + C (1 − γ ) . (cid:3) ow, we are ready to prove Proposition 1. For convenience, we restate Proposition 1 below. Proposition

Under the regularity assumptions (A1)-(A2), there exists some constant

L > such that for any restartingstate distribution µ , J µ ( π ( · ; θ )) is L -smooth in θ . Proof (Proposition 1)

By (ii) in Lemma 4 and the deﬁnition that J µ ( π θ ) = E s ∼ µ [ V ( s ; π θ )] , we know J µ ( π θ ) is L -smooth,for any restarting state distribution µ . (cid:3) A.2 PROOF OF PROPOSITION 2

For convenience, we restate Proposition 2 as follows.

Proposition

Under the FWPO algorithm with α k ( s ) = (1 − γ ) µ min LD s g k ( s ) , { π ( · ; θ k ) } form a non-decreasing sequence ofpolicies in the sense that π ( · ; θ k +1 ) ≥ π ( · ; θ k ) , for all k . Moreover, the effective Frank-Wolfe gap of FWPO converges tozero as k → ∞ , and the convergence rate can be quantiﬁed as ∞ X k =0 G k ≤ LD (1 − γ ) µ , (49) which implies that ¯ G T = O ( T − / ) . To show the non-decreasing property in Proposition 2, we summarize useful properties on policy improvement as follows.We use A ( · , · ; π ) to denote the advantage function of a policy π . Lemma 5 (Performance difference lemma, (Kakade and Langford, 2002))

For any two policies π and π ′ , for anyrestarting state distribution µ , the performance difference between the two policies at any state s is V ( s ; π ′ ) − V ( s ; π ) = 11 − γ E s ∼ d π ′ µ ,a ∼ π ′ ( ·| s ) (cid:2) A ( s, a ; π ) (cid:3) . (50)By Lemma 5, we can directly obtain a sufﬁcient condition of state-wise policy improvement. Corollary 1

For any two policies π and π ′ , we have π ′ ≥ π if the following condition holds for every state s ∈ S : E a ∼ π ′ [ A ( s, a ; π )] ≥ E a ∼ π [ A ( s, a ; π )] = 0 . (51)Now we are ready to prove Proposition 2. Proof (Proposition 2)

For ease of notation, in this proof we use π k ≡ π ( · ; θ k ) and π k ( s ) ≡ π ( s ; θ k ) . Recall from (8) thatunder the policy π k , the state-wise Frank-Wolfe gap is deﬁned as g k ( s ) := h c k ( s ) − θ k ( s ) , ∇ a Q ( s, a ; π k ) | a = π k ( s ) i . ByLemma 4, the Q ( s, a ; π ) is L -smooth in a , for any state s and any policy π . Then, under FWPO, we have Q ( s, π k +1 ( s ); π k ) ≥ Q ( s, π k ( s ); π k ) + α k ( s ) h c k ( s ) − θ k ( s ) , ∇ a Q ( s, a ; π k ) | a = π k ( s ) i − L α k ( s ) k c k ( s ) − θ k ( s ) k , (52) ≥ Q ( s, π k ( s ); π k ) + α k ( s ) g k ( s ) − L α k ( s ) D s , (53)where (53) follows from the deﬁnitions of the state-wise Frank-Wolfe gap and the diameter D s . It is easy to verify that α k ( s ) g k ( s ) − L α k ( s ) D s is positive for all α k ( s ) ∈ (0 , g k ( s ) LD s ) and attains a maximum of g k ( s ) LD s at α k ( s ) = g k ( s ) LD s . There-fore, if the learning rate α k ( s ) ∈ (0 , g k ( s ) LD s ) , then Q ( s, π k +1 ( s ); π k ) > Q ( s, π k ( s ); π k ) and hence A ( s, π k +1 ( s ); π k ) > .By Corollary 1, we know π k +1 ≥ π k , for all k . Hence, { π ( · ; θ k ) } indeed form a non-decreasing sequence of policies.ext, we characterize the rate of convergence of the objective J µ ( π k ) of the proposed FWPO algorithm. By Proposition 1,we know the objective J µ ( π k ) is L -smooth. Therefore, we have J µ ( π k +1 ) ≥ J µ ( π k ) + h∇ θ J µ ( π k ) , θ k +1 − θ k i − L k θ k +1 − θ k k (54) = J µ ( π k ) + X s ∈S d π k µ ( s ) α k ( s ) · h c k ( s ) − θ k ( s ) , ∇ a Q ( s, a ; π k ) | a = π k ( s ) i − L X s ∈S α k ( s ) k c k ( s ) − θ k ( s ) k (55) ≥ J µ ( π k ) + (1 − γ ) µ min X s ∈S α k ( s ) g k ( s ) − L X s ∈S α k ( s ) D s , (56)where (54) follows from that J µ ( π k ) is L -smooth, (55) holds by the update scheme of FWPO, and (56) follows from that d π k µ ( s ) α k ( s ) ≥ (1 − γ ) µ min and the deﬁnition of the diameters. By using (56) and letting α k ( s ) = g k ( s ) LD s (1 − γ ) µ min , J µ ( π k +1 ) ≥ J µ ( π k ) + X s ∈S g k ( s ) LD s (1 − γ ) µ (57) ≥ J µ ( π k ) + (1 − γ ) µ LD X s ∈S g k ( s ) . (58)Recall that G k := P s ∈S g k ( s ) denotes the effective Frank-Wolfe gap of the k -th iteration. By taking the telescoping sumof (58), we know J µ ( π k +1 ) ≥ J µ ( π ) + (1 − γ ) µ LD k X t =0 G t . (59)Let π ∗ denote an optimal policy, i.e., π ∗ ≥ π , for any policy π . Hence, we have k X t =0 G t ≤ LD (1 − γ ) µ ( J µ ( π k +1 ) − J µ ( π )) ≤ LD (1 − γ ) µ ( J µ ( π ∗ ) − J µ ( π )) ≤ LD (1 − γ ) µ , (60)where the last inequality follows from the fact that the value functions are upper bounded by (1 − γ ) − . Recall that ¯ G T := min ≤ k ≤ T G k . Therefore, (60) implies that ¯ G T ≤ vuut T + 1 T X t =0 G t ≤ s T + 1 · LD (1 − γ ) µ = O ( T − / ) . (61)This completes the proof. (cid:3) B PROOF OF PROPOSITION 3

For convenience, we restate Proposition 3 as follows.

Proposition

If there is no action constraints, then the policy update scheme of NFWPO is equivalent to the vanilla DDPGby (Lillicrap et al., 2016).

Proof

We prove this result by reinterpreting the DDPG from a state-wise perspective. Let ¯ θ and ¯ φ be the current parametersof the actor and the critic, respectively. As proposed in (Lillicrap et al., 2016), the policy update under DDPG is done byapproximating the true gradient ∇ θ J µ ( π ( · ; θ )) by the sample-based gradient ∇ θ ˆ J µ ( π ( · ; θ )) ≈ ∇ θ J µ ( π ( · ; θ )) as ∇ θ ˆ J µ ( π ( · ; θ )) = 1 |B| X s ∈B ∇ a Q ( s, a ; φ ) | a = π ( s ;¯ θ ) ∇ θ π ( s ; θ ) , (62)where B denotes a mini-batch of states drawn from the replay buffer. Note that this update rule can be reinterpreted fromthe perspective of regression by the following steps: Reference actions.

For each s in the mini-batch B , compute the reference action at state s under the guidance of thecritic a ∗ s = π ( s ; ¯ θ ) + η ∇ a Q ( s, a ; ¯ φ ) | a = π ( s ;¯ θ ) , (63)where η > is the step size.• Loss function.

Construct a loss function L ( θ ; ¯ θ ) as the mean-squared error (MSE) between the actions of the currentpolicy and the reference actions, i.e., L ( θ ; ¯ θ ) = 12 |B| X s ∈B (cid:0) π ( s ; θ ) − a ∗ s (cid:1) . (64)• Gradient update.

Accordingly, update the policy parameter by minimizing the MSE loss by applying gradient descentfor one step, i.e., θ ← θ − η ∇ θ L ( θ ; ¯ θ ) . (65)We can observe that the update scheme characterized by (64)-(65) is equivalent to the DDPG update with a learning rateof η η . Therefore, DDPG can be viewed as an actor-critic algorithm where both the actor and critic are trained by usingregression as subroutines. Note that (63) is equivalent to (12) if there is no action constraints. Hence, we conclude thatNFWPO is equivalent to DDPG if there is no action constraints. (cid:3) C PSEUDO CODE OF NFWPO

For completeness, we provide the pseudo code of NFWPO in Algorithm 2.

Algorithm 2

Frank-Wolfe Policy Optimization With Neural Policy Parameterization (NFWPO) Input:

Frank-Wolfe learning rate α , actor learning rate β , critic learning rate β c , target update ratio τ , and variance ofexploration noise σ Randomly initialize the actor π ( · ; θ ) and the critic Q ( · , · ; φ ) with parameters θ, φ Initialize the target networks with parameters θ † , φ † for each episode i = 0 , , · · · do Let ¯ θ and ¯ φ be the snapshots of the current actor and critic parameters Receive initial state s of the current episode for t = 0 , , · · · do Select action a t = π ( s t ; ¯ θ ) + N (0 , σ ) Apply action a t and observe reward r t as well as the next state s t +1 Store transition ( s t , a t , r t , s t +1 ) in the replay buffer Randomly sample a mini-batch B of transitions ( s, a, r, s ′ ) from the replay buffer // update the actor by state-wise Frank-Wolfe for each state s ∈ B do ¯ c ( s ) = argmax c ∈C ( s ) h c, ∇ a Q ( s, a ; ¯ φ ) | a =Π C ( s ) ( π ( s ;¯ θ )) i ˜ a s = Π C ( s ) ( π ( s ; ¯ θ )) + α (cid:0) ¯ c ( s ) − Π C ( s ) ( π ( s ; ¯ θ )) (cid:1) end for L NFWPO ( θ ; ¯ θ ) = P s ∈B (cid:0) π ( s ; θ ) − ˜ a s (cid:1) θ ← θ − β ∇ θ L NFWPO ( θ ; ¯ θ ) | θ =¯ θ // update the Q-learning-like critic as vanilla DDPG L critic ( φ ) = |B| P ( s,a,r,s ′ ) ∈B (cid:0) r + γQ ( s ′ , π ( s ′ ; θ † ); φ † ) − Q ( s, a ; φ ) (cid:1) φ ← φ − β c ∇ φ L critic ( φ ) | φ = ¯ φ Update the target networks: θ † ← τ θ + (1 − τ ) θ † , φ † ← τ φ + (1 − τ ) φ † end for end for DETAILED EXPERIMENTAL SETUP AND ADDITIONAL EXPERIMENTALRESULTS

D.1 TRAINING CONFIGURATIONSRandom seeds . For each task, each algorithm is trained under the common set of random seeds of { , , , , } . Exploration.

The training process starts after some number of time steps (1000 steps for Reacher-v2 and 10000 steps forthe other environments), and we use a purely exploratory policy in this initial phase to collect samples for the replay bufferfor all the algorithms. During training, Gaussian noise is added to each action for exploration for the neural cases. In thetabular case, we use the ǫ -greedy policy as the behavior policy instead. Update frequency . For NFSNET, due to the longer computation time required by the network simulator, we speed up thesimulations by updating the actor every 50 steps for all the algorithms. For the other environments, as DDPG+OptLayer isparticularly time-consuming, we also update its actor every 50 steps to speed up the training process.

Implementation of OptLayer.

As there is no off-the-shelf implementation of DDPG+OptLayer available, we leverage theopen-source packages cvxpylayer (Agrawal et al., 2019) as well as the OptNet proposed by (Amos and Kolter, 2017) toimplement the differentiable projection-based OptLayer for end-to-end training. Note that in DDPG+OptLayer there aretwo scenarios where projection is needed for the actor output: (i) producing actions for interacting with the environment(under a behavior policy) and (ii) evaluating the actions produced by the current policy for calculating the deterministicpolicy gradient as in (62). Since OptLayer is computationally inefﬁcient, we use it only for the scenario (ii), where back-propagation is required for gradient descent. For scenario (i), we use Gurobi optimization solver instead to speed up thetraining process.

D.2 A SUMMARY OF THE HYPERPARAMETERS

In this section, we provide a summary of the key hyperparameters in Tables 3-4. We highlight some key design choices:• For both the actor and critic networks, we use two hidden layers with 400 and 300 hidden units with ReLU as theactivation, as suggested by Fujimoto et al. (2018). For the activation function of the actor output, in order to betteraccommodate the various action ranges of different environments, we choose tanh for the MuJoCo control tasks andReLU for the other tasks, respectively.• We choose a smaller batch size for DDPG+OptLayer since its computation time is much larger than the other approaches.As DDPG+OptLayer is a strong baseline in some of the environments, we also set the batch size of NFWPO to be 16 fora fair comparison.

D.3 ADDITIONAL EXPERIMENTAL RESULTS

For the experiments with Halfcheetah-v2, we also evaluate NFWPO and the baseline methods with nonlinear action con-straints. Recall that in this task, an action is -dimensional and is denoted by v , · · · , v . We consider the quadratic actionconstraints as: v + v + v ≤ and v + v + v ≤ . These constraints capture the limitation on the total torque applied tothe joints of the robot. Similar to the results in other environments, from Figure 6(a)-(b), we observe that NFWPO achievesmuch faster convergence than the baselines and has much fewer constraint violations in the pre-projection actions.Table 2: Average return over the ﬁnal 10 evaluations for Halfcheetah-v2 with quadratic constraints.Methods Halfcheetah (With Quadratic Constraints)NFWPO DDPG+Projection 1277.16DDPG+Reward Shaping 2040.08DDPG+OptLayer 3598.95 a) (b)

Figure 6: Halfcheetah-v2 with quadratic constraints: (a) Average return at each evaluation over 5 random seeds; (b) Cumu-lative number of constraint violations of the primitive actions during training.Table 3: A summary of the hyper-parameters of NFWPO and the baseline methods.Hyper-parameters Reacher-v2 HalfCheetah-v2 NSFNET BSS-5Learning Rate (For Frank-Wolfe) 0.05 0.01 0.05 0.05Actor Learning Rate (For others) 0.0001 0.0001 0.0001 0.0001Critic Learning Rate 0.001 0.001 0.001 0.001Discount Factor 0.99 0.99 0.99 0.99Target Update Ratio ( τ ) 0.001 0.001 0.001 0.001Replay Buffer Size × Evaluation Frequency 5000 5000 10000 5000Total Training Steps × × × N (0 , . N (0 , . N (0 , N (0 , Weight of Reward Shaping τ ) 0.01Target Update Frequency for Actor 100Replay Buffer Size Evaluation Frequency 5000Total Training Steps Starting Time of Training 10000Exploratory Behavior Policy ǫ -greedy, ǫ = 0 .1