[PDF] Zeroth-Order Supervised Policy Improvement

Abstract

Despite the remarkable progress made by the policy gradient algorithms in reinforcement learning (RL), sub-optimal policies usually result from the local exploration property of the policy gradient update. In this work, we propose a method referred to as Zeroth-Order Supervised Policy Improvement (ZOSPI) that exploits the estimated value function Q globally while preserves the local exploitation of the policy gradient methods. We prove that with a good function structure, the zeroth-order optimization strategy combining both local and global samplings can find the global minima within a polynomial number of samples. To improve the exploration efficiency in unknown environments, ZOSPI is further combined with bootstrapped Q networks. Different from the standard policy gradient methods, the policy learning of ZOSPI is conducted in a self-supervision manner so that the policy can be implemented with gradient-free non-parametric models besides the neural network approximator. Experiments show that ZOSPI achieves competitive results on MuJoCo locomotion tasks with a remarkable sample efficiency.

Full PDF

ZZeroth-Order Supervised Policy Improvement

Hao Sun , Ziping Xu , Yuhang Song , Meng Fang , Jiechao Xiong ,Bo Dai , Zhengyou Zhang , Bolei Zhou The Chinese University of Hong Kong University of Michigan Oxford Tencent Rotobics X

Abstract

Despite the remarkable progress made by the policy gradient algorithms in rein-forcement learning (RL), sub-optimal policies usually result from the local explo-ration property of the policy gradient update. In this work, we propose a methodreferred to as Zeroth-Order Supervised Policy Improvement (ZOSPI) that exploitsthe estimated value function Q globally while preserves the local exploitation ofthe policy gradient methods. We prove that with a good function structure, thezeroth-order optimization strategy combining both local and global samplings canﬁnd the global minima within a polynomial number of samples. To improve theexploration efﬁciency in unknown environments, ZOSPI is further combined withbootstrapped Q networks. Different from the standard policy gradient methods,the policy learning of ZOSPI is conducted in a self-supervision manner so that thepolicy can be implemented with gradient-free non-parametric models besides theneural network approximator. Experiments show that ZOSPI achieves competitiveresults on MuJoCo locomotion tasks with a remarkable sample efﬁciency. Model-free Reinforcement Learning has achieved a great success in many challenging tasks [1–3],however one obstacle for its application to real-world control problems is the insufﬁcient sampleefﬁciency. To improve sample efﬁciency, off-policy methods [4–8] reuse the experiences generatedby previous policies to optimize the current policy, therefore obtaining a higher sample efﬁciencythan on-policy methods [9, 10]. Recently, SAC [11] proposes to regularize off-policy actor-critic bythe maximum entropy RL framework [12] for better exploration, which results in a much improvedsample efﬁciency and the state-of-the-art asymptotic performance. OAC [13] further improves SACby combining it with the Upper Conﬁdence Bound heuristics [14] to conduct more informativeexploration. Despite of their successes, these methods relying on a Gaussian policy and a localexploration strategy from simply adding noises to the action space might still lead to sub-optimalsolutions as pointed by [15].In this work we aim to explore a new learning paradigm that is able to carry out non-local explorationas well as non-local exploitation in continuous control tasks to achieve better sample efﬁciency.Speciﬁcally, we propose to better exploit the learned value functions Q , where we search globally fora better action rather than only utilize its local information or the Jacobian matrix used in previouspolicy gradient methods [16]. The idea behind our work is most related to the value-based policygradient methods [7, 8], where the policy gradient step takes the role of ﬁnding an well-performingaction given a learned state-action value function. Besides, such step also tackles the curse ofdimensionality since it is intractable to directly search for the maximal value in the continuous actionspace [1].Inspired by the works of evolution strategies [17–19] that adopt zeroth-order methods in the parameterspace, we apply the zeroth-order method to the action space and then update policy through supervised Code will be made available at https://github.com/decisionforce/ZOSPI .Preprint. Under review. a r X i v : . [ c s . L G ] J un earning. Combining the zeroth-order optimization with supervised learning forms as a new way ofpolicy update different from the standard policy gradient. It avoids the local improvement of thepolicy gradient: when policy gradient is applied, the target policy outputs continuous actions and usespolicy gradient to adjust its predictions according to the deterministic policy gradient theorem [16],but such updates can only lead to local improvements, thus induce sub-optimal policies due to thenon-convexity of the policy function [15]; on the contrary, our sample-based zeroth-order methodwith supervised learning can help the non-convex policy optimization to escape the local minimum.Our contributions are summarized as follows. Firstly we propose a new policy optimization method,namely the Zeroth-Order Supervised Policy Improvement (ZOSPI), where the policy utilizes globalinformation of the learned value function Q and learns through sample-based supervised learning.Secondly, in order to get better estimation of the value function Q , we combine ZOSPI with optimisticexploration strategies to reduce the estimation error. Finally, we verify the exploration improvementof ZOSPI in a diagnostic environment named Four-Solution-Maze, where only ZOSPI with optimisticexploration is able to ﬁnd the optimal solution. We further demonstrate the the effectiveness ofZOSPI by comparing it with SOTA policy gradient methods on the MuJoCo locomotion benchmarksin terms of the improved performance and sampling efﬁciency. Policy Gradient Methods.

The policy gradient methods solve the MDP by directly optimizing thepolicy to maximize the cumulative reward [20, 21]. While the prominent on-policy policy gradientmethods like TRPO [9] and PPO [10] improve the learning stability via trust region updates, off-policy methods such as DPG [16] and DDPG [7] can learn with a higher sample efﬁciency thanon-policy methods. The work of TD3 [8] further addresses the function approximation error andboost the stability of DDPG with several improvements. Another line of works is the combination ofpolicy gradient methods and the max-entropy principle, which leads to better exploration and stableasymptotic performances [12, 11]. All of these approaches adopt function approximators [22] forstate or state-action value estimation as well as directionally uninformed Gaussian policies for policyparameterization, which lead to a local exploration behavior [13, 15].

Self-Supervised RL.

Self-supervised learning or self-imitate learning is a rising stream as an al-ternative approach for model-free RL. Instead of applying policy gradient for policy improvement,methods of self-supervised RL update policies through supervised learning by minimizing the meansquare error between target actions and current actions predicted by a policy network [23], or alterna-tively by maximizing the likelihood for stochastic policy parameterizations [24]. While these worksfocus on the Goal-Conditioned tasks, in this work we aim at general RL tasks. Some other worksuse supervised learning to optimize the policy towards manually-selected policies to achieve bettertraining stability [25, 26].

Zeroth-Order Methods.

Zeroth-order optimization methods, also called gradient-free methods,are widely used when gradients are difﬁcult to compute. They approximate the local gradient withrandom samples around the current estimate. The works in [27, 28] show that a local zeroth-orderoptimization method has a convergence rate that depends logarithmically on the ambient dimensionof the problem under some sparsity assumptions. It can also efﬁciently escape saddle points innon-convex optimizations [29, 30]. In RL, many studies have veriﬁed an improved sample efﬁciencyof zeroth-order optimization [31, 19, 17]. In this work we provide a novel way of combining the localsampling and the global sampling to ensure that our algorithm approximates the gradient descentlocally and is also able to ﬁnd a better global region.

We consider the deterministic Markov Decision Process (MDP) with continuous state and actionspaces in the discounted inﬁnite-horizon setting. Such MDPs can be denoted by M = ( S , A , P, r, γ ) ,where the state space S and the action space A are continuous, and the unknown state transitionprobability representing the transition dynamics is denoted by P : S × A (cid:55)→ S . r : S × A (cid:55)→ [0 , is the reward function and γ ∈ [0 , is the discount factor. An MDP M and a learningalgorithm operating on M with an arbitrary initial state s ∈ S constitute a stochastic processdescribed sequentially by the state s t visited at time step t , the action a t chosen by the algorithm2 olicy Gradient (a) Policy Gradient ! " (b) Supervised Policy Improvement Iteration Q v a l u e range =0.01 range =0.1 range =0.3 range =0.5 range =1.0 range =2.0 (c) Simulation Figure 1: (a) Q value landscape of a -dim continuous control task. Policy gradient methods optimizethe policy according to the local information (b) For the same task, supervised policy improvementdirectly update the predicted actions to the sampled action with the largest Q value. (c) Simulationresults, in each optimization iteration, actions are uniformly sampled with different ranges, resultsare averaged over 100 random seeds. A larger random sample range improves the chance of ﬁndingglobal optima. Similar phenomenon do exist in practice as shown in Appendix A.at step t , the reward r t = r ( s t , a t ) and the next state s t +1 = P ( s t , a t ) for any t = 0 , . . . , T . Let H t = { s , a , r , . . . , s t , a t , r t } be the trajectory up to time t . Our algorithm ﬁnds the policy thatmaximizes the discounted cumulative rewards.Our work follows the general Actor-Critic framework, which learns in an unknown environmentusing a Q network denoted by Q w t : S × A (cid:55)→ R for estimating Q values and a policy network forlearning the behavior policy π θ t : S (cid:55)→ A . Here w t and θ t are respectively the parameters of thesetwo networks at step t . Figure 1 shows a motivating example to demonstrate the beneﬁts of applying zeroth-order methods topolicy optimization. Consider we have learned a Q function that has multiple local optima , and ourpresent deterministic policy selects a certain action at this state, denoted as the red dot in Figure 1(a).In deterministic policy gradient methods [16, 7], the policy gradient is conducted according to thechain rule to optimize policy parameter θ with regard to Q -value by timing up the Jacobian matrix ∇ θ π θ ( s ) and the derivative of Q , i.e. ∇ a Q ( s, a ) . Consequently, the policy gradient can only lead toa local improvement, and similar local improvement behaviors are also observed in stochastic policygradient methods like PPO and SAC [10, 11, 15, 13]. Instead, if we are able to sample sufﬁcient random actions in a broader range for the state, denoted as blue dots in Figure 1(b), and then evaluatetheir values respectively through the learned Q estimator, it is possible to ﬁnd the action with ahigher Q value as the target action in the optimization. Figure 1(c) shows the simulation results usingdifferent sample ranges for the sample-based optimization starting from the red point. It is clearthat a larger sample range improves the chance of ﬁnding global optima. Utilizing such a globalexploitation on the learned value function is the key insight of this work. Q-learning in the tabular setting relies on ﬁnding the best action given current state, which can bedifﬁcult in the continuous action space due to the non-convexity of Q . Instead, a policy networkis thus trained to approximate the solution. In most of previous PG methods, the function class isselected to be Gaussian in consideration of both exploration and computational tractability, while, inthis work, we consider the deterministic policy class which is simpler and easier to learn as presentedin [16]. Here we assume the traditional estimation of Q function is sufﬁcient for a global exploitation [8, 11] andwe will discuss an improved estimation method in the next section. lgorithm 1 Zeroth-Order Policy Optimization

Require

Objective function Q s , domain A , current point a = π θ ( s ) , number of local samples n andglobal samples n , local scale η > and step size h . Locally sampling

Sample n points around a by a i = a + µe i for e i ∼ N ( a , I d ) , i = 1 , . . . n , where N ( a , I d ) is the standard normal distribution centered at a . Globally sampling

Sample n points uniformly in the entire space by a i + n ∼ U A , for i = 1 , . . . , n , where U A is the uniform distribution over A . Update

Set a + = arg max a ∈{ a ,...a n n } Q s ( a ) .Update policy π θ according to Eq.(2)As shown in [16], a deterministic policy gradient updates the policy network only through theﬁrst-order gradient of the current Q estimate: ∇ a Q w t ( s t , π θ t ( s t )) ∇ θ π θ t ( s t ) . (1)Such an optimization through the local information of Q may incur a slow convergence rate, especiallywhen Q function is non-convex. To mitigate this issue, we propose the Zeroth-Order SupervisedPolicy Improvement (ZOSPI), which exploits the entire learned Q function instead of merely the localgradient information of Q . Thus the key insight of our proposed method is to utilize the Zeroth-Ordermethod to overcome the local policy improvement problem induced by the non-convexity of Q .To be speciﬁc, we ﬁrst calculate the predicted action a = π θ t ( s t ) . Then we sample two sets ofactions with size n , namely a local set and a global set. For the local set, we sample actions randomlyfrom a Gaussian distribution centered at a . For the global set, we sample points uniformly over theaction space. The update a + t is chosen as the action that gives the highest Q value in the union of twosets. Finally, we apply the supervised policy improvement that minimizes the L distance between a + t and π θ t ( s t ) , which gives the descent direction: ∇ θ

12 ( a + t − π θ t ( s t )) = ( a + t − π θ t ( s t )) ∇ θ π θ t ( s t ) . (2)The implementation detail is shown in Algorithm 1. Comparison between two methods.

We now compare the performances obtained via applying (1)to the standard deterministic policy gradient update used in (2).

Proportion 1.

Let the best action in the local set be a L = arg max a i ,i =1 ,...n Q w t ( s t , a i ) . We have a L − π s t ∝ ∇ a Q w t ( s t , π s t ) , when n → ∞ and η → .Proportion 1 guarantees that with a sufﬁciently large number of samples and a sufﬁciently small η , zeroth-order optimization can at least ﬁnd the local descent direction as in a ﬁrst-order method.By Theorem 7 in [28], the local zeroth-order optimization has an exponential convergence rate forconvex functions.When the best action is actually included in the global set, zeroth-order method will update towardsthe global direction with a larger step size as E (cid:107) a − π θ t ( s t ) (cid:107) ≤ E (cid:107) a n +1 − π θ t ( s t ) (cid:107) in general. Thebeneﬁts of SPI over DPG are determined by the probability of sampling an action globally that isclose to the global minima . We will discuss this with more details in Section 4.3. With a sufﬁcientnumber of sampled actions at each step, our supervised policy improvement is able to ﬁnd bettersolutions in terms of higher Q values for a given state, and therefore can globally exploit the Q function, which is especially useful when the Q function is non-convex as illustrated in Figure 1.Algorithm 2 provides the pseudo code for ZOSPI, where we follow the double Q network in TD3as the critic, and also use target networks for stability. In the algorithm we sample actions in the4 lgorithm 2 Zeroth-Order Supervised Policy Improvement (ZOSPI)

Require • Number of epochs M , size of mini-batch N , momentum τ > . • Random initialized policy network π θ , target policy network π θ (cid:48) , θ (cid:48) ← θ . • Two random initialized Q networks, and corresponding target networks, parameterized by w , , w , , w (cid:48) , , w (cid:48) , . w (cid:48) i, ← w i, . • Empty experience replay buffer D = {} . for iteration = 1 , , ... dofor t = 1 , , ..., T do InteractionRun policy π θ (cid:48) t in environment, store transition tuples ( s t , a t , s t +1 , r t ) into D . for epoch = 1 , , ..., M do Sample a mini-batch of transition tuples D (cid:48) = { ( s t j , a t j , s t j +1 , r t j ) } Nj =1 . Update Q Calculate target Q value y j = r t j + min i =1 , Q w (cid:48) i,t ( s t j +1 , π θ (cid:48) t ( s t j )) .Update w it with one step gradient descent on the loss (cid:80) j ( y j − Q w (cid:48) i,t ( s t j , a t j )) , i = 1 , . Update π Call Algorithm 1 for policy optimization to update θ t . end for θ (cid:48) t +1 ← τ θ t + (1 − τ ) θ (cid:48) t . w (cid:48) i,t +1 ← τ w i,t + (1 − τ ) w (cid:48) i,t . w i,t +1 ← w i,t ; θ t +1 ← θ t . end forend for global set from a uniform distribution on the action space a i ∼ U A , and sample actions from anon-policy local Gaussian (e.g., a i ( s ) = π θ old ( s ) + η i , and η i ∼ N (0 , σ ) )) to form the local setand guarantee local exploitation, so that ZOSPI will at least perform as good as deterministic policygradient methods [16, 7, 8]. In this section, we discuss a type of structure, with which our zeroth-order optimization has aexponential convergence rate. To better explain our points, we include an Algorithm 3 in AppendixB, a modiﬁed version of Algorithm 1, which prevents the sampled action jumping too far acrossdifferent global regions.

Deﬁnition 1 (Sampling-Easy Functions) . A function F : X ⊂ R d (cid:55)→ R is called αβ -Sampling-Easy ,if it has an unique global minima x ∗ , and there exists an region D ⊂ X , such that1. x ∗ ∈ D ;2. F is α -convex and β -smooth in region D ;3. |D| / |X | ≥ c/d for some c > .A function F is α -convex in region D , if F ( y ) ≥ F ( x ) + (cid:104)∇ F ( x ) , y − x (cid:105) + α (cid:107) y − x (cid:107) , x, y ∈ D .Furthermore, it is β -smooth, if | F ( y ) − F ( x ) − (cid:104)∇ F ( x ) , y − x (cid:105)| ≤ β (cid:107) x − y (cid:107) , x, y ∈ D . Theorem 1.

For any αβ -Sampling-Easy function F that satisﬁes F ( x ∗ ) ≤ F ( x ) − (cid:15) , for all x / ∈ D , by running Algorithm 3, on average it requires at most O (cid:18) log (cid:18) D m β min { (cid:15), (cid:15) } (cid:19) d βcα (cid:19) iterations to ﬁnd an (cid:15) -optimal solution for any (cid:15) > , with c and d the same in Deﬁnition 1. Here D m = max x ∈D (cid:107) x − x ∗ (cid:107) . The proof is provided in Appendix B.Theorem 1 suggests that despite the function is non-convex or non-smooth, the convergence can beguaranteed as long as there is a sufﬁciently large convex and smooth area around the global optima.5 Better Exploration with Bootstrapped Networks

Sample efﬁcient RL requires algorithms to balance exploration and exploitation. One of the mostpopular way to achieve this is called optimism in face of uncertainty (OFU) [14, 32–34], which givesan upper bound on Q estimates and applies the optimal action corresponding to the upper bound. Theoptimal action a t is given by the following optimization problem: arg max a Q + ( s t , a ) , (3)where Q + is the upper conﬁdence bound on the optimal Q function. A guaranteed explorationperformance requires both a good solution for (3) and a valid upper conﬁdence bound.While it is trivial to solve (3) in the tabular setting, the problem can be intractable in a continuousaction space. Therefore, as shown in the previous section, ZOSPI adopts a local set to approximatepolicy gradient descent methods in the local region and further applies a global sampling scheme toincrease the potential chance of ﬁnding a better maxima.As for the second requirement, we use bootstrapped Q networks to address the uncertainty of Q estimates as in [35–38, 13]. Speciﬁcally, we keep K estimates of Q , namely Q , . . . Q K withbootstrapped samples from the replay buffer. Let Q = K (cid:80) k Q k ( s, a ) . An upper bound Q + is Q + ( s, a ) = Q + φ (cid:115) K (cid:88) k [ Q k ( s, a ) − Q ] , (4)where φ is the hyper-parameter controlling the failure rate of the upper bound. Another issue is onthe update of bootstrapped Q networks. Previous methods [37] usually update each Q network withthe following target: r t + γQ k ( s t +1 , π θ t ( s t +1 )) , which violates the Bellman equation as π θ t is designed to be the optimal policy for Q + rather than Q k . Using π θ t also introduces extra dependencies among the K estimates. We instead employ aglobal random sampling method to correct the violation: r t + γ max i =1 ,...n Q k ( s t +1 , a i ) , a , . . . a n ∼ U A . The correction also reinforces the argument that a global random sampling method yields a goodapproximation to the solution of the optimization problem (3). The detailed algorithm is provided inAlgorithm 4 in Appendix C.

In this section, we show the empirical results to demonstrate the effectiveness of our proposed ZOSPImethod on both diagnostic environment with known ground-truth and the MuJoCo locomotion tasks.Speciﬁcally, we validate the following statements:1. If we further estimate the Q function with Bootstrap and behave with the principle OFU [14],we may apply Eq.(3) to pursue better exploration and reduce the estimation error effectively,and succeedingly acquire a better solution or even the optimal one.2. If we use ZOSPI with locally sampled actions, the performance of ZOSPI should be thesame as its policy gradient counterpart (i.e., TD3); if we increase the sampling range, ZOSPIwill be able to better exploit the Q function and ﬁnd better solutions than the methods basedon policy gradient.3. If we continuously increase the sampling range, it will result in an uniform sampling (inpractice we include an additional local sampling to encourage local improvements in thelater stage of the learning process), and the Q function can be maximally exploited. Global Exploration on the Four-Solution-Maze.

The Four-Solution-Maze environment is a di-agnostic environment where four positive reward regions with a unit side length are placed in themiddle points of edges of a N × N map. An agent starts from a uniformly initialized positionin the map and can then move in the map by taking actions according to the location observations6

10 +10+10 +10 (a) The Four-Solution-Maze environment and the optimalsolution

Interactions R e w a r d TD3ZOSPISACDDPGPPOZOSPI UCB (b) Performance comparison(c) PPO (d) DDPG (e) TD3(f) SAC (g) ZOSPI (h) ZOSPI-UCB

Figure 2: Experiments on the Four-Solution-Maze environment. (a) the Four-Solution-Maze environ-ment and its optimal solution, where an optimal policy should learn to ﬁnd the nearest reward regionand step into it; (b) learning curves of different approaches; (c)-(h) visualize the learned policies andtheir corresponding value functions.(current coordinates x and y ). Valid actions are limited to [ − , for both x and y axes. Eachgame consists of N timesteps for the agent to navigate in the map and collect rewards. In eachtimestep, the agent will receive a +10 reward if it is inside one of the reward regions or a tinypenalty otherwise. For simplicity, there are no obstacles in the map, the optimal policy thus will ﬁndthe nearest reward region, directly move towards it, and stay in the region till the end. Figure 2(a)visualizes the environment and the ground-truth optimal solution.On this environment we compare ZOSPI to on-policy and off-policy SOTA policy gradient methodsin terms of the learning curves, each of which is averaged by runs. The results are presented inFigure 2(b). For ZOSPI with UCB we use bootstrapped Q networks for the upper bound estimation.The sample efﬁciencies of ZOSPI and ZOSPI with UCB exploration are much higher than that ofother methods. Noticeably ZOSPI with UCB exploration is the only method that can ﬁnd the optimalsolution, i.e. , a policy directs to the nearest region with a positive reward. All other methods gettrapped in sub-optimal solutions by stepping to certain reward regions they ﬁnd.Learned policies from different methods are visualized in Figure 2(c)-2(h). For each method we plotthe predicted behaviors of its learned policy at grid points using arrows (although the environment iscontinuous in the state space), and show the corresponding value function of its learned policy with acolored map. All policies and value functions are learned with . interactions except for SACwhose ﬁgures are learned with . interactions as it can ﬁnd out of target regions when moreinteractions are provided.On the other hand, we ﬁnd ZOSPI with UCB exploration, although can ﬁnd the global optimalsolution, outperforms ZOSPI only by a small margin, which is at the price of keeping K bootstrapped Q networks and updating them through separately sampled actions. Since it is relatively inefﬁcient interms of computational complexity, we choose to demonstrate ZOSPI without UCB exploration onthe MuJoCo benchmarks in consideration of both sample efﬁciency and computational efﬁciency.7 Interactions R e w a r d Hopper-v2

TD3ZOSPISAC 0 50000 100000 150000 200000 250000 300000

Interactions R e w a r d Walker-v2

TD3ZOSPISAC 0 50000 100000 150000 200000 250000 300000

Interactions R e w a r d HalfCheetah-v2

TD3ZOSPISAC0 50000 100000 150000 200000 250000 300000

Interactions R e w a r d Ant-v2

TD3ZOSPISAC 0 50000 100000 150000 200000 250000 300000

Interactions R e w a r d Humanoid-v2

TD3ZOSPISAC 0 20000 40000 60000 80000 100000

Interactions R e w a r d Ablation study on Walker2d-v2

TD3ZOSPIZOSPI 0.1ZOSPI 0.3ZOSPI 0.5ZOSPI 0.5

Figure 3: Experimental results on the MuJoCo locomotion tasks. The shaded region represents half astandard deviation of the average evaluation over trials with different random seeds. ZOSPI on the MuJoCo Locomotion Tasks.

In this section we evaluate ZOSPI on the OpenAIGym locomotion tasks based on the MuJoCo engine [39, 40]. Concretely, we test ZOSPI on ﬁvelocomotion environments, namely the Hopper-v2 (11-dim observations and 3-dim actions), Walker2d-v2 (17-dim observations and 6-dim actions), HalfCheetah-v2 (17-dim observations and 6-dim actions),Ant-v2 (111-dim observations and 8-dim actions) and Humanoid-v2 (376-dim observations and 17-dim actions). We compare results of different methods within , environment interactionsto demonstrate the high learning efﬁciency of ZOSPI. We include TD3 and SAC, respectively adeterministic and a stochastic SOTA policy gradient methods in the comparison. The results of TD3are obtained by running author-released codes and the results of SAC are directly extracted from thetraining logs released by the authors.The results of ZOSPI, TD3, and SAC are included in Figure 3. It is worth noting that in ourimplementation of ZOSPI, only actions are sampled for all tasks and it is sufﬁcient to learnwell-performing policies. Surprisingly, with such high sampling efﬁciencies, the results of ZOSPI aregood even in the tasks that have high-dimensional action spaces such as Ant-v2 and Humanoid-v2. Inall the tasks the sample efﬁciency is consistently improved over TD3, which is the DPG counterpartof ZOSPI. While a total of sampled actions should be very sparse in the high dimensional space,we contribute the success of ZOSPI to the generality of the policy network as well as the sparsity ofmeaningful actions, i.e. , even in tasks that have high dimension action spaces, only limited dimensionsof actions are crucial for making decisions.The last plot in Figure 3 shows the ablation study on the sampling range N in ZOSPI, where asampling method based on a zero-mean Gaussian is applied and we gradually increase its variancefrom . to . . We also evaluate the uniform sampling method with radius of . , which is denoted as U . in the Figure. The results suggest that zeroth-order optimization with local sampling performssimilarly to the policy gradient method, and increasing the sampling range can effectively improvethe performance. In this work, we propose the Zeroth-Order Supervised Policy Improvement (ZOSPI) method as analternative approach to policy gradient methods for continuous control tasks. We also propose tocombine ZOSPI with a bootstrapped estimation of the Q function for further improvements. Weprovide the theoretic analysis to validate that the proposed ZOSPI can be a sample efﬁcient method.On a diagnostic environment called Four-Solution-Maze, ZOSPI is shown to outperform prevailingpolicy gradient methods and become the only method that ﬁnds the optimal solution. We furtherevaluate ZOSPI on the MuJoCo locomotion tasks and demonstrate its high sampling efﬁciency, whereZOSPI achieves competitive performance compared to SOTA policy gradient methods on continuouscontrol tasks. 8 roader Impact In this work we show that sample-based zeroth-order method can perform as good as previouspolicy gradient methods in continuous control tasks. Our proposed method outperforms the previousstate-of-the-art policy gradient methods even in the challenging high-dimensional control tasks interms of sample efﬁciency. Thus more investigations on applying self-supervised methods into RLshould be a promising future direction.Empirically, it is promising to apply our method to areas where interaction with the environment isextremely expensive, e.g., autonomous-driving, robotics, etc.

References [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.[2] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun-young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmasterlevel in starcraft ii using multi-agent reinforcement learning.

Nature , 575(7782):350–354, 2019.[3] Jakub Pachocki, Greg Brockman, Jonathan Raiman, Susan Zhang, Henrique Pondé, Jie Tang,Filip Wolski, Christy Dennison, Rafal Jozefowicz, Przemyslaw Debiak, et al. Openai ﬁve, 2018.

URL https://blog. openai. com/openai-ﬁve .[4] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprintarXiv:1205.4839 , 2012.[5] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and SergeyLevine. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. arXiv preprintarXiv:1611.02247 , 2016.[6] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu,and Nando de Freitas. Sample efﬁcient actor-critic with experience replay. arXiv preprintarXiv:1611.01224 , 2016.[7] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXivpreprint arXiv:1509.02971 , 2015.[8] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation errorin actor-critic methods. arXiv preprint arXiv:1802.09477 , 2018.[9] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust regionpolicy optimization. In

International conference on machine learning , pages 1889–1897, 2015.[10] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.[11] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprintarXiv:1801.01290 , 2018.[12] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learningwith deep energy-based policies. In

Proceedings of the 34th International Conference onMachine Learning-Volume 70 , pages 1352–1361. JMLR. org, 2017.[13] Kamil Ciosek, Quan Vuong, Robert Loftin, and Katja Hofmann. Better exploration withoptimistic actor critic. In

Advances in Neural Information Processing Systems , pages 1785–1796, 2019.[14] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm fornear-optimal reinforcement learning.

Journal of Machine Learning Research , 3(Oct):213–231,2002. 915] Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization: Analternative approach for continuous control. arXiv preprint arXiv:1905.09855 , 2019.[16] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.Deterministic policy gradient algorithms. In

ICML , 2014.[17] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategiesas a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 , 2017.[18] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth Stanley, andJeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via apopulation of novelty-seeking agents. In

Advances in Neural Information Processing Systems ,pages 5027–5038, 2018.[19] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitiveapproach to reinforcement learning. arXiv preprint arXiv:1803.07055 , 2018.[20] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256, 1992.[21] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 1998.[22] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradientmethods for reinforcement learning with function approximation. In

Advances in neuralinformation processing systems , pages 1057–1063, 2000.[23] Hao Sun, Zhizhong Li, Xiaotong Liu, Bolei Zhou, and Dahua Lin. Policy continuation withhindsight inverse dynamics. In

Advances in Neural Information Processing Systems , pages10265–10275, 2019.[24] Dibya Ghosh, Abhishek Gupta, Justin Fu, Ashwin Reddy, Coline Devine, Benjamin Eysenbach,and Sergey Levine. Learning to reach goals without reinforcement learning. arXiv preprintarXiv:1912.06088 , 2019.[25] Qing Wang, Jiechao Xiong, Lei Han, Han Liu, Tong Zhang, et al. Exponentially weightedimitation learning for batched historical data. In

Advances in Neural Information ProcessingSystems , pages 6288–6297, 2018.[26] Chuheng Zhang, Yuanqi Li, and Jian Li. Policy search by target distribution learning forcontinuous control. arXiv preprint arXiv:1905.11041 , 2019.[27] Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-orderoptimization in high dimensions. arXiv preprint arXiv:1710.10551 , 2017.[28] Daniel Golovin, John Karro, Greg Kochanski, Chansoo Lee, Xingyou Song, et al. Gradientlessdescent: High-dimensional zeroth-order optimization. arXiv preprint arXiv:1911.06317 , 2019.[29] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, and Georgios Piliouras. Efﬁ-ciently avoiding saddle points with zero order methods: No gradients required. In

Advances inNeural Information Processing Systems , pages 10066–10077, 2019.[30] Qinbo Bai, Mridul Agarwal, and Vaneet Aggarwal. Escaping saddle points for zeroth-ordernon-convex optimization using estimated gradient descent. In , pages 1–6. IEEE, 2020.[31] Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic explorationfor deep deterministic policies for starcraft micromanagement. 2016.[32] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcementlearning.

Journal of Machine Learning Research , 11(Apr):1563–1600, 2010.[33] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds forreinforcement learning. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 263–272. JMLR. org, 2017.1034] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provablyefﬁcient? In

Advances in Neural Information Processing Systems , pages 4863–4873, 2018.[35] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration viabootstrapped dqn. In

Advances in neural information processing systems , pages 4026–4034,2016.[36] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein-forcement learning. In

Advances in Neural Information Processing Systems , pages 8617–8629,2018.[37] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity inoff-policy deep reinforcement learning. arXiv preprint arXiv:1907.04543 , 2019.[38] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policyq-learning via bootstrapping error reduction. In

Advances in Neural Information ProcessingSystems , pages 11761–11771, 2019.[39] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,and Wojciech Zaremba. Openai gym, 2016.[40] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-basedcontrol. In

IROS , pages 5026–5033. IEEE, 2012.[41] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.

Foundations of Computational Mathematics , 17(2):527–566, 2017.11

Visualization of Q-Landscape

Figure 4 shows the visualization of learned policies (actions given different states) and Q values inTD3 during training in the Pendulum-v0 environment, where the state space is 3-dim and action spaceis 1-dim. The red lines indicates the selected action by the current policy. The learned Q function arealways non-convex, as a consequence, in many states the TD3 is not able to ﬁnd globally optimalsolution and locally gradient information may be misleading in ﬁnding actions with high Q values. Q v a l u e state[ 0.03637077 0.99933836 -0.60342179] state[-0.90821207 0.41851026 -0.73363298] state[-0.16446012 0.98638373 0.61945554] state[ 0.91556378 -0.4021728 -0.58840183] Q v a l u e state[-0.99982863 -0.01851269 0.63284793] state[-0.98439843 -0.17595377 0.41002269] state[ 0.73651951 0.6764163 -0.83011548] state[ 0.80671349 0.59094275 -0.3167576 ] Action Q v a l u e state[0.64130743 0.76728403 0.85531035] Action state[-0.82916699 0.55900098 0.23418416]

Action state[ 0.99968987 -0.02490301 -0.30115112]

Action state[-0.08968228 -0.99597043 0.20557877]

Figure 4: Landscape of learned value function in the Pendulum-v0 environment

B Convergence for Zeroth-order Optimization

Algorithm 3

One-step Zeroth-Order Optimization with Consistent Iteration

Require

Objective function Q , domain A , current point a , number of local samples n , number of globalsamples n , local scale η > and step size h , number of steps m . for t = 1 , . . . n doGlobally sampling Sample a point uniformly in the entire space by a t ∼ U A where U A is the uniform distribution over A . for i = 1 , . . . , m doLocally sampling Sample n points around a t,i − by ˜ a j = a t,i − + µe j for e j ∼ N (0 , I d ) , j = 1 , . . . n , where N (0 , I d ) is the standard normal distribution centered at 0. Update

Set a t,i = a t,i − + h (arg max a ∈{ ˜ a j } Q ( a ) − a t,i − ) end forend forreturn max a ∈{ a t m } n t =1 Q ( a ) . 12 roof Sketch. As shown in [41], under the same condition in Deﬁnition 1, given x . . . , x N ∈ D , wehave F ( x N ) − F ( x ∗ ) ≤ β (1 − α d +4) β ) N (cid:107) x − x ∗ (cid:107) .Thus, as long as there is a global sample lie in D , it requires at most N (cid:15) = log( β (cid:107) x − x ∗ (cid:107) (cid:15) ) 8( d + 4) βα iterations to ﬁnd an (cid:15) -optimal maxima.The probability of sampling a point in D globally is at least cd . On expectation, it requires dc globalsamples to start from a point in D . Theorem 1 follows. C Algorithm 4: ZOSPI with Bootstrapped Q networks Algorithm 4

ZOSPI with UCB Exploration

Require • The number of epochs M , the size of mini-batch N , momentum τ > and the number ofBootstrapped Q-networks K . • Random initialized policy network π θ , target policy network π θ (cid:48) , θ (cid:48) ← θ . • K random initialized Q networks, and corresponding target networks, parameterized by w k, , w (cid:48) k, , w (cid:48) k, ← w k, for k = 1 , . . . , K . for iteration = 1 , , ... dofor t = 1 , , ..., T do InteractionRun policy π θ (cid:48) t , and collect transition tuples ( s t , a t , s (cid:48) t , r t , m t ) . for epoch j = 1 , , ..., M do Sample a mini-batch of transition tuples D j = { ( s, a, s (cid:48) , r, m ) i } Ni =1 . Update Q for k = 1 , , ..., K do Calculate the k -th target Q value y ki = r i + max l Q w (cid:48) k,t ( s (cid:48) i , a (cid:48) l ) , where a (cid:48) l ∼ U A .Update w k,t with loss (cid:80) Ni =1 m ik ( y ki − Q w (cid:48) k,t ( s i , a i )) . end for Update π Calculate the predicted action a = π θ (cid:48) t ( s i ) Sample actions a l ∼ U A Select a + ∈ { a l } ∪ { a } as the action with maximal Q + ( s t , a ) deﬁned in (4).Update policy network with Eq.(2). end for θ (cid:48) t +1 ← τ θ t + (1 − τ ) θ (cid:48) t . w (cid:48) k,t +1 ← τ w k,t + (1 − τ ) w (cid:48) k,t . w k,t +1 ← w k,t ; θ t +1 ← θ t . end forend forend forend for