[PDF] Evolutionary Reinforcement Learning via Cooperative Coevolutionary Negatively Correlated Search

Abstract

Evolutionary algorithms (EAs) have been successfully applied to optimize the policies for Reinforcement Learning (RL) tasks due to their exploration ability. The recently proposed Negatively Correlated Search (NCS) provides a distinct parallel exploration search behavior and is expected to facilitate RL more effectively. Considering that the commonly adopted neural policies usually involves millions of parameters to be optimized, the direct application of NCS to RL may face a great challenge of the large-scale search space. To address this issue, this paper presents an NCS-friendly Cooperative Coevolution (CC) framework to scale-up NCS while largely preserving its parallel exploration search behavior. The issue of traditional CC that can deteriorate NCS is also discussed. Empirical studies on 10 popular Atari games show that the proposed method can significantly outperform three state-of-the-art deep RL methods with 50% less computational time by effectively exploring a 1.7 million-dimensional search space.

Full PDF

EEvolutionary Reinforcement Learning via Cooperative Coevolutionary Negatively Correlated Search

Hu Zhang; Peng Yang; Yanglong Yu; Mingjia Li; Ke Tang Southern University of Science and Technology

Abstract

Section I Introduction

Reinforcement Learning (RL) is an attracting machine learning problem that actively learns the decision-making policy through environmental interactions with delayed and noisy rewards, without consuming labeled training data like traditional supervised learning [1]. In RL, an agent sequentially observes the environment state and takes an action to the environment suggested by the policy, a reward is then returned (from the environment) to the agent reflecting how it interacts with the environment. By iteratively conducting the above procedure, RL is targeted to learn the policy that can maximize the long-term reward [2]. For this purpose, the policy is expected to accurately react to the observed environment states. The arising Deep Reinforcement Learning (DRL) methods provide the possibility by employing deep neural networks as the policy [3]. DRL builds policy directly on the observed raw data, offering seamless interfaces to the environment, and facilitates the state-action decision-making with powerful non-linear inference ability for many successful real-world applications [4]-[6]. On the other hand, due to the non-uniform distribution of rewards, it is important to explore the action space to enlarge the coverage of the policy on the state space [7]. This requires to exploratively optimize the parameters of the policy, e.g., the connection weights of deep networks. Traditional gradient-based methods [8][9] are not always effective due to the lacks of exploration ability. Recently, Evolutionary Algorithms (EAs) have been successfully applied to DRL by searching for the optimal policy in a population-based manner [10]-[12]. The population-based nature of EAs not only provides the urgent exploration ability to RL, but also offers other bonuses such as parallel acceleration [13], noisy-resistance [14], and compatibility of training non-differentiable policies e.g., trees [15]). Due to those advantages, the resultant Evolutionary Reinforcement Learning (ERL) has drawn increasing research popularity from both academia and industries. Actually, the balance between exploration and exploitation has been extensively studied in the EA community for decades [16]. Basically, it is widely acknowledged that the diversity of the population is an important indicator to guide the exploration. Rich volume of works have been proposed to effectively measure the diversity of the current population and adjust it accordingly to generate the next population. Until recently, [17] discusses that the diversity of the current population may not be closely related to the generation of the diversified next population, and proposes to directly measure the diversity of the next population. Based on this idea, the presented Negatively Correlated Search (NCS) is able to capture the on-going interactions between successive iterations and introduce manageable diversity to the generation of the next population, successfully forming a fully parallel exploration search behavior where each individual in the population is managed to visit a different region of the search space with high enough objective quality. Due to its effectiveness, NCS has been successfully applied to several types of real-world applications [18]-[21]. This paper aims to apply NCS into ERL to offer a parallel exploration ability for policy training. Note that training a deep neural network needs to optimize commonly millions or more connection weights. Such huge numbers of decision variables usually impose great challenges to EAs (e.g., NCS) [13]. Generally, to solve such large-scale problems, the Cooperative Coevolution (CC) framework is often beneficial for scaling up EAs by dividing the decision variables into multiple exclusive groups and addressing the smaller sub-problems (i.e., groups of decision variables) with EAs respectively [22]. One major difficulty of applying CC lies in the evaluation of partial solutions in each sub-problem with respect to the original objective function. Existing CC methods mostly complement all partial solutions in a sub-problem with the same complementary vector [22], and then evaluate the complemented solutions accordingly. This strategy is mostly effective for traditional EAs whose population tend to converge and the individuals in the population are somewhat similar at the later stage of the search. However, due to the parallel exploration behaviors, the population of NCS tend to diverse. In this case, sharing the same complementary vector among all partial solutions may lead to significantly incorrect evaluations of some partial solutions, and thus fail the parallel exploration. To address this issue, a specified CC framework is devised to scale up NCS without compromising its exploration ability, resulting in the proposed Cooperative Coevolutionary Negatively Correlated Search (CCNCS). CCNCS is applied to train a neural policy with 1.7 million connection weights on 10 popular Atari games to verify its effectiveness on RL problems. Empirical results show that CCNCS can outperform the state-of-the-art DRL methods (including both gradient-based and EA-based policy training methods) by scoring significantly higher scores within half less computational time. Furthermore, it has been shown that CCNCS can obtain scores in some games with very sparse reward distributions, while the state-of-the-art methods cannot break the tie. This should also be attributed to the well-preserved strong exploration ability while scaling up NCS with CC. The reminder of this paper is organized as follows. Section II briefly introduces the preliminaries of ERL, NCS, and CC. Section III details the proposed CCNCS method. Section IV presents the empirical studies of CCNCS on Atari games. The conclusions are drawn in Section V.

Section II Preliminaries

This section briefly provides three aspects of preliminaries for the following sections.

Section II.A Evolutionary Reinforcement Learning

ERL specifies the class of RL approaches that using EAs to directly optimize the parameters of the policy [10], e.g., deep neural networks in this paper. The general flowchart of ERL can be seen in Fig.1. Initially, each individual of an EA is represented as a vector of all the connection weights of the policy. The training phase is divided into multiple epochs. At each epoch, the agent starts from the beginning of a game and takes actions suggested by a policy model (i.e., an individual of EA) with respect to the environment states. After a game (i.e., an epoch) has been finished, the reward of the policy will be returned back to the agent as well as EA. Then EA takes the rewards of all candidate policies in the current population to generate the new population of policies for the next epoch. The above procedure repeats until the training budget runs out, and the best policy model at the final epoch will be output.

Fig. 1 The flowchart of Evolutionary Reinforcement Learning

ERL enjoys multiple merits offered by EAs. First, the population-based search mechanism of EAs can be exploited to improve both the exploration and parallel acceleration of ERL [11]. Second, the search direction of EAs can be guided by the relative order of the individuals in terms of their objective qualities, which makes ERL less sensitive to the evaluation noise [14]. Lastly, EAs can be used to optimize non-differentiable models (e.g., trees [15]), thus ERL is not restricted to train neural policy.

Section II.B Negatively Correlated Search

NCS is a recently proposed EA that features in its parallel exploration behavior [17]. Technically, NCS regards the population of 𝜆 individuals as 𝜆 exclusive search processes. Each search process is evolved with a traditional EA to maximize the objective quality of each individual. Meanwhile, the diversity among different search processes are explicitly measured and maximized to improve the exploration ability. Different from most existing EAs that measure the diversity as the differences among the individuals of the current population, NCS measures the diversity as the differences among individuals of the next population. By maximizing the diversity of the unsampled next population, it is straightforward to diversify the search processes in a controllable manner. As result, each search process can visit a different region of the search space while preserving high objective quality, finally forming a parallel exploration search behavior (see Fig.2 for illustration). Fig.2 The parallel exploration search pattern of NCS is illustrated on a 2-D multi-modal function, i.e., F19 in CEC’2005 continuous optimization functions benchmark [28], where 4 search processes (i.e., 4 individuals) explore different regions of the search space and successfully locate 4 optima.

Take the instantiated NCS-C [17] as an example. At each iteration, each 𝑖 -th search process preserves a parent individual 𝐱 ! , and generates an offspring individual 𝐱 !" from the Gaussian distribution 𝒩(𝐱 ! , 𝚺 ! ) . The Gaussian distribution basically describes how the 𝑖 -th search process will generate the next individual. Thus, if the Gaussian distributions of different search processes can be explicitly forced to be different, they are highly likely generate diversified individuals. To realize this idea, three steps are conducted in NCS-C. First, the difference among distributions is measured by the Bhattacharyya distance, which considers the differences of both mean vectors and covariance matrices and is able to quantify the “overlap” of pairwise Gaussian distributions. Second, the diversity of the next population is modeled as the differences of all search processes and is maximized parallel with respect to each search process. Specifically, for each search process, either the parent distribution 𝒩(𝐱 ! , 𝚺 ! ) or the offspring distribution 𝒩(𝐱 !" , 𝚺 !" ) is selected for the next iteration to maximize the whole diversity. Third, while selecting either the parent or offspring, their objective qualities are also considered. Thus, a balance between the exploration (by maximizing the diversity) and the exploitation (by maximizing the objective quality) can be readily achieved by a trade-off parameter, which in turn is managed to control the distributions of search processes to be negatively correlated for parallel exploration. The pseudo code of NCS-C is given in Algorithm 1 for illustration. For more details of NCS, please refer to [17]. Algorithm 1

NCS-C 1.

Initialize 𝜆 solutions 𝐱 ! randomly, 𝐱 ! , 𝑖 = 1, … , 𝜆 . 2. Evaluate the 𝜆 solutions with respect to the objective function 𝑓 . 3. Repeat until stop criteria are met: 4.

For i= 𝜆 Generate an offspring solution 𝐱 !" from the distribution 𝑝 ! ~𝒩(𝐱 ! , 𝚺 ! ) . 6. Calculate 𝑓(𝐱 !" ) , 𝑑(𝑝 ! ) and 𝑑(𝑝 !" ) . // 𝑑 denotes the diversity 7. If 𝑓(𝐱 !" ) + 𝜑 ⋅ 𝑑(𝑝 !" ) > 𝑓(𝐱 ! ) + 𝜑 ⋅ 𝑑(𝑝 ! ) // 𝜑 denotes the trade-off parameter 8. Update (𝐱 ! , 𝚺 ! ) with (𝐱 !" , 𝚺 !" ) . 9. Update 𝚺 ! according to certain EA rules. Output the best solution ever found with respect to 𝑓 . ection II.C Cooperative Coevolution EAs basically work by randomized sampling in the search space, which means their performance usually drop heavily when the search space enlarges significantly. To solve such large-scale problems with EAs, a commonly adopted method is CC [22]. CC follows the classic divide-and-conquer methodology that first divides the decision variables into multiple small-scale groups, and optimizes each group separately with an EA. Ideally, each EA only needs to tackle a small-scale sub-problem while the original large-scale problem can still be effectively solved [22]. The major difficulty of applying the divide-and-conquer idea to EAs lies in how to evaluate the partial solutions in each sub-problem. To solve this problem, CC first complements the low-dimensional partial solutions to the dimensionality of the original objective function, and then evaluates them accordingly. However, different complementary vectors to a same partial solution can result in different objective qualities. Thus, rich amount of research efforts has been devoted to studying the complementing methods so as to improve the accuracy of the evaluation of partial solutions [23]. There are also a plenty of approaches that have been proposed to the grouping of decision variables, aiming to make the sub-problems independent from each other so that even random complementary vectors can lead to accurate evaluations [24]. Also note that other methods that do not need to complement partial solutions also exist [13]. Despite of all the technical details, most CC methods basically complement all the partial solutions in a sub-problem with the same complementary vector [22]. This is reasonable in many cases since keeping the complementary vector the same for all partial solutions can largely reduce the noise of calculating their relative order that guides the search direction of many EAs. However, it will be shown in the next section that such strategy does not work well for NCS due to its parallel exploration search behavior. For illustration, the framework of classic CC is briefly given as follows.

Algorithm 2

Classic Cooperative Coevolution 1.

Initialize 𝜆 solutions 𝐱 ! randomly, 𝐱 ! , 𝑖 = 1, … , 𝜆 . 2. Repeat until stop criteria are met: 3.

Divide the 𝐷 -dimensional problem into 𝑀 sub-problems. 4. For j= 𝑀 Evolve all the partial solutions 𝒙 !,$ for one iteration with an EA, 𝑖 = 1, … , 𝜆 . For 𝑖 = 1 to 𝜆 Complement each partial solution 𝒙 !,$ with the same vector 𝒗 $ as [ 𝒙 !,$ , 𝒗 $ ]. Evaluate each complemented solution 𝑓([𝒙 !,$ , 𝒗 $ ]) as the qualities of 𝒙 !,$ . Output the best complemented solution ever found with respect to 𝑓 . Section III Cooperative Coevolutionary Negatively Correlated Search

This section first discusses that the traditional CC framework is not suitable for NCS, and an NCS-friendly CC framework is proposed accordingly. The resultant CCNCS by incorporating NCS into the variant CC framework is also described in detail.

Section III.A The NCS-Friendly CC Framework s mentioned above, traditional CC complements all the partial solutions of a sub-problem with the same complementary vector and then evaluates the complemented solutions with respect to the objective function. This is useful for most EAs as their partial solutions in each sub-problem often tend to be similar at the convergence stage of the search, where the population highly likely locates at the same sub-space of a basin of attraction (i.e., they will converge to the same optimum). In this case, the differences of the evaluation qualities among all partial solutions can be made marginal by complementing them with the same complementary vector. By “marginal”, we mean this optimum will be found by this population very likely, no matter by which specific individual. Distinctly, NCS explicitly asks each search process to be negatively correlated, where the partial solutions in each sub-problem tend to be very different from each other at the later stage of the search and the population may locate multiple different basins of attraction. In this case, complementing all the partial solutions with the same complementary vector is problematic. For example, Fig.3 illustrates a 2-D search space with two optima located at the left-upper corner and the right-lower corner. Suppose NCS has obtained two partial solutions in the first sub-problem, denoted as 𝑥 and 𝑥 %, , respectively. If they are complemented with 𝑥 , it is clear that 𝑓(𝑥 , 𝑥 ) is better than 𝑓(𝑥 %, , 𝑥 ) and the left-upper optimum will be found. If they are complemented with 𝑥 %,% , the right-lower optimum will be found. In either case, the traditional CC will deteriorate the parallel exploration ability of NCS as either of the two optima will be missed out. As the iterations go on, the population of NCS will finally get converged to the partial solution who best fits the complementary vector, and thus fails to form the featured parallel exploration. One can also complement the partial solutions with multiple complementary vectors (e.g., both 𝑥 and 𝑥 %,% in Fig.3) to fit each promising partial solution with redundancy. However, it is computationally expensive as multiple times of objective function evaluations will be consumed. Fig.3 Traditional CC that complements two partial solutions of a sub-problem, i.e., 𝑥 !,! and 𝑥 , with the same vector, i.e., either 𝑥 !, or 𝑥 , is not suitable for NCS alike search methods as either optimum (marked as red solid circles) will be missed. In this paper, we propose to complement each partial solution of NCS with a different complementary vector to maximally preserve the parallel exploration search behavior. The only difference between the NCS-friendly CC and traditional CC lies in the step 7 of Algorithm 2. Specifically, in each 𝑗 -th sub-problem, each 𝑖 -th partial solution 𝒙 !,$ will be complemented with a distinct complementary vector 𝒗 !,$ in the new CC framework, rather than the same complementary ector 𝒗 $ as for traditional CC. This helps to avoid the situation in Fig.3 where one complementary vector cannot fit diversified promising partial solutions. For the context of the 𝑖 -th complementary vector 𝒗 !,$ , we simply set it as 𝒗 !,$ = 𝐱 $ /𝒙 !,$ . That is, each 𝑖 -th complementary vector in the 𝑗 -th sub-problem is calculated as the quotient set of 𝐱 $ with respect to 𝒙 !,$ , and is composed of the current 𝑖 -th partial solution in other sub-problems. As the complementary vector is evolved together with the partial solution, i.e., [ 𝒙 !,$ , 𝒗 !,$ ]= 𝐱 $ , this complementing strategy should also be able to well characterize the quality of each 𝒙 !,$ . This strategy actually exploits the parallel exploration feature itself that different partial solutions will locate the sub-space of different basins of attraction. Also note this strategy does not cost extra computational objective function evaluations as each partial solution is still evaluated once.

Section III.B The Details of CCNCS

In this section, NCS-C is incorporated into the variant CC framework to obtain the proposed CCNCS and the detailed pseudo code is illustrated in Algorithm 3.

Algorithm 3

CCNCS 1.

Initialize 𝜆 individuals 𝐱 ! randomly, 𝑖 = 1, … , 𝜆 . 2. Repeat until stop criteria are met: 3.

Divide the 𝐷 -dimensional problem into 𝑀 sub-problems. 4. For j= 𝑀 For i= 𝜆 Generate an offspring solution 𝒙 !,$" from the distribution 𝑝 !,$ ~𝒩(𝒙 !,$ , 𝚺 !,$ ) . Calculate the diversity of both parent and offspring as 𝑑<𝑝 !,$ = and 𝑑<𝑝 !,$" =. Complement 𝒙 !,$ and 𝒙 !,$" as [𝒙 !,$ , 𝒗 !,$ ] and [𝒙 !,$" , 𝒗 !,$ ] , respectively. Evaluate 𝑓<[𝒙 !,$ , 𝒗 !,$ ]= and 𝑓<[𝒙 !,$" , 𝒗 !,$ ]= as the qualities of 𝒙 !,$ and 𝒙 !,$" . 10. If 𝑓<[𝒙 !,$" , 𝒗 !,$ ]= + 𝜑 ⋅ 𝑑<𝑝 !,$" = > 𝑓<[𝒙 !,$ , 𝒗 !,$ ]= + 𝜑 ⋅ 𝑑<𝑝 !,$ = Update (𝒙 !,$ , 𝚺 !,$ ) with (𝒙 !,$" , 𝚺 !,$" ) . 12. Update 𝚺 !,$ according to NCS rules. 13. Output the best complemented solution ever found with respect to 𝑓 . CCNCS first randomly initialize 𝜆 individuals (i.e., search processes) in the original 𝐷 -dimensional search space. At each iteration of the main loop, the large-scale 𝐷 decision variables are first decomposed in to 𝑀 groups. In the literature, there are lots of well-established decomposition methods [25]. For simplicity, the random grouping with varied values of 𝑀 is employed [26]. To be brief, when executing the step 3, the value of 𝑀 is first randomly selected from a fixed pool with candidate values, and the 𝐷 decision variables are randomly divided into 𝑀 groups. Then NCS is run for optimizing each sub-problem (steps 5-12). As NCS conducts 𝜆 search processes in parallel, the detailed search operators can be executed with respect to each partial solution. Specifically, for the 𝑖 -th partial solution in the 𝑗 -th sub-problem 𝒙 !,$ , its offspring 𝒙 !,$" is first generated from the distribution 𝑝 !,$ ~𝒩(𝒙 !,$ , 𝚺 !,$ ) . Then the Bhattacharyya distance between the parent distribution 𝑝 !,$ ~𝒩(𝒙 !,$ , 𝚺 !,$ ) to the distributions of other search processes is calculated and denoted as 𝑑<𝑝 !,$ = . Similarly, the Bhattacharyya distance of the offspring distribution 𝑝 !,$" ~𝒩(𝒙 !,$" , 𝚺 !,$" ) to the distributions of other search processes is calculated and denoted as 𝑑<𝑝 !,$" = . To measure the qualities of 𝒙 !,$ and 𝒙 !,$" , they are first respectively complemented as [𝒙 !,$ , 𝒗 !,$ ] and [𝒙 !,$" , 𝒗 !,$ ] , and valuated with respect to the objective function 𝑓 . In fact, 𝑓<[𝒙 !,$ , 𝒗 !,$ ]= can be obtained from the previous iteration, thus it will not cost function evaluations except for the first iteration. Then by considering both the objective qualities and the diversity, i.e., 𝑓<[𝒙 !,$" , 𝒗 !,$ ]= + 𝜑 ⋅𝑑<𝑝 !,$" = and 𝑓<[𝒙 !,$ , 𝒗 !,$ ]= + 𝜑 ⋅ 𝑑<𝑝 !,$ = , either the parent partial solution or its offspring is selected to the next iteration. Accordingly, the distribution of either the parent or the offspring should also be passed to the next iteration as it describes the search behavior of the search process. After that, the distribution should be adjusted based on the feedback of the search process. Basically, the distribution mainly concerns the mean vector and covariance matrix. Between them, the mean vector is actually identical to the partial solution and should not be adjusted. The covariance matrix can be adjusted via various rules (e.g., the 1/5 successful rule adopted in [17]). After the main loop terminates, the best complemented solution ever found will be output as the final solution of CCNCS. Section IV Empirical Studies

This section verifies the effectiveness of CCNCS on the popular RL problems, i.e., playing Atari Games. First, the Atari Games and the CCNCS-based ERL method are described. After that, the experimental protocol including the experimental settings and the compared state-of-the-art algorithms are briefly introduced. Finally, empirical results are analyzed to discuss the effectiveness of CCNCS.

Section IV.A CCNES for Playing Atari Games

Atari 2600 is a set of video games that have been popular for benchmarking RL approaches [10]. Atari games successfully cover different types of difficult tasks, such as obstacle avoidance (e.g., Freeway and Enduro), shooting (Beamrider, and SpaceInvaders), maze (e.g., Alien and Venture), balls (e.g., Pong), Mario alike (e.g., Montezuma’s Revenge), Sports (e.g Bowling and DoubleDunk), and other types. Those types of games provide significantly different environmental settings and interaction ways for the agent, and thus are able to test the RL approaches with different tasks in terms of maximizing the long-term reward. Another feature of Atari games is that the reward is delayed for actions as it is only returned after a game is finished. Furthermore, the environment of Atari games can involve considerable randomness that the rewards may be noisy. For the underlying empirical studies, the mentioned-above 10 games are selected to assess the performance of CCNCS. When playing Atari games, CCNCS fully follows the ERL flowchart depicted in Fig.1. Specifically, each individual is represented as a vector of all connection weights of the neural policy model (i.e., one individual of CCNCS represents one candidate policy), and is evolved based on Algorithm 3. Each epoch of training is equivalent to each iteration of CCNCS, and one game playing with a candidate policy is actually one objective function evaluation of an individual of CCNCS. In this paper, the policy suggested in [8] is adopted, which is composed of three convolution layers and two full connection layers. The architecture of the network can be seen in Table I, which involves nearly 1.7 million connection weights to be optimized. With the convolution layers, the policy is able to directly process the raw pixel data from the videos and no more efforts are needed for preprocessing the environmental states.

Table I The Network Architecture of the Agent

Input Output Kernel Size Stride

Conv2

Conv3

Fc1

Fc2

512

Section IV.B Experimental Protocol

Three state-of-the-art DRL approaches are employed as the compared algorithms, i.e., A3C [8], PPO [9], and CES [10]. Among them, A3C and PPO are popular gradient-based methods that train the network with the traditional back-propagation. CES is a recently proposed ERL that utilizes the canonical Evolution Strategies to directly optimize the policy network. By comparing with those three methods, it suffices to assess how CCNCS performs against the state-of-the-arts. Besides, in order to show how CC facilitates NCS, we directly apply NCS to Atari games as the baseline. All the above four algorithms are targeted to train the same policy network with CCNCS. Each algorithm terminates the training phase in a game when the total budget runs out, and the best policy network at the final epoch will be returned for testing. The quality of the final policy is measured with the testing score, i.e., averaged score of 200 repeated games playing without the frame limitations. Given that the environment is randomly initialized, each run will be re-run for three times, i.e., each algorithm will obtain three testing scores on each game. The total time budget is set as the total game frames that each training phase is allowed to consume. For three ERL methods (i.e., CES, NCS, and CCNCS), the total game frames are set to 100 million. For A3C and PPO, as it works quite differently with back-propagation, it is unfair to set the same total game frames with the ERL methods. On this basis, we counted the game frames consumed by the well-established CES and A3C on the same hardware configurations with the same given computational run time in the same games. It has been found that the ratio of the consumed game frames between A3C and CES is about 2.5. As a result, the total game frames are set to 40 million for A3C and PPO for fairness. To discretize the games for agent’s actions execution and states acquiring, the skipping frame is set to 4. In this case, in each training phase, the agent is allowed to take 25 million actions for derivate-free methods and 10 million actions for gradient-based methods. For A3C, PPO, and CES, the hyper-parameters suggested in their original papers are directly employed for the comparisons. For CCNCS, the population size 𝜆 is set to 6. The covariance of the Gaussian distribution is initialized as 0.2 for each dimensionality. The number of sub-problems 𝑀 is randomly picked from the set of (2,3,4) at each iteration. The other hyper-parameters of CCNCS follow the original settings of NCS in [17]. To keep it simple, NCS shares the same hyper-parameters with CCNCS, expect that it does not need to divide 𝑀 sub-problems. Section IV.C Results and Analysis

Three repeated testing scores of each algorithm on 8 games are shown in Table II, where the best averaged score of the three testing scores in each game is marked in bold. It can be seen that CCNCS generally performs the best among all compared algorithms and NCS also obtains competitive esults in most cases. Specifically, for Alien, Beamrider, Freeway, Montezuma’s Revenge, Pong, SpaceInvader, Venture, Bowling and DoubleDunk, CCNCS can significantly outperform the 3 state-of-the-art methods within the same time budget. For Enduro, CCNCS can still perform better than A3C and CES while performs less satisfactorily than PPO. To further express the effectiveness of CCNCS, we run it with 50 million game frames which is 50% less than the time budget of the other algorithms. The results are listed in the rightmost column of Table II. It is clearly seen that the advantages still exist that CCNCS can gain as much as twice scores than the state-of-the-art methods in most of the tested games within half less time. Among all tested games, Montezuma’s Revenge is almost the hardest one as traditional well-established methods (including the three compared ones) can hardly gain any score, unless human experience is incorporated to educate the policy [28]. This is mainly because the reward distribution of this game is very sparse that very limited sequence of actions can hit an effective reward. More intuitively, it can be explained as that the search space of Montezuma’s Revenge appears to be very flat and there is very few information can be utilized to guide the search (i.e., to optimize the policy). On this basis, Montezuma’s Revenge might be the most appropriate problem for assessing the exploration ability of RL methods. As both NCS and CCNCS can break the tie, the motivation of this paper that applying NCS to ERL can benefit the exploration ability is verified. By comparing CCNCS with NCS, it suffices to show that the proposed NCS-friendly CC framework is able to facilitate NCS in the large-scale search space. That is, by consuming half less time than NCS, CCNCS can still significantly outperform NCS in almost all tested games. Contrarily, without using CC, the advantage of NCS over the three state-of-the-art algorithms is much weaker than that of CCNCS.

Table II The testing scores of compared algorithms on nine Atari games Methods

Time Budget CES

PPO A3C

NCS CCNCS CCNCS

Alien Run 1 Run 2 Run 3 Average

Beamrider Run 1 Run 2 Run 3 Average

Enduro Run 1 Run 2 Run 3 Average

Freeway Run 1 Run 2 Run 3 Average ontezuma’s

Revenge

Run 1 Run 2 Run 3 Average

Pong Run 1 Run 2 Run 3 Average -19.9 -5.7 6.3 -6.4 -19.8 -19.9 -20.0 -19.9 -21.0 -21.0 -21.0 -21.0 -9.8 -9.1 -9.6 -9.5 4.8 2.9 4.0 -9.9 11.3 -10.8 -3.5

SpaceInvader Run 1 Run 2 Run 3 Average

Venture Run 1 Run 2 Run 3 Average

Bowling Run 1 Run 2 Run 3 Average

Double Dunk Run 1 Run 2 Run 3 Average -3.3 -2.8 -4.1 -3.4 -3.5 -3.2 -3.0 -3.2 -2.0 -1.5 -2.8 -2.1 -1.5 -0.8 -1.7 -1.3 -0.6 0.0 -0.1 -0.2 -0.4 0.0 -0.4 -0.3

Section V Conclusions

This paper studies how to effectively enable DRL approaches the exploration ability. By focusing on the emerging ERL techniques that employs EAs to optimize the parameters of the policies, this paper discusses that the recently presented NCS can be more appropriate to ERL due to its parallel exploration search behavior. To apply NCS to optimize millions of connection weights of neural policies, this paper employs the CC framework to scale up NCS for large-scale search space. The adopted CC framework is modified to specially fit NCS alike methods who have parallel exploration search behaviors. Extensive studies are conducted on 10 Atari games to verify the proposed method. Empirical results show that scaled-up NCS can perform more effectively than three state-of-the-art DRL methods with 50% less computational time, including both gradient-based ones and derivate-free ones. Furthermore, the powerful exploration ability enabled by NCS successfully helps the agent gain scores on the difficult Montezuma’s Revenge game, where most existing methods cannot get scored due to the highly sparse reward distribution.

References [1] L. Zhang, K. Tang, and X. Yao, “Log-normality and Skewness of Estimated State/Action Values in Reinforcement Learning,” in

Proceedings of 2017

Neural Information Processing Systems (NIPS'17) , pp.1804-1814, Long Beach, US, Dec. 2017. 2] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in

Proceedings of 2017

Neural Information Processing Systems (NIPS'17), pp. 6118–6128, Long Beach, US, Dec. 2017. [3] K. Arulkumaran, M. P. Deisenroth, M. Brundage and A. A. Bharath, "Deep Reinforcement Learning: A Brief Survey," in

IEEE Signal Processing Magazine , vol. 34, no. 6, pp. 26-38, Nov. 2017. [4] V. Mnih, K. Kavukcuoglu , D. Silver, et al., “Human-level control through deep reinforcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015. [5] S., David, et al. "Mastering the game of Go with deep neural networks and tree search."

Nature, vol. 529, no.7587, pp.484-489, 2016. [6] Luong, Nguyen Cong, et al. "Applications of Deep Reinforcement Learning in Communications and Networking: A Survey."

IEEE Communications Surveys and Tutorials, vol.21, no.4, pp.3133-3174, 2019. [7] L. Zhang, K. Tang and X. Yao, “Efficient exploration is crucial to achieving good performance in reinforcement learning”, in Proceedings of 2019 Neural Information Processing Systems (NIPS'19) , Vancouver, Canada, Dec. 2019, in press. [8] V. Mnih, P.B. Adrià, M. Mehdi, et al., “Asynchronous Methods for Deep Reinforcement Learning,” in

Proceedings of The 33rd International Conference on Machine Learning , pp. 1928–1937, 2016. [9] Schulman, John, et al. "Proximal Policy Optimization Algorithms." arXiv: 1707.06347 , 2017. [10] P. Chrabąszcz, I. Loshchilov, and F. Hutter, “Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence , pp. 1419–1426, 2018. [11] T. Salimans, et al., I. “Evolution strategies as a scalable alternative to reinforcement learning.” arXiv:1703.03864 , 2017. [12] Drugan, Madalina M.. "Reinforcement learning versus evolutionary computation: A survey on hybrid algorithms."

Swarm and evolutionary computation , pp.228-246, 2019. [13] P. Yang, K. Tang and X. Yao, “Turning High-dimensional Optimization into Computationally Expensive Optimization,”

IEEE Transactions on Evolutionary Computation , Vol. 22, Issue 1, pp. 143-156, 2018. [14] C. Qian, Y. Yu, K. Tang, Y. Jin, X. Yao, and Z. Zhou, “On the Effectiveness of Sampling for Evolutionary Optimization in Noisy Environments.”

Evolutionary Computation , vol. 26, Issue. 2, pp.237-267, 2018. [15] Z. Zhou, and J. Feng. "Deep forest: towards an alternative to deep neural networks."

In Proceedings of international joint conference on artificial intelligence , pp. 3553-3559, 2017. [16] M. Crepinšek, S.-H. Liu, and M. Mernik, “Exploration and exploitation in evolutionary algorithms: A survey,”

ACM Comput. Surveys , vol. 45, no. 3, p. 35, 2013. [17] K. Tang, P. Yang and X. Yao, "Negatively Correlated Search," in

IEEE Journal on Selected Areas in Communications , vol. 34, no. 3, pp. 542-550, March 2016. [18] G. Li, C. Qian, C. Jiang, X. Lu, and K. Tang. "Optimization based Layer-wise Magnitude-based Pruning for DNN Compression." In

Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), pages 2383-2389, Stockholm, Sweden, 2018. [19] Y. Lin, H. Liu, G. Xie, and Y. Zhang. "Time series forecasting by evolving deep belief network with negative correlation search." In

Proceedings of the 2018 Chinese Automation Congress (CAC) , pp. 3839-3843. Xi’an, China. IEEE. 2018. [20] D. Jiao, P. Yang, L. Fu, L. Ke and K. Tang, "Optimal Energy-Delay Scheduling for Energy-Harvesting WSNs With Interference Channel via Negatively Correlated Search," in

IEEE Internet of Things Journal , vol. 7, no. 3, pp. 1690-1703, March 2020. [21] P. Yang, K. Tang, J. A. Lozano and X. Cao, "Path Planning for Single Unmanned Aerial Vehicle by Separately Evolving Waypoints,"

IEEE Transactions on Robotics , vol. 31, no. 5, pp. 1130-1146, Oct. 2015. [22] Z. Yang, K. Tang, and X. Yao, “Large scale evolutionary optimization using cooperative coevolution,”

Information Sciences , vol. 178, no. 15, pp. 2985–2999, 2008. [23] L. Panait, “Theoretical convergence guarantees for cooperative coevolutionary algorithms,”

Evolutionary Computation , vol. 18, no. 4, pp. 581–615, 2010. [24] M. N. Omidvar, X. Li, Y. Mei, and X. Yao, “Cooperative co-evolution with differential grouping for large scale optimization,”

EEE Transactions on Evolutionary Computation , vol. 18, no. 3, pp. 378–393, Jun. 2014. [25] S. Mahdavi, M. E. Shiri, and S. Rahnamayan, “Metaheuristics in largescale global continues optimization: A survey,”

Information Sciences , vol. 295, pp. 407–428, Feb. 2015. [26] Z. Yang, K. Tang, and X. Yao, “Multilevel cooperative coevolution for large scale optimization,” in

Proc. of 2008 IEEE World Congress on Computational Intelligence , June 2008, pp. 1663–1670. [27] Y. Aytar, et. Al., “Playing hard exploration games by watching YouTube,” In

Advances in Neural Information Processing Systems , pp. 2930–2941. 2018. [28] P. N. Suganthan, N. Hansen, J. J. Liang, K. Deb, Y.-P. Chen, A. Auger, and S. Tiwari, “Problem definitions and evaluation criteria for the CEC 2005 special session on real-parameter optimization,” Nanyang Technological University (NTU), Singapore, Tech. Rep., May 2005., pp. 2930–2941. 2018. [28] P. N. Suganthan, N. Hansen, J. J. Liang, K. Deb, Y.-P. Chen, A. Auger, and S. Tiwari, “Problem definitions and evaluation criteria for the CEC 2005 special session on real-parameter optimization,” Nanyang Technological University (NTU), Singapore, Tech. Rep., May 2005.