Generalization in Transfer Learning
GG ENERALIZATION IN T RANSFER L EARNING
A P
REPRINT
Suzan Ece Ada, Emre Ugur, H. Levent Akin
Department of Computer EngineeringBogazici UniversityBebek, 34342 Besiktas/Istanbul [email protected],[email protected],[email protected]
September 4, 2019 A BSTRACT
Agents trained with deep reinforcement learning algorithms are capable of performing highly complextasks including locomotion in continuous environments. In order to attain a human-level performance,the next step of research should be to investigate the ability to transfer the learning acquired in one taskto a different set of tasks. Concerns on generalization and overfitting in deep reinforcement learningare not usually addressed in current transfer learning research. This issue results in underperformingbenchmarks and inaccurate algorithm comparisons due to rudimentary assessments. In this study, weprimarily propose regularization techniques in deep reinforcement learning for continuous controlthrough the application of sample elimination and early stopping. First, the importance of theinclusion of training iteration to the hyperparameters in deep transfer learning problems will beemphasized. Because source task performance is not indicative of the generalization capacity of thealgorithm, we start by proposing various transfer learning evaluation methods that acknowledge thetraining iteration as a hyperparameter. In line with this, we introduce an additional step of resortingto earlier snapshots of policy parameters depending on the target task due to overfitting to the sourcetask. Then, in order to generate robust policies,we discard the samples that lead to overfitting viastrict clipping. Furthermore, we increase the generalization capacity in widely used transfer learningbenchmarks by using entropy bonus, different critic methods and curriculum learning in an adversarialsetup. Finally, we evaluate the robustness of these techniques and algorithms on simulated robotsin target environments where the morphology of the robot, gravity and tangential friction of theenvironment are altered from the source environment. K eywords Robot Learning · Deep Reinforcement Learning · Transfer Learning · Continuous Control
Inferring the general intuition of the learning process and harnessing this to learn a different task is necessary foran autonomous agent to operate in non-stationary real-life environments. Being able to adapt to the changes in anenvironment while performing continuous control as quickly as possible by cross-task generalization is preeminentto attain artificial intelligence. Generalizing well among the tasks belonging to the same set often leads to higherperformance in each.Deep reinforcement learning methods require long training periods in the source domain to develop a strategy close toa human for the same source environment [1]. In real life, a robot will encounter different scenarios when executingtasks in a non-stationary environment. Waiting for a robot to interact millions of times with the environment instead ofincreasing its robustness to the varying environmental dynamics is time-consuming and impractical. A robot is expectedto generalize to a similar task it hasn’t encountered, adequately and quickly to coexist with humans in the real world.In order to obtain robust policies that can excel in the aforementioned tasks the robot should not only learn walkingbut to learn walking robustly in altered target environments and the learning attained in source environment should a r X i v : . [ c s . L G ] S e p eneralization in Transfer Learning A P
REPRINT be transferable to morphologically different robots. Hence, the trade-off between the generalization capacity and thesource task performance should be acknowledged when determining the most pertinent training model.Analogous to the variety of ways humans carry out a simple task in the real world, robots can perform a task incontinuous control environments distinctively based on the design decisions in the training phase. A human can performlocomotion in many different ways ranging from closer to ground running to a careful tiptoeing on a rope. To increasethe capabilities of the robots in the same manner, we have focused on increasing the scope of abilities gained duringlearning.Evaluation of generalization in deep reinforcement learning for discrete and continuous environments is still an openresearch area[2, 3, 4]. Similar to Zhao et al. [4], we first prove that source task environment performance isn’t indicativeof generalization capacity and target task performance. To achieve adequate performance in the target task environment,this statement will be the origin of our methodology. In continuous control environments, we will leverage a recent policygradient method particularly Proximal Policy Optimization(PPO)[5] to acquire knowledge in the form of neural networkparameters. First, we will show how the failure of recognizing overfitting leads to inaccurate algorithm comparisons.We suggest transfer learning evaluation structures based on the policy buffer we propose and employ one of them inour evaluations. Policy buffer consists of promising policy network snapshots saved during training in the source taskenvironment and the design is inspired by the human memory. Likewise we propose new environments inspired bythe real life humanoid robot scenarios for benchmarking transfer learning. First, we increased the range of gravityand robot torso mass experiments used by Henderson et al. [6], Rajeswaran et al.[7] and Pinto et al.[8] respectivelyto demonstrate the capabilities of the methods we propose. Then, we introduce new morphology environments forhumanoid. Robots should be able to transfer the learning they’ve attained to each other like humans. Subsequently, wedesigned two new target environments named tall humanoid and short humanoid, each having different loss functionconstraints and morphologies than the standard humanoid source environment. Similarly because carrying a heavyobject is among the expectations of a service robot, we designed a humanoid delivery environment.Recognition of the policy iteration as a hyperparameter not only prevents inaccurate algorithm evaluations but addi-tionally increases the performance of recent algorithms. In this regard, we will show that earlier iterations of policiesperforms well in harder target environments due to the regularization effect of early stopping. We introduce themethod of strict clipping to discard samples that cause overfitting. This regularization technique is developed for PPObut we will discuss its possible applications to other algorithms in future directions. We propose a new advantageestimation technique for Robust Adversarial Reinforcement Learning (RARL) [8] in Section 3.2 named AverageConsecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL) by involving both critics at each iteration.In morphological experiments, we further demonstrate that a hopper robot can hop with twice its original torso massusing learning attained in the standard environment with this technique [8]. We compare the generalization capacityof different training methods on a hopper robot, namely advantage estimation techniques, entropy bonuses[2] anddifferent curriculum[9] in the RARL setting. In contrast to previous work we combine entropy bonus with RARL,compare different critic architectures using the iteration of training as a hyperparameter. These findings are beneficial inconstructing a meaningful policy buffer and further asses the necessity of this evaluation technique by pointing out thestriking variability in the target task performance using constant hyperparameters.In Section II, studies on the generalization in deep reinforcement learning and transfer learning are provided. Theproposed method along with the background will be detailed in Section III. Experimental setup and results will begiven in Section IV and V respectively. Finally, in Section VI, the conclusion section will include a summary of ourcontributions, discussion of the results and future directions for research.
Generalization in Deep Reinforcement Learning
Evaluation of generalization in deep reinforcement learning is atrending research area[3] [2] that flourished from the need to extend the range of control tasks using existing skills.Because testing and the training environments are identical in deep reinforcement learning research, the proposedalgorithms’ robustness which is essential to real world deployment are often neglected[3]. Since the generalization indeep reinforcement learning is often neglected, the algorithms are developed without the necessary intermediary step toincrease the generalization capacity.Data is separated into a training set, a validation set and a test set in supervised learning problems. Cross-validationis used for hyperparameter tuning on the training set. The training set is divided into a predetermined number ofgroups and for each hyperparameter search iteration, each group is used once as the test environment. After all thehyperparameters are used, the algorithm is ready to be tested on the unseen test set. Although the hyperparameteroptimization to increase the generalization capacity of the algorithm is an almost mandatory step in supervised learningproblems it is challenging to implement this method in deep reinforcement learning.2eneralization in Transfer Learning
A P
REPRINT
Concerns over the reproducibility and evaluation of deep reinforcement learning algorithms in general have beenbrought up by Henderson et al. [10] where affect of various hyperparameter selections on the training performance areanalyzed. In transfer learning, the design decisions are even more pivotal since the training environment performance isnot informative on the target environment performance. Consequently few-shot learning via sampling from the targetenvironment is used in these cases.Cobbe et al. [2] developed a benchmark for generalization named CoinRun where high number of test and traininglevels can be generated. The generalization capacity of various agents are compared using the percentage of levelssolved in the test environment. Dropout, L2 regularization, data augmentation, batch normalization and increasing thestochasticity of the policy and the environment are the regularization techniques used in [2] to increase the zero-shottest performance. Still, few shot-learning in the test environment is required to find the dropout probability, L2 weight,epsilon and entropy coefficient that increases the generalization capacity. Zhao et al. [4] studied generalization forcontinuous control by parametrizing domain shift in generalization as testing Area Under Curve(AUC) using systematicshifts and noise scales introduced to the transition, observation, actuator functions. Similar to [2], Zhao et al. [4]compared regularization techniques for deep reinforcement learning such as policy network size reduction, soft actorcritic entropy regularizer[11], multi-domain learning[12], structured control net [13] and adversarially robust policylearning(APRL)[14] using AUC. In contrast to [4], we will not compare the regularization techniques using the policiesobtained after a predetermined number of training iterations.
Forward Transfer Learning
Transferring the knowledge and representation gained from one task to another task iscalled forward transfer learning[15]. In order to start from a more rewarding location in the parameter space or to learnfaster, most applications include transfer of the policy parameters to the new task after the manipulation of the trainingphase or modification of the architecture [7, 16]. The transferred neural networks are expected to generalize to the newtask and adapt to the new task’s domain.A transfer can occur from a large domain to a small domain and vice versa. In both cases, the possibility of negativetransfer exists. If the algorithm is performing worse than not using any transfer at all, it is called negative transferlearning. For instance, if the algorithm is initialized from an optimal point for the source task but a suboptimal pointfor the target task, insufficient exploration might occur. Eventually, this results in a negative transfer thereby makingrandom initialization a better choice.When a transfer occurs from a larger domain to a smaller domain it is called partial transfer learning . One exampleof this in computer vision is discussed in [17] where the source space is the subset of the target space. The transfercases they have covered in the experiments section include performing transfer from
ImageNet 1000 [18] to
Caltech 84 ,and
Caltech 256 [19] to
ImageNet 84 where the corresponding numbers stand for the number of classes. They haveattempted to counteract the effects of negative transfer by discriminating the outlier classes from the source domain andmaximizing the accuracy of the source and target distribution locations.Transferring from simulation to the real world is oftentimes a tedious task. Model-free algorithms rely on sampleshowever the cost of sampling from a real-world environment is high in robotics settings. Tobin et al. [12] implementedthe method of domain randomization in the physics simulator to accurately localize the objects in the real world fora manipulation task. Similarly Sadeghi et al. [20] used domain variation in 3D Modeling simulator Blender [21] bygenerating distinct pieces of furniture and hallway textures to train a simulated quadcopter.Adversarial scenarios have also been widely used for robotics tasks involving computer vision. For instance, Bousmalis et al. [22], and Tzeng et al. [23] used adversarial networks where one neural network is optimized to discriminate thereal world image data from the simulation whereas the other network is optimized to generate simulator images that canfool the discriminator. The generator network will come up with better representations of the real image data as thediscriminator approaches the minima.Deep reinforcement learning have benefited from the adversarial implementations. Increasing the robustness of policyleads to a higher performance in target tasks involving domain change. Rajeswaran et al. [7] suggested a method toincrease robustness of the policy network in Ensemble Policy Optimization(EPOpt) algorithm by training an ensemble ofdifferent tasks. Their dual-step approach to perform transfer from one distribution of tasks to another consists of
RobustPolicy Search where policy optimization is performed using samples from a batch of different tasks and
Model-BasedBayesian Reinforcement Learning where the source task distribution parameters are updated via experience on thetarget task during training. Experiments in EPOpt further asses that performing batch policy optimization solely on theworst performing subset while discarding the higher performing trajectories leads to more robust policies. Although notexperimented in the paper, this setting is applicable to problems where a limited number of trials are allowed in thereal world target setting. All in all, they provide satisfactory results by comparing their results from
EPOpt on th percentile with Trust Region Policy Optimization (TRPO) trained on a single source task for each mass and EPOpt without the use of worst percentile subset extraction. 3eneralization in Transfer Learning
A P
REPRINT
In Robust Adversarial Reinforcement Learning (RARL) [8], a separate adversarial network is created to destabilizethe agent during training in a more intelligent way for increased robustness in the target environment. Separate criticnetworks are consecutively optimized with their policy network counterparts in RARL. The reward functions of theprotagonist and the antagonist are each others’ negative in the algorithm thus a single shared critic network is also arelevant architecture. For instance, in Dong’s implementation both of the networks are optimized redundantly but onlythe protagonist critic network’s resulting advantage estimation is used in policy optimization resulting in a single sharedcritic network architecture [24].Inspired by the fruitful results of RARL [8], Shioya et al. [9] proposed two extensions to the RARL algorithm byvarying the adversary policies. Their first proposal is to add a penalty term to the adversary policy’s reward function bysampling from the test domain to adapt to the source task’s transition function. This, however, is tailored robustness foreach test task at hand which requires sampling from the test domain similar to the Bayesian update used in EPOpt [7].The second extension is inspired by curriculum learning that selects the adversarial agents based on the progress oflearning instead of naively taking the latest adversarial policy. The protagonist policies trained in harder environmentsdoesn’t guarantee a more robust performance, in fact, the using previous adversaries randomly during training wasexplored in some previous works [25] [9]. In both Shioya et al. ’s [9] and Bansal et al. ’s [25] experiments, using thelatest and the hardest adversarial policy hinders the learning progression of the protagonist. In [9], first, multipleadversaries are created and samples from the latest T iterations of the adversary policies are ranked according to theprogress of learning using linear regression. The ranks are used to determine the probabilities of using the sampleswhich are selected stochastically during training. Each adversary policy is maximized using the negative reward of theprotagonist agent and the sum of KL Divergence from all the other adversary policies to encourage diversity between theadversaries. Bansal et al. [25] used Uniform ( δv, v ) distribution for determining opponent humanoid’s policy iterationwhere δ is the percentage of the constructed set’s coverage from the last T adversary policy iterations. Shioya et al. [9]used multiple adversaries and ranked each sample’s performance to determine the set of samples that should be used foroptimization. Experiments were done using Hopper and Walker2d environments in MuJoCo [26] to compare the resultsto the RARL Algorithm [9]. It is found that in the Hopper environment ranking the policies to adapt the probabilityof their selection performs better than RARL and uniform random selection but performed worse than the latter inWalker2d environment. In both of the environments using the less rewarding trajectories performed worse than all themethods. As a contrast to this result, optimizing over the worst performing samples generated more robust policies inHopper task when tested with different torso masses in EPOpt Algorithm [7] when an adversary policy is not present.The adversarial algorithms can be considered as a dynamic way of generating different suitable tasks for the agentat each iteration to encourage it to be more robust in unseen test environment [27]. Challenging tasks increases therobustness by allowing agents to grasp complex latent features of the task. Policy gradient methods have gained vast popularity against Deep Q-Networks (DQNs) [1] especially after theintroduction of algorithms that constrain gradient movement in the policy parameter space such as Trust Region PolicyOptimization(TRPO) and Proximal Policy Optimization (PPO) [28, 5]. In our experiments the open-source
OpenAIBaselines framework [29] will be used where PPO [5] is used to optimize the actor policy network and the generalizedadvantage estimator (GAE) is used to optimize the critic value function network simultaneously [30].Actor-Critic architectures are used to estimate the advantage function [30] because the actual value of the state can onlybe approximated based on the samples rolled out so far in the reinforcement learning domain. Thus, a separate criticnetwork is trained simultaneously with the policy network to predict the value of a given state.Value function loss is the mean squared difference between the target value V targ t and the predicted critic networkoutput. In the OpenAI Baselines framework, the target value is the sum of generalized advantage estimation and thesampled output of the value function network. In PPO, given in Eq. 1, if the advantage is negative then all the ratiosbelow − (cid:15) will be clipped and if the advantage is positive, then all the ratios above (cid:15) will be clipped and thegradient of clipped loss will be 0. L CLIP ( θ ) = ˆ E t (cid:104) min (cid:16) r t ( θ ) ˆ A t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) ˆ A t (cid:17)(cid:105) (1)RARL algorithm will be used as the baseline algorithm in the hopper robot experiments, so detailed information willbe presented in Section 3.2 . The proposed scenario is a two-player zero sum discounted game where the agent triesto maximize its own reward that is actively minimized by the adversary. Actions ( A ) are sampled from the agent’s4eneralization in Transfer Learning A P
REPRINT policy denoted as µ whereas the actions ( A ) are rolled out from the adversary’s policy ν . Equation 2 shows the rewardfunction of the agent. Corresponding to that, the reward function of the adversary is R = − R . R = E s ∼ ρ,a ∼ θ pro ( s ) ,a ∼ θ adv ( s ) (cid:34) T − (cid:88) t =0 r (cid:0) s, a , a (cid:1)(cid:35) (2)Instead of optimizing the minimax equation at each iteration the reward functions of the agent and the adversary aremaximized consecutively. The agent’s policy is optimized iteratively, while collecting samples using a fixed adversarypolicy. After that, the same number of rollouts are collected from the environment for the adversary’s optimizationwhile the agent’s last policy’s parameters are used. First, they have compared TRPO and RARL using the defaultenvironment hyperparameters without any disturbances for 500 iterations on tasks HalfCheetah, Swimmer, Hopper,and Walker2d in MuJoCo environments. Training with RARL achieved a better mean reward than the baseline TRPO.RARL and TRPO are evaluated with a trained adversary in the test environment where RARL performed significantlybetter on all tasks compared to TRPO although the training and test environments are identical in this scenario. They’vealso tested the aforementioned tasks by varying torso mass and friction coefficients which were not seen during thetraining phase and again RARL yielded better results than the TRPO baseline. An implementation of the algorithmusing rllab framework [31] [32] [33] and a single critic, simultaneous PPO variant using OpenAI Baselines frameworkare open-sourced[24]. We will compare double, single and shared double critic structure in Section 5.2.2 and Section5.2.3 to evaluate how the critic effects the algorithm’s generalization capability using a policy buffer.
Performing continuous control tasks in different environments requires different policies. Policiestrained with different hyperparameters show different control patterns thus we propose a policy buffer to store thesepolicies trained on the same source task environment with the same loss function. In the proposed system, we show thatit is possible to extract a comprehensive set of policies representative of distinct control patterns just from one sourcetask environment.Different snapshots of the policy network parameters taken during training in the source task environment performdramatically different in the target task environment. Overtraining in the source task environment decreases the testingarea under curve performance which is dependent on the expected return in the test environment. [4] Overfitting leads toa worse result on the target environments as the distance between the target environment and the source task environmentincrease in the environment parameter space. In order to discover the most suitable policy network parameters for theunknown target task, we first take snapshots of the policy network at each sampling iteration at predetermined intervals.The snapshots of the policies trained with different hyperparameters in the source task environment will be saved inthe policy buffer. Considering the striking difference between each policy snapshot’s ability to transfer, evaluations atdifferent iterations are necessary to compare various methods. Taking the resulting parameters after a constant numberof training iterations does not constitute a valid comparison because full scope of the algorithms’ generalization capacityis omitted.Sampling from the target environment during training is used in several transfer learning algorithms [7, 9] to demonstratethe real world as the target environment where minimizing the quantity of the samples is aimed and the simulator is thesource environment where the sampling phase is only restricted by the computational resources.Acknowledging the training iteration as a hyperparameter is analogous to using early stopping regularization techniquein supervised learning. In deep reinforcement learning settings where the difference between source task and targettask can be parametrized, we propose designing a surrogate validation task . This task, will prove to be informative forhyperparameter optimization to identify eminently generalizable policy snapshots for an alike target task. The surrogatevalidation task should have parameters closer to the target environment than the source environment and a few-shotlearning setting surrogate validation task would be an adequate starting point for determining the policies that shouldbe given priority during target environment sampling.We should be aware that these target environment performance plots are unknown to us before sampling in the targetenvironment. Consequently, we suggest that an alternative proper comparison of algorithms can be made by computingthe area above a predetermined threshold for each curve because that would suggest that the likelihood of choosing arobust policy from the snapshots of policies trained with the corresponding algorithm is higher if the area is bigger.However, in our work we will plot the consecutive policy iterations of the algorithm that generated best performingpolicy in the target environment to prove that an expert level robust policy for a variety of environments has alreadybeen saved during training. 5eneralization in Transfer Learning
A P
REPRINT
Regularization via PPO Hyperparameter Tuning
Policy gradient algorithms are the building blocks of the recenttransfer learning and generalization algorithms. The choice of the clipping hyperparameter of Proximal PolicyOptimization is crucial when using the algorithm as a transfer learning benchmark.
Open AI Baselines frameworkand most of the literature use the clip parameter of 0.2 for continuous control tasks [29] [5] [25] [27] [34][10]. Inaddition to that, the clipping parameter is discounted using a learning rate multiplier in
Open AI Baselines frameworkto encourage swift reaching to the asymptote for continuous control tasks in MuJoCo. In our experiments, we havefound out that decaying the clipping parameter decreases the asymptotic performance of the algorithm in the Humanoidenvironment. After our suggestion, annealing clipping hyperparameter in the ppo1 algorithm has been omitted in the
Open AI Baselines framework [29].We hypothesize that in a transfer learning setting, strict clipping can be used to discard the MDP tuples that lead tooverfitting. In addition to early stopping, we propose strict clipping as a regularization technique for PPO to combatoverfitting introduced by the source task-specific samples. In order to construct a fair comparison between state ofart transfer learning algorithms and their corresponding benchmarks, various lower values of clipping parameters areanalyzed.
Strict clipping is performed by decreasing the clipping parameter to unconventional values, for instance aslow as 0.01. We prove that this method is a competitive benchmark for the transfer learning algorithms.
Adversarial Reinforcement Learning
Training the policy aiming general robustness for a range of possible unknownscenarios is one way of achieving a successful initial test performance. Adversarial scenarios are inspired by the successof domain randomization during training. Introducing an adversary to destabilize the agent using multidimensionalforces in Robust Adversarial Reinforcement Learning (RARL) have proven successful results in continuous control tasks[8]. APRL[14] which is another type of adversarial algorithm that uses adversarial noise, is among the regularizationmethods used for comparison in [4]. However, it performed worse than vanilla PPO in most of the cases where eachregularization technique is compared on the test performance after a predetermined number of timesteps sampled fromthe source environment. In contrast to prior work, we will acknowledge policy iteration as a hyperparameter in ourcomparisons and use RARL instead of APRL for our evaluations.There is a continuous competition in RARL depending on the destabilization capability of the adversary. For instance,an adversary policy with a 2-dimensional output has a restricted power due to low dimensional action space and mighthave a hard time destabilizing a Humanoid robot with 17-dimensional action space depending on the strength of theforce during training. However, an adversary with a 27-dimensional action space that applies a 3-dimensional force toeach body component of the protagonist humanoid might even hinder the protagonist policy from reaching convergence.Accounting for overfitting is pivotal in finding the policy with the highest generalization capability when increasing thecomplexity of the training environment. Thus, we will compare and discuss the variants of RARL by forming a policybuffer and extract the most generalizable policies to increase the capability of the algorithm.In order to analyze the effect of different critic network architectures on the generalization capacity of the policy wewill compare three different critic architectures: separate double critic networks that is used in RARL[31], single criticnetworks in Shared Critic Robust Adversarial Reinforcement Learning (SC-RARL)[24] and our proposition AverageConsecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL). Critic networks are inherently functionapproximators so they are vulnerable to overfitting as well as the actor networks.In RARL, each critic network is separate from each other and at each global iteration protagonist and adversary policiesare updated with the A GAE ( γ,λ ) computed using the rewards from different trajectories and separate randomly initializedcritic network outputs. In SC-RARL, the shared critic is updated by the rewards gained during protagonist optimizationphase and in RARL each critic is updated only by the rewards gained during its corresponding policy’s optimizationphase. Since the advantage function computation for the adversary uses the negative of the protagonist’s, SC-RARLis meaningful but the protagonist’s critic network is only updated by the trajectories sampled in the protagonist’soptimization phase. Fewer samples and optimization iterations might act as a regularization but we should also exploreways of using the rewards from other sampling iteration without overfitting. The total number of samples used to updatethe critic networks are the same for RARL and SC-RARL.We propose a third architecture named Average Consecutive Critic Robust Adversarial Reinforcement Learning (ACC-RARL) that computes advantage estimates using the mean of both critic’s output but consecutively and separatelyoptimizes each critic network along with their corresponding policies. By this method, we aim to decrease overfittingvia using double critic networks with different random initializations and restrict the movement of the critic in parameter6eneralization in Transfer Learning A P
REPRINT space by including the output of the previously updated critic in advantage estimation. The δ t values for both adversaryand protagonist in ACC-RARL algorithm are shown in Equation 3. δ protagonistt = ( − V protagonist ( s t ) + V adversary ( s t )) / r t + γ ( V protagonist ( s t +1 ) − V adversary ( s t +1 )) / δ adversaryt = − δ protagonistt (3)In ACC-RARL, each critic is updated by the average of two sequentially updated critic outputs and the rewards gainedduring their corresponding policy’s optimization phase. In RARL, policies are not informed of each other’s criticoutput but they rely on the similarity of 2 consecutive sets of rewards accumulated by a different group of agents. Onthe other hand, in ACC-RARL, both critics are updated by each others’ critic outputs with the rewards gained duringtheir optimization phase. Updating the critic function more frequently than the policies would increase overfitting soby ACC-RARL we aim to increase generalization capacity by encouraging them to optimize with different rewardbatches while considering each other’s critic outputs. Thus, each critic network observes all the rewards sampled in theenvironment but assigns more weight to the rewards gained during its optimization phase.Both the value function approximators and the policy networks have 2 hidden layers with size 64 and input layer of size11 for the hopper environment. All policy networks have a separate logarithm of standard deviations vector that has thesame size as the policy network’s output. These vectors are optimized simultaneously with the policy networks.Entropy bonus is used to aid in exploration by rewarding the variance in the multivariate Gaussian distribution of actionprobabilities by a coefficient c given in Equation 4. c S [ π θ ]( s t ) = c ∗ (cid:88) ( log ( σ ) + 0 . ∗ log (2 πe )) (4)Although entropy bonus is a part of PPO total loss function we did not include it in the loss function because it decreasedthe training performance. Cobbe et al. [2] used PPO with a non-zero entropy coefficient to perform stochasticity asa regularization technique in discrete environments. Instead, we will incorporate entropy bonus in RARL for thecontinuous control cases because decreasing training performance usually increases the generalization capacity ofthe algorithm. Its inclusion in the protagonist and the adversary’s loss function separately will be discussed due toits destabilizing effect. Our hypothesis is motivated by the works on competitive and adversarial environments thatsuggest that hard environments might hinder learning [25, 9]. Subsequently impelling the protagonist to explore provesto be beneficial in some cases although it decreases the performance in environments with no adversaries. Similarly,adding an entropy bonus to adversary’s loss function might prepare the protagonist for a wider range of target tasks byextended domain randomization. Correspondingly the entropy bonus hinders adversary to reach peak performance anddecreases the difficulty of the environment occasionally. All in all, the snapshots of two uniquely trained policies willbe added to the policy buffer for each critic architecture.Curriculum Learning is a recent branch in transfer learning that focuses on discovering the optimal arrangement ofthe source tasks to perform better on the target task. To excel in complex tasks, humans follow specifically designedcurricula in higher education [35]. Similar to how a more personalized curriculum leads to a more successful result forhumans, this strategy is beneficial for the learning of robots in hard environments.Similar to [25, 9], we will construct a random curriculum by randomizing the adversary policy iterations duringtraining. In our experiments, we will compare the performance of protagonist policies trained with adversaries randomlychosen from different last T iterations. We will use uniform sampling from a restricted last T adversary policy set forthis experiment which corresponds to Shioya et al. ’s [9] "mean" method. We introduce RARL Policy Storage(PS)-Curriculum where we first train the policy with the adversary in the source task and record all the adversary policysnapshots at each iteration to policy storage. Then at each iteration based on the sampling from the Uniform ( δv, v ) distribution [25], the adversary from the policy storage will be loaded during sampling. This method is included in ourexperiments because the adversaries loaded from and recorded to the buffer during curriculum training become lesscapable and more inconsistent as the training progress and δ decreases. As a consequence, we intend to show how avariety of design decisions during training affects the target performance in the pursuit of developing more reliablebenchmarks for transfer deep reinforcement learning. To demonstrate the generalization capability of the policies trained using the proposed regularization techniques, weintroduce a set of transfer learning benchmarks in the
Humanoid-v2 environment and increase the range of commonly7eneralization in Transfer Learning
A P
REPRINT used mass and gravity benchmarks in the
Hopper-v2 environment [26]. Reproducibility of reinforcement learningalgorithms is challenging due to colossal dependability on hyperparameters and implementation details. We will use astochastic policy for the training and a deterministic policy for the testing environment.The inputs of each policy network are the same however the outputs of the protagonist and adversary policy networksdiffer. Because
Hopper-v2 is a 2-dimensional environment 2 output neurons are specified to represent each force appliedto the geom component of the robot, specifically the heel. The environment simulates actions from both the protagonistand adversary at each state and outputs the reward based on the next derived position of the agent.
The environments in our setup are diverse enough to pose a challenging transfer learning problem. For instance, therepresentation of a humanoid trained for the source environment friction with parameters optimized for source taskcan’t walk in a target environment with high tangential friction coefficient. No modifications will be made to the lossfunctions for these environmental conditions.Friction is one of the environmental dynamics that have a substantial effect on bipedal locomotion. Rajeswaran etal. used ground friction for a hopper robot in [7]. Similarly, we will perform forward transfer learning in a higherdimensional humanoid environment by transferring the learning attained in an environment with ground tangentialfriction 1 to an environment with ground tangential friction of 3.5.Altering the gravity to generate target tasks for the Humanoid and Hopper was one of the experiments Henderson et al. [6] designed multi-task environments in line with
OpenAI ’s request for research [36]. In [37] 4 target tasksare created with 0.25 multiples of the source environment’s gravity more specifically . G, . G, . G and . G using MuJoCo simulator where G=-9.81. In our gravity experiments, we will use . G, . G and . G as the targetenvironment gravities to benchmark our propositions. Transferring among morphologically different robots with different limb sizes or torso masses have been a popularmulti-task learning benchmark [6] [7] [8]. In the first section, we will test the generalization capability of our trainedpolicy on a hopper with differing torso masses to compare our proposition to other algorithms. In Section 5.1.2 wewill introduce three new target environments: a tall, heavy humanoid robot, a short, lightweight humanoid robot and adelivery robot that carries a heavy box.
In this section, we will first discuss the details of the training conducted in the source environments using thehyperparameters provided in Table 1, then the target environment experiments will be discussed using the methodologywe propose.
As detailed in Equation 5, the standard reward function in the
Humanoid-v2 environment consists of an alive bonus,linear forward velocity reward, quadratic impact cost with lower bound 10 and the quadratic control cost. r humanoid ( s, a ) = 0 . ∗ r v fwd + min (5 · − ∗ c impact ( s, a ) ,
10) + 0 . ∗ c control ( s, a ) + C bonus (5)If the z-coordinate of the agent’s root which lays at the center of the torso is not between the interval 1 and 2, theepisode terminates. The alive bonus C bonus is specified as +5 for the Humanoid task which is the default value for theGym Humanoid Environment[38].In Equation 6, the total loss function of the PPO algorithm that we will use to update actor and critic networks isgiven. We do not use the entropy reward in the original algorithm for PPO implementation because we did not seeany improvement in the expected reward. Likewise, the entropy bonus is not used in PPO implementation [29][5]. Inaddition to that, in the OpenAI Baselines framework [29], the clipping hyperparameter decays by the learning ratemultiplier, we also omit that because the learning curve tends to decrease after reaching the asymptote at the later8eneralization in Transfer Learning
A P
REPRINT
Table 1: Hyperparameters
Hyperparameters Symbol Values
Clipping Parameter (cid:15) b
64 512Step Size α χ H γ λ β β L CLIP + V Ft ( θ )= ˆ E t (cid:2) L CLIPt ( θ ) − c L V Ft ( θ ) (cid:3) (6) (a) (b) Figure 1: (a) Humanoid running in source environment using the last policy trained with PPO (cid:15) = 0 . . (b) Expectedreward of policies trained in the standard humanoid source environment with different hyperparametersIn Figure 1b, average episode rewards of policies trained with 4 different sets of hyperparameters are shown. The policyparameters of each set is saved at intervals of 50 iterations. Learning curve obtained with the latest PPO hyperparameterssuggested in OpenAI Baselines framework [29] for the Humanoid environment are represented by the red curve inFigure 1b. Following the
OpenAI Baselines framework, we used 16 parallel processes to sample from structurally sameenvironments concurrently with different random seeds during training time. The learning curves for the strict clippingmethods for generalization are shown with clipping hyperparameters 0.01, 0.025 and 0.01 decaying learning rate andclipping. Linearly decaying learning rate and clipping is a method used for environments with lower dimensions but weinclude strict clipping variations with clipping hyperparameters 0.01 and 0.025 of it in our benchmarks since sourceenvironment performance is not indicative of the target environment performance. In the testing phase, we will sampletrajectories of 2048 timesteps from 32 different seeded target task environments for each policy from the policy buffer.9eneralization in Transfer Learning
A P
REPRINT
The morphological modification experiments include interrobot transfer learning. Since the termination criteriondepends on the location of the center of the torso, the reward functions of both tall and short humanoid environment areupdated. For the tall humanoid, the range of the constraint shifts higher and for the short humanoid, it shifts lower. Inaddition to that, the total body weights of short and tall humanoids differ from the standard humanoid by the exclusionand inclusion of the upper waist respectively as seen in Figure 2.The method of clipping is primarily used to discourage catastrophic displacement in the parameter space. We’veobserved that a higher clipping parameter often causes a sudden drop in the learning curve and puts the policy inan unrecoverable location at the policy parameter space. In contrast, we show that strict clipping to unconventionalvalues like (cid:15) = 0 . is used in a transfer learning setting, the MDP samples that lead to overfitting to the source taskare discarded. Using strict clipping the trajectory that is used in the optimization process will be free of the varianceintroduced by the source task-specific samples.Figures 2, 3 and 4 show the performance the of policies trained with clipping parameter of (cid:15) = 0 . and (cid:15) = 0 . . Thelines of the strictly clipped policy iterations tend to be smooth thus saving the policy parameters every 50 iterations forall environments is sufficient for this experiment. Additionally, we’ve also tried RARL algorithm for these tasks butstrict clipping performed superior with the hyperparameters we’ve used. Besides, strict clipping achieved remarkablywell results in our benchmarks so the domain randomization enacted by the adversary is not needed. Strict clippingallows the humanoid to learn general characteristics of forward locomotion that can be transferred to various differentenvironments by discarding samples that cause overfitting. (a) (b) Figure 2: (a)Average reward per episode of every 50 iterations from policy buffer for a shorter humanoid. (b) Snapshotfrom the short humanoid environment simulation when the th policy iteration is used.Figure 2a shows that all the policies trained in the standard humanoid environment with strict clipping (cid:15) = 0 . , savedafter th iteration perform exceptionally well in the target environment. th and th policy iterations not onlygained a high average reward per episode but also performed consistently well with low standard deviation over all thetrajectories sampled from 32 environments with different seeds. When the th policy is directly transferred, theshort humanoid is able to run without the need for adaptation. In contrast, the policy with clipping parameter (cid:15) = 0 . can’t transfer the learning it attained in the source environment because the additional samples used during optimizationcaused overfitting to the source environment. It is seen in Figure 2, that even the earlier iterations of the policy trainedwith clipping parameter (cid:15) = 0 . can’t be transferred to a shorter robot where early stopping regularization technique isnot sufficient.We observed that the humanoid takes smalles steps to stay in balance with a larger upperbody and a higher + z constraint. Figure 3 proves that same policy iterations trained with strict clipping also performs well in the tall humanoidenvironment. This suggests that the form of moving forward is applicable in both of these environments as well as thesource environment. As a result, a tradeoff between generalization capacity and the performance arises as we prove thatoverfitting to the source task samples is a crucial issue. Our results confirm that tall humanoid environment is a harderenvironment than the short humanoid as expected because it is harder to keep balance with heavier upper body masswhere the center of mass is further from the ground. Similarly, we also observe higher variance in the average rewardscollected from the hopper environments with heavier unit torso mass.10eneralization in Transfer Learning A P
REPRINT (a) (b)
Figure 3: (a)Average reward per episode of every 50 iterations from policy buffer for a taller humanoid. (b) Snapshotfrom the tall humanoid environment simulation when the th policy iteration is used.Masses of relevant body parts for the delivery robot are given in Table 2. Taking into account the total body mass, adelivery box with a unit mass of 5 constitutes a challenging benchmark. The design decision was made to create ahorizontal imbalance by enforcing the humanoid to carry the box only by the right hand. Humanoid is able to carrythe heavy box just like a human using the th policy. th policy iteration that has the best target environmentjumpstart performance is shown in the simulation snapshot in Figure 4b. The simulation performance shows that thehumanoid can generalize to this delivery task just by utilizing the learning attained from standard humanoid environment.The reduction in the performance after the th iteration in Figure 4a, supports our method of resorting to the earlierpolicy iterations before discarding all the snapshots of the corresponding algorithm. This concavity suggests that theexperience and learning gained from the target environment is detrimental after a point based on the target environmentand early stopping as a regularization technique is effective.Table 2: Delivery Environment Body Unit Mass
Delivery Box 5Right Hand 1.19834313Torso 8.32207894Total Body without Delivery Box 39.64581713 (a) (b)
Figure 4: (a)Average reward per episode of every 50 iterations from policy buffer at target environment where a deliverybox of mass 5 unit is carried using right hand. (b) Standard delivery humanoid.11eneralization in Transfer Learning
A P
REPRINT
The humanoid robots fall immediately in the target environments when the best-performing policy in the sourceenvironment is used thus the orange curves in Figures 2, 3 and 4 assess our claim that source environment performanceisn’t indicative of the generalization capacity. When strict clipping is used as a regularization technique for PPO, thehumanoid is able to run in the proposed target environments. In these sets of experiments, our policy buffer consisted ofsnapshots of policies trained with only (cid:15) = 0 . and (cid:15) = 0 . . An alternative policy buffer might consist of snapshots ofpolicies trained with different hyperparameters or training methods and a better performing snapshot trained in thesource environment might be found for each benchmark. In order to find the best performing policy from the bufferwhen there is no surrogate validation environment, more sampling should be done in the target environment. Thus, thetradeoff between the number of trajectories rolled out and the performance in the target environment emerges. If thisproblem is not acknowledged the number of experiences gathered from the environment might even exceed the numberof samples used for random initialization from the beginning. Using observations gained from the performances ofdifferent policies in various target environments we provide insight on the logical ways of constructing a policy buffer. Frictional variation is one of the most common scenarios encountered in real life for bipedal locomotion. Theenvironment designed to benchmark this scenario is shown in Figure 5b. The humanoid is seen sinking to the grounddue to high tangential friction but still is able to run using the policy trained with strict clipping without the knowledgeof the target environment friction. (a)(b)
Figure 5: (a)Comparison of average reward per episode of policies at target environment with tangential friction 3.5times the source environment. (b) Humanoid runing in target environment with tangential friction 3.5 times the sourceenvironment.Instead of using multiple different policies for environments with different friction coefficients, choosing a policy witha higher generalization capacity is sufficient even for a target environment with 3.5 times the tangential friction of thesource environment.In Figure 5a, the best jumpstart performances for each clipping parameter are given. For instance, as seen in Table 3,the last iteration of the policy with a strict clipping (cid:15) = 0 . trained in the source environment has an average reward of8283 and a standard deviation of 24.26 across 32 target environments. In contrast, the best performing policy in thesource environment has a low generalization capacity due to the fact that it overfits the source task environment. As12eneralization in Transfer Learning A P
REPRINT
Table 3: Best Performing Iterations of Policies in Target Friction Environment
Clip Iteration Average Reward per Episode ± . ± . more samples are discarded and the movement in parameter space is restricted using strict clipping thus the agent learnsmore generalizable patterns of bipedal locomotion. In order to stay in balance under harsh circumstances, the policy that is being transferred should be robust to unknownenvironmental dynamics. Walking uninterruptedly in gravities lower and higher than the earth’s environment requiresdifferent patterns of forward locomotion unlike the humanoid morpohology benchmarks in 5.1.2 where the t h iteration of each policy trained with the same clipping parameter gained above 4000 average rewards per episode for alltarget environments. (a) (b) Figure 6: (a)Average reward per episode at target environment with gravity= -4.905 ( . G earth ). (b)Humanoid in targetenvironment with gravity= -4.905 ( . G earth ) including RARL variations.When the last iteration of the policy trained with strict clipping (cid:15) = 0 . is used in the target environment with gravity = − . ( . G earth ) the humanoid in Figure 6b is able to run. Although earlier snapshots of the policythat shows the best training performance yields less average rewards than the last policy iteration of PPO trained with (cid:15) = 0 . in the target environment, the performance is still remarkably well. Both regularization techniques namelyearly stopping and strict clipping show the same performance and generalization to this target environment like thedelivery environment in Section 5.1.2. Similarly, in the target environment with gravity= - 14.715 ( . G earth ), thehumanoid needs to resort to the previous snapshots of the policy trained with strict clipping (cid:15) = 0 . as plotted inthe Figure 7a. The bipedal locomotion pattern in the simulated target environment with gravity= - 14.715 ( . G earth )when the humanoid jumpstarts with the th policy trained with clipping parameter (cid:15) = 0 . is shown in Figure 7b.Gravity benchmarks for the humanoid indicate that snapshots of different policies should be used for the targetenvironment with gravity= - 17.1675 ( . G earth ). The policy iterations trained with hyperparameters " (cid:15) = 0 . , and decaying learning rate and clipping " performed poorly in the source environment and target environment withlower gravities given in Figures 1a and 6a respectively. However, the last iterations of them perform consistentlywell in environments with higher gravities. Figures 7 and 8a reveal that decaying clipping during training might havehindered the exploration and restricted the humanoid to stick to a more careful way of stepping forward under the highgravitational force which pulls the humanoid to the ground as in Figure 8b.13eneralization in Transfer Learning A P
REPRINT (a) (b)
Figure 7: (a)Average reward per episode at target environment with gravity= -14.715 ( . G earth ). (b)Average rewardper episode at target environment with gravity= -14.715 ( . G earth ) including RARL variations. (a) (b) Figure 8: (a)Average reward per episode at target environment with gravity= -17.1675 ( . G earth ). (b)Humanoid intarget environment with gravity= -17.1675 ( . G earth ) including RARL variations. Reward function of the hopper environment given in Equation 7 consists of an alive bonus, linear forward velocityreward and sum of squared actions. Alive bonus in the Hopper environment is C bonus is +1. r hopper ( s, a ) = r v fwd − . ∗ (cid:80) a + C bonus (7)16 parallel processes with different random seeds for each Hopper environment are initiated and 1.875M timesteps ofsamples are collected from each uniquely seeded environment. Hyperparameters of the best performing PPO in thisexperimental setting are found via simple grid search. In Figure 9b, the best performing clipping parameter, step sizeand batch size in source task environment are α = 0 . , (cid:15) = 0 . and b = 512 . Average reward per episode of PPOand RARL with different critic architectures at source environment over a total of 30 million timesteps are shown inFigure 4b.The policies trained with different variations of RARL perform worse in the source task environment than PPO,consistent with the comparison of the policies proposed in the Humanoid experiments. The protagonist policy gainsfewer rewards due to domain randomization introduced by the adversary but a natural regularization occurs whichcounteracts the overfitting to the target environment. 14eneralization in Transfer Learning A P
REPRINT (a) (b)
Figure 9: (a) Expected reward of policies trained in the standard hopper environment. (b) Hopping action in sourceenvironment using the last policy trained with PPO.
Initially, the performance of different critic structures will be compared: RARL [31], Shared Critic Robust AdversarialReinforcement Learning (SC-RARL) [24], our proposition Average Consecutive Critic Robust Adversarial Reinforce-ment Learning (ACC-RARL). Next, the best performing variation of each critic structure will be analyzed. Table 4shows the morphological specifications of the standard Hopper. In Figure 9b, best source task reward is found to beTable 4: Source Environment
Body Unit Mass
Torso 3.53429174Thigh 3.92699082Leg 2.71433605Foot 5.0893801Total Body 15.26499871above 3000 where the hopper hops quickly and seamlessly. In RARL with TRPO, the torso mass range chosen for theexperiments is [2 . − . [8], we will experiment with torso unit masses in the range [1 − . [1 − unit masses willprove to be easier benchmarks thus the best performing policy for each critic structure comparison will be omitted. Theright iterations of baseline policy PPO performed adequately between [1 − which suggests that domain randomizationvia adversary is redundant for these experiments and early stopping as a regularization technique is sufficient by usingthe right iteration of policy from the policy buffer.Algorithms trained with adversaries proposed in Section 3.2 are more unstable during the training phase. In ourexperiments, we observed that wrong choice of hyperparameters lead to a decrease in average reward per episode afterconvergence .The policy buffer will consist of snapshots of policies trained with PPO and 21 variants of RARL recorded at intervalsof 10. Hence, we will analyze the average reward per episode of 2002 different policies and test their generalizationcapacities for torso masses [6 − . In Figure 10b, it is seen that different iterations of each algorithm reach maximumperformance in the target environment.Figures 10c and 10d show that RARL and SC-RARL perform uniformly satisfactory for the corresponding benchmarksclose to the source environment. Although both algorithms have different critic initializations each critic is updated forthe same number of iterations and using same structured loss functions thus it is understandable that In Figure 10d, theSC-RARL, and RARL perform similarly when target environment is close to the source environment.There is a considerable difference between the target environment performances of SC-RARL and RARL, especiallyin th iteration, in Figure 10e which proves that training with the rewards of different trajectories sampled usingdifferent protagonist adversary pairs does have an effect on the type of control behavior learned.15eneralization in Transfer Learning A P
REPRINT (a) (b)(c) (d)(e)
Figure 10: (a)Average reward per episode at target environment with torso mass 1 of every 10 iterations from policybuffer. (b) Average reward per episode at target environment with torso mass 2 of every 10 iterations from policybuffer. (c) Average reward per episode at target environment with torso mass 3 of every 10 iterations from policy buffer.(d) Average reward per episode at target environment with torso mass 4 of every 10 iterations from policy buffer. (e)Average reward per episode at target environment with torso mass 5 of every 10 iterations from policy buffer.16eneralization in Transfer Learning
A P
REPRINT
The performance of the last iteration of RARL starts to decay in Figures 10a and 10e. As a consequence, the agentshould first resort to earlier snapshots of the policy intended for transfer to succeed in the harder target environments. Letus assume that the agent only has several snapshots of the policy in its policy buffer trained in the source environmentwith standard torso mass. Then the agent is put in a target environment with torso mass 6 which is analogous to an agentexpected to carry weight while performing a control task. We propose that in cases like these, instead of training fromthe very beginning because the last iteration of each policy known by the agent performs below a certain threshold as inFigure 10e, the agent should primarily resort to earlier policies at intervals suited for the context because the policiesperforming above 3000 are readily available in the agent’s memory if the policy iterations are saved during training.The policy buffer allows us to analyze all the different patterns of hopping learned during training. For instance, the typeof hopping learned by the ACC-RARL between the iterations 550 and 800 is successful in the source task environmentand environment where torso mass is 3 provided in Figure 10c but it is clearly unsuccessful in environments where torsomass is 1 and 2 as illustrated in Figures 10a, 10b. If we had not recognized the policy iteration as a hyperparameter thencomparing the algorithms at arbitrarily selected iterations would not constitute a fair comparison. More importantly, thePPO that is generally used as a benchmark algorithm performs poorly when the last snapshot of it is used for comparisonin a target task in Figure 10a. However, for target environments with torso masses [1 − , the right snapshots of PPOare capable of obtaining above 2500 average reward per episode from the environment as seen in Figures 10a, 10b,10c, 10d, 10e. Thus, this transfer learning problem reduces to finding the most suitable snapshot of the policies from aappropriately constructed policy buffer.Assuming that the environments are parametrized, it might be possible to predict performance of the policies residingin the policy buffer. Coinciding with the performances seen in Figures 10a, 11a, 12a, 13a , it is anticipated that theearlier policy iterations trained with less samples perform better as the distance between target environment and thesource environment increases in the parameter space. As seen in Figures 10b, 10c and 10d the last iteration of thepolicy trained with RARL with PPO perform better or same compared to the original RARL experiment carried outwith TRPO in [8]. For masses below hopper’s torso mass the last iterations of ACC-RARL shown in Figures 10a, 10b,10c performs superior to the last iteration of SC-RARL and RARL. (a) (b) Figure 11: (a)Average reward per episode at target environment with torso mass 6 of every 10 iterations from policybuffer. (b) Average reward per episode at target environment with torso mass 6 of every 10 iterations from policy bufferincluding RARL variations.In Figure 11a, a significant target environment performance drop occurs after approximately th until the last policyiteration where all algorithms are affected. This implies that the type of hopping behavior learned after a certain pointof training can’t generalize to hopper environments with higher torso masses and all policy iterations trained via PPOalgorithm with the given hyperparameters are inadequate.The best performing policy iterations of all algorithms startto get accumulated in the range [150 − when torso mass is greater than 5 thus a mapping between the targetenvironment parameters and the policy iterations is highly probable.Addition of the entropy bonus increased the fluctuation of the average rewards for all algorithms as seen in Figure 11b.Because the standard deviations of adversary’s output action probability distributions increase as a result of optimization,the adversary takes more randomized actions which often changes the protagonist’s way of hopping by destabilizing theequilibrium.In harder environments with torso mass 7 and 8, the earlier iterations of the policy trained with critic architectureACC-RARL that we propose performs the best. Figure 12a shows that the range of the best performing policy iterationsare contracted more and the performance similarities of RARL and SC-RARL are more apparent as illustrated inFigure 10d. Moreover, it is observed in Figure 12b that the benefit of adding entropy to the adversary loss function in17eneralization in Transfer Learning A P
REPRINT (a) (b)
Figure 12: (a)Average reward per episode at target environment with torso mass 7 of every 10 iterations from policybuffer. (b) Average reward per episode at target environment with torso mass 7 of every 10 iterations including RARLvariations.ACC-RARL and SC-RARL continues in the target environment with torso unit mass 7 whereas standard RARL is seento perform better in this case. Hence, addition of entropy bonus is not guaranteed to increase the maximum averageperformance in the target environment. (a) (b)
Figure 13: (a)Average reward per episode at target environment with torso mass 8 of every 10 iterations from policybuffer. (b) Average reward per episode at target environment with torso mass 8 of every 10 iterations from policy bufferincluding RARL variations.The best average episode rewards for SC-RARL is trained with adversary loss function including the entropy bonuswith entropy coefficient c adversary = 0 . as shown in Figures 11b, 12b, 13b. Training with adversary entropyshifts the best-performing policy iterations slightly to the right due to the increased domain randomization through theencouragement of adversary policy exploration via entropy bonus.When the mass of the torso is increased to 8, th iteration of policy trained with ACC-RARL gains an average episodereward of ± . as plotted in Figure 13. Although RARL trained with curriculum (RARL PS-Curriculum χ = 0 . ) still performed worse than ACC-RARL, it is observed that there is less variation among policy snapshotsrecorded after 100 iterations as shown in Figure 13b. The increase in the performance indicates that training with thehardest adversary policies might not be beneficial. The robustness of a randomly chosen snapshot from the policiestrained with RARL PS-Curriculum χ = 0 . has increased.If the target environments are grouped as all the target environments lower and higher than the source environment’storso mass, we find that each target group requires a different policy if the highest possible target performance isintended. We show the performance of two policy iterations with high generalization capacity trained with the ACC-RARL algorithm in Table 5 to demonstrate that only two closely saved policy iterations are capable of performingforward locomotion when torso mass is in the range [1 − . Using ACC-RARL and regularization via early stoppingwe’ve increased the target environment success range in RARL[8] from [2 . − . to [1 − .18eneralization in Transfer Learning A P
REPRINT
Table 5: Average Reward per Episode of 2 policy iterations trained with ACC-RARL
Unit Mass Iteration Average Reward per Episode ± . ± . ± . ± . ± . ± . ± . ± . In Learning Joint Reward Policy Options using Generative Adversarial Inverse Reinforcement Learning (OptionGAN)[37], the parameter space of gravity environment is between . G earth and . G earth for both humanoid and hopper.The policy over options converges to 2 different policies for Hopper tasks: one for lower and one for higher thanthe earth’s gravity indicating that the tasks are complex enough to be solved with different policies. In these sets ofexperiments conducted with a larger range of gravity environments specifically . G earth and . G earth , we willfurther assess that by using early stopping with adversarial algorithms, average episode reward greater than 3000 in thetarget environment is achieved. For these sets of tasks, we will use the same policy buffer we’ve created using PPO anddifferent variations of RARL for the morphology experiments for hopper. The plots in Figures 14a and 14b prove thatthe probability of picking the right iteration of policies trained using curriculum from the policy buffer that conforms toexpectations is higher. (a) (b) Figure 14: (a)Average reward per episode at target environment with Gravity= -4.905 ( . G earth ). (b) Average rewardper episode at target environment with Gravity= -4.905 ( . G earth ) using policy buffer including RARL variations.In Figure 15a the baseline PPO’s th policy is shown where the average return is ± . . Although it is thebest performing policy among the policy iterations recorded at intervals of 10 it is harder to find this iteration than thepolicies trained with RARL. The Figure 15b shows that training SC-RARL with curriculum and inclusion of entropybonus in adversary loss function not only increased the average reward per episode for the best performing policyiterations but also increased the number of policy snapshots that obtained an average episode reward above 2000.As the target environment gets harder by moving further away from the source environment, the best performingiterations of policies are aggregated around earlier iterations similar to the morphology experiments. As seen in Figure16 the domain randomization in naive PPO doesn’t suffice for generalization in harder target environments. In Figure16b we see that encouraging the exploration of the protagonist policy through the inclusion of entropy bonus increasesperformance for policies trained with ACC-RARL and SC-RARL. Although entropy doesn’t guarantee an increase inall harder target environments in morphology tasks, it is a regularization technique that should be considered given thatentropy increases the generalization for ACC-RARL and SC-RARL in gravity tasks.In Figures 16a and 16b we see the same concavity encountered in the performance plots of heavier torso mass targetenvironments in the hopper morphology and the humanoid delivery experiments. This curve is analogous to theconvexity of the test error curve in supervised learning problems where earlier training iterations at the training error19eneralization in Transfer Learning A P
REPRINT (a) (b)
Figure 15: (a)Average reward per episode at target environment with Gravity= -14.715 ( . G earth ). (b) Average rewardper episode at target environment with Gravity= -14.715 ( . G earth ) using policy buffer including RARL variations. (a) (b) Figure 16: (a)Average reward per episode at target environment with Gravity= 17.1675 ( . G earth ). (b) Averagereward per episode at target environment with Gravity= -17.1675 ( . G earth ) using policy buffer including RARLvariations.curve leads to underfitting and the later iterations overfit to the training set. The optimum point on the curve has highgeneralization capacity and performance in the target environment. The regularization effect of early stopping is pivotalin increasing the generalization capacity. Above all, the gravity environments performed in line with morphologyexperiments and early stopping in deep reinforcement transfer learning is proved to be essential in succeeding in hardertarget environments and creating meaningful benchmarks. An agent tries a substantial amount of different action combinations depending on the hyperparameters in the trainingphase of deep reinforcement learning algorithms. With each environment interaction, the agent’s strategy of solving thatparticular task is expected to advance globally. At some point, the agent becomes an expert at performing that task butdoesn’t remember the generalizable strategies it has acquired during the earlier phases of learning. These overriddenstrategies and strategies with poorer training performance yield higher performance in our transfer learning benchmarks.Provided that forward locomotion is an integral problem in continuous control, we altered the gravity, the tangentialfriction of the environment and the morphology of the agent in our benchmarks.When training deep neural networks, early stopping [39] is used when the algorithm’s generalization capacity, validationset performance starts to decrease. Knowing where to stop depends on the difference between target and the sourceenvironment. In our work, we proposed keeping a policy buffer analogous to human memory to capture differentstrategies because training performance doesn’t determine test performance in a transfer learning setting. Since we’renot given the context of the target environment during training in the source environment, we keep snapshots of policiesin the policy buffer. Accordingly, transferring the policy with the best source task performance to the target taskbecomes a less adequate evaluation technique as the difference between the source and target environment increases.Consequently, this methodology allowed us to increase the scope of existing algorithms and transfer the same learningattained in a source environment to harder target environments. In addition to that, we suggest the use of surrogatevalidation environment to optimize the hyperparameters by choosing the best fitting policy from the buffer. We proposed20eneralization in Transfer Learning
A P
REPRINT the inclusion of the iteration of training to the hyperparameters. After this inclusion, we’ve managed to retrieve theoverridden strategies that yield high rewards in the target environments. We proved that a hopper robot is capable ofperforming forward locomotion in an unknown environment 1.75 times the source task’s gravity using the policiessaved at earlier iterations.We provided comparisons of RARL algorithms trained with different critic structures, curriculum learning, entropybonus and showed how the choice of training affect the generalization capacity in continuous control experiments.Using our proposed methods ACC-RARL and early stopping via policy buffer, we’ve increased the range of Hoppertorso mass experiments from the range of [2.5-4.75] to [1-8].Furthermore, we introduced strict clipping for Proximal Policy Optimization as a regularization technique. Using anunconventionally low clipping parameter we discarded the samples that overfit the source task namely the standardhumanoid environment. We observed higher jumpstart performance in humanoid environments with higher tangentialfriction, a larger range of gravity and morphological modifications using the robust policies saved during training.Although outside the scope of transfer learning, we’ve discovered that decaying the clipping parameter decreasestraining performance for the humanoid environment that has higher state, action space. Policy gradient algorithms areused in the transfer learning algorithms for continuous control thus this finding had a substantial effect on the trainingand testing performance.We believe that the first step of determining the most promising policy parameters lies in the accurate parametrization ofthe environment. We would like to investigate the relationship between the parametrized distance between environmentsand the hyperparameters used to train the policies residing in the buffer. This mapping might be depicted as a nonlinearfunction approximator and should be estimated with the least amount of data possible.In this study, we showed the necessity of hyperparameter tuning not just for the training performance but for thegeneralization capacity of the transferred policy. Designing a better surrogate validation task while minimizing itsdifference from the source task is also a fruitful future research direction we would like to explore. Decreasing theKullback–Leibler divergence constraint to unconventional values for Trust Region Policy Optimization (TRPO) similarto strict clipping for PPO might also prove to increase the generalization capacity.
References [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcementlearning.
Nature , 518(7540):529–533, February 2015.[2] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization inreinforcement learning. arXiv preprint arXiv:1812.02341 , 2018.[3] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcementlearning. arXiv preprint arXiv:1804.06893 , 2018.[4] Chenyang Zhao, Olivier Siguad, Freek Stulp, and Timothy M Hospedales. Investigating generalisation incontinuous deep reinforcement learning. arXiv preprint arXiv:1902.07015 , 2019.[5] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.[6] Peter Henderson, Wei-Di Chang, Florian Shkurti, Johanna Hansen, David Meger, and Gregory Dudek. Benchmarkenvironments for multitask learning in continuous domains. arXiv preprint arXiv:1708.04352 , 2017.[7] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neuralnetwork policies using model ensembles. arXiv preprint arXiv:1610.01283 , 2016.[8] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning.In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 2817–2826. JMLR.org, 2017.[9] Hiroaki Shioya, Yusuke Iwasawa, and Yutaka Matsuo. Extending robust adversarial reinforcement learningconsidering adaptation and diversity. In
International Conference on Learning Representations , 2018.[10] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforce-ment learning that matters. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[11] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropydeep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 , 2018.21eneralization in Transfer Learning
A P
REPRINT [12] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomiza-tion for transferring deep neural networks from simulation to the real world. In
Intelligent Robots and Systems(IROS), 2017 IEEE/RSJ International Conference on , pages 23–30. IEEE, 2017.[13] Mario Srouji, Jian Zhang, and Ruslan Salakhutdinov. Structured control nets for deep reinforcement learning. In
International Conference on Machine Learning , pages 4749–4758, 2018.[14] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio Savarese. Adversarially robust policy learning:Active construction of physically-plausible perturbations. In , pages 3932–3939. IEEE, 2017.[15] Sergey Levine. Lecture 15—Transfer and Multi-Task Learning. 2017, http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_15_multi_task_learning.pdf , accessed in June 2018.[16] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, et al. Progressive neural networks. arXiv preprintarXiv:1606.04671 , 2016.[17] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I. Jordan. Partial transfer learning with selectiveadversarial networks. In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.[18] Olga Russakovsky, Jia Deng, Hao Su, et al. Imagenet large scale visual recognition challenge.
InternationalJournal of Computer Vision , 115(3):211–252, 2015.[19] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, CaliforniaInstitute of Technology, 2007.[20] Fereshteh Sadeghi and Sergey Levine. Cad2rl: Real single-image flight without a single real image. arXiv preprintarXiv:1611.04201 , 2016.[21] Blender Online Community.
Blender - a 3D modelling and rendering package . Blender Foundation, BlenderInstitute, Amsterdam, 2002.[22] Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, et al. Using simulation and domain adaptation to improveefficiency of deep robotic grasping. In ,pages 4243–4250. IEEE, 2018.[23] Eric Tzeng, Coline Devin, Judy Hoffman, et al. Adapting deep visuomotor representations with weak pairwiseconstraints. arXiv preprint arXiv:1511.07111 , 2015.[24] Zisu Dong.
Tensorflow implementation for Robust Adversarial Reinforcement Learning . 2018, https://github.com/Jekyll1021/RARL , accessed in March 2019.[25] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity viamulti-agent competition. arXiv preprint arXiv:1710.03748 , 2017.[26] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In
IntelligentRobots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pages 5026–5033. IEEE, 2012.[27] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuousadaptation via meta-learning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641 ,2017.[28] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza-tion. In
International Conference on Machine Learning , pages 1889–1897, 2015.[29] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, et al.
OpenAI Baselines . 2017, https://github.com/openai/baselines , accessed in May 2018.[30] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuouscontrol using generalized advantage estimation. arXiv preprint arXiv:1506.02438 , 2015.[31] Lerrel Pinto.
Rllab implementation for Robust Adversarial Reinforcement Learning . 2017, https://github.com/lerrel/rllab-adv , accessed in December 2018.[32] Lerrel Pinto.
Gym environments with adversarial disturbance agents . 2017, https://github.com/lerrel/gym-adv , accessed in March 2019.[33] Rocky Duan, Peter Chen, Houthooft, Rein, John Schulman, and Pieter Abbeel. rllab . 2016, https://github.com/rll/rllab , accessed in December 2018.[34] Ye Huang, Chaochen Gu, Kaijie Wu, and Xinping Guan. Reinforcement learning policy with proportional-integralcontrol. In
International Conference on Neural Information Processing , pages 253–264. Springer, 2018.22eneralization in Transfer Learning
A P
REPRINT [35] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In
Proceedings ofthe 26th annual international conference on machine learning , pages 41–48. ACM, 2009.[36] OpenAI.
Request for Research: Multitask RL with Continuous Actions . https://openai.com/requests-for-research/ , accessed in November 2018.[37] Peter Henderson, Wei-Di Chang, Pierre-Luc Bacon, David Meger, Joelle Pineau, and Doina Precup. Optiongan:Learning joint reward-policy options using generative adversarial inverse reinforcement learning. In Thirty-SecondAAAI Conference on Artificial Intelligence , 2018.[38] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and WojciechZaremba. Openai gym. arXiv preprint arXiv:1606.01540 , 2016.[39] Lutz Prechelt. Early stopping-but when? In