Risk Averse Robust Adversarial Reinforcement Learning
RRisk Averse Robust Adversarial Reinforcement Learning
Xinlei Pan , Daniel Seita , Yang Gao , John Canny Abstract — Deep reinforcement learning has recently madesignificant progress in solving computer games and robotic con-trol tasks. A known problem, though, is that policies overfit tothe training environment and may not avoid rare, catastrophicevents such as automotive accidents. A classical technique forimproving the robustness of reinforcement learning algorithmsis to train on a set of randomized environments, but thisapproach only guards against common situations. Recently, ro-bust adversarial reinforcement learning (RARL) was developed,which allows efficient applications of random and systematicperturbations by a trained adversary. A limitation of RARLis that only the expected control objective is optimized; thereis no explicit modeling or optimization of risk. Thus theagents do not consider the probability of catastrophic events(i.e., those inducing abnormally large negative reward), exceptthrough their effect on the expected objective. In this paper weintroduce risk-averse robust adversarial reinforcement learning(RARARL), using a risk-averse protagonist and a risk-seekingadversary. We test our approach on a self-driving vehiclecontroller. We use an ensemble of policy networks to modelrisk as the variance of value functions. We show throughexperiments that a risk-averse agent is better equipped tohandle a risk-seeking adversary, and experiences substantiallyfewer crashes compared to agents trained without an adversary.Supplementary materials are available at https://sites.google.com/view/rararl . I. I
NTRODUCTION
Reinforcement learning has demonstrated remarkable per-formance on a variety of sequential decision making taskssuch as Go [1], Atari games [2], autonomous driving [3],[4], and continuous robotic control [5], [6]. Reinforcementlearning (RL) methods fall under two broad categories: model-free and model-based. In model-free RL, the environment’sphysics are not modeled, and such methods require substantialenvironment interaction and can have prohibitive samplecomplexity [7]. In contrast, model-based methods allowfor systematic analysis of environment physics, and inprinciple should lead to better sample complexity and morerobust policies. These methods, however, have to date beenchallenging to integrate with deep neural networks and togeneralize across multiple environment dimensions [8], [9],or in truly novel scenarios, which are expected in unrestrictedreal-world applications such as driving.In this work, we focus on model-free methods, butinclude explicit modeling of risk . We additionally focus ona framework that includes an adversary in addition to themain (i.e., protagonist) agent. By modeling risk, we can trainstronger adversaries and through competition, more robustpolicies for the protagonist (see Figure 1 for an overview). University of California, Berkeley. {xinleipan,seita,yg,canny}@berkeley.edu
Protagonist: Risk Averse Adversary: Risk SeekingCompetitionMaximize Reward Minimize Reward
Fig. 1:
Risk averse robust adversarial reinforcement learning diagram: anautonomous driving example. Our framework includes two competing agentsacting against each other, trying to drive a car (protagonist), or trying toslow or crash the car (adversary). We include a notion of risk modelingin policy learning. The risk-averse protagonist and risk-seeking adversarialagents learn policies to maximize or minimize reward, respectively. The useof the adversary helps the protagonist to effectively explore risky states.
We envision this as enabling training of more robust agentsin simulation and then using sim-to-real techniques [10] togeneralize to real world applications, such as house-holdrobots or autonomous driving, with high reliability and safetyrequirements. A recent algorithm combining robustness inreinforcement learning and the adversarial framework isrobust adversarial reinforcement learning (RARL) [11], whichtrained a robust protagonist agent by having an adversaryproviding random and systematic attacks on input states anddynamics. The adversary is itself trained using reinforcementlearning, and tries to minimize the long term expected rewardwhile the protagonist tries to maximize it. As the adversarygets stronger, the protagonist experiences harder challenges.RARL, along with similar methods [12], is able to achievesome robustness, but the level of variation seen duringtraining may not be diverse enough to resemble the varietyencountered in the real-world. Specifically, the adversarydoes not actively seek catastrophic outcomes as does theagent constructed in this paper. Without such experiences,the protagonist agent will not learn to guard against them.Consider autonomous driving: a car controlled by the protag-onist may suddenly be hit by another car. We call this andother similar events catastrophic since they present extremelynegative rewards to the protagonist, and should not occurunder a reasonable policy. Such catastrophic events are highlyunlikely to be encountered if an adversary only randomlyperturbs the environment parameters or dynamics, or if theadversary only tries to minimize total reward.In this paper, we propose risk averse robust adversarialreinforcement learning (RARARL) for training risk aversepolicies that are simultaneously robust to dynamics changes.Inspired by [13], we model risk as the variance of valuefunctions. To emphasize that the protagonist be averse tocatastrophes, we design an asymmetric reward function (see a r X i v : . [ c s . L G ] M a r ection IV-A): successful behavior receives a small positivereward, whereas catastrophes receive a very negative reward.A robust policy should not only maximize long termexpected reward, but should also select actions with lowvariance of that expected reward. Maximizing the expectationof the value function only maximizes the point estimate ofthat function without giving a guarantee on the variance.While [13] proposed a method to estimate that variance, itassumes that the number of states is limited, while we don’tassume limited number of states and that assumption makesit impractical to apply it to real world settings where thenumber of possible states could be infinitely large. Here, weuse an ensemble of Q-value networks to estimate variance. Asimilar technique was proposed in Bootstrapped DQNs [14]to assist exploration, though in our case, the primary purposeof the ensemble is to estimate variance.We consider a two-agent reinforcement learning scenario(formalized in Section III). Unlike in [11], where the agentsperformed actions simultaneously, here they take turnsexecuting actions, so that one agent may take multiple stepsto bring the environment in a more challenging state for theother. We seek to enable the adversarial agent to activelyexplore the parameter variation space, so that the perturbationsare generated more efficiently. We use a discrete controltask, autonomous driving with the TORCS [15] simulator, todemonstrate the benefits of RARARL.II. R ELATED W ORK
Reinforcement Learning with Adversaries . A recenttechnique in reinforcement learning involves introducingadversaries and other agents that can adjust the environmentdifficulty for a main agent. This has been used for robustgrasping [16], simulated fighting [17], and RARL [11], themost relevant prior work to ours. RARL trains an adversaryto appropriately perturb the environment for a main agent.The perturbations, though, were limited to a few parameterssuch as mass or friction, and the trained protagonist may bevulnerable to other variations.The works of [12] and [18] proposed to add noise to stateobservations to provide adversarial perturbations, with thenoise generated using fast gradient sign method [19]. However,they did not consider training an adversary or training riskaverse policies. The work of [20] proposed to introduceBayesian optimization to actively select environment variablesthat may induce catastrophes, so that models trained can berobust to these environment dynamics. However, they did notsystematically explore dynamics variations and therefore themodel may be vulnerable to changing dynamics even if it isrobust to a handful of rare events.
Robustness and Safety in RL . More generally, robustnessand safety have long been explored in reinforcement learn-ing [21], [22], [23]. Chow et al. [23] proposed to model riskvia constraint or chance constraint on the conditional valueat risk (CVaR). This paper provided strong convergence guar-antees but made strong assumptions: value and constrainedvalue functions are assumed to be known exactly and tobe differentiable and smooth. Risk is estimated by simply sampling trajectories which may never encounter adverseoutcomes, whereas with sparse risks (as is the case here)adversarial sampling provides more accurate estimates of theprobability of a catastrophe.A popular ingredient is to enforce constraints on anagent during exploration [24] and policy updates [25], [26].Alternative techniques include random noise injection duringvarious stages of training [27], [28], injecting noise to thetransition dynamics during training [29], learning when toreset [30] and even physically crashing as needed [31].However, Rajeswaran et al. [29] requires training on a targetdomain and experienced performance degradation when thetarget domain has a different model parameter distributionfrom the source. We also note that in control theory, [32], [33]have provided theoretical analysis for robust control, thoughtheir focus lies in model based RL instead of model free RL.These prior techniques are orthogonal to our contribution,which relies on model ensembles to estimate variance.
Uncertainty-Driven Exploration . Prior work on explo-ration includes [34], which measures novelty of states usingstate prediction error, and [35], which uses pseudo countsto explore novel states. In our work, we seek to measurethe risk of a state by the variance of value functions. Theadversarial agent explores states with high variance so that itcan create appropriate challenges for the protagonist.
Simulation to Real Transfer . Running reinforcementlearning on physical hardware can be dangerous due toexploration and slow due to high sample complexity. Oneapproach to deploying RL-trained agents safely in the realworld is to experience enough environment variation duringtraining in simulation so that the real-world environment looksjust like another variation. These simulation-to-real techniqueshave grown popular, including domain randomization [10],[36] and dynamics randomization [37]. However, their focusis on transferring policies to the real world rather than trainingrobust and risk averse policies.III. R
ISK A VERSE R OBUST A DVERSARIAL
RLIn this section, we formalize our risk averse robustadversarial reinforcement learning (RARARL) framework.
A. Two Player Reinforcement Learning
We consider the environment as a Markov Decision Process(MDP) M = { S , A , R , P , γ } , where S defines the statespace, A defines the action space, R ( s , a ) is the rewardfunction, P ( s (cid:48) | s , a ) is the state transition model, and γ is thereward discount rate. There are two agents: the protagonistP and the adversary A . Definition.
Protagonist Agent. A protagonist P learns apolicy π P to maximize discounted expected reward E π P [ ∑ γ t r t ] .The protagonist should be risk averse, so we define the valueof action a at state s to beˆ Q P ( s , a ) = Q P ( s , a ) − λ P Var k [ Q kP ( s , a )] , (1)where ˆ Q P ( s , a ) is the modified Q function, Q P ( s , a ) is theoriginal Q function, and Var k [ Q kP ( s , a )] is the variance of theQ function across k different models, and λ P is a constant;ig. 2: Our neural network design. (Notation: “s” indicates stride forthe convolutional weight kernel, and two crossing arrows indicate denselayers.) The input is a sequence of four stacked observations to form an ( × × ) -dimensional input. It is passed through three convolutionallayers to obtain a 3136-dimensional vector, which is then processed througha dense layer. (All activations are ReLus.) The resulting 512-dimensionalvector is copied and passed to k branches, which each process it throughdense layers to obtain a state value vector Q i ( s , · ) . We apply the ensembleDQN framework for estimating the value function variance. The term − λ P Var k [ Q kP ( s , a )] is called the risk-averse termthereafter, and encourages the protagonist to seek lowervariance actions. The reward for P is the environment reward r t at time t . Definition.
Adversarial Agent. An adversary A learnsa policy π A to minimize long term expected reward, orto maximize the negative discounted reward E π A [ ∑ − γ t r t ] .To encourage the adversary to systematically seek adverseoutcomes, its modified value function for action selection isˆ Q A ( s , a ) = Q A ( s , a ) + λ A Var k [ Q kA ( s , a )] , (2)where ˆ Q A ( s , a ) is the modified Q function, Q A ( s , a ) is theoriginal Q function, Var k [ Q kA ( s , a )] is the variance of the Qfunction across k different models, and λ A is a constant;the interaction between agents becomes a zero-sum gameby setting λ A = λ P . The term λ A Var k [ Q kA ( s , a )] is called the risk-seeking term thereafter. The reward of A is the negativeof the environment reward − r t , and its action space is thesame as for the protagonist.The necessity of having two agents working separatelyinstead of jointly is to provide the adversary more powerto create challenges for the protagonist. For example, inautonomous driving, a single risky action may not putthe vehicle in a dangerous condition. In order to create acatastrophic event (e.g., a traffic accident) the adversary needsto be stronger. In our experiments (a vehicle controller withdiscrete control), the protagonist and adversary alternate fullcontrol of a vehicle, though our methods also apply to settingsin [11], where the action applied to the environment is a sumof contributions from the protagonist and the adversary. B. Reward Design and Risk Modeling
To train risk averse agents, we propose an asymmetricreward function design such that good behavior receives smallpositive rewards and risky behavior receives very negativerewards. See Section IV-A and Equation 4 for details.The risk of an action can be modeled by estimating thevariance of the value function across different models trainedon different sets of data. Inspired by [14], we estimate the variance of Q value functions by training multiple Qvalue networks in parallel. Hereafter, we use Q to denotethe entire Q value network, and use Q i to denote the i -th head of the multi-heads Q value network. As shownin Figure 2, the network takes in input s , which consistsof stacked frames of consecutive observations. It passes s through three shared convolutional layers, followed by one(shared) dense layer. After this, the input is passed to k different heads which perform one dense layer to obtain k action-value outputs: { Q ( s , · ) , . . . , Q k ( s , · ) } . Defining themean as ˜ Q ( s , a ) = k ∑ ki = Q i ( s , a ) , the variance of a singleaction a is, Var k ( Q ( s , a )) = k k ∑ i = ( Q i ( s , a ) − ˜ Q ( s , a )) , (3)where we use the k subscripts to indicate variance over k models, as in Equations 1 and 2. The variance in Equation 3measures risk, and our goal is for the protagonist andadversarial agents to select actions with low and high variance,respectively.At training time, when we sample one action using the Qvalues, we randomly choose one of k heads from Q to Q k ,and use this head throughout one episode to choose the actionthat will be applied by the agent. When updating Q functions,our algorithm (like DQN [2]) samples a batch of data of size B from the replay buffer { ( s , a , s (cid:48) , r , done ) t } Bt = which, for eachdata point, includes the state, action, next state, reward, andtask completion signal. Then we sample a k -sized mask. Eachmask value is sampled using a Poisson distribution (modelinga true Bootstrap sample with replacement) instead of theBernoulli distribution in [14] (sample without replacement).At test time, the mean value ˜ Q ( s , a ) is used for selectingactions. C. Risk Averse RARL
In our two-player framework, the agents take actionssequentially, not simultaneously: the protagonist takes m steps, the adversary takes n steps, and the cycle repeats. Theexperience of each agent is only visible to itself, which meanseach agent changes the environment transition dynamicsfor another agent. The Q learning Bellman equation ismodified to be compatible with this case. Let the currentand target value functions be Q P and Q ∗ P for the protagonist,and (respectively) Q A and Q ∗ A for the adversary. Given thecurrent state and action pair ( s t , a t ) , we denote actionsexecuted by the protagonist as a Pt and actions taken by theadversary as a At . The target value functions are Q P ( s Pt , a Pt ) = r ( s Pt , a Pt ) + ∑ ni = γ i r ( s At + i , a At + i ) + γ n + max a Q ∗ ( s Pt + n + , a ) , and, similarly, Q A ( s At , a At ) = r ( s At , a At ) + ∑ mi = γ i r ( s Pt + i , a Pt + i ) + γ m + max a Q ∗ ( s At + m + , a ) . To increase training stability forthe protagonist, we designed a training schedule Ξ of theadversarial agent. For the first ξ steps, only the protagonistagent takes actions. After that, for every m steps taken bythe protagonist, the adversary takes n steps. The reason We use Q and Q i to represent functions that could apply to either theprotagonist or adversary. If it is necessary to distinguish among the twoagents, we add the appropriate subscript of P or A . or this training schedule design is that we observed if theadversarial agent is added too early (e.g., right at the start),the protagonist is unable to attain any rewards. Thus, we letthe protagonist undergo a sufficient amount of training stepsto learn basic skills. The use of masks in updating Q valuefunctions is similar to [14], where the mask is a integervector of size equal to batch size times number of ensembleQ networks, and is used to determine which model is to beupdated with the sample batch. Algorithm 1 describes ourtraining algorithm. Algorithm 1:
Risk Averse RARL Training Algorithm
Result:
Protagonist Value Function Q P ; AdversarialValue Function Q A . Input:
Training steps T; Environment env ; AdversarialAction Schedule Ξ ; Exploration rate ε ; Number ofmodels k . Initialize: Q iP , Q iA ( i = , · · · , k ); Replay Buffer RB P , RB A ; Action choosing head H P , H A ∈ [ , k ] ; t = 0;Training frequency f ; Poisson sample rate q ; while t < T do Choose Agent g from { A ( Adversarial agent ) , P ( Protagonist agent ) } according to Ξ ;Compute ˆ Q g ( s , a ) according to (1) and (2) ;Select action according to ˆ Q g ( s , a ) by applying ε -greedy strategy ;Excute action and get obs , reward , done ; RB g = RB g ∪ { ( obs , reward , done ) } ; if t % f = 0 then Generate mask M ∈ R k ∼ Poisson ( q ) ;Update Q iP with RB P and M i , i = , , ..., k ;Update Q iA with RB A and M i , i = , , ..., k ; if done then update H p and H a by randomly samplingintegers from 1 to k ;reset env ;t = t + 1; IV. E XPERIMENTS
We evaluated models trained by RARARL on an au-tonomous driving environment, TORCS [15]. Autonomousdriving has been explored in recent contexts for policylearning and safety [38], [39], [40] and is a good testbed forrisk-averse reinforcement learning since it involves events(particularly crashes) that qualify as catastrophes.
A. Simulation Environment
For experiments, we use the Michigan Speedway environ-ment in TORCS [15], which is a round way racing track; seeFigure 6 for sample observations. The states are ( × × ) -dimensional RGB images. The vehicle can execute nineactions: (1) move left and accelerate, (2) move ahead andaccelerate, (3) move right and accelerate, (4) move left, (5) do nothing, (6) move right, (7) move right and decelerate, (8)move ahead and decelerate, (9) move right and decelerate.We next define our asymmetric reward function. Let v be the magnitude of the speed, α be the angle between thespeed and road direction, p be the distance of the vehicleto the center of the road, and w be the road width. Weadditionally define two binary flags: st and da , with st = da = C = (cid:108) st + da (cid:109) ,the reward function is defined as: r = β v (cid:18) cos ( α ) − | sin ( α ) | − pw (cid:19) ( − st )( − da ) + r cat · C (4) with the intuition being that cos ( α ) encourages speeddirection along the road direction, | sin ( α ) | penalizes movingacross the road, and pw penalizes driving on the side ofthe road. We set the catastrophe reward as r cat = − . β = .
025 as a tunable constant which ensures that themagnitude of the non-catastrophe reward is significantly lessthan that of the catastrophe reward. The catastrophe rewardmeasures collisions, which are highly undesirable events tobe avoided. We note that constants λ P = λ A used to blendreward and variance terms in the risk-augmented Q-functionsin Equations 1 and 2 were set to 0 . total progress reward excludes thecatastrophe reward: r = β v (cid:18) cos ( α ) − | sin ( α ) | − pw (cid:19) ( − st )( − da ) , (5)and the pure progress reward is defined as r = β v (cid:16) cos ( α ) − | sin ( α ) | (cid:17) ( − st )( − da ) . (6)The total progress reward considers both moving along theroad and across the road, and penalizes large distances to thecenter of the road, while the pure progress only measures thedistance traveled by the vehicle, regardless of the vehicle’slocation. The latter can be a more realistic measure sincevehicles do not always need to be at the center of the road. B. Baselines and Our Method
All baselines are optimized using Adam [41] with learningrate 0.0001 and batch size 32. In all our ensemble DQNmodels, we trained with 10 heads since empirically thatprovided a reasonable balance between having enough modelsfor variance estimation but not so much that training timewould be overbearing. For each update, we sampled 5 modelsusing Poisson sampling with q = .
03 to generate the maskfor updating Q value functions. We set the training frequencyas 4, the target update frequency as 1000, and the replay buffersize as 100,000. For training DQN with an epsilon-greedystrategy, the ε decreased linearly from 1 to 0.02 from step10,000 to step 500,000. The time point to add in perturbationsis ξ = ,
000 steps, and for every m =
10 steps taken byprotagonist agent, the random agent or adversary agent willtake n = Vanilla DQN . The purpose of comparing with vanilla DQNis to show that models trained in one environment may overfito specific dynamics and fail to transfer to other environments,particularly those that involve random perturbations. Wedenote this as dqn . Ensemble DQN . Ensemble DQN tends to be more robustthan vanilla DQN. However, without being trained on differentdynamics, even Ensemble DQN may not work well whenthere are adversarial attacks or simple random changes in thedynamics. We denote this as bsdqn . Ensemble DQN with Random Perturbations WithoutRisk Averse Term . We train the protagonist and providerandom perturbations according to the schedule Ξ . We donot include the variance guided exploration term here, soonly the Q value function is used for choosing actions. Theschedule Ξ is the same as in our method. We denote this as bsdqnrand . Ensemble DQN with Random Perturbations With theRisk Averse Term . We only train the protagonist agent andprovide random perturbations according to the adversarialtraining schedule Ξ . The protagonist selects action based onits Q value function and the risk averse term. We denote thisas bsdqnrandriskaverse . Ensemble DQN with Adversarial Perturbation . This isto compare our model with [11]. For a fair comparison, wealso use Ensemble DQN to train the policy while the varianceterm is not used as either risk-averse or risk-seeking term ineither agents. We denote this as bsdqnadv . Our method . In our method, we train both the protagonistand the adversary with Ensemble DQN. We include herethe variance guided exploration term, so the Q function andits variance across different models will be used for actionselection. The adversarial perturbation is provided accordingto the adversarial training schedule Ξ . We denote this as bsdqnadvriskaverse . C. Evaluation
To evaluate robustness of our trained models, we usethe same trained models under different testing conditions,and evaluate using the previously-defined reward classes oftotal progress (Equation 5), pure progress (Equation 6), andadditionally consider the reward of catastrophes. We presentthree broad sets of results: (1)
No perturbations . (Figure 3)We tested all trained models from Section IV-B withoutperturbations. (2)
Random perturbations . (Figure 4) Toevaluate the robustness of trained models in the presenceof random environment perturbations, we benchmarked alltrained models using random perturbations. For every 10actions taken by the main agent, 1 was taken at random. (3)
Adversarial Perturbations . (Figure 5) To test the ability ofour models to avoid catastrophes, which normally requiredeliberate, non-random perturbations, we test with a trainedadversarial agent which took 1 action for every 10 taken bythe protagonist.All subplots in Figures 3, 4, and 5 include a vertical blueline at 0.55 million steps indicating when perturbations werefirst applied during training (if any). Before 0.55 millionsteps, we allow enough time for protagonist agents to be ableto drive normally. We choose 0.55 million steps because the TABLE I: Robustness of Models Measured by Average BestCatastrophe Reward Per Episode (Higher is better)
Exp Normal Random Perturb Adv. Perturbdqn -0.80 -3.0 -4.0bsdqn -0.90 -1.1 -2.5bsdqnrand -0.10 -1.0 -2.1bsdqnadv -0.30 -0.5 -1.0bsdqnrandriskaverse -0.09 -0.4 -2.0bsdqnadvriskaverse -0.08 -0.1 -0.1 exploration rate decreases to 0.02 at 0.50 million steps, andwe allow additional 50000 steps for learning to stabilize.
Does adding adversarial agent’s perturbation affect therobustness?
In Table I, we compare the robustness of allmodels by their catastrophe rewards. The results indicatethat adding perturbations improves a model’s robustness,especially to adversarial attacks. DQN trained with randomperturbations is not as robust as models trained with adver-sarial perturbations, since random perturbations are weakerthan adversarial perturbations.
How does the risk term affect the robustness of thetrained models?
As shown in Figures 4 and 5, modelstrained with the risk term achieved better robustness underboth random and adversarial perturbations. We attribute thisto the risk term encouraging the adversary to aggressivelyexplore regions with high risk while encouraging the oppositefor the protagonist.
How do adversarial perturbations compare to randomperturbations?
A trained adversarial agent can enforcestronger perturbations than random perturbations. By com-paring Figure 4 and Figure 5, we see that the adversarialperturbation provides stronger attacks, which causes thereward to be lower than with random perturbations.We also visualize an example of the differences betweena trained adversary and random perturbations in Figure 6,which shows that a trained adversary can force the protagonist(a vanilla DQN model) to drive into a wall and crash.V. C
ONCLUSION
We show that by introducing a notion of risk aversebehavior, a protagonist agent trained with a learned adversaryexperiences substantially fewer catastrophic events duringtest-time rollouts as compared to agents trained without anadversary. Furthermore, a trained adversarial agent is ableto provide stronger perturbations than random perturbationsand can provide a better training signal for the protagonist ascompared to providing random perturbations. In future work,we will apply RARARL in other safety-critical domains, suchas in surgical robotics.A
CKNOWLEDGMENTS
Xinlei Pan is supported by Berkeley Deep Drive. Daniel Seita issupported by a National Physical Science Consortium Fellowship. ig. 3:
Testing all models without attacks or perturbations. The reward is divided into distance related reward (left subplot), progress related reward (middlesubplot). We also present results for catastrophe reward per episode (right subplot). The blue vertical line indicates the beginning of adding perturbationsduring training. All legends follow the naming convention described in Section IV-B.
Fig. 4:
Testing all models with random attacks. The three subplots follow the same convention as in Figure 3.
Fig. 5:
Testing all models with adversarial attack. The three subplots follow the same convention as in Figure 3.
Fig. 6:
Two representative (subsampled) sequences of states in TORCS for a trained protagonist, with either a trained adversary (top row) or randomperturbations (bottom row) affecting the trajectory. The overlaid arrows in the upper left corners indicate the direction of the vehicle. The top row indicatesthat the trained adversary is able to force the protagonist to drive towards the right and into the wall (i.e., a catastrophe). Random perturbations cannotaffect the protagonist’s trajectory to the same extent because many steps of deliberate actions in one direction are needed to force a crash.
EFERENCES[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, et al. , “Mastering the game of go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, 2016.[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. , “Human-level control through deep reinforcement learning,”
Nature , vol. 518, no. 7540, pp. 529–533, 2015.[3] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprintarXiv:1610.03295 , 2016.[4] Y. You, X. Pan, Z. Wang, and C. Lu, “Virtual to real reinforcementlearning for autonomous driving,”
British Machine Vision Conference ,2017.[5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” in
International Conference on Learning Representations(ICLR) , 2016.[6] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel,“Benchmarking deep reinforcement learning for continuous control,” in
International Conference on Machine Learning (ICML) , 2016.[7] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trustregion policy optimization,” in
International Conference on MachineLearning (ICML) , 2015.[8] Y. Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunya-suvunakool, J. Kramar, R. Hadsell, N. de Freitas, and N. Heess,“Reinforcement and imitation learning for diverse visuomotor skills,”in
Robotics: Science and Systems (RSS) , 2018.[9] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural networkdynamics for model-based deep reinforcement learning with model-free fine-tuning,” in
IEEE International Conference on Robotics andAutomation (ICRA) , 2018.[10] F. Sadeghi and S. Levine, “CAD2RL: real single-image flight withouta single real image,” in
Robotics: Science and Systems , 2017.[11] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarialreinforcement learning,”
International Conference on Machine Learning(ICML) , 2017.[12] A. Mandlekar, Y. Zhu, A. Garg, L. Fei-Fei, and S. Savarese, “Adversar-ially robust policy learning: Active construction of physically-plausibleperturbations,” in
IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) , 2017.[13] A. Tamar, D. Di Castro, and S. Mannor, “Learning the variance of thereward-to-go,”
Journal of Machine Learning Research , vol. 17, no. 13,pp. 1–36, 2016.[14] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep explorationvia bootstrapped dqn,” in
Advances in Neural Information ProcessingSystems , 2016, pp. 4026–4034.[15] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, andA. Sumner, “Torcs, the open racing car simulator,”
Software availableat http://torcs. sourceforge. net , 2000.[16] L. Pinto, J. Davidson, and A. Gupta, “Supervision via competition:Robot adversaries for learning tasks,” in
IEEE International Conferenceon Robotics and Automation (ICRA) , 2017.[17] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch,“Emergent complexity via multi-agent competition,” in
InternationalConference on Learning Representations (ICLR) , 2018.[18] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, and G. Chowdhary,“Robust deep reinforcement learning with adversarial attacks,” arXivpreprint arXiv:1712.03632 , 2017.[19] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harness-ing adversarial examples,” in
International Conference on LearningRepresentations (ICLR) , 2015.[20] S. Paul, K. Chatzilygeroudis, K. Ciosek, J.-B. Mouret, M. A. Osborne,and S. Whiteson, “Alternating optimisation and quadrature for robustcontrol,” in
AAAI 2018-The Thirty-Second AAAI Conference onArtificial Intelligence , 2018.[21] R. Neuneier and O. Mihatsch, “Risk sensitive reinforcement learning,”in
Neural Information Processing Systems (NIPS) , 1998.[22] S. Carpin, Y.-L. Chow, and M. Pavone, “Risk aversion in finite markovdecision processes using total cost criteria and average value at risk,”in
IEEE International Conference on Robotics and Automation (ICRA) ,2016. [23] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,”
Journalof Machine Learning Research , 2018.[24] T. M. Moldovan and P. Abbeel, “Safe exploration in markov decisionprocesses,” in
International Conference on Machine Learning (ICML) ,2012.[25] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policyoptimization,” in
International Conference on Machine Learning(ICML) , 2017.[26] D. Held, Z. McCarthy, M. Zhang, F. Shentu, and P. Abbeel, “Proba-bilistically safe policy transfer,” in
IEEE International Conference onRobotics and Automation (ICRA) , 2017.[27] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen,T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise forexploration,” in
International Conference on Learning Representations(ICLR) , 2018.[28] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg,“Noisy networks for exploration,” in
International Conference onLearning Representations (ICLR) , 2018.[29] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine, “Epopt:Learning robust neural network policies using model ensembles,” in
International Conference on Learning Representations (ICLR) , 2017.[30] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace:Learning to reset for safe and autonomous reinforcement learning,” in
International Conference on Learning Representations (ICLR) , 2018.[31] D. Gandhi, L. Pinto, and A. Gupta, “Learning to fly by crashing,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) , 2017.[32] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe androbust learning-based model predictive control,”
Automatica , vol. 49,no. 5, pp. 1216–1226, 2013.[33] A. Aswani, P. Bouffard, and C. Tomlin, “Extensions of learning-based model predictive control for real-time application to a quadrotorhelicopter,” in
American Control Conference (ACC), 2012 . IEEE,2012, pp. 4661–4666.[34] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-drivenexploration by self-supervised prediction,” in
International Conferenceon Machine Learning (ICML) , vol. 2017, 2017.[35] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, andR. Munos, “Unifying count-based exploration and intrinsic motivation,”in
Advances in Neural Information Processing Systems , 2016, pp.1471–1479.[36] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” in
IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS) , 2017.[37] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in
IEEEInternational Conference on Robotics and Automation (ICRA) , 2018.[38] G.-H. Liu, A. Siravuru, S. Prabhakar, M. Veloso, and G. Kantor,“Learning end-to-end multimodal sensor policies for autonomousnavigation,” in
Conference on Robot Learning (CoRL) , 2017.[39] S. Ebrahimi, A. Rohrbach, and T. Darrell, “Gradient-free policyarchitecture search and adaptation,” in
Conference on Robot Learning(CoRL) , 2017.[40] A. Amini, L. Paull, T. Balch, S. Karaman, and D. Rus, “Learningsteering bounds for parallel autonomous systems,” in
IEEE InternationalConference on Robotics and Automation (ICRA) , 2018.[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in