Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement Learning
Chao-Han Huck Yang, Jun Qi, Pin-Yu Chen, Yi Ouyang, I-Te Danny Hung, Chin-Hui Lee, Xiaoli Ma
EENHANCED ADVERSARIAL STRATEGICALLY-TIMED ATTACKS AGAINST DEEPREINFORCEMENT LEARNING
Chao-Han Huck Yang , Jun Qi , Pin-Yu Chen , Yi Ouyang , I-Te Danny Hung , Chin-Hui Lee , Xiaoli Ma
1. Georgia Institute of Technology, Atlanta, GA, USA2. IBM Research, Yorktown, NY, USA3. Preferred Network America, Berkeley, CA, USA4. Columbia University, NY, USA
ABSTRACT
Recent deep neural networks based techniques, especiallythose equipped with the ability of self-adaptation in the systemlevel such as deep reinforcement learning (DRL), are shownto possess many advantages of optimizing robot learning sys-tems (e.g., autonomous navigation and continuous robot armcontrol.) However, the learning-based systems and the asso-ciated models may be threatened by the risks of intentionallyadaptive (e.g., noisy sensor confusion) and adversarial pertur-bations from real-world scenarios. In this paper, we introducetiming-based adversarial strategies against a DRL-based nav-igation system by jamming in physical noise patterns on theselected time frames. To study the vulnerability of learning-based navigation systems, we propose two adversarial agentmodels: one refers to online learning; another one is basedon evolutionary learning. Besides, three open-source robotlearning and navigation control environments are employed tostudy the vulnerability under adversarial timing attacks. Ourexperimental results show that the adversarial timing attackscan lead to a significant performance drop, and also suggestthe necessity of enhancing the robustness of robot learningsystems.
Index Terms — Deep Reinforcement Learning, Adversar-ial Robustness, Decision Control, Intelligent Navigation
1. INTRODUCTION
Deep Reinforcement Learning (DRL) has gained a widespreadapplications in digital gaming, robotics and control. In particu-lar, the main DRL approaches, such as the value-based deepQ-network (DQN) [1], Asynchronous Advantage Actor-Critic(A3C) [2], and the population-based Go-explore [3], havesucceeded in mastering many dynamically unknown action-searching environments [3]. Relying on the similarity betweenthe adaptive and interacting behaviors, DRL-based modelsare commonly used in the domain of navigation and robotics,and achieve a noticeable improvement over classical methods.However, despite the significant performance enhancement,DRL-based models may incur some new challenges in terms of system robustness against adversarial attacks. For example,the DRL-based navigation systems are likely to propagate andeven enlarge risks (e.g., delay, noisy, and pixel-wise pulsed-signals [4] on the sensor networks of vehicle [5]) inducedfrom the attackers. Besides, unlike the image classificationtasks where only a single mission gets involved, the navigationlearning agent has to deal with a couple of dynamic states(e.g., inputs from sensors or raw pixels) and the related re-wards. Our work mainly focuses on the robustness analysis ofstrategically-timed attacks by potential noises incurred fromthe real world scenarios. More specifically, we formulate theadversarial attacks on two DRL-security settings:• White-box attack: if attacker can access to model pa-rameters, some potential function needs to be used toestimate the learning performance to jam in noise.• Black-box attack: without the requirements of modelparameters, the attacker trains a policy agent with theopposite reward objective via observing actions fromthe victim DRL network, the state, and the reward fromthe environment.To validate the adversarial robustness of a navigation sys-tem, we attempt a new and important research direction basedon a D environment of (1) continuous robot arm control(e.g., Unity Reacher); (2) sensor-input navigation system (e.g.,Unity Banana Collector [6]); (3) raw images of self-drivingenvironments (e.g., Donkey Car) as shown in Fig.1 (a), (b),and (c).
2. RELATED WORKScheduling Physical Attacks on Sensor Fusion.
Sensornetworks for the navigation system are susceptible to flooding-based attacks like Pulsing Denial-of-Service (PDoS) [7] andadversary selective jamming attacks [8]. The related workincludes the security and robustness of background noise,spoofing pulses, and jamming signals on autonomous vehicles.For example, Yan et al. [9] show that PDoS attacks can fea-sibly conduct on a Tesla Model S automobile equipped with a r X i v : . [ c s . L G ] F e b tandard millimeter-wave radars, ultrasonic sensors, forward-looking cameras. Besides, to detect any anonymous networkattacks, a sensing engine defined by some offline algorithmsis required within a built-in network system. Furthermore,a recent work [10] also demonstrates that the LiDAR-basedApollo-Auto system [11] could be fooled by adversarial noisesduring the 3D-point-cloud pre-processing phase as a maliciousreconstruction. Adversarial Attacks on Deep Reinforcement Learning.
Many works are denoted to adversarial attacks on neuralnetwork classifiers in either white-box settings or black-boxones [12, 13]. Goodfellow et al. [14] proposed adversarialexamples for evaluating the robustness of machine learn-ing classifiers. Zeroth order optimization (ZOO) [13] wasemployed to estimate the gradients of black-box systemsfor generating adversarial examples. Besides, the tasks onRL-based adversarial attacks aim at addressing policy miscon-duct [15, 16] or generalization issues [17]. In particular, Linet al. [16] developed a strategically-timed attacking methodin which at time t , an agent takes action based on a policyderived from a Potential Energy Function [18]. However,these approaches do not consider the update of online weightsassociated with the size of the action space. In this work, wefurther improve the potential estimated model from [16] byweighted-majority online learning, which owns a performanceguarantee with a bound for regret T in Eq. (4). Besides, weintroduce a more realistic black-box timed-attack setting. Reward Reward Reward (1) (2) (3)(4) (5) (6)
Fig. 1 : The 3D robot learning environments: (1) continuousrobot arm control as the
Env ; (2) banana collector as the Env ; (3) self-driving donkey car as the Env . Noisy obser-vation under timing attack: (4) zero-out; (5) random sensorfusion; (6) adversarial perturbation.
3. METHOD3.1. Noisy Observation from the Real World
We define a noisy DRL framework of a robot learning systemunder perturbation, where a noisy state observation noisy _ s t can be formulated as the addition of a state s t and a noisepattern noise ( t ) : noisy _ s t = s t + noise ( t ) . (1)We propose three principal types of noise test (from T , T to T ) from the real world to impose adversarial timing attacks: Pulsed Zero-out Attacks ( T ): Off-the-shelf hardwares [9] can affect the entire sensor networks by an over-shooting noise noisy _ s t = 0 incurred from a timing attack in Eq.(1) as Fig.1 (4). Gaussian Average on Sensor Fusion ( T ): Sensor fusion isan essential part of the autonomous system by combining ofsensory data from disparate sources with less uncertainty. Wedefine a noisy sensor fusion system by a Gaussian filter forgetting noisy _ s t in Eq.(2) and shown as Fig. 1 (5). Adversarial Noise Patterns ( T ): Inspired by the fast gradi-ent sign method (FGSM) [12, 15] based DQN attacks, we useFGSM to generate adversarial patterns against the predictionloss of a well-trained DQN. We use (cid:15) = 0 . and a restrictionof (cid:96) ∞ -norm, where x is the all input including s t and r t ; y = a t is an optimal output action by weighting over possibleactions in Eq.(2): noise ( t ) = (cid:15) sign ( ∇ x J ( θ, x, y )) . (2) To evaluate the performance of each timing selection algo-rithm in following sections, each model will receive noisepatterns (from T , T to T ) and average the total rewardas Table 1. In a perspective of system level, we take the ran-dom pulsed-signal as a attacking baseline. We jam in PDoSsignals discussed in Sec. 3.1 randomly with maximum con-strains H times (we use H = from [16] as a baseline) toblock agent from obtaining actual state observations in anepisode. Recently, since various pre-defined DRL architectures and models (e.g., Google Dopamine[19]) are released for public use and as a key to Business-to-Business (B2B) solution, an adversarial attacker is likely toaccess the open-source and design an efficient strategically-timed attack.
Weighted-Majority Potential Energy Function.
We firstpropose an advanced adversarial attack which is originatedfrom online learning and based on the algorithm of weightedmajority algorithm (WMA). The procedures of WMA areshown in Eq. 3 and Algorithm 1, where we introduce d ex-perts for weighting the revenues incurred by taking d actions.The weights of experts are equally initialized to and theniteratively updated as the step (12) in the Algorithm 1. At eachtime t , steps ( ) and ( ) suggest that we obtain both a max t and a min t which correspond to the actions of maximum and mini-mum costs. The decision of attacking the states relies on thethreshold value c ( s t , w t , a max t , a min t ) . If c ( s t , w t , a max t , a min t ) is greater than a pre-specified constant threshold β , we intendto attack the states by adding pulses to make the user haverandom observations. The choices of c ( s t , w t , a max t , a min t ) are based on the difference of two potential energy functionsinspired by [16] and [15]) defined as (3) : c ( s t , w t , a min t , a max t ) = w Tt exp( − Q ( s t , a max t )) (cid:80) a ( k ) t w Tt exp( − Q ( s t , a ( k ) t )) − w Tt exp( − Q ( s t , a min t )) (cid:80) a ( k ) t w Tt exp( − Q ( s t , a ( k ) t )) (3)We use the strategically-timed attacks in [16] as a baselinewith β = 0.3 to evaluate our WMA-enhance algorithms. Then,we further discuss a learning bound for this advanced WMA-policy estimation. Proposition 1:
Assuming that the total number of rounds
T > d ) , the weighted algorithm enjoys the bound asEq.(4), where Z t denotes a normalization term at time t . regret T = T (cid:88) t =1 w Tt exp( − Q ( s t , a t )) Z t − min i ∈ [ d ] T (cid:88) t =1 exp( − Q ( s t , a ( i ) t )) Z t ≤ (cid:112) d ) T . (4)Proposition [18] suggests that the weighted revenues aremore likely to reach the global optimal in theory, since theregret at time T is upper bounded by a constant value in Alg.1. Since an adversarial insidiousattacking agent is hardly recognizable, an adversarial agentis able to drive the equilibrium of DRL-based system with anopposite objective reward without any information of targetedDRL-model. Thus, we propose an adversarial-strategic agent(ASA) via a population-based training method based on pa-rameter exploring policy gradients [20] (PEPG) to optimizea black-box system. The PEPG-ASA algorithm can dynami-cally select sensitive time frames for jamming in an physicalnoise patterns in Section 3.1, which is likely to minimize thetotal system-rewards from an off-online observation of theinput-output pairs without accessing actual parameters fromthe given DRL framework as below:• observation: records of state S from [ s , s ,..., s n ] andadversarial reward against victim navigation DRL-agent R advs from [ r , r ,..., r n ], an adversarial reward R adv as a black-box security setting.• adversarial reward R adv : a negative absolute value ofthe environmental reward R env . For potential energy estimation on policy-based model (e.g., A3C), weuse a weighted-majority average as c ( s t , w t ) = max a t w Tt π ( s t , a t ) − min a t w Tt π ( s t , a t ) . An obvious way to maximize E [ R adv | s, a adv , π adv ] is to es-timate ∇ E . Differentiating this form of the expected returnwith respect to ρ and applying sampling methods, where ρ inEq. (5) are the parameters determining the distribution over θ ,the agent can generate h from p ( h | θ ) and yield the followinggradient estimator: ∇ ρ E ( ρ ) ≈ N N (cid:88) n =1 ∇ ρ log p ( θ | ρ ) r ( h n ) . (5)The probabilistic policy, which is parametrized over a singleparameter θ for PEPG, has the advantage of taking determinis-tic actions such that an entire track of history can be traced bysampling the parameter θ .
4. RESULTS4.1. 3D Control and Robot Learning Environment Setup
Our testing platforms were based on the most recently releasedopen-source ‘Unity-3D’ environments [6] for robotic applica-tions.
Env Reacher:
A double-jointed arm could move to the de-sired position. A reward of +0.1 is provided for each stepthat the agent’s hand is in the goal location. The observationspace consists of 33 variables corresponding to the position,rotation, velocity, and angular velocities of the arm. Everyaction is a vector with four numbers, corresponding to torqueapplicable to two joints. Each entry in the action vector shouldbe a numerical value between -1 and 1.
Env Banana Collector:
A reward of +1 is provided forcollecting a yellow banana, and a reward of − is providedfor collecting a blue banana from a first-person view vehicleto collect as many yellow bananas as possible while avoidingblue bananas. The state-space has 37 dimensions and containsthe agent’s velocity, along with the ray-based perception ofobjects around the agent’s forward direction. Four discreteactions are available to associate with four moving directions. Env Donkey Car:
Donkey Car is an open-source embeddedsystem for radio control vehicles with an off-line RL simulator.The state input is the image from the front camera with 80 ×
80 pixels, the actions are equal to two steering values rangingfrom -1 to 1, and the reward is a cross-track error (CTE). Weuse a modified reward from [21] divided by 1k to balancetrack-staying and maximize its speed.
We applied two classical DRL algorithms, namely DQNand A3C, to evaluate the learning performance relative towell-trained DRL models in Tab. 1.
Baseline (aka no attack):
We modify DQN and A3C modelsfrom the open-source Dopamine 2.0 [19] package to avoid anoverparameterized model with reproducibility guarantee.
Adversarial Robustness (aka under attack):
Assuming lgorithm 1
Adversarial Online Learning Attack based on Weighted Majority Algorithm
1. Input : number of experts, d ; number of rounds, T , a threshold constant β .
2. Parameter : η = (cid:112) d ) /T , expert weight w ∈ R d associated with d actions.
3. Initialize : ¯ w = (1 , , ..., .
4. For t = 1 , , ..., T Set w t = ¯ w t /Z t , where Z t = (cid:80) i ¯ w ( i ) t . Receive revenues from all experts Q ( s t , a t ) = [ Q ( s t , a (1) t ) , Q ( s t , a (2) t ) , ..., Q ( s t , a ( d ) t )] . a min t ← arg min a t w Tt exp( − Q ( s t , a t )) . a max t ← arg max a t w Tt exp( − Q ( s t , a t )) . Compute the threshold function c ( s t , w t , a max t , a min t ) . If c ( s t , w t , a max t , a min t ) > β : Attack the state by s t ← shuffle ( s t )
12. Update rule ∀ i , ¯ w ( i ) t +1 = ¯ w ( i ) t exp( − η Q ( s t , a ( i ) t )) . Table 1 : A comparison of Performance of the timing attack algorithms on
Env , Env , and Env environments. We try toevaluate the robustness of a robot learning system with four types of strategically-timed attack algorithms, namely randomselection; weighted majority algorithm (WMA); parameter exploring policy gradients [20] adversarial strategy agent (PEPG-ASA), and Lin et al. [16]. All experiments are tested for ten times under three different types of noise patterns (zero-out, Gaussian,and adversarial noises), where the total rewards are averaged by dividing . Model
Baseline Random WMA PEPG-ASA Lin et al. [16]
Env : Continuous Robot-Arm Control with DQN [1] 30.2 ± ± ± ± ± Env : Continuous Robot-Arm Control with A3C [2] 30.1 ± ± ± ± ± Env : 3D-Banana Collector Navigation with DQN ± ± ± ± ± Env : 3D-Banana Collector Navigation with A3C ± ± ± ± ± Env : Donkey Car Navigation with DQN ± ± ± ± ± Env : Donkey Car Navigation with A3C ± ± ± ± ±
5. CONCLUSION
This work introduces two novel adversarial timing attackingalgorithms for evaluating DRL-based model robustness underwhite-box and black-box adversarial settings. The experimentssuggest that the improved performance of DRL-based contin-uous control and robot learning models can be significantlydegraded in adversarial settings. In particular, both valuedand policy-based DRL algorithms are easily manipulated bya black-box adversarial attacking agent. Besides, our workpoints out the importance of the robustness and adversarialtraining against adversarial examples in DRL-based navigationsystems. Our future work will discuss the visualization and
Training Episode A v e r a g e R e t u r n Unity 3D: Banana Collector DQN
BaselineRandomWMAPEPG-ASALin et al. [16]
Fig. 2 : Learning performance of a DQN agent testing in Unity3D banana collector, a 3D-navigation task, included the base-line (a.k.a. no attacks); random jamming, WMA, PEPG-ASA,and the method from Lin et. al.interpretability of robot learning and control systems in orderto secure the system. To improve model defense, we couldalso adapt the adversarial training [12] to train DQN & A3Cmodels by noisy states. . REFERENCES [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al., “Human-level control through deepreinforcement learning,”
Nature , vol. 518, no. 7540, pp.529, 2015.[2] Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu, “Asynchronousmethods for deep reinforcement learning,” in
Interna-tional conference on machine learning , 2016, pp. 1928–1937.[3] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Ken-neth O Stanley, and Jeff Clune, “Go-explore: a newapproach for hard-exploration problems,” arXiv preprintarXiv:1901.10995 , 2019.[4] Chao-Han Huck Yang, Yi-Chieh Liu, Pin-Yu Chen, Xi-aoli Ma, and Yi-Chang James Tsai, “When causal inter-vention meets adversarial examples and image maskingfor deep neural networks,” in . IEEE, 2019, pp.3811–3815.[5] Tor A Johansen, Andrea Cristofaro, Kim Sørensen,Jakob M Hansen, and Thor I Fossen, “On estimation ofwind velocity, angle-of-attack and sideslip angle of smalluavs using standard sensors,” in . IEEE,2015, pp. 510–519.[6] Arthur Juliani, Vincent-Pierre Berges, Esh Vckay, YuanGao, Hunter Henry, Marwan Mattar, and Danny Lange,“Unity: A general platform for intelligent agents,” arXivpreprint arXiv:1809.02627 , 2018.[7] Xiapu Luo and Rocky KC Chang, “On a new class ofpulsing denial-of-service attacks and the defense.,” in
NDSS , 2005.[8] Alejandro Proano and Loukas Lazos, “Selective jammingattacks in wireless networks,” in
Communications (ICC),2010 IEEE International Conference on . IEEE, 2010, pp.1–6.[9] Chen Yan, Wenyuan Xu, and Jianhao Liu, “Can youtrust autonomous vehicles: Contactless attacks againstsensors of self-driving vehicle,” .[10] Yulong Cao, Chaowei Xiao, Dawei Yang, Jing Fang,Ruigang Yang, Mingyan Liu, and Bo Li, “Adversarialobjects against lidar-based autonomous driving systems,” arXiv preprint arXiv:1907.05418 , 2019. [11] Haoyang Fan, Fan Zhu, Changchun Liu, LiangliangZhang, Li Zhuang, Dong Li, Weicheng Zhu, JiangtaoHu, Hongye Li, and Qi Kong, “Baidu apollo em motionplanner,” arXiv preprint arXiv:1807.08048 , 2018.[12] Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy, “Explaining and harnessing adversarial ex-amples,”
ICLR , 2015.[13] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi,and Cho-Jui Hsieh, “Zoo: Zeroth order optimizationbased black-box attacks to deep neural networks withouttraining substitute models,” in
Proceedings of the 10thACM Workshop on Artificial Intelligence and Security .ACM, 2017, pp. 15–26.[14] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yin-peng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang,Jun Zhu, Xiaolin Hu, Cihang Xie, et al., “Adversar-ial attacks and defences competition,” arXiv preprintarXiv:1804.00097 , 2018.[15] Sandy Huang, Nicolas Papernot, Ian Goodfellow, YanDuan, and Pieter Abbeel, “Adversarial attacks on neuralnetwork policies,” arXiv preprint arXiv:1702.02284 ,2017.[16] Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao,Meng-Li Shih, Ming-Yu Liu, and Min Sun, “Tactics ofadversarial attack on deep reinforcement learning agents,” arXiv preprint arXiv:1703.06748 , 2017.[17] Lerrel Pinto, James Davidson, Rahul Sukthankar, andAbhinav Gupta, “Robust adversarial reinforcement learn-ing,” arXiv preprint arXiv:1703.02702 , 2017.[18] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal-walkar,
Foundations of machine learning , MIT press,2012.[19] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada,Saurabh Kumar, and Marc G Bellemare, “Dopamine:A research framework for deep reinforcement learning,” arXiv preprint arXiv:1812.06110 , 2018.[20] Frank Sehnke, Christian Osendorfer, Thomas Rück-stieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber,“Parameter-exploring policy gradients,”
Neural Networks ,vol. 23, no. 4, pp. 551–559, 2010.[21] Bharat Prakash, Mark Horton, Nicholas R Waytowich,William David Hairston, Tim Oates, and Tinoosh Mohs-enin, “On the use of deep autoencoders for efficientembedded reinforcement learning,” in