[PDF] Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks

Abstract

Deep learning classifiers are known to be inherently vulnerable to manipulation by intentionally perturbed inputs, named adversarial examples. In this work, we establish that reinforcement learning techniques based on Deep Q-Networks (DQNs) are also vulnerable to adversarial input perturbations, and verify the transferability of adversarial examples across different DQN models. Furthermore, we present a novel class of attacks based on this vulnerability that enable policy manipulation and induction in the learning process of DQNs. We propose an attack mechanism that exploits the transferability of adversarial examples to implement policy induction attacks on DQNs, and demonstrate its efficacy and impact through experimental study of a game-learning scenario.

Full PDF

aa r X i v : . [ c s . L G ] J a n Vulnerability of Deep Reinforcement Learning toPolicy Induction Attacks

Vahid Behzadan and Arslan Munir

Department of Computer Science and EngineeringUniversity of Nevada, Reno1664 N Virginia St, Reno, NV 89557 {vbehzadan,amunir}@unr.edu

Abstract.

Deep learning classiﬁers are known to be inherently vulner-able to manipulation by intentionally perturbed inputs, named adver-sarial examples. In this work, we establish that reinforcement learningtechniques based on Deep Q-Networks (DQNs) are also vulnerable to ad-versarial input perturbations, and verify the transferability of adversarialexamples across diﬀerent DQN models. Furthermore, we present a novelclass of attacks based on this vulnerability that enable policy manipu-lation and induction in the learning process of DQNs. We propose anattack mechanism that exploits the transferability of adversarial exam-ples to implement policy induction attacks on DQNs, and demonstrateits eﬃcacy and impact through experimental study of a game-learningscenario.

Keywords:

Reinforcement Learning, Deep Q-Learning, Adversarial Ex-amples, Policy Induction, Manipulation, Vulnerability

Inspired by the psychological and neuroscientiﬁc models of natural learning, Re-inforcement Learning (RL) techniques aim to optimize the actions of intelligentagents in complex environments by learning eﬀective controls and reactions thatmaximize the long-term reward of agents. [1]. The applications of RL range fromcombinatorial search problems such as learning to play games [2] to autonomousnavigation [3], multi-agent systems [4], and optimal control [5]. However, classicRL techniques generally rely on hand-crafted representations of sensory input,thus limiting their performance in the complex and high-dimensional real worldenvironments. To overcome this limitation, recent developments combine RLtechniques with the signiﬁcant feature extraction and processing capabilities ofdeep learning models in a framework known as Deep Q-Network (DQN) [6]. Thisapproach exploits deep neural networks for both feature selection and Q-functionapproximation, hence enabling unprecedented performance in complex settingssuch as learning eﬃcient playing strategies from unlabeled video frames of Atarigames [7], robotic manipulation [8], and autonomous navigation of aerial [9] andground vehicles [10].

Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks

The growing interest in the application of DQNs in critical systems necessi-tate the investigation of this framework with regards to its resilience and robust-ness to adversarial attacks on the integrity of reinforcement learning processes.The reliance of RL on interactions with the environment gives rise to an inherentvulnerability which makes the process of learning susceptible to perturbation asa result of changes in the observable environment. Exploiting this vulnerabil-ity provides adversaries with the means to disrupt or change control policies,leading to unintended and potentially harmful actions. For instance, manipula-tion of the obstacle avoidance and navigation policies learned by autonomousUnmanned Aerial Vehicles (UAV) enables the adversary to use such systems askinetic weapons by inducing actions that lead to intentional collisions.In this paper, we study the eﬃcacy and impact of policy induction attackson the Deep Q-Learning RL framework. To this end, we propose a novel attackmethodology based on adversarial example attacks against deep learning models[13]. Through experimental results, we verify that similar to classiﬁers, Q net-works are also vulnerable to adversarial examples, and conﬁrm the transferabilityof such examples between diﬀerent models. We then evaluate the proposed attackmethodology on the original DQN architecture of Mnih, et. al. [7], the resultsof which verify the feasibility of policy induction attacks by incurring minimalperturbations in the environment or sensory inputs of an RL system. We alsodiscuss the insuﬃciency of defensive distillation [14] and adversarial training [15]techniques as state of the art countermeasures proposed against adversarial ex-ample attacks on deep learning classiﬁers, and present potential techniques tomitigate the eﬀect of policy induction attacks against DQNs.The remainder of this paper is organized as follows: Section 2 presents anoverview of Q-Learning, Deep Q-Networks, and adversarial examples. Section 3formalizes the problem and deﬁnes the target and attacker models. In Section 4,we outline the attack methodology and algorithm, followed by the experimentalevaluation of the proposed methodology in Section 5. A high-level discussion oneﬀectiveness of the current countermeasures is presented in Section 6, and thepaper is concluded in Section 7 with remarks on future research directions.

The generic RL problem can be formally modeled as a Markov Decision Process,described by the tuple

M DP = (

S, A, P, R ), where S is the set of reachable statesin the process, A is the set of available actions, R is the mapping of transitionsto the immediate reward, and P represents the transition probabilities. At anygiven time-step t , the MDP is at a state s t ∈ S . The RL agent’s choice of actionat time t , a t ∈ A causes a transition from s t to a state s t +1 according to thetransition probability P a t s t ,s t + a . The agent receives a reward r t = R ( s t , a t ) ∈ R for choosing the action a t at state s t .Interactions of the agent with MDP are captured in a policy π . When suchinteractions are deterministic, the policy π : S → A is a mapping between the ulnerability of Deep Reinforcement Learning to Policy Induction Attacks 3 states and their corresponding actions. A stochastic policy π ( s, a ) represents theprobability of optimality for action a at state s .The objective of RL is to ﬁnd the optimal policy π ∗ that maximizes thecumulative reward over time at time t , denoted by the return function ˆ R = P t ′ = tT γ t ′ − t r t ′ , where γ < R is bounded.One approach to this problem is to estimate the optimal value of each action,deﬁned as the expected sum of future rewards when taking that action andfollowing the optimal policy thereafter. The value of an action a in a state s isgiven by the action-value function Q deﬁned as: Q ( s, a ) = R ( s, a ) + γmax a ′ ( Q ( s ′ , a ′ )) (1)Where s ′ is the state that emerges as a result of action a , and a ′ is a pos-sible action in state s ′ . The optimal Q value given a policy pi is hence de-ﬁned as: Q ∗ ( s, a ) = max π Q π ( s, a ), and the optimal policy is given by π ∗ ( s ) =arg max a Q ( s, a )The Q-learning method estimates the optimal action policies by using theBellman equation Q i +1 ( s, a ) = E [ R + γ max a Q i ] as the iterative update of avalue iteration technique. Practical implementation of Q-learning is commonlybased on function approximation of the parametrized Q-function Q ( s, a ; θ ) ≈ Q ∗ ( s, a ). A common technique for approximating the parametrized non-linearQ-function is to train a neural network whose weights correspond to θ . Suchneural networks, commonly referred to as Q-networks, are trained such that atevery iteration i , it minimizes the loss function L i ( θ i ) = E s,a ∼ ρ ( . ) [( y i − Q ( s, a, ; θ i )) ] (2)where y i = E [ R + γ max a ′ Q ( s ′ , a ′ ; θ i − ) | s, a ], and ρ ( s, a ) is a probabilitydistribution over states s and actions a . This optimization problem is typicallysolved using computationally eﬃcient techniques such as Stochastic GradientDescent (SGD) [11]. Classical Q-networks present a number of major disadvantages in the Q-learningprocess. First, the sequential processing of consecutive observations breaks the iid requirement of training data as successive samples are correlated. Furthermore,slight changes to Q-values leads to rapid changes in the policy estimated by Q-network, thus enabling policy oscillations. Also, since the scale of rewards andQ-values are unknown, the gradients of Q-networks can be suﬃciently large torender the backpropagation process unstable.A deep Q network (DQN) [6] is a multi-layered Q-network designed to miti-gate such disadvantages. To overcome the issue of correlation between consecu-tive observations, DQN employs a technique named experience replay : Instead oftraining on successive observations, experience replay samples a random batch

Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks

Input Layer 1st hidden 2nd hidden 3rd hidden

Output

Convolutional Network Fully Connected

Fig. 1: DQN architecture for end-to-end learning of Atari 2600 game playsof previous observations stored in the replay memory to train on. As a result,the correlation between successive training samples is broken and the iid settingis re-established. In order to avoid oscillations, DQN ﬁxes the parameters of theoptimization target y i . These parameters are then updated at regulat intervalsby adopting the current weights of the Q-network. The issue of unstability inbackpropagation is also solved in DQN by clipping the reward values to therange [ − , +1], thus preventing Q-values from becoming too large.Mnih et. al. [7] demonstrate the application of this new Q-network techniqueto end-to-end learning of Q values in playing Atari games based on observationsof pixel values in the game environtment. The neural network architecture ofthis work is depicted in ﬁgure 1. To capture the movements in the game envi-ronment, Mnih et. al. use stacks of 4 consecutive image frames as the input tothe network. To train the network, a random batch is sampled from the pre-vious observation tuples ( s t , a t , r t , s t +1 ). Each observation is then processed by2 layers of convolutional neural networks to learn the features of input images,which are then employed by feed-forward layers to approximate the Q-function.The target network ˆ Q , with parameters θ − , is synchronized with the parametersof the original Q network at ﬁxed periods intervals. i.e., at every i th iteration, θ − t = θ t , and is kept ﬁxed until the next synchronization. The target value foroptimization of DQN learning thus becomes: y ′ t ≡ r t +1 + γmax a ′ ˆ Q ( S t +1 , a ′ ; θ − ) (3) ulnerability of Deep Reinforcement Learning to Policy Induction Attacks 5 Accordingly, the training process can be stated as: min a t ( y ′ t − Q ( s t , a t , θ )) (4) in [16], Szegedy et. al. report an intriguing discovery: several machine learningmodels, including deep neural networks, are vulnerable to adversarial examples.That is, these machine learning models misclassify inputs that are only slightlydiﬀerent from correctly classiﬁed samples drawn from the data distribution. Fur-thermore, a wide variety of models with diﬀerent architectures trained on diﬀer-ent subsets of the training data misclassify the same adversarial example.This suggests that adversarial examples expose fundamental blind spots inmachine learning algorithms. The issue can be stated as follows: Consider amachine learning system M and a benign input sample C which is correctlyclassiﬁed by the machine learning system, i.e. M ( C ) = y true . According to thereport of Szegedy [16] and many proceeding studies [13], it is possible to constructan adversarial example A = C + δ , which is perceptually indistinguishable from C , but is classiﬁed incorrectly, i.e. M ( A ) = y true .Adversarial examples are misclassiﬁed far more often than examples thathave been perturbed by random noise, even if the magnitude of the noise is muchlarger than the magnitude of the adversarial perturbation [17]. According to theobjective of adversaries, adversarial example attacks are generally classiﬁed intothe following two categories:1. Misclassiﬁcation attacks, which aim for generating examples that are classi-ﬁed incorrectly by the target network2. Targeted attacks, whose goal is to generate samples that the target misclas-siﬁes into an arbitrary class designated by the attacker.To generate such adversarial examples, several algorithms have been pro-posed, such as the Fast Gradient Sign Method (FGSM) by Goodfellow et. al.,[17], and the Jacobian Saliency Map Algorithm (JSMA) approach by Papernotet. al., [13]. A grounding assumption in many of the crafting algorithms is thatthe attacker has complete knowledge of the target neural networks such as itsarchitecture, weights, and other hyperparameters. Recently, Papernot et. al. [18]proposed the ﬁrst black-box approach to generating adversarial examples. Thismethod exploits the generalized nature of adversarial examples: an adversarialexample generated for a neural network classiﬁer applies to most other neuralnetwork classiﬁers that perform the same classiﬁcation task, regardless of theirarchitecture, parameters, and even the distribution of training data. Accordingly,the approach of [18] is based on generating a replica of the target network. Totrain this replica, the attacker creates and trains over a dataset from a mixtureof samples obtained by observing target’s performance, and synthetically gener-ated inputs and label pairs. Once trained, any of the adversarial example crafting Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks algorithms that require knowledge of the target network can be applied to thereplica. Due to the transferability of adversarial examples, the perturbed sam-ples generated from the replica network will induce misclassiﬁcations in manyof the other networks that perform the same task. In the following sections,we describe how a similar approach can be adopted in policy induction attacksagainst DQNs.

We consider an attacker whose goal is to perturb the optimality of actions takenby a DQN learner via inducing an arbitrary policy π adv on the target DQN. Theattacker is assumed to have minimal a priori information of the target, suchas the type and format of inputs to the DQN, as well as its reward function R and an estimate for the frequency of updating the ˆ Q network. It is noteworthythat even if the target’s reward function is not known, it can be estimated viaInverse Reinforcement Learning techniques [19]. No knowledge of the target’sexact architecture is considered in this work, but the attacker can estimate thisarchitecture based on the conventions applied to the input type (e.g. image andvideo input may indicate a convolutional neural network, speech and voice datapoint towards a recurrent neural network, etc.).In this model, the attacker has no direct inﬂuence on the target’s architectureand parameters, including its reward function and the optimization mechanism.The only parameter that the attacker can directly manipulate is the conﬁgu-ration of the environment observed by the target. For instance, in the case ofvideo game learning [6], the attacker is capable of changing the pixel values ofthe game’s frames, but not the score. In cyber-physical scenarios, such pertur-bations can be implemented by strategic rearrangement of objects or preciseillumination of certain areas via tools such as laser pointers. To this end, weassume that the attacker is capable of changing the state before it is observedby the target, either by predicting future states, or after such states are gener-ated by the environment’s dynamics. The latter can be achieved if the attackerhas a faster action speed than the target’s sampling rate, or by inducing a delaybetween generation of the new environment and its observation by the target.To avoid detection and minimize inﬂuence on the environment’s dynamics,we impose an extra constraint on the attack such that the magnitude of pertur-bations applied in each conﬁguration must be smaller than a set value denotedby ǫ . Also, we do not limit the attacker’s domain of perturbations (e.g. in the caseof video games, the attacker may change the value of any pixel at any positionon the screen). As discussed in Section 2, the DQN framework of Mnih et. al. [7] can be seen asconsisting of two neural networks, one is the native network which performs theimage classiﬁcation and function approximation, and the other is the auxiliary ulnerability of Deep Reinforcement Learning to Policy Induction Attacks 7 ˆ Q network whose architecture and parameters are copies of the native networksampled once every c iterations. Training of DQN is performed optimizing theloss function of equation 4 by Stochastic Gradient Descent (SGD). Due to thesimilarity of this process and the training mechanism of neural network classi-ﬁers, we hypothesize that the function approximators of DQN are also vulnerableto adversarial example attacks. In other words, the set of all possible inputs tothe approximated functions Q and ˆ Q contains elements which cause the approx-imated functions to generate outputs that are diﬀerent from the output of theoriginal Q function. Furthermore, we hypothesize that similar to the case ofclassiﬁers, the elements that cause one DQN to generate incorrect Q values willincur the same eﬀect on other DQNs that approximate the same Q-function.Consequently, the attacker can manipulate a DQN’s learning process by craft-ing states s t such that ˆ Q ( s t +1 , a ; θ − t ) identiﬁes an incorrect choice of optimalaction at s t +1 . If the attacker is capable of crafting adversarial inputs s ′ t and s ′ t +1 such that the value of Equation 4 is minimized for a speciﬁc action a ′ , thenthe policy learned by DQN at this time-step is optimized towards suggesting a ′ as the optimal action given the state s t .Considering that the attacker is not aware of the target’s network archi-tecture and its parameters at every time step, crafting adversarial states mustrely on black-box techniques such as those introduced in [18]. Attacker can ex-ploit the transferability of adversarial examples by obtaining the state pertur-bations from a replica of the target’s DQN. At every time step of training thisreplica, attacker calculates the perturbation vectors ˆ δ t +1 for the next state s t +1 such that max a ′ ˆ Q ( s t +1 + ˆ δ t +1 , a ′ ; θ − t ) causes ˆ Q to generate its maximum when a ′ = π ∗ adv ( s t +1 ), i.e., the maximum reward at the next state is obtained whenthe optimal action taken at that state is determined by attacker’s policy.This is procedurally similar to targeted misclassiﬁcation attacks described inSection 2 that aim to ﬁnd minimal perturbations to an input sample such thatthe classiﬁer assigns the maximum value of likelihood to an incorrect target class.Therefore, the adversarial example crafting techniques developed for classiﬁers,such as the Fast Gradient Sign Method (FGSM) and the Jacobian Saliency MapAlgorithm (JSMA), can be applied to obtain the perturbation vector ˆ δ t +1 .The procedure of this attack can be divided into the two phases of initializa-tion and exploitation. The initialization phase implements processes that mustbe performed before the target begins interacting with the environment, whichare:1. Train a DQN based on attacker’s reward function r ′ to obtain the adversarialpolicy π ∗ adv

2. Create a replica of the target’s DQN and initialize with random parametersThe exploitation phase implements the attack processes such as crafting ad-versarial inputs. This phase constitutes an attack cycle depicted in ﬁgure 2. Thecycle initiates with the attacker’s ﬁrst observation of the environment, and runsin tandem with the target’s operation. Algorithm 1 details the procedural ﬂowof this phase.

Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks

2. Attacker estimates best action according to adversarial policy

Fig. 2: Exploitation cycle of policy induction attack

To study the performance and eﬃcacy of the proposed mechanism, we examinethe targeting of Mnih et. al.’s DQN designed to learn Atari 2600 games [7]. Inour setup, we train the network on a game of Pong implemented in Python usingthe PyGame library [12]. The game is played against an opponent with a modestlevel of heuristic artiﬁcial intelligence, and is customized to handle the delays inDQN’s reaction due to the training process. The game’s backened provides theDQN agent with the game screen sampled at 8Hz, as well as the game score (+1for win, -1 for lose, 0 for ongoing game) throughout each episode of the game.The set of available actions A = { U P, DOW N, Stand } enables the DQN agentto control the movements of its paddle. Figure 3 illustrates the game screen ofPong used in our experiments.The training process of DQN is implemented in TensorFlow [20] and executedon an Amazon EC2 g2.2xlarge instance [21] with 8 Intel Xeon E5-2670 CPU coresand a NVIDIA GPU with 1536 CUDA cores and 4GB of video memory. Eachstate observed by the DQN is a stack of 4 consecutive 80x80 gray-scale gameframes. Similar to the original architecture of Mnih et. al. [7], this input is ﬁrstpassed through two convolutional layers to extract a compressed feature spacefor the following two feed-forward layers for Q function estimation. The discountfactor γ is set to 0 .

99, and the initial probability of taking a random action isset to 1, which is annealed after every 500000 actions. The agent is also set to ulnerability of Deep Reinforcement Learning to Policy Induction Attacks 9

Algorithm 1:

Exploitation Procedure input : adversarial policy π ∗ adv , initialized replica DQNs Q ′ , ˆ Q ′ , synchronizationfrequency c , number of iterations N for observation = 1, N do Observe current state s t , action a t , reward r t , and resulting state s t +1 if s t +1 is not terminal then set a ′ adv = π ∗ adv ( s t +1 ) Calculate perturbation vector ˆ δ t +1 = Craft ( ˆ Q ′ , a ′ adv , s t +1 ) Update s t +1 ← s t +1 + ˆ δ t +1 Set y t = ( r t + max a ′ ˆ Q ′ ( s t +1 + ˆ δ t +1 , a ′ ; θ ′− ) Perform SGD on ( y t − Q ′ ( s t , a t , θ ′ )) w.r.t θ ′ end Reveal s t +1 to target if observation mod c = 0 then θ ′− ← θ ′ end Fig. 3: Game of Pongtrain its DQN after every 50000 observations. Regular training of this DQN takesapproximately 1.5 million iterations ( ∼

16 hours on the g2.2xlarge instance) toreach a winning average of 51% against the heuristic AI of its opponent Following the threat model presented in Section 3, this experiment considersan attacker capable of observing the states interactions between his target DQNand the game, but his domain of inﬂuence is limited to implementation of minorchanges on the environment. Considering the visual representation of the envi-ronment in this setup, the minor changes incurred by attacker take the form ofperturbing pixel values in the 4 consecutive frames of a given state. As expected, longer training of this DQN leads to better results. After a 2-week pe-riod of training we veriﬁed the convergent trait of our implementation by witnessingwinning averages of more than 80%.0 Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks

No. iterations x 1e6 S u cc e ss R a t e % FGSMJSMA

Fig. 4: Success rate of crafting adversarial examples for DQNSuccessful implementations of the proposed policy induction attack mecha-nisms rely on the vulnerability of DQNs to targeted adversarial perturbations.To verify the existence of this vulnerability, the ˆ Q networks of target were sam-pled at regular intervals during training in the game environment. In the nextstep, 100 observations comprised of a pair of consecutive states ( s t , s t +1 ) wererandomly selected from the experience memory of DQN, to ensure the possi-bility of their occurrence in the game. Considering s t +1 to be the variable thatcan be manipulated by the attacker, it is passed along with the model ˆ Q to theadversarial example crafting algorithms. To study the extent of vulnerability,we evaluated the success rate of both FGSM and JSMA algorithms for each ofthe 100 random observations in inducing a random game action other than thecurrent optimal a ∗ t . The results, presented in Figure 4, verify that DQNs areindeed vulnerable to adversarial example attacks. It is noteworthy that the suc-cess rate of FGSM with a ﬁxed perturbation limit decreases by one percent per100000 observations as the number of observations increases. Yet, JSMA seemsto be more robust to this eﬀect as it maintains a success rate of 100 percentthroughout the experiment. To measure the transferability of adversarial examples between models, we trainedanother Q-network with a similar architecture on the same experience memoryof the game at the sampled instances of the previous experiment. It is note-worthy that due to random initializations, the exploration mechanism, and thestochastic nature of SGD, even similar Q-networks trained on the same set ofobservations will obtain diﬀerent sets of weights. The second Q-network wastested to measure its vulnerability to the adversarial examples obtained fromthe last experiment. Figure 5 shows that more than 70% of the perturbations ulnerability of Deep Reinforcement Learning to Policy Induction Attacks 11 obtained from both FGSM and JSMA methods also aﬀect the second network,hence verifying the transferability of adversarial examples between DQNs.

No. iterations x 1e6 T r a n s f e r S u cc e ss R a t e % FGSMJSMA

Fig. 5: Transferability of adversarial examples in DQN

Our ﬁnal experiment tests the performance of our proposed exploitation mech-anism. In this experiment, we consider an adversary whose reward value is theexact opposite of the game score, meaning that it aims to devise a policy thatmaximizes the number of lost games. To obtain this policy, we trained an adver-sarial DQN on the game, whose reward value was the negative of the value ob-tained from target DQN’s reward function. With the adversarial policy at hand,a target DQN was setup to train on the game environment to maximize the orig-inal reward function. The game environment was modiﬁed to allow perturbationof pixel values in game frames by the adversary. A second DQN was also setupto train on the target’s observations to provide an estimation of the target DQNto enable blackbox crafting of adversarial example. At every observation, theadversarial policy obtained in the initialization phase was consulted to calculatethe action that would satisfy the adversary’s goal. Then, the JSMA algorithmwas utilized to generate the adversarial example that would cause the outputof the replica DQN network to be the action selected by the adversarial policy.This example was then passed to the target DQN as its observation. Figure 6compares the performance of unperturbed and attacked DQNs in terms of theirreward values, measured as the diﬀerence of current game score with the aver-age score. It can be seen that the reward value for the targeted agent rapidlyfalls below the unperturbed case and maintains the trend of losing the gamethroughout the experiment. This result conﬁrms the eﬃcacy of our proposedattack mechanism, and veriﬁes the vulnerability of Deep Q-Networks to policyinduction attacks. A v e r a g e R e w a r d p e r E p o c h UnperturbedAttacked

Fig. 6: Comparison of rewards between unperturbed and attacked DQNs

Since the introduction of adversarial examples by Szgedey, et. al. [16], variouscounter-measures have been proposed to mitigate the exploitation of this vul-nerability in deep neural networks. Goodfellow et. al. [17] proposed to retraindeep networks on a set of minimally perturbed adversarial examples to preventtheir misclassiﬁcation. This approach suﬀers from two inherent short-comings:Firstly, it aims to increase the amount of perturbations required to craft anadversarial example. Second, this approach does not provide a comprehensivecounter-measure as it is computationally ineﬃcient to ﬁnd all possible adver-sarial examples. Furthermore, Papernot et. al. [18] argue that by training thenetwork on adversarial examples, the emerging network will have new adversarialexamples and hence this technique does not solve the problem of exploiting thisvulnerability for critical systems. Consequently, Papernot, et. al [14] proposeda technique named Defensive Distillation, which is also based on retraining thenetwork on a dimensionally-reduced set of training data. This approach, too,was recently shown to be insuﬃcient in mitigating adversarial examples [22].It is hence concluded that the current state of the art in countering adversar-ial examples and their exploitation is incapable of providing a concrete defenseagainst such exploitations. ulnerability of Deep Reinforcement Learning to Policy Induction Attacks 13

In the context of policy induction attacks, we conjecture that the temporalfeatures of the training process may be utilized to provide protection mecha-nisms. The proposed attack mechanism relies on the assumption that due to thedecreasing chance of random actions, the target DQN is most likely to performthe action induced by adversarial inputs as the number of iterations progress.This may be mitigated by implementing adaptive exploration-exploitation mech-anisms that both increase and decrease the chance of random actions accordingto the performance of the trained model. Also, it may be possible to exploitspatio-temporal pattern recognition techniques to detect and omit regular per-turbations during the pre-processing phase of the learning process. Investigatingsuch techniques is the priority of our future work.

We established the vulnerability of reinforcement learning based on Deep Q-Networks to policy induction attacks. Furthermore, we proposed an attack mech-anism which exploits the vulnerability of deep neural networks to adversarialexamples, and demonstrated its eﬃcacy and impact through experiments on agame-learning DQN.This preliminary work solicitates a wide-range of studies on the security ofDeep Reinforcement Learning. As discussed in Section 6, novel countermeasuresneed to be investigated to mitigate the eﬀect of such attacks on DQNs deployed incyber-physical and critical systems. Also, an analytical treatment of the problemto establish the bounds and relationships of model parameters, such as networkarchitecture and exploration mechanisms, with DQN’s vulnerability to policyinduction will provide deeper insight and guidelines into designing safe and securedeep reinforcement learning architectures.

References

1. R. S. Sutton and A. G. Barto,

Introduction to reinforcement learning , vol. 135. MITPress Cambridge, 1998.2. I. Ghory, “Reinforcement learning in board games,”

Department of Computer Sci-ence, University of Bristol, Tech. Rep , 2004.3. X. Dai, C.-K. Li, and A. B. Rad, “An approach to tune fuzzy controllers basedon reinforcement learning for autonomous vehicle control,”

IEEE Transactions onIntelligent Transportation Systems , vol. 6, no. 3, pp. 285–293, 2005.4. L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagentreinforcement learning,”

IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 38 (2), 2008 , 2008.5. R. S. Sutton, A. G. Barto, and R. J. Williams, “Reinforcement learning is directadaptive optimal control,”

IEEE Control Systems , vol. 12, no. 2, pp. 19–22, 1992.6. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, andM. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprintarXiv:1312.5602 , 2013.4 Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks7. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. , “Human-level controlthrough deep reinforcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015.8. S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for roboticmanipulation,” arXiv preprint arXiv:1610.00633 , 2016.9. T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep control policiesfor autonomous aerial vehicles with mpc-guided policy search,” arXiv preprintarXiv:1509.06791 , 2015.10. A. Hussein, M. M. Gaber, and E. Elyan, “Deep active learning for autonomousnavigation,” in

International Conference on Engineering Applications of Neural Net-works , pp. 3–17, Springer, 2016.11. L. Baird and A. W. Moore, “Gradient descent for general reinforcement learning,”

Advances in neural information processing systems , pp. 968–974, 1999.12. W. McGugan,

Beginning game development with Python and Pygame: from noviceto professional . Apress, 2007.13. N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami,“The limitations of deep learning in adversarial settings,” in , pp. 372–387, IEEE, 2016.14. N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a de-fense to adversarial perturbations against deep neural networks,” arXiv preprintarXiv:1511.04508 , 2015.15. N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” arXiv preprint arXiv:1608.04644 , 2016.16. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel-low, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprintarXiv:1312.6199 , 2013.17. I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarialexamples,” arXiv preprint arXiv:1412.6572 , 2014.18. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Berkay Celik, and A. Swami,“Practical black-box attacks against deep learning systems using adversarial exam-ples,” arXiv preprint arXiv:1602.02697 , 2016.19. Y. Gao, J. Peters, A. Tsourdos, S. Zhifei, and E. Meng Joo, “A survey of inversereinforcement learning techniques,”

International Journal of Intelligent Computingand Cybernetics , vol. 5, no. 3, pp. 293–311, 2012.20. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,A. Davis, J. Dean, M. Devin, et al. , “Tensorﬂow: Large-scale machine learning onheterogeneous distributed systems,” arXiv preprint arXiv:1603.04467 , 2016.21. M. Gilani, C. Inibhunu, and Q. H. Mahmoud, “Application and network perfor-mance of amazon elastic compute cloud instances,” in

Cloud Networking (CloudNet),2015 IEEE 4th International Conference on , pp. 315–318, IEEE, 2015.22. N. Carlini and D. Wagner, “Defensive distillation is not robust to adversarial ex-amples,” arXiv preprintarXiv preprint