[PDF] Stealing Deep Reinforcement Learning Models for Fun and Profit

Abstract

This paper presents the first model extraction attack against Deep Reinforcement Learning (DRL), which enables an external adversary to precisely recover a black-box DRL model only from its interaction with the environment. Model extraction attacks against supervised Deep Learning models have been widely studied. However, those techniques cannot be applied to the reinforcement learning scenario due to DRL models' high complexity, stochasticity and limited observable information. We propose a novel methodology to overcome the above challenges. The key insight of our approach is that the process of DRL model extraction is equivalent to imitation learning, a well-established solution to learn sequential decision-making policies. Based on this observation, our methodology first builds a classifier to reveal the training algorithm family of the targeted black-box DRL model only based on its predicted actions, and then leverages state-of-the-art imitation learning techniques to replicate the model from the identified algorithm family. Experimental results indicate that our methodology can effectively recover the DRL models with high fidelity and accuracy. We also demonstrate two use cases to show that our model extraction attack can (1) significantly improve the success rate of adversarial attacks, and (2) steal DRL models stealthily even they are protected by DNN watermarks. These pose a severe threat to the intellectual property and privacy protection of DRL applications.

Full PDF

SStealing Deep Reinforcement Learning Modelsfor Fun and Proﬁt

Kangjie Chen, Tianwei Zhang, Xiaofei Xie and Yang Liu

Nanyang Technological University, SingaporeEmail: { kangjie.chen,tianwei.zhang,xfxie,yangliu } @ntu.edu.sg Abstract —In this paper, we present the ﬁrst attack methodologyto extract black-box Deep Reinforcement Learning (DRL) modelsonly from their actions with the environment. Model extractionattacks against supervised Deep Learning models have beenwidely studied. However, those techniques cannot be appliedto the reinforcement learning scenario due to DRL models’high complexity, stochasticity and limited observable information.Our methodology overcomes those challenges by proposing twotechniques. The ﬁrst technique is an RNN classiﬁer which canreveal the training algorithms of the target black-box DRL modelonly based on its predicted actions. The second technique is theadoption of imitation learning to replicate the model from theextracted training algorithm. Experimental results indicate thatthe integration of these two techniques can effectively recoverthe DRL models with high ﬁdelity. We also demonstrate a usecase to show that our model extraction attack can signiﬁcantlyimprove the success rate of adversarial attacks, making the DRLmodels more vulnerable.

I. I

NTRODUCTION

Deep Reinforcement Learning has gained popularity dueto its strong capability of handling complex tasks and en-vironments. It integrates Deep Learning (DL) architecturesand reinforcement learning algorithms to build sophisticatedpolicies, which can accurately understand the environmentalcontext ( states ), and make the optimal decisions ( actions ).Various DRL algorithms and methodologies have been de-signed to facilitate the application of DRL in different artiﬁcialintelligent tasks, e.g., autonomous driving [1], robot motionplanning [2], video game playing [3], etc.As DRL has been widely commercialized (e.g., autonomousdriving framework Wayve [4], path planning system MobilEye[5]), it is of paramount importance for model owners toprotect the intellectual property of their DRL-based products.DRL models are generally deployed as black boxes insidethe applications, so the model designs and parameters are notdisclosed to the public.

Problem statement.

From the adversarial perspective, wewant to address the following question in this paper: isit possible for an adversary to extract the properties (i.e.,training algorithm) of a black-box DRL model, and produce areplicated model with the same behaviors?

This is known asmodel extraction attacks, which have been widely studied insupervised DL models [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].However, the possibility and feasibility of extracting DRLmodels have not been explored yet. We make a ﬁrst steptowards this goal. It is worth noting that our goal is to extract the DRL modelalgorithm and replicate its behaviors. It is impossible to extractthe values of model parameters: different parameter values cangive the same model behaviors, which are indistinguishablefrom the adversary. Also the adversary cannot identify thevalues of dead neurons which never contribute to the modeloutput.

Threat model.

We assume the adversary has the domainknowledge of the target DRL model, i.e., the task the model isperforming, the environmental context, and also the formats ofthe model input and output. However, he has little knowledgeabout the DRL model itself, including the model structuresand parameter values, training methods and hyper-parameters,etc. We further assume the adversary is able to operate theDRL model or product in a controllable environment: hecan set certain environmental states, and observe the model’scorresponding actions.

Challenges.

Although various extraction techniques were de-signed against supervised DL models, it is hard to apply themto DRL models due to signiﬁcant differences of model featuresand scenarios.First, some attack approaches can only extract very simplemodels and datasets. For instance, the method in [9] can onlywork for two-layer neural networks (one hidden layer and aReLU layer). The method in [6] is only applicable to simplemodels with simple datasets (e.g., MNIST) constrained bythe computing power. In contrast, DRL models usually havemore complicated and deeper network structures to handlecomplex tasks. As such, the above techniques fail to extractDRL models.Second, the adversary in our threat model has less ob-servable information for model extraction. Past works assumethe adversary has access to the prediction conﬁdence scores[9, 6, 7, 12], gradients [10] or the side-channel executioncharacteristics [11, 13, 14, 15]. In our scenario, the adversarycan only observe the predicted actions from the DRL model.This can also invalidate the above methods.Third, supervised DL models perform predictions overdiscrete input samples, which are independent of each other.However, DRL is a Markov Decision Process (MDP). Individ-ual input samples cannot fully reﬂect the inherent features ofDRL models and training algorithms. The adversary will losethe information of temporal relationships if he only observesthese discrete data. Besides, compared to supervised DLmodels, DRL models are more stochastic and their behaviors a r X i v : . [ c s . L G ] J un ig. 1: Overview of our proposed DRL model extraction attackhighly depend on the environments with different transitionprobabilities. Contribution.

We propose a novel model extraction approachfor DRL model which can overcome the above challenges. Itis composed of two techniques.The ﬁrst technique is to build a classiﬁer which can identifythe algorithm of the target DRL model based on its runtimeactions. This technique has three innovations: (1) We use atiming sequence of actions as the feature of a DRL modelto characterize its decision process and interaction with en-vironment. (2) We utilize Recurrent Neural Networks as thestructure of the classiﬁer for training and prediction, which canbetter understand the temporal relationships inside the featuresequence. (3) For one DRL model with the same algorithm,we will generate different feature sequences in environmentsinitialized with different random seeds. This guarantees thatthe training set of the classiﬁer is comprehensive and includingdifferent behaviors of the same model.The second technique is to adopt imitation learning toreplicate the behaviors of the target model based on the ex-tracted algorithm. We use the Generative Adversarial ImitationLearning (GAIL) framework [17] to achieve this process.The contest between the discriminative model and generativemodel can guarantee that our replication has similar behaviorsas the target one within the environment.The integration of RNN classiﬁcation and imitation learningcan produce replicated models with high similarity of trainingalgorithm, behavior and performance with the target model.This can bring severe threats of copyright infringement andeconomic loss to the DRL-based applications and products.More seriously, we provide a use case to show that this attackapproach can signiﬁcantly enhance the adversarial attacks byincreasing the attack transferability and success rate. This demonstrates the practical value of our study, and is expectto raise people’s awareness about the privacy threats of DRLmodels, as well as the necessity of defense solutions.II. B

ACKGROUND

Deep Reinforcement Learning.

DRL adopts deep learningtechnology to instruct an agent to act in a given task, in orderto maximize the expected cumulative rewards. The deep neuralnetworks adopted by DRL are powerful to understand andinterpret complex environmental states, and make the optimaldecisions. There are three common approaches to solve rein-forcement learning tasks. The ﬁrst one is value function basedmethods. The DRL algorithm trains deep neural networksto approximate the optimal value functions. For instance, acommon value-based algorithm is deep Q-network (DQN)[3], which learns Q value estimates for each state-action pairindependently. The second category is policy search basedmethods. The DRL algorithm attempts to identify the optimalpolicy. Typical examples include REINFORCE [18] whichoptimize policies directly. The third category is the hybridof value function and policy search (actor-critic approach).These algorithms learn both a policy and a state value functionto reduce variance and accelerate learning. State-of-the-artalgorithms include Proximal policy optimisation (PPO) [19],Actor-Critic with Experience Replay (ACER) [20] and ActorCritic using Kronecker-Factored Trust Region (ACKTR) [21].

Imitation Learning.

Imitation learning [22] is a process ofacquiring skills or behaviors by observing demonstrations ofan expert performing the corresponding tasks. It was originallyfor learning from human demonstrations. Then the conceptof imitation learning was applied to the domain of artiﬁcialexperts, such as reinforcement learning agents. Various imi-ation learning techniques have been designed to imitate thebehaviors of DRL models [17, 23].

Adversarial Examples.

It has been found that small andundetectable perturbations in input samples could affect theresults of a target classiﬁer [24]. Following this initial study,many researchers designed various attack methods to attacksupervised DL models [25, 26, 27]. Adversarial attacks onRL policies have also received some attention for the pastyears. Huang et al. [28] made an initial attempt to attackneural network policies by applying FGSM to the state at eachtime step. Following this work, black-box adversarial attacksagainst DRL were demonstrated in [29]. Russo and Proutiere[30] found the adversarial examples can be also transferredacross different DRL models.

Privacy in Machine Learning.

There have been a quantityof works on the privacy threats of deep learning models anddata. Model extraction attacks aim to steal model parametersor architectures [7, 6]. Membership inference attacks [31]are designed to determine if a given data sample has beenincluded in the training data. Model inversion attacks [32]aim to leverage model predictions to inverse the training dataproperties. In this paper, we are focusing on model extractionattacks. III. A

TTACK M ETHODOLOGY

Our attack approach consists of two stages. At the ﬁrst stage,we construct a classiﬁer, which can predict the training algo-rithm of a given black-box DRL model based on its runtimebehavior. At the second stage, based on the extracted algo-rithm, we adopt state-of-the-art imitation learning technique togenerate and ﬁne-tune a model with the similar behaviors asthe victim one. Figure 1 illustrates the methodology overview,and Algorithm 1 describes the detailed steps.

A. Extracting DRL Model Algorithms via RNN Classiﬁcation

As the ﬁrst stage, we train a RNN classiﬁer, whose inputis a DRL model’s action sequence, and output is the model’straining algorithm. With this classiﬁer, we are able to identifythe algorithm of an arbitrary black-box DRL model.

Dataset preparation.

A dataset is necessary to train this clas-siﬁer. It should consist of enough samples to cover models withdifferent algorithms, as well as various behaviors. We train alarge quantity of shadow DRL models in the same environmentbut with various algorithms, and collect their behaviors to formthis dataset. Speciﬁcally, we set up a algorithm pool P thatincludes all the training algorithm in our consideration. Weprepare a set S of random seeds for environment initialization.Then for each algorithm in the pool P , we train some DRLmodels with this algorithm in various environments initializedby different random seeds in S . We evaluate the performanceof each trained DRL model by measuring its reward andcomparing it with a reward threshold R : we only select theDRL models whose reward is higher than R . For each qualiﬁedmodel, we collect N different state-action sequences with alength of T : { ( s , a ) , ( s , a ) , ... ( s T , a T ) } . Then samples aregenerated with the action sequences ( A = { a , a , ...a T } ) as Algorithm 1:

Extracting DRL models

Input: target model M ∗ , DRL environment env Output:

Replicated model M (cid:48) /* Stage 1 */ Set up a set of random seeds S ; Select algorithm pool P , reward threshold R , sequence length T ; Dataset D = ∅ ; for each p ∈ P do for each s ∈ S do env . initialize( s ) ; m = train_DRL ( env , p ); if evaluate ( m , env ) > R then A = GenSequence ( m , env , T ); D . add ([ A , p ]); end end end C = train_RNN ( D ); /* Extract model algorithm */ A ∗ = GenSequence ( M ∗ , env , T ); P ∗ = C . predict ( A ∗ ) /* Stage 2 */ M (cid:48) = IimiatationLearning ( M ∗ , P ∗ , env ); while evaluate ( M (cid:48) , env ) < evaluate ( M ∗ , env ) do M (cid:48) = IimiatationLearning ( M ∗ , P ∗ , env ); end return M (cid:48) the feature and the training algorithm as the label, to constructthe dataset. Training.

We train a Recurrent Neural Network over theprepared dataset for the classiﬁer. A RNN is competent ofprocessing sequence data of arbitrary lengths by recursivelyapplying a transition function to its internal hidden state vectorof the input. It is generally used to map the input sequence toa ﬁxed-sized vector, which will be further fed to a softmaxlayer for classiﬁcation. However, vanilla RNNs are well-known to suffer from the gradient vanishing and explodingproblem: during training, components of the gradient vectorcan grow or decay exponentially over long sequences. Toaddress this issue, we adopt the Long Short-Term Memory(LSTM) network [33] in our approach. LSTMs can selectivelyremember or forget things regulated by a set of gates. Eachgate in LSTM units is composed of a sigmoid neural netlayer and a pointwise multiplication operation, which canﬁlter the information through the network. As a result, LSTMunits can maintain information in memory for a long periodunder the control of gates. To train the classiﬁer, for eachinput sequence A = { a , a , ..., a T } , we ﬁrst apply a setof LSTM layers to obtain its vector representation. Thenwe attach a fully-connected layer and a non-linear softmaxlayer after the LSTMs to output the probability distributionover all classes of possible model algorithms. We use cross-entropy of the predicted and ground-truth labels as the lossfunction to identify the optimal parameters for this classiﬁerby minimizing the loss function. Extracting model algorithms.

With this RNN classiﬁer, we

2C PPO ACER ACKTR DQN

Label

A2CPPOACERACKTRDQN P r e d i c t c l a ss (a) Cart-Pole A2C PPO ACER ACKTR DQN

Label

A2CPPOACERACKTRDQN P r e d i c t c l a ss (b) Atari Pong Fig. 2: The accuracy of RNN classiﬁer for different algorithmsare now able to predict the training algorithm of a givenback-box DRL model. We operate this target model in thesame environment with certain random seed and collect theaction sequence for T rounds. We query the classiﬁer with thissequence and get the probability of each candidate algorithm.We select the one with the highest probability as the attackresult. To further increase the conﬁdence and eliminate thestochastic effects, we can run the target model in differentinitialized environments and collect the sequences for predic-tions. We choose the most-predicted label as the target model’salgorithm. B. Replicating DRL Models via Imitation Learning

With the extracted model algorithm, the adversary can traina new model (or just pick a shadow DRL model during theclassiﬁer training step) with the same algorithm as the replicaof the target model. However, due to the complexity of DRLalgorithms and variance of initial environment, this replicatedmodel can still exhibit distinct behaviors from the real one,even they are from the same algorithm. This stage aims toreﬁne the replicated model via imitation learning.Imitation learning aims to mimic expert behavior in a giventask. An imitation model learns skills to perform a taskfrom expert demonstrations by learning a mapping betweenobservations and actions [22]. Recently, several works conductmodel imitating on DRL models, e.g., GAIL[17] and DQfD[23]. In our case, we adopt the GAIL algorithm to replicateDRL models. GAIL is a model-free learning algorithm thatcan obtain signiﬁcant performance gains in imitating complexbehaviors in large-scale and high-dimensional environments.Speciﬁcally, two models are constructed to contest with eachother during the imitation process: a generative DRL model G with the extracted algorithm, and a discriminative model D whose job is to distinguish the distribution of data gen-erated by G from the ground-truth data distribution (i.e.,expert trajectory) from the target DRL model. The trajectorydata for generative model and target model is a sequence of { ( s , a ) , ( s , a ) , ..., ( s T , a T ) } . The generative model G iteratively reﬁnes its parameters based on the feedback from D until D cannot distinguish the data generated from G orthe target model.After the imitated model is produced, considering thestochasticity of learning progress, it is possible that it cannotreach the same reward although it has the same behaviorsas the target model. Therefore, we repeat the GAIL processuntil a qualiﬁed model is obtained which has high very similarperformance (i.e., reward) as the target model.IV. E VALUATION

A. Implementation and Experimental Setup

Our attack approach is general-purpose and applicable tovarious reinforcement learning environments. Without lossof generality, we consider two popular environments: Cart-Pole and Atari Pong [34]. For each environment, we trainDRL models with ﬁve mainstream DRL algorithms (DQN [3],PPO [19], ACER [20], (ACKTR) [21] and A2C [35]). Weuse the default training settings and hyperparameters in theOpenAIBaselines framework. For each environment, we select50 trained models whose rewards are higher than the baseline R as introduced in OpenAI Baselines framework [34].For RNN classiﬁcation, we consider different sequencelengths T (50, 100 and 200), and compare their impacts onthe RNN classiﬁcation accuracy. For each trained DRL model,we collect 50 sequences of actions as the training input ofour RNN classiﬁer. Therefore, for Cart-Pole and Atari Pong,the sizes of the data set we collected from 250 trained DRLmodels are both 12,500. To evaluate the trained RNN classiﬁer,we splits the data set to training and test sets randomly.Moreover, we consider various RNN structures. During thetraining process, the initial learning rate is set to 0.005 with adecay factor of 0.7 whenever loss plateaus, and the batch sizeis set to 32. We stop the training after N = 100 iterations. Sequence length A cc u r a c y Cart(1 layer)Cart(2 layer) Pong(1 layer)Pong(2 layer)

Fig. 3: Average accuracy

B. Results of RNN Classiﬁcation

Impact of hyper-parameters.

The predication accuracy of theRNN classiﬁer can be affected by a few hyper-parameters, e.g.,the length of the input sequence, the number of hidden layers.Figure 3 shows the accuracy under different combinations ofthese hyper-parameters. First, we observe that the length ofthe input sequence can affect the classiﬁcation performance: alonger input sequence can give a higher accuracy. Therefore,for Cart-Pole environment, we take all the actions withinone episode as the input sequence ( T =200). For Atari Pongenvironment, one episode can have up to 10,000 actions.It is not recommended to take the entire episode as input,which can incur very high cost and training over-ﬁtting. Since T = 200 can already give us very satisfactory accuracy, wewill set the length of input sequence to 200 as well.Second, we consider different numbers of hidden LSTMlayers (1 and 2) for the classiﬁer. We observe that this factorhas slight inﬂuence on the accuracy of the classiﬁer. Onehidden layer can already validate the effectiveness of the RNNclassiﬁcation. So in the following experiments, we will adopta 1-layer RNN for simplicity.Third, the action space can also affect the classiﬁcationaccuracy. Higher-dimensional actions can contain more infor-mation about the DRL model. Thus, it will be easier and moreaccurate to classify them. In our case, the action space of Cart-Pole environment is 2 while that of Atari Pong environmentis 6. Then the classiﬁer of Atari Pong has a higher accuracythan Cart-Pole, as reﬂected in Figure 3. Accuracy of each class.

Figure 2 shows the confusion matrixfor both two environments. We observe the RNN classiﬁercan distinguish DRL models of each algorithm with very highconﬁdence. For most cases, the prediction accuracy is above70%; the best case is up to 100% (i.e. DQN models in AtariPong); the worst case is 54% (ACER models in Cart-Pole),which is still much higher than random choice (20%). Theprediction accuracy of the DQN model is particularly high(0.95 in Cart-Pole, and 1 in Atari Pong). The reason behindthis is that DQN is a value-based algorithm while all the other four algorithms are actor-critic methods. So DQN models areeasier to be distinguished.

C. Explanation of RNN Classiﬁcation

We quantitatively explain and validate why our RNN classi-ﬁer is able to distinguish different DRL algorithms. We adoptthe Local Interpretable Model-agnostic Explanations (LIME)framework [36], which attempts to understand a model byperturbing its input and observing how the output changes.Speciﬁcally, it modiﬁes a single data sample by tweakingthe feature values and observes the resulting impact on theoutput to determine which features play an important role inthe model predictions. C o n t r i b u t i o n ACER0.0000.1000.200 ACKTR0 20 40 60 80

Input Sequence Timestep

Fig. 4: The explanation of RNN classiﬁerIn our case, we build a explainer with LIME on our RNNclassiﬁer. Then, we randomly select 200 explanation instancesfrom the training data of the classiﬁer from the shadowDRL models trained in Atari Pong environment in SectionIII-A. We feed these instances to the explainer and obtainthe explanation results. For each explanation instance, weidentify the feature (i.e., one action of input sequence) whichcontributes most to the prediction. Through the analysis ofthese features, we can discover the different behaviors of DRLmodels trained from different algorithms. Fig. 4 shows thecontribution of the actions ( UP , DOWN , IDLE ) with prominentimpacts on the prediction in each input sequence. We canobserve that different DRL algorithms give very differentbehavior preferences. A2C tends to issue important actions

0k 40k 60k 80k

Training timestep E p i s o d e r e w a r d PPO expert Replicated model Stop and Retrain (a) Imitation learning with same algorithm

20k 40k 60k 80k

Training timestep E p i s o d e r e w a r d DQN expert Replicated model Stop and Retrain (b) Imitation learning with different algorithms

Fig. 5: The reward of the replicated model during the imitation learning progressof UP at the beginning of the sequence; ACKTR prefers togive the action of DOWN also at the task beginning; DQN hasa higher chance to predict

IDLE clustering at the beginning;PPO issues the

DOWN action all over the sequence with a largevariance of contribution factor; ACER has important actionsof UP and IDLE with similar contributions spanning all overthe sequence. This shows those DRL algorithms have quitedifferent characteristics in making action decisions, giving theclassiﬁer an opportunity to distinguish them just based on theaction sequence.

D. Results of Imitation Learning

We demonstrate the effectiveness of imitation learning formodel replication. To train the replicated models, GAIL algo-rithm is applied to imitate behaviors from the target model. Asimplemented in OpenAI baseline [34], the generator of GAILcan be PPO or TRPO policies. Without loss of generality, weselect PPO as the generator.

Imitating Performance.

The replicated model with the samealgorithm can reach similar performance (i.e., reward) asthe target model after imitation learning. Without loss ofgenerality, we show this effect in Cart-Pole environment. Weconsider the adversary has identiﬁed the training algorithm viaRNN classiﬁcation, and then use this algorithm for imitationlearning. Figure 5a shows the ﬁne-tuning process (the targetmodel is trained with PPO). We can observe that in the ﬁrstimitation cycle, the replicated model cannot reach the sameperformance as the target one, as it has been supervised tolearn the random behaviors of the target model with lowrewards. Then we start a new imitation cycle, and now the learned model can get the same reward as the victim model.We can stop with this replica, or continue to identify morequaliﬁed ones (at the 6th cycle). In contrast, we also consider acase where the adversary does not know the training algorithm,and randomly pick one for imitation learning. Figure 5b showsthe corresponding imitation process (the target model usesDQN while the adversary selects PPO generator). Now thereplicated model can never get the same performance asthe target model. This indicates the importance of extractedalgorithm from the RNN classiﬁcation, in order to performhigh-quality imitation learning.

JS divergence C u m u l a t i v e p o r t i o n BaselineReplicated modelShadow model

Fig. 6: JS divergence of different models

Imitating behaviors.

In addition to performance, the repli-ated model can also learn the similar behaviors as the targetmodel. Since the output of a DRL model is a probabilitydistribution over legal actions, we adopt the Jensen-Shannon(JS) divergence [37] to measure the similarity of the actionprobability distributions between the replicated model andthe target model. We still use the PPO policy in the Cart-Pole environment for illustration. We consider three cases:(1) the similarity between the target model and itself (i.e.,collecting the behaviors twice). This serves as the baseline forcomparison. (2) the similarity between the target model andthe replicated model from imitation learning; (3) the similaritybetween the target model and a shadow model with the sametraining algorithm. For each case, we feed the same statesto the two models in comparison, sample 100 actions fromeach model, compute the action probability distributions andthe divergence between these two action distributions. Fig.6 shows the cumulative probability of the JS divergence foreach case. We can observe that cumulative probability of JSdivergence in both case (1) and (2) increases sharply to 1.This indicates that the replicated model indeed has very similarbehaviors as the target model. In contrast, the divergence ofaction probability distributions between the shadow model andtarget model can be very high. Even they are trained from thesame algorithm, their behaviors are still quite distinct in thesame environment. We can conclude that through imitationlearning with the extracted algorithm, the replicated modelcan behave very closely with the target one.V. C

ASE S TUDY : E

NHANCING A DVERSARIAL A TTACKS

In this section, we present a case study to show how anadversary can leverage the model extraction technique to causesevere damages to the victim.Generally, there are three types of adversarial attacks,white-box, grey-box and black-box attacks, determined by theadversary’s knowledge of the victim model [38]. The black-box scenario is the most realistic setting as the informationand details of the DL models are usually conﬁdential forintellectual property protection. However, black-box attackshave lower success rates than grey-box attacks due to thelow transferability across models with different algorithms.Such distinction is more prominent for DRL models for theircomplexity and large diversity. So to enhance the adver-sarial attacks against black-box DRL models, we can usethe proposed model extraction attack to turn the black-boxmodels into grey-box ones. Speciﬁcally, we extract the trainingalgorithm from the black-box DRL model and replicate a newone. Then we generate adversarial examples via conventionalmethods from the parameters of the replicated model, and usethem to attack the target black-box one.

Implementation.

We evaluate the effectiveness of adversarialexamples in Atari Pong environment. The target black-boxmodel can use one training algorithm and conﬁgurations. Theadversary may choose an arbitrary different algorithm to traina shadow model, or use model extraction method to identifythe target model algorithm and replicate a new model. Foreach case, we adopt the FGSM technique [25] to generate

A2C PPO ACER ACKTR DQN

White-box models

A2CPPOACERACKTRDQN B l a c k - b o x m o d e l s Fig. 7: The transferability of adversarial examples acrossdifferent DRL algorithms1,000 adversarial examples and measure their success rates onthe target model.

Results.

Fig. 7 reports the transferability of adversarial ex-amples across different model algorithms under the sameperturbation scale. We can observe that the success rateincreases when the replicated model has the same trainingalgorithm as the target model. The reason behind this is thatthe gradients of the DRL models with the same algorithmare closer than the ones with different algorithms. Therefore,adversarial examples are easier to be transferred to the modelswith the same algorithm, even when their parameters aredifferent. This indicates that our model extraction techniquecan signiﬁcantly enhance the adversarial attack effects on theblack-box DRL models.VI. C

ONCLUSION

In this paper, we design a novel attack methodology tosteal DRL models. We utilize RNN classiﬁcation and Gen-erative Adversarial Imitation Learning, to extract the modelalgorithms, imitate their behaviors and performance. Withsuch powerful attack techniques, an adversary can recover theDRL models with high ﬁdelity only by observing the actionsother than prediction conﬁdence score. Such minimal attackrequirements can invalidate the common defenses againstmodel extraction attacks, e.g., perturbing the output probability[7, 39, 40], removing the probabilities for some classes [7],returning only the class output [7, 39], query pattern analysis[41, 42], watermarking [43, 44]. We expect this study caninspire people’s awareness about the severity of DRL modelprivacy issue, and come up with better solutions to mitigatesuch model extraction attacks.R

EFERENCES [1] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, andDaan Wierstra. Continuous control with deep reinforce-ment learning. arXiv preprint arXiv:1509.02971 , 2015.2] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim,Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deepreinforcement learning. In , pages3357–3364. IEEE, 2017.[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al. Human-level control through deep rein-forcement learning.

Nature , 518(7540):529–533, 2015.[4] Learning to drive in a day. https://wayve . . mobileye . com/our-technology/driving-policy/, 2020. Accessed: 2020-06-05.[6] Seong Joon Oh, Bernt Schiele, and Mario Fritz. To-wards reverse-engineering black-box neural networks. In Explainable AI: Interpreting, Explaining and VisualizingDeep Learning , pages 121–144. Springer, 2019.[7] Florian Tram`er, Fan Zhang, Ari Juels, Michael K Re-iter, and Thomas Ristenpart. Stealing machine learningmodels via prediction apis. In { USENIX } SecuritySymposium ( { USENIX } Security 16) , pages 601–618,2016.[8] Vasisht Duddu, Debasis Samanta, D Vijay Rao, andValentina E Balas. Stealing neural networks via timingside channels. arXiv preprint arXiv:1812.11720 , 2018.[9] Matthew Jagielski, Nicholas Carlini, David Berthelot,Alex Kurakin, and Nicolas Papernot. High accuracyand high ﬁdelity extraction of neural networks. In { USENIX } Security Symposium ( { USENIX } Secu-rity 20) , 2020.[10] Smitha Milli, Ludwig Schmidt, Anca D Dragan, andMoritz Hardt. Model reconstruction from model expla-nations. In

Proceedings of the Conference on Fairness,Accountability, and Transparency , pages 1–9, 2019.[11] Xing Hu, Ling Liang, Shuangchen Li, Lei Deng, PengfeiZuo, Yu Ji, Xinfeng Xie, Yufei Ding, Chang Liu, TimothySherwood, et al. Deepsniffer: A dnn model extractionframework based on learning architectural hints. In

Proceedings of the Twenty-Fifth International Conferenceon Architectural Support for Programming Languagesand Operating Systems , pages 385–399, 2020.[12] Honggang Yu, Kaichen Yang, Teng Zhang, Yun-YunTsai, Tsung-Yi Ho, and Yier Jin. Cloudleak: Large-scale deep learning models stealing through adversarialexamples. In

Proceedings of Network and DistributedSystems Security Symposium (NDSS) , 2020.[13] Mengjia Yan, Christopher Fletcher, and Josep Tor-rellas. Cache telepathy: Leveraging shared resourceattacks to learn dnn architectures. arXiv preprintarXiv:1808.04761 , 2018.[14] Lejla Batina, Shivam Bhasin, Dirmanto Jap, and StjepanPicek. { CSI }{ NN } : Reverse engineering of neural net- work architectures through electromagnetic side channel.In { USENIX } Security Symposium ( { USENIX } Se-curity 19) , pages 515–532, 2019.[15] Weizhe Hua, Zhiru Zhang, and G Edward Suh.Reverse engineering convolutional neural networksthrough side-channel information leaks. In , pages 1–6. IEEE, 2018.[16] Binghui Wang and Neil Zhenqiang Gong. Stealinghyperparameters in machine learning. In , pages 36–52.IEEE, 2018.[17] Jonathan Ho and Stefano Ermon. Generative adversarialimitation learning. In

Advances in neural informationprocessing systems , pages 4565–4573, 2016.[18] Ronald J Williams. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Ma-chine learning , 8(3-4):229–256, 1992.[19] John Schulman, Filip Wolski, Prafulla Dhariwal, AlecRadford, and Oleg Klimov. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.[20] Ziyu Wang, Victor Bapst, Nicolas Heess, VolodymyrMnih, Remi Munos, Koray Kavukcuoglu, and Nandode Freitas. Sample efﬁcient actor-critic with experiencereplay. arXiv preprint arXiv:1611.01224 , 2016.[21] Yuhuai Wu, Elman Mansimov, Roger B Grosse, ShunLiao, and Jimmy Ba. Scalable trust-region methodfor deep reinforcement learning using kronecker-factoredapproximation. In

Advances in neural information pro-cessing systems , pages 5279–5288, 2017.[22] Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan,and Chrisina Jayne. Imitation learning: A survey oflearning methods.

ACM Computing Surveys (CSUR) , 50(2):1–35, 2017.[23] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanc-tot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan,Andrew Sendonaris, Ian Osband, et al. Deep q-learningfrom demonstrations. In

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence , 2018.[24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, and RobFergus. Intriguing properties of neural networks. arXivpreprint arXiv:1312.6199 , 2013.[25] Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. Explaining and harnessing adversarial exam-ples. arXiv preprint arXiv:1412.6572 , 2014.[26] Nicholas Carlini and David Wagner. Towards evaluatingthe robustness of neural networks. In , pages 39–57. IEEE,2017.[27] Nicolas Papernot, Patrick McDaniel, Somesh Jha, MattFredrikson, Z Berkay Celik, and Ananthram Swami. Thelimitations of deep learning in adversarial settings. In , pages 372–387. IEEE, 2016.[28] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yanuan, and Pieter Abbeel. Adversarial attacks on neu-ral network policies. arXiv preprint arXiv:1702.02284 ,2017.[29] Alessio Russo and Alexandre Proutiere. Optimal at-tacks on reinforcement learning policies. arXiv preprintarXiv:1907.13548 , 2019.[30] Vahid Behzadan and Arslan Munir. Vulnerability of deepreinforcement learning to policy induction attacks. In

International Conference on Machine Learning and DataMining in Pattern Recognition , pages 262–275. Springer,2017.[31] Reza Shokri, Marco Stronati, Congzheng Song, andVitaly Shmatikov. Membership inference attacks againstmachine learning models. In , pages 3–18. IEEE, 2017.[32] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart.Model inversion attacks that exploit conﬁdence infor-mation and basic countermeasures. In

Proceedings ofthe 22nd ACM SIGSAC Conference on Computer andCommunications Security , pages 1322–1333, 2015.[33] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780,1997.[34] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, AlexNichol, Matthias Plappert, Alec Radford, John Schulman,Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openaibaselines. https://github . com/openai/baselines, 2017.[35] Volodymyr Mnih, Adria Puigdomenech Badia, MehdiMirza, Alex Graves, Timothy Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. Asynchronousmethods for deep reinforcement learning. In Interna-tional conference on machine learning , pages 1928–1937, 2016.[36] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.”why should I trust you?”: Explaining the predictions ofany classiﬁer. In

Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, San Francisco, CA, USA, August 13-17,2016 , pages 1135–1144, 2016.[37] Jianhua Lin. Divergence measures based on the shannonentropy.

IEEE Transactions on Information theory , 37(1):145–151, 1991.[38] Dongyu Meng and Hao Chen. Magnet: a two-prongeddefense against adversarial examples. In

Proceedings ofthe 2017 ACM SIGSAC Conference on Computer andCommunications Security , pages 135–147, 2017.[39] Varun Chandrasekaran, Kamalika Chaudhuri, Irene Gia-comelli, Somesh Jha, and Songbai Yan. Exploring con-nections between active learning and model extraction. arXiv preprint arXiv:1811.02054 , 2018.[40] Taesung Lee, Benjamin Edwards, Ian Molloy, and DongSu. Defending against model stealing attacks using de-ceptive perturbations. arXiv preprint arXiv:1806.00054 ,2018.[41] Mika Juuti, Sebastian Szyller, Samuel Marchal, andN Asokan. Prada: protecting against dnn model stealing attacks. In , pages 512–527. IEEE, 2019.[42] Manish Kesarwani, Bhaskar Mukhoty, Vijay Arya, andSameep Mehta. Model extraction warning in mlaasparadigm. In

Proceedings of the 34th Annual ComputerSecurity Applications Conference , pages 371–380, 2018.[43] Jialong Zhang, Zhongshu Gu, Jiyong Jang, Hui Wu,Marc Ph Stoecklin, Heqing Huang, and Ian Molloy.Protecting intellectual property of deep neural networkswith watermarking. In

Proceedings of the 2018 on AsiaConference on Computer and Communications Security ,pages 159–172, 2018.[44] Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, andShin’ichi Satoh. Embedding watermarks into deep neuralnetworks. In