Towards Closing the Sim-to-Real Gap in Collaborative Multi-Robot Deep Reinforcement Learning
Wenshuai Zhao, Jorge Peña Queralta, Li Qingqing, Tomi Westerlund
11 Towards Closing the Sim-to-Real Gap inCollaborative Multi-Robot Deep Reinforcement Learning
Wenshuai Zhao , Jorge Pe˜na Queralta , Li Qingqing , Tomi Westerlund Turku Intelligent Embedded and Robotic Systems Lab, University of Turku, FinlandEmails: { wezhao, jopequ, qingqli, tovewe } @utu.fi Abstract —Current research directions in deep reinforcementlearning include bridging the simulation-reality gap, improv-ing sample efficiency of experiences in distributed multi-agentreinforcement learning, together with the development of ro-bust methods against adversarial agents in distributed learning,among many others. In this work, we are particularly interestedin analyzing how multi-agent reinforcement learning can bridgethe gap to reality in distributed multi-robot systems where theoperation of the different robots is not necessarily homogeneous.These variations can happen due to sensing mismatches, inherenterrors in terms of calibration of the mechanical joints, or simpledifferences in accuracy. While our results are simulation-based,we introduce the effect of sensing, calibration, and accuracymismatches in distributed reinforcement learning with proximalpolicy optimization (PPO). We discuss on how both the differenttypes of perturbances and how the number of agents experiencingthose perturbances affect the collaborative learning effort. Thesimulations are carried out using a Kuka arm model in theBullet physics engine. This is, to the best of our knowledge,the first work exploring the limitations of PPO in multi-robotsystems when considering that different robots might be exposedto different environments where their sensors or actuators haveinduced errors. With the conclusions of this work, we set theinitial point for future work on designing and developing methodsto achieve robust reinforcement learning on the presence of real-world perturbances that might differ within a multi-robot system.
Index Terms —Reinforcement Learning; Multi-Robot Systems;Collaborative Learning; Deep RL; Adversarial RL; Sim-to-Real;
I. I
NTRODUCTION
Reinforcement learning (RL) algorithms for robotics andcyber-physical systems have seen an increasing adoptionacross multiple domains over the past decade [1], [2]. Deepreinforcement learning (DRL) enables agents to be trained inrealistic environments without the need for large amounts ofdata to be gathered and labeled a priori. Specifically, rein-forcement learning has enjoyed significant success in roboticcontrol tasks involving manipulation [3], [4], [5]. Motivatedby the way humans and animals learn, DRL algorithms workon a trial and error basis, where an agent interacts with itsenvironment and receives a reward based on its performance.When complex agents or environments are involved, thelearning process can be relatively slow. This has motivatedthe design and development of multi-agent DRL algorithms.In this paper, we are interested in exploring some of thechallenges remaining in multi-robot collaborative DRL.Reinforcement learning applied to multi-agent systems hastwo dimensions: DRL algorithms that model policies for multi-agent control and interaction, and DRL approaches that rely on E x p e r i e n c e A gg r e g a t i o n E x p e r i e n c e A gg r e g a t i o n E x p e r i e n c e A gg r e g a t i o n Experiences and
Rewards UpdatedCommon ModelPolicy UpdatesPolicy
Updates
Policy Updates
PPO2
Fig. 1: Conceptual view of the proposed scenario, where multiplerobotic agents are collaboratively learning the same task. While thetask is common, and the agents are a priori identical, we study howdifferent alterations in the agents or their environments affects theperformance of the collaborative learning process. multiple agents to parallelize the learning process or explore awider variety of experiences. Within the former category, wecan find examples of DRL for formation control [6], obstacleand collision avoidance [7], [8], collaborative assembly [9], orcooperative multi-agent control in general [10]. In the lattercategory, most existing approaches refer to the utilization ofmultiple agents to learn in parallel, but from the point ofview of a multi-process or multi-threaded application [3]. Weare interested in works exploring the possibilities of usingmultiple robotic agents that collaborate on learning the sametask. This has been identified as one of the future paradigmswith 5G-and-beyond connectivity and edge computing [11],[12]. For instance, in [13] an asynchronous method for off-policy updates between robots was presented. Other worksalso consider network conditions and propose frameworksfor multi-agent collaborative DRL over imperfect networkchannels [14]. This type of scenario is illustrated in Fig. 1,where three robotic arms are collaboratively learning the sametask and sharing their experiences to update a common policy.Hereinafter, we refer to these types of scenarios as multi-agent a r X i v : . [ c s . L G ] A ug or multi-robot collaborative RL tasks, where multiple agentscollaborate to learn the same task but might be exposed todifferent environments, or work under different conditions.Among the multiple challenges in DRL, recent years haveseen a growing research interest in closing the simulation-to-reality gap [5], [15], and on the design and developmentof robust algorithms with resilience against adversarial con-ditions [16], [17], [18]. This latter topic is also of paramountrelevance in distributed or multi-agent DRL, where adversarialagents can hinder the collaborative learning process [19].When multiple agents are learning a collaborative or coordi-nated behavior, byzantine agents can significantly reduce theperformance of the system as a whole.We aim at studying how adversarial conditions can helpto bridge the simulation-to-reality gap. In [5] and [15], theauthors analyze perturbances in the rewards towards theapplicability of DRL in real-world applications. In [5], thefocus is on learning how to manipulate deformable objects,with agents trained in a simulation environment but directlydeployable in the real-world. In [20], the authors present ameta-learning approach for domain adaption in simulation-to-reality transfers. Our objective in this paper is not to designa specific sim-to-real method for a given algorithm or task,but instead to analyze the performance of collaborative multi-robot DRL in the presence of disturbances in the environmentas a step towards more effective sim-to-real transfers wherereal noises, errors or perturbances are accounted for alsoin the simulation environment. This includes variability inthe operation of the robots, as robots might be operating inslightly different environments, or operate in different waysunder the same environment. In particular, we are interestedin studying how exposing multiple collaborative robots todifferent environments from the point of view of sensing andactuation can affect the joint learning effort.In this paper, therefore, we focus on introducing pertur-bances inspired by real-world cases in a multi-agent DRLsimulation. We expose different subsets of agents to slightlymodified environments and study how different types of dis-turbances affect the collaborative learning process and theability of the multi-robot system to converge to a commonpolicy. The main contribution of this paper is the analysis ofhow input and output disturbances affect a collaborative deepreinforcement learning process with multiple robot arms. Inparticular, we simulate real-world perturbations that can occuron robotic arms, from the sensing and actuation perspectives.This is, to the best of our knowledge, the first study to considerthe evaluation of both sensing and actuation disturbances ina multi-robot collaborative learning scenario, with differentrobots being exposed to different environments.The remainder of this document is organized as follows.In Section II we review the literature in distributed RL,adversarial RL, and robust multi-agent RL in the presenceof byzantine agents. Then, Section III introduces the DRLalgorithm, and the methodology and simulation environmentutilized in our experiments. The agent training methods andenvironment disturbances introduced to emulate real-worldoperational variability, together with the simulations results,are presented in Section IV. Section V concludes the work. II. R ELATED W ORKS
In this work, we study adversarial conditions in a simulationenvironment to emulate real-world conditions in terms ofvariability of the environment across a set of multiple agentscollaborating in learning the same task. With most of theliterature in simulation-to-reality transfer aiming at specificapplications or adaptation to different environments [15], [5],[20], in this section we focus instead on previous worksanalyzing the effect of adversarial of byzantine effects inmulti-agent reinforcement learning, as well as consideringother perturbations in the environment to better emulate real-world conditions. The literature in adversarial conditions forcollaborative multi-agent learning is, nonetheless, sparse.Adversarial RL has been a topic of extensive study overthe past years. Multiple deep learning algorithms have beenshown to be vulnerable when subject to adversarial inputperturbations, being able to induce certain policies [16]. Thisis a general problem of reinforcement learning that affectsdifferent types of algorithms and scenarios. In multi-agentenvironments, the ability of an attacker to create adversarialobservations increases significantly [17]. A comprehensivesurvey on the main challenges and potential solutions for ad-versarial attacks on DRL is available in [21]. The authors clas-sify attacks in four categories: attacks targeting (i) rewards, (ii)policies, (iii) observations, and (iv) the environment. Amongthese, those targeting observations and the environment arethe most relevant within the scope of this survey. In most ofthese cases, however, the literature only considers single-agentlearning (or multiple agents being affected in the same way).Moreover, previous works focus on malicious perturbationsaimed at decreasing the performance of the learning agent.In this paper, nonetheless, we induce perturbations that areinspired by real-world issues including changes in accuracyor calibration errors.Other authors have explored the effects of having noisyrewards in RL. In this direction, Wang et al. presented ananalysis of perturbed rewards for different RL algorithms,including PPO but also DQN and DDPG, among others [18].Compared to their approach, we also consider perturbanceson the RL process but focus on those that model real-worldnoises and errors. Moreover, we specifically put an emphasison multi-robot collaborative learning, and consider situationsin which the perturbances that affect different robots are alsodifferent. We also focus on the PPO algorithm as the state-of-the-art in three-dimensional locomotion. In fact, PPO hasbeen identified as one of the most robust approaches againstreward perturbances in [18]. Also within the study of noisyrewards, a method to improve performance in such scenariosis proposed in [22].In general, we see a gap in the literature in the study ofnoisy or perturbed environments that do not affect equallyacross multiple agents collaborating towards learning the sametask. This paper thus tries to address this issue with aninitial assessment of how perturbations in the environmentinfluencing a subset of agents affect a global common modelwhere experiences from different agents are aggregated.
III. M
ETHODOLOGY
In this section, we define our problem of distributed rein-forcement learning with a subset of perturbed agents, as wellas the simulation environment and modifications applied to it.
A. Multi-agent RL
In multi-agent reinforcement learning, approaches can beroughly divided into two parallel modes, asynchronous andsynchronous. A3C (Asynchronous Advantage Actor-Critic)[3]is one of the most widely adopted methods for multi-agent re-inforcement learning, representing the asynchronous paradigm.A3C consists of multiple independent agents with their ownnetworks. These agents interact with different copies of the en-vironment in parallel and update a global network periodicallyand asynchronously.After each update, the agents reset theirown weights to those of the global network and then resumetheir independent exploration.Because some of the agents willbe exploring the environment with an older version of thenetwork weights, A3C results in relatively suboptimal use ofcomputational resources as well as more noisy updates. Analternative is A2C (Advantage Actor-Critic), which utilizessynchronous parallel mode. In this case, there are only twonetworks in the system. One is used by all agents equally tointeract with the environment in parallel, and outputs a batchof experiences. With this data, the second network is trainedand updates the former network.In this paper, we utilize a synchronous multi-agent rein-forcement learning algorithm: proximal policy optimization(PPO). PPO and has been adopted as the default methodof OpenAI owing to its excellent performance. The PPOalgorithm takes advantage of the A2C ideas in terms of havingmultiple workers, and gradient policy ideas from TRPO (TrustRegion Policy Optimization) to improve the actor performanceby utilizing a trust region. PPO seeks to find a balance betweenthe ease of implementation, sample complexity, and ease ofadjustment, trying to update at each step to minimize the costfunction while assuring that the new policies are not far fromlast policies. The scheme follows these steps:1) Set the initial policy parameters θ .2) In each iteration, use θ k to interact with the environment,collect experience data (a tuple of state and action { s t , a t } ),and compute their advantage A θ k ( s t , a t ) [3].3) Find the optimal θ by optimizing J P P O ( θ ) : J θ k P P O ( θ ) = J θ k ( θ ) − βKL (cid:0) θ, θ k (cid:1) (1)where β is a hyperparameter and will be adapted according tothe value of KL . J θ k ( θ ) is calculated by: J θ k ( θ ) ≈ (cid:88) ( s t ,a t ) p θ ( a t | s t ) p θ k ( a t | s t ) A θ k ( s t , a t ) (2)where p θ k ( a t | s t ) is the probability of ( s t , a t ) under θ k . B. Simulation Environment
Our simulation environment is built based on top one ofthe Bullet physics simulators, specifically the PyBullet Kukaarm for grasping [23]. In order to simplify the training of
Fig. 2: Kuka arm reaching environment based on Bullet simulator. our RL algorithm, we modify the original grasping task into areaching task, which allows us to focus on observing the effectof adversarial agents in training distributed reinforcementlearning algorithms, rather on optimizing the training itself.The simulation environment is shown in Figure 2. Therobotic arm in this environment attempts to reach the objectin the bin. It takes the Cartesian coordinates of the gripperand the relative position of the object as input instead of theon-shoulder camera observation. This input can be seen as alist with nine elements:
Input = [ x g , y g , z g , yaw g , pit g , rol g , x og , y og , rol og ] (3)where x g , y g , z g denote the Cartesian coordinates of the centerof the gripper, and yaw g , pit g , rol g refers to its three Eulerangles, while x og , y og , rol og indicate the relative x , y positionand the roll angle of the object in the gripper space.Our RL algorithm receives the input and then gives aCartesian displacement: Output = [ dx, dy, dz, dφ ] (4)in which φ is the rotation angle of the wrist around the z -axis.An inverse kinematics method is then employed to calculatethe real motor control values of the joints. Note that all theunits used for the position are in meters, and the angles arein radians. This environment with our training code is nowopen-source on Github . C. Collaborative Learning under Real-World Perturbations
In real robots, some of the most characteristic sources ofperturbations within a homogeneous multi-robot team comefrom the calibration of the robots in terms of sensing andactuation. In this paper, we thus study how these two typesof input (sensing) and output (actuation) perturbations affecta collaborative learning process:
Input perturbations : we consider both fixed and variableerrors in the input to the network regarding the position of theobject to be reached. This emulates the error that might resultfrom identifying the position of the object from a camera oranother sensor on the robot arm. The fixed noise represents, forinstance, installation or calibration errors on the position of thecamera, which might have an offset in position or orientation. https://github.com/TIERS/NoisyKukaReacher Variable errors, on the other hand, try to emulate the sensingerrors that come, for example, from the vibration of the arm orlocal odometry errors describing its orientation and position.
Output perturbations : we simulate both fixed and variableperturbations in the actuation of the robotic arm, emulatingcalibration errors (e.g., a constant offset in one direction), orchanges in accuracy or repeatability across different robots.Through multiple simulations, we study how these types ofperturbations affect the collaborative learning effort when theyare not common across the entire set of agents.IV. E
XPERIMENTATION AND R ESULTS
In this section, we describe the training parameters uti-lized through our simulations, and the ways in which theenvironments have been modified to introduce disturbancesin both sensors and actuators. We then present the results ofmultiple simulations where different numbers of agents havebeen trained in different environments but treated equally fromthe point of view of the collaborative learning process.
A. Training Method
The maximal number of steps in one episode is set to 1200,and the maximum number of steps for the whole trainingprocess is set to 4 million. If the gripper contacts the object orapproaches it at a very small distance (0.008 m), the episodewill be terminated. The final score for this episode is thuscalculated by summing all the rewards obtained in all the stepsuntil termination.The initial reward is set as -1000 for each step if the distanceis larger than 1 m. However, if the distance between the fingeron the gripper and the object is smaller than 1 m, the reward iscomputed as reward raw = − · distance . Moreover, we alsoadd the cost of each step, in order to encourage the gripper toapproach the target as soon as possible. The cost of each stepis set as 1. Therefore, the final reward for this step is hence: reward final = − · distance − , where the distance is givenin meters. If the gripper finally contacts its target or approachit in a threshold, we give it a significantly larger reward (1000)to help the model learn faster and clearly.In total, in our simulations, we utilize 30 agents parallelizedon the GPU processes to produce experience data based on avectorized environment. Therefore, these agents can representa multi-robot system learning a collaborative RL task. Wegive different settings on individual environments to manuallysimulate the possible perturbations that robots find in real-world scenarios. B. Calibration and Accuracy Noises
To emulate the practical noises and errors that robots couldencounter when training an RL algorithm, we consider thefollowing four types of perturbations, for each of which wegenerate a different environment to expose a variable numberof robots to: fixed input errors on all the nine elements by . m , uniformly distributed sensing errors in the interval [0 . m, . m ] , fixed output errors modifying the gripperactuators by an offset of . m on the x axis, and uniformly distributed output errors in the interval [0 . m, . m ] . Itshould be noted that the uniform distributed errors on input andoutput could be different in each step, which can be regardedas inaccurate sensing errors, or reduced repeatability in theactuation of real robots.Moreover, in order to further analyze how more extremecases affect the collaborative learning process, we also con-sider fixed disturbances on larger magnitude (0.015 m on allthe values for the sensing error and 0.015 m on the x-axisfor the actuation error) as well as scenarios where the noiseis different for each of the agents exposed to the modifiedenvironment (in the interval 0.005 m to 0.025 m for 25 agents). C. Simulation Results
Figure 3 shows the results of our simulations. The notationdescribing each subfigure is as follows: { I,O } : representingthe input (sensing) and output (actuation) perturbances, { F,V } :representing fixed and variable perturbances, and {
5, 15, 25 } :representing the number of agents exposed to the modifiedenvironment where the perturbances occur.Comparing perturbations in the sensors versus perturbationsin the actuators, we see an overall more robust performanceagainst adversarial elements in the sensing part. In Figures 3bto 3d, we see that the network always converges and we onlysee more unstable behaviour when there is a large fractionof agents suffering of variable sensing errors (50% and 83%).When we compare the effect of constant or fixed perturbationsagainst variable ones, we notice that variable perturbationsinduce less stable convergence. This can be to some extentexplained by the fact that there are no large subsets of agentsbeing exposed to a common environment.For small fixed perturbations affecting actuation (output dis-turbances), we have seen that the agents are able to convergetowards a working policy. In the cases where 5 or 25 of theagents are affected, this was expected as there is a majority(25) of agents, in both cases, that work in exactly the sameway, and a small subset (5) that work in a slightly differentway (but still the same within that subset). When this fixedperturbation is introduced to half of the agents, then we havetwo subsets of the same size operating in different ways, butagain consistently across each of the subsets. In this case, wehave seen that for a small magnitude in the perturbation, theagents still converge on a policy that works for both subsets.As the difference between the operation of the agents in thesetwo subsets diverges, the performance of the system as a wholedrops significantly. Nonetheless, we have observed that thecase were half of the agents have a common fixed perturbationof small magnitude the system is able to converge even whenthe initial conditions are disadvantageous.In order to analyze the effect of perturbations with largermagnitude as well as fixed perturbations in both sensing andactuation that vary across the robots exposed to a modifiedenvironment, we have analyzed four more cases shown inFig. 4. In Figures 4a and 4b, we analyze how perturbationswith larger magnitude affect the learning process, with half ofthe agents being affected as the worst-case scenario. We seethat the trend from the previous results is followed, with the M ea n s c o r e ( × − ) TrainingEvaluation(a) Base scenario with common perturbation-free environments. M ea n s c o r e ( × − ) (b) IF5 M ea n s c o r e ( × − ) (c) IF15 M ea n s c o r e ( × − ) (d) IF25 M ea n s c o r e ( × − ) (e) IV5 M ea n s c o r e ( × − ) (f) IV15 M ea n s c o r e ( × − ) (g) IV25 M ea n s c o r e ( × − ) (h) OF5 M ea n s c o r e ( × − ) (i) OF15 M ea n s c o r e ( × − ) (j) OF25 M ea n s c o r e ( × − ) (k) OV5 M ea n s c o r e ( × − ) (l) OV15 M ea n s c o r e ( × − ) (m) OV25Fig. 3: Simulation results where we show the training in perturbance-free environment, and 12 cases where we analyze the effect of amodified environment (fixed and variable perturbances on both sensing and actuation) on 5, 15 and 25 agents. The total number of agentsis 30 in all cases. The legend is common across all graphs and has been omitted in subfigures (b) through (m) to improve readability. network being able to converge to a common policy when aconstant error is added to the sensors interface, but not whenthe disturbances affect to the actuators. Finally, Figures 4cand 4d show that when there are no differentiated subsets ofagents with a common behaviour and the perturbations aredifferent across a large number of agents, then the system isnot able to converge.V. C ONCLUSION AND F UTURE W ORK
Adversarial agents and closing the simulation-to-reality gapare among the key challenges preventing wider adoptionof reinforcement learning in real-world applications. In thispaper, we have addressed the latter one from the perspective ofthe former: by introducing adversarial conditions inspired byreal-world perturbances to a subset of agents in a multi-robotsystem during a collaborative reinforcement learning process,we have been able to identify points where the robustness of distributed multi-agent DRL algorithms needs to be improved.In this paper, we have considered multiple robotic arms ina simulation environment collaborating towards learning acommon policy to reach an object. In order to emulate morerealistic conditions and understand how perturbances in theenvironment affect the learning process, we have consideredvariability across the agents in terms of their ability to senseand actuate accurately. We have shown how different types ofdisturbances in the model’s input (sensing) and output (actua-tion) affect the robustness and ability to converge towards aneffective policy. We have seen how variable perturbances havethe most effect on the ability of the network to converge, whiledisturbances in the ability of the robots to actuate properlyhave had a comparatively worse effect than those in theirability to sense the position of the object accurately.The conclusions of this work serve as a starting pointtowards the design and development of more robust methods M ea n s c o r e ( × − ) TrainingEvaluation (a) IF15 - Large perturbations M ea n s c o r e ( × − ) TrainingEvaluation (b) OF15 - Large perturbations M ea n s c o r e ( × − ) TrainingEvaluation (c) OF25 - Each agent exposed to a different environment M ea n s c o r e ( × − ) TrainingEvaluation (d) IF25 - Each agent exposed to a different environmentFig. 4: Simulation results with an extra 4 scenarios analyzed: two of them where we consider perturbations of larger magnitude, and twomore where we consider that each of the agents in a modified environment is affected differently. able to identify and take into account these disturbances inthe environment that do not occur across all robots equally.This will be the subject of our future work, as well asthe study of other types or combinations of disturbances inthe environment. We will also work towards modeling moreaccurately real-world errors for RL simulation environments.A
CKNOWLEDGEMENTS
This work was supported by the Academy of Finland’sAutoSOS project with grant number 328755.R
EFERENCES[1] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil An-thony Bharath. A brief survey of deep reinforcement learning. arXivpreprint arXiv:1708.05866 , 2017.[2] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Deepreinforcement learning for multiagent systems: A review of challenges,solutions, and applications.
IEEE transactions on cybernetics , 2020.[3] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, AlexGraves, Timothy Lillicrap, Tim Harley, David Silver, and KorayKavukcuoglu. Asynchronous methods for deep reinforcement learning.In
International conference on machine learning , 2016.[4] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani,John Schulman, Emanuel Todorov, and Sergey Levine. Learningcomplex dexterous manipulation with deep reinforcement learning anddemonstrations. arXiv preprint arXiv:1709.10087 , 2017.[5] Jan Matas, Stephen James, and Andrew J Davison. Sim-to-real rein-forcement learning for deformable object manipulation. arXiv preprintarXiv:1806.07851 , 2018.[6] Ronny Conde, Jos´e Ram´on Llata, and Carlos Torre-Ferrero. Time-varying formation controllers for unmanned aerial vehicles using deepreinforcement learning. arXiv preprint arXiv:1706.01384 , 2017.[7] Yu Fan Chen, Miao Liu, Michael Everett, and Jonathan P How. Decen-tralized non-communicating multiagent collision avoidance with deepreinforcement learning. In
ICRA , pages 285–292. IEEE, 2017.[8] Pinxin Long, Tingxiang Fanl, Xinyi Liao, Wenxi Liu, Hao Zhang, andJia Pan. Towards optimally decentralized multi-robot collision avoidancevia deep reinforcement learning. In
ICRA . IEEE, 2018. [9] Dorothea Schwung, Fabian Csaplar, Andreas Schwung, and Steven XDing. An application of reinforcement learning algorithms to industrialmulti-robot stations for cooperative handling operation. In ,pages 194–199. IEEE, 2017.[10] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperativemulti-agent control using deep reinforcement learning. In
AAMAS , pages66–83. Springer, 2017.[11] Jorge Pe˜na Queralta, Li Qingqing, Zhuo Zou, and Tomi Westerlund.Enhancing autonomy with blockchain and multi-acess edge computingin distributed robotic systems. In
The Fifth International Conference onFog and Mobile Edge Computing (FMEC). IEEE , 2020.[12] Jorge Pe˜na Queralta and Tomi Westerlund. Blockchain-powered collab-oration in heterogeneous swarms of robots.
Frontiers in Robotics andAI , 2020.[13] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deepreinforcement learning for robotic manipulation with asynchronous off-policy updates. In
ICRA , pages 3389–3396. IEEE, 2017.[14] Yiding Yu, Soung Chang Liew, and Taotao Wang. Multi-agent deep rein-forcement learning multiple access for heterogeneous wireless networkswith imperfect channels. arXiv preprint arXiv:2003.11210 , 2020.[15] Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac,Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, et al.Deepracer: Educational autonomous racing platform for experimentationwith sim2real reinforcement learning. arXiv:1911.01562 , 2019.[16] Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcementlearning to policy induction attacks. In
MLDM . Springer, 2017.[17] Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine,and Stuart Russell. Adversarial policies: Attacking deep reinforcementlearning. arXiv preprint arXiv:1905.10615 , 2019.[18] Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning withperturbed rewards. In
AAAI , pages 6202–6209, 2020.[19] Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning. In
Advances in neuralinformation processing systems , pages 7461–7472, 2018.[20] Karol Arndt, Murtaza Hazara, Ali Ghadirzadeh, and Ville Kyrki. Metareinforcement learning for sim-to-real domain adaptation. arXiv preprintarXiv:1909.12906 , 2019.[21] Inaam Ilahi, Muhammad Usama, Junaid Qadir, Muhammad Umar Jan-jua, Ala Al-Fuqaha, Dinh Thai Hoang, and Dusit Niyato. Challenges andcountermeasures for adversarial attacks on deep reinforcement learning. arXiv preprint arXiv:2001.09684 , 2020.[22] Aashish Kumar et al.
Enhancing performance of reinforcement learningmodels in the presence of noisy rewards . PhD thesis, 2019.. PhD thesis, 2019.