Ubiquitous Distributed Deep Reinforcement Learning at the Edge: Analyzing Byzantine Agents in Discrete Action Spaces
Wenshuai Zhao, Jorge Peña Queralta, Li Qingqing, Tomi Westerlund
11 Ubiquitous Distributed Deep Reinforcement Learning at the Edge:Analyzing Byzantine Agents in Discrete Action Spaces
Wenshuai Zhao , Jorge Pe˜na Queralta , Li Qingqing , Tomi Westerlund Turku Intelligent Embedded and Robotic Systems Lab, University of Turku, FinlandEmails: { wezhao, jopequ, qingqli, tovewe } @utu.fi Abstract —The integration of edge computing in next-generation mobile networks is bringing low-latency and high-bandwidth ubiquitous connectivity to a myriad of cyber-physicalsystems. This will further boost the increasing intelligence thatis being embedded at the edge in various types of autonomoussystems, where collaborative machine learning has the potentialto play a significant role. This paper discusses some of the chal-lenges in multi-agent distributed deep reinforcement learning thatcan occur in the presence of byzantine or malfunctioning agents.As the simulation-to-reality gap gets bridged, the probability ofmalfunctions or errors must be taken into account. We show howwrong discrete actions can significantly affect the collaborativelearning effort. In particular, we analyze the effect of having afraction of agents that might perform the wrong action with agiven probability. We study the ability of the system to convergetowards a common working policy through the collaborativelearning process based on the number of experiences from eachof the agents to be aggregated for each policy update, togetherwith the fraction of wrong actions from agents experiencingmalfunctions. Our experiments are carried out in a simulationenvironment using the Atari testbed for the discrete action spaces,and advantage actor-critic (A2C) for the distributed multi-agenttraining.
Index Terms —Reinforcement Learning; Edge Computing;Multi-Agent Systems; Collaborative Learning; RL; Deep RL;Adversarial RL;
I. I
NTRODUCTION
The edge computing paradigm is bringing higher degreesof intelligence to connected cyber-physical systems acrossmultiple domains. This intelligence is being in turn enabledby lightweight deep learning (DL) models deployed at theedge for real-time computation. Among the multiple DLapproaches, reinforcement learning (RL) has been increasinglyadopted in various types of cyber-physical systems over thepast decade, and, in particular, multi-agent RL [1]. Deep rein-forcement learning (DRL) algorithms are motivated by the waynatural learning happens: through trial and error, learning fromexperiences based on the performance outcome of differentactions. Among other fields, DRL algorithms have had successin robotic manipulation [2], but also in finding more optimalapproaches to complex multi-dimensional problems involvedin edge computing [3].We are particularly interested on how edge computing canenable more efficient real-time distributed and collaborativemulti-agent RL. When discussing multi-agent DRL, two dif-ferent views emerge from the literature: those where multipleagents are utilized to improve the learning process (e.g., fasterlearning through parallelization [4], higher diversity by meansof exploration of different environments [5], or increased robustness with redundancy [1]), and those where multipleagents are learning a policy emerging from an interactivebehavior (e.g., formation control algorithms [6], or collisionavoidance [7]).This work explores an scenario where multiple agents arecollaboratively learning towards the same task, which is oftenindividual, as illustrated in Fig. 1. Deep reinforcement learninghas been identified as one of the key areas that will see wideradoption within the Internet of Things (IoT) owing to the riseof edge computing and 5G-and-beyond connectivity [8], [9].Multiple application fields can benefit from this synergy indifferent domains, from robotic agents in the industrial IoT, toinformation fusion in wireless sensor networks, and includingall types of connected autonomous systems and other types ofcyber-physical systems.Multiple challenges still exist in distributed multi-agentDRL. Among the most relevant ones within the scope ofthis paper are the development of novel techniques to in-crease robustness in the presence of adversarial agents orperturbed environments [10], [11], [12], as well as closingthe simulation-to-reality gap [2], [13], [14]. In relation to theformer area, recent works have focused on exploring differenttypes of noise or perturbations in the agents or environments tobetter understand how these potentially adversarial conditionsaffect the collaborative learning process. For example, Gu etal. study in [15] the effect of network delays and propose anasynchronous method for off-policy updates, while Yu et al.have studied the effect of adversarial conditions in the networkconnection between the agents [16]. We have seen a lack ofresearch, however, on the analysis of adversarial conditionsin discrete action spaces. These type of scenarios occur whenagents need to make a decision from a finite and discrete setof actions.In this paper, we study the effect of byzantine agents thatperform the wrong action with certain probabilities and reportinitial results that let us understand the limitations of thestate-of-the-art in multi-agent RL in the presence of byzantineagents for discrete action spaces. In particular, we utilize thesynchronous advantage actor-critic (A2C) algorithm on two ofthe standard Atari environments typically used for benchmark-ing DRL methods. This is, to the best of our knowledge, thefirst paper analyzing the effects terms of policy convergencein collaborative multi-agent DRL caused by having differentfractions of wrong actions in discrete action spaces. Our resultsshow that in some environments with totally 16 distributedagents the training process is highly sensitive to having asingle agent acting in the wrong manner over a relatively small a r X i v : . [ c s . R O ] A ug Synchronous Experience Upload(Data and Rewards) CommonPolicy Download
Agent 1Edge BackendCloud Backend
Baselines
Offline training
Transfer learning
Periodic
Weight Updates
Aggregated Data
Periodic
Weight Updates
PeriodicWeight Updates
Common PolicyTime
Discrete step by one agent (action + reward)Byzantine step (wrong action + reward) Synchronous upload to edge backendSynchronous download to all agents
Agent 2Agent N
Fig. 1: Conceptual view of the system architecture proposed in this paper. Multiple agents are collaborating towards learning a commontask, with the deep neural network updates being synchronized at the edge layer. Each agent, however, individually explores its environmentsgathering experience in a series of episodes, and calculating the corresponding rewards. fraction of its actions, and unstable convergence appears withjust a single agent having 2.5% of its actions wrong. In otherenvironments the threshold is higher, with the network stillconverging to a working policy at over 10% of wrong actionsin a single byzantine agent.The remainder of this document is organized as follows.Section 2 presents related works in the area of adversarialRL and applications combining RL with edge computing.Section 3 describes the basic theory behind A2C and thesimulation environments. Section 4 then presents our results onthe convergence of the system when a fraction of the actionsis wrong, and Section 5 concludes the work and outlines ourfuture work directions.II. R
ELATED W ORKS
Adversarial RL has attracted many researchers’ interest inrecent years. Multiple deep learning algorithms are known tobe vulnerable to manipulation by perturbed inputs [10]. Thisproblem also affects various reinforcement learning algorithmsunder different scenarios. In multi-agent environments, anattacker can significantly increase the adversarial observationsability [11]. Ilahi et al. review emerging adversarial attacksin DRL-based systems and the potential countermeasures todefend against these attacks [17]. The authors classify theattacks as attacks targeting (i) rewards, (ii) policies, (iii)observations, and (iv) the environment. In this paper, instead,we consider targeting the agents’ actions, which can happenin real-world applications when agents interact with theirenvironment. Similarly considering how to better transfer learning fromsimulations to the real-world, multiple researchers have beenworking on a simulation-to-reality transfer for specific applica-tions in different environments [13], [2], [18]. In this paper, wewill analyze the effect of the adversarial of byzantine effects inmulti-agents reinforcement learning, and introduce a fractionof byzantine actions, which has not been studied before.Other researchers have explored the influence of noisyrewards in RL. Wang et al. present a robust RL frameworkthat enables agents to learn in noisy environments where onlyperturbed rewards are observed, and analyzes different algo-rithms performance under their proposed framework, includingPPO, DQN, and DDPG [12]. In this paper, the perturbances onthe DRL process will be explored, but we focus on analyzingdiscrete action spaces and a fraction of byzantine actionsperformed by a small number of byzantine agents.Multiple works have been presented in the convergenceof DRL and edge computing. However, rather than focusingon exploiting edge computing for distributed RL, most ofthe current literature is exploiting RL for edge service. Forinstance, Ning et al. apply DRL for more efficient offloadingorchestration [3], while Wang et al. have applied DRL tooptimize resource allocation at the edge [19]. In our work,however, we focus on analyzing some of the challenges thatcan appear when edge computing is exploited for distributedmulti-agent DRL in real-world applications.
III. M
ETHODOLOGY
This section describes the methods and simulation environ-ments utilized for the analysis in Section 4: the advantageactor-critic (A2C) algorithm for distributed DRL, and thesimulation environment.Actor-critic methods combine the advantages of value basedand policy based methods, and has been regarded as the baseof many modern RL algorithms. In A2C, two neural networksrepresent the actor and critic, where the actor controls theagent’s behavior and the critic evaluate how good the actiontaken is. As value-based methods tend to high variability,an advantage function is employed to replace the raw valuefunction, leading to advantage actor-critic (A2C). The mainscheme of the policy gradient updates is shown in (1): θ new ← θ old + η ∇ R θ (1)where θ denotes the policy to be learned, η is the learningrate, and ∇ R θ represents the policy gradient, given by (2). ∇ R θ ≈ N N (cid:88) n =1 T n (cid:88) t =1 R ( τ n ) ∇ logp ( a nt | s nt , θ ) (2)where N is the number of trajectories sampled under the policy τ , and R ( τ n ) denotes the accumulated reward for each episodeconsisting of T n steps. In the policy with weight θ , an action a nt under the state s nt is chosen with probability p ( a nt | s nt , θ ) .In this policy gradient method, the accumulated reward R ( τ n ) is calculated by sampling the trajectories, which arecomputed when the whole episode is finished, and hence mightbring high variability affecting the policy convergence. Toavoid this, value estimation is introduced and merged intothe policy gradient method. An advantage function is thusproposed to replace R ( τ n ) according to (3), which is alsothe reason for the name of A2C. R ( τ n ) = r nt + V π ( s nt +1 ) − V π ( s nt ) (3)where r t is the reward gained in the step t , and V π denotesthe value function to estimate the accumulated reward that willbe gained. Additionally, in the implementation of this A2Calgorithm, multiple agents are employed to produce the trajec-tories in parallel. Compared to A3C [4], in which each agentwill update the network individually and asynchronously, A2Ccollects the whole data from each agents and then update theshared network. This is also illustrated in Fig. 1.In order to analyze the effect of byzantine actions, wechoose two typical gym-wrapped Atari games as our sim-ulation environments: PongNoFrameskip-v4 (Fig. 2b) andBreakoutNoFrameskip-v4 (Fig. 3b). Both of them take videoas an input, based on which the policy will be trained toproduce the corresponding discrete actions to obtain higherrewards. The action spaces for Pong and Breakout havecardinality of 5 and 4, respectively. We set their correspondingbyzantine agents to behave with the opposite actions (e.g., ifthe output action from the policy is action = 2 in Pong, thena wrong action by the byzantine agents will be action = 3 ).In this paper, we consider the effect of the presence ofByzantine agent in terms of their number and the frequencyof wrong actions they perform. The patterns we observe in the experiments can be further utilized to detect Byzantine agentsin distributed multi-agent DRL scenarios.IV. S IMULATIONS AND R ESULTS
In this section, we describe the settings in our experimentsand present the main conclusions of our analysis. There aretotally 16 agents or workers employed to produce trajectorydata both in Pong and Breakout environments. The experi-ences, actions and rewards from the different agents are thenaggregated synchronously to calculate the policy gradientsand update the policy towards a more optimal one. In theexperiments, we analyze how byzantine agents that performwrong actions unknowingly affect the collaborative learningeffort. The reference training without byzantine agents for thePong and Breakout environments are shown in Fig. 2a andFig. 3a, respectively.In the Pong environment, we first set a single agent continu-ously behaving wrongly, out of the total of 16 agents workingin parallel (Fig. 2c). Compared to the reference training, weobserve that the policy is unable to improve in order to obtainbetter rewards. Therefore, a single byzantine agent represent-ing as little as 6.25% of the total is enough to completelydisable the ability of the system to converge towards a workingpolicy. Therefore, we have focused on analyzing the maximumfraction of wrong actions that byzantine agents can performin order to ensure convergence of the system. Moreover, inorder to test whether it is the total fraction what matters or thenumber of agents, we have considered the same total fractionof byzantine actions in different settings.In the training, the policy is updated only when the agentsperform a series of steps, collecting a certain amount ofinteraction data. In particular, agents perform 5 steps betweenupdates of the policy. This leads to 80 steps between updates,and we set the total number of steps to for the completetraining process. The number of episodes depends on theperformance of the agents (the better the performance, thelonger the episodes are).With this, we set different fractions of byzantine actionsdepending on (i) the number of agents, (ii) the number ofwrong actions in between updates, and (iii) the fraction ofupdates affected by byzantine actions. The results for the Pongenvironment are shown in Figures 2d through 2h. From these,we conclude that of byzantine actions are enough todeplete the systems’ ability to converge, while the system isable to converge with slight unstabilities in the presence ofa 10% of byzantine actions (Figures 2e, 2g and 2h). Finally,with just 5% of byzantine actions (Fig. 2f) the convergence issimilar to the reference.In addition, we also conduct similar experiments on anotherclassical Atari game, BreakoutNoFrameskip-v4. The resultswith byzantine actions are shown in Figures 3c through Fig-ure 3h. In this environment, byzantine actions are enoughto deter convergence (Fig. 3c, Figure 3f, and Figure 3h).Only by reducing the frequency to . can the training getacceptable convergence (Figure 3e). A general conclusion isalso that the total fraction of byzantine actions is what mattersthe most, and not how they are introduced in the system. − Episode R e w a r d (a) Reference training (no byzantine agents). (b) PongNoFrameskip-v4environment. − Episode R e w a r d (c) One agent with continuous byzantine ac-tions. − Episode R e w a r d (d) One agent, byzantine actions in 1/5 up-dates (20% of the total). − Episode R e w a r d (e) One agent, byzantine actions in 1/10 up-dates (10% of the total). − Episode R e w a r d (f) One agent, byzantine actions in 1/2 stepsof 1/10 updates (5% of the total). − Episode R e w a r d (g) Two agents, byzantine actions in 1/2 stepsof 1/10 updates ( × % of the total). − Episode R e w a r d (h) Four agents, byzantine actions in 1/2 stepsof 1/20 updates ( × . % of the total).Fig. 2: Experiments in the PongNoFrameskip-v4 environment. V. C
ONCLUSION AND F UTURE W ORK
Adversarial agents and closing the simulation-to-reality gapare among the key challenges preventing wider adoptionof reinforcement learning in real-world applications. In thispaper, we have addressed the latter one from the perspective ofthe former: by introducing adversarial conditions inspired byreal-world perturbances to a subset of agents in a multi-robotsystem during a collaborative reinforcement learning process,we have been able to identify points where the robustness ofdistributed multi-agent DRL algorithms needs to be improved.In this paper, we have considered multiple robotic arms ina simulation environment collaborating towards learning acommon policy to reach an object. In order to emulate morerealistic conditions and understand how perturbances in theenvironment affect the learning process, we have consideredvariability across the agents in terms of their ability to senseand actuate accurately. We have shown how different types ofdisturbances in the model’s input (sensing) and output (actua-tion) affect the robustness and ability to converge towards aneffective policy. We have seen how variable perturbances havethe most effect on the ability of the network to converge, while disturbances in the ability of the robots to actuate properlyhave had a comparatively worse effect than those in theirability to sense the position of the object accurately.The conclusions of this work serve as a starting pointtowards the design and development of more robust methodsable to identify and take into account these disturbances inthe environment that do not occur across all robots equally.This will be the subject of our future work, as well asthe study of other types or combinations of disturbances inthe environment. We will also work towards modeling moreaccurately real-world errors for RL simulation environments.A
CKNOWLEDGEMENTS
This work was supported by the Academy of Finland’sAutoSOS project with grant number 328755.R
EFERENCES[1] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Deepreinforcement learning for multiagent systems: A review of challenges,solutions, and applications.
IEEE transactions on cybernetics , 2020.
Episode R e w a r d (a) Reference training (no byzantine agents) (b) BreakoutNoframeskip-v4 env. Episode R e w a r d (c) One agent, byzantine actions in 1/10 up-dates (10% of the total). Episode R e w a r d (d) One agent, byzantine actions in 1/2 stepsof 1/10 updates (5% of the total). Episode R e w a r d (e) One agent, byzantine actions in 1/2 stepsof 1/20 updates (2.5% of the total). ,
000 1 , Episode R e w a r d (f) One agent, byzantine actions in 1/2 stepsof 1/5 updates (10% of the total). Episode R e w a r d (g) Two agents, byzantine actions in 1/2 stepsof 1/10 updates ( × % of the total). Episode R e w a r d (h) Four agents, byzantine actions in 1/2 stepsof 1/20 updates ( × . % of the total).Fig. 3: Experiments in the BreakoutNoFrameskip-v4 environment. [2] Jan Matas, Stephen James, and Andrew J Davison. Sim-to-real rein-forcement learning for deformable object manipulation. arXiv preprintarXiv:1806.07851 , 2018.[3] Zhaolong Ning, Peiran Dong, Xiaojie Wang, Joel JPC Rodrigues, andFeng Xia. Deep reinforcement learning for vehicular edge computing:An intelligent offloading system. ACM Transactions on IntelligentSystems and Technology (TIST) , 10(6):1–24, 2019.[4] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, AlexGraves, Timothy Lillicrap, Tim Harley, David Silver, and KorayKavukcuoglu. Asynchronous methods for deep reinforcement learning.In
International conference on machine learning , pages 1928–1937,2016.[5] Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus.Modeling others using oneself in multi-agent reinforcement learning. arXiv preprint arXiv:1802.09640 , 2018.[6] Ronny Conde, Jos´e Ram´on Llata, and Carlos Torre-Ferrero. Time-varying formation controllers for unmanned aerial vehicles using deepreinforcement learning. arXiv preprint arXiv:1706.01384 , 2017.[7] Pinxin Long, Tingxiang Fanl, Xinyi Liao, Wenxi Liu, Hao Zhang, andJia Pan. Towards optimally decentralized multi-robot collision avoidancevia deep reinforcement learning. In , pages 6252–6259. IEEE, 2018.[8] Jorge Pe˜na Queralta, Li Qingqing, Zhuo Zou, and Tomi Westerlund.Enhancing autonomy with blockchain and multi-acess edge computingin distributed robotic systems. In
The Fifth International Conference onFog and Mobile Edge Computing (FMEC). IEEE , 2020.[9] Jorge Pe˜na Queralta and Tomi Westerlund. Blockchain-powered collab-oration in heterogeneous swarms of robots.
Frontiers in Robotics andAI , 2020.[10] Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcementlearning to policy induction attacks. In
International Conference onMachine Learning and Data Mining in Pattern Recognition , pages 262–275. Springer, 2017. [11] Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine,and Stuart Russell. Adversarial policies: Attacking deep reinforcementlearning. arXiv preprint arXiv:1905.10615 , 2019.[12] Jingkang Wang, Yang Liu, and Bo Li. Reinforcement learning withperturbed rewards. In
AAAI , pages 6202–6209, 2020.[13] Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac,Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, et al.Deepracer: Educational autonomous racing platform for experimentationwith sim2real reinforcement learning. arXiv preprint arXiv:1911.01562 ,2019.[14] Wenshuai Zhao, Jorge Pe˜na Queralta, Li Qingqing, and Tomi West-erlund. Towards closing the sim-to-real gap in collaborative multi-robot deep reinforcement learning. In . IEEE, 2020.[15] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deepreinforcement learning for robotic manipulation with asynchronous off-policy updates. In , pages 3389–3396. IEEE, 2017.[16] Yiding Yu, Soung Chang Liew, and Taotao Wang. Multi-agent deep rein-forcement learning multiple access for heterogeneous wireless networkswith imperfect channels. arXiv preprint arXiv:2003.11210 , 2020.[17] Inaam Ilahi, Muhammad Usama, Junaid Qadir, Muhammad Umar Jan-jua, Ala Al-Fuqaha, Dinh Thai Hoang, and Dusit Niyato. Challenges andcountermeasures for adversarial attacks on deep reinforcement learning. arXiv preprint arXiv:2001.09684 , 2020.[18] Karol Arndt, Murtaza Hazara, Ali Ghadirzadeh, and Ville Kyrki. Metareinforcement learning for sim-to-real domain adaptation. arXiv preprintarXiv:1909.12906 , 2019.[19] Jiadai Wang, Lei Zhao, Jiajia Liu, and Nei Kato. Smart resourceallocation for mobile edge computing: A deep reinforcement learningapproach.