[PDF] Diverse Auto-Curriculum is Critical for Successful Real-World Multiagent Learning Systems

Abstract

Multiagent reinforcement learning (MARL) has achieved a remarkable amount of success in solving various types of video games. A cornerstone of this success is the auto-curriculum framework, which shapes the learning process by continually creating new challenging tasks for agents to adapt to, thereby facilitating the acquisition of new skills. In order to extend MARL methods to real-world domains outside of video games, we envision in this blue sky paper that maintaining a diversity-aware auto-curriculum is critical for successful MARL applications. Specifically, we argue that \emph{behavioural diversity} is a pivotal, yet under-explored, component for real-world multiagent learning systems, and that significant work remains in understanding how to design a diversity-aware auto-curriculum. We list four open challenges for auto-curriculum techniques, which we believe deserve more attention from this community. Towards validating our vision, we recommend modelling realistic interactive behaviours in autonomous driving as an important test bed, and recommend the SMARTS/ULTRA benchmark.

Full PDF

DDiverse Auto-Curriculum is Critical for SuccessfulReal-World Multiagent Learning Systems ∗ Blue Sky Ideas Track

Yaodong Yang † University College LondonHuawei R&D U.K.

Jun Luo

Huawei Canada

Ying Wen

Shanghai Jiao Tong University

Oliver Slumbers

University College London

Daniel Graves

Huawei Canada

Haitham Bou Ammar

Huawei R&D U.K.

Jun Wang

University College LondonHuawei R&D U.K.

Matthew E. Taylor

University of AlbertaAlberta Machine Intelligence Institute

ABSTRACT

Multiagent reinforcement learning (MARL) has achieved a remark-able amount of success in solving various types of video games.A cornerstone of this success is the auto-curriculum framework,which shapes the learning process by continually creating newchallenging tasks for agents to adapt to, thereby facilitating theacquisition of new skills. In order to extend MARL methods to real-world domains outside of video games, we envision in this blue skypaper that maintaining a diversity-aware auto-curriculum is criti-cal for successful MARL applications. Specifically, we argue that behavioural diversity is a pivotal, yet under-explored, componentfor real-world multiagent learning systems, and that significantwork remains in understanding how to design a diversity-awareauto-curriculum. We list four open challenges for auto-curriculumtechniques, which we believe deserve more attention from this com-munity. Towards validating our vision, we recommend modellingrealistic interactive behaviours in autonomous driving as an impor-tant test bed, and recommend the SMARTS/ULTRA benchmark.

KEYWORDS

Multiagent reinforcement learning; auto-curriculum; behaviourmodels; autonomous driving; simulators; SMARTS

ACM Reference Format:

Yaodong Yang, Jun Luo, Ying Wen, Oliver Slumbers, Daniel Graves, HaithamBou Ammar, Jun Wang, and Matthew E. Taylor. 2021. Diverse Auto-Curriculumis Critical for Successful Real-World Multiagent Learning Systems: BlueSky Ideas Track. In

Proc. of the 20th International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2021), Online, May 3–7, 2021 , IFAA-MAS, 6 pages. ∗ The authors thank the anonymous reviewers, as well as Greg d’Eon, Calarina Musli-mani, Laura Petrich, Sahir, and Amirmohsen Sattarifard for comments and suggestions.Part of this work has taken place in the Intelligent Robot Learning (IRL) Lab, which issupported in parts by research grants from Amii, CIFAR, and NSERC. † Corresponding author: [email protected]

Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online

Reinforcement learning (RL) [85] allows an agent to learn to max-imise cumulative rewards via environmental interactions. It hasproved successful in many areas, including playing video games[61, 70], robotics control [49], data centre cooling [57], and assetpricing in finance [45]. Multiagent RL (MARL) extends RL to coverthe setting where there are multiple learning entities in the envi-ronment [36, 97]. This technique has shown remarkable success,especially on multi-player video games such as StarCraft [88], Dota2[67], and Hide and Seek [5]. However, MARL has had relativelyfew successes in solving real-world problems. The core thesis ofthis paper is that the development of learning frameworks thatcan induce behavioural diversity in the policy space is critical forMARL to succeed in real world domains. We summarise existingchallenges and recommend autonomous driving (AD) as an idealtest bed for future investigations.The challenges of deploying RL in the real world are frequentlydiscussed in both workshops [64, 95, 96] and papers [20, 21]. Chal-lenges include the lack of an accurate simulator [13], the high costof environmental interaction, and the difficulty in learning both ef-fective and diverse policies. Off-policy RL [56] or imitation learning[40] methods could be used for policy evaluation or policy improve-ment in an offline manner; this would allow agents to learn a goodinitial behaviour before ever interacting with an environment. How-ever, these methods are only applicable if training data sets exist.When the training data are limited, for example in the AD domain[100], offline methods are insufficient for robust performance in thereal world due to a lack of diversity in agents’ behaviours [18, 92]. Infact, even in cases when a simulator is available, lack of behaviouraldiversity could still exist due to the sim-to-real gap [59, 65, 76].Unfortunately, MARL suffers from all of these concerns. Further-more, the additional complexity of multi-agent problems that arisesfrom the cross product of multiple agents’ state and action spacesinduced by social interactions compounds these concerns. Devel-oping frameworks that can deal with the underlying complexitiesof the MARL domain is crucial. We argue that the development ofeffective, yet diverse behaviours is critical for MARL to have animpact in domains outside of video games. a r X i v : . [ c s . A I] F e b he auto-curricula framework [53, 72] is a promising directiontowards such a goal. In natural evolution, species with strongeradaptability flourish when nature alters the environment and somepreviously well-adapted species no longer survive. Through this co-evolution process [22, 68, 74], the diversity of life on Earth hasbeen maintained over billions of years. Inspired by this mechanismof bio-diversity in nature, a series of MARL learning frameworkshave recently been proposed and have demonstrated remarkableempirical success. These include open-ended evolution [9, 50, 82],population based training [42, 58], and training by emergent curric-ula [5, 53, 72]. In general, these frameworks can be unified underthe idea of an auto-curriculum where an endless procession ofbetter-performing agents are automatically generated by exertingselection pressure among the multiple self-optimising agents. Theunderlying principle of auto-curricula is that any adaptation anagent makes will have a cascading effect that other agents mustadapt to in order to survive. This intrinsically provides that pro-vides an automatic curriculum that continually facilitates agents’acquisition of new skills via such social interactions.In this Blue Sky paper, we emphasise that maintaining a diversity-aware auto-curriculum is critical for successful MARL applications.Specifically, we advocate verification through one specific real-world problem: modelling interactive behaviours in AD scenarios.The main contributions of this work are as follows. We start byhighlighting the necessity of behavioural diversity in multiagentsystems in Section 2 and then briefly survey existing works andexplain why they are not yet sufficient for MARL in Section 3. Weinvestigate the idea of auto-curriculum in Section 4, and raise fouropen challenges which we believe deserve more attention from thiscommunity. Section 5 discusses why AD is an excellent domain tohost such investigations and proposes one particular test bed forfuture study. Finally, we reiterate our vision in Section 6. Nature exhibits a remarkable tendency towards diversity [38]. Overthe past billions of years, a vast assortment of unique species havenaturally evolved. Each one is capable of orchestrating the com-plex biological processes necessary to sustain life. Analogously, incomputer science, machine intelligence can be considered as theability to adapt to a diverse set of complex environments [37]. Thissuggests that the ceiling of intelligence rises when environmentsof increasing diversity and complexity are provided. In fact, re-cent successes in developing AI capable of achieving super-humanperformance on complicated multi-player video games, such as Star-Craft [34, 89], Honour of King [99], Hide and Seek [5], and Dota2[67], have provided justification for emphasising behavioural di-versity when designing learning protocols in multiagent systems.Specifically, promoting behavioural diversity is pivotal for MARLmethods. Diversity not only prevents AI agents from checking thesame policies repeatedly, but also helps agents discover niche skills,avoid systematic weaknesses and maintain robust performancewhen encountering unfamiliar types of opponents at test time.Behavioural diversity and the non-transitivity of many environ-ments are intertwined. In biological systems, bio-diversity is pro-moted by the non-transitive interactions among many competingpopulations [43, 75]. The central feature of such non-transitive rela-tions can be thought of as analogous to a Rock-Paper-Scissors game, where rock beats scissors, scissors beats paper, and paper beats rock.In game theory, the necessity of pursuing behavioural diversity isalso deeply rooted in the non-transitive structure of games [6, 8, 48].In general, an arbitrary game, of either the normal-form type [15]or the differential type [7], can always be decomposed into a sumof two components: a transitive part plus a non-transitive part . Thetransitive part of a game represents the structure in which the ruleof winning is transitive (i.e., if strategy A beats B, B beats C, thenA can surely beat C), and the non-transitive part refers to the gamestructure in which the set of strategies follow a cyclic rule (e.g.,the endless cycles of rock, paper, and scissors). Diversity mattersespecially for the non-transitive part because there is no consistentwinner in such sub-games: if a player only plays rock, he can beexploited by paper, but not so if he is diverse in playing pock andscissors. In fact, real-world problems often consist of a mixture ofboth parts [17], therefore it is critical to design learning objectivesthat lead to behavioural diversity.Effective MARL performance often requires diversity in twoaspects. The first aspect is about the training player uses diversifiedstrategies against a fixed type of opponents. Most games involvenon-transitivity in the policy space and thus it is necessary for eachplayer to acquire a diverse set of winning strategies to achieve highunexploitability. The second aspect is the ability to pick a diverse setof opponents during training. In playing cooperative card gameslike Hanabi [10], one player may or may not understand the indirectsignalling when choosing a card to play. If an agent has not learnedto play diversely under both mindsets, it will fail to accuratelymodel the collaborator and play sub-optimally. Similarly, in real-world driving, distinct locales have different conventions. The UKand the US drive on different sides of the road, or even within thesame country, different cities can follow different conventions. Forexample, the Pittsburgh Left convention assumes that a few carswill turn left in front of traffic at the beginning of a green light[94], while other areas assume that cars will turn left in front oftraffic during a yellow or at the beginning of a red light [78]. Asa result, it can be expected that an autonomous agent without adiverse mindset could easily create hazards on the road [93].

Despite the importance of diversity, there has been limited workwithin the machine learning domain where diversity is modelledin a principled way. Furthermore, there is no agreed upon, formaldefinition. For example, behavioural diversity can be defined asthe variance in rewards [50, 51], the convex hull of a gamescape [6], choosing whether or not to visit a new environmental state[27, 84, 98], or, acquiring new types of skills in a task [28, 35].So far, the majority of work that models diversity lies in evo-lutionary computation (EC) [3, 29], which attempts to mimic thenatural evolution process. One classic idea in EC is novelty search [50, 51], which aims to search for behaviours that lead to differ-ent outcomes. Quality-diversity (QD) methods hybridise noveltysearch with fitness under the notion of survival of the fittest (i.e.,high utility) [73]. Two representatives are

Novelty Search with LocalCompetition [52] and

MAP-Elites [16, 62]. We use the term “opponent” for presentation purposes, acknowledging that agentsmay be teammates, opponents, or something in between for non-zero sum games. earching for behavioural diversity is also a common topic inRL, which is often studied in the context of skill discovery [27, 28,35], intrinsic rewards [11, 12, 31], or maximum-entropy learning[32, 33, 55]. These RL algorithms can be considered as QD methods,in the sense that quality refers to maximising cumulative reward,and diversity means either visiting a new state [27, 98] or obtaininga policy with larger entropy [55].In the context of MARL, learning typically means an agent actingin an open-ended system with continually changing policies bydifferent opponents. Yet, such a learning process can only guaran-tee differences but not diversity , which are two different notions–diversity is not an inherent feature in MARL. In fact, understandingthe principle of how diversity is promoted in an auto-curriculum isan open problem in MARL [6, 69, 98]. In the example of trainingsoccer AIs [47], learning against only different opponents can easilymake an agent get into circular dynamics and not improve. Finally,this work is also different from the previous manifesto [53], whichlinks the auto-curricula in natural evolution with MARL; our mainfocus is to emphasise creating diversity-aware auto-curricula.

Auto-curricula [5, 53, 72] provide a framework to automaticallyshape learning procedures for AI agents by consistently challengingthem with new tasks that are adapting to their capabilities. As thechallenges generated by an auto-curriculum become increasinglydiverse and complex over time, AI agents accumulate more diverseand effective skills. In fact, recent successes in training AIs thatachieve super-human performance and acquire diverse behaviourson complex video games [5, 80, 89] provide strong justification foradopting auto-curricula a diversifying learning protocol. However,in order to serve as a general framework to tackle more real-worldproblems beyond video games, auto-curriculum technique still facesfour open challenges.Open Challenges of Designing Diversity-Aware Auto-Curricula(1) How do we measure diversity in an auto-curriculum?(2) How do we generate diversity-aware auto-curricula, espe-cially in non-zero sum settings?(3) How do we shape an auto-curriculum to induce diverse yeteffective behaviours?(4) How do we deal with non-transitivity when learning in adiversity-aware auto-curricula?The first challenge is to define the correct objective to measureand promote diversity in the generated auto-curriculum. In thesingle-agent setting, diversity can be defined through a different re-ward function [50, 51], visiting a new state [27, 84, 98], or acquiringa new skill [28, 35]. However, in the MARL setting, with multipleplayers, each having a population of strategies, diversity should bedefined in the joint policy space, considering all existing strategiesof all agents. Yet, there is limited work that tries to quantify thebehavioural diversity at the population level. Although there are nostraightforward answers, we believe one promising direction couldbe to leverage the determinantal point process [46] from quantumphysics. This processs measuring diversity through the determinantvalue in a vector space, thus the level of orthogonality among the input vectors can be represented by agents’ different joint-strategyprofiles in terms of rewards [98].The second challenge involves the applicability on non-zero sumgames . The curricula in the examples of StarCraft [88] or Hideand Seek [5] are generated by competitive self-play from the play-ers in zero-sum games. However, many real-world tasks, such asautonomous driving, are not zero-sum—in fact, they tend to be amixed setting where cooperation outweighs competition. There-fore, creating an auto-curriculum in non-zero sum games is an openproblem. Interestingly, recent studies have shown that adaptingin social dilemmas can create an effective auto-curriculum for theemergence of collective cooperation [39, 54, 71]. Through sequencesof new challenges in addressing social dilemmas, agents eventuallylearn to achieve a socially-beneficial outcome. This resembles taskssuch as discovering collective driving strategies that can mitigatecongestion. For example, consider the case of solving Braess’s para-dox [14] (i.e., a typical example in modelling road network andtraffic flow) where agents progressively learn to sanction thosewho tend to over-exploit the common resources, thus creating newcurriculum. Nonetheless, creating curricula for collective coopera-tion is still under-developed relative to auto-curricula induced byzero-sum games. Importantly, as pointed out by Leibo et al. [53],auto-curricula induced by social dilemmas could be cursed by the“no-free-lunch" property: once you resolve a social dilemma in oneplace, another one crops up to take its place, a problem also knownas higher-order social dilemmas [60, 66].Thirdly, although RL techniques offer insight into how a desir-able behaviour can be learned in a fixed environment, it is stillunclear how complex and useful behaviours can be best developed,while these behaviours are influencing the environment. In fact,it is often the case that the more complex the behaviour, the lesslikely it is generated completely from scratch [53]. An example isthat it is highly unlikely a world-champion level policy is quicklygenerated by a curriculum when learning to act in complex envi-ronments. Moreover, this issue is only exacerbated when multipleagents ( 𝑁 ≫

2) are involved to explore the joint-strategy space.Fortunately, initial progress has been made by works on

PolicySpace Response Oracle (PSRO) [6, 48, 63] where different kinds of rectifiers have been proposed to shape the auto-curricula so thateffective behaviours with high quality can be emphasised. For ex-ample, PSRO with a

Nash rectifier [6] explores only strategies thathave positive Nash support so as to preserve the strategy strength.Despite the empirical success in generating diverse yet effectivestrategies, PSRO methods only work in solving symmetric zero-sumgames, a limitation highlighted in Open Challenge II.Lastly, results on game decomposition suggests that a game[6, 15] generally consists of both transitive and non-transitive struc-tures. The topological structure of real-world tasks often resem-bles a spinning top if projected onto a 2D space [17], where thex-axis is the non-transitive dimension and y-axis is the transitivedimension. The non-transitive part can harm the effectiveness ofauto-curriculum [6, 17]. For example, an auto-curriculum generatedby self-play in zero-sum games [30, 77] could make a learning agentendlessly chase its own tail by creating the same tasks repetitivelywithout breaking out. Things become even worse when the non-transitivity issue couples with the catastrophic-forgetting propertyof the model itself, as seen with deep neural networks [44]. As aesult, agents may end up with acquiring mediocre solutions orgetting trapped in limited cycles within the strategy space [6, 8].Memorising a library of all possible policies can help prevent cycling(e.g., three strategies in the toy example of Rock-Paper-Scissors),but for many real-world tasks, the dimension of the non-transitivecycles can be huge, and as a result, building such a library itselfbecomes an endless task [17].

Creating a diversity-aware auto-curricula for MARL, although stillfacing several open challenges, is a critical step for deploying suc-cessful multi-agent learning systems in real-world domains. Forvalidating effectiveness, we believe autonomous driving (AD) sim-ulation environments provide an excellent test bed.AD technologies [4] enable a vehicle to sense its environmentand move safely to a destination with little or no human interven-tion. Since the first DARPA competition in 2004, where the bestperforming car completed only 7.3 miles of the 142-mile desertroute, remarkable progress has been made. For example, the com-mercial company Waymo has driven more than 20 million mileson public roads under the SAE level-4 setting [41, 93].In spite of such achievements, fully competent and natural in-teractions with other road users remain out-of-reach. Rather thanembracing inter-driver interaction, current mainstream level-4 ADsolutions restricts it. When encountering complex interactive sce-narios, autonomous cars tend to slow down and wait for the situa-tion to become more simple. They rarely cut in front of another caror force its way in at a merge, as human drivers routinely do. InCalifornia in 2018, 86% of crashes involving autonomous vehicleswere attributable to the AD car’s conservative behaviour [83], with57% rear endings and 29% sideswipes by other vehicles on the ADcar. Trial AD cars in Arizona and California are often targets ofcomplaints for blocking other cars [79], excessive hard braking [23],hesitant highway merging, and inflexible pick-up/drop-off locations[24, 25]. While rarely illegal, the overly conservative driving stylefrustrates human drivers, and can even pose road hazards. This alsorestricts AD technologies from being applied on special-purpose ve-hicles, such as ambulances or police cars, where aggressive drivingbehaviour may be required.A key reason for this limitation is that existing AD simulators have limited capacity for modelling realistic interactions with di-verse driving behaviours. Simulators are crucial for validation ofthe AI software controlling the autonomous vehicle (also called ego vehicle ). For validating the ego vehicle’s interactive behaviourwith social vehicles (i.e., other vehicles that share the same drivingenvironment), we need diverse social agents capable of realistic andcompetent interaction. Conventional AD simulators [19, 81, 90] fo-cus on modelling sensory inputs and control dynamics, rather thaninteraction. As a result, the behaviours of social vehicles end upbeing controlled by simple scripts or rule-based models (e.g., IDMfor longitudinal control [86] or MOBIL for lateral control [87]) andthe simulated interaction between ego and social vehicles falls shortof the richness and diversity seen in the real world. AD companiesheavily use replay of historical data collected from real-world trialsto validate ego vehicle behaviour [2]. However, such a data-replayapproach to simulation does not allow true interaction between vehicles because the social vehicles are not controlled by intelli-gent agents but merely stick to historical trajectories [1, 91]. Inshort, how to create a population of diverse intelligent social agentsthat can be adopted in simulation to provide traffic with realisticinteraction is still an open question.We believe that a MARL approach powered by diversity-awareauto-curriculum can help solve the problem of generating highquality behaviour models that approach human-level sophistica-tion. Fundamentally, driving in a shared public space with otherroad users is a multi-agent problem wherein the behaviour of agentsco-evolve. Co-evolved diverse and competent behaviours can al-low AD simulation to encompass sophisticated interactions seenamong human drivers and thus alleviate the conservativeness ofexisting AD solutions. Crucially, solving this problem for AD willalso require the key research challenges identified in Section 4 tobe addressed.For such a plan to work, we need an appropriate simulator thatsupports MARL auto-curriculum for diversity, such as the SMARTSAD simulator [101]. Unlike other existing simulators, SMARTS is natively multi-agent . Social agents use the same APIs as the egoagent to control vehicles, and thus may use arbitrarily complexcomputational models, either rule-based or (MA)RL-driven. TheSMARTS social agent zoo hosts a growing number of behaviour mod-els to be used by simulated agents, regardless of their divergence inmodel architectures, observation/action spaces, or computationalrequirements. The key component that makes such computationspossible is the built-in “bubble” mechanism. This mechanism de-fines a spatiotemporal region so that intensive computing is onlyactivated inside the bubbles where fine-grained interaction is re-quired, such as at unprotected left-turns, roundabouts, and highwaydouble merges. These features make SMARTS highly suitable formulti-agent auto-curriculum studies.To be more concrete, consider the ULTRA [26] benchmark suitebuilt on top of SMARTS. It includes over 100,000 unprotected left-turn scenarios at different levels of complexity. These scenarioscould seed an auto-curriculum that gradually increases the diversityand complexity of interaction by injecting trained behaviour modelsthrough available SMARTS mechanisms, allowing for not onlymore diverse agent behaviour, but also a curriculum composed ofincreasingly difficult scenarios. Through this, we expect interactionbehaviours reminiscent of the

Pittsburgh Left to emerge in a fashionthat could potentially reach the level of sophistication of humandrivers. In turn, such emergent behavioural models can be used tosupport diverse interactions in AD simulations far beyond whathas been possible through the rule-based models used so far.

Despite remarkable success shown on video games, we envisionthat developing a diversity-aware auto-curriculum framework isa critical step to ensure the success of MARL technique on real-world problems. Specifically, we believe the pressing need for high-quality behaviour models in AD simulation is a great opportunityfor the MARL community to make a unique contribution by (1)theoretically addressing the modelling challenges on behaviouraldiversity and (2) experimentally training generations of increasinglydiverse and competent agents to provide interactions for real-worldAD developments.

EFERENCES . youtube . com/watch?v = Q0nGo2-y0xY&feature = youtu . be&t = = . theatlantic . com/technology/archive/2017/08/inside-waymos-secret-testing-and-simulation-facilities/537648/.[3] Thomas Bäck, David B Fogel, and Zbigniew Michalewicz. 1997. Handbook ofevolutionary computation. Release

97, 1 (1997), B1.[4] Claudine Badue, Rânik Guidolini, Raphael Vivacqua Carneiro, Pedro Azevedo,Vinicius Brito Cardoso, Avelino Forechi, Luan Jesus, Rodrigo Berriel, Thi-ago Meireles Paixão, Filipe Mutz, et al. 2020. Self-driving cars: A survey.

ExpertSystems with Applications (2020), 113816.[5] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, BobMcGrew, and Igor Mordatch. 2019. Emergent tool use from multi-agent autocur-ricula. arXiv preprint arXiv:1909.07528 (2019).[6] D Balduzzi, M Garnelo, Y Bachrach, W Czarnecki, J Pérolat, M Jaderberg, and TGraepel. 2019. Open-ended learning in symmetric zero-sum games. In

ICML ,Vol. 97. PMLR, 434–443.[7] D Balduzzi, S Racaniere, J Martens, J Foerster, K Tuyls, and T Graepel. 2018.The Mechanics of n-Player Differentiable Games. In

ICML , Vol. 80. JMLR. org,363–372.[8] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. 2018. Re-evaluating evaluation. In

Advances in Neural Information Processing Systems .3268–3279.[9] Wolfgang Banzhaf, Bert Baumgaertner, Guillaume Beslon, René Doursat,James A Foster, Barry McMullin, Vinicius Veloso De Melo, Thomas Miconi,Lee Spector, Susan Stepney, et al. 2016. Defining and simulating open-endednovelty: requirements, guidelines, and challenges.

Theory in Biosciences

Artificial Intelligence

280 (2020), 103216.[11] Andrew G Barto. 2013. Intrinsic motivation and reinforcement learning. In

Intrinsically motivated learning in natural and artificial systems . Springer, 17–47.[12] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Sax-ton, and Remi Munos. 2016. Unifying count-based exploration and intrinsicmotivation. In

Advances in neural information processing systems . 1471–1479.[13] Joschka Boedecker and Minoru Asada. 2008. Simspark–concepts and applicationin the robocup 3d soccer simulation league.

Autonomous Robots

174 (2008), 181.[14] Dietrich Braess. 1968. Über ein Paradoxon aus der Verkehrsplanung.

Un-ternehmensforschung

12, 1 (1968), 258–268.[15] Ozan Candogan, Ishai Menache, Asuman Ozdaglar, and Pablo A Parrilo. 2011.Flows and decompositions of games: Harmonic and potential games.

Mathe-matics of Operations Research

36, 3 (2011), 474–503.[16] Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. 2015.Robots that can adapt like animals.

Nature arXiv (2020), arXiv–2004.[18] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher,Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. 2019.RoboNet: Large-Scale Multi-Robot Learning. arXiv (2019), arXiv–1910.[19] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and VladlenKoltun. 2017. CARLA: An open urban driving simulator. arXiv preprintarXiv:1711.03938 (2017).[20] Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Padu-raru, Sven Gowal, and Todd Hester. 2020. An empirical investigation of thechallenges of real-world reinforcement learning. arXiv preprint arXiv:2003.11881 (2020).[21] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. 2019. Challenges ofreal-world reinforcement learning. arXiv preprint arXiv:1904.12901 (2019).[22] William H Durham. 1991.

Coevolution: Genes, culture, and human diversity . theinformation . . theinformation . com/articles/waymos-backseat-drivers-confidential-data-reveals-self-driving-taxi-hurdles[25] Amir Efrati. 2020. Waymo’s Big Ambitions Slowed by Tech Trouble. https://bit . ly/31lKwgt.[26] Mohamed Elsayed, Kimia Hassanzadeh, Nhat M. Nguyen, Montgomery Alban,Xiru Zhu, Daniel Graves, and Jun Luo. 2020. ULTRA: A reinforcement learninggeneralization benchmark for autonomous driving. [27] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. 2018.Diversity is All You Need: Learning Skills without a Reward Function. In Inter-national Conference on Learning Representations .[28] Carlos Florensa, Yan Duan, and Pieter Abbeel. 2017. Stochastic Neural Networksfor Hierarchical Reinforcement Learning. arXiv (2017), arXiv–1704.[29] David B Fogel. 2006.

Evolutionary computation: toward a new philosophy ofmachine intelligence . Vol. 1. John Wiley & Sons.[30] Michael E Gilpin. 1975. Limit cycles in competition communities.

The AmericanNaturalist

ICML .[33] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. SoftActor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning witha Stochastic Actor. In

International Conference on Machine Learning . 1861–1870.[34] Lei Han, Jiechao Xiong, Peng Sun, Xinghai Sun, Meng Fang, Qingwei Guo,Qiaobo Chen, Tengfei Shi, Hongsheng Yu, and Zhengyou Zhang. 2020. TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training inStarCraft II Full Game. arXiv preprint arXiv:2011.13729 (2020).[35] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, andMartin Riedmiller. 2018. Learning an embedding space for transferable robotskills. In

International Conference on Learning Representations .[36] Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. 2019. A survey andcritique of multiagent deep reinforcement learning.

Autonomous Agents andMulti-Agent Systems

33, 6 (2019), 750–797.[37] José Hernández-Orallo. 2017.

The measure of all minds: evaluating natural andartificial intelligence . Cambridge University Press.[38] John Henry Holland et al. 1992.

Adaptation in natural and artificial systems: an in-troductory analysis with applications to biology, control, and artificial intelligence .MIT press.[39] Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez-Guzman, Antonio García Castañeda, Iain Dunning, Tina Zhu, Kevin McKee,Raphael Koster, et al. 2018. Inequity aversion improves cooperation in intertem-poral social dilemmas. In

Advances in neural information processing systems .3326–3336.[40] Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. 2017.Imitation learning: A survey of learning methods.

ACM Computing Surveys(CSUR)

50, 2 (2017), 1–35.[41] SAE International. 2014. Automated Driving Levels of Driving Automation areDefined in New SAE International Standard J3016.[42] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever,Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos,Avraham Ruderman, et al. 2019. Human-level performance in 3D multiplayergames with population-based reinforcement learning.

Science

Nature

Proceedings of the national academy of sciences

Modern Perspectives on Reinforcement Learning in Finance(September 6, 2019). The Journal of Machine Learning in Finance

1, 1 (2020).[46] Alex Kulesza and Ben Taskar. 2012. Determinantal point processes for machinelearning. arXiv preprint arXiv:1207.6083 (2012).[47] Karol Kurach, Anton Raichuk, Piotr Stańczyk, Michał Zając, Olivier Bachem,Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, OlivierBousquet, et al. 2020. Google research football: A novel reinforcement learningenvironment. In

Proceedings of the AAAI Conference on Artificial Intelligence ,Vol. 34. 4501–4510.[48] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, KarlTuyls, Julien Pérolat, David Silver, and Thore Graepel. 2017. A unified game-theoretic approach to multiagent reinforcement learning. In

Advances in neuralinformation processing systems . 4190–4203.[49] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and MarcoHutter. 2020. Learning quadrupedal locomotion over challenging terrain.

Sciencerobotics

5, 47 (2020).[50] Joel Lehman and Kenneth O Stanley. 2008. Exploiting open-endedness to solveproblems through the search for novelty.. In

ALIFE . 329–336.[51] Joel Lehman and Kenneth O Stanley. 2011. Abandoning objectives: Evolutionthrough the search for novelty alone.

Evolutionary computation

19, 2 (2011),189–223.[52] Joel Lehman and Kenneth O Stanley. 2011. Evolving a diversity of virtualcreatures through novelty search and local competition. In

Proceedings of the13th annual conference on Genetic and evolutionary computation . 211–218.53] Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. 2019. Autocur-ricula and the Emergence of Innovation from Social Interaction: A Manifestofor Multi-Agent Intelligence Research. arXiv (2019), arXiv–1903.[54] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Grae-pel. 2017. Multi-agent Reinforcement Learning in Sequential Social Dilemmas.In

Proceedings of the 16th Conference on Autonomous Agents and MultiAgentSystems . 464–473.[55] Sergey Levine. 2018. Reinforcement learning and control as probabilistic infer-ence: Tutorial and review. arXiv preprint arXiv:1805.00909 (2018).[56] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offlinereinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020).[57] Yuanlong Li, Yonggang Wen, Dacheng Tao, and Kyle Guan. 2019. Transformingcooling optimization for green data center via deep reinforcement learning.

IEEE transactions on cybernetics

50, 5 (2019), 2002–2013.[58] Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, andThore Graepel. 2018. Emergent Coordination Through Competition. In

Interna-tional Conference on Learning Representations .[59] Jan Matas, Stephen James, and Andrew J Davison. 2018. Sim-to-real re-inforcement learning for deformable object manipulation. arXiv preprintarXiv:1806.07851 (2018).[60] Sarah Mathew. 2017. How the second-order free rider problem is solved in asmall-scale society.

American Economic Review arXiv preprint arXiv:1312.5602 (2013).[62] Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminating search spaces by map-ping elites. arXiv (2015), arXiv–1504.[63] Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat,Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, et al. 2019.A Generalized Training Approach for Multiagent Learning. In

InternationalConference on Learning Representations .[64] Workshop of RL4RealLife. 2020. Rl for Real Life 2020. https://sites . google . com/view/RL4RealLife.[65] OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz,Bob McGrew, Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, MatthiasPlappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin,Peter Welinder, Lilian Weng, and Wojciech Zaremba. 2018. Learning DexterousIn-Hand Manipulation. CoRR abs/1808.00177 (2018). arXiv:1808.00177 http://arxiv . org/abs/1808 . Journalof economic perspectives

14, 3 (2000), 137–158.[67] Jakub Pachocki, Greg Brockman, Jonathan Raiman, Susan Zhang, HenriquePondé, Jie Tang, Filip Wolski, Christy Dennison, Rafal Jozefowicz, PrzemyslawDebiak, et al. 2018. OpenAI Five, 2018.

URL https://blog. openai. com/openai-five (2018).[68] Jan Paredis. 1995. Coevolutionary computation.

Artificial life

2, 4 (1995), 355–375.[69] Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski, and StephenRoberts. 2020. Effective Diversity in Population-Based Reinforcement Learning. arXiv preprint arXiv:2002.00632 (2020).[70] Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long,and Jun Wang. 2017. Multiagent bidirectionally-coordinated nets: Emergenceof human-level coordination in learning to play starcraft combat games. arXivpreprint arXiv:1703.10069 (2017).[71] Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, andThore Graepel. 2017. A multi-agent reinforcement learning model of common-pool resource appropriation. In

Advances in Neural Information Processing Sys-tems . 3643–3652.[72] Rémy Portelas, Cédric Colas, Lilian Weng, Katja Hofmann, and Pierre-YvesOudeyer. 2020. Automatic Curriculum Learning For Deep RL: A Short Survey. arXiv preprint arXiv:2003.04664 (2020).[73] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. 2016. Quality diversity: Anew frontier for evolutionary computation.

Frontiers in Robotics and AI

Nature

Conference on Robot Learning . PMLR, 262–270.[77] Spyridon Samothrakis, Simon Lucas, ThomasPhilip Runarsson, and David Rob-les. 2012. Coevolving game-playing agents: Measuring performance and intran-sitivities.

IEEE Transactions on Evolutionary Computation . valleydrivingschool . com/blog/main/6-times-you-can-proceed-on-a-red-light.[79] Faiz Siddiqui. 2020. Some of the biggest critics of Waymo and other self-drivingcars are the Silicon Valley residents who know how they work. https://wapo . st/30UAX6R.[80] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, GeorgeVan Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneer-shelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neuralnetworks and tree search. Nature . lgsvlsimulator . com/docs/.[82] Russell K Standish. 2003. Open-ended artificial evolution. International Journalof Computational Intelligence and Applications

3, 02 (2003), 167–175.[83] Jack Stewart. 2020. Humans Just Can’t Stop Rear-Ending Self-Driving Cars –Let’s Figure Out Why. https://bit . ly/3jfcfFs.[84] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Ken-neth O Stanley, and Jeff Clune. 2017. Deep neuroevolution: Genetic algorithmsare a competitive alternative for training deep neural networks for reinforce-ment learning. arXiv preprint arXiv:1712.06567 (2017).[85] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An intro-duction . MIT press.[86] Martin Treiber, Ansgar Hennecke, and Dirk Helbing. 2000. Congested trafficstates in empirical observations and microscopic simulations.

Physical review E

62, 2 (2000), 1805.[87] Martin Treiber and Arne Kesting. 2009. Modeling lane-changing decisions withMOBIL. In

Traffic and Granular Flow’07 . Springer, 211–221.[88] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, An-drew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds,Petko Georgiev, et al. 2019. Grandmaster level in StarCraft II using multi-agentreinforcement learning.

Nature

Nature . mscsoftware . com/product/virtual-test-drive.[91] Kyle Vogt. 2020. The Disengagement Myth. https://medium . com/cruise/the-disengagement-myth-1b5cbdf8e239.[92] Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, andNicolas Heess. 2017. Robust imitation of diverse behaviors. Advances in NeuralInformation Processing Systems

30 (2017), 5320–5329.[93] Waymo. 2020. Waymo Safety Report. https://bit . ly/2T4vRl4.[94] Wikipedia. 2020. Pittsburgh Left. https://en . wikipedia . org/wiki/Pittsburgh_left.[95] ICML Workshop. 2019. Reinforcement Learning for Real Life. https://icml . cc/Conferences/2019/ScheduleMultitrack?event = . google . com/view/neurips2020rwrl.[97] Yaodong Yang and Jun Wang. 2020. An Overview of Multi-Agent ReinforcementLearning from Game Theoretical Perspective. arXiv preprint arXiv:2011.00583 (2020).[98] Yaodong Yang, Ying Wen, Lihuan Chen, Jun Wang, Kun Shao, David Mguni,and Weinan Zhang. 2020. Multi-Agent Determinantal Q-Learning. (2020).[99] Deheng Ye, Guibin Chen, Wen Zhang, Sheng Chen, Bo Yuan, Bo Liu, Jia Chen,Zhao Liu, Fuhao Qiu, Hongsheng Yu, et al. 2020. Towards Playing Full MOBAGames with Deep Reinforcement Learning. arXiv e-prints (2020), arXiv–2011.[100] Wei Zhan, Liting Sun, Di Wang, Haojie Shi, Aubrey Clausse, Maximilian Nau-mann, Julius Kummerle, Hendrik Konigshof, Christoph Stiller, Arnaud deLa Fortelle, et al. 2019. Interaction dataset: An international, adversarial andcooperative motion dataset in interactive driving scenarios with semantic maps. arXiv preprint arXiv:1910.03088 (2019).[101] Ming Zhou, Jun Luo, Julian Villela, Yaodong Yang, David Rusu, Jiayu Miao,Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen, et al. 2020.SMARTS: Scalable Multi-Agent Reinforcement Learning Training School forAutonomous Driving. arXiv preprint arXiv:2010.09776arXiv preprint arXiv:2010.09776