[PDF] Fever Basketball: A Complex, Flexible, and Asynchronized Sports Game Environment for Multi-agent Reinforcement Learning

Abstract

The development of deep reinforcement learning (DRL) has benefited from the emergency of a variety type of game environments where new challenging problems are proposed and new algorithms can be tested safely and quickly, such as Board games, RTS, FPS, and MOBA games. However, many existing environments lack complexity and flexibility and assume the actions are synchronously executed in multi-agent settings, which become less valuable. We introduce the Fever Basketball game, a novel reinforcement learning environment where agents are trained to play basketball game. It is a complex and challenging environment that supports multiple characters, multiple positions, and both the single-agent and multi-agent player control modes. In addition, to better simulate real-world basketball games, the execution time of actions differs among players, which makes Fever Basketball a novel asynchronized environment. We evaluate commonly used multi-agent algorithms of both independent learners and joint-action learners in three game scenarios with varying difficulties, and heuristically propose two baseline methods to diminish the extra non-stationarity brought by asynchronism in Fever Basketball Benchmarks. Besides, we propose an integrated curricula training (ICT) framework to better handle Fever Basketball problems, which includes several game-rule based cascading curricula learners and a coordination curricula switcher focusing on enhancing coordination within the team. The results show that the game remains challenging and can be used as a benchmark environment for studies like long-time horizon, sparse rewards, credit assignment, and non-stationarity, etc. in multi-agent settings.

Full PDF

FFever Basketball: A Complex, Flexible, and Asynchronized Sports GameEnvironment for Multi-agent Reinforcement Learning

Hangtian Jia, Yujing Hu, Yingfeng Chen, Chunxu Ren, Tangjie Lv, Changjie Fan, Chongjie Zhang Netease Fuxi AI Lab, Tsinghua University{jiahangtian, huyujing, chenyingfeng1, renchunxu, hzlvtangjie, fanchangjie}@corp.netease.com,[email protected],

Abstract

The development of deep reinforcement learning (DRL) hasbeneﬁted from the emergency of a variety type of game en-vironments where new challenging problems are proposedand new algorithms can be tested safely and quickly, suchas Board games, RTS, FPS, and MOBA games. However,many existing environments lack complexity and ﬂexibility,and assume the actions are synchronously executed in multi-agent settings, which become less valuable. We introduce theFever Basketball game, a novel reinforcement learning envi-ronment where agents are trained to play basketball game.It is a complex and challenging environment that supportsmultiple characters, multiple positions, and both the single-agent and multi-agent player control modes. In addition, tobetter simulate real-world basketball games, the executiontime of actions differs among players, which makes FeverBasketball a novel asynchronized environment. We evaluatecommonly used multi-agent algorithms of both independentlearners and joint-action learners in three game scenarios withvarying difﬁculties, and heuristically propose two baselinemethods to diminish the extra non-stationarity brought byasynchronism in

Fever Basketball Benchmarks . Besides, wepropose an i ntegrated c urricula t raining (ICT) framework tobetter handle Fever Basketball problems, which includes sev-eral game-rule based cascading curricula learners and a coor-dination curricula switcher focusing on enhancing coordina-tion within the team. The results show that the game remainschallenging and can be used as a benchmark environment forstudies like long-time horizon, sparse rewards, credit assign-ment, and non-stationarity, etc. in multi-agent settings. Introduction

Deep reinforcement learning (DRL) has achieved great suc-cess in many domains, including games (Mnih et al. 2013;Silver et al. 2017; Lample and Chaplot 2017), recom-mendation systems (Munemasa et al. 2018), robot control(Haarnoja et al. 2018) and autonomous driving (Pan et al.2017). Among all these domains, games are one of the mostactive and popular settings. Because they are simulationsof reality and have relatively lower cost of trial and error.Besides, games can be run in parallel to collect experiencefor training, which is another advantage of facilitating thesuccess of DRL. There have been a variety kinds of game environments nowadays, for example, Atari games (Mnihet al. 2013, 2015), board games (Silver et al. 2016, 2017),card games (Heinrich and Silver 2016), ﬁrst-person shoot-ing (FPS) games (Lample and Chaplot 2017), multiplayeronline battle arena (MOBA) games (OpenAI 2018; Jiang,Ekwedike, and Liu 2018), real-time strategy (RTS) games(Vinyals et al. 2019; Liu et al. 2019). However, the lack ofcomplexity and ﬂexibility, and the assumption of synchro-nized actions in many existing environments that supportmulti-agent training remain potential barriers for better de-velopment of RL.As a typical sports game (SPG), Fever Basketball simu-lates basic elements of basketball games (Figure ), which ischallenging for modern RL algorithms. First of all, the long-time horizon and sparse rewards remain issues for most DRLmethods. In basketball, it is normally not until scoring shallthe agents get a reward (goal in or not), which may requirea long sequence of consecutive events such as dribbling andpassing the ball to teammates to break through the defenseof opponents. Second, the whole basketball game is a com-bination of many challenging sub-tasks based on game rules,for example, the offense sub-task, the defense sub-task, andthe sub-task of ﬁghting for ball possession when the ball isfree. Third, it is a multi-agent system that requires team-mates to cooperate well to win the game. Fourth, playersin reality have different reaction time, which makes the deci-sion making asynchronized within the same team. Moreover,the different characters and positions classiﬁed according toplayers’ capabilities or tactical strategies such as center (C),power forward (PF), small forward (SF), point guard (PG),and shoot guard (SG) add extra stochasticity to the game.All of these reasons described above make basketball a chal-lenging SPG.In this paper, we propose the Fever Basketball Envi-ronment, a novel open-source asynchronized reinforcementlearning environment where agents can learn to play one ofthe world’s most popular sports basketball. Building upon acommercial basketball game engine, our main contributionsare as follows:1) We provide the Fever Basketball Environment, an ad-vanced and challenging basketball simulator that sup-ports all the major basketball rules.2) We provide the asynchronized Fever Basketball game a r X i v : . [ c s . A I] D ec a) (b) (c) Figure 1: Fever Basketball is a basketball simulator that supports major basketball rules and scenarios such as jump ball, offense,defense, passing ball, dunk, rebound, etc.clients to better simulate reality in multi-agent settings.3) We provide different training curricula (such as offense,defense, freeball, ballclear) for handling the whole bas-ketball task.4) We provide various training scenarios (such as 1v1, 2v2,3v3), multiple characters, and tasks of varying difﬁcul-ties that can be used to compare different algorithms.5) We evaluate common algorithms for multi-agent scenar-ios and propose two heuristic methods for handling theasynchronism for joint-action learners.6) We propose an integrated curricula training (ICT) frame-work that reaches up to 70% win-rate during a 300-dayonline evaluation with human players.

Motivation and Related Works

The development of algorithms beneﬁts from the emergenceof new challenging problems, and game environments havebeen serving as the fundamental place where the reinforce-ment learning community tests its ideas nowadays. How-ever, most of the existing environments have certain deﬁ-ciencies which can be made up by Fever Basketball game:

Low task complexity.

As deep reinforcement learning al-gorithms become more sophisticated, existing environmentswith low task complexity and randomness become less chal-lenging, and the benchmarks based on them become less in-formative (Juliani et al. 2018). For example, the canonical

CartPole and

MountainCar tasks (Sutton and Barto 2018)are too simple to distinguish the performance of different al-gorithms. Meanwhile, most of the agents of

Atari games inthe commonly used

Arcade Learning Environment (Belle-mare et al. 2013) have been trained to super-human level(Badia et al. 2020). The same applies to the

DeepMind Lab (Beattie et al. 2016) and

Procgen (Cobbe et al. 2019), theformer of which consists of several relatively simple ﬁrst-person navigation maze environments and the latter is asuite of several game-like environments mainly designedto benchmark generalization in reinforcement learning. Be-sides, games in

OpenAI Retro (Nichol et al. 2018) such asSonic The Hedgehog can be easily solved by existing algo-rithms (Schulman et al. 2017; Hessel et al. 2018).

Fixed number of agents.

Many existing environmentsonly support the controlling of a ﬁxed number of agents,and most of them are single-agent reinforcement learning(SARL) problems. For example, the

Hard Eight (Paine et al.2019) environment focuses on a single agent’s training tosolve hard exploration problems. The

Obstacle Tower Envi-ronment (Juliani et al. 2019) also only supports the trainingof a single agent to solve puzzles and make plans on mul-tiple ﬂoors. The

Atari games and

OpenAI Retro games arealso environments that support single-agent training. How-ever, most of the real-world scenarios involve more than oneagents, such as basketball matches and autonomous driv-ing cars (Shalev-Shwartz, Shammah, and Shashua 2016),which can be naturally modeled as multi-agent systems(MAS) in a centralized or distributed manner. In addi-tion to the challenges in SARL such as long time horizonand sparse rewards, MARL brings extra challenges suchas non-stationarity (Papoudakis et al. 2019), credit assign-ment (Nguyen, Kumar, and Lau 2018) as well as scalabil-ity with the number of agents increasing (Hernandez-Leal,Kartal, and Taylor 2019; Zhang, Yang, and Ba¸sar 2019).Thus, platforms with ﬂexible settings on the number of con-trolled agents become both important and valuable for rele-vant studies.

Synchronized actions.

Common SARL paradigm as-sumes that the environment will not change between whenthe environment state is observed and when the action is exe-cuted. The system is treated sequentially: the agent observesa state, freezes time while computing the action and appliesthe action, and then unfreezes time. The same settings nor-mally apply to MARL, where multiple agents calculate ac-tions together and then execute them at the same time step.However, it is usually not the case in the real world that thewhole system’s decision making and executing processes aresynchronized. This results from agents’ different reactiontime and diverse action execution time under various situ-ations, and it makes the whole MAS works asynchronously.For example, in multi-robot control areas, robots could be-have asynchronously when executing different actions dueto hardware limitations (Xiao et al. 2020).

Other related works.

There are also many other open-source environments focusing on certain game types and a) (b)

Figure 2: The asynchonized actions in Fever Basketball. (a) The average execution time of different action categories ( ∗ repre-sents action units with different directions). (b) Illustration of the action asynchronism within the team.speciﬁc research ﬁelds. For example, the SMAC (Vinyalset al. 2017), which is a representative of the challenging RTSgames, has been used as a test-bed for MARL algorithms de-spite that the additive and dense rewards settings could makeit less challenging (Foerster et al. 2017; Rashid et al. 2018;Vinyals et al. 2019). The

DeepMind Control Suite (Tassaet al. 2018),

AI2Thor (Kolve et al. 2017),

Habitat (Savvaet al. 2019), and

PyBullet (Coumans and Bai 2016) envi-ronments are all related to continuous control tasks. Themost similar platform to ours is the

Google Research Foot-ball (Kurach et al. 2019), which offers another kind of SPG:football. However, it is a synchronized game platform andthere are many differences between football and basketballin terms of game settings like game rules, number of playersas well as tactics.

Fever Basketball Games

Fever Basketball is an online basketball game, which sim-ulates a half-court (length=11.4 meters, width=15 meters)basketball match between two teams * . The game includesthe most common basketball aspects, such as jump ball,dribble, three-pointer, dunk, rebound, etc (see Figure 1 ?? fora few examples). The objective of each team is to score asmuch as possible to win the match within a limited time. Supported basketball elements.

The game offers morethan 30 characters (Charles, Alex, Steven, etc.) of differentpositions (C, PF, SF, PG, SG) with various attributes andskills to choose from before a match, which largely enrichesboth the randomness and challenges of the game. At the be-ginning of each match, one player from each team will dothe jump ball. The team which gets the ball will be the of-fense team and the other team will be the defense team. Theplayer holding the ball in the offense team is in the state of attack and the other two players are in the state of assist .Players in the offense team can use offense actions (suchas screen, fast break, jockey for position, etc.) and shoot-ing actions (such as jump shot, layup, dunk, etc.) to score.Meanwhile, players in the defense team should try their bestto prevent the offense team from scoring by applying de-fense strategy like one-on-one checks, steals, rejections, and * https://github.com/FuxiRL/FeverBasketball so on. Once the ball is out of the hands of the possessedplayer such as after shoot or rebound, all of the players willbe in the freeball states. At such a moment, if players ofthe defense team manage to fetch the ball, they need to gothrough an attack-defense switch process named ballclear toprepare for offense by dribbling out of the three-point-line.Once the offense team scores, the ball possession will behanded to the opposite team. A typical match lasts for threeminutes (with an average FPS of 60) except for the overtime.The shoot clock violation (20 seconds) will be punished byhandling the ball possession to the opposite team. Supported player control modes.

Fever Basketball offersa convenient way to control game players by modifying cor-responding keywords when launching the game clients. Thenumber of players within each team can be chosen from { , , } , which covers both the single-agent training andthe multi-agent training scenarios with increasing complex-ity and difﬁculty in a curriculum manner. Meanwhile, theposition of each player can be chosen from {C, PF, SF, PG,SG} and the game characters can be switched freely withinthe players we provide. Furthermore, the game also providesthree control modes. The ﬁrst one is the Bot mode, wherethe agents can be trained with the built-in rule-based botswhose difﬁculty levels can be chosen from {easy, medium,hard} with different reaction time and shooting rate. Thesecond one is the

SelfPlay mode, which allows the trainingof both teams through self-play. The third mode is the

Hu-man mode, where the human player can control the speciﬁcposition of the home team and ﬁght against the built-in botsor pre-trained agents.

Game states and representations.

Raw game states inFever Basketball are data packages received from gameclients, which includes information like current scene name( attack, assist, defense, freeball, ballclear ), general gameinformation (such as attack remain time and scores), bothteams’ information (such as the player type, player position,facing angle and shoot rate), ball information (such as ballposition, ball velocity, owned player and owned team) andthe results of the last action. Please see the detailed descrip-tion in the Appendix. Besides, we also provide a vector-ased representation wrapper class corresponding to eachgame scene as well as some useful functions like distancecalculation of two coordinates, based on which researcherscan easily deﬁne their state representations for training. Wecollect transition experience from 20 parallelled

Asynchronized game actions.

To better simulate the real-world basketball game, one of the key features of Fever Bas-ketball is that a player’s primitive actions have different ex-ecution time (see Figure 2(a)), which makes the actions ofthe players within the same team asynchronized. For exam-ple, consider the offense scenario depicted in Figure 2(b),where the PG is dribbling the ball with the SG requestingthe ball and the C keep moving at t . After getting the ballat time step t , which is passed from PG at time t , theSG makes a three-point shot that costs two time steps ofexecution (i.e., ﬁnishing shooting at t ), with C keeps do-ing the OffenceScreen action (i.e. pick-and-roll) from t to t . Unlike common MARL environments which assume theagents’ actions are synchronized, the asynchronism in FeverBasketball brings extra challenges for the current MARL al-gorithms, especially for the ones with centralized training(Sunehag et al. 2018; Rashid et al. 2018). Each agent facesa much more non-stationary environment since the otheragents’ ongoing actions may have a large impact on the statetransitions observed by the agent. Besides, the number ofactions for different positions and different game scenes arelisted in Table 1 (3v3 mode). A detailed description of theseactions can be found in Appendix.Table 1: Number of actions in Fever Basketball scenes. Scene Type C PF SF PG SGAttack 29 30 43 35 42Defense 19 19 19 27 27Freeball 10 10 9 9 9Ballclear 22 22 25 25 25Assist 11 11 11 11 11

Game rewards settings.

Game rewards settings in FeverBasketball are also highly ﬂexible and can be easily cus-tomized by researchers in terms of both the shaping rewardsand the game rewards. We currently offer a set of game re-wards related to corresponding game scenes. To be speciﬁc,in the offense scene ( attack & assist ), the agent will be re-warded with 2 or 3 if the team goals while being punishedwith -1 if the ball is blocked, stolen, lost, or time is up. Re-wards settings in the defense scene are the opposite of theoffense scene. For the freeball scene, the agent will be re-warded with 1 if it possesses the ball and will be punishedwith -1 if it loses (such as that the opponent gets the ballor time is up). In the ballclear scene, the agent will be re-warded with 1 if it gets the ball out of the three-point linesuccessfully and will be punished similarly as those in the offense scene. (a) Exp-Mask (Ms) (b) Exp-Splice (Sp)

Figure 3: Proposed experience collection methods for JAL.

Fever Basketball Benchmarks

Fever Basketball is a complex, ﬂexible, and highly customiz-able game environment which allows researchers to try newideas and solve problems in basketball games. Comparedwith the single-agent mode, the multi-agent mode is morechallenging and includes both competition and collaborationscenarios. In addition, it brings new challenges such as asyn-chronized actions. To evaluate the performance of existingalgorithms and handle the asynchronism problems in FeverBasketball, we propose some heuristic methods and providea set of benchmarks regarding the 3v3 tasks. In all of thesetasks, the goal of the trained agents is to score as many pointsas possible in a limited amount of time (3 minutes per round)against the built-in bots, whose difﬁculty levels range fromeasy, medium to hard. In addition, to facilitate fair compar-isons, we use the SG position for both teams.

Methods

Generally speaking, there are two major learning paradigmsin MARL, namely the joint action learner (JAL) and the in-dependent learner (IL) (Claus and Boutilier 1998). In co-operative settings, JAL, which also includes the centralisedtraining with decentralised execution paradigm, assumes allthe agents’ actions can be observed, such as VDN (Sunehaget al. 2018) and QMIX (Rashid et al. 2018). In contrast, ILonly relies on its action and the coordination can be achievedthrough heuristics of optimistic and average rewards, suchas HYQ (Matignon, Laurent, and Le Fort-Piat 2007), EX-CEL(Hu et al. 2019).The modeling of asynchronized actions in Fever Basket-ball differs from that of

MacDec-POMDPs (Amato et al.2019; Xiao, Hoffman, and Amato 2020) where options areproposed, and applied to dynamic programming problemsand model-free robot-control areas, respectively. Since theasynchronized actions in Fever Basketball are still primitiveactions and the decision-making still focuses on a low levelof granularity, and proper methods of collecting transitionsremain critical. In terms of experience collection, IL algo-rithms have advantages over JAL algorithms because theirlearning processes can be handled independently and do notrely on collecting other agents’ on-going actions. However,it will be a problem to ﬁnd an appropriate time to collect thejoint-action transitions for JAL.As illustrated in Figure 3, we propose two methods

QL HYQ EXCEL VDN_Ms QMIX_Ms VDN_Sp QMIX_Sp−30−20−10010 A v g S c o r e G ap pe r R ound Full Game

Easy Medium HardIQL HYQ EXCEL VDN_Ms QMIX_Ms VDN_Sp QMIX_Sp

Divide and Conquer

Figure 4: Benchmark experiments for Fever Basketball.to collect the joint-action experience. The ﬁrst method isexperience-mask (EXP-Ms), which masks the on-going ac-tions out and regards them as

Idle when collecting joint tran-sitions at a certain time-step. For example, if we denote o Pt , s Pt , a Pt , r t , d t as the global observation, local observation,action, global reward, and done information of player P attime step t , respectively. The global experience in the shadedarea of Figure 2(a) would be: EXP Ms = [ o Ct , ( s SGt , s Ct , s P Gt ) , ( Idle, a Ct , a P Gt ) , r t ,o Ct , ( s SGt , s Ct , s P Gt ) , d t ] The second method is experience splice (EXP-Sp), whichmeans that we collect the joint transition when all the play-ers have ﬁnished the recent on-going actions, and then splicethe experience to form ﬁnal transitions based on the globalstates observed at the time step when the agents start to exe-cute actions. As illustrated by Figure 2(b), the joint transitionexperience in the shaded area would be:

EXP Sp = [ o SGt , ( s SGt , s Ct , s P Gt ) , ( a SGt , a Ct , a P Gt ) , r t & t ,o SGt , ( s SGt , s Ct , s P Gt ) , d t & t ] EXP Sp = [ o Ct , ( s SGt , s Ct , s P Gt ) , ( a SGt , a Ct , a P Gt ) , r t & t ,o Ct , ( s SGt , s Ct , s P Gt ) , d t & t ] EXP Sp = [ o Ct , ( s P Gt , s Ct , s P Gt ) , ( a SGt , a Ct , a P Gt ) , r t & t ,o P Gt , ( s SGt , s Ct , s P Gt ) , d t & t ] By learning from the reconstructed experience, we expectthese two heuristic methods can help the joint-action learn-ers acquire the perception of the execution time of corre-sponding actions to facilitate better coordination.

Experimental results

In this section, we provide benchmark results for both IL al-gorithms IQL (Mnih et al. 2015), HYQ (Matignon, Laurent,and Le Fort-Piat 2007), EXCEL (Hu et al. 2019)) and JALalgorithms VDN (Sunehag et al. 2018), QMIX (Rashid et al.2018) with the EXP-Ms and EXP-Sp methods. We evaluatethese algorithms in both the

Full Game setting and

Divideand Conquer setting. In the

Full Game setting, the learnersneed to handle all the sub-tasks (i.e. offense (attack & as-sist), defense, freeball, ballclear ) through one model withunavailable actions masked out. In the

Divide and Conquer setting, each of these sub-tasks is allocated with a corre-sponding learner, which decreases the difﬁculties of train-ing. The technical details of the training architectures andhyperparameters can be found in Appendix. The experimental results of the Fever Basketball Bench-marks, which are averaged over 10 game clients after trainedfor 100 rounds, are shown in Figure 4. It can be found thatthe

Full Game setting is much more challenging than the

Divide and Conquer setting, where all of the algorithms failto defeat the built-in bots. Besides, the hard bots are moredifﬁcult to be defeated than the medium and easy ones. Theperformances of the independent learners generally outper-form the joint-action learners even though we try to elimi-nate the action asynchronism within the team by using EXP-Ms and EXP-Sp methods. In addition, it seems the EXP-Msmethod performs relatively better than the EXP-Sp method,which might result from the neglect of some agent’s short-time transitions when generating the global experience, suchas the transition from t to t of player C in Figure 3(b). Theresults indicate that the asynchronism problem is not wellsolved and worth further studying. The Integrated Curricula Training (ICT)

Although the complex basketball problem can be partiallysolved through existing MARL algorithms under the

Divideand Conquer settings, the correlations between correspond-ing sub-tasks are neglected and there could be miscoordina-tions induced by the asynchronism. Besides, it also seemsthat the proposed Exp-Ms and Exp-Sp methods struggle tofacilitate the learning of action execution time, and the asyn-chronism in Fever Basketball remains a critical problem, es-pecially for joint-action learners. To make further progress,we take advantage of the independent learners and proposea curriculum learning based framework named ICT (Figure5), which mainly includes a set of weighted cascading cur-ricula learners and a coordination curricula switcher. Theseweighted cascading curricula learners are responsible forcorresponding sub-tasks generated by basketball game rules.And the coordination curricula switcher, which has a rela-tively higher priority in making decisions, focuses on learn-ing cooperative policy on primitive actions that will resultin curriculum switch, such as the pass action that triggersswitch between the attack and the assist curricula.

Methods

The weighted cascading curricula learners.

Curriculumlearning is used to solve complex and difﬁcult problems (Wuand Tian 2016; Wu, Zhang, and Song 2018). As mentionedbefore, Fever Basketball offers a set of base training scenar-ios according to game rules, namely attack , defense , free-ball , ballclear and assist from the perspective of a singleplayer. All of these ﬁve base curricula are the fundamentaligure 5: The Integrated Curricula Training framework (ICT).aspects of an integrated basketball match. And only by mas-tering these basic curricula can one be ready for generatingappropriate policies throughout the entire basketball match.Thus we intend to ﬁrstly train a corresponding DRL agent( i ) to learn each of these base curricula ( τ i ), the interactionprocess of which can be formulated as a ﬁnite Markov Deci-sion Process (MDP). During each episode, agent i perceivesthe state of the corresponding base curriculum s t ∈ S τ i ateach time step t , and outputs an action a t ∈ A τ i accordingto policy π τ i . A scalar reward r t ∈ R τ i is then yielded fromthe environment and the agent will transit to a new state s t +1 with a probability distribution P ( s t +1 | s t , a t , τ i ) . The transi-tion ( s t , a t , r t , s t +1 ) is stored in replay buffer D i . Agent i ’s goal is to ﬁnd an optimal policy π ∗ to maximize the ex-pected accumulative (discounted) rewards from each state sin corresponding curriculum τ i , namely the value function Q ∗ τ i ( s, a ) which can be formulated as: Q ∗ τ i ( s, a ) = E s (cid:48) ∼ ε [ r + γmax a (cid:48) Q ∗ τ i ( s (cid:48) , a (cid:48) ) | s, a, τ i ] The update of the Q τ i network parameters θ i are carried outby randomly sampling mini-batches from corresponding re-play buffer D i and performing a gradient descent step on ( y i − Q ( s, a ; θ i )) . The Q value labels y i can be calculatedas follows: y i = (cid:26) r i , terminal s of τ i r i + γmax a (cid:48) Q ∗ τ i ( s (cid:48) , a (cid:48) ; θ i ) , otherwiseIn this way, the complicated and challenging basketballproblem is decomposed into several easier curricula whichcan be preliminarily solved by applying co-training of cor-responding DRL agents similar to the divide-and-conquer strategy. Although these base curricula training has enabledthe agents to acquire some primary skills towards corre-sponding sub-tasks, the whole basketball match remains achallenge. This is because a round of basketball match couldinclude many inter-transitions between corresponding sub-tasks, and these ﬁve base curricula are actually highly cor-related. For example, the attack and assist curricula are nor-mally followed by freeball after the shot of the offense team,thus the policy used in former curricula will contribute to theoutcomes of the latter curricula.To deal with the correlations between corresponding basecurricula, we propose the cascading curricula training ap-proach. It is implemented by adding the max Q ( s (cid:48) τ j , a (cid:48) τ j ; θ j ) value of the following base curriculum τ j to the reward r i that agent i received from environment to form the new la-bel y i when former base curriculum τ i reaches a terminal.Meanwhile, we use a weight parameter η ∈ [0 , to ad-just the ratio of the cascading Q values heuristically duringtraining. The adjustment procedure for η is crucial for boththe stabilization and performance of the whole training pro-cess. When η equals to 0, the cascading curricula trainingbecomes the base curricula training. When η increases to1 by following certain heuristic procedures during training,the correlations between corresponding base curricula willbe gradually established through the backup of the learned Qvalues and contribute to the ﬁnal integrated policy through-out the whole basketball match. The new Q value labels y casi can be formulated as: y casi = (cid:40) r i + ηγmax a (cid:48) Q ∗ τ j ( s (cid:48) τ j , a (cid:48) τ j ; θ j ) , terminal s of τ i r i + γmax a (cid:48) Q ∗ τ i ( s (cid:48) , a (cid:48) ; θ i ) , otherwise The coordination curricula switcher.

Although theweighted cascading curricula training can degrade the com-plex basketball problem into relatively easier base curric-ula and take their correlations into account, it is only fromthe perspective of a single player. However, as a typicalteam sport, coordination within the same team plays a cru-cial role in all basketball matches. Based on the cascadingcurricula training, we propose a high-level coordination cur-ricula switcher to facilitate the training of coordination byfocusing on learning cooperative actions that could inducecurricula switching within the same team. For example, the pass action, which is the core primary action that transfersball possession and creates basketball tactics in the offenseteam. By taking over such action, the coordination curriculaswitcher will focus on learning how to pass the ball to theright player in an appropriate time, which, in the meantime,will also result in the curriculum switch between attack / ballclear and assist within the same team. The coordinationcurricula switcher will have a relatively higher priority overthose weighted cascading base curricula learners on actionselection to ensure the performance of coordination whenit is necessary. Meanwhile, by taking over coordination re-lated actions from original action spaces, it also enables areduction of original action spaces for attack , ballclear and assist scene, which helps the agents to learn policies more

20 40 60 80 100Round−30−20−100102030 A v e r age S c o r e G ap pe r R ound One Model TrainingBase Curricula Training Cascading Curricula TrainingIntegrated Curricula Training (a) W i n R a t e o f R L A I T ea m vs H u m an P l a y e r s i n V PVP s Win Rate of Base Curricula TrainingWin Rate of Integrated Curricula Training (b) M a t c h R a t i o o f R L A I M ode a m ong A ll PVP s Match Ratio of Base Curricula TrainingMatch Ratio of Integrated Curricula Training (c)

Figure 6: Performance of training approaches proposed in Fever Basketball. (a) Evaluation with hard built-in bots. (b) Win-rateof self-play models during online evaluation with human players. (c) Match-ratio of self-play models during online evaluationwith human players.effectively in corresponding curricula. The pseudo-code ofthe ICT framework for Fever Basketball can be found in theAppendix (see Algorithm ). Experimental results

In this subsection, an ablation study is ﬁrstly carried out toassess the effects of different parts in our ICT framework byplaying with the hard built-in bots, which includes the one-model training, the base curricula training, the cascadingcurricula training, and ICT framework. The performance ofthe whole ICT framework is then further evaluated with on-line players in Fever Basketball. The APEX-Rainbow (Hor-gan et al. 2018; Hessel et al. 2018) algorithm is used for alllearners and we put the detailed training setups and architec-tures in Appendix.The results of the ablation experiments are demonstratedin Figure 6(a). The horizontal axis is the number of evalu-ated matches along with the training process (3 minutes perround). The vertical axis is the average score gap betweenproposed training approaches and the built-in hard bots inone match over 10 game clients. We can ﬁnd that the one-model training approach performs the worst and the playerstrained by this approach struggle to master these ﬁve dis-tinct sub-tasks together. The base curriculum training ap-proach performs much better than the one-model trainingmethod since it can focus only on the corresponding sub-taskand generate some fundamental policies towards basic gamerules. Players trained in this way tend to play solo whilelacking tactical movements since correlations between re-lated sub-tasks are ignored. The weighted cascading curricu-lum training can make further improvements compared withthe base curriculum training because the correlation betweenrelated sub-tasks is retained and the policy can be optimizedover the whole task despite that the coordination within oneteam remains a weakness. The ICT framework signiﬁcantlyoutperforms other training approaches since the coordina-tion performance can be essentially improved by using thecoordination curricula switcher.The results of a 300-day online evaluation with , , human players are illustrated in Figure 6(b and c). During this evaluation, we ﬁrst test the model learned bythe base curriculum learning and change the online model tothe one trained through the ICT framework on day 63. As isshown in Figure 6(b), the win rate of the updated model (al-most up to 70%, red broken lines) increases more than twiceof the former model (around 30%, blue broken lines) at thebeginning of each online evaluations with human players in3v3 PVP (i.e. player vs player) matches. The team trainedwith our method can generate many professional coordina-tion tactics like give-and-go, and it is more likely to createwide-open areas to score by passing smoothly. In addition,the match ratio that human players participated in playingagainst the challenging AI teams keeps increasing (see Fig-ure 6 (c)) among all the PVP matches, which indicates thatwe bring extra revenues for the game. Discussion and Conclusion

In this paper, we present the Fever Basketball Environ-ment, a novel open-source reinforcement learning environ-ment of the basketball game. It is a complex and challeng-ing environment which supports both single-agent trainingand multi-agent training. Besides, the actions with differ-ent execution time in this environment make it a good plat-form for studying the challenging problem of asynchronizedmulti-agent decision-making. We implement and evaluatethe state-of-the-art MARL algorithms (such as VDN, Qmix,and EXCEL) together with two heuristic methods (i.e. EXP-Ms and EXP-Sp) to alleviate the effect of asynchronism inboth the

Full Game setting and

Divide and Conquer settingof Fever Basketball. The results show that the game is chal-lenging and existing algorithms fail to solve the asynchro-nism problems. To shed light on this complex task, we takeadvantage of the curriculum learning and propose an inte-grated curricula training framework to solve this problemstep by step. Though progress has been made, the win-rateagainst on-line human players is not high (up to 70%) andkeeps decreasing as the evaluation process goes, which mayresult from our model’s lacking generation to unseen oppo-nents, and meanwhile demonstrates the difﬁculties of mas-tering the basketball game. We expect the components in-olved in Fever Basketball such as the complexity, the ﬂex-ible settings , and the asynchronism will be useful for inves-tigating current scientiﬁc challenges like long-time horizon,spare rewards, credit assignment, non-stationarity.

Ethical Impact

Considering that the game platforms have substantiallyboosted the development of reinforcement learning (RL), theopen-source of our Fever Basketball platform is expected tofurther enrich the types of existing virtual environments forRL communities. What’s more, the new challenges broughtby our platform are also of great potential to incubate newalgorithms, which is another aspect for contributing to thedevelopment of RL.

References

Amato, C.; Konidaris, G.; Kaelbling, L. P.; and How, J. P.2019. Modeling and planning with macro-actions in de-centralized POMDPs.

Journal of Artiﬁcial Intelligence Re-search

64: 817–859.Badia, A. P.; Piot, B.; Kapturowski, S.; Sprechmann, P.;Vitvitskyi, A.; Guo, D.; and Blundell, C. 2020. Agent57:Outperforming the atari human benchmark. arXiv preprintarXiv:2003.13350 .Beattie, C.; Leibo, J. Z.; Teplyashin, D.; Ward, T.; Wain-wright, M.; Küttler, H.; Lefrancq, A.; Green, S.; Valdés,V.; Sadik, A.; et al. 2016. Deepmind lab. arXiv preprintarXiv:1612.03801 .Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.2013. The arcade learning environment: An evaluation plat-form for general agents.

Journal of Artiﬁcial IntelligenceResearch

47: 253–279.Claus, C.; and Boutilier, C. 1998. The dynamics of re-inforcement learning in cooperative multiagent systems.

AAAI/IAAI arXiv preprint arXiv:1912.01588 .Coumans, E.; and Bai, Y. 2016. Pybullet, a python mod-ule for physics simulation for games, robotics and machinelearning .Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; andWhiteson, S. 2017. Counterfactual multi-agent policy gra-dients. arXiv preprint arXiv:1705.08926 .Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018.Soft actor-critic: Off-policy maximum entropy deep rein-forcement learning with a stochastic actor. arXiv preprintarXiv:1801.01290 .Heinrich, J.; and Silver, D. 2016. Deep reinforcement learn-ing from self-play in imperfect-information games. arXivpreprint arXiv:1603.01121 .Hernandez-Leal, P.; Kartal, B.; and Taylor, M. E. 2019. Asurvey and critique of multiagent deep reinforcement learn-ing.

Autonomous Agents and Multi-Agent Systems

Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence .Horgan, D.; Quan, J.; Budden, D.; Barth-Maron, G.; Hessel,M.; Van Hasselt, H.; and Silver, D. 2018. Distributed prior-itized experience replay. arXiv preprint arXiv:1803.00933 .Hu, Y.; Chen, Y.; Fan, C.; and Hao, J. 2019. Explicitly Co-ordinated Policy Iteration. In

IJCAI , 357–363.Jiang, D. R.; Ekwedike, E.; and Liu, H. 2018. Feedback-based tree search for reinforcement learning. arXiv preprintarXiv:1805.05935 .Juliani, A.; Berges, V.-P.; Vckay, E.; Gao, Y.; Henry, H.;Mattar, M.; and Lange, D. 2018. Unity: A general platformfor intelligent agents. arXiv preprint arXiv:1809.02627 .Juliani, A.; Khalifa, A.; Berges, V.-P.; Harper, J.; Teng, E.;Henry, H.; Crespi, A.; Togelius, J.; and Lange, D. 2019. Ob-stacle tower: A generalization challenge in vision, control,and planning. arXiv preprint arXiv:1902.01378 .Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.;Herrasti, A.; Gordon, D.; Zhu, Y.; Gupta, A.; and Farhadi,A. 2017. Ai2-thor: An interactive 3d environment for visualai. arXiv preprint arXiv:1712.05474 .Kurach, K.; Raichuk, A.; Sta´nczyk, P.; Zaj ˛ac, M.; Bachem,O.; Espeholt, L.; Riquelme, C.; Vincent, D.; Michalski, M.;Bousquet, O.; et al. 2019. Google research football: Anovel reinforcement learning environment. arXiv preprintarXiv:1907.11180 .Lample, G.; and Chaplot, D. S. 2017. Playing FPS gameswith deep reinforcement learning. In

Thirty-First AAAI Con-ference on Artiﬁcial Intelligence .Liu, T.; Zheng, Z.; Li, H.; Bian, K.; and Song, L. 2019.Playing Card-Based RTS Games with Deep ReinforcementLearning. In

Proceedings of the Twenty-Eighth Interna-tional Joint Conference on Artiﬁcial Intelligence, IJCAI-19 ,4540–4546. International Joint Conferences on Artiﬁcial In-telligence Organization. doi:10.24963/ijcai.2019/631. URLhttps://doi.org/10.24963/ijcai.2019/631.Matignon, L.; Laurent, G. J.; and Le Fort-Piat, N. 2007. Hys-teretic q-learning: an algorithm for decentralized reinforce-ment learning in cooperative multi-agent teams. In , 64–69. IEEE.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play-ing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 .Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-trol through deep reinforcement learning.

Nature , 226–233. IEEE.Nguyen, D. T.; Kumar, A.; and Lau, H. C. 2018. Creditassignment for collective multiagent RL with global re-wards. In

Advances in Neural Information Processing Sys-tems , 8102–8113.Nichol, A.; Pfau, V.; Hesse, C.; Klimov, O.; and Schulman,J. 2018. Gotta learn fast: A new benchmark for generaliza-tion in rl. arXiv preprint arXiv:1804.03720 .OpenAI. 2018. OpenAI Five. https://blog.openai.com/openai-ﬁve/.Paine, T. L.; Gulcehre, C.; Shahriari, B.; Denil, M.; Hoff-man, M.; Soyer, H.; Tanburn, R.; Kapturowski, S.; Rabi-nowitz, N.; Williams, D.; et al. 2019. Making EfﬁcientUse of Demonstrations to Solve Hard Exploration Problems. arXiv preprint arXiv:1909.01387 .Pan, X.; You, Y.; Wang, Z.; and Lu, C. 2017. Virtual toreal reinforcement learning for autonomous driving. arXivpreprint arXiv:1704.03952 .Papoudakis, G.; Christianos, F.; Rahman, A.; and Al-brecht, S. V. 2019. Dealing with non-stationarity inmulti-agent deep reinforcement learning. arXiv preprintarXiv:1906.04737 .Rashid, T.; Samvelyan, M.; De Witt, C. S.; Farquhar, G.;Foerster, J.; and Whiteson, S. 2018. QMIX: Monotonicvalue function factorisation for deep multi-agent reinforce-ment learning. arXiv preprint arXiv:1803.11485 .Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans,E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al.2019. Habitat: A platform for embodied ai research. In

Pro-ceedings of the IEEE International Conference on ComputerVision , 9339–9347.Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; andKlimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .Shalev-Shwartz, S.; Shammah, S.; and Shashua, A. 2016.Safe, multi-agent, reinforcement learning for autonomousdriving. arXiv preprint arXiv:1610.03295 .Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering thegame of Go with deep neural networks and tree search. na-ture

Nature

AAMAS , 2085–2087. Sutton, R. S.; and Barto, A. G. 2018.

Reinforcement learn-ing: An introduction . MIT press.Tassa, Y.; Doron, Y.; Muldal, A.; Erez, T.; Li, Y.; Casas, D.d. L.; Budden, D.; Abdolmaleki, A.; Merel, J.; Lefrancq,A.; et al. 2018. Deepmind control suite. arXiv preprintarXiv:1801.00690 .Vinyals, O.; Babuschkin, I.; Chung, J.; Mathieu, M.; Jader-berg, M.; Czarnecki, W. M.; Dudzik, A.; Huang, A.;Georgiev, P.; Powell, R.; et al. 2019. AlphaStar: Masteringthe real-time strategy game StarCraft II.

DeepMind Blog .Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhn-evets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.;Schrittwieser, J.; et al. 2017. Starcraft ii: A new challenge forreinforcement learning. arXiv preprint arXiv:1708.04782 .Wu, Y.; and Tian, Y. 2016. Training agent for ﬁrst-personshooter game with actor-critic curriculum learning .Wu, Y.; Zhang, W.; and Song, K. 2018. Master-Slave Cur-riculum Design for Reinforcement Learning. In

IJCAI ,1523–1529.Xiao, T.; Jang, E.; Kalashnikov, D.; Levine, S.; Ibarz, J.;Hausman, K.; and Herzog, A. 2020. Thinking While Mov-ing: Deep Reinforcement Learning with Concurrent Con-trol. arXiv preprint arXiv:2004.06089 .Xiao, Y.; Hoffman, J.; and Amato, C. 2020. Macro-Action-Based Deep Multi-Agent Reinforcement Learning. arXivpreprint arXiv:2004.08646 .Zhang, K.; Yang, Z.; and Ba¸sar, T. 2019. Multi-agent re-inforcement learning: A selective overview of theories andalgorithms. arXiv preprint arXiv:1911.10635arXiv preprint arXiv:1911.10635