Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning
Inseok Oh, Seungeun Rho, Sangbin Moon, Seongho Son, Hyoil Lee, Jinyun Chung
Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning
Inseok Oh*, Seungeun Rho*, Sangbin Moon, Seongho Son, Hyoil Lee and Jinyun Chung † NCSOFT, South Korea {ohinsuk, gloomymonday, sangbin, hingdoong, onetop21, jchung2050}@ncsoft.com
Abstract
Reinforcement learning combined with deep neural networks has performed remarkably well in many genres of games re-cently. It has surpassed human-level performance in fixed game environments and turn-based two player board games. However, to the best of our knowledge, current research has yet to produce a result that has surpassed human-level perfor-mance in modern complex fighting games. This is due to the inherent difficulties with real-time fighting games, including: vast action spaces, action dependencies, and imperfect infor-mation. We overcame these challenges and made 1v1 battle AI agents for the commercial game “Blade & Soul”. The trained agents competed against five professional gamers and achieved a win rate of 62%. This paper presents a practical reinforcement learning method that includes a novel self-play curriculum and data skipping techniques. Through the curriculum, three different styles of agents were created by reward shaping and were trained against each other. Additionally, this paper suggests data skipping techniques that could increase data efficiency and facilitate explorations in vast spaces. Since our method can be generally applied to all two-player competitive games with vast action spaces, we anticipate its application to game development including level design and automated balancing.
Introduction
Reinforcement learning (RL) is extending its boundaries to a variety of game genres. In PVE (player versus environ-ment) settings, such as those found in Atari 2600 games, RL agents have exceeded human level performance using vari-ous methods (Mnih et al . 2015; Mnih et al. et al. et al. et al. * Equal contribution. Alphabetical ordering. † Corresponding author such as StarCraft2 (Vinyals et al. et al. et al . 2016; Kim et al. et al. et al . 2013). However, it is hard to ful-fill real-time conditions when applied to heavier modern game engines with longer query times. Additionally, a deep RL based agent (Li et al. et al. (2017) in which a self-play
Figure 1. A scene from the B&S Arena Battle
Year Commercial Dimension Pro-scene FICE 2013 X 2D X LF2 1999 O 2.5D X SSBM 2001 O 2D O BAB 2013 O 3D O
Table 1: Fighting games from other works deep RL method was applied to "Super Smash Bros. Melee (SSBM)”. However, the complexity of state and action space is significantly limited compared to our 3D environ-ment with complex game rules. We created pro-level AI agents for the real-time fighting game “Blade & Soul (B&S) Arena Battle" via novel self-play based reinforcement learn-ing. B&S is a commercial massively multiplayer online role-playing game. It supports duels between two players called “B&S Arena Battles (BABs)”. As presented in Table 1, BAB is a more modern fighting game compared to the games considered in other works; hence, it has much more complex game dynamics and heavier game engines. Addi-tionally, a large number of people play BAB and it has more active professional scenes than other fighting games. BAB’s larger number of active professional scenes stands out more significantly when compared to FightingICE, which was designed solely for research purposes. Figure 1 displays a scene from BAB. BAB is a two-player zero sum game. In BAB, two players fight against each other to reduce their opponent's HP (health point) to zero within three minutes. To master BAB, an agent must be able to deal with multiple challenges. First, an agent must manage vast action and state spaces. An agent must make skill, move, and targeting decisions simultaneously, which yields many possible combinations. As a rough estimate, there are 144 potential actions for each time step: 8 (avg. ( required interval for re-using a skill) and SP (skill point) consumptions, and serve one or more of five different functions: damage dealing, crowd control (which functions to make the opponent in-competent; abbreviated CC), resistance (which functions to make the player immune or resistant to CC skills), escape, and dash. Lastly, an agent must deal with imperfect information set-tings. Because BAB is a real-time game, two players make their decisions simultaneously. This indicates that an agent is required to make decisions without knowing the oppo-nent’s decision or strategy. Hence, BAB can be considered to be a series of rock-paper-scissors games. For example, when a player uses a resistance skill and the opponent uses a crowd control skill at the same time, the player gains ad-vantage over the opponent. As a result, the essence of the problem is to approximate a Nash equilibrium strategy so that the agent can respond appropriately to any opposing strategy. To tackle these challenges, we have made improvements to vanilla self-play algorithm by diversifying opponent pools and skipping data to facilitate exploration. The main contributions of this work are as follows: • We devised a novel self-play curriculum with agents of different styles. The curriculum made these agents com-pete against each other and reinforced the agents simulta-neously, rendering the agents capable of handling a variety of opponents. We empirically demonstrate that our curric-ulum outperforms vanilla self-play method. • We diversified the fighting style of the game-playing AIs by reward shaping (Ng et al . 1999). We created three types of agents with different fighting styles: aggressive, defen-sive, and balanced. We anticipate its application to game development including level design and automated bal-ancing. • We introduced data skipping techniques to enhance explo-ration in vast space. These can be generally applied to any two-player real-time fighting games. • We evaluate our agents by pitting them against profes-sional players in the 2018 B&S World Championship Blind Match. Our AI agents won three out seven matches, while the aggressive one beating all professional players both in the live event and pre-test. Background
Reinforcement Learning
In reinforcement learning (Sutton and Barto 1998), agent and environment can be formalized as a Markov decision process (MDP) (Howard 1960). For every discrete time step t , an agent receives a state 𝑠 𝑡 ∈ 𝑆 and sends an action 𝑎 𝑡 ∈𝐴 to the environment. Then, the environment makes a state transition from 𝑠 𝑡 to 𝑠 𝑡+1 with the state transition probabil-ity 𝑃 𝑠𝑠′𝑎 = 𝑃[𝑠 ′ |𝑠, 𝑎] and gives a reward signal 𝑟 𝑡 ∈ ℝ to the agent. Therefore, this process can be expressed with {𝑆, 𝐴, 𝑃, 𝑅, 𝛾} , where 𝛾 ∈ [0,1] is a discount factor, which represents the preference for immediate reward over long-term reward. Here, the agent samples an action from a policy π(𝑎 𝑡 |𝑠 𝑡 ) , and the learning process modifies the policy to en-courage good actions and suppress bad actions. The objec-tive of the learning is to find the optimal policy 𝜋 ∗ that max-imizes the expected discounted cumulative reward. 𝜋 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜋 𝐸 𝜋 [Σ 𝑟 𝑡 ∗ 𝛾 𝑡 ] (fourth annual event). The winning prize was approx. $50k (compared to Tekken7: $30k) Real-Time Two Player Game
In a real-time two player game, there are two players, namely, the agent and the opponent. Both of them send an action to the environment at the same time. Let us denote the policy of the agent as 𝜋 𝑎𝑔 , and the policy of the oppo-nent as 𝜋 𝑜𝑝 . Each samples an action from its own policy for every time step. 𝑎 𝑡𝑎𝑔 ~𝜋 𝑎𝑔 (𝑎 𝑡𝑎𝑔 |𝑠 𝑡 ), 𝑎 𝑡𝑜𝑝 ~𝜋 𝑜𝑝 (𝑎 𝑡𝑜𝑝 |𝑠 𝑡 ) Then, the environment makes a state transition by consid-ering those two actions jointly. 𝑠 𝑡+1 ~𝑃(𝑠 𝑡+1 |𝑠 𝑡 , 𝑎 𝑡𝑎𝑔 , 𝑎 𝑡𝑜𝑝 ), 𝑟 𝑡+1 = 𝑅(𝑠 𝑡 , 𝑎 𝑡𝑎𝑔 , 𝑎 𝑡𝑜𝑝 ) Here, the MDP can be expressed as {S, 𝐴 𝑎𝑔 , 𝐴 𝑜𝑝 , 𝑃, 𝑅, 𝛾} . If 𝜋 𝑜𝑝 is fixed, then we can regard the opponent as a part of the environment by marginalizing the policy of the opponent. 𝑃 ′ (𝑠 𝑡+1 |𝑠 𝑡 , 𝑎 𝑡𝑎𝑔 ) = ∑ 𝜋 𝑜𝑝 (𝑎 𝑡𝑜𝑝 |𝑠 𝑡 ) ∗ 𝑃(𝑠 𝑡+1 |𝑠 𝑡 , 𝑎 𝑡𝑎𝑔 , 𝑎 𝑡𝑜𝑝 ) 𝑎 𝑡𝑜𝑝 Then, the MDP expression turns into a simpler form with 𝑃′ : {S, 𝐴 𝑎𝑔 , 𝑃′, 𝑅′, 𝛾} . This expression is coherent with the one player MDP. Therefore, any methods for the original MDP work in this form as well. However, 𝜋 𝑜𝑝 is not fixed in general, and our agent does not know which 𝜋 𝑜𝑝 it is go-ing to face. We propose a self-play curriculum with diversi-fied pool of 𝜋 𝑜𝑝 in the following section. BAB as MDP
If we assume 𝜋 𝑜𝑝 or the pool of 𝜋 𝑜𝑝 is fixed, BAB can be expressed as an MDP. Figure 2 illustrates the agent-environ-ment framework in BAB. LSTM (Hochreiter and Schmid-huber 1997) based agents interact with the BAB simulator, which acts as the environment. For every time step with 0.1 sec intervals, state 𝑠 𝑡 is constructed from the history of ob-servations 𝐻 𝑡 = {𝑜 , 𝑜 , … , 𝑜 𝑡 } . To be specific, 𝑠 𝑡 is com-posed of any information that a human can access during a game, such as HP, SP, distance from opponent, distance from the arena wall, current position, remaining game time, remaining cooldown times for all 44 skills, an agent’s status info (midair, stun, down, kneel, etc.), and so on. Then, the agent decides on an action 𝑎 𝑡 = (𝑎 𝑡𝑠𝑘𝑖𝑙𝑙 , 𝑎 𝑡𝑚𝑜𝑣𝑒,𝑡𝑎𝑟𝑔𝑒𝑡 ) for every time step. Note that the targeting action space was originally continuous. We discretized it into two (facing op-ponent or current moving direction) and jointly considered it along with the decision to move. Following this, the action is then sent to the environment and a state transition occurs accordingly. Here, exact re-wards should also be determined. Rewards are closely re-lated to high performance in BAB. We provided 𝑟 𝑡𝑊𝐼𝑁 , which is the reward for winning a game, and 𝑟 𝑡𝐻𝑃 , the reward for the changes in HP margin. These rewards are designed based on the assumption that the more a player wins, and with more remaining HP, the better that player’s perfor-mance is. 𝑟 𝑡𝑊𝐼𝑁 is given at the terminal step of each episode with +10 for a win and -10 for a loss. 𝑟 𝑡𝐻𝑃 may occur at every time step when the agent deals damage to the opponent and vice versa. Since HP is normalized to [0, 10], 𝑟 𝑡𝑊𝐼𝑁 and 𝑟 𝑡𝐻𝑃 have the same scale. 𝑟 𝑡 = 𝑟 𝑡𝑊𝐼𝑁 + 𝑟 𝑡𝐻𝑃 𝑟 𝑡𝐻𝑃 = (𝐻𝑃 𝑡𝑎𝑔 − 𝐻𝑃 𝑡−1𝑎𝑔 ) − (𝐻𝑃 𝑡𝑜𝑝 − 𝐻𝑃 𝑡−1𝑜𝑝 ) These are fundamental rewards, and additional rewards for guiding battle styles are described in the next section. The value of γ is set to 0.995, which is close to 1.0, since all episodes in BAB are forced to terminate after 1,800 time steps (= 3 min).
Self-Play Curriculum with Diverse Styles
Existing self-play methods (Silver et al . 2017; Silver et al . 2018a) generally use opponent pools for training. Parame-ters of a network are stored at regular intervals during train-ing to create a pool of past selves. Opponents are then sam-pled from this pool. Although the self-play method of RL offers a way to learn the Nash equilibrium strategy (
Heinrich and Silver 2016), high coverage of strategy space is essential to efficiently find one. Vanilla self-play alone does not guarantee enough coverage for games with large problem spaces. To tackle this problem, AlphaStar (Vinyals et al . 2019) diversified the opponent pool by imitating different human strategy and in-troducing three types of agents with different match making scheme. The Poker AI, Pluribus (
Brown and Sandholm 2019), hand-tuned three different strategies on top of basic blueprint strategy. The three strategies are biased toward raising, calling, and folding respectively. Concurrently, we devised a novel self-play curriculum. We enforced diversity of agents’ strategies by introducing a range of different battle styles, and agents of different styles were made to compete against each other.
Figure 2. Agent-environment plot in BAB
Guiding Battle Styles through Reward Shaping
One of the most noticeable fighting styles to invest with is the degree of aggressiveness. We used three dimensions of rewards to control the degree of aggressiveness. The first di-mension is the “time penalty”. The aggressive agent re-ceives larger penalties per time step, and this motivates it to finish the match in a shorter period of time. The second di-mension is the relative importance of the agent’s HP to the opponent’s HP. Aggressive players will try to reduce the op-ponent's HP rather than preserving their own HP, while de-fensive players tend to act the opposite way. The final di-mension is the “distance penalty”. Defensive players tend to ensure a certain distance from their opponents to respond appropriately against attacks, while aggressive players tend to approach their opponents and attack relentlessly. To real-ize these properties, the aggressive agent received larger penalties in proportion compared to the distance between it-self and its opponent. The specific reward weights used for each style are shown in Table 2. Note that each of these three dimensions can take continuous values. This means that it is possible to create a spectrum of different fighting styles with varying degrees of aggressiveness. However, to effectively demonstrate the viability of this method, we limited the number of fighting styles to three. By using any type of ad-ditional reward signals along with 𝑟 𝑡𝑊𝐼𝑁 and 𝑟 𝑡𝐻𝑃 , this method could be applied to other fighting games in general to create agents with various fighting styles. Our Self-Play Curriculum
Figure 3 shows an overview of the proposed self-play cur-riculum with three different types of agents. Agents of each style have their own learning process, and all three agent types were trained in a concurrent manner. Each learning process consisted of a learner and multiple simulators. The learner and the simulators work asynchro-nously. In the simulators, an agent constantly plays matches against randomly sampled opponents from the shared pool. The most recent k models of each style are uniformly se-lected with total probability mass of p , while other models are chosen uniformly with probability . As training goes on, p is linearly annealed from 0.8 to 0.1. A higher p assists in swift adaptation to the latest opponents, while a lower p stabilizes the learning process by alleviating catastrophic forgetting. Each simulator sends a match log to the learner at the end of every match and updates its agent with the lat-est parameters received from the learner. The same proce-dure continues to be used through subsequent games. The learner trains its agents in an off-policy manner using logs gathered from multiple simulators and sends the latest network parameters to the simulators on request. In addition, the learner sends its network parameters to the shared pool every C steps (e.g. C=10,000) of update. Thus, the pool has varying policies that come from the different learning pro-cesses of the different styles. These sets of model parameters Aggressive Balanced Defensive Time penalty 0.008 0.004 0.0 HP ratio 5:5 5:5 6:4 Distance penalty 0.002 0.0002 0.0 Table 2: Reward weights of each style Figure 3. Overview of self-play curriculum with three different styles are provided as opponents to each learning process. By shar-ing a pool, every learning agent encounters opponents of every style during training and learns how to deal with them. Therefore, agents trained via our self-play curriculum can ultimately learn how to face opponents with varying fighting styles while maintaining their own battle styles.
Data Skipping Techniques
In this section, we detail data skipping technique, which re-fers to the process of dropping certain data during training and evaluation procedures.
Discarding Passive “No-op”
In fighting games, using skills generally consumes resources, such as SP and cooldown time. Therefore, if a player over-uses a certain skill, it will not be available for use during actual times of need. Thus, players should strategically use and retain their skills to ensure their availability when needed. To take this aspect into account, we concatenated a “no-op” action to the output of the policy network, allowing the agent to choose “no-op” and do nothing for a certain pe-riod if necessary. This means that our action space has 44 skills, plus an additional “no-op” action. This is significant because human play logs of BAB show that “no-op” actions take up the largest portion of skill usage among human play-ers. “No-op” decisions can be categorized as passive and ac-tive use cases. The passive use of “no-op” implies that an agent chooses “no-op” because there is no skill available for use. For example, when an agent is out of resources or is hit by an opponent’s CC skill, an agent has no option but to choose “no-op”. The active use of “no-op” means that an agent selects “no-op” strategically, even though other skills are available for use. We discarded passive “no-op” data from both the training and evaluation phases because passive “no-ops” are not used deliberately by an agent. In addition, the method enables LSTM to reflect representations of longer time horizons be-cause the data is not provided to the network. We show in the experiment section that skipping passive “no-ops” greatly improves learning efficiency. Note that this method-ology is generally applicable to other domains where a set of available skills changes constantly and the “no-op” action is a valid option to choose.
Maintaining Move Action
Although a single skill decision can have a substantial influ-ence on the subsequent states, the effect of a single move decision is relatively limited. The reason is that the distance a character moves in a single time step (0.1 s) is very short considering its speed. In order for any moving decision to have a meaningful effect, the agent should make the same moving decision consecutively for several ticks in a row. This allows the agent to literally “move” and leads to changes in subsequent states and rewards. Therefore, it is difficult to train a move policy from the initial policy with random move decisions. Since the chance of a random pol-icy making the same decision consecutively is very low, ex-ploration is extremely limited. We therefore propose main-taining the move decision for a fixed number of time steps. Figure 4 shows how the method works with an example. If the agent selects a move action, it skips the move decision for the following n-1 time steps. This means that the agent maintains the same move decision for n steps in total. Note that our method has different purpose from frame skip tech-nique (Mnih et al . 2015) in Atari domain. Frame skip tech-nique was introduced for simulator's efficiency. However, we cannot just skip the frames because skill decisions must still be made. Although we could not enjoy advantage in the simulator's efficiency, maintaining move still facilitates training and this is solely because maintaining move deci-sion increases the influence of a single move decision, as we will confirm with experiments. In this sense, maintaining move rather can be viewed as ‘amplifying advantage’ from (Mladenov et al. Experiments
Implementation Details
Network
The network is composed of LSTM-based architecture which has four heads with a shared state representation layer. Each head consists of 𝜋 𝑠𝑘𝑖𝑙𝑙 , 𝑄 𝑠𝑘𝑖𝑙𝑙 , 𝜋 𝑚𝑜𝑣𝑒,𝑡𝑎𝑟𝑔𝑒𝑡 and 𝑄 𝑚𝑜𝑣𝑒,𝑡𝑎𝑟𝑔𝑒𝑡 . 𝑄 𝑠𝑘𝑖𝑙𝑙 and 𝑄 𝑚𝑜𝑣𝑒,𝑡𝑎𝑟𝑔𝑒𝑡 are used for the gradi-ent update of 𝜋 𝑠𝑘𝑖𝑙𝑙 and 𝜋 𝑚𝑜𝑣𝑒,𝑡𝑎𝑟𝑔𝑒𝑡 , respectively. Before the network output goes into the softmax layer, a Boolean vector indicating the availability of each skill operates to make the output of unavailable skill to negative infinity. Algorithm
We used actor-critic off-policy learning algorithm (Wang et al.
Figure 4. Examples of (a) regular move decisions and (b) main-taining decisions for 1 second simulators and learner through truncated importance sam-pling. Moreover, we could also use the advantages of sto-chastic policy, which responds more stably to changes in the environment due to smooth policy updates and works well in the domain of games like rock-scissors-paper where de-terministic policy is vulnerable to exploitation. For this spe-cific algorithm, both 𝜋 𝑠𝑘𝑖𝑙𝑙 and 𝜋 𝑚𝑜𝑣𝑒,𝑡𝑎𝑟𝑔𝑒𝑡 are updated in an alternating manner with following gradient: 𝑔 𝑡𝑎𝑐𝑒𝑟 = 𝜌 𝑡 ̅ ∇ 𝜃 𝑙𝑜𝑔𝜋 𝜃 (𝑎 𝑡 |𝑥 𝑡 )[𝑄 𝑟𝑒𝑡 (𝑥 𝑡 , 𝑎 𝑡 ) − 𝑉 𝜃 𝑣 (𝑥 𝑡 )] + 𝔼 𝑎~𝜋 ([𝜌 𝑡 (𝑎) − 𝑐𝜌 𝑡 (𝑎) ] + ∇ 𝜃 𝑙𝑜𝑔𝜋 𝜃 (𝑎|𝑥 𝑡 )[𝑄 𝜃 𝑣 (𝑥 𝑡 , 𝑎) − 𝑉 𝜃 𝑣 (𝑥 𝑡 )]), where 𝜌 𝑡 ̅ = min{𝑐, 𝜌 𝑡 } with behavior policy 𝜇 and im-portance sampling ratio 𝜌 𝑡 = 𝜋( 𝑎 𝑡| 𝑥 𝑡)𝜇( 𝑎 𝑡| 𝑥 𝑡) . [𝑥] + = 𝑥 𝑖𝑓 𝑥 >0 and zero otherwise. Learning System
In total, there are three learning processes with each learning process consisting of a learner and 100 simulators. Each learning process is largely similar to that proposed by Hor-gan et al . (2018). The final agent is trained for two weeks, which is equivalent to four years of game play.
Effect of Self-Play Curriculum with Three Styles
To demonstrate the effects of the proposed self-play curric-ulum, we trained agents with and without the proposed cur-riculum. A baseline agent was trained with the vanilla self-play curriculum without any style-related rewards (only win reward and HP reward were included) and a pool of past selves was used. Meanwhile, three agents with different styles were trained with the self-play curriculum using the shared pool that we proposed. Our aggressive, balanced and defensive agents then played 1,000 matches each against the baseline agent to measure the performance. As shown in Table 3, the agents that followed the learnings from our cur-riculum outperformed the baseline agent. Next, we conducted an ablation study to observe how the shared pool helps generalization. We wanted to confirm whether an agent would be able to deal with opponents of unseen style, when it experienced only a limited range of opponents during training. Thus, we created three styles of agents trained in exactly the same manner, except that they had their own independent opponent pools. We denote the three types of agents using shared pools as 𝜋 𝑠ℎ𝑎𝑔𝑔 , 𝜋 𝑠ℎ𝑏𝑎𝑙 , and 𝜋 𝑠ℎ𝑑𝑒𝑓 , and three type of agents using independent pools as 𝜋 𝑖𝑛𝑑𝑎𝑔𝑔 , 𝜋 𝑖𝑛𝑑𝑏𝑎𝑙 , and 𝜋 𝑖𝑛𝑑𝑑𝑒𝑓 . All of six agents were trained for 5M steps (equivalent to 6 days) each. Our assumption is that the agent trained with the shared pool is more robust when it faces opponents it has never en-countered. Thus, we compared the win rate of 𝜋 𝑠ℎ𝑎𝑔𝑔 vs. {𝜋 𝑖𝑛𝑑𝑏𝑎𝑙 , 𝜋 𝑖𝑛𝑑𝑑𝑒𝑓 } and 𝜋 𝑖𝑛𝑑𝑎𝑔𝑔 vs. {𝜋 𝑖𝑛𝑑𝑏𝑎𝑙 , 𝜋 𝑖𝑛𝑑𝑑𝑒𝑓 } . This experimental We measured how the average game length differs for each style because game length is a good proxy for assessing the degree of defensiveness of setting is based on three key ideas. First, 𝜋 𝑠ℎ𝑎𝑔𝑔 and 𝜋 𝑖𝑛𝑑𝑎𝑔𝑔 have the same training settings except for sharing the pool. Second, 𝜋 𝑠ℎ𝑎𝑔𝑔 and 𝜋 𝑖𝑛𝑑𝑎𝑔𝑔 are evaluated against the same op-ponents. Finally, although 𝜋 𝑠ℎ𝑎𝑔𝑔 has encountered other styles from its pool, it has not confronted {𝜋 𝑖𝑛𝑑𝑏𝑎𝑙 , 𝜋 𝑖𝑛𝑑𝑑𝑒𝑓 } , for they were trained using independent opponent pools. If our assumption is correct, 𝜋 𝑠ℎ𝑎𝑔𝑔 should have a higher winning rate. It is to be noted that 𝜋 𝑖𝑛𝑑𝑏𝑎𝑙 and 𝜋 𝑖𝑛𝑑𝑑𝑒𝑓 are not a single model, but 10 models each sampled at the same fixed inter-vals from their pools. We then conducted the same experi-ments for the remaining two styles; the results are presented in Table 4. As shown in the table, agents trained with shared pool outperform their counterparts. Based on the data in Table 4, the effect of using a shared pool is marginal in the case of aggressive agents. It indicates that the strategy spaces in which trainings take place are similar whether or not various opponents are provided. This is related to the nature of fighting games in which one side should fight back if the other side approaches and initiates a brawl. Thus, in the case of an aggressive agent that attacks consistently, there is a little difference in the experience re-gardless of the diversity of the opponent’s fighting style. Effect of Discarding Passive “No-op”
As discussed in the previous section, the “no-op” decision may be either active or passive. We conducted an experi-ment to investigate the effect of discarding such passive “no-op” data from learning. The sparring partner for the experi-ment was the built-in BAB AI, with a performance compa-rable to the top 20% of the players. We measured how fast agents learned to defeat it, and the results are shown in Fig-ure 5 (a). If “no-op” ticks are discarded from the learning data, the win rate reaches 80% after 70k steps, whereas 170k steps are required when “no-op” ticks are included. The amount of time steps required to reach 90% win rate was reduced to half when passive “no-op” data was skipped. an agent’s game play. The results were as follows: 66.6 sec for the aggres-sive, 91.7 sec for the balanced, and 179.9 sec for the defensive agent.
Aggressive Balanced Defensive Average Shared 64.8% 79.6% 75.3% 73.6% Ind. 64.7% 72.1% 56.5% 64.4%
Table 4: Generalization performance of three styles of agents for both with and without shared pool (7,000 games each)
Aggressive Balanced Defensive Average Vs. Baseline 59.5% 63.8% 63.2% 62.2%
Table 3: Win rate of three style of agents against baseline (1,000 games each)
This experiment confirms that the training performance is improved by discarding passive “no-op” from the learning data.
Effect of the Maintaining Move
To examine the effect of the maintaining move, we devel-oped two learning processes, with both processes involving learning on a self-play basis. One process makes a moving decision at every time step, while the other makes a moving decision and sends the same decision for 9 more times in a row. We measured the entropy of the move policy to observe the effects. Entropy of the move policy for a given state 𝑠 𝑡 is as follows. H(𝑠 𝑡 ) = − ∑ 𝜋 𝑚𝑜𝑣𝑒 (𝑠 𝑡 ) ∗ log 𝜋 𝑚𝑜𝑣𝑒 (𝑠 𝑡 ) Generally, entropy gradually decreases as learning pro-gresses. Figure 5 (b) shows that the entropy declines faster if the technique is applied. A noticeable difference was also observed in the quality of movement which the agent learned. Before the technique was applied, the agent did not make any improvement from random motion, but it learned to approach and retreat with data skip. The longer the decision was repeated, the agent’s reaction became less immediate, but the agent moved more consist-ently. In this case, we tested 1, 3, 5, 10 ticks for maintaining time. 10 ticks (equivalent to 1 s) yielded the best perfor-mance.
Pro-Gamer Evaluation
This section will address the results of both the pre-test and the Blind Match, and conditions to ensure fairness for hu-man players.
Conditions for Fairness
Reaction Time
When humans confront an AI in a real-time fighting game, the most important factor that affects the result is the reac-tion time. Humans require some time to recognize the skill used by the opponent and to press a button by moving his/her hand. We applied an average of 230 ms of delay for the decision of an AI to be reflected in the game, so that the AI does not have an advantage. This amount of delay corre-sponds to the average reaction time of professional players in BAB.
Classes and Skill Set
There are 11 classes in B&S, and each class has unique char-acteristics. Since there exists relative superiority among classes, we fixed the class of both AI and pro-gamer as “De-stroyer”. Destroyer is a class that has an infighting style and steadily appears in the B&S world championship. Addition-ally, AI’s and pro-player’s skill trees were set as identical to ensure a fair match. The skill tree was chosen to match what the majority of users selected, based on the BAB user statis-tics.
Evaluation Results
We invited two prominent pro-gamers, Yuntae Son (GC Busan, Winner of 2017 B&S World Championship), and Shingyeom Kim (GC Busan, Winner of 2015 and 2016 B&S World Championship), to test our agents before the Blind Match. Note that the total number of games played is differ-ent for each style because the testers can play as many games as they want for each style. After the pre-test, we went for the Blind Match of 2018 World Championship. Our agents had matches against three pro-gamers: Nicholas Parkinson (EU), Shen Haoran (CHN), and Sungjin Choi (KOR). The video recording of the game highlights can be found at https://goo.gl/7VUTzV. The results of both the pre-test and the Blind Match are provided in Table 5. As can be seen from the table, the ag-gressive agent dominated the game, while the other two types of agents had rather intense games. According to the interview after the pre-test, we found that this was partly be-cause human players often need some breaks between fights, but the aggressive agent does not permit humans to have breaks between battles; rather, the attacks are continuous.
Conclusion
Using deep reinforcement learning, we created AI agents that competed evenly with professional players in a 3D real-time fighting game. To accomplish this, we proposed a Aggressive Balanced Defensive Pro-Gamer 1 5-1 2-1 1-2 Pro-Gamer 2 4-0 2-4 4-1 Blind Match 2-0 1-2 0-2 Total 11-1 (92%) 5-7 (42%) 5-5 (50%)
Table 5: Final score of AI vs. Human Figure 5. Results of data skipping experiments method to guide the fighting style with reward shaping. With three styles of agents, we introduced a novel self-play curriculum to enhance generalization performance. We also proposed data-skipping techniques to improve data effi-ciency and enable efficient exploration. Consequently, our agents were able to compete with the best BAB pro-gamers in the world. The proposed training methods are generally applicable to other fighting games.
References
Brown, N., and Sandholm, T. 2019. Superhuman AI for multi-player poker.
Science , (6456), 885-890. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., and Kavukcuoglu, K. 2018. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In Pro-ceedings of the 35th International Conference on Machine Learn-ing , 80:1407-1416. Firoiu, V., Whitney, W. F., and Tenenbaum, J. B. 2017. Beating the world's best at Super Smash Bros. with deep reinforcement learning. arXiv preprint arXiv:1702.06230 . Heinrich, J., and Silver, D. 2016. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121 . Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, et al . 2018. Rainbow: Combining improvements in deep reinforcement learning. In
Thirty-Second AAAI Conference on Ar-tificial Intelligence . Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory.
Neural computation , (8):1735-1780. Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., and Silver, D. 2018. Distributed prioritized expe-rience replay. arXiv preprint arXiv:1803.00933 . Howard, R.A. 1960. Dynamic programming and markov processes . MIT Press. Ishihara, M., Ito, S., Ishii, R., Harada, T., and Thawonmas, R. 2018. Monte-Carlo Tree Search for Implementation of Dynamic Diffi-culty Adjustment Fighting Game AIs Having Believable Behaviors. In et al . 2018. Human-level performance in first-person multiplayer games with population-based deep rein-forcement learning. arXiv preprint arXiv:1807.01281 . Kim, M. J., and Kim, K. J. 2017. Opponent modeling based on ac-tion table for MCTS-based fighting game AI. In
IEEE 2nd Global Conference on Consumer Electronics: arXiv preprint arXiv:1905.13559 . Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al . 2015. Human-level control through deep reinforcement learning.
Nature , (7540), 529. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., et al . 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48:
ICML
Vol. 99: 278-287. OpenAI. 2018. OpenAI five, https://blog.openai.com/openai-five. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al . 2016. Mastering the game of Go with deep neural networks and tree search. nature , (7587), 484. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 . Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al . 2017. Mastering the game of go without human knowledge. Nature , (7676), 354. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al . 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 . Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction . MIT press. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., et al . 2017. Starcraft II: A new challenge for rein-forcement learning. arXiv preprint arXiv:1708.04782 . Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M. , et al . 2019. Alphastar: Mastering the real-time strategy game starcraft ii.
DeepMind blog , 2. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and Freitas, N. 2017. Sample efficient actor-critic with experi-ence replay. In
International Conference on Learning Representa-tions (ICLR) . Yoshida, S., Ishihara, M., Miyazaki, T., Nakagawa, Y., Harada, T., and Thawonmas, R. 2016. Application of Monte-Carlo tree search in a fighting game AI. In