MMultiplayer AlphaZero
Nick Petosa
School of Interactive ComputingGeorgia Institute of Technology [email protected]
Tucker Balch School of Interactive ComputingGeorgia Institute of Technology [email protected]
Abstract
The AlphaZero algorithm has achieved superhuman performance in two-player,deterministic, zero-sum games where perfect information of the game state isavailable. This success has been demonstrated in Chess, Shogi, and Go wherelearning occurs solely through self-play. Many real-world applications (e.g., equitytrading) require the consideration of a multiplayer environment. In this work, wesuggest novel modifications of the AlphaZero algorithm to support multiplayerenvironments, and evaluate the approach in two simple 3-player games. Ourexperiments show that multiplayer AlphaZero learns successfully and consistentlyoutperforms a competing approach: Monte Carlo tree search. These results suggestthat our modified AlphaZero can learn effective strategies in multiplayer gamescenarios. Our work supports the use of AlphaZero in multiplayer games andsuggests future research for more complex environments.
DeepMind’s AlphaZero algorithm is a general learning algorithm for training agents to master two-player, deterministic, zero-sum games of perfect information [8]. Learning is done tabula rasa -training examples are generated exclusively through self-play without the use of expert trajectories.Unlike its predecessor AlphaGo Zero, AlphaZero is designed to work across problem domains [7].DeepMind has demonstrated AlphaZero’s generality by training state-of-the-art AlphaZero agentsfor Go, Shogi, and Chess. This result suggests that AlphaZero is applicable to other games andreal-world challenges. In this paper, we explore AlphaZero’s generality further by evaluating itsperformance on simple multiplayer games.Our approach is to extend the original two-player AlphaZero algorithm to support multiple playersthrough novel modifications to its tree search and neural network architecture. Since the AlphaZerosource code is not released, we implemented a single-threaded version of AlphaZero from scratchusing Python 3 and PyTorch based on DeepMind’s papers [8, 7]. There are several notable reimple-mentations of DeepMind’s AlphaGo Zero algorithm by the research community such as LeelaZero andELF [1, 9]. However, these implementations are designed and optimized for reproducing DeepMind’sresults on Go, not for general experimentation with the algorithm.Our contribution is threefold. First, we produce an independent reimplementation of DeepMind’sAlphaZero algorithm. Second, we extend the original algorithm to support multiplayer games. Andthird, we present the empirical performance of this extended algorithm on two multiplayer gamesusing some novel evaluation metrics. We conclude that the AlphaZero approach can succeed inmultiplayer problems. On leave at J.P. Morgan AI Research. https://github.com/petosa/multiplayer-alphazeroAccepted at the Workshop on Deep Reinforcement Learning at the 33rd Conference on Neural InformationProcessing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ c s . A I] D ec his paper will first introduce the original AlphaZero algorithm, then discuss our novel multiplayerextensions, and lastly discuss our experiments and results. The original, two-player AlphaZero can be understood as an algorithm that learns a board-qualityheuristic to guide search over the game tree. This can be interpreted as acquiring an “instinct”for which board states and moves are likely to end in victory or defeat, and then leveraging thatknowledge while computing the next move to make. The resulting informed search can pick a highquality move in a fraction of the time and steps as an uninformed search .This “instinct heuristic” is the output of a deep convolutional neural network, which ingests thecurrent board state as input and outputs two values [5]. The first output is the value head ( v ), thescalar utility of this board from the perspective of the current player. The second output is the policyhead ( (cid:126)p ), a probability distribution over legal actions from the current board state, where higherprobability actions should lead the current player to victory. Both v and (cid:126)p inform a Monte Carlo treesearch (MCTS) to guide search over the game tree. MCTS is a search algorithm that traverses the game tree in an exploration/exploitation fashion [2].At each state, it prioritizes making moves with high estimated utility, or that have not been wellexplored. The upper confidence bound for trees (UCT) heuristic is often used to balance explorationand exploitation during search [4]. Each iteration of MCTS from a board state is called a “rollout.”AlphaZero uses most of the standard MCTS algorithm, but with a few key changes.1. Replaces UCT with the following ( state, action ) -pair heuristic in MCTS to decide whichmove to search next. Q ( s, a ) + c puct P ( s, a )1 + N ( s, a ) Where Q is the average reward experienced for this move, N is the number of times thismove has been taken, P is the policy head value, and c puct is an exploration constant. Thedesign of this heuristic trades off exploration of under-visited moves with exploitation of thevalue and policy heads from the network.2. Random rollouts are removed. Instead of rolling out to a terminal state, the value head v istreated as the approximate value of a rollout. Because of this, during the backpropagationstep of MCTS, v gets incorporated into Q .The result is an “informed MCTS” which incorporates the outputs of the neural network to guidesearch. To train, AlphaZero operates in a cycle of policy evaluation and policy improvement. AlphaZerorequires full access to a simulator of the environment.
Policy evaluation
Training proceeds as follows. Our game starts at initial board state ( s ). Fromhere, several rollouts of MCTS are run to discover a probability distribution ( (cid:126)π ) across valid actions.Uniform dirichlet noise is added to (cid:126)π to encourage exploration (this is only done for the first moveof the game). We then take our turn by sampling from (cid:126)π to get to s , and repeat the process until aterminal state is encountered. This terminal state outcome z will be 1 for win, -1 for loss, or 0 for tiefrom the perspective of the current player. Policy improvement
After a game ends, we generate training samples for each turn of the game ( s i , (cid:126)π i , z ) and add them to an experience replay buffer. After several games, we sample batches fromthe buffer to update our network parameters by minimizing the following loss function, which is justa sum of cross-entropy loss and mean squared error.2 = ( z − v ) − (cid:126)π (cid:62) log (cid:126)p Through this cycle, AlphaZero refines its heuristic after each iteration and snowballs into a strongplayer.
Several changes are made to MCTS and the neural network for AlphaZero to support multiplayergames.1. MCTS now rotates over the full list of players during play instead of alternating betweentwo players.2. Instead of completed games returning an outcome z , they now return a score vector ( (cid:126)z ),indicating the scores of each player. For example, in a 3-player game of Tic-Tac-Toe, a tiemight return [0 0 0] and a first player win might return [1 − − . Note from thelatter example that we are incidentally relaxing the zero-sum constraint on games. In fact,this opens the door for games that do not have binary win/lose outcomes, but this is not thefocus of our work.3. In two-player AlphaZero, the value of a state from the perspective of one player is thenegation of the value for the other player. With a score vector, each player can have its ownscore. So when backpropagating value, MCTS uses the corresponding score in (cid:126)v for eachplayer instead of flipping the sign of a scalar v .4. Instead of the value head of the neural network predicting a scalar value, it now predicts a value vector ( (cid:126)v ), which contains the expected utility of a state for each player. The size ofthe vector equals the number of players in the game. The loss function is updated to accountfor this change. L = 1 n n (cid:88) i =1 ( z i − v i ) − (cid:126)π (cid:62) log (cid:126)p The neural network is now trained on ( s i , (cid:126)π i , (cid:126)z ) tuples since value is a vector. An illustrationof this change from the standard two-player case to the novel multiplayer case is shown infigure 1 for a 3-player variant of Tic-Tac-Toe.Figure 1: The change in neural network structure with novel multiplayer approach.3he aforementioned changes to make MCTS multiplayer have been described in previous literatureas MCTS-max n [6]. We define several metrics of success in training effective agents: • Does the neural network successfully converge?
A stable, decreasing loss function indi-cates training is proceeding as anticipated. Divergence likely indicates the network is lowcapacity or over-regularized, as it cannot explain the growing variance of experience data. • Does the agent outperform a MCTS given the same number of rollouts?
Since theAlphaZero agent starts off as a standard MCTS agent and improves from there, it shouldoutperform a MCTS agent given the same number of rollouts per turn. This experiment testsif the experimental agent (AlphaZero) outperforms the control agent (MCTS). • Does the agent outperform a MCTS given more rollouts (up to a point)?
Since Alp-haZero’s heuristic enables it to efficiently search the game tree, it should perform as wellas or better than some MCTS agents that are given additional rollouts. Since AlphaZeroonly plays against itself during training, this experiment tests the generality of the learnedstrategy. • Does the agent outperform a human?
There are no human experts for the games we havecreated, but victory against a competent human opponent confirms that a reasonably strongand general strategy was learned.Using these criteria to evaluate multiplayer agents, we train AlphaZero to play multiplayer versionsof Tic-Tac-Toe and Connect 4.
Testbed.
We have implemented multiplayer AlphaZero entirely in Python 3 using PyTorch. UnlikeDeepMind’s AlphaZero, we do not parallelize computation or optimize the efficiency of our codebeyond vectorizing with numpy. All experiments were run on a desktop machine containing ani9-9900k processor and an RTX 2080 Ti GPU. Our biggest limitation was compute - DeepMindtrained AlphaZero to master Go in 13 days with 5000 first generation TPUs and 16 second-generationTPUs, but with our hardware, that result would take years to replicate [8]. For this reason, weexperiment on multiplayer games with small state and action spaces to make this project feasible.Even on these simpler games, training takes over 15 hours. Future research with access to morecompute can expand on our results by evaluating performance on more complex multiplayer games.
Hyperparameters.
The same hyperparameters are used across all games and experiments. Forthe neural network, we use a squeeze-and-excitation model, which has been shown to outperformexisting DCNN architectures by modeling channel interdependencies with only a slight increase tomodel complexity [3]. The specific SENet architecture used in this project consists of 8 SE-PREblocks and two heads (value and policy).Table 1: HyperparametersHyperparameter Ours DeepMindNetwork SENet ResNetL2 regularization 1e-4 1e-4Batch size 64 2048Optimizer ADAM SGD + MomentumLearning rate 1e-3 1e-2 −→ ∞ ? c puct α .1 Multiplayer Tic-Tac-Toe Our multiplayer Tic-Tac-Toe game, dubbed “Tic-Tac-Mo,” adds an additional player to Tic-Tac-Toebut keeps the 3-in-a-row win condition. To make games more complicated, the size of the board isexpanded to be 3x5 instead of 3x3. Games can therefore last up to 15 turns. Players receive a scoreof 1 for a win, 0 for a tie, and -1 for a loss. The state representation of the board fed into the neuralnetwork is depicted in figure 2a. We trained the AlphaZero algorithm by having it play Tic-Tac-Moagainst itself for about 18 hours.
Does the neural network successfully converge?
The loss curve for the underlying heuristicnetwork is shown in figure 2b. The loss decreases and stabilizes as AlphaZero goes through moreiterations. (a) The state representation of a Tic-Tac-Mo boardpassed into the neural network. Size is 3x5x6.Each player owns one "piece location" plane and"turn indicator" plane. (b) The loss of our SENet steadily converges.
Figure 2: State representation and loss curve for Tic-Tac-Mo.
Does the agent outperform a MCTS given more rollouts (up to a point)?
We compare thescores between AlphaZero and MCTS opponents of increasing strength. Here, “increasing strength”means that after each match, MCTS gets more rollouts to search the game tree, while AlphaZero’scomputation remains fixed at 50 rollouts. Each game pits 2 MCTS agents of equal strength againstour AlphaZero agent. For each match, a total of 6 games is played between the same opponents- one game for each permutation of players to break any advantages from going first, second, orthird. Figure 3a plots the scores of AlphaZero against the scores of MCTS for each match. We findAlphaZero convincingly defeats MCTS agents that have few rollouts, but score starts to converge asMCTS strength increases.
Does the agent outperform a MCTS given the same number of rollouts?
A control MCTSagent given the same number of rollouts as AlphaZero (50 rollouts) is also played against MCTSopponents of increasing strength. The difference in points between the control and its opponents isplotted alongside the difference in points between AlphaZero and its opponents (figure 3b). We findAlphaZero always performs better than the control MCTS when playing against the same opponents.
Does the agent outperform a human?
We had several graduate students play Tic-Tac-Mo againsttwo AlphaZero agents (table 2). We described the game to this small group of students and gavethem a chance to practice before having each student play 6 games against AlphaZero opponents(order of players was permuted each round). For these AlphaZero agents, we increased their rolloutsto 500 - if the learned heuristic function is general enough, it should scale given more search timeand develop extremely strong strategies. In total, AlphaZero tied 42% of the games and won 58% ofthe games as shown in table 2.
Results.
Multiplayer AlphaZero for Tic-Tac-Mo suffered no defeats against computer or humanopponents. The decreasing network loss function is an indication that the network trained successfully,5 a) AlphaZero and opponent scores accumulatedover six games as opponent rollouts increase. Al-phaZero’s rollouts remain fixed at 50, while itsMCTS opponents use an increasing number ofrollouts. The two MCTS agents have identicalperformance across each match. (b) Score difference as opponent rollouts increasefor AlphaZero and a control MCTS using the samenumber of rollouts. Score difference is the differ-ence between our score and the opponent’s score.Score differences less than 0 indicate more gameslost than won, score differences greater than 0 in-dicate more games won than lost, and score differ-ences of 0 indicate equal wins and losses.
Figure 3: Tic-Tac-Mo experiments againsts MCTS opponents of increasing strength.Table 2: Summary of human performance against AlphaZero, Tic-Tac-Mo.Human Wins Ties LossesHuman 1 0 1 5Human 2 0 4 2Human 3 0 3 3Human 4 0 2 4Totals 0 10 14continually improving its estimates of policy and value while incorporating new experience. Withjust 50 rollouts, AlphaZero has equivalent performance to at least a 3000-rollout MCTS - and canpick a move in a fraction of the time. Our control experiment indicates that the learned heuristicis necessary and useful, leading us to believe our multiplayer AlphaZero algorithm successfullyencoded knowledge of the game into its heuristic, creating a powerful Tic-Tac-Mo agent.
Our multiplayer Connect 4 game dubbed “Connect 3x3” adds an additional player to the game andchanges the win condition to 3-in-a-row instead of 4-in-a-row. The size of the board remains 6x7.We believe Connect 3x3 to be a harder game to learn than Tic-Tac-Mo, as games can last up to 42turns as opposed to 15, so the game tree is much deeper. Players receive a score of 1 for a win, 0 for atie, and -1 for a loss. The state representation of the board fed into the neural network is depicted infigure 4a. We trained the algorithm by having it play Connect 3x3 against itself for about 18 hours.
Does the neural network successfully converge?
The loss curve for the network is shown infigure 4b. Error does not steadily decrease over time but remains relatively stable.
Does the agent outperform a MCTS given more rollouts (up to a point)?
We run the sameexperiment as described in Tic-Tac-Mo, but now for Connect 3x3 (figure 5a). Like with Tic-Tac-Mo,AlphaZero ties or outperforms each MCTS opponent. Unlike Tic-Tac-Mo, we do not see a convergingscore gap between MCTS and AlphaZero - instead score appears to oscillate as MCTS increases instrength. 6 a) The state representation of a Connect 3x3 boardpassed into the neural network. Size is 6x7x6.Each player owns one "piece location" plane and"turn indicator" plane. (b) The loss of our SENet is stable.
Figure 4: State representation and loss curve for Connect 3x3.
Does the agent outperform a MCTS given the same number of rollouts?
We again run a controlMCTS agent and compare it to our AlphaZero agent (figure 5b). From these results, we see thatthe control is mostly losing to stronger MCTS agents while AlphaZero maintains a non-negativescore difference. In general, AlphaZero outperforms the control, however there is one blip where thecontrol outperforms AlphaZero. (a) AlphaZero and opponent scores over six gamesas opponent rollouts increase. (b) Score difference as opponent rollouts increasefor AlphaZero and a control.
Figure 5: Connect 3x3 experiments againsts MCTS opponents of increasing strength.
Does the agent outperform a human?
Finally, we had the same group of graduate students nowplay against two Connect 3x3 AlphaZero agents (table 3). In total, AlphaZero won 79% of the gamesand lost 21% of the games.Table 3: Summary of human performance against AlphaZero, Connect 3x3.Human Wins Ties LossesHuman 1 0 0 6Human 2 0 0 6Human 3 3 0 3Human 4 2 0 4Totals 5 0 197 esults.
Multiplayer AlphaZero trains a strong agent to play Connect 3x3 which wins or ties mostgames. However, unlike with Tic-Tac-Mo, we do have humans who are able to defeat AlphaZero, anda case where the control MCTS agents outperforms AlphaZero. Both of these measurements indicatethat AlphaZero did not perfect its neural network board-quality heuristic and master Connect 3x3.But the learned heuristic, though fallible, still successfully encodes knowledge of the game into search.With just 50 rollouts, AlphaZero meets or beats its MCTS opponents, and typically outperformsa control MCTS agent given the same number of rollouts. And though the loss function is notdecreasing, it does not diverge either. Since training data is continually added to the replay buffer,this indicates knowledge is being incorporated and generalized into the network.Our results from Connect 3x3 indicate that the overall multiplayer AlphaZero strategy works, butmore hyperparameter tuning is needed to truly master complex games.
In this paper we propose a novel modification to the AlphaZero algorithm that enables it to trainmultiplayer agents through self-play. Our experiments show that AlphaZero can be successfullyapplied to multiplayer games, but more careful hyperparameter tuning is necessary to achieve strongeragents. We define measures of success that can be applied in future AlphaZero research, and create anindependent AlphaZero reimplementation with multiplayer modification. Given more computation,future work should include experiments on games with more players, more board states, and moreactions. Other research directions might investigate the effectiveness of AlphaZero when otherconstraints such as zero-sum game , deterministic game , or perfect information game are lifted. References [1] Leela zero. https://github.com/leela-zero/leela-zero .[2] Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In
Interna-tional conference on computers and games , pages 72–83. Springer, 2006.[3] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In
Proceedings of the IEEEconference on computer vision and pattern recognition , pages 7132–7141, 2018.[4] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In
Europeanconference on machine learning , pages 282–293. Springer, 2006.[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deepconvolutional neural networks. In
Advances in neural information processing systems , pages1097–1105, 2012.[6] Joseph Antonius Maria Nijssen.
Monte-Carlo Tree Search for Multi-Player Games . PhD thesis,Maastricht University, 2013.[7] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game ofgo without human knowledge.
Nature , 550(7676):354, 2017.[8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, ArthurGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A generalreinforcement learning algorithm that masters chess, shogi, and go through self-play.
Science ,362(6419):1140–1144, 2018.[9] Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinkerton,and C Lawrence Zitnick. Elf opengo: An analysis and open reimplementation of alphazero. arXiv preprint arXiv:1902.04522arXiv preprint arXiv:1902.04522