[PDF] Mastering Terra Mystica: Applying Self-Play to Multi-agent Cooperative Board Games

Abstract

In this paper, we explore and compare multiple algorithms for solving the complex strategy game of Terra Mystica, hereafter abbreviated as TM. Previous work in the area of super-human game-play using AI has proven effective, with recent break-through for generic algorithms in games such as Go, Chess, and Shogi \cite{AlphaZero}. We directly apply these breakthroughs to a novel state-representation of TM with the goal of creating an AI that will rival human players. Specifically, we present the initial results of applying AlphaZero to this state-representation and analyze the strategies developed. A brief analysis is presented. We call this modified algorithm with our novel state-representation AlphaTM. In the end, we discuss the success and shortcomings of this method by comparing against multiple baselines and typical human scores. All code used for this paper is available at on \href{this https URL}{GitHub}.

Full PDF

MMastering Terra Mystica: Applying Self-Play to Multi-agent Cooperative BoardGames

Luis PerezStanford University450 Serra Mall [email protected]

Abstract

In this paper, we explore and compare multiple algo-rithms for solving the complex strategy game of Terra Mys-tica, thereafter abbreviated as TM. Previous work in thearea of super-human game-play using AI has proven ef-fective, with recent break-through for generic algorithms ingames such as Go, Chess, and Shogi [4]. We directly applythese breakthroughs to a novel state-representation of TMwith the goal of creating an AI that will rival human players.Speciﬁcally, we present initial results of applying AlphaZeroto this state-representation, and analyze the strategies de-veloped. A brief analysis is presented. We call this modiﬁedalgorithm with our novel state-representation AlphaTM. Inthe end, we discuss the success and short-comings of thismethod by comparing against multiple baselines and typi-cal human scores. All code used for this paper is availableat on GitHub.

1. Background and Overview

In this paper, we provide an overview of the infrastruc-ture, framework, and models required to achieve super-human level game-play in the game of Terra Mystica (TM)[1], without any of its expansions . The game of TMinvolves very-little luck, and is entirely based on strategy(similar to Chess, Go, and other games which have re-cently broken to novel Reinforcement Learning such DeepQ-Learning and Monte-Carlo Tree Search as a form of Pol-icy Improvement [3] [5]). In fact, the only randomnessarises from pregame set-up, such as players selecting differ- There are multiple expansions, most which consists of adding differentFactions to the game or extending the TM Game Map. We hope to havethe time to explore handling these expansions, but do not make this part ofour goal ent Factions , different set of end-round bonus tiles beingselected.TM is a game played between 2-5 players. For our re-search, we focus mostly on the adversarial 2-player versionof the game. We do this mostly for computational efﬁ-ciency, though for each algorithm we present, we discussbrieﬂy strategies for generalizing them to multiple play-ers. We also do this so we can stage the game as a zero-sum two-player game, where players are rewarded for win-ning/loosing.TM is a fully-deterministic game whose complexityarises from the large branching factor and a large number ofpossible actions A t from a given state, S t . There is furthercomplexity caused by the way in which actions can interact,discussed further in Section 1.2. In order to understand the inputs and outputs of our sys-tem, the game of TM must be fully understood. We lay outthe important aspects of a given state below, according tothe standard rules [2].The game of TM consists of a terrain board that’s splitinto × terrain tiles. The board is ﬁxed, but each terraintile can be terra-formed (changed) into any of the 7 distinctterrains (plus water, which cannot be modiﬁed). Players canonly expand onto terrain which belongs to them. Further-more, TM also has a mini-game which consists of a Cult-Board where individual players can choose to move up ineach of the cult-tracks throughout game-play.The initial state of the game consists of players selectinginitial starting positions for their original dwellings.At each time-step, the player has a certain amount of re-sources, consisting of Workers, Priests, Power, and Coin.The player also has an associated number of VPs.Throughout the game, the goal of each player is to ac-cumulate as many VPs as possible. The player with the A Faction essentially restricts each Player to particular areas of themap, as well as to special actions and cost functions for building a r X i v : . [ c s . M A ] F e b ighest number of VPs at the end of the game is the winner.From the deﬁnition above, the main emphasis of our sys-tem is the task of taking a single state representation S t at aparticular time-step, and outputting an action to take to con-tinue game-play for the current player. As such, the inputof our system consists of the following information, fullyrepresentative of the state of the game:1. Current terrain conﬁguration. The terrain conﬁgura-tion consists of multiple pieces of information. Foreach terrain tile, we will receive as input:(a) The current color of the tile. This gives us in-formation not only about which player currentlycontrols the terrain, but also which terrains canbe expanded into.(b) The current level of development for the ter-rain. For the state of development, we note thateach terrain tile can be one of (1) UNDEVEL-OPED, (2) DWELLING, (3) TRADING POST,(4) SANCTUARY, or (5) STRONGHOLD.(c) The current end-of-round bonus as well as futureend-of-round bonus tiles.(d) Which special actions are currently available foruse.2. For each player, we also receive the following infor-mation.(a) Current level of shipping ability, Current level ofspade ability, the current number of VPs that theplayer has.(b) The current number of towns the player has (aswell as which town is owned), The current num-ber of worker available to the player, the currentnumber of coins available to the player, the cur-rent number of LV1, LV2, and LV3 power tokens.(c) The current number of priests available to theplayer.(d) Which bonus tiles are currently owned by theplayer.(e) The amount of income the player currently pro-duces. This is simply the power, coins, priests,and worker income for the player.The above is a brief summary of the input to our algo-rithm. However, in general, the input to the algorithm is acomplete deﬁnition of the game state at a particular turn.Note that Terra Mystica does not have any dependencies inprevious moves, and is completely Markovian. As such,modeling the game as an MDP is fully realizable, and issimply a question of incorporating all the features of thestate. For a given state, the output of our algorithm will consistof an action which the player can take to continue to thenext state of the game. Actions in TM are quite varied, andwe do not fully enumerate them here. In general, however,there are eight possible actions:1. Convert and build.2. Advance on the shipping ability.3. Advance on the spade ability.4. Upgrade a building.5. Sacriﬁce a priest and move up on the cult track.6. Claim a special action from the board.7. Some other special ability (varies by class)8. Pass and end the current turn.We will evaluate our agents using the standard simula-tor. The main metric for evaluation will be the maximumscore achieved by our agent in self-play when winning, aswell as the maximum score achieved against a set of humancompetitors.

2. Experimental Approach

Developing an AI agent that can play well will be ex-tremely challenging. Even current heuristic-based agentshave difﬁculty scoring positions. The state-action space forTM is extremely large. Games typically have trees that are > moves deep (per player) and which have a branchingfactor of > .We can approach this game as a typical min-max search-problem. Simple approaches would simply be depth-limitedalpha-beta pruning similar to what we used in PacMan.These approaches can be tweaked for efﬁciency, and are es-sentially what the current AIs use.Further improvement can be made to these approachesby attempting to improve on the Eval functions.However, the main contribution of this paper will be toapply more novel approaches to a custom state-space rep-resentation of the game. In fact, we will be attempting toapply Q-Learning – speciﬁcally DQNs (as per [3], [5], and[4]).

Existing open-source AIs for TM are based on combi-nation of sophisticated search techniques (such as depth-limited, alpha-beta search, domain-speciﬁc adaptations, andhandcrafted evaluation functions reﬁned by expert humanplayers). Most of these AIs fail to play at a competitivelevel against human players. The space of open-source AIsis relatively small, mainly due to the newness of TM. .2. Alpha Zero for TM

In this section, we describe the main methods we use fortraining our agent. In particular, we place heavy emphasison the methods described by AlphaGo[3], AlphaGoZero,[5], and AlphaZero [4] with pertinent modiﬁcations madefor our speciﬁc problem domain.Our main method will be a modiﬁcation of the AlphaZero [4] algorithm which was described in detail. We chosethis algorithm over the methods described for Alpha Go [3]for two main reasons:1. The Alpha Zero algorithm is a zero-knowledge rein-forcement learning algorithm. This is well-suited forour purposes, given that we can perfectly simulategame-play.2. The Alpha Zero algorithm is a simpliﬁcation over thedual-network architecture used for AlphaGo.As such, our goal is to demonstrate and develop a slightlymodiﬁed general-purpose reinforcement learning algorithmwhich can achieve super-human performance tabula-rasa onTM.

We ﬁrst introduce some of the TM-speciﬁc challenges ouralgorithm must overcome.1. Unlike the game of Go, the rules of TM are not trans-lationally invariant. The rules of TM are position-dependent – the most obvious way of seeing this isthat each terrain-tile and patterns of terrains are dif-ferent, making certain actions impossible from certainpositions (or extremely costly). This is not particularlywell-suited for the weight-sharing structure of Convo-lutional Neural Networks.2. Unlike the game of Go, the rules for TM are asymmet-ric. We can, again, trivially see this by noting that theboard game board itself 7 has little symmetry.3. The game-board is not easily quantized to exploit po-sitional advantages. Unlike games where the Alp-haZero algorithm has been previously applied (such asGo/Shogi/Chess), the TM map is not rectangular. Infact, each “position” has neighbors, which is not eas-ily representable in matrix form for CNNs.4. The action space is signiﬁcantly more complex and hi-erarchical, with multiple possible “mini”-games beingplayed. Unlike other games where similar approacheshave been applied, this action-space is extremely com-plex. To see this, we detail the action-spaces for othergames below. (a) The initial DQN approach for Atari games hadan output action space of dimension 18 (thoughsome games had only 4 possible actions, themaximum number of actions was 18 and this wasrepresented simply as a 18-dimensional vectorrepresenting a softmax probability distribution).(b) For Go, the output actions space was similarlya ×

19 + 1 probability distribution over thelocations on which to place a stone.(c) Even for Chess and Shogi, the action space sim-ilarly consisted of all legal destinations of allplayer’s pieces on the board. While this is veryexpansive and more similar to what we expectfor TM, TM nonetheless has additional complex-ity in that some actions are inherently hierarchi-cal. You must ﬁrst decide if you want to build,then decided where to build, and ﬁnally decide what to build. This involves deﬁning an outputactions-space which is signiﬁcantly more com-plex than anything we’ve seen in the literature.For comparison, in [4] the output space consistsof a stack of planes of × × . Each of the64 positions identiﬁes a piece to be moved, withthe 73 associated layers identifying exactly how the piece will be moved. As can be seen, this isessentially a two-level decision tree (select piecefollowed by selecting how to move the piece). InTM, the action-space is far more varied.5. The next challenge is that TM is not a binary win-losesituation, as is the case in Go. Instead, we must seekto maximize our score relative to other players. Ad-ditionally, in TM, there is always the possibility of atie.6. Another challenge present in TM not present in otherstated games is the fact that there exist a limited num-ber of resources in the game. Each player has a limitednumber of workers/priests/coin with which a sequenceof actions must be selected.7. Furthermore, TM is now a multi-player game (not two-player). For our purposes, however, we leave exploringthis problem to later research. We focus exclusivelyon a game between two ﬁxed factions (Engineers andHalﬂings). Unless otherwise speciﬁed, we leave the training andsearch algorithm large unmodiﬁed from those presented in[4] and [5]. We will described the algorithm, nonetheless,in detail in subsequent sections. For now, we focus on pre-sented the input representation of our game state. .3.1 The Game Board

We begin by noting that the TM GameBoard 7 is naturallyrepresented as × hexagonal grid. As mentioned in thechallenges section, this presents a unique problem since foreach intuitive “tile“, we have rather than the the (as inGo, Chess, and Shogi). Furthermore, unlike chess wherea natural dilation of the convolution will cover additionaltangent spots equally (expanding to ), the hexagonal naturemakes TM particularly interesting.However, a particular peculiarity of TM is that we canthink of each “row” of tiles as being shifted by “half” atile, thereby becoming “neighbors”. With this approach, wechose to instead represent the TM board as a × grid,where each tile is horizontally doubled. Our terrain rep-resentation of the map then begins as a × × stackof layers. Each layer is a binary encoding of the terrain-type for each tile. The main types are { PLAIN, SWAMP,LAKE, FOREST, MOUNTAIN, WASTELAND, DESERT } . It as a possible action to “terra-form“ any of these tilesinto any of the other available tiles, therefore why we mustmaintain all of them as part of our conﬁguration. The -th layer actually remains constant throughout the game, asthis layer represents the water-ways and cannot be modiﬁed.Note that even-row ( B, D, F, H ) are padded at columns and with WATER tiles.The next feature which we tackle is the representationof the structures which can be built on each terrain tile.As part of the rules of TM, a structure can only existson a terrain which corresponds to it’s player’s terrain. Assuch, for each tile we only need to consider the possi-ble structures, { DWELLING, TRADING POST, SANC-TUARY, TEMPLE, STRONGHOLD } . We encode theseas an additional ﬁve-layers in our grid. Our representationis now a × × stack.We now proceed to add a set of constant layers. First, torepresent each of the 5 special-actions, we add -constantlayers which will be either or signifying whether a par-ticular action is available ( ) or take ( ). This gives us a × × representation.To represent the scoring tiles (of which there are 8), weadd × constant layers (either all or all ) indicating theirpresence in each of the rounds. This gives us a × × stack.For favor tiles, there are 12 distinct favor tiles. We add layers each specifying the number of favor tiles remaining.This gives use × × .For the bonus tiles, we add constant layers. These layers specify which favor tiles were selected for this game(only P + 3 cards are ever in play). This gives us a game-board representation which is of size × × We now introduce another particularity of TM, which is thefact that each player has a different amount of resources which must be handled with care. This is something whichis not treated in other games, since the resource limitationdoes not exist in Go, Chess, or Shogi (other than those fullyencoded by the state of the board).With that in-mind, we move to the task of encoding eachplayer. To be fully generic, we scale this representation withthe number of players playing the game, in our case, P = 2 .To do this, for each player, we add constant layers spec-ifying: (1) number of workers, (2) number of priests, (3)number of coins, (4) power in bowl I, (5) power in bowlII, (6) power in bowl III, (7) the cost to terraform, (8) ship-ping distance, (9-12) positions in each of the 4 cult tracks,(13-17) number of building built of each type, (18) cur-rent score, (19) next round worker income, (20) next roundpriest income, (21) next round coin income, (22) next roundpower income, (23) number of available bridges. This givesus a total of P additional layers required to specify infor-mation about the player resources.Next, we consider representing the location of bridges.We add P layers, each corresponding to each player, in aﬁxed order. The each layer is a bit representing the exis-tence/absence of a bridge at a particular location. This givesus P layers.We’ve already considered the positions of the player inthe cult-track. The only thing left is the tiles which theplayer may have. We add layers to each player.The ﬁrst specify which bonus card the player currentlyholds. The next specify which favor tiles the player cur-rently owns. And the last specify how many town tilesof each type the player currently holds. This gives use anadditional P layers.We end with a complete stack of dimension × × P to represent P players. Finally, we add layers to specify which of the possiblefactions the neural network should play as. This gives us aninput representation of size × × (48 P +110) . See Table1 which places this into context. In our case, this becomes × × . Terra Mystica is a complex game, where actions are sig-niﬁcantly varied. In fact, it is not immediately obvious howto even represent all of the possible actions. We provide abrief overview here of our approach.In general, there are 8 possible actions in TM which are,generally speaking, quite distinct. In general, we output allpossible actions and assign a probability. Illegal actions omain

Input Dimensions Total SizeAtari 2600

84 x 84 x 4 28,224 Go

19 x 19 x 17 6,137

Chess

Shogi

Terra Mystica

ImageNet

Table 1. Comparison of input sizes for different domain of bothgames. For reference, typical CNN domain of ImageNet is alsoincluded. are removed by setting their probabilities to zero and re-normalizing the remaining actions. Actions are consideredlegal as long as they can be legally performed during thatturn (ie, a player can and will burn power/workers/etc. inorder to perform the required action. We could technicallyadd additional actios for each of this possibilities, but thisvastly increases the complexity.1. Terra Form and Build: This action consists of (1) se-lecting a location (2) selecting a terrain to terraform to(if any) and (3) optionally selecting to build. We canrepresent this process as a vector of size × × (7 × .The × is selecting a location, while the ﬁrst lay-ers the probability of terraforming into one of the terrains and not building, and the last the probabilityof terraforming into each of the terrains and building.2. Advancing on the Shipping Track: The player maychoose to advance on the shipping track. This consistsof a single additional value encoding the probability ofadvancement.3. : Lowering Spade Exchange Rate: The player maychoose to advance on the spade track. This consistsof a single additional value encoding the probability ofchoosing to advance on the spade track.4. Upgrading a Structure: This action consists of (1) se-lecting a location, (2) selecting which structure to up-grade to. Depending on the location and existing struc-ture, some actions may be illegal. We can representthis process as a vector of size × × specifyingwhich location as well as the requested upgrade to thestructure (DWELLING to TRADING POST, TRAD-ING POST to STRONG HOLD, TRADING POST toTEMPLE, or TEMPLE to SANCTUARY).Note that when a structure is built, it’s possible for theopponents to trade victory points for power. While thisis an interesting aspect of the game, we ignore for ourpurposes and assume players will never chose to takeadditional power.5. Send A Priest to the Order of A Cult: In this action,the player choose to send his or her priest to one of four possible cults. Additionally, the player must de-termine if he wants to send his priest to advance , or spaces – some of which maybe illegal moves. We canrepresent this simply as a × vector of probabilities.6. Take a Board Power Action: There are 6 availablepower actions on the board. We represent this as a × vector indicating which power action the player wishesto take. Actions can only be take once per round.7. Take a Special Action: There are multiple possible“special“ actions a player may choose to take. For ex-ample, there’s a (1) spade bonus tile, (2) cult favor tileas well as special action allowed by the faction (3). Assuch, we output a × vector in this case for each ofthe above mentioned actions, many of which may beillegal.8. Pass: The player may chose to pass. If the ﬁrst to pass,the player will become the ﬁrst to go next round. Forthis action, the player must also chose which bonus tileto take. There are possible bonus tiles (some whichwon’t be available, either because they were never inplay or because the other players have taken them). Assuch, we represent this action by a × vector.9. Miscellaneous: At any point during game-play for thisplayer, it may become the case that a town is founded.For each player round, we also output a × vector ofprobabilities specifying which town tile to take in theeven this has occurred. These probabilities are normal-ized independently of the other actions, as they are notexclusive, though most of the time they will be ignoredsince towns are founded relatively rarely (two or threetimes per game). As described above, this leads to a relatively complexaction-space representation. In fact, we’ll end-up outputtinga × × × We summarize the action-spacerepresentation in Table 2 and provide a few other methodsfore reference.

Domain

Input Dimensions Total SizeAtari 2600

18 x 1 18 Go

19 x 19 + 1 362

Chess

Shogi

Terra Mystica

ImageNet

Table 2. Comparison of action space sizes for different domain ofboth games. For ImageNet, we consider the class-distribution theactions-space .5. Deep Q-Learning with MCTS Algorithm andModiﬁcations

In this section, we present our main algorithm and themodiﬁcations we’ve made so-far to make it better suited forour state-space and action-space. We describe in detail thealgorithm for completeness, despite the algorithm remain-ing mostly the same as that used in [4] and presented indetail in [5].

The main modiﬁcations to the algorithm are mostly per-formed on the architecture of the neural network.1. We extend the concept of introduced in [5] of “dual”heads to “multi-head“ thereby providing us with mul-tiple differing ﬁnal processing steps.2. We modify the output and input dimensions accord-ingly.

We make use of a deep neural network (architecture de-scribed in Section 2.5.5) ( p , m , v ) = f θ ( s ) with parame-ters θ , state-representations s , as described in Section 2.3,output action-space textbf p , as described in Section 2.4,additional action-information as m as described in Section2.4.1, and a scalar value v that estimates the expected output z from position s v ≈ E [ z | s ] . The values from this neu-ral network are then used to guide MCTS, and the resultingmove-probabilities and game output are used to iterativelyupdate the weights for the network. We provide a brief overview of the MCTS algorithm used.We assume some familiarity with how MCTS works. Ingeneral, there are three phases which need to be considered.For any given state S t which we call the root-state (this isthe current state of gameplay), the algorithm simulates 800iterations of gameplay. We note that AlphaGoZero [5] does1600 iterations and AlphaZero does [4] also does 800.At each node, we perform a search until a leaf-node isfound. A leaf-node is a game-state which has never-beenencountered before. The search algorithm is relatively sim-ple, as shown below and as illustrated in Figure 1. Note thatthe algorithm plays for the best possible move, with somebias given to moves with low-visit counts. def Search(s):if IsEnd(s): return Rif IsLeaf(s): return Expand(s) Figure 1. Monte Carlo Tree Search: Selection Phase. During thisphase, starting from the root node, the algorithm selects the op-timal path until reaching an un-expanded leaf node. In the caseabove, we select left action, then the right action, reaching a new,un-expanded board state. while seen(S):max_u, best_a = -INF, -1for a in Actions(S) :u = Q(s,a) + c*P(s,a)*sqrt(visit_count(s)))/(1+action_count(s,a))if u>max_u:max_u = ubest_a = as = Successor(s, best_a)v = search(sp, game, nnet)

From above, we can see that as a sub-routing of searchwe have the ‘Expand‘ algorithm. The expansion algorithmis used to initialize non-terminal, unseen states of the game S as follow. def Expand(s):v, p = NN(s)InitializeCounts(s)StoreP(s, p)return v Where ‘InitializeCounts‘ simply initializes the counts forthe new node (1 visit, 0 for each action). We also initializeall Q ( s, a ) = 0 and store the values predicted by our N N .Intuitively, we’ve now expanded the depth of our searchtree, as illustrated in Figure 2.After the termination of the simulation (which ended ei-ther with an estimated v by the NN or an actually R ), weback-propagate the result by updating the corresponding Q ( s, a ) values using the formula: Q ( s, a ) = V ( Succ ( s, a )) This is outlined in Figure 3. igure 2. Monte Carlo Tree Search: Expansion Phase. Duringthis phase, a leaf-node is “expanded”. This is where the neuralnetwork comes in. At the leaf-node, we process the state S L toretrieve p, v = f θ ( S L ) , which is a vector of probabilities for the actions that arepossible and a value function. Figure 3. Monte Carlo Tree Search: Back-propagation Phase. Dur-ing this phase, we use the value v estimated at the leaf-node andpropagate this information back-up the tree (along the path-taken)to update the stored Q ( s, a ) values. The training can be summarized relatively straightfor-wardly. We batch N (with N = 128 ) tuples ( s, p, v ) and usethis to train, with the loss presented in AlphaZero [4]. Weuse c = 1 to for our training. We perform back-propagationwith this batch of data, and continue our games of self-playusing the newly updated neural network. This is all per-formed synchronously. We ﬁrst describe the architecture of our neural network. Forbrief overview, see Figure 4. The input is a × × image stack as described in Section 2.3.3. Note that un-like other games, our representation includes no temporal- Figure 4. Detailed diagram of the multi-headed architecture ex-plored for the game of Terra Mystica. information ( T = 1 ), both for computational efﬁciency anddue to the fact that the game is fully encoded with the cur-rent state.The input features s t are processed by a residual towerwhich consists of a single convolution block followed by residual blocks, as per [5]. The ﬁrst convolution blockconsists of 256 ﬁlters of kernel size × with stride 1, batchnormalization, and a ReLU non-linearity. Figure 5. Architecture Diagram for shared processing of the statespace features. An initial convolution block is used to standard-ize the number of features, which is then followed by 18 residualblocks.

Each residual block applies the following modules, insequence, to its input. A convolution of 256 ﬁlters of ker-nel size × with stride 1, batch-normalization, and aReLU non-linearity, followed by a second convolution of256 ﬁlters of kernel size × with stride , and batch-normalization. The input is then added to this, a ReLu ap-plied, and the ﬁnal output taken as input for the next block.See Figure 5 for a reference.The output of the residual tower is passed into multipleseparate ’heads’ for computing the policy, value, and mis-cellaneous information. The heads in charge of computingthe policy apply the following modules, which we guess ativen that the AlphaZero Paper [4] does not discuss in detailhow the heads are modiﬁed to handle the ﬁnal values. SeeFigure 6 for an overview diagram. Figure 6. Details on multi-headed architecture for Neural Network.The ﬁnal output state of the residual tower is fed into two paths. (1)On the left is the policy network. (2) On the right is the value esti-mator. The policy network is further split into two, for computingtwo disjoint distributions over the action space, each normalizedindependently.

For the policy, we have one head that applies a convolu-tion of 64 ﬁlters with kernel size × with stride 2 along thehorizontal direction, reducing our map to × × . Thisconvolution is followed by batch normalization, a ReLUnon-linearity.We then split this head into two further heads. We thenapply an FC layer which outputs a vector of × × which we interpret as discussed in Section 2.4, represent-ing the mutually exclusive possible actions that a player cantake.For the second part, we apply a further convolution with1 ﬁlter of size × , reducing our input to × . This is fol-lowed by a batch-normalization layer followed by a ReLU.We then apply a FC layer further producing a probabilitydistribution over a × vector.For the value head, we apply a convolution of ﬁltersof kernel size × with stride to the, followed by batchnormalization and a ReLU unit. We follow this with an FCto 264 units, a ReLU, another FC to a scalar, and a tanh-nonlinearity to output a scalar in the range [ − , .

3. Experimental Results

Given the complexity of the problem we’re facing, themajority of the work has been spent developing the re-inforcement learning pipeline with an implementation ofMCTS. The pipeline appears to not train well, even aftermultiple games of self-play.

For the base-lines, we compare the ﬁnal scores achievedby existing AI agents. We see their results in Table 4. Theresults demonstrate that current AIs are fairly capable atscoring highly during games of self-play.

Simulated Self-Play Average Scores - AIFaction Average Score Sampled Games

Halﬁng 92.21 1000Engineers 77.12 1000

Table 3. Self-play easy AI agent: AI Level5 from [6]

A second comparison, showing in Table 4, demonstratesthe skill we would expect to achieve. These are the averagescores of the best human players, averaged over online data.

Average Human Score (2p)Faction Average Score Sampled Games

Halﬂing 133.32 2227Engineers 127.72 1543

Table 4. Average human scores by faction for a two-player TMgames online.

The results for AlphaTM are presented below. Trainingappears to not have taken place, at least not with the archi-tecture and number of iterations which we executed. The AIstill appears to play randomly, specially at later, and morecrucial, stages of the game. See Table 5.

Simulated Self-Play Average Scores - AlphaTMFaction Average Score Training Iterations

Halﬁng 32.11 10,000Engineers 34.12 10,000

Table 5. Our Self-Play AI after 10,000 training iterations, withaverage score over ﬁnal 1000 games.

Overall, we summarize:• The AI plays poorly in early stages of the game, thoughit seems to learn to build structures adjacent to otherplayers.• As the game progresses, the actions of the agent areindistinguishable from random. A cursory analysis of π reveals these states are basically random. It appearsthat the AI is not learning to generalize, or has simplenot played sufﬁcient games. . Future Work Given the poor results from the experiments above, manyavenues exists for future work. In particular, we propose afew extensions to the above approach below.

In the general context, the reinforcement learningpipeline that performed the best (with some semblance oflearning) is the one where the game was presented as a zero-sum two-player game explicitly (I win, you lose). Whilethe neural network architecture presented can readily gen-eralize to more players, the theory behind the learning algo-rithm will no-longer hold. The game is no longer zero-sum.

Another area of future work is experimenting with fur-ther architectures and general improvement, with possiblehyper parameter tuning.

5. Appendices

Figure 7. The Terra Mystica Game Board and Its Representation

References [1] BoardGeek. Terra mystica: Statistics, 2011. 1[2] M. LLC. Terra mystica: Rule book, 2010. 1[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Pan-neershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham,N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach,K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering thegame of go with deep neural networks and tree search.

Nature ,529:484 EP –, Jan 2016. Article. 1, 2, 3[4] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai,A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P.Lillicrap, K. Simonyan, and D. Hassabis. Mastering chessand shogi by self-play with a general reinforcement learningalgorithm.

CoRR , abs/1712.01815, 2017. 1, 2, 3, 6, 7, 8[5] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Figure 8. The Terra Mystica Cult TrackFigure 9. The Terra Mystica Board RepresentationY. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche,T. Graepel, and D. Hassabis. Mastering the game of go with-out human knowledge.