[PDF] Deep Learning for General Game Playing with Ludii and Polygames

Abstract

Combinations of Monte-Carlo tree search and Deep Neural Networks, trained through self-play, have produced state-of-the-art results for automated game-playing in many board games. The training and search algorithms are not game-specific, but every individual game that these approaches are applied to still requires domain knowledge for the implementation of the game's rules, and constructing the neural network's architecture -- in particular the shapes of its input and output tensors. Ludii is a general game system that already contains over 500 different games, which can rapidly grow thanks to its powerful and user-friendly game description language. Polygames is a framework with training and search algorithms, which has already produced superhuman players for several board games. This paper describes the implementation of a bridge between Ludii and Polygames, which enables Polygames to train and evaluate models for games that are implemented and run through Ludii. We do not require any game-specific domain knowledge anymore, and instead leverage our domain knowledge of the Ludii system and its abstract state and move representations to write functions that can automatically determine the appropriate shapes for input and output tensors for any game implemented in Ludii. We describe experimental results for short training runs in a wide variety of different board games, and discuss several open problems and avenues for future research.

Full PDF

DDeep Learning for General Game Playing withLudii and Polygames

Dennis J. N. J. Soemers ∗ , Vegard Mella ∗∗ ,Cameron Browne ∗ , Olivier Teytaud ∗∗ January 26, 2021

Abstract

Combinations of Monte-Carlo tree search and Deep Neural Networks,trained through self-play, have produced state-of-the-art results for au-tomated game-playing in many board games. The training and searchalgorithms are not game-speciﬁc, but every individual game that theseapproaches are applied to still requires domain knowledge for the im-plementation of the game’s rules, and constructing the neural network’sarchitecture – in particular the shapes of its input and output tensors.Ludii is a general game system that already contains over 500 diﬀerentgames, which can rapidly grow thanks to its powerful and user-friendlygame description language. Polygames is a framework with training andsearch algorithms, which has already produced superhuman players forseveral board games. This paper describes the implementation of a bridgebetween Ludii and Polygames, which enables Polygames to train and eval-uate models for games that are implemented and run through Ludii. Wedo not require any game-speciﬁc domain knowledge anymore, and insteadleverage our domain knowledge of the Ludii system and its abstract stateand move representations to write functions that can automatically deter-mine the appropriate shapes for input and output tensors for any gameimplemented in Ludii. We describe experimental results for short trainingruns in a wide variety of diﬀerent board games, and discuss several openproblems and avenues for future research. Self-play training approaches such as those popularised by

AlphaGo Zero [27]and

AlphaZero [26], based on combinations of Monte-Carlo tree search (MCTS)[10, 6, 3] and Deep Learning [13], have been demonstrated to be fairly generallyapplicable, and achieved state-of-the-art results in a variety of board games suchas Go [27], Chess, Shogi [26], Hex, and Havannah [5]. These approaches requirerelatively little domain knowledge, but still require some in the form of: ∗ Department of Data Science and Knowledge Engineering, Maastricht University, theNetherlands. ∗∗ Facebook AI Research. a r X i v : . [ c s . A I] J a n . A complete implementation of a forward model for the game, for the imple-mentation of lookahead search as well as automated self-play to generateexperience for training.2. Knowledge of which state features are required or useful to provide asinputs for a neural network.3. Knowledge of the action space, which is typically used to construct thepolicy head in such a way that every distinct possible action has a uniquelogit.The ﬁrst requirement, for the implementation of a forward model, is partiallyaddressed by research on using learned simulators for tree search as in MuZero [24], but in practice a simulator is actually still required for the purpose ofgenerating trajectories outside of the tree search. For the board games Go,Chess, and Shogi, MuZero still requires the input and output tensor shapes (forstates and actions, respectively) to be manually designed per game. We remarkthat MuZero was also evaluated on 57 diﬀerent Atari games in the ArcadeLearning Environment (ALE) [1], and it can use identical tensor shapes acrossall these Atari games because ALE uses the same observation and action spacesfor all games in this framework.The challenge posed by General Game Playing (GGP) [21] is to build sys-tems that can play a wide variety of games, which makes the three forms ofrequired domain knowledge listed above diﬃcult. A number of systems havebeen proposed that can interpret and run any arbitrary game as long as it hasbeen described in their respective game description language, such as the origi-nal

Game Description Language (GDL) [16] from Stanford,

Regular Boardgames (RBG) [11], and

Ludii [20].In this paper, we describe how we combine the GGP system Ludii and thePyTorch-based [17] state-of-the-art training algorithms in Polygames [5], withthe goal of mitigating all three of the requirements for domain knowledge listedabove. Section 2 provides some background information on these training tech-niques. Section 3 describes existing work and limitations in applying these DeepLearning approaches to general games. Section 4 presents the interface betweenLudii and Polygames. Experiments and results are described in Section 5. Wediscuss some open problems in Section 6, and conclude the paper in Section 7.

The basic premise behind AlphaZero and similar approaches in frameworks suchas Polygames is that Deep Neural Networks (DNNs) take representations T ( s )of game states s as input, and produce discrete probability distributions P ( s )with probabilities P ( s, a ) for all actions a in states s , as well as value estimates V ( s ), as outputs. This is depicted in Figure 1. Both of these outputs are usedto guide MCTS in diﬀerent ways.DNNs in general have a ﬁxed architecture, requiring ﬁxed and predeterminedshapes for both the input and the output representations. The value output is2igure 1: Basic architecture of DNNs for game playing. Raw game states s are transformed into a tensor representation T ( s ) of some ﬁxed shape (often 3-dimensional). The DNN learns to compute hidden representations of its inputsin hidden layers. Finally, it computes a scalar value estimate V ( s ), and a discreteprobability distribution P ( s ) with probabilities P ( s, a ) for all actions a in thecomplete action space.always simply a scalar, but determining the shapes of the input tensors T ( s )and policy outputs P ( s ) typically requires game-speciﬁc domain knowledge. T ( s ) is generally a 3-dimensional tensor, where 2 dimensions are spatialdimensions (corresponding to e.g. a 2-dimensional playable area in a boardgame). The third dimension is formed by a stack of diﬀerent channels whicheach have diﬀerent semantics. For example, T ( s ) in AlphaZero has a shape of19 × ×

17 for the game of Go played on a 19 ×

19 board, with eight times twobinary channels encoding the presence of the two players’ pieces – for a history ofup to eight successive game states ending in s – and one ﬁnal channel encodingthe current player to move. The spatial structure of the ﬁrst two dimensions istypically assumed to be meaningful, which is exploited by the inductive bias ofConvolutional Neural Networks (CNNs) [14].For the policy head, it is customary for neural networks to ﬁrst output real-valued logits L ( s, a ) for all possible actions a . These are subsequently convertedinto probabilities P ( s, a ) using a softmax over all legal actions a (cid:48) ∈ A ( s ) in s : P ( s, a ) = exp( L ( s, a )) (cid:80) a (cid:48) ∈A ( s ) exp( L ( s, a (cid:48) )) . It is generally assumed that every distinct possible action a that may be legal inany game state s has a unique, matching logit L ( s, a ). This means that domainknowledge of the game’s action space is required to construct a DNN’s architec-ture in such a way that distinct actions always have distinct logits. The logitsare sometimes laid out in a structure of multiple 2-dimensional planes, like theinputs, but typically preceded by fully connected (as opposed to convolutional)layers. This is equivalent to all the logits being laid out in a single, ﬂat vectorwith no spatial structure. Assuming 2-player zero-sum games; see [18] for relaxations of this assumption.

3n addition to such typical architectures, Polygames [5] includes various dif-ferent structures, such as: • Fully convolutional networks:

As in many cases, actions are spatiallydistributed in a manner somehow close to the pieces, the output has spa-tial coordinates matching the spatial coordinates of the input. This canbe exploited in fully convolutional networks [25]: the policy head has nofully connected layer, and directly maps inputs to outputs through convo-lutional blocks. This has the advantage of being boardsize invariant: wecan train in size 13 ×

13 and play in 19 ×

19. Global pooling can be used toadditionally make the value head size-invariant [15, 29]. • U-networks:

It is usually considered that DNNs rephrase their data inan increasingly abstract manner, layer after layer. However, in fully con-volutional networks, the output is dense; it has the same low-level natureas the input. The level of abstraction increases, and then decreases again.Then, one may consider that layers might beneﬁt from a direct connec-tion into a layer containing information at the same level of abstraction.This can be done by skip-connections, i.e. additional connections to layerssymmetrically positioned in the network (Figure 2c): this is a U-network[22].Some of these diﬀerent structures are depicted and explained in Figure 2.

To some extent, all GGP systems mitigate the requirement for the implemen-tation of complete forward models for every distinct game, in the sense thatnew games can be added and supported simply by deﬁning them in a gamedescription language. Ludii’s game description language in particular has beendesigned in such a way that game descriptions for new games are fast and easyto write and understand [20], which has allowed for a signiﬁcantly larger libraryof distinct games to be built up than would be feasible if they were all writtenin a programming language such as C++. Ludii’s predecessor has also alreadydemonstrated that the “ludemic” approach to game description languages usedby Ludii facilitates procedural generation of complete games [4], which can beused to easily extend the set of compatible benchmark problems.Similarly, we may argue that running games through a GGP system removesthe requirements for game-speciﬁc knowledge about how to shape state inputsand action outputs, but introduces requirements for similar knowledge about theGGP system . Given any arbitrary game deﬁned in a game description languageof a GGP system, we require the ability to construct tensor representationsof game states, and the ability to map from any index in a policy head to amatching action in any non-terminal game state. Ludii has over 500 distinct built-in games at the time of this writing, with many of themhaving multiple variants for diﬀerent board sizes, board shapes, variant rulesets, etc. a) Standard convolutional net:convolutional layers ﬁrst, withtheir inductive bias towardsspatial invariance, followed byfully connected layers. (source: https://en.wikipedia.org/wiki/Convolutional_neural_network ) (b) Fully convolutional net, e.g. for im-age segmentation: each output scalar(each logit in the case of games) is theoutput of the same net, applied on amoving window. No fully connectedlayers.(c) U-networks: this fully convolu-tional network also connects some lay-ers to their symmetric counterpart sup-posed to work at a similar level of ab-straction. (d) Max pooling: here we downsizefrom 4x4 to 2x2. In case of global pool-ing, the size of tensors after the globalpooling layer is 1 × × channels , in-dependently of the input size. Figure 2: Convolutional neural network (a). Fully convolutional counterpart(b, image from [25]; other images from Wikipedia), typically used in Imagesegmentation: image segmentation is related to policy heads in games in thatthe output has the same spatial coordinates at the input. U-networks (c):only convolutional layers, and skip connections symmetrically connecting layers.Global pooling (d): here we down-sample to a spatial size 1x1 in the value head:this is boardsize invariant. Global pooling can use channels for mean, standarddeviation, max, etc: the number of channels is not necessarily preserved. (b+d)or (c+d) allow boardsize-invariant training [5].5DL [16] is a low-level logic-based game description language, where gamesare described as logic programs consisting of many low-level propositions. ManyGDL-based agents convert such a GDL description into a propositional network[23, 7, 28], which can more eﬃciently process the games than Prolog-basedreasoners or other similar techniques. Such propositional networks can be au-tomatically constructed from GDL descriptions, and the structure of such anetwork remains constant across all game states of the same game. [9] thereforeproposed using the internal state of a game’s propositional network as the inputstate tensor for a deep neural network. A downside of this approach is that thestate input tensor is a ﬂat tensor, and there is no possibility to use inductivebiases such as those of CNNs for inputs with spatial semantics.

Galvanise Zero [8] does exploit knowledge of spatial semantics through CNNs, but it only sup-ports a limited selection of GDL-based games because it requires a handwrittenPython function to create the mapping from game states to input tensors forevery game that it supports. The action space can automatically be inferredfrom GDL descriptions, which means that these approaches require no extradomain knowledge with respect to the output policy heads.In the game description language of Ludii [20], common high-level gameconcepts such as boards, piece types, etc. are all “ﬁrst-class citizen” of the lan-guage, as opposed to GDL where every separate game description ﬁle encodessuch concepts from scratch in low-level logic. Based on these concepts, Ludiialso has an object-oriented game state representation that it uses internally,which remains consistent across all games. This enables us to write a singlefunction that automatically constructs input tensors from Ludii’s internal staterepresentation, using our domain knowledge of Ludii as a whole instead of do-main knowledge of every individual game. Unlike GDL, it is not straightforward(if at all possible) to infer the action space from game description ﬁles in Ludii.However, actions in Ludii do have an object-oriented structure, and at least anapproximation of the action space can be constructed based on these properties– again, based on domain knowledge of Ludii rather than any individual game.In many games, this is suﬃcient to distinguish most or all legal actions fromeach other.

Based on the insights described above, we developed an interface between theLudii general game system, and the Polygames framework with state-of-the-artAI training code. In Polygames, diﬀerent games are normally implemented fromscratch in C++. The basic idea of this interface is that there is a single “Ludiigame” in Polygames, with C++ code that interacts with Ludii’s Java-based APIthrough Java Native Interface. Polygames command-line arguments can be usedto load diﬀerent games and variants from Ludii into this wrapper. This sectionprovides details on how Ludii automatically constructs tensor representationsof its state and action spaces, based on its own internal representations, for any6rbitrary game implemented in Ludii. CNNs normally operate on grid structures of “pixels”, such that every positioncan be indexed by a row and column, and every position has a square of up toeight neighbour positions around it. This structure resembles the game boardsof games such as Chess, Shogi, and Go most closely. Some other boards, suchas the tilings of hexagonal cells used in games like Hex and Havannah, can alsobe “packed” into such a grid. This approach is used for the game-speciﬁc C++implementations of those games in Polygames. However, Ludii supports gameswith arbitrary graphs as boards, and hence requires a generic solution that canmap positions from graphs with any arbitrary connectivity structure into a gridstructure that CNNs can work with.For every game in Ludii, there is at least one (and possibly more than one) container , which speciﬁes a playable “area” with positions that may containpieces, have corresponding clickable elements in Ludii’s GUI, etc. [20, 19]. Theﬁrst container typically corresponds to the board that a game is played on,and is often the largest. Any other containers represent auxiliary areas, suchas players’ hands to hold captured pieces in Shogi. Even games that are notgenerally thought of as being played on a board are still modelled in this wayin Ludii. For instance, Rock-Paper-Scissors is modelled as a board with two(initially empty) cells, and two hands for the two players, each containing rock,paper, and scissors “pieces” which players can drag onto their designated cellson the board to make their move.Every site in any such container in Ludii has x and y coordinates in [0 , x - and y -coordinates across all sites in the board in increasing order,and assigning distinct columns and rows, respectively, to distinct x - and y -coordinates. Coordinates that are within a tolerance value of 10 − are treatedas equal, to avoid generating excessively large and sparse tensors due to smalldiﬀerences resulting from ﬂoating-point arithmetic. Note that this approachis not equivalent to directly overlaying a suﬃciently ﬁne-grained grid over the[0 , space, because we only add rows and columns that each contain at leastone site. This is depicted in Figure 3. Our approach may lose some informationconcerning the relative distances between sites, but because these x - and y -coordinates are only used for the graphical user interface – not for game logic– we expect the smaller and less sparse grids to be preferable due to improvedcomputational eﬃciency. Note that the vast majority of games in Ludii useboards deﬁned by regular or semiregular tilings, and for these the two approacheswill have similar results. The source code for building tensors from Ludii’s internal state and action representationsis available from https://github.com/Ludeme/LudiiAI . All the source code of Polygames isavailable from https://github.com/facebookincubator/Polygames . a) A 3 × x -and y -coordinates of three sites A, B, andC. (b) An 8 × , space containing three sites A, B, and C. Figure 3: Two diﬀerent approaches for computing a grid based on a playablespace deﬁned by three sites A, B, and C, each with distinct x - and y -coordinates.The approach we use is depicted in (a). This approach results in smaller, moredense tensors, but information of the relative distances between all sites is notnecessarily preserved. The alternative approach, depicted in (b), preserves moreof this information, but can result in large and sparse tensors.In the current version of Ludii, containers other than the ﬁrst one (corre-sponding to the “main” board) never have more than one meaningful dimension;they are always a single, contiguous sequence of cells. Each of those containersis concatenated to the grid constructed for the ﬁrst container, either using oneextra column or one extra row per extra container (whichever results in thelowest increase in total size of the tensor). Additionally, one extra dummy rowor column is inserted to create a more explicit separation between the mainboard (for which we expect there to be meaningful spatial semantics) and theother containers (for which there is no expectation that any meaningful spatialsemantics exist). For example, Shogi is played on a 9 × × Let s denote a raw game state in Ludii’s object-oriented state representation [19],for a game G . Based on the properties of s , we construct a tensor representation T ( s ) – which can be used as input for a DNN – of shape ( C, W, H ), where C denotes the number of channels (variable, depends on G ), W denotes the width(i.e., number of columns), and H denotes the height (i.e., number of rows). Thechannels are constructed as follows: 8igure 4: Shogi being played in Ludii’s user interface. The game board is on theleft-hand side, and each player has a “hand” with seven slots to hold capturedpieces on the right-hand side. Figure 5 shows how the numbered positions getmapped to positions in a tensor.Figure 5: Mapping from positions in Shogi’s three containers to positions in asingle tensor. Numbers 0 through 80 correspond to positions on the board, 81through 87 are positions in the hand of Player 1, and 88 through 94 are positionsin the hand of Player 2 (see Figure 4). 9 Binary channels indicating the presence (or absence) of every piece typedeﬁned in G . Most games have one channel per piece type, where valuesof 1 indicate the presence of a piece of that type in a position. If G is a“stacking” game, meaning that it allows for multiple pieces to form a stackon a single position, we use M + N binary channels per piece type, insteadof just one. M channels are used to indicate presence of a piece type inthe bottom M layers of every stack on every position, and N channelsindicate the same for the top N layers. In our implementation, we use M = N = 5. If a single stack contains more than M + N pieces, thisrepresentation is not suﬃcient to provide information about some of themiddle layers to the DNN, but this is rare in practice. • If G is a “stacking” game, we include an additional non-binary channelcontaining the height of every stack in every position. • If G is a game where positions can contain a “count” of more than onepiece, we include a non-binary channel denoting the count of pieces on thatposition. This channel is semantically similar to the one described abovefor stack heights. In Ludii, positions in these games are still restricted tocontaining only a single piece type at a time. This is most notably used for mancala games. Games where pieces of diﬀerent types can share a singleposition are modelled as stacking games instead. • Ludii’s state representation can include an “amount” value per player,primarily intended to represent money for games that involve betting orother similar mechanisms. If G uses this, we add one non-binary channelper player, such that every position in the channel for player p containsthe amount value of p in s . • If G is played by n > n binary channels, such thatthe n th channel is ﬁlled with values of 1 if and only if n is the currentplayer to move in state s . This also accounts for swap rules. For example,the ﬁrst player normally plays red, and the second blue, in Hex . If s is astate where the red player is the next to make a move, and a swap hasoccurred, the second of these channels will be ﬁlled with 1 entries insteadof the ﬁrst. • In some games, every position has a “local state” variable, which is aninteger value. Diﬀerent games can use this in diﬀerent ways to store(temporary) auxiliary information about positions. For instance, localstate values of 1 are used for positions that contain pieces that are still intheir initial position, and values of 0 otherwise (this is used for castling).Most games only use low local state values, if any at all. Hence, we useseparate binary channels to indicate local state values of 0, 1, 2, 3, 4, and ≥ • If the game uses a swap rule (or “pie rule”), such as Hex, we include abinary channel that is ﬁlled with values of 1 if and only if a swap hasoccurred in s . 10 For every distinct container in G , we include one binary channel that hasvalues of 1 for entries that correspond to a position in that container, andvalues of 0 everywhere else. • For each of the last two moves m played prior to reaching s , we add onebinary channel with only a single value of 1 in the entry correspondingto the “from” position of m (typically the location that a piece movesaway from), and a similar channel for the “to” position of m (typicallythe location that a piece is placed in).With these channels, we did not yet exhaustively cover all the state variablesin Ludii’s game state representation [19], but we covered the most commonly-used ones. Whenever new variables are added to Ludii’s game state representa-tion, engineering eﬀort for including these in the tensor representations is onlyrequired once for Ludii as a whole – not once per game added to Ludii. In contrast to GDL [16, 8, 9], it is not straightforward – if at all possible – toautomatically infer the complete action space for any arbitrary game describedin Ludii’s game description language. This is because in Ludii’s game descrip-tion language, the function that generates lists of legal moves is deﬁned as acomposite of many simple functions ( ludemes ), which may be arranged in anyarbitrary tree structure. While each of these ludemes in principle has somedomain for its possible inputs, and range for its possible outputs, these are notstrictly deﬁned in logic-based or other formats that permit automated inference.Similar to its state representation, Ludii has an object-oriented move rep-resentation [19]. However, in contrast to the state representation, the mostimportant variables of the move representation are arbitrarily-sized lists (ofprimitive modiﬁcations to be applied to a game state) and arbitrarily-sizedtrees (of ludemes to be evaluated after applying the initial primitive modiﬁca-tions). The arbitrary sizes of these variables make them diﬃcult to encode ina ﬁxed-size tensor representation. Hence, we ignore these properties, and onlydistinguish moves based on some simple properties that can easily be used forthis purpose. We construct the space of output tensors to map moves to for agame G as follows: • The action space is organised as a stack of 2-dimensional planes, withthe spatial dimensions being identical to those of the state tensors (seeSubsection 4.2). Every action will map to exactly one position in thisspace – i.e., one location in the 2-dimensional area, and one channel. • Pass and swap moves have been identiﬁed as special cases that are suf-ﬁciently common, important, and semantically diﬀerent from any other In this document we use the terms “move” and “action” interchangeably, to refer tocomplete decisions that players make. Within Ludii, these are referred to only as moves, andactions are smaller parts of moves. • Many games only involve moves that can be identiﬁed by just a singleposition in the spatial dimensions; these are generally games where playersplace stones (Go, Hex, Havannah, etc.), but may in theory also be gameslike Chess if they have been deﬁned in a way such that movements are splitup into two separate decisions (picking a source and picking a destination).These games can be automatically discovered in Ludii. For these games,we only add one more channel in addition to the pass and swap movechannels, to encode all other moves based on their positions in the spatialdimensions. In Ludii, this position is referred to as the “to” position. • In all other games, moves may have distinct “from” and “to” positions;typical examples are standard implementations of Chess, Amazons, Shogi,etc. For moves that have an invalid “from” position, we assume that it isequal to the “to” position. For games that involve stacking, moves mayadditionally have l min and l max properties which refer to the levels withina stack at which a move operates; both are assumed to equal 0 if thegame does not allow stacking. The “to” position of a move is used tomap the move to a location in the spatial dimensions, and the remain-ing properties are used to index into one of multiple channels based onthe relative “distance covered” by the move. More speciﬁcally, we create(2 M +1) × (2 M +1) × ( N +1) × ( N +1) channels, where we use M = 3, and N = 2 if G involves stacking, or N = 0 otherwise. Let dx and dy denotethe diﬀerences in rows and columns, respectively, between the “to” and“from” positions of a move. Let [ a ] cb denote a value of a clipped to lie in theinterval [ b, c ]. Then, this move gets mapped to the channel given by the 0-based index (cid:0)(cid:0) [ dx ] M − M × (2 M + 1) + [ dy ] M − M (cid:1) × ( N + 1) + [ l min ] N (cid:1) × ( N +1) + [ l max − l min ] N .Note that this is simply one approach to constructing tensor representations ofmoves that we implemented, but we may envision other approaches as well. Forinstance, in a game like Chess, it may be more important to encode the type ofthe piece that makes a move, rather than encoding the distance and directioncovered by a move. This could be accomplished by creating channels that areindexed based on the type of piece in the “from” location of a move, instead ofthe distance between “from” and “to” positions.While we ﬁnd this approach to be suﬃcient to distinguish moves from eachother in many cases, there are cases where multiple distinct moves that are legalin a single game state will end up being represented by exactly the same logit.When multiple distinct moves are represented by the same logit in a DNN’soutput, we say that they are aliased . DNNs cannot distinguish between aliasedmoves, and hence always provide the same advice (in the form of the prior prob-abilities P ( s, a )) to MCTS for these diﬀerent moves. However, in Polygames [5],the MCTS itself can still distinguish between the diﬀerent moves by diﬀerentrepresenting them as distinct branches in the search tree, and backing up (po-12entially) diﬀerent values throughout the tree search. This is an importantdiﬀerence with other frameworks, such as OpenSpiel [12], where the MCTS it-self requires every possible distinct action that may ever be legal in a game tobe assigned a unique integer upfront. When subsequently using the visit countsto compute the standard cross-entropy loss as proposed by [27], the visit countsfor all moves that share a single logit are summed up. The softmax over thelogits only counts every distinct logit once. In this section we describe experiments intended to demonstrate the poten-tial for the approach described in the previous section to facilitate trainingand research in general games. We picked ﬁfteen diﬀerent games, all as imple-mented with their default options in Ludii [20] v1.1.6, and trained a model ofthe ResConvConvLogitPoolModelV2 type from Polygames [5] in each of thesegames. The selected games are depicted in Figure 6.We used the same training hyperparameters across all games. Every trainingrun used 20 hours of wall time, with 8 GPUs, 80 CPU cores, and 475GB memoryallocated per training job. Every training job used 1 server for model training,and 7 clients for the generation of self-play games. The MCTS agents used 400MCTS iterations per move in self-play.The ﬁnal model checkpoint of every training run is evaluated in a set of 300evaluation games played against a pure MCTS – a standard UCT agent [10, 3]without any DNNs. In evaluation games, the MCTS with a trained model used40 iterations per move, whereas the pure MCTS used 800 iterations per move –where at the end of every iteration, the average outcome of 10 random rolloutsis backed up. The ﬁnal column of Table 1 reports the win percentages of thetrained MCTS against the untrained MCTS. The table also provides furtherdetails on the number of trainable parameters in each of the DNNs, and forsome games summarises unusual properties that these games have which we didnot yet observe in much of the existing literature on learning through self-playin games.In the majority of the evaluated games, the trained MCTS easily outper-forms the untrained one, even using 20 times fewer MCTS iterations (or 200times fewer if the number of random rollouts performed by the untrained MCTSis counted). Note that, in comparison to work that focuses on achieving super-human playing strength [26, 5], we focused on short training runs using fewerresources and smaller networks. Our primary aim is to demonstrate the possi-bility of training eﬀectively using a single implementation without game-speciﬁcdomain knowledge. The code used by Ludii to construct state and move tensors for any game is available from https://github.com/Ludeme/LudiiAI . All the training and evaluation code of Polygames isavailable from https://github.com/facebookincubator/Polygames . Checkpoints of modelsused in these experiments are available from http://dl.fbaipublicfiles.com/polygames/ludii_checkpoints/list.txt . Game Unusual Properties TrainableParame-ters Win Per-centage

Breakthrough - 188,296 100.00%Connect6 - 180,472 75.67%Dai Hasami Shogi - 188,296 99.33%Fanorona Move aliasing due to choice ofcapture direction. 188,296 50.00%Feed the Ducks Moves have global eﬀectsacross entire board. 231,152 83.00%Gomoku - 180,472 91.00%Hex - 222,464 100.00%HeXentaﬂ Asymmetry in piece types, ini-tial setup, and goals. 231,152 98.67%Konane - 188,296 98.00%Lasca Pieces (of multiple diﬀerenttypes) can stack. 5,450,268 3.50%Minishogi - 2,009,752 97.00%Pentalath - 180,472 95.33%Squava Lines of 4 win, but lines of 3lose. 222,464 96.67%Surakarta Loops around board allow forunique move patterns. 188,948 100.00%Yavalath Lines of 4 win, but lines of 3lose. 222,464 97.33%

First row : Breakthrough, Connect6, Dai Hasami Shogi, Fanorona, Feed theDucks.

Second row : Gomoku, Hex, HeXentaﬂ, Konane, Lasca.

Third row :Minishogi, Pentalath, Squava, Surakarta, Yavalath.The two results that stand out most are for

Lasca and

Fanorona . The winpercentage of 3 .

50% for Lasca indicates that this model is not trained nearlyas well as the others. Lasca is the only game among those tested that involvesstacking of multiple pieces on a single site. Our procedures for the constructionof input and output tensors lead to a signiﬁcantly larger numbers of channelsin this game compared to the other games, which is also reﬂected in the largenumber of trainable parameters that this model has. Further research is requiredto establish whether it would be suﬃcient to reduce the size of the model, orwhether entirely diﬀerent approaches for constructing the tensors would be moreappropriate. In Fanorona, the win percentage of 50% for the trained model isnot necessarily a poor level of performance (considering the large diﬀerence innumber of MCTS iterations), but it appears to be noticeably worse than in theother games. One possible explanation for this may be that Fanorona has amore severe degree of move aliasing, because there are situations where thereare multiple diﬀerent legal moves with identical “to” and “from” positions, butdiﬀerent eﬀects in that a player can choose in which direction they wish tocapture opposing pieces. Such moves are all represented by a single, sharedlogit in our output tensors – which means that only the MCTS can distinguishbetween them, but the trained policy head cannot.15

Open Problems

Thanks to the large library of games available in Ludii [20], we can get a clearpicture of categories of games that are open problems to various extents; somethat are simply not supported yet by Polygames [5] and require more engineeringeﬀort, and some that appear to have been neglected across the majority of recentgame AI literature. All of these types of games are supported by Ludii: • Stochastic games : these were not included in this paper because they aretemporarily unsupported by the MCTS implementation of Polygames, butwere supported in earlier versions of Polygames and will be again in futureversions. • Games with more than players : support for these can be added relativelyeasily [18], but is not yet available in Polygames. • Imperfect-information games : there has been some recent work towardsAlphaZero-like training approaches that support imperfect-informationgames [2], but tractability is still a concern for games with little com-mon knowledge. • Simultaneous-move games : simultaneous-move games will at least requiresigniﬁcant changes in the MCTS component [3] as it is typically used inAlphaZero-like training setups. • Games with excessively large state or move tensors : games such as

TaikyokuShogi , with a 36 ×

36 board and 402 pieces per player of 209 diﬀerent types,can be modelled and run in Ludii, but produce excessively large tensorswhich quickly lead to memory issues when training with standard hyper-parameter values that work well for “normal” games. These issues do notappear straightforward to resolve with current hardware and large DNNs. • Games played on a mix of cells, edges and/or vertices of graphs : whilegames like

Chess are only played on cells, and games like Go only onvertices, there are also games such as Contagion that are played on amix of multiple diﬀerent parts of a graph. It is not clear how to directlysupport these with the standard CNNs. • Games without an explicitly deﬁned board : games such as

Andantino or Chex are not played in a limited area that is deﬁned upfront, but in aplayable area that grows dynamically as play progresses. The standardDNN architectures require these spatial dimensions to be predeﬁned andﬁxed. • Games with more than spatial dimensions : games such as Spline havea third spatial dimension, which cannot be handled by the standard 2Dconvolutional layers. While a straightforward extension to 3D convolu-tional layers may be suﬃcient, we are not aware of any existing researchtowards this for games, and also imagine that a third spatial dimension16an rapidly lead to tensors becoming excessively large again for manynon-trivial games.

We have described our approach for constructing tensor representations of statesand moves for any game implemented in the Ludii general game system, andused this to implement a bridge between Ludii and the Polygames framework.This allows for the state-of-the-art tree search and self-play training techniquesimplemented in Polygames to be used for training game-playing models inany game described in Ludii’s general game description language. WhereasAlphaZero-like approaches typically require game-speciﬁc domain knowledge todeﬁne a Deep Neural Network’s architecture and its input and output tensors,we only require such domain knowledge at the level of the general game sys-tem as a whole, and can now leverage Ludii’s wide library of games – whichcan quickly grow thanks to its user-friendly game description language – tofacilitate more general game AI research with minimal requirements for game-speciﬁc engineering eﬀorts. We have identiﬁed a series of “open problems” inthe form of classes of games that are already supported by Ludii, but not yetby Polygames. For some of these is a clear path that merely requires additionalengineering eﬀort, but others are likely to require a more signiﬁcant amount ofextra research.

Acknowledgments

This work was partially supported by the European Research Council as partof the Digital Ludeme Project (ERC Consolidator Grant

References [1] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learn-ing Environment: An Evaluation Platform for General Agents.

Journal ofArtiﬁcial Intelligence Research , 47:253–279, 2013.[2] N. Brown, A. Bakhtin, A. Lerer, and Q. Gong. Combining deep reinforce-ment learning and search for imperfect-information games. In H. Larochelle,M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,

Advances in NeuralInformation Processing Systems 33 (NeurIPS 2020) , 2020.[3] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlf-shagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of17onte Carlo tree search methods.

IEEE Transactions on ComputationalIntelligence and AI in Games , 4(1):1–49, 2012.[4] C. B. Browne.

Automatic Generation and Evaluation of RecombinationGames . PhD thesis, Queensland University of Technology, 2009.[5] T. Cazenave, Y.-C. Chen, G. Chen, S.-Y. Chen, X.-D. Chiu, J. Dehos,M. Elsa, Q. Gong, H. Hu, V. Khalidov, C.-L. Li, H.-I. Lin, Y.-J. Lin,X. Martinet, V. Mella, J. Rapin, B. Roziere, G. Synnaeve, F. Teytaud,O. Teytaud, S.-C. Ye, Y.-J. Ye, S.-J. Yen, and S. Zagoruyko. Polygames:Improved zero learning.

ICGA Journal , 2020. To appear.[6] R. Coulom. Eﬃcient selectivity and backup operators in Monte-Carlo treesearch. In H. J. van den Herik, P. Ciancarini, and H. H. L. M. Donkers, ed-itors,

Computers and Games , volume 4630 of

LNCS , pages 72–83. SpringerBerlin Heidelberg, 2007.[7] E. Cox, E. Schkufza, R. Madsen, and M. R. Genesereth. In

Proceedingsof the IJCAI Workshop on General Intelligence in Game-Playing Agents(GIGA) , pages 13–20, 2009.[8] R. Emslie. Galvanise zero. https://github.com/richemslie/galvanise zero,2019.[9] A. Goldwaser and M. Thielscher. Deep reinforcement learning for gen-eral game playing. In

The Thirty-Fourth AAAI Conference on ArtiﬁcialIntelligence , pages 1701–1708. AAAI Press, 2020.[10] L. Kocsis and C. Szepesv´ari. Bandit based Monte-Carlo planning. InJ. F¨urnkranz, T. Scheﬀer, and M. Spiliopoulou, editors,

Machine Learn-ing: ECML 2006 , volume 4212 of

LNCS , pages 282–293. Springer, Berlin,Heidelberg, 2006.[11] J. Kowalski, M. Maksymilian, J. Sutowicz, and M. Szyku(cid:32)la. Regularboardgames. In

The Thirty-Third AAAI Conference on Artiﬁcial Intel-ligence , pages 1699–1706. AAAI Press, 2019.[12] M. Lanctot, E. Lockhart, J.-B. Lespiau, V. Zambaldi, S. Upad-hyay, J. P´erolat, S. Srinivasan, F. Timbers, K. Tuyls, S. Omidshaﬁei,D. Hennes, D. Morrill, P. Muller, T. Ewalds, R. Faulkner, J. Kram´ar,B. de Vylder, B. Saeta, J. Bradbury, D. Ding, S. Borgeaud, M. Lai,J. Schrittwieser, T. Anthony, E. Hughes, I. Danihelka, and J. Ryan-Davis. OpenSpiel: A framework for reinforcement learning in games.http://arxiv.org/abs/1908.09453, 2019.[13] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.

Nature ,521(7553):436–444, 2015. 1814] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hub-bard, and L. D. Jackel. Backpropagation applied to handwritten zip coderecognition.

Neural Computation , 1(4):541–551, 1989.[15] M. Lin, Q. Chen, and S. Yan. Network in network, 2014.[16] N. Love, T. Hinrichs, D. Haley, E. Schkufza, and M. Genesereth. Generalgame playing: Game description language speciﬁcation, 2008.[17] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style,high-performance deep learning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d ' Alch´e-Buc, E. Fox, and R. Garnett, editors,

Advancesin Neural Information Processing Systems 32 , pages 8024–8035. CurranAssociates, Inc., 2019.[18] N. Petosa and T. Balch. Multiplayer alphazero. In

Workshop on Deep Re-inforcement Learning at the 33rd Conference on Neural Information Pro-cessing Systems (NeurIPS 2019) , 2019.[19] ´E. Piette, C. Browne, and D. J. N. J. Soemers. Ludii game logic guide.

CoRR , abs/2101.02120, 2021.[20] ´E. Piette, D. J. N. J. Soemers, M. Stephenson, C. F. Sironi, M. H. M.Winands, and C. Browne. Ludii – the ludemic general game system. InG. D. Giacomo, A. Catala, B. Dilkina, M. Milano, S. Barro, A. Bugar´ın, andJ. Lang, editors,

Proceedings of the 24th European Conference on ArtiﬁcialIntelligence (ECAI 2020) , volume 325 of

Frontiers in Artiﬁcial Intelligenceand Applications , pages 411–418. IOS Press, 2020.[21] J. Pitrat. Realization of a general game-playing program. In A. J. H.Morrel, editor,

Information Processing, Proceedings of IFIP Congress 1968,Edinburgh, UK, 5-10 August 1968, Volume 2 - Hardware, Applications ,pages 1570–1574, 1968.[22] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networksfor biomedical image segmentation. In N. Navab, J. Hornegger, W. M.Wells, and A. F. Frangi, editors,

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , pages 234–241, Cham, 2015.[23] E. Schkufza, N. Love, and M. Genesereth. Propositional automata and cellautomata: Representational frameworks for discrete dynamic systems. InW. Wobcke and M. Zhang, editors,

AI 2008: Advances in Artiﬁcial Intel-ligence , volume 5360 of

LNCS , pages 56–66. Springer, Berlin, Heidelberg,2008. 1924] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre,S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap,and D. Silver. Mastering atari, go, chess and shogi by planning with alearned model.

Nature , 588:604–609, 2020.[25] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks forsemantic segmentation.

IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 39(4):640–651, 2017.[26] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,and D. Hassabis. A general reinforcement learning algorithm that masterschess, shogi, and Go through self-play.

Science , 362(6419):1140–1144, 2018.[27] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez,T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui,L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering thegame of Go without human knowledge.

Nature , 550:354–359, 2017.[28] C. F. Sironi and M. H. M. Winands. Optimizing propositional networks.In

Computer Games , pages 133–151. Springer, 2017.[29] D. J. Wu. Accelerating self-play learning in go.