[PDF] Derived metrics for the game of Go -- intrinsic network strength assessment and cheat-detection

Abstract

The widespread availability of superhuman AI engines is changing how we play the ancient game of Go. The open-source software packages developed after the AlphaGo series shifted focus from producing strong playing entities to providing tools for analyzing games. Here we describe two ways of how the innovations of the second generation engines (e.g.~score estimates, variable komi) can be used for defining new metrics that help deepen our understanding of the game. First, we study how much information the search component contributes in addition to the raw neural network policy output. This gives an intrinsic strength measurement for the neural network. Second, we define the effect of a move by the difference in score estimates. This gives a fine-grained, move-by-move performance evaluation of a player. We use this in combating the new challenge of detecting online cheating.

Full PDF

DDERIVED METRICS FOR THE GAME OF GO – INTRINSICNETWORK STRENGTH ASSESSMENT ANDCHEAT-DETECTION

ATTILA EGRI-NAGY , ANTTI T ¨ORM ¨ANEN Abstract.

The widespread availability of superhuman AI engines is changinghow we play the ancient game of Go. The open-source software packagesdeveloped after the AlphaGo series shifted focus from producing strong playingentities to providing tools for analyzing games. Here we describe two ways ofhow the innovations of the second generation engines (e.g. score estimates,variable komi) can be used for deﬁning new metrics that help deepen ourunderstanding of the game. First, we study how much information the searchcomponent contributes in addition to the raw neural network policy output.This gives an intrinsic strength measurement for the neural network. Second,we deﬁne the eﬀect of a move by the diﬀerence in score estimates. This givesa ﬁne-grained, move-by-move performance evaluation of a player. We use thisin combating the new challenge of detecting online cheating. Introduction

The game of Go is an ancient board game with simple rules and enormouscomplexity. It was the last grand challenge for artiﬁcial intelligence (AI) in abstractboard games, when the challenge is understood as beating the best human player,not as solving the game.AlphaGo (AG) [15] made history by being the ﬁrst superhuman Go AI engine.By using deep neural networks, AG had a way to integrate expertise of masterplayers, further enhanced by reinforcement learning and self-plays. AlphaGo Zero(AGZ) [16] improved the results by removing human expertise from the trainingprocess. These developments are revolutionary in AI. However, the real revolutioncame afterwards, when the technology became available to all players.Several new implementations followed the success of AG and AGZ [4, 9, 10, 17,18]. Given some computational resources, now anyone can build a deep learningGo engine [13]. Moreover, just a standard gaming PC is capable of providingsuperhuman play and analysis.The new implementations did not just recreate the same architecture, but severalof them went beyond it in terms of providing more information about the game. Wecall second-generation engines those that can play with variable komi (the pointsWhite get in the beginning as a compensation for not making the ﬁrst move) andgive information about the expected score, not just the probability of winning.This ﬁxes the problem of ‘slack’ moves, which AG was famous for. These wereinterpreted as mistakes ﬁrst, but then it was realized that once a win is secured,the neural network has no preference for choosing eﬃcient moves. a r X i v : . [ c s . A I] N ov ATTILA EGRI-NAGY , ANTTI T ¨ORM¨ANEN After AG, the focus shifted from creating a superhuman Go-playing entity todeveloping tools that help in understanding the game and in the learning processof human players. Now, the main usage of superhuman AIs is game analysis.

The structure of the paper.

First we will review the basic measures used in deeplearning Go AIs, followed by the suggested derived measures. Then we will describetwo applications, one for measuring network strength intrinsically, and one foronline cheat detection. Next, we describe the developed software tool and close thepaper with discussion. 2.

Basic Measures

The AGZ-like systems are based on deep reinforcement learning. Therefore wecan describe their functioning in games and in analysis (we are not consideringtraining here) in terms of neural networks and Monte-Carlo tree searches. Herewe describe three important measures: the visit count, the win rate, and the scoremean.We denote a state of the game (the board position) by s , specifying the turnnumber as an index when needed. This way, s denotes the empty board. Wedenote a move (action) on turn i by a i : this takes the board position s i − to s i . Inparticular, the ﬁrst move a takes s to s . The action can be a pass.2.1. Visit count: N ( s, a ) . For move a at board position s the visit count N ( s, a )is the number of times the search algorithm examined a variation starting with a .The Monte-Carlo tree search methods keep track of how many times a node in thesearch tree gets visited. In AGZ [16], the move selection is solely based on thevisit count, since the search algorithm keeps visiting the promising moves. Roughlyspeaking, the number of visits measures how many times a particular candidatemove is considered, how ‘interesting’ it is. Another way to look at the visit countis to use it as a reliability measure. A move may look very promising with a highchance of winning, but with just a few visits we cannot trust its value. Whileanalysis GUIs expose this value, it may be less used by the end users.2.2. Value function: V ( s ) , win rate. The value function V ( s ) gives the proba-bility of winning the game at a board position s . For the sake of simplicity, unlessotherwise stated we consider the value function from the perspective of Black.In AG [15], a dedicated network was trained for estimating the value function.In AGZ [16] it became another head of the same network shared by the policy head.It was realized that the same neural computation can be used both for predictingmoves and for deciding who is winning.2.3. Score mean: µ s . Convolutional neural networks can have diﬀerent heads,giving other values beyond a probability distribution for the next move. They can betrained to predict the score lead, the score diﬀerence at the end of the game [9,10,18].The score value head combined with the Monte-Carlo search methods give statisticalinformation about the outcome of the game: the score mean value. It can beinterpreted as the estimated score diﬀerence between the players at the end of thegame.

How reliable is the score mean?

It is part of the loss function for the neuralnetwork’s training [18], therefore the reliability of the estimate should increase withthe strength of the network. Searching for an indicator, we tried several handcrafted

ERIVED METRICS FOR THE GAME OF GO 3 self-plays with KataGo’s ﬁnal 40 blocks network, starting the game with a balancedinteger komi. Handcrafted means that the move is selected by a human operatorafter extensive analysis, to make sure that the choice is the best possible by thenetwork with no time control. These games reliably end up in draws, indicatingthe stability of µ s .There is an analogy for score mean in chess, where the advantage is measuredby centipawns ( th of the value of the pawn). Score mean has a similar role inGo, with the added beneﬁt that it fully captures the goal of the game. In chess onemay need to consider distance from checkmates as well.2.4. Score mean vs. win rate – the human perspective.

The relationshipbetween score mean and win rate is not straightforward. On average, positive scoremean comes with higher than 50% winning chance, but this is not a strict rule.There can be a situation in which there is a high chance of loosing by a smallmargin, but there is still the possibility with low probability of capturing a biggroup. Therefore, positive mean score can be associated with less than 50% winrate. Further properties of the connection can be demonstrated with two simpleexamples: a high-handicap game and a general consideration of the dynamics ofscore mean throughout a game.In a high-handicap game, Black’s advantage might be eroding steadily (reﬂectedin the gradual decline of score mean), while the win rate stays ﬂat above 90%.Then, suddenly the win rate switches when Black’s score mean becomes negative.Analysing this situation without the score mean could mislead us to search for aspecial meaning for the last little mistake, while it is just one of many.Every move played in a game reduces the number of its future possibilities. Asa game proceeds, its score estimate becomes more likely to be realized; so, while agame’s score mean might remain constant and close to even, the game’s win ratewill eventually drift to an extreme.Therefore, while win rate is a useful measure for the AI, it is often unintuitivefor human players and it can be misleading. A relatively small mistake can causea big shift in win rate. This eﬀect is further ampliﬁed if a game is nearing its end.Score mean is a useful measure for human players for two reasons. Firstly, stronghuman players themselves tend to estimate the values of moves in points, so thescore mean values can be easily understood. Secondly, unlike the win rate, the scoremean is not aﬀected by the stage of the game. For example, a move that loses onepoint in terms of the score might cause a win rate shift of 50% in the late game,but only 5% in the early game. A human player cannot visualise this win rate shift,but the one-point loss is easy to understand.3.

Derived Measures

Based on the inner measures of deep learning Go AI engines, we deﬁne newmeasures to increase their usability and explainability. These can be viewed as newperspectives, from which we can understand the games and their analyses better.3.1.

The eﬀect of a move: δ ( a ) . The eﬀect δ ( a ) is the diﬀerence between thescore mean after and before a move a : δ ( a ) = µ s i +1 − µ s i , when a takes boardposition s i to s i +1 . The diﬀerence in the corresponding win rates was used ﬁrstin Go GUIs, but as discussed before, the score mean is more stable and moreinformative, thus they quickly included the eﬀect as well. ATTILA EGRI-NAGY , ANTTI T ¨ORM¨ANEN Networks (256 channels)20 blocks 40 blocksGames early ﬁnal early ﬁnal1846 “Ear-reddening” 61.04% 61.96% 60.43% 64.11%325 positions 199 202 197 2092016 “Move 37” 55.66% 55.66% 56.60% 58.01%212 positions 118 118 120 1232019 Meijin 53.25% 58.44% 53.68% 61.03%231 positions 123 135 124 1412020 kyu game 58.51% 57.44% 63.83% 61.17%188 positions 110 108 120 115

Table 1.

Comparison of hit rate percentages of diﬀerent networks.

100 1000 2000 5000 10000 20000 50000 100000 200000 1000000 visits K L - d i v e r g e n c e Figure 1.

Analyzing the turn of move 37 in the second game ofthe AlphaGo-Lee Sedol match. The same turn is analyzed withdiﬀerent visit counts, 7 diﬀerent runs for each visit count.By gathering statistical information of the eﬀects throughout a game (averageof the eﬀects, deviations from the mean, cumulative moving average of the eﬀects)we can characterize the playing skill of a player. However, this alone cannot give arating to a player, as the eﬀects also depend on the type of the game played.3.2.

Search gaps: hit rate and KL-divergence. P ( s, a ), the prior probabilityof move a at board position s , is provided by the raw network output. This proba-bility distribution p is called the policy . The tree search guided by this policy thenproduces an updated policy π , the probability distribution of good moves aftersearch. π can be simply deﬁned by the visit counts [13, 16], as it is a good measureof the value of a candidate moves, given that enough simulations were made. Thedisparity between p and π is the search gap , which can be measured in diﬀerentways. ERIVED METRICS FOR THE GAME OF GO 5

KL-divergence b20b40 n e t w o r k Figure 2.

Measuring the KL-divergence after 100,000 visits for arandomly chosen position in 915 strong amateur games.3.2.1.

Hit Rate. How many times does the search select the same move as the topmove in the raw policy?

Clearly, this depends on the length of the search. If wejust allow a couple of simulations, then this number will be high. So hit rate isrelative to number of simulations.3.2.2.

KL-divergence.

The Kullback-Leibler divergence [7] is a fundamental tool forcomparing two discrete probability distributions, P and Q . D KL ( P (cid:107) Q ) = (cid:88) P ( x ) ln P ( x ) Q ( x )It is a measure of the disparity of the two distribution, although it is not a distancemetric. It measures how much information we gain if we use the distribution Q instead of P . It is a positive number, and it is zero when the distributions arethe same. The KL-divergence is a natural choice in the context of deep learning,since the closely related cross-entropy is used as a loss function for training thenetworks [13].We want to measure D KL ( p (cid:107) π ), but there are a couple of issues. Both p and π can have zero entries. There are illegal moves (a stone is already there,suicide move, or a ko situation), and the search will also visit only a subset ofthe possible moves, so in general we do not have visit counts for all legal moves.Therefore, we take the actually visited moves in the search tree, and deﬁne π (cid:48) bytheir visit counts and using normalization. So π (cid:48) is the probability distribution ofthe moves considered by the network. Note that this is now well-deﬁned, while weused π informally before. Then we take the set of moves included in π (cid:48) and ﬁndthe corresponding probabilities in π , and restricting to those moves, we normalizeand get p (cid:48) .As a rough but useful analogy, we can say that the output of the neural networkcorresponds to human intuition, while the search algorithm resembles step-by-steplogical thinking. Just as humans mix these two types of thinking, the computercombines the deep neural networks with tree search. We want to measure thestrength of intuition of the deep neural networks. This can be done by comparingthe policy with or without tree search.4. Application: Intrinsic Strength of Networks

How far are the deep neural networks from perfect minimax play?

For now, theuniversally agreed answer to this almost philosophical question is that they are veryfar. We stop training a network due to external reasons (e.g. the cost of computa-tional resources), not because we reached a theoretical limit for improvement. If anetwork played perfectly, the reported win rates could be more polarized, tending

ATTILA EGRI-NAGY , ANTTI T ¨ORM¨ANEN to one of the values 0, 0.5, and 1.0. Also, in that case, we would not need the treesearch. How long shall we run a game analysis?

This is a more practical, but relatedquestion. Can we simply use the raw network output policy? As a calibration test,we analyzed a game position with diﬀerent visit counts. Since the tree search isprobabilistic, we repeated the analyses several times. Fig. 1 shows the results of 7batches. A low number of visits gives a rather diﬀerent policy, since it takes a fewsimulations for the Monte-Carlo algorithm to balance the exploitation/explorationratio. After that we see an increasing KL-divergence value. Due to practical con-siderations, we chose 100,000 visits for further experiments.We analyzed four full games with respect to the hit rates of four diﬀerent net-works (Table 1). The games are chosen to be diﬀerent in style and strength. Theﬁrst is a historical game from 1846, the famous ‘ear-reddening’ game [12]. Thesecond game is from the Alphago vs. Lee Sedol match in 2016. The second game ofthe match contains the famous move 37, an example that a computer can also havecreative ideas. The third is taken from the 44th Meijin title match, as an exampleof post-AG professional play. And the fourth is an amateur game. The results showthe tendency of higher hit rates for stronger networks, both in terms of structureand length of training. Interestingly, the amateur game has the opposite tendency.The percentages in Table 1 are reminiscent of the success rate of supervisedlearning for predicting human expert moves used in the ﬁrst version of AG [15].In the self-play based reinforcement learning the network is trying to predict theoutcome of the tree search indirectly. So one might wonder whether it would bepossible to improve the networks without any more self-play games; after all, thetree search is a short-circuited self-play. Of course, this could only work for ﬁne-tuning of networks that are already strong, since the external reward signal is notavailable. We invite the deep learning community to test this hypothesis.The above analysis has the problem that moves in a game are correlated. There-fore, we also measured KL-divergence over 915 games from the KGS server. Allthe games are between players of 4 dan or better, so they represent strong amateurplay. We picked a random game from each and did a 100,000 visit analysis. Fig. 2compares the KL-divergence in the early 20 block and the late 40 block Katagonetworks. We can observe that the stronger network has smaller KL-divergencevalues on average; also, the maximal values are more extreme.5.

Application: Cheat Detection

Using an AI engine for ﬁnding best moves and variations in a game is called analysis after the game is ﬁnished; and cheating when the game is still ongoing.The widespread availability of AI engines is beneﬁcial in many ways. Most no-tably, one can improve their playing skills by reviewing their games with an AI.However, there are downsides of the technological progress: many players reportcheating on online Go servers. With the availability of superhuman AI engines,online cheating might be rampant; but, besides the rare cases where a player ad-mitted to cheating, there is no direct evidence for this except the gut feelings ofstrong human players.Cheating defeats the purpose of online playing, where one wants to have a humanopponent. On Asian servers, the top ranks are reportedly infested by cheaters.This has resulted in previously top-ranked humans to drop to lower ranks, starting

ERIVED METRICS FOR THE GAME OF GO 7

Figure 3.

Game 1: White is likely using an AI. The SGF ﬁles forthe presented games are available upon request.

Figure 4.

White 26, 28, 42, and 44 are exactly correct accordingto KataGo – even though a human player could think of manyviable plans in these parts of the game.a snowball eﬀect inside the servers’ ranking systems. If no countermeasures tocheating are found, in the near future it is possible that online ratings will belargely devalued.Strong players can quickly and reliably assess the opponent’s strength. Conse-quently, experienced players can recognize superhuman AI opponents. Could thisbe reproduced or at least helped by software tools?Players with a rating history are easier to catch from cheating by noticing asudden increase in their won games. However, clever cheaters that only consultan AI occasionally may be impossible to detect this way. Also, the availability ofAI-based training tools may accelerate individual learning.In this research, we do not consider players’ histories, so we can deal with newlyregistered users as well. Our aim is therefore to be able to decide whether cheatinghappened in a single game solely based on the game record.

ATTILA EGRI-NAGY , ANTTI T ¨ORM¨ANEN move s c o r e M e a n AIaveragechoicemedian name

White's scoremean values move w i n r a t e winrate Figure 5.

White’s score mean and win rate graphs for Game 1.In order to indicate the nature of the board position (whether onlya single ‘forced’ move is available, or there are several equally goodoptions), we display the AI’s best move, the actual choice made bythe player, and the average and median of the candidate movesconsidered by the engine. Before move 86, when White’s win ratehits 98%, his moves were almost perfect. Afterwards, White’s playbecomes less sharp, as indicated by the distance of the AI andchoice lines; but the win rate does not change, suggesting an AI’s‘safe play mode’.

Prior work.

Chess has a longer history of living with superhuman AI engines, thusthe integrity of online games has been investigated extensively. However, the con-clusion is that fully automated cheat detection is not possible. In [2] it is demon-strated that ‘false positives’ are abundant. This was shown by the existence ofhistoric games that would be classiﬁed as cheating, though that clearly could nothave happened.In [3], the theory of complex networks and the PageRank algorithm was usedto ﬁnd distinguishing statistical features of human and computer play. The anal-ysis was based on local information (3 × Human ways of recognizing an AI-using cheater.

The ways that hu-man players recognize AI-using cheaters, listed in this section, might be of help indesigning software tools for automatically catching cheaters.5.1.1.

Temporal evidence.

When a cheater consults an AI, there is a near-constanttime lag created by the cheater inputting their opponent’s last move to the AIprogram and waiting a moment for the AI to come up with an answer. When donein a straightforward fashion, this results in a cheater always playing their move after

ERIVED METRICS FOR THE GAME OF GO 9 for example ﬁve seconds – no matter if the move is obvious or extremely diﬃcultfor a human player to come up with.5.1.2.

Playing style.

It did not take long for human players to notice that AI engineshave a discernible playing style, emphasising quick exchanges and maintaining awhole-board balance. At ﬁrst the diﬀerence to human players was glaring, buthuman players have since adopted the AI’s favored techniques, resulting in a human-AI blend.Still, there are many moments during games when, according to the AI, an‘obvious’ move by human intuition is wrong, with the correct move being somethingvery unintuitive. When several such moves get played by the same player in a singlegame, the player is suspicious.5.1.3.

Safe play when ahead.

As most AI engines choose their moves by the win rateestimate, when a game is deemed practically ‘over’ (at roughly 98% and above),they will start playing moves that are not optimal in terms of the score mean butthat still retain the player’s win rate. This leads to the AI choosing moves that ahuman player would consider ‘slack’, and often a strong human player can noticewhen their opponent enters this kind of ‘safe play mode’.5.1.4.

Seemingly inconsistent play.

Human players and AI engines choose theirmoves very diﬀerently. Strong human players generally:(1) analyse and judge the current whole-board situation,(2) try to identify the most important or valuable areas of the board,(3) create a plan for how to develop the game, and(4) ﬁnally choose a move that furthers the plan.This process is then more or less repeated on each move, with adjustments madeas necessary depending on what the opponent is doing.As the opponent generally acts on a similar modus operandi, it becomes valuablefor a strong player to try to infer what the opponent is planning and to adjust theirown plan accordingly. For strong human players, this generates a kind of non-verbaldiscussion or give-and-take that takes place on the go board. For this reason, Gois sometimes referred to as ‘hand talk’ in Asian countries.As the AI does not form plans in a similar way as humans, it is not possible fora human to create this kind of a higher-level discussion with an AI engine. TheAI will constantly play moves that, to a human, seem to betray its plan – possiblyonly because the human player is unable to grasp it.5.2.

Case studies.

In this section, we have analysed four games, three of which(most likely) involve cheaters. All four games were played online and analysed bya professional Go player. As we have no input from the other player, ultimatelythere is no hard evidence on whether they were cheating or not.The diﬃculty of identifying a cheater depends greatly on whether the cheateris trying to cover their cheating or not. A clever cheater will vary the time theyuse for their moves, playing ‘obvious’ moves quickly and taking more time fordiﬃcult moves; and they will also not always play the AI’s best recommendedmove. Additionally, as AI engines rarely make big mistakes (especially early on inthe game), a clever cheater would optimally try to include a few larger mistakes intheir play. , ANTTI T ¨ORM¨ANEN The analysis has been performed as follows: ﬁrst, the win rate graph of the gameis checked. Since a cheater is using the AI to win the game, the win rate graph willgenerally tend to be one-sided, steadily rising to 99%; large shifts should not takeplace, as even a strong AI engine might not be able to beat a strong human if itfalls too much behind. An exception is if both players are cheating, in which casethe win rate usually progresses evenly for the most of the game.Secondly, the development of the player’s average eﬀect during the game ischecked. Of particular interest are the ﬁnal average eﬀect for the whole game,which is a general indicator of the player’s skill, and if the players’ average ef-fects develop in similar stages. Also, a player’s moves after their win rate reached98% can be indicative of AI involvement, as an AI will start playing score meanineﬃcient moves after this point.Thirdly, we check how the player performed in comparison to KataGo’s moverecommendations. If the player played moves that are roughly as good as KataGo’sﬁrst recommendations, the player is suspect; whereas, if the player does consider-ably worse than KataGo, that is evidence of either human play or at least the playeravoiding the AI’s best recommended moves.5.3.

Game 1.

The white player in Fig. 3 is most likely consulting an AI. Firstly,an experienced human player can already ﬁnd White’s opening suspicious whencomparing White’s choices with the AI’s suggestions. 26 and 28, shown in Fig. 4,are non-obvious moves to a human but ﬁrst options for the AI. A bit later, 42 and44 are another combination that looks made-up on the go, but exactly matches theAI’s recommendation. For a third example, 58 and its follow-up are very rarelyseen in human play and, while not KataGo’s ﬁrst recommendation, perform justabout as well.Secondly, as shown in Fig. 5 White basically makes no mistakes up until 86, eventhough Black is a professional player. This is diﬃcult to accomplish even for a tophuman player.Thirdly, after White reaches 98% win rate at move 86 as shown in Fig. 5, White’splay gets sloppy in terms of the score mean. After this point, the white averageeﬀect starts decreasing, but the win rate is ﬁrmly stuck at 99%.All three pieces of evidence put together, it is very likely that an AI engine wasinvolved.5.4.

Game 2.

Both players in Fig. 6 are most likely consulting an AI. Most ofthe moves in this game are among KataGo’s top picks. Furthermore, the players’average eﬀects are extremely small ( − .

25 and − .

20) even though there is a largevariance in the score means of KataGo’s considered moves, as shown in Fig. 7. Eventhe world champion of Go would ﬁnd it diﬃcult to play this well.5.5.

Game 3.

Most likely neither player in Fig. 8 consulted an AI – this is agame by strong human players. As shown in Fig. 9, the average eﬀect for theplayers peaks at around − . − .

65 for each, whichare reasonable numbers for strong human players. Comparing the players’ chosenmoves with KataGo’s recommended alternatives, we see that both players generallyperform better than the average choice but worse than the best choice, with plentyof exceptions to both directions.An AI-using smart cheater might attempt to play bad moves from time to time,but not so much that it should threaten their win. The win rate graph in Fig. 9

ERIVED METRICS FOR THE GAME OF GO 11

Figure 6.

Game 2: Both players are likely using an AI.shows that this is not the case, as there are large shifts in the win rate in the ﬁrstthird of the game: ﬁrst Black got a considerable lead, then White turned the gamearound, after which Black caught up again, after which White took oﬀ to a decisivelead. For further evidence, White’s win rate wavers even after ﬁrst hitting 99%,which is common to human games.While it is impossible to prove that neither player used the AI at any pointduring the game, it does not look like an AI was consulted to decide the outcomeof the game.5.6.

Game 4.

The black player in Fig. 10 is most likely consulting an AI. This caseis possibly the most obvious to a strong human player. First, black 31 and 35 inFig. 11 are moves that few human players could consider. Then, Black’s play from39 to 51, after which Black lives comfortably in the centre, would also be unthinkableto most – but all of these black moves are KataGo’s ﬁrst recommendations. A bitlater, black 63 also looks mistimed in human terms, but is among KataGo’s topchoices.Secondly, looking at the win rate graph in Fig. 12, Black’s win rate is headeddirectly to 99% with practically no drops. This is evidence of a vast diﬀerence ofskill between the players – even though White is a professional player who did notplay particularly badly in this game, according to KataGo.Thirdly, looking at the size of Black’s average eﬀect in Fig. 12, we see that Blackmanages an impressive − .

16 until move 61, at which point Black’s win rate hasreached 98%. After this, Black’s moves get sloppier in terms of the score mean,which further suggests an AI.All three pieces of evidence put together, it is very likely that an AI engine wasinvolved. , ANTTI T ¨ORM¨ANEN move -0.25-0.20-0.15-0.10-0.050.00 c u m s u m BW color Cumulative moving average of effects move -6-4-20246 s c o r e M e a n AIaveragechoicemedian name

Black's scoremean values move -10-8-6-4-20 s c o r e M e a n AIaveragechoicemedian name

White's scoremean values

Figure 7.

The two players’ average eﬀects and score means forGame 2. Both players’ average eﬀects are considerably small whentaking into amount the ‘volatility’ of the game, indicated by thedistance of the AI, average, and median lines in the score meangraphs. 6.

Software Implementation

We developed a dedicated software package for the described computations andfor generating the diagrams. The source code of LambdaGo is available at https://github.com/egri-nagy/lambdago . It is a simple command-line tool.The core system (including a game engine) is written in the Clojure language . Due to its dynamic nature, this functional languageis particularly suited for data-driven experimentation [5]. It is hosted on the JVM,therefore it also has convenient access to the whole JAVA ecosystem.For parsing the game record SGF ﬁles [1], in order to avoid writing yet anotherparser, we use a parser generator, Instaparse https://github.com/Engelberg/instaparse . This library is based on the idea of parsing with derivatives [8].

ERIVED METRICS FOR THE GAME OF GO 13

Figure 8.

Game 3: Neither player is likely using an AI. move -1.0-0.8-0.6-0.4-0.20.0 c u m s u m BW color Cumulative moving average of effects move -30-20-10010 s c o r e M e a n AIaveragechoicemedian name

Black's scoremean values move -60-40-20020 s c o r e M e a n AIaveragechoicemedian name

White's scoremean values move w i n r a t e winrate Figure 9.

The two players’ average eﬀects, score means, and thewin rate for Game 3. The fairly low size of the players’ averageeﬀects, the variance in the score mean graphs, and the up-and-down in the win rate graph suggest that this was a game by stronghuman players.The visualization of the graphs is done by the Vega-lite library [14]. It is ahigh-level grammar of graphics that allowed us automate the task of diagram gen-eration. The diagrams are just data (JSON ﬁles), which can be manipulated easily , ANTTI T ¨ORM¨ANEN Figure 10.

Game 4: Black is likely using an AI.

Figure 11.

Most of these black moves would be diﬃcult for ahuman to come up with, but they align with KataGo’s recommen-dations.in LambdaGo before rendering in a browser. The Go diagrams are made withGOWrite 2 [11], a high-quality Go publishing tool.The workﬂow of the system evolved through the cheat-detection application, andit has two steps: analysis and visualization.

Analysis.

The analysis can be done by the Lizzie GUI application ( https://github.com/featurecat/lizzie ). This was designed as an interface to Leela Zero [4], butlater it was adapted to work with other engines as well. It produces SGF ﬁles withthe analysis information added. The KataGo engine [18] also has a direct interfaceto its analysis engine, which accepts and emits information in the ubiquitous JSONformat. The analysis is a GPU-intensive and time consuming computation, so forpractical reasons we need to limit the visit counts.

Visualization.

The output of the analysis can be quickly processed to generate thediagrams. They can be generated in batch mode as well. We expect that thesevisualization features will appear in other tools as well, as the analysis needs of theusers will reach more sophisticated levels.

ERIVED METRICS FOR THE GAME OF GO 15 move -0.8-0.6-0.4-0.20.0 c u m s u m BW color Cumulative moving average of effects move s c o r e M e a n AIaveragechoicemedian name

Black's scoremean values move -20-15-10-50 s c o r e M e a n AIaveragechoicemedian name

White's scoremean values move w i n r a t e winrate Figure 12.

Black’s average eﬀect, both players’ score means, andthe win rate for Game 4. The straightforwardness of the win rategraph as well as Black’s small average eﬀect suggest AI involve-ment. 7.

Conclusion

Building upon the advances in artiﬁcial intelligence, and the developments inopen-source software projects, we suggested novel measures for evaluating and un-derstanding AI game analyses. Measuring the search gap (the added value of thetree search to the raw output of the neural network) allows us to measure thestrength of the network intrinsically, without playing other networks. The eﬀect ofa move can be used for assessing a player’s performance with high resolution (moveby move). We showed that an investigation of the eﬀect can be helpful in detectingonline cheating. Although automated cheat-detection may never be feasible dueto the danger of false positives, we used these tools in a real online tournamentand could catch a cheating player, who admitted the misconduct. This is an ex-ample of a successful collaboration of a human arbiter and an AI engine, accordingto the human-plus-machine paradigm envisioned by former chess world champion , ANTTI T ¨ORM¨ANEN Garry Kasparov [6]. What happens in the world of the game of Go will happen inother aspects of our life, and therefore it is valuable to understand the eﬀects of AItechnologies on the game.In this paper we demonstrated that deriving further measures based on the innerparameters and outputs of deep learning AI engines could provide useful tools forsolving practical problems and ways to advance theoretical Go knowledge.

Acknowledgment.

This paper beneﬁted from conversations with Go AI tooldevelopers Sander Land (KaTrain), Benjamin Teuber (AI Sensei) and David Wu(KataGo).

References [1] The Smart Game Format. .[2] David J. Barnes and Julio Hernandez-Castro. On the limits of engine analysis for cheatingdetection in chess.

Computers & Security , 48:58 – 73, 2015.[3] C. Coquid´e, B. Georgeot, and O. Giraud. Distinguishing humans from computers in the gameof go: A complex network approach.

EPL (Europhysics Letters) , 119(4):48001, aug 2017.[4] Gian-Carlo Pascutto et al. Leela Zero – Go engine with no human-provided knowledge, mod-eled after the AlphaGo Zero paper. https://zero.sjeng.org , 2019. https://github.com/leela-zero/leela-zero .[5] Rich Hickey. A history of Clojure.

Proc. ACM Program. Lang. , 4(HOPL), June 2020.[6] G. Kasparov.

Deep Thinking: Where Machine Intelligence Ends and Human Creativity Be-gins . Millennium series. Hodder & Stoughton, 2017.[7] S. Kullback and R. A. Leibler. On information and suﬃciency.

The Annals of MathematicalStatistics , 22(1):79–86, 1951.[8] Matthew Might, David Darais, and Daniel Spiewak. Parsing with derivatives: A functionalpearl.

SIGPLAN Not. , 46(9):189–195, September 2011.[9] Francesco Morandin, Gianluca Amato, Marco Fantozzi, Rosa Gini, Carlo Metta, and MaurizioParton. SAI: a sensible artiﬁcial intelligence that plays with handicap and targets high scoresin 9x9 Go (extended version), 2019.[10] Francesco Morandin, Gianluca Amato, Rosa Gini, Carlo Metta, Maurizio Parton, and Gian-Carlo Pascutto. SAI: a sensible artiﬁcial intelligence that plays Go. , Jul 2019.[11] Lauri Paatero. Gowrite. https://gowrite.net/ , 2009.[12] J. Power.

Invincible, the Game of Shusaku . Game Collections Series. Kiseido PublishingCompany, 1998.[13] M. Pumperla and K. Ferguson.

Deep Learning and the Game of Go . Manning Publications,2019.[14] Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and Jeﬀrey Heer. Vega-Lite:A Grammar of Interactive Graphics.

IEEE Trans. Visualization & Comp. Graphics (Proc.InfoVis) , 2017.[15] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van denDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, TimothyLillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas-tering the game of Go with deep neural networks and tree search.

Nature , 529(7587):484–489,January 2016.[16] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, TimothyLillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and DemisHassabis. Mastering the game of go without human knowledge.

Nature , 550:354–359, October2017.[17] Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinker-ton, and C. Lawrence Zitnick. ELF OpenGo: An analysis and open reimplementation ofAlphaZero, 2019.[18] David J. Wu. Accelerating self-play learning in go, 2019.

ERIVED METRICS FOR THE GAME OF GO 17 Akita International University, Department of Mathematics and Natural Sciences,Yuwa, Akita-City 010-1292, Japan Nihon Ki-in – Japan Go Association, 7-2 Gobancho, Chiyoda City,, Tokyo 102-0076,Japan

Email address ::