[PDF] Improving Model and Search for Computer Go

Abstract

The standard for Deep Reinforcement Learning in games, following Alpha Zero, is to use residual networks and to increase the depth of the network to get better results. We propose to improve mobile networks as an alternative to residual networks and experimentally show the playing strength of the networks according to both their width and their depth. We also propose a generalization of the PUCT search algorithm that improves on PUCT.

Full PDF

IImproving Model and Search for Computer Go

Tristan Cazenave

LAMSADE, Universit´e Paris-Dauphine, PSL, CNRS, France

[email protected]

Abstract.

The standard for Deep Reinforcement Learning in games, followingAlpha Zero, is to use residual networks and to increase the depth of the networkto get better results. We propose to improve mobile networks as an alternative toresidual networks and experimentally show the playing strength of the networksaccording to both their width and their depth. We also propose a generalization ofthe PUCT search algorithm that improves on PUCT.

Training deep neural networks and performing tree search are the two pillars of currentboard games programs. Deep reinforcement learning combining self play and MonteCarlo Tree Search (MCTS) [5,9] with the PUCT algorithm [12] is the current state ofthe art for computer Go [14,16] and for other board games [13,4]MobileNets have already been used in computer Go [3]. In this paper we improve onthis approach evaluating the beneﬁts of improved MobileNets for computer Go usingSqueeze and Excitation in the inverted residuals and using a multiplication factor of 6for the planes of the trunk. We get large improvements both for the accuracy and thevalue when previous work obtained large improvements only for the value.Some papers only take the accuracy of networks and the number of parameters intoaccount. For games the speed of the networks is a critical property since the networksare used in a real time search engine. There is a dilemma between the accuracy and thespeed of the networks. We experiment with various depth and width of network and ﬁndthat when increasing the size of the networks there is a balance to keep between thedepth and the width of the networks.We are also interested in improving the Monte Carlo Tree Search that uses the trainednetworks. We propose Generalized PUCT (GPUCT), a generalization of PUCT thatmakes the best constant invariant to the number of descents.The remainder of the paper is organized as follows. The second section presentsrelated works in Deep Reinforcement Learning for games. The third section describesthe Generalized PUCT bandit. The fourth section details the neural networks we trained.The ﬁfth section gives experimental results.

Monte Carlo Tree Search (MCTS) [5,9] made a revolution in Artiﬁcial Intelligenceapplied to board games. A second revolution occurred when it was combined with Deep a r X i v : . [ c s . A I] F e b T. Cazenave

Reinforcement Learning which led to superhuman level of play in the game of Go withthe AlphaGo program [12].Residual networks [2], combined with policy and value heads sharing the samenetwork and Expert Iteration [1] did improve much on AlphaGo leading to AlphaGoZero [14] and zero learning. With these improvements AlphaGo Zero was able to learnthe game of Go from scratch and surpassed AlphaGo.There were many replication of AlphaGo Zero, both for Go and for other games. Forexample ELF/OpenGo [15], Leela Zero [10], Crazy Zero by Coulom and the currentbest open source Go program KataGo [16].The approach was also used for learning a large set of games from zero knowledgewith Galvanise Zero [6] and Polygames [4].

AlphaGo used a convolutional network with 13 layers and 256 planes.Current computer Go and computer games programs use a neural networks with twoheads, one for the policy and one for the value as in AlphaGo Zero [14]. Using a networkwith two heads instead of two networks was reported to bring a 600 ELO improvementand using residual networks [2] also brought another 600 ELO improvement. Thestandard for Go programs is to follow AlphaGo Zero and use a residual netwrork with20 or 40 blocks and 256 planes.An innovation in the Katago program is to use Global Average Pooling in the networkin some layers of the network combined with the residual layers. It also uses more thantwo heads as it helps regularization.Polygames also uses Global Average Pooling in the value head. Together with a fullyconvolutional policy, it make Polygames networks invariant to changes in the size of theboard.

MobileNet [7] and then MobileNetV2 [11] are a parameter efﬁcient neural networkarchitecture for computer vision. Instead of usual convolutional layers in the block theyuse depthwise convolutions. The also use 1x1 ﬁlters to pass for a small number ofchannels in the trunk to 6 times more channels in the block.MobileNets were successfully applied to the game of Go [3]. Our approach is a largeimprovement on this approach using Squeeze and Excitation and a 6 fold increase inthe number of channels in the blocks. We also compare networks of different width anddepth and show some possible choices for increasing width and depth are dominatedand that it is better to increase both width and depth when making the networks grow.

The Monte Carlo Tree Search algorithm used in current computer Go programs sinceAlphaGo is PUCT. A Bandit that include the policy prior in its exploration term. Thebandit for PUCT is: mproving Model and Search for Computer Go 3 V ( s, a ) = Q ( s, a ) + c × P ( s, a ) × (cid:112) N ( s )1 + N ( s, a ) Where P ( s, a ) is the probability of move a to be the best moves in state s given bythe policy head, N ( s ) is the total number of descents performed in state s and N ( s, a ) is the number of descents for move a in state s . We propose to generalize PUCT replacing the square root with an exponential and usinga parameter τ for the exponential. The Generalized PUCT bandit (GPUCT) is: V ( s, a ) = Q ( s, a ) + c × P ( s, a ) × e τ × log ( N ( s )) N ( s, a ) This is a generalization of PUCT since for τ = 0 . this is the PUCT bandit. We experimented with various constants and numbers of evaluations for the PUCT bandit.We found that for small numbers of evaluations the 0.1 constant was performing well.In order to make the experiments stable we made PUCT with numbers of evaluationsstarting at 16 and doubling until 512 and a constant of 0.1 play 400 games against PUCTwith the same number of evaluations but varying constants. The result are given in Table1. The last line of the table is the average winning rate. On average the 0.15 constant isthe best and the following constants have decreasing average.Table 1: Evolution of win rates with the constant and the number of descents. Averageover 400 games against the 0.1 constant. σ √ n ∈ (2 .

0; 2 . d \ c It is clear from this table that there is a drift of the best constant towards greatervalues. We repair this undesirable property with Generalized PUCT. In order to ﬁnd thebest τ for Generalized PUCT we used the following algorithm: argmin τ,c ( Σ d | c × e τ × log ( d ) − c d × e . × log ( d ) | ) T. Cazenave

Where d is the budget (the ﬁrst column of Table 1), c d is the best PUCT constant forbudget d (for example 0.15 for d = 32 ).Given the data in Table 1 the best parameters values we found are τ = 0 . and c = 0 . .In order to verify that these values counter the drift in the best constants of PUCT,we made PUCT with different budgets and constants play against Generalized PUCTwith τ = 0 . and c = 0 . and the same budget as PUCT. The results are given inTable 2. For small budgets the PUCT algorithm with a constant of 0.2 is close to GPUCTbut when the budget grows to 1024 or 2048 descents, GPUCT gets much better thanPUCT. This is also the case for PUCT with the 0.1 and a 0.15 constants.Table 2: Comparison of PUCT with constants of 0.1 and 0.2 against GPUCT with τ = 0 . and c = 0 . . 400 games. σ √ n ∈ (2 .

0; 2 . d Our ﬁndings are consistent with the PUCT constant used in other zero Go programs.For example ELF uses a 1.5 constant [15] for much more tree descents than we do. Ourmodel ﬁts this increase of the value of the best constant when the search uses moredescents.We did not try to tune the constants for GPUCT and still get better results than PUCT.Some additional strength may come from tuning the constants instead of only using thetwo constants coming from optimizing on the data of Table 1.Note that having the same constant for different budgets can also be useful forprograms that use pondering or adaptive thinking times. Besides being more convenientfor tuning the constant.

In this section we start describing the dataset we built. We then detail the training andthe inputs and labels. We also explain how we have added Squeeze and Excitation to theMobilNets and how we experimented with various depth and width of the networks. Weﬁnally give experimental results for various networks. mproving Model and Search for Computer Go 5

Katago is the strongest available computer Go program. It has released the self-playedgames of 2020 as sgf ﬁles. We selected from these games the games played on a 19x19board with a komi between 5.5 and 7.5. We built the Katago dataset taking the last1,000,000 games played by Katago. The validation dataset is bult by randomly selecting100,000 games from the 1,000,000 games and taking one random state from each game.The games of the validation set are never used during training.The Katago dataset is a better dataset than the ELF and the Leela datasets used in[3]. Katago plays Go at a much better level than ELF and Leela. The networks trainedon the Katago dataset are trained with better data.

Networks are trained with Keras with each epoch containing 1 000 000 states randomlytaken in the Katago dataset.The evaluation on the validation set is computed every epoch. The loss used forevaluating the value is the mean squared error (MSE) as in AlphaGo. However we trainthe value with the binary cross entropy loss. The loss for the policy is the categoricalcrossentropy loas. We evaluate the policy according to its accuracy on the validation set.The batch size is 32 due to memory constraints. The annealing schedule is to trainwith a learning rate of 0.0005 for the ﬁrst 100 epochs, then to train with a learningrate of 0.00005 for 50 epochs, and then with 0.000005 for 50 epochs, and ﬁnally with0.0000005 for another 50 epochs. An epoch takes between 3 minutes for the smallestnetwork and 30 minutes for the larger ones using a V100 card.All layers in the networks have L2 regularization with a weight of 0.0001. The lossis the sum of the value loss, the policy loss and the L2 loss.

The inputs of the networks use the colors of the stones, the liberties, the ladder status ofthe stone, the ladder status of adjacent strings (i.e. if an adjacent string is in ladder), thelast 5 states and a plane for the color to play. The total number of planes used to encodea state is 21.The labels are a 0 or a 1 for the value head. A 0 means Black has won the game anda 1 means it is White. For the policy head there is a 1 for the move played in the stateand 0 for all other moves.

We add Squeeze and Excitation [8] to the MobileNets so as to improve their performancein computer Go. The way we do it is by adding a Squeeze and Excitation block at theend of the MobileNet block before the addition.The squeeze and excitation block starts with Global Average Pooling followed bytwo dense layers and a multiplication of the input tensor by the output of the denselayers.We give here the Keras code we used for this block:

T. Cazenave d e f

SE Block ( t , f i l t e r s , r a t i o = 1 6 ) :s e s h a p e = ( 1 , 1 , f i l t e r s )s e = G l o b a l A v e r a g e P o o l i n g 2 D ( ) ( t )s e = R e s h a p e ( s e s h a p e ) ( s e )s e = Dense ( f i l t e r s / / r a t i o ,a c t i v a t i o n = ’ r e l u ’ ,u s e b i a s = F a l s e ) ( s e )s e = Dense ( f i l t e r s ,a c t i v a t i o n = ’ s i g m o i d ’ ,u s e b i a s = F a l s e ) ( s e )x = m u l t i p l y ( [ t , s e ] ) r e t u r n x In order to improve the performance of AlphaGo Zero their authors made the networkgrow from 20 residual blocks of 256 planes to 40 residual blocks of 256 planes.In this paper we claim that MobileNets with Squeeze and Excitation give much betterresults that residual networks for the game of Go and we also claim that to improve theperformance of a network it is better to make both the number of blocks (i.e. the depthof the network) and the number of planes (i.e. the width of the network) grow together.

All the networks we test use two heads, one for the policy and one for the value. TheAlpha Zero like network uses fully connected layers both in the policy and the valuehead as in AlphaGo and descendants. When restricted to one million parameters it isdetrimental since it uses a little less than 300,000 parameters only for the two heads.The remaining of the AlphaZero network is 10 blocks of 44 planes. For all the networkswe optimize the number of parameters so as to be as close to one million paramatersas possible. The Polygames network use a fully convolutional policy head, with noparameters in the policy head. It also uses Global Average Pooling in the value headbefore a fully connected layer of 50 outputs and then a fully connected layer whit oneoutput for the value. The Polygames network uses 13 residual blocks of 64 planes. TheMobile and the Mobile+SE networks use 17 mobile blocks of 384 planes with a trunk of64 planes and the Polygames heads.The evolution of the policy accuracy for the four networks is given in Figure 1.Polygames architecture gives better results than the Alpha Zero architecture. Mobile isbetter than Polygames and Mobile+SE is better than Mobile alone.Figure 2 gives the evolution of the MSE validation loss of the value during training.Again Mobile+SE is better than Mobile. Mobile is better than Polygames and Polygamesis better than AlphaZero. mproving Model and Search for Computer Go 7

Fig. 1: The evolution of the policy validation accuracy for the different networks withless than one million parameters on the Katago dataset.Fig. 2: The evolution of the value validation MSE loss for the different networks withless than one million parameters on the Katago dataset.

T. Cazenave

Multiple MobileNets and two Polygames/Alpha Zero like residual networks were trainedon the Katago dataset. It took a total of more than 10,000 hours of training using V100cards.We trained a 20 blocks and a 40 blocks residual network with the Polygames heads.The results for these two networks are given in Table 3.We trained many MobileNets with Squeeze and Excitation and the Polygames heads.The number of parameters of the networks according to their width and depth are givenin Table 4. The speed of the networks are given in Table 5. The accuracy reached on thevalidation set at the end of training are given in Table 6. The MSE validation loss of thevalue is given in Table 7.Table 3: Properties of residual networks.

Network Parameters Accuracy MSE Speedresidual.20.256 23,642,469 52.77 0.1764 21.30residual.40.256 47,266,149 53.15 0.1763 13.54

Table 4: Parameters according to width and depth. w \ d

16 32 48 64 8064 908,197 1,811,365 2,714,533 3,617,701 4,520,86996 1,958,213 3,908,933 5,859,653 7,810,373 9,761,093128 3,405,541 6,801,125 10,196,709 13,592,293160 5,250,181 10,487,941 15,725,701192 7,492,133 14,969,381224 10,131,397 20,245,445

Table 5: Speed according to width and depth. Number of batch of size 32 per second onGPU (RTX 2080 Ti). w \ d

16 32 48 64 80 9664 28.17 25.28 19.39 18.30 16.20 14.1396 26.61 21.65 16.56 13.51 11.62 9.90128 25.25 17.65 13.32 10.71 9.07 7.71160 21.82 14.37 10.44 8.18 6.87 5.94192 19.78 12.33 9.02 6.97 5.60 4.81224 15.93 10.33 7.39 5.69 4.66 3.83256 14.08 8.15 6.26 4.81 3.95 3.32mproving Model and Search for Computer Go 9

Fig. 3: The accuracy Pareto front.Fig. 4: Networks dominated by the accuracy Pareto front.

Table 6: Accuracy according to width and depth. w \ d

16 32 48 64 8064 53.98 55.94 56.98 57.77 58.2196 55.48 57.78 58.51 59.20 59.62128 56.52 58.56 59.40 60.06160 57.00 59.26 60.16192 57.65 59.73224 58.01 60.05

Figure 3 gives the Pareto front of the networks according to speed and accuracy.Figure 4 gives the networks that are dominated by other networks. The dominatednetworks are residual.20.256, residual.40.256, se.16.192 ,se.16.224, se.32.224, se.48.64,se.48.96, se.64.64, se.64.96, se.80.64, se.80.96. We can see in this list that networks thatare either to shallow and too wide or to deep and too narrow are dominated. It means thatthere is a balance to keep between the depth and the width of the networks. Networksthat are shallow and wide or networks that are deep and narrow are dominated. Theoptimal ratio widthdepth seems to lie somewhere between 2.67 and 6.00. But we have notenough data to assess whether it stays in the same range for greater depth and width.The accuracy of our MobileNets are much better than the accuracy previouslyreported for MobileNets that were close to the accuracy of residual nets for similarspeeds [3]. MobileNets that have the same speed as the 20 blocks residual network havean accuracy close to 57% whereas the residual net is close to 53%. Moreover the Katagodataset has better quality games and is more elaborate that the datasets used in [3]. Itis even better when comparing the 40 blocks residual network to MobileNets with thesame speed. Table 7: MSE according to width and depth. w \ d

16 32 48 64 8064 0.1695 0.1637 0.1614 0.1602 0.159296 0.1657 0.1604 0.1580 0.1572 0.1563128 0.1631 0.1583 0.1566 0.1553160 0.1615 0.1570 0.1551192 0.1603 0.1560224 0.1595 0.1558

Figure 5 gives the value Pareto front. Figure 6 gives the networks dominated bythe value Pareto front. The dominated networks are residual.20.256, residual.40.256,se.16.224, se.32.224, se.48.64, se.64.96, se.80.64 and se.80.96. Again we see that shallowand wide networks as well as deep and narrow networks are dominated.The residual networks are largely dominated by squeeze and excitation MobileNetsboth for the accuracy and for the value. mproving Model and Search for Computer Go 11

Fig. 5: The value Pareto front.Fig. 6: Networks dominated by the value Pareto front.

In order to extrapolate the accuracy of the bigger networks we made a regression ona formula to estimate the accuracy given the depth and the width of the networks. Weassume that the increase in accuracy is logarithmic with the size of the network and weﬁnd the appropriate parameters using this algorithm: argmin p,p ,p ( Σ d,w | p + p × log ( d ) + p × log ( w ) − A ( d, w ) | ) Where d is the depth, w the width and A ( d, w ) the value of the accuracy in Table6. The best parameters values we found are p = 32 . , p = 2 . and p = 3 . giving aminimal error of 2.70 for the 21 accuracies, an error of approximately 0.128 per accuracy.Table 8 gives the Pareto front for the extrapolated accuracies and the accuracies weexperimentally found. The extrapolated accuracies are in parenthesis. The non dominatedaccuracies according to the speed and accuracies of the other networks are in bold. Wecan observe that the trend of balancing the depth and the width of the networks continuesfor extrapolated values.Table 8: Extrapolation of the accuracy Pareto front. w \ d

16 32 48 64 8064 (61.54)192 57.65

224 58.01 60.05 (61.32) (62.09) (62.70)

256 (58.81) (60.68) (61.78) (62.55) (63.16)

Figure 7 gives the evolution of the accuracy of our best network against the evolution ofthe accuracy of a state of the art residual network. We can observe that training startsbetter for our network and that it improves more during training.Figure 8 gives the same evolution for the mean squared error of the value and againthe residual network is largely dominated.

We now make the networks play Go. We ﬁrst test the strength of the networks only usingthe policy to play. The policy is randomized by choosing randomly between the movewith the best prior and the move with the second best prior, proportionally to the priors.Networks plays then instantly and still play at the level of high amateur Dan players.People enjoy playing blitz games on the internet against such networks. Table 9 gives mproving Model and Search for Computer Go 13

Fig. 7: The evolution of the policy validation accuracy for the best SE network and the40 blocks residual network on the Katago dataset.Fig. 8: The evolution of the value validation MSE loss for the best SE network and the40 blocks residual network on the Katago dataset. the result of a round robin tournament between the policy networks. As expected these.48.960.160 network, the one with the best accuracy has the best winning rate and theresidual networks have poor results.Table 9: Round robin tournament between policies.

Network Games Winrate σ √ n se.48.960.160 220 0.818 0.026se.64.576.96 220 0.795 0.027se.32.1344.224 220 0.727 0.030se.64.768.128 220 0.727 0.030se.48.768.128 220 0.705 0.031se.32.1152.192 220 0.682 0.031se.80.576.96 220 0.636 0.032se.16.1152.192 220 0.591 0.033se.32.960.160 220 0.591 0.033se.64.384.64 220 0.591 0.033se.16.1344.224 220 0.591 0.033se.48.576.96 220 0.568 0.033se.32.768.128 220 0.545 0.034se.80.384.64 220 0.455 0.034se.48.384.64 220 0.432 0.033se.32.576.96 220 0.364 0.032se.16.960.160 220 0.364 0.032se.32.384.64 220 0.341 0.032se.16.576.96 220 0.273 0.030se.16.768.128 220 0.273 0.030se.16.384.64 220 0.205 0.027residual.40.256 220 0.136 0.023residual.20.256 220 0.091 0.019 The se.48.960.160 network played games on the internet Kiseido Go Server (KGS)using only the policy to play and playing its moves instantly. Many people play againstthe network making it busy 24 hours a day. It reached a 6 dan level and it is still improvingits rating, winning 80% of its games. It is the best ranking reached by a policy networkalone, the previous best ranking was 5 dan with a mobile network [3].We now make the networks play using GPUCT and 32 descents per move. Allnetwork play against a randomized se.48.768.128 network. The randomization considerthe second best move if it has at least half the number of descents of the best move andthen randomly choose between the two moves according to their number of descents.Table 10 gives the results for all networks. The se.48.960.160 is again the best networkand the residual networks the worst networks.In the last experiment we make all the networks use the same thinking time of 10seconds per move. Large and slow networks make less descents than small and fastnetworks in this experiment. So there is a balance between the gain of accuracy of the mproving Model and Search for Computer Go 15

Table 10: Making all networks play 400 games as Black with 32 descents at each moveagainst a randomized se.48.768.128 network.

Network Games Winrate σ √ n se.48.960.160 400 0.675 0.023se.32.1152.192 400 0.580 0.025se.64.768.128 400 0.550 0.025se.32.1344.224 400 0.525 0.025se.80.576.96 400 0.515 0.025se.32.960.160 400 0.487 0.025se.48.768.128 400 0.430 0.025se.64.576.96 400 0.425 0.025se.32.768.128 400 0.365 0.024se.48.576.96 400 0.330 0.024se.16.1344.224 400 0.315 0.023se.80.384.64 400 0.302 0.023se.32.576.96 400 0.278 0.022se.64.384.64 400 0.275 0.022se.16.1152.192 400 0.250 0.022se.16.960.160 400 0.240 0.021se.48.384.64 400 0.217 0.021se.16.768.128 400 0.212 0.020se.32.384.64 400 0.120 0.016se.16.576.96 400 0.113 0.016se.16.384.64 400 0.043 0.010residual.20.256 400 0.025 0.008residual.40.256 400 0.010 0.0056 T. Cazenave policy and the improvement of the value due to increasing the size of the network andthe slowdown due to the increased time for a forward pass. Results are given in Table 11.Interestingly the se.48.960.160 is not the best network anymore. The residual networksare still way behind. We observe the impact of the balance between the size of thenetwork and its speed. We give for each network the ratio between the width w and thedepth d of the network. We observe that a too small or a too great ratio is to be avoidedand that using bigger networks with a balanced ratio is beneﬁcial until a limit. We proposed a generalization of the PUCT bandit of AlphaGo and Alpha Zero so as tomake it invariant to the number of descents. Experimental results show it is less sensitiveto the budget than usual PUCT. We also proposed improvements to MobileNets andshow that they give much better results than the commonly used residual networks. Wemade a detailed analysis of the balance between the depth, the width and the speed ofMobileNets. We also made the networks play in order to evaluate their strengths.In future work we plan to use MobileNets with Squeeze and Excitation in a DeepReinforcement Learning framework, using the Expert Iteration algorithm to train thenetworks. We also plan to use Expert Iteration and Monte Carlo Tree Search for variousgames and optimization problems.

Acknowledgment

This work was granted access to the HPC resources of IDRIS under the allocation2020-AD011012119 made by GENCI.This work was supported in part by the French government under management ofAgence Nationale de la Recherche as part of the “Investissements d’avenir” program,reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute).

References

1. Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search.In: Advances in Neural Information Processing Systems. pp. 5360–5370 (2017)2. Cazenave, T.: Residual networks for computer Go. IEEE Transactions on Games (1),107–110 (2018)3. Cazenave, T.: Mobile networks for computer Go. IEEE Transactions on Games (2020)4. Cazenave, T., Chen, Y.C., Chen, G.W., Chen, S.Y., Chiu, X.D., Dehos, J., Elsa, M., Gong, Q.,Hu, H., Khalidov, V., Cheng-Ling, L., Lin, H.I., Lin, Y.J., Martinet, X., Mella, V., Rapin, J.,Roziere, B., Synnaeve, G., Teytaud, F., Teytaud, O., Ye, S.C., Ye, Y.J., Yen, S.J., Zagoruyko,S.: Polygames: Improved zero learning. ICGA Journal (4) (December 2020)5. Coulom, R.: Efﬁcient selectivity and backup operators in Monte-Carlo tree search. In: van denHerik, H.J., Ciancarini, P., Donkers, H.H.L.M. (eds.) Computers and Games, 5th InternationalConference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers. Lecture Notes inComputer Science, vol. 4630, pp. 72–83. Springer (2006)6. Emslie, R.: Galvanise zero. https://github.com/richemslie/galvanise zero (2019)mproving Model and Search for Computer Go 17 Table 11: Making all networks play 400 games as Black with 10 seconds at each moveagainst a randomized se.48.768.128 network.

Network wd Winrate σ √ n se.32.768.128 4.00 0.600 0.024se.64.576.96 1.50 0.542 0.025se.80.576.96 1.20 0.520 0.025se.48.768.128 2.67 0.500 0.025se.64.768.128 2.00 0.487 0.025se.32.960.160 5.00 0.468 0.025se.48.576.96 2.00 0.465 0.025se.32.1152.192 6.00 0.443 0.025se.32.576.96 3.00 0.443 0.025se.48.960.160 3.33 0.440 0.025se.80.384.64 0.80 0.435 0.025se.64.384.64 1.00 0.427 0.025se.48.384.64 1.33 0.360 0.024se.32.1344.224 7.00 0.355 0.024se.16.1152.192 12.00 0.343 0.024se.16.768.128 8.00 0.335 0.024se.16.960.160 10.00 0.300 0.023se.32.384.64 2.00 0.285 0.023se.16.1344.224 14.00 0.247 0.022se.16.576.96 6.00 0.215 0.021se.16.384.64 4.00 0.158 0.018residual.20.256 - 0.020 0.007residual.40.256 - 0.013 0.0068 T. Cazenave7. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M.,Adam, H.: Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861 (2017)8. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEEconference on computer vision and pattern recognition. pp. 7132–7141 (2018)9. Kocsis, L., Szepesv´ari, C.: Bandit based Monte-Carlo planning. In: 17th European Conferenceon Machine Learning (ECML’06). LNCS, vol. 4212, pp. 282–293. Springer (2006)10. Pascutto, G.C.: Leela zero. https://github.com/leela-zero/leela-zero (2017)11. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residu-als and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 4510–4520 (2018)12. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser,J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J.,Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T.,Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature (7587), 484–489 (2016)13. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M.,Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., Hassabis, D.: A generalreinforcement learning algorithm that masters chess, shogi, and go through self-play. Science (6419), 1140–1144 (2018)14. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,L., Kumaran, D., Graepel, T., Lillicrap, T.P., Simonyan, K., Hassabis, D.: Mastering the gameof go without human knowledge. Nature (7676), 354 (2017)15. Tian, Y., Ma, J., Gong, Q., Sengupta, S., Chen, Z., Pinkerton, J., Zitnick, C.L.: Elf opengo:An analysis and open reimplementation of alphazero. CoRR abs/1902.04522 (2019)16. Wu, D.J.: Accelerating self-play learning in go. CoRR abs/1902.10565abs/1902.10565