[PDF] First-Order Problem Solving through Neural MCTS based Reinforcement Learning

Abstract

The formal semantics of an interpreted first-order logic (FOL) statement can be given in Tarskian Semantics or a basically equivalent Game Semantics. The latter maps the statement and the interpretation into a two-player semantic game. Many combinatorial problems can be described using interpreted FOL statements and can be mapped into a semantic game. Therefore, learning to play a semantic game perfectly leads to the solution of a specific instance of a combinatorial problem. We adapt the AlphaZero algorithm so that it becomes better at learning to play semantic games that have different characteristics than Go and Chess. We propose a general framework, Persephone, to map the FOL description of a combinatorial problem to a semantic game so that it can be solved through a neural MCTS based reinforcement learning algorithm. Our goal for Persephone is to make it tabula-rasa, mapping a problem stated in interpreted FOL to a solution without human intervention.

Full PDF

FFirst-Order Problem Solving through Neural MCTS basedReinforcement Learning

Ruiyang Xu

Khoury College of Computer SciencesNortheastern University, [email protected]

Prashank Kadam

Khoury College of Computer SciencesNortheastern University, [email protected]

Karl Lieberherr

Khoury College of Computer SciencesNortheastern University, [email protected]

ABSTRACT

The formal semantics of an interpreted first-order logic (FOL) state-ment can be given in Tarskian Semantics or a basically equivalentGame Semantics. The latter maps the statement and the interpreta-tion into a two-player semantic game. Many combinatorial prob-lems can be described using interpreted FOL statements and canbe mapped into a semantic game. Therefore, learning to play asemantic game perfectly leads to the solution of a specific instanceof a combinatorial problem. We adapt the AlphaZero algorithm sothat it becomes better at learning to play semantic games that havedifferent characteristics than Go and Chess. We propose a generalframework, Persephone, to map the FOL description of a combina-torial problem to a semantic game so that it can be solved througha neural MCTS based reinforcement learning algorithm. Our goalfor Persephone is to make it tabula-rasa, mapping a problem statedin interpreted FOL to a solution without human intervention.

KEYWORDS

Semantic Game, Reinforcement Learning, Neural MCTS

ACM Reference Format:

Ruiyang Xu, Prashank Kadam, and Karl Lieberherr. 2021. First-Order Prob-lem Solving through Neural MCTS based Reinforcement Learning. In

Proc.of the 20th International Conference on Autonomous Agents and MultiagentSystems (AAMAS 2021), London, UK, May 3–7, 2021 , IFAAMAS, 9 pages.

Recent success from AlphaZero [29] sheds light on solving com-binatorial games through combining deep RL with Monte CarloTree Searching (MCTS). It is known that conventional model-freeRL performs poorly on those problems, that is because a combi-natorial problem usually implies a large state space with sparserewards, which causes sample efficiency a challenge to those algo-rithms. Nevertheless, neural MCTS can largely increase the sampleefficiency and improve the performance of solving those problems.Since the first proposal of neural MCTS [2, 28], people haveseen a remarkable performance of the algorithm on gameplay andtried to improve this algorithm in various ways. For instance, Alp-haZero [29] uses self-play to generate learning data, which makesit becomes self-supervised learning. However, the power of Alp-haZero is limited to conventional board games. Later, MuZero [24]extended this idea to model-based RL, which makes the algorithmable to play Atari games. At the theoretical level, Grill et al. [12]

Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, London,UK pointed out that MCTS itself is some kind of regularized policy opti-mization, which perfectly explained the mysterious PUCT heuristicappeared in the AlphaZero paper. Furthermore, based on their the-ory, they proposed an optimal heuristic and proved that one hasa better performance than AlphaZero’s. Their research reveals adeep connection between neural MCTS and RL, which has alsobeen noticed by other researchers, like in [14] where the authoruses a single Q-value network to replace the value and policy net-work in AlphaZero. The policy is dynamically computed from theQ-values through a softmax operator. Performance improvementhas also been seen in this variant. Likewise, Guo et al. [9] use athree-headed neural network (i.e., policy, value, and Q-value) toincrease the learning efficiency on Hex. Moreover, recently, Gold-waser et al. [11] adapted neural MCTS to general game playing[10], however, their current framework can only handle turn-basedtwo-player zero-sum symmetric games.Neural MCTS’s ability to handle games with large state spaceand sparse reward motivates us to extend its application to anotherdomain, solving combinatorial problems. We should mention thatapplying RL to solving combinatorial problems has been studiedfor a period of time [5, 7, 17, 19, 22, 30]. In this paper, unlike othermethods, we propose a framework, Persephone, which solves FOL-expressible combinatorial problems (first-order problems in short)through a logic semantic-based gamification. We also point out aclose connection between neural MCTS and RL to use concepts inRL to improve the algorithm. Specifically, our main contributionsare 1. we proposed a framework to map first-order problems tomulti-agent MDPs so that solutions can be learned through neu-ral MCTS based RL. 2. We define symmetry and asymmetry forextended form games and exploit the asymmetry in our networkdesigns. 3. We propose using warm-start MCTS, different policylearning strategies, and asymmetric neural network structures toincrease the performance. 4. We carry out experiments on differentdesigns and verify the best configuration. Our experimental resultsshow that using asymmetric designs helps increase performanceon asymmetric semantic games.

MCTS has been applied to solving combinatorial games for a longtime [6], while recently, combining deep neural networks withMCTS showed success in improving solver competence in manypractical combinatorial games. The concept of neural MCTS wasproposed independently in Expert Iteration [2] and AlphaZero [29].In a nutshell, neural MCTS uses the neural network as policy andvalue approximators. During each learning iteration, it carries out a r X i v : . [ c s . A I] J a n ultiple rounds of self-plays. Each self-play runs several MCTSsimulations to estimate an empirical policy at each state, then sam-ple from that policy, take a move, and continue. After each round ofself-play, the game’s outcome is backed up to all states in the gametrajectory. Those game trajectories generated during self-play arethen be stored in a replay buffer, which is used to train the neuralnetwork.In self-play, for a given state, the neural MCTS runs a givennumber of simulations on a game tree ,rooted at that state, togenerate an empirical policy. Each simulation, guided by the policyand value networks, passes through 4 phases:(1) SELECT: At the beginning of each iteration, the algorithmselects a path from the root (current game state) to a leaf(either a terminal state or an unvisited state) according toan upper confidence boundary (UCB, [3, 4, 18]). Specifically,suppose the root is 𝑠 . The UCB determines a serial of states { 𝑠 , 𝑠 , ..., 𝑠 𝑙 } by the following process: 𝑎 𝑖 = arg max 𝑎 (cid:34) 𝑄 ( 𝑠 𝑖 , 𝑎 ) + 𝑐𝜋 𝜃 ( 𝑠 𝑖 , 𝑎 ) √︁(cid:205) 𝑎 ′ 𝑁 ( 𝑠 𝑖 , 𝑎 ′ ) 𝑁 ( 𝑠 𝑖 , 𝑎 ) + (cid:35) 𝑠 𝑖 + = move ( 𝑠 𝑖 , 𝑎 𝑖 ) (1)It has been proved in [12] that selecting simulation actionsusing Eq.1 is equivalent to optimize the empirical policyˆ 𝜋 ( 𝑠, 𝑎 ) = + 𝑁 ( 𝑠, 𝑎 )| 𝐴 | + (cid:205) 𝑎 ′ 𝑁 ( 𝑠, 𝑎 ′ ) where | 𝐴 | is the size of current action space, so that it ap-proximate to the solution of the following regularized policyoptimization problem: 𝜋 ∗ = arg max 𝜋 (cid:104) 𝑄 𝑇 ( 𝑠, ·) 𝜋 ( 𝑠, ·) − 𝜆𝐾𝐿 [ 𝜋 𝜃 ( 𝑠, ·) , 𝜋 ( 𝑠, ·)] (cid:105) 𝜆 = √︁(cid:205) 𝑎 ′ 𝑁 ( 𝑠 𝑖 , 𝑎 ′ )| 𝐴 | + (cid:205) 𝑎 ′ 𝑁 ( 𝑠, 𝑎 ′ ) (2)That also means that MCTS simulation is an regularizedpolicy optimization [12], and as long as the value networkis accurate, the MCTS simulation will optimize the outputpolicy so that it maximize the action value output whileminimize the change to the policy network.(2) EXPAND: Once the selected phase ends at an unvisited state 𝑠 𝑙 , the state will be fully expanded and marked as visited. Allits child nodes will be considered as leaf nodes during nextiteration of selection.(3) ROLL-OUT: The roll-out is carried out for every child ofthe expanded leaf node 𝑠 𝑙 . Starting from any child of 𝑠 𝑙 , thealgorithm will use the value network to estimate the resultof the game, the value is then backed up to each node in thenext phase.(4) BACKUP: This is the last phase of an iteration in whichthe algorithm updates the statistics for each node in theselected states { 𝑠 , 𝑠 , ..., 𝑠 𝑙 } from the first phase. To illustratethis process, suppose the selected states and correspondingactions are {( 𝑠 , 𝑎 ) , ( 𝑠 , 𝑎 ) , ... ( 𝑠 𝑙 − , 𝑎 𝑙 − ) , ( 𝑠 𝑙 , _ )} Let 𝑉 𝜃 ( 𝑠 𝑖 ) be the estimated value for child 𝑠 𝑖 . We want toupdate the Q-value so that it equals to the averaged cumula-tive reward over each accessing of the underlying state, i.e., 𝑄 ( 𝑠, 𝑎 ) = (cid:205) 𝑁 ( 𝑠,𝑎 ) 𝑖 = (cid:205) 𝑡 𝑟 𝑖𝑡 𝑁 ( 𝑠,𝑎 ) . To rewrite this updating rule in aniterative form, for each ( 𝑠 𝑡 , 𝑎 𝑡 ) pair, we have: 𝑁 ( 𝑠 𝑡 , 𝑎 𝑡 ) ← 𝑁 ( 𝑠 𝑡 , 𝑎 𝑡 ) + 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ) ← 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ) + 𝑉 𝜃 ( 𝑠 𝑟 ) − 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 ) 𝑁 ( 𝑠 𝑡 , 𝑎 𝑡 ) (3)Such a process will be carried out for all of the roll-outoutcomes from the last phase.Once the given number of iterations has been reached, the algorithmreturns the empirical policy ˆ 𝜋 ( 𝑠 ) for the current state 𝑠 . After theMCTS simulation, the action is then sampled from the ˆ 𝜋 ( 𝑠 ) , andthe game moves to the next state. In this way, for each self-playiteration, MCTS samples each player’s states and actions alternatelyuntil the game ends, which generates a trajectory for the currentself-play. After a given number of self-plays, all trajectories will bestored into a replay buffer so that it can be used to train and updatethe neural networks. Game semantics is an approach that rebuilds the logical conceptsof game-theoretic concepts. For instance, in the propositional logic,each formula is interpreted as a game between two players. TheProponent is in charge of all the "OR" operators, while the Opponenttakes over all the "AND" operators. The game runs recursively onthe computational order of the operators. During each move of thegame, the current operator owner will choose one of its sides as asubformula, and the game will then continue in that subformula.The game will end when a primitive proposition is achieved, and theProponent wins the game if the formula evaluates to true; otherwise,the Opponent wins.Hintikka refined the game semantics and extended it to themodel-based first-order logic. To be specific, using Hintikka’s game-theoretic semantics approach [16], one can map any first-orderlogic formula (with models) into object picking and fact testing. Awinning strategy for the Proponent/Opponent exists if the underly-ing logic formula is true/false. Under the game-theoretic semantics,a semantic game is represented as a tuple (⟨ Ψ , 𝑀 ⟩ , P , OP ) , wherethe underlying formula is Ψ interpreted by the model 𝑀 . 𝑃 and 𝑂𝑃 denote the game role, namely, who is playing as the Propo-nent/Opponent. There are 6 types for any first-order logic formula(Tbl.1):(1) For universally quantified formulas, the Opponent takes amove by providing a legal value 𝑥 for the quantified variable 𝑥 . The game continues with a subformula where all theappearances of 𝑥 are replaced with the particular assignment 𝑥 .(2) For existentially quantified formulas, the Proponent takes amove by providing a legal value 𝑥 for the quantified variable 𝑥 . The game continues with a subformula where all theappearances of 𝑥 are replaced with the particular assignment 𝑥 .3) For conjunctive formulas, the Opponent takes a move bypicking either side of the subformula from the original for-mula.(4) For disjunctive formulas, the Proponent takes a move bypicking either side of the subformula from the original for-mula.(5) For negated formulas, no moves will happen, while the twoplayers will switch roles.(6) For primitive propositions, the true value will be evaluateddirectly within the given model 𝑀 . This situation is also anindication of the end of the game, where the player whocurrently plays Proponent wins if the formula evaluates totrue; otherwise, the current Opponent player wins.Proposition 𝜑 Operation Subgame ∀ 𝑥 : Ψ ( 𝑥 ) OP picks 𝑥 (⟨ Ψ [ 𝑥 / 𝑥 ] , 𝑀 ⟩ , P , OP ) Ψ ∧ 𝜒 OP picks 𝜃 ∈ { Ψ , 𝜒 } (⟨ 𝜃, 𝑀 ⟩ , P , OP )∃ 𝑥 : Ψ ( 𝑥 ) P picks 𝑥 (⟨ Ψ [ 𝑥 / 𝑥 ] , 𝑀 ⟩ , P , OP ) Ψ ∨ 𝜒 P picks 𝜃 ∈ { Ψ , 𝜒 } (⟨ 𝜃, 𝑀 ⟩ , P , OP )¬ Ψ N/A (⟨ Ψ , 𝑀 ⟩ , OP , P ) 𝜑 N/A N/A

Table 1: The mapping between a formula ⟨ 𝜑, 𝑀 ⟩ and the cor-responding semantic game (⟨ 𝜑, 𝑀 ⟩ , P , OP ) . In this table, OPstands for Opponent, and P stands for Proponent. M is themodel of the formula, which defines all non-logical symbolsin the formula. The game ends at an atomic proposition 𝜑 . Itis to be noted that the negation switches the role of the twoplayers; namely, strategies for P in a game for ¬ Ψ are strate-gies for OP in the game for Ψ . In this section, following the definition given by [15, 27], we extendthe concept of the asymmetric game from normal form game tothe two-player extensive form game using first-order logic. Theasymmetry of an asymmetric game is mainly reflected in two as-pects: 1. The players’ actions space is asymmetric. Namely, playerswith different game roles have different action spaces. 2. the finalobjective of the game roles are asymmetric. Generally speaking,any two-player extensive form game can be abstractly described asthe following: 𝐺 ( 𝑆, 𝑃 ) =  True , if P winsFalse , if P loses ∃ 𝑚 ∈ A 𝑆𝑃 : ¬ 𝐺 (T ( 𝑆, 𝑚 ) , ¬ 𝑃 ) where 𝐺 ( 𝑆, 𝑃 ) is a predicate that claims that, given a game stateparameter 𝑆 , the current player 𝑃 will win the game. In a nontrivialcase, the current player will pick up the action from the action space A 𝑆𝑃 and claim that after taking this action through the transitionoperator T (

𝑆, 𝑚 ) , the other player cannot win the game. It is to benoted that the action space is parameterized on the current stateand player, which is consistent with any game definition. Definition 2.1.

Given the space of all possible legal game states 𝔖 the space of all possible legal actions 𝔄 , an extensive form game is symmetric if there exist two involutions 𝜑 : 𝔄 → 𝔄 and 𝜓 : 𝔖 → 𝔖 such that we have the following holds: ∀ 𝑚 ∈ 𝔄 : 𝜑 ( 𝑚 ) ∈ 𝔄 ∧ 𝜑 ( 𝜑 ( 𝑚 )) = 𝑚 ∀ 𝑆 ∈ 𝔖 : 𝜓 ( 𝑆 ) ∈ 𝔖 ∧ 𝜓 ( 𝜓 ( 𝑆 )) = 𝑆 ∀ 𝑚 ∈ A 𝑆𝑃 : 𝜑 ( 𝑚 ) ∈ A 𝜓 ( 𝑆 )¬ 𝑃 ∀ 𝑆 ∈ 𝔖 : 𝐺 ( 𝜓 ( 𝑆 ) , ¬ 𝑃 ) = 𝐺 ( 𝑆, 𝑃 ) (4)Using the definition above, one can easily judge whether a givenextensive form game is symmetric or asymmetric. For instance,Chess is considered symmetric because one can find 𝜓 and 𝜑 bothto be a ’flipping’ operator, which makes the current board upsidedown. That means the White player can move by pretending to bea Black player by rotating the board 180 degrees and flipping thecolor. On the other hand, Fox and Geese is asymmetric, because onecannot find any involution 𝜑 between Fox player’s action space andGeese player’s action space such that 𝜓 ( 𝑆 ) is a legal state in 𝔖 . It canalso be inferred that all impartial games (hence all Nim games) aresymmetric because, by definition, all players share the same actionspace, so the mappings here can just be identical. Nevertheless,partisan games can either be symmetric or asymmetric based onthe game definition.The semantic games we are dealing with are considered asym-metric in general because the two roles’ action space would bedifferent due to different domains for quantified variables. Such anasymmetry introduces an intrinsic imbalance once the two playersplay against each other, which causes the game to be easier forone player but harder for the other one. The asymmetric gamesbecome a challenge for the original AlphaZero algorithm designedfor symmetric board games, like Chess and Go. As a result, Alp-haZero can learn on a consistent action space, objective, and playerrole using the color-flipping trick. The symmetric games make itpossible only to learn a single policy that applies to both players.However, separate policies have to be learned for an asymmetricgame, adding another layer of complexity to the learning algorithm. In this section, we propose a general framework, Persephone, tosolve FOL-expressible combinatorial problems. Through game-theoreticsemantics, the framework transforms the FOL description of thetarget problem into a two-player semantic game. The transformedtwo-player semantic game can then be modeled with Two-playerMDPs. After that, a neural MCTS algorithm will be applied to playand learning the game. The algorithm will finally converge to anoptimal strategy for the Proponent player if the original problemhas an optimal solution. Otherwise, the Opponent player will learnthe counter strategy to demonstrate the falsehood of the originalproblem (Fig. 1).Two-player MDPs can be viewed as extensions of MDPs [20],which, in case of semantic games, can be represented as a tuple ⟨S 𝑃 , A 𝑃 , S 𝑂𝑃 , A 𝑂𝑃 , R , T , 𝛾 ⟩ where: S and S are state spaces foreach players, and A and A are action spaces. • S 𝑃 and S 𝑂𝑃 are state spaces for each players, which containsall possible states in a decision problem. In terms of theemantic game, they contain all possible legal game statesfor each players respectively. • A 𝑃 and A 𝑂𝑃 are action space for each player, which con-tains all possible actions in a decision problem. In terms ofthe semantic game, it contains all possible legal moves foreach players respectively. • transition function T defines the dynamic from one state toanother. In a semantic game, we have a transition function T : S 𝑃 / 𝑂𝑃 × A 𝑃 / 𝑂𝑃 → S 𝑃 ∪ 𝑆 𝑂𝑃 . It is to be noted that afeature of the semantic game is that, depending on the step-wise evaluation of the FOL formula and game-theoreticalsemantics, the next state can either belong to the same playeror change to another player. • Rewards R( 𝑠, 𝑎, 𝑠 ′ ) , which defines the reward after takingaction 𝑎 in state 𝑠 and moving to state 𝑠 ′ . In a semantic game,since the outcome of the game is unknown until the gameends, the rewards are sparse. In other words, we have thefollowing reward function (suppose 𝑠 ∈ S 𝑃 / 𝑂𝑃 ) R( 𝑠, 𝑎, 𝑠 ′ ) =  𝑒𝑣𝑎𝑙 ( 𝑠 ′ ) , if 𝑠 ′ is terminal and 𝑠 ′ ∈ S 𝑃 / 𝑂𝑃 − 𝑒𝑣𝑎𝑙 ( 𝑠 ′ ) , if 𝑠 ′ is terminal and 𝑠 ′ ∈ S 𝑂𝑃 / 𝑃 , otherwise 𝑒𝑣𝑎𝑙 ( 𝑠 ) = (cid:40) , if 𝑠 evaluates to True − , if 𝑠 evaluates to False • Reward discount factor 𝛾 ∈ ( , ] , which weighs the impor-tance of future rewards. Typically, the farther the distance ofa reward from the current state, the less effective the rewardbrings to the current decision. In our semantic games, sincethe reward is sparse, 𝛾 is set to 1.Since there are two players, solving these MDPs means finding twopolicies 𝜋 𝑃 and 𝜋 𝑂𝑃 such that a Nash equilibrium can be established.As a result, once the learning converged, one of the player canalways win the game while the other one is forced to lose the game. Figure 1: The Persephone Framework.

The entrance of a semantic game isalways a predicate, which can be viewed as a tree. Once evaluatingthis predicate step-wisely, each node is either a logic operator ora predicate. A predicate indicates a leaf node for the current tree,but also an entrance to another tree. Persephone uses a preordertraversal to identify each node, and hence to vectorize the treestructure uniquely. For instance,

𝐻𝑆𝑅 ( 𝑘, 𝑞, 𝑛 ) : = ∃ 𝑚 ∈ [ , 𝑛 ) : 𝐻𝑆𝑅 ( 𝑘 − , 𝑞 − , 𝑚 )∧ 𝐻𝑆𝑅 ( 𝑘, 𝑞 − , 𝑛 − 𝑚 ) can be represented as: ExistsAndPRED_1 PRED_2The corresponding preorder index is [Exist:0, And:1, PRED_1:2,PRED_2:3], which means Persephone can use a length-four vectorto completely record what happened during a game played on thistree. All it needs is storing the action taken on each node to thecorresponding position in the vector. For the given instance above,suppose we choose 𝑚 = [ 𝑃𝑅𝐸𝐷 _ 𝐼𝐷, 𝑃𝐴𝑅𝐴𝑀 _ 𝐿𝐼𝑆𝑇,𝑇 𝑅𝐸𝐸 _ 𝑃𝑅𝐸𝑂𝑅𝐷𝐸𝑅 ] Backing to the previous example, suppose 𝑘 = 𝑞 = 𝑛 =

16, wehave: [ 𝑃𝑅𝐸𝐷 _ 𝐼𝐷, , , , , , − , − ] when the game stops at 𝑃𝑅𝐸𝐷 _1 :

𝐻𝑆𝑅 ( − , − , ) . It is known that MCTS cannot handlelarge action space because of the sample efficiency problem. To bespecific, think of a game tree with a significant branching factorwhile the winning leaf nodes are incredibly sparse. In this case,MCTS can find the winning strategy if and only if it can traverseas many paths as possible, which requires a large number of MCTSsimulations. The number of simulations soon becomes intractablewhen the action space increases. To mitigate this issue, We appliedthe warmstart trick.The idea is quite simple: in the original design of neural MCTS,the search tree is always re-initialized before each MCTS simulationsession. That means MCTS has to recount the visiting count foreach node from the beginning. When the action space increases,such recounting makes it impossible to locate the optimal path witha relatively small number of simulations. Therefore, we propose touse a warm-start MCTS to accelerate the simulation process. Thewarm-start mainly contains the following two components:(1) Keeping counting info: after each MCTS simulation session,we will keep the node’s counting 𝑁 ( 𝑠, 𝑎 ) in the search treeand reuse it in the next iteration of the simulation session.We will see in the experiment that a substantial learningspeed improvement by merely applying this trick.2) Q-value injection: inspired by the idea from SAVE [14], wherethe author proposes a variant of neural MCTS which onlylearns the Q-values and uses a softmax operator to recoverthe policy from the learned Q-values. We inject the predictedQ-values into the search tree nodes as priors, and our ex-periment shows that it will make the learning process morestable. The neural MCTS can be viewed as an Actor-Critic (AC) algorithm [13]. The self-play phase works like an actorwhich optimized empirical policy ˆ 𝜋 in the direction suggested bythe value network, the Critic, then samples game trajectories fromthe optimized empirical policy ˆ 𝜋 . In the training phase, the Actorwill update the policy network towards the empirical policy ˆ 𝜋 ;meanwhile, the Critic will update the value network using thereward signal from the sampled trajectories.An AC interpretation of neural MCTS makes alternative policylearning approaches becomes feasible. Specifically, in the originalAlphaZero implementation, policy network is updated through across entropy method (CEM,[21]), where the policy loss is definedas: L( 𝜃 ) = − ∑︁ 𝑡 ˆ 𝜋 ( 𝑠 𝑡 , 𝑎 𝑡 ) 𝑙𝑜𝑔𝜋 𝜃 ( 𝑠 𝑡 , 𝑎 𝑡 ) However, further research in policy gradient suggests that usingmore sophisticated methods like TRPO and PPO can improve thestability of the learning process [25, 26]. Therefore, we also imple-mented the PPO policy update as alternatives. In our experiment,we mainly focus on the following two PPO policy loss variants: L 𝐶𝐿𝐼𝑃 ( 𝜃 ) = E 𝑡 [ min ( 𝑟 ( 𝜃 ) 𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 ) , clip ( 𝑟 ( 𝜃 ) 𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 ) , − 𝜖, + 𝜖 ))]L 𝐾𝐿 ( 𝜃 ) = E 𝑡 [ 𝑟 ( 𝜃 ) 𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 ) − 𝛽𝐾𝐿 [ ˆ 𝜋, 𝜋 𝜃 ]] (5)where the ratio 𝑟 ( 𝜃 ) = 𝜋 𝜃 ( 𝑠 𝑡 ,𝑎 𝑡 ) ˆ 𝜋 ( 𝑠 𝑡 ,𝑎 𝑡 ) is the importance weights, and 𝐴 ( 𝑠, 𝑎 ) is the advantage term, which is defined as: 𝐴 ( 𝑠, 𝑎 ) = (cid:40) − 𝑉 𝜃 ( 𝑠 ) − 𝑟, if T ( 𝑠, 𝑎 ) ends the game with reward 𝑟 − 𝑉 𝜃 (T ( 𝑠, 𝑎 )) − 𝑉 𝜃 ( 𝑠 ) , otherwise (6)when the next player is not the current player. And 𝐴 ( 𝑠, 𝑎 ) = (cid:40) − 𝑉 𝜃 ( 𝑠 ) + 𝑟, if T ( 𝑠, 𝑎 ) ends the game with reward 𝑟𝑉 𝜃 (T ( 𝑠, 𝑎 )) − 𝑉 𝜃 ( 𝑠 ) , otherwise (7)when the next player is the current player. There are mainly two reasonswhy we consider multiple neural networks in our implementation:(1) As we have mentioned earlier that a semantic game can bean asymmetric extensive form game, which means the twoplayers learns different value and policy networks. Therefore,it is useful to know whether to use separate neural networksfor each player would be helpful on their learning efficiency.For instance, in the

𝐻𝑆𝑅 ( 𝑘, 𝑞, 𝑛 ) semantic game, the Propo-nent player has a policy of size 𝑛 while the Opponent onlyneeds to choose from two actions. In [30], the author showsa behavior asymmetry on playing those games, which might indicate a performance improving if the two players learnon different neural networks.(2) As suggested in [1], the author shows experimentally thatseparate value and policy networks can generally increasethe learning performance. Even though AlphaZero is proneto integrate the two into one neural network, it might be abad idea because the training signal can interfere with eachother and cause the learning process to become unstable.In our experiment, we also test this idea and justify theobservation in [1]. We test our idea on the HSR problem, whichhas been introduced and also used as experimental subjects in [30].

𝐻𝑆𝑅 ( 𝑘, 𝑞, 𝑛 ) basically defines a stress testing problem, where one,given 𝑘 jars and 𝑞 test chances, throwing jars from a specific rungof a given ladder with height 𝑛 to locate the highest safe rung. If 𝑛 is appropriately large, then one can locate the highest safe rungwith at most 𝑘 jars and 𝑞 test times; otherwise, if 𝑛 is too big, thenthere is no way to locate the highest safe rung. This problem canbe described with the following FOL: 𝐻𝑆𝑅 ( 𝑘, 𝑞, 𝑛 ) =  True , if 𝑛 = , if 𝑛 > ∧ ( 𝑘 = ∨ 𝑞 = )∃ 𝑚 ∈ [ , 𝑛 ) : 𝐻𝑆𝑅 ( 𝑘 − , 𝑞 − , 𝑚 )∧ 𝐻𝑆𝑅 ( 𝑘, 𝑞 − , 𝑛 − 𝑚 ) , otherwiseMoreover, we perform our experiment mainly on 𝐻𝑆𝑅 ( , , ) inthe following context, if we do not specify other instances. • For the single neural network, we use a two-head MLP forboth policy and value network. The shared layers has shape [ , , , ] , with ReLu activation. Them, for thepolicy head, it has shape [| 𝐴 | 𝑚 𝑎𝑥 ] (where | 𝐴 | 𝑚 𝑎𝑥 is themaximum size of action space), with Softmax activation. Forthe value head, it has shape [ ] , with Tanh activation. • For the separated policy and value neural networks, we justsplit the shared layers into a different neural network, sothat they are independent with each other. • For the separated player neural network, to respect thegame’s asymmetry, we use [ , , , ] for theProponent network, while [ , , , ] for the Oppo-nent network. It is to be noted that separate player networksis independent of separate policy/value networks, whichmeans one can have separated player networks while eachplayer still uses a single neural network or uses a separatedpolicy/value networks. Since there are multiple imple-mentations to compare, we define each implementation setup herefor later usage. • AZ: the original implementation of AlphaZero. • CE: the keep-counting-info warm-start variant of AZ. • CE_Sep: the separate policy/value network variant of CE.

CE_Q_Sep: the full warm-start (i.e., keeping-counting-infoand Q-value injection) variant of CE_Sep. • PPO_CLIP_Sep: the separate policy/value network with PPOpolicy learning, which uses the L 𝐶𝐿𝐼𝑃 loss. • PPO_KL_Sep: the separate policy/value network with PPOpolicy learning, which uses the L 𝐾𝐿 loss. • PPO_KL_Sep_2NN: the separate player network variant ofPPO_KL_Sep. • Learning rate to 0.001 with Adam optimizer. • The mini-batch size is set to 64. • Training epochs is set to 10. • Number of MCTS simulation is set to 25. • number of self-play is set to 100. • Replay buffer limit is set to 20 iterations. • 𝛽 for L 𝐾𝐿 is 1. • 𝜖 for L 𝐶𝐿𝐼𝑃 is 0.2.

The evaluation phase is like the self-play phase, where the playersoptimize their moves through an MCTS simulation to play againsteach other. However, instead of using the same neural networksfor both players, we use the newly trained policy/value networksto play against the networks from the last iteration. Specifically,we first run a given number (in our experiment, it is 20) of gamesbetween the Proponent, who uses the newly trained networks, andthe Opponent, who uses the previously trained networks; then werun the same number of games between the Proponent, who usesthe previously trained networks, and the Opponent, who uses thenewly trained networks.We then use fault counting to measure the performance of eachgame. The concept of fault counting is based on the correctnessmeasurement in [30], where the action is correct if it preserves awinning position: • Proponent’s correctness: Given ( 𝑘, 𝑞, 𝑛 ) , correct actions existonly if 𝑛 ≤ 𝑁 ( 𝑘, 𝑞 ) . In this case, all testing points in the range [ 𝑛 − 𝑁 ( 𝑘, 𝑞 − ) , 𝑁 ( 𝑘 − , 𝑞 − )] are acceptable. Otherwise,there is no corrective action. • Opponent’s correctness: Given ( 𝑞, 𝑘, 𝑛, 𝑚 ) , When 𝑛 > 𝑁 ( 𝑘, 𝑞 ) ,any action is regarded as correct if 𝑁 ( 𝑘 − , 𝑞 − ) ≤ 𝑚 ≤ 𝑛 − 𝑁 ( 𝑘, 𝑞 − ) , otherwise, the OP should take “not break” if 𝑚 > 𝑛 − 𝑁 ( 𝑘, 𝑞 − ) and “break’ if 𝑚 < 𝑁 ( 𝑘 − , 𝑞 − ) ; when 𝑛 ≤ 𝑁 ( 𝑘, 𝑞 ) , the OP should take the action “not break” if 𝑚 < 𝑛 − 𝑁 ( 𝑘, 𝑞 − ) and take action “break” if 𝑚 > 𝑁 ( 𝑘 − , 𝑞 − ) .Otherwise, there is no corrective action.Fault counting then counts when a player makes a mistake thatthere is a move to keep the winning position, but the player choosesan incorrect one. If the player makes a mistake, while the opponentplayer catches that mistake by moving to a winning position, weincrease the fault counting by one for the player who makes thatmistake. It is to be noted that a player can lose the game withoutmaking any fault. That is because the player is in a losing position,and there is no corrective action to take. In other words, if a playeris forced to lose, then he should not be blamed for losing that game. The learning is considered to be converged if both newly trainedand previously trained networks show zero faults for a certainnumber (in our experiment, 5) of consecutive learning iterations. Inour experiment, we measure the fault counting and the number ofiterations needed before convergence for each configuration, alongwith the loss curve, we can evaluate the performance of differentconfiguration. In this experiment, we mea-sure the performance of AZ and CE. We set the maximum number ofiteration to 100 and then measure the number of iterations neededbefore fault counting first time arrives 0 for both of the players. Werun the experiment for 15 times, and it has been shown in Fig.2 thatCE converges in 35 iterations, while AZ never converges, given thehyperparameters we have mentioned in the previous section. Theexperimental result justified that merely adding keep-counting-infowarm-start can already improve the efficiency significantly.

Figure 2: Number of iterations needs before the first timeboth of the players got 0 faults. It can be seen that withoutkeep-counting-info, AZ even cannot find the optimal policyby any chance. Actually, AZ can find the optimal policy onlywhen we increase MCTS simulation times to 100, which isfour times larger than the current value.

In this experi-ment, we measure the performance of CE and CE_Sep. We setthe maximum number of iteration to 100 and then measure thefault-counting after each iteration. We run the experiment 15 times,and found that CE can be unstable compared to CE_Sep (as shownin Fig.3 and Fig.4). This verified the observation from [1], where theauthor also recommended to use a separated policy/value neuralnetwork. We think those spikes are due to interfered target signalsin a layer-shared neural network, where value signal and policysignal interfere with each other inevitably but unnecessarily.

Since keep-counting-info and separate policy/value networks areessential for efficiency and stability, we always apply these twoconfigurations in the rest of our experiment. In this experiment,we measure the average number of iterations need before conver-gence for CE_Sep, CE_Q_Sep, PPO_CLIP_Sep, PPO_KL_Sep, and igure 3: The fault-counting measurement from one of the15 experiments on CE. We only show one experiment be-cause of page limits, but all 15 experiments have a similarcurve where spikes appear after convergence, which signi-fies unstable learning.Figure 4: The fault-counting measurement from one of the15 experiments on CE_Sep. Because of page limits, we onlyshow one experiment, but all 15 experiments have a similarcurve where no spike appears after convergence, which is asign of stable learning.

PPO_KL_Sep_2NN. We run 20 experiments on each configurationand plot the Box and Whisker graph for each of them (see Fig.5).Our experiment provides us several meaningful results:(1) Q-value injection helps increase the efficiency of the algo-rithm. It can be seen from the graph that CE_Q_Sep aver-agely locates the optimal policy faster than CE_Sep. Thatis because Q-values from the value network accelerates thesearch process of MCTS, which helps it evaluate the UCBformula unbiased. As a result, we assume Q-value injectionin the rest of our experiments.(2) Using L 𝐾𝐿 is much better than using L 𝐶𝐿𝐼𝑃 . As shown inthe graph, PPO_CLIP_Sep has a higher average value and alarger variance than PPO_KL_Sep. We think this might be due to the fact of reward sparsity in semantic games, whichwill cause the advantage to become very small hence provideless information to update the policy network for L 𝐶𝐿𝐼𝑃 . Onthe other hand, L 𝐾𝐿 has a KL regularization term, whichworks like cross-entropy and provides more information forlearning the policy network.(3) It seems that separate player neural networks perform slightlyworse than using the same policy/value network for bothof the players. It is to be noted that PPO_KL_Sep_2NN runsmuch faster than PPO_KL_Sep because of the smaller net-work size for the Opponent player. It takes averagely 10s tofinish one training epoch for PPO_KL_Sep, while only 2s forPPO_KL_Sep_2NN. Therefore, even thoughPPO_KL_Sep_2NN needs averagely 2 to 3 more iterationsto find the optimal strategy, it still takes less time thanPPO_KL_Sep, which means separate player neural networksdo increase efficiency. Figure 5: The Box and Whisker graph for the number of it-erations needed before convergence for different configura-tions.

To further investigate the behavior of different configurations,we also plot the loss curve for both policy and value network foreach configuration (Fig. 6). CE’s value loss curve shows that it dropsmuch faster than other configurations, which means that the valuenetwork might quickly converge to local optimal at the beginningof learning. Hence it takes a longer time to jump out of the localoptimal and lower the efficiency. Another fact to notice is that thepolicy loss of Opponent’s networks in the PPO_KL_Q_Sep_2NNconfiguration converges earlier than Proponent’s networks, whichmeans the separated player networks captures the asymmetry prop-erly in this semantic game.

It is to be noted that HSR is a special problem for which the ground-truth is already known. However, in general, for most combinatorialproblems, the ground-truth is unknown, requiring us to figureout an alternative to measure the performance. In this additionalexperiment, we use two performance scoring techniques, Elo-rating[8] and 𝛼 -Rank [23], to score the Proponent player per each learningiteration. Performance scoring allows us to measure a player’sperformance simply through the game results or the pay-off tablefrom the competition with other player instances. For Elo-rating, igure 6: The loss curve for different configurations. No-tice that we did not plot PPO_CLIP_Q_Sep, because the losscurve of L 𝐶𝐿𝐼𝑃 behaves differently, which makes it incom-parable to the rest of the loss curves. we run a competition between the two players after each trainingiteration and compute their score; For 𝛼 -Rank, we have to store allplayer instances after each iteration and run a competition amongdifferent players from different training iterations to generate apay-off table, then we run the 𝛼 -Rank algorithm to get the score.In order to generate the pay-off table more efficiently, we run thescoring process on two relatively smaller instances: 𝐻𝑆𝑅 ( , , ) (Fig.7) and 𝐻𝑆𝑅 ( , , ) (Fig.8). We can see a clear phase transitionwhere the algorithm jumps from a low score to a high score, whichindicates an optimal strategy has been found. This paper proposed a framework, Persephone, to efficiently map afirst-order problem to a two-player semantic game and then playand learn an optimal game strategy through a neural MCTS basedRL algorithm. The optimal learned strategy can then be mappedback to an optimal solution for the original problem. We also in-troduced a formal definition for symmetric/asymmetric extendedform games, which motivates us to investigate asymmetric neu-ral network designs. We proposed several variants to the vanilla

Figure 7: For

𝐻𝑆𝑅 ( , , ) , the convergence happened veryquickly when the instance is small. The 𝛼 -Rank score alsoshows a very similar pattern. A further checking on thefault-counting metric confirmed that both players have con-verged to 0 faults.Figure 8: For 𝐻𝑆𝑅 ( , , ) , the Elo-rating starts at 600 anddrops until iteration 4, after the fourth iteration the increasein the Elo-rating is almost linear up to iteration 7 afterwhich there is a huge jump of around 550 points to iteration8 and a 220 point jump to iteration 9 stays constant at around1420 thereafter. The 𝛼 -Rank score for this experiment staysconstant at 0.02 for the first 7 steps and again shoots to 0.97at iteration 8 and remains constant at 0.99 thereafter. AlphaZero algorithm, such as using different policy learning strate-gies, warm-start, separate policy/value networks, separate playernetworks, and carried out experiments on different configurations.The experimental results can be measured either through ground-truth based metrics, like fault-counting in our case or throughperformance scoring techniques like Elo-rating and 𝛼 -Rank. Ourexperimental results show that a KL-divergence regularized PPOolicy learning with warm-start MCTS and separated neural net-works perform the best, which justified our improvements to theoriginal AlphaZero algorithm. REFERENCES [1] Marcin Andrychowicz, Anton Raichuk, P. Stanczyk, Manu Orsini, S. Girgin,Raphael Marinier, L’eonard Hussenot, M. Geist, Olivier Pietquin, M. Michalski,S. Gelly, and Olivier Bachem. 2020. What Matters In On-Policy ReinforcementLearning? A Large-Scale Empirical Study.

ArXiv abs/2006.05990 (2020).[2] Thomas Anthony, Zheng Tian, and David Barber. 2017. Thinking Fast and Slowwith Deep Learning and Tree Search. In

Proceedings of the 31st InternationalConference on Neural Information Processing Systems (Long Beach, California,USA) (NIPS’17) . 5366–5376.[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. 2002. Finite-time Analysis of TheMultiarmed Bandit Problem.

Machine learning

47, 2 (2002), 235–256.[4] David Auger, Adrien Couetoux, and Olivier Teytaud. 2013. Continuous UpperConfidence Trees with Polynomial Exploration - Consistency. In

ECML/PKDD(1) (Lecture Notes in Computer Science, Vol. 8188) . Springer, 194–209.[5] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio.2016. Neural combinatorial optimization with reinforcement learning. arXivpreprint arXiv:1611.09940 (2016).[6] Cameron Browne, Edward Jack Powley, Daniel Whitehouse, Simon M. Lucas,Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez Liebana,Spyridon Samothrakis, and Simon Colton. 2012. A Survey of Monte Carlo TreeSearch Methods.

IEEE Trans. Comput. Intellig. and AI in Games

4, 1 (2012), 1–43.[7] Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, and Andre Cire. 2020. Combining Reinforcement Learning and Con-straint Programming for Combinatorial Optimization. arXiv:2006.01610 [cs.AI][8] A.E. Elo. 1978.

The Rating of Chessplayers, Past and Present . Arco Pub.[9] Chao Gao, Martin Müller, and Ryan Hayward. 2018. Three-Head Neural NetworkArchitecture for Monte Carlo Tree Search. In

Proceedings of the Twenty-SeventhInternational Joint Conference on Artificial Intelligence, IJCAI-18 . InternationalJoint Conferences on Artificial Intelligence Organization, 3762–3768.[10] Michael Genesereth, Nathaniel Love, and Barney Pell. 2005. General game playing:Overview of the AAAI competition.

AI magazine

26, 2 (2005), 62–62.[11] Adrian Goldwaser and Michael Thielscher. 2020. Deep Reinforcement Learningfor General Game Playing. In

AAAI .[12] Jean-Bastien Grill, Florent Altché, Yunhao Tang, T. Hubert, Michal Valko, IoannisAntonoglou, and Rémi Munos. 2020. Monte-Carlo Tree Search as RegularizedPolicy Optimization.

ArXiv abs/2007.12509 (2020).[13] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska. 2012. A Survey ofActor-Critic Reinforcement Learning: Standard and Natural Policy Gradients.

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications andReviews)

42, 6 (2012), 1291–1307.[14] Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Théo-phane Weber, Lars Buesing, and Peter W. Battaglia. 2020. Combining Q-Learningand Search with Amortized Value Estimates.

ArXiv abs/1912.02807 (2020).[15] Y. Heller. 2014. Stability and trembles in extensive-form games.

Games Econ.Behav.

84 (2014), 132–136.[16] Jaakko Hintikka. 1982. Game-theoretical semantics: insights and prospects.

NotreDame J. Formal Logic

23, 2 (04 1982), 219–241.[17] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learn-ing combinatorial optimization algorithms over graphs. In

Advances in NeuralInformation Processing Systems . 6348–6358.[18] Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-carlo Planning.In

Proceedings of the 17th European Conference on Machine Learning (Berlin,Germany) (ECML’06) . Springer-Verlag, 282–293.[19] Alexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, DavidKas, Karl Hajjar, Torbjorn S Dahl, Amine Kerkeni, and Karim Beguir. 2018. RankedReward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimiza-tion. arXiv preprint arXiv:1807.01672 (2018).[20] Michael L. Littman. 1994. Markov Games as a Framework for Multi-AgentReinforcement Learning. In

Proceedings of the Eleventh International Conferenceon International Conference on Machine Learning (New Brunswick, NJ, USA) (ICML’94) . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 157–163.[21] Shie Mannor, R. Rubinstein, and Yohai Gat. 2003. The Cross Entropy Method forFast Policy Search. In

ICML .[22] Nina Mazyavkina, S. Sviridov, S. Ivanov, and Evgeny Burnaev. 2020. Reinforce-ment Learning for Combinatorial Optimization: A Survey.

ArXiv abs/2003.03600(2020).[23] Shayegan Omidshafiei, C. Papadimitriou, G. Piliouras, K. Tuyls, M. Rowland,Jean-Baptiste Lespiau, W. Czarnecki, Marc Lanctot, Julien Pérolat, and R. Munos.2019. alpha-Rank: Multi-Agent Evaluation by Evolution.

Scientific Reports (Proceedings of Machine Learning Research,Vol. 37) , Francis Bach and David Blei (Eds.). PMLR, Lille, France, 1889–1897.[26] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal Policy Optimization Algorithms. arXiv:arXiv:1707.06347[27] R. Selten. 1983. Evolutionary stability in extensive two-person games.[28] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, Georgevan den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-vam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalch-brenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu,Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deepneural networks and tree search.

Nature

529 (Jan. 2016), 484.[29] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, AjaHuang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche,Thore Graepel, and Demis Hassabis. 2017a. Mastering the game of Go withouthuman knowledge.

Nature

550 (Oct. 2017a), 354.[30] Ruiyang Xu and Karl J. Lieberherr. 2019. Learning Self-Game-Play Agents forCombinatorial Optimization Problems. In