An Optimal Computing Budget Allocation Tree Policy for Monte Carlo Tree Search
IIEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 1
An Optimal Computing Budget Allocation TreePolicy for Monte Carlo Tree Search
Yunchuan Li, Michael C. Fu and Jie Xu
Abstract —We analyze a tree search problem with an under-lying Markov decision process, in which the goal is to identifythe best action at the root that achieves the highest cumulativereward. We present a new tree policy that optimally allocatesa limited computing budget to maximize a lower bound on theprobability of correctly selecting the best action at each node.Compared to widely used Upper Confidence Bound (UCB) treepolicies, the new tree policy presents a more balanced approachto manage the exploration and exploitation trade-off when thesampling budget is limited. Furthermore, UCB assumes that thesupport of reward distribution is known, whereas our algorithmrelaxes this assumption. Numerical experiments demonstrate theefficiency of our algorithm in selecting the best action at the root.
Index Terms —Stochastic optimal control, Monte Carlo treesearch, machine learning, optimization algorithms
I. I
NTRODUCTION
We consider a reinforcement learning problem where anagent interacts with an underlying environment. A MarkovDecision Process (MDP) with finite horizon is used to modelthe environment. In each move, the agent will take an action,receive a reward and land in a new state. The reward is usuallyrandom, and its distribution depends on both the state of theagent and the action taken. The distribution of the next stateis also determined by the agent’s current state and action.Our goal is to determine the optimal sequence of actions thatleads to the highest expected reward. The optimality of thedecision policy will be evaluated by the probability of correctlyselecting the best action in the first stage of the underlyingMDP.If the distributions and the dynamics of the environment areknown, the optimal set of actions can be computed throughdynamic programming [2]. Under more general settings wherethe agent does not have perfect information regarding theenvironment, [3] proposed an adaptive algorithm based on
This work was supported in part by the National Science Foundation underGrant CMMI-1434419 and DMS-1923145, the Air Force Office of ScientificResearch under Grant FA9550-19-1-0383 and by the Defense AdvancedResearch Projects Agency (DARPA) under Grant N660011824024. The views,opinions, and/or findings expressed are those of the authors and should notbe interpreted as representing the official views or policies of the Departmentof Defense or the U.S. Government. A preliminary version of this work [1]was published in the proceedings of the 2019 IEEE Conference on Decisionand Control.Y. Li is with the Department of Electrical and Computer Engineering andthe Institute of Systems Research, University of Maryland, College Park USA,e-mail: [email protected]. C. Fu is with the R. H. Smith School of Business and the Instituteof Systems Research, University of Maryland, College Park USA, e-mail:[email protected] Xu is with the Department of Systems Engineering and Opera-tions Research, George Mason University, Fairfax VA 22030, USA, email:[email protected]. a Multi-Armed Bandit (MAB) model and Upper ConfidenceBound (UCB) [4]. [5] and [6] applied UCB to tree search,and [6] invented the term Monte Carlo Tree Search (MCTS)and used it in a Go-playing program for the first time. Sincethen, MCTS has been developed extensively and applied tovarious games such as Othello [7] and Go [8]. To deal withdifferent types of problems, several variations of MCTS havebeen introduced, e.g., Flat UCB (and its extension BanditAlgorithm for Smooth Trees) [9] and Single-Player MCTS(for single-player games) [10].However, most bandit-based MCTS algorithms are designedto minimize regret (or maximize the cumulative reward of theagent), whereas in many situations, the goal of the agent maybe to efficiently determine the optimal set of actions within alimited sampling budget. To the best of our knowledge, thereis limited effort in the literature that aims at addressing thelatter problem. [11] first incorporated Best Arm Identification(BAI) into MCTS for a MIN-MAX game tree, and providedupper bounds of play-outs under different settings. [12] hadan objective similar to [11], but with a tighter bound. Theirtree selection policy selects the node with largest confidenceinterval, which can be seen as choosing the node with thehighest variance. In some sense, this is a pure exploration pol-icy and would not efficiently use the limited sampling budget.In our work, we are motivated to establish a tree policy thatintelligently balances exploration and exploitation (analogousto the objective of UCB). The algorithms developed in [11]and [12] are only for MIN-MAX game trees, whereas ournew tree policy can be applied to more general types of treesearch problems. The MCTS algorithm in [13] is more generalthan [11] and [12], but its goal is to estimate the maximumexpected cumulative reward at the root node, whereas we focuson identifying the optimal action.Algorithms that focus on minimizing regret tend to dis-courage exploration. This tendency can be seen in two ways.Suppose at some point an action was performed and receiveda small reward. To minimize regret, the algorithm wouldbe discouraged from taking this action again. However, thesmall reward could be due to the randomness in the rewarddistribution. Mathematically, [14] showed that for MAB al-gorithms, the number of times the optimal action is taken isexponentially more than sub-optimal ones, which makes sensewhen the objective is to maximize the cumulative reward,since the exploration of other actions is highly discouraged.This leads to our second motivation: is there a tree policy thatexplores sub-optimal actions more to ensure the optimal actionis found?Apart from the lack of exploration as a result of theunderlying MAB model’s objective to minimize regret or a r X i v : . [ ee ss . S Y ] S e p EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 2 maximize cumulative reward, most MCTS algorithms assumethat the support of the reward distribution is bounded andknown (typically assumed to be [ , ] ). With the support ofreward distribution being known, the parameter in the upperconfidence term in UCB is tuned or the reward is normalized.However, a general tree search problem may likely have anunknown and practically unbounded range of rewards. In suchcase, assuming a range can lead to very poor performance.Therefore, the third motivation of our research is to relax theknown reward support assumption.To tackle the challenge in balancing exploration and ex-ploitation with a limited sampling budget for a tree policy,we model the tree selection problem at each stage as astatistical Ranking & Selection (R&S) problem and proposea new tree policy for MCTS based on an adaptive algorithmfrom the R&S community. Similar to the MAB problem, R&Sassumes that we are given a set of bandit machines (oftenreferred to as alternatives in the R&S literature) with unknownreward distributions, and the goal is to select the machinewith the highest mean reward. Specifically, we will develop anMCTS tree policy based on the Optimal Computing BudgetAllocation (OCBA) framework [15]. OCBA was first proposedin [16], and aims at maximizing the probability of correctlyselecting the action with highest mean reward using limitedsampling budget. More recent developments of OCBA includeaddressing multiple objectives [17] and subset selection [18],[19].The objective of the proposed OCBA tree policy is tomaximize the Approximate Probability of Correct Selection(APCS), which is a lower bound on the probability of correctlyselecting the optimal action at each node. Intuitively, theobjective function of the new OCBA tree selection policywould lead to an optimal balance between exploration andexploitation with a limited sampling budget, and thus helpaddress the drawbacks of existing work that either pursues pureexploration [11], [12] or exponentially discourages exploration[14]. Our new OCBA tree policy also removes the knownand bounded support assumption for the reward distribution,because the new OCBA policy determines the sampling allo-cation based on the posterior distribution of each action, whichis updated adaptively according to samples.To summarize, contributions of this paper include the fol-lowing:1) We propose a new tree policy for MCTS with an objectiveto maximize APCS with a limited sampling budget. Thenew tree policy optimally balances exploration and ex-ploitation to efficiently select the optimal action. The newOCBA tree selection policy also relaxes the assumptionof known bounded support on the reward distribution.2) We present a sequential algorithm to implement the newOCBA tree policy that maximizes the APCS at eachsampling stage and prove that our algorithm convergesto the optimal action.3) We provide theoretical analyses, such as convergenceguarantee, proof of optimality of the proposed algorithm,and the exploration-exploitation trade-off of the proposedalgorithm, which works differently than bandit-based algorithms, and is more suitable for identifying the bestaction.4) We demonstrate the efficiency of our algorithm throughnumerical experiments.
Remark 1.
In much of the computer science/artificial intel-ligence literature, an algorithm that focuses on determiningthe optimal set of actions under a limited budget is definedas a pure exploration algorithm (see, e.g., [20], [21], [22]),whereas we view such algorithms as retaining a balancebetween exploration and exploitation, as the analysis in Sec-tion III shows. In statistical R&S, pure exploration algorithmsgenerally implies sampling based primarily on the variance ofeach action, which often leads to sampling suboptimal actionsmore. It will be clearer in the Section V where we show thatOCBA-MCTS actually samples less those highly suboptimalactions and “exploits” those potential actions more.
The rest of the paper is organized as follows. We presentthe problem formulation in Section II, and review the proposedOCBA-MCTS algorithm in Section III. Theoretical analyses,including convergence theorems and exploration-exploitationanalysis, are carried out in Section IV. Proofs are given inthe Appendix. Numerical examples are presented in SectionV to evaluate the performance of our algorithm. Section VIconcludes the paper and points to future research directions.A preliminary version of this work was presented in [1],where a simpler tree policy (not employing the node repre-sentation adopted in the current work) was used. In additionto improving the efficiency of the MCTS algorithm, here weprove that our proposed algorithm converges asymptoticallyto the optimal action, and provide an exploration-exploitationtrade-off analysis, both analytically and through a more com-prehensive set of numerical experiments.II. P
ROBLEM FORMULATION
Consider a finite horizon MDP M = ( X , A , P , R ) with horizonlength H , finite state space X , finite action space A with | A | >
1, bounded reward function R = { R t , t = , , . . . H } suchthat R t maps a state-action pair to a random variable (r.v.), andtransition function P = { P t , t = , , . . . H } such that P t mapsa state-action pair to a probability distribution over X . Weassume that P t is unknown and/or | X | and | A | are very large,and hence it is not feasible to solve the problem by dynamicprogramming. Further define X a and A x as the availablechild states when taking action a and available actions atstate x , respectively. Denote by P t ( x , a )( y ) the probability oftransitioning to state y ∈ X a from state x ∈ X when takingaction a ∈ A x in stage t , and R t ( x , a ) the reward in stage t bytaking action a in state x . Let Π be the set of all possiblenonstationary Markovian policies π = { π i | π i : X → A , i ≥ } .Bandit-based algorithms for MDPs seek to minimize theexpected cumulative regret, whereas our objective is to identifythe best action that leads to maximum total expected rewardgiven by E (cid:2) ∑ H − t = R t ( x t , π t ( x t )) (cid:3) for given x ∈ X . We firstdefine the optimal reward-to-go value function for state x in EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 3 stage i by V ∗ i ( x ) = max π ∈ Π E (cid:2) H − ∑ t = i R t ( x t , π t ( x t )) (cid:12)(cid:12) x i = x (cid:3) , i = , , . . . , H − V ∗ H ( x ) = x ∈ X . Also define Q i ( x , a ) = E [ R ( x , a )] + ∑ y ∈ X a P t ( x , a )( y ) V ∗ i + ( y ) , with Q H ( x , a ) =
0. It is well known [2] that eq. (1) can bewritten via the standard Bellman optimality equation: V ∗ i ( x ) = max a ∈ A x ( E [ R i ( x , a )] + E P t ( x , a ) V ∗ i + ( Y )) , = max a ∈ A x ( E [ R i ( x , a )] + ∑ y ∈ X a P t ( x , a )( y ) V ∗ i + ( y ))= max a ∈ A x ( Q i ( x , a )) , i = , , . . . , H − , where Y ∼ P i ( x , a )( · ) represents the random next state.Since we are considering a tree search problem, someadditional notation and definitions beyond MDP settings areneeded. Define a state node by a tuple that contains the stateand the stage number: x = ( x , i ) ∈ X ∀ x ∈ X , ≤ i ≤ H , where X is the set of state nodes. Similarly, we define a state-action node by a tuple of state, stage number and action (i.e.,a state node followed by an action): a = ( x , a ) = ( x , i , a ) , ∀ x ∈ X , ≤ i ≤ H , a ∈ A x , Now, we can rewrite the immediate reward function, valuefunction for state, state-action pair with state node and state-action node and state transition distribution, respectively, by R ( a ) = R ( x , a ) : = R i ( x , a ) V ∗ ( x ) : = V ∗ i ( x ) , Q ( a ) = Q ( x , a ) : = Q i ( x , a ) P ( a ) = P ( x , a ) : = P i ( x , a ) . Similarly, V ∗ ( x ) and Q ( x , a ) are assumed to be zero for allterminal state nodes x . To make our presentation clearer,we adopt the following definitions based on nodes: define N ( x ) and N ( x , a ) the number of visits to node x and ( x , a ) ,respectively, X a the set of child state nodes given parent nodes,and A x the set of available child actions at node x , respectively.Traditionally, MCTS algorithms aim at estimating V ∗ ( x ) andmodel the selection process in each stage as an MAB problem,i.e., view Q ( x , a ) as a set of bandit machines where ( x , a ) arechild state-action nodes of x ([3], [5]), and minimize the regret ,namely, min a ,..., a N ∈ A x { N max a ∈ A x ( Q ( x , a )) − N ∑ k = Q ( x , a k ) } = { NV ∗ ( x ) − N ∑ k = Q ( x , a k ) } for x in stage 1 , , . . . H , where N and a k are the number ofrollouts/simulations (also known as total sampling budget inmuch of Ranking & Selection literature) and the k -th actionsampled at state node x by the tree policy, respectively. Themeaning of rollout will be clearer in Section III. In this paper,our goal is to identify the optimal action that achieves thehighest cumulative reward at the root with initial state x , thatis, find a ∗ x = arg max a ∈ A x Q ( x , a ) , where the root state node x = ( x , ) . Let ˆ Q ( x , a ) = R ( x , a ) + V ∗ ( y ) be the random cumulative reward by taking action a at state node x , where y is the random state node reached.Clearly, ˆ Q ( x , a ) is a random variable. We assume ˆ Q ( x , a ) isnormally distributed with known variance, and its mean µ ( x , a ) has a conjugate normal prior with a mean equals Q ( x , a ) .Hence we have Q ( x , a ) = E [ E [ ˆ Q ( x , a ) | µ ( x , a )]] . Remark 2.
For our derivations, we assume the variance of thesampling distribution of ˆ Q ( x , a ) is known; however, in practice,the prior variance may be unknown, in which case estimatessuch as the sample variance are used [23]. Consider the non-informative case, i.e., the prior mean Q ( x , a ) is unknown, it can be shown that [24] the posteriorof µ ( x , a ) given observations (i.e., samples) is also normal.For convenience, define the t -th sample by ˆ Q t ( x , a ) . Then theconditional distribution of µ ( x , a ) given the set of samples ( ˆ Q ( x , a ) , ˆ Q ( x , a ) , . . . , ˆ Q N ( x , a ) ( x , a )) is˜ Q ( x , a ) ∼ N ( ¯ Q ( x , a ) , σ ( x , a ) N ( x , a ) ) , (2)where¯ Q ( x , a ) = N ( x , a ) N ( x , a ) ∑ t = ˆ Q t ( x , a ) , ˜ Q ( x , a ) = µ ( x , a ) | ( ˆ Q ( x , a ) , ˆ Q ( x , a ) , . . . , ˆ Q N ( x , a ) ( x , a )) , and σ ( x , a ) is the variance of ˆ Q ( x , a ) and can be approxi-mated by the sample variance:ˆ σ ( x , a ) = N ( x , a ) N ( x , a ) ∑ t = (cid:0) ˆ Q t ( x , a ) − ¯ Q ( x , a ) (cid:1) . Remark 3.
If the samples of Q ( x , a ) are not normally dis-tributed, the normal assumption can be justified by batchsampling and the central limit theorem. Under these settings, our objective is to maximize theProbability of Correct Selection (PCS) defined by
PCS = P (cid:20) (cid:92) a ∈ A , a (cid:54) = ˆ a ∗ x ( ˜ Q ( x , ˆ a ∗ x ) ≥ ˜ Q ( x , a )) (cid:21) (3)for a state node x , where ˆ a ∗ x is the action that achievesthe highest mean sample Q -value at such node, i.e., ˆ a ∗ x = arg max a ∈ A x ¯ Q ( x , a ) .PCS is hard to compute because of the intersections in the(joint) probability. We seek to simplify the joint probability EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 4 by changing the intersections to sums using the Bonferroniinequality to make the problem tractable. By the Bonferroniinequality, PCS is lower bounded by the Approximate Proba-bility of Correct Selection (APCS), that is,
PCS ≥ − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x P (cid:20) ˜ Q ( x , ˆ a ∗ x ) ≤ ˜ Q ( x , a ) (cid:21) (4) = : APCS . The objective of our new tree policy is to maximize APCSas given in Equation (4). Compared to MAB’s objectiveof minimizing the expected cumulative regret, this objectivefunction will result in an allocation of sampling budget toalternative actions in a way that optimally balances explorationand exploitation. This objective function is motivated by theOCBA algorithm [15] in the R&S literature. We will presentand analyze our OCBA tree policy in the following sections.III. A
LGORITHM DESCRIPTION
In this section, we first briefly describe the main four phases,i.e., selection , expansion , simulation and backpropagation , inan MCTS algorithm. Then, we propose a novel tree policy inthe selection stage that aims at finding the optimal action ateach state node. A. Canonical MCTS algorithm
Here we briefly summarize the four phases in a typicalMCTS algorithm. We refer readers to [25] for a completeillustration of these phases. Algorithm 1 represents a canonicalMCTS, with detailed descriptions of the main phases below.
1) Selection:
In this phase, the algorithm will navigatedown the tree from the root state node to an expandablenode, i.e., a node with unvisited child nodes. We assumethat expansion is automatically followed when a state-actionis encountered. Therefore, when determining the path down,there are three possible situations:(i) If a state-action node is encountered (denoted by ( x , a ) ),we will land into a new state node y which is obtained bycalling the expansion function. Then, we continue withthe selection algorithm.(ii) If an expandable state node (which could be a leafnode) is encountered, we call the expansion functionto add a new child state-action node and a state node(by automatically expanding the state-action node) to thepath. Then, we stop the selection phase and return thepath from the root to this state node. Finally, we proceedwith the simulation and backpropagation phase(iii) If an unexpandable state node is encountered (denotedby x ), we employ a tree policy to determine which childaction to sample. Then we enter the new state-action node ( x , a ) and continue the selection algorithm with this state-action node. The tree policies can be briefly categorizedinto two types: deterministic, such as UCB1 and severalof its variants (e.g., UCB-tuned, UCB-E), and stochastic,such as ε -greedy and EXP3; see [25] for a review.
2) Expansion:
In this phase, a random child state or state-action node of the given node is added. If the incoming nodeis a state node x , the next node is selected randomly (usuallyuniform) from those unvisited child state-action nodes. If theincoming node is a state-action node ( x , a ) , the subsequentstate node is found by simply sampling from distribution P ( x , a )( · ) .
3) Simulation:
In some literature, this phase is also knownas “rollout”. The simulation phase starts with a state node.The purpose of this step is to simulate a path from thisnode to a terminal node and produce a sample of cumulativereward by taking this path (which is a sample of the value forthis node). The simulated path is taken by a default policy ,which is usually sample the feasible child sate-action nodesuniformly. With this node’s value sample, we may proceed tothe backpropagation phase.
4) Backpropagation:
This phase simply takes the simulatednode value and update the values of the nodes in the path(obtained in selection step) backward.In the next section, we will propose our tree policy basedon OCBA and illustrate the detailed implementations of thefour phases.
B. OCBA selection algorithm
We now present an efficient tree policy to estimate theoptimal actions in every state node by estimating V ∗ ( x ) and Q ( x , a ) for all possible a ∈ A x at the state node. Denote theestimates of V ∗ ( x ) at node x by ˆ V ∗ ( x ) , which is initializedto 0 for all state nodes. Our algorithm estimates Q ( x , a ) foreach action a by its sample mean, and selects the action thatmaximizes the sample mean as ˆ a ∗ x . During the process, theestimate of Q ( x , a ) is given by Equation (2) and the proposednew OCBA tree policy is applied. Our algorithm follows thealgorithmic framework described in Section III-A, with thetree policy changed to OCBA and other mild modifications.The structure of the proposed OCBA-MCTS algorithm isshown in Algorithms 1 to 6. There are two major characteris-tics: the first is to use the proposed OCBA algorithm for thetree policy. The second is to require each state-action node tobe expanded n > N times (which will be later referred to as number of rolloutsor sampling budget) from the root state node x , after which apartially expanded tree is obtained and the optimal action ˆ a ∗ x can be derived.When steering down the tree and a state node x is visited,the selection phase, which is illustrated in Algorithm 2, willfirst determine if there is a child state-action node that wasvisited for less than n times at the given state node. If thereis, then the state-action node will be sampled and added to thepath. In other words, we try to expand each state node whenit is visited, and require each node to be expanded n times.If all the state-action nodes are well-expanded, Algorithm 2will call Algorithm 3 (OCBASelection), which calculates theallocation of samples to child state-action nodes of the currentstate node for a total sampling budget ∑ a ∈ A N ( x , a ) +
1. To
EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 5 determine the number of samples allocated to each state-actionnode, denoted by ( ˜ N ( x , a ) , ˜ N ( x , a ) , . . . , ˜ N ( x , a | A x | )) (where a i ∈ A x , i = , . . . , | A x | ), the OCBA tree policy first identifies thechild state-action node with the largest sample mean (sampleoptimal) and finds the difference between the sample meansof the sample optimum and all other nodes:ˆ a ∗ x : = arg max a ¯ Q ( x , a ) δ x ( ˆ a ∗ x , a ) : = ¯ Q ( x , ˆ a ∗ x ) − ¯ Q ( x , a ) , ∀ a (cid:54) = ˆ a ∗ x . The set of allocations ( ˜ N ( x , a ) , ˜ N ( x , a ) , . . . , ˜ N ( x , a | A | )) thatmaximizes APCS can be obtained by solving the followingset of equations:˜ N ( x , a n + ) ˜ N ( x , a n ) = (cid:32) σ ( x , a n + ) / δ x ( ˆ a ∗ x , a n + ) σ ( x , a n ) / δ x ( ˆ a ∗ x , a n ) (cid:33) , ∀ a n , a n + (cid:54) = ˆ a ∗ x , a n , a n + ∈ A x , (5)˜ N ( x , ˆ a ∗ x ) = σ ( x , ˆ a ∗ x ) (cid:118)(cid:117)(cid:117)(cid:116) ∑ a ∈ A , a (cid:54) = ˆ a ∗ x ( ˜ N ( x , a )) σ ( x , a ) , (6) ∑ a ∈ A ˜ N ( x , a ) = ∑ a ∈ A N ( x , a ) + . (7)The derivations of Equations (5) to (7) are illustrated in theappendix.After the new budget allocation is computed, the algorithmwill select the “most starving” action to sample [23], i.e.,sample ˆ a = arg max a ∈ A x ( ˜ N ( x , a ) − N ( x , a )) . (8)We highlight some major modifications to the canonicalMCTS in the proposed algorithm. First, in the selection phase,we will try to expand all “expandable” nodes visited whenobtaining a path to leaf. Since the variances of the values ofa state node’s child nodes are required in the proposed treepolicy, we define a state node as expandable if it has childnodes that are visited less than n > P ( x , a )( · ) , and the resulting state node is subse-quently added to the path. The reward by taking the actionin the state node is also recorded and will be used in thebackpropagation stage.In the simulation and backpropagation phases illustrated inAlgorithm 5 and 6, a leaf-to-terminal path is simulated, andits reward is used to update the value for the leaf node. Ifwe denote the leaf node and the reward from the simulatedpath by x l and r , respectively, the leaf node value estimate isupdated by ˆ V ∗ ( x l ) ← N ( x l ) − N ( x l ) ˆ V ∗ ( x l ) + N ( x l ) r . (9)After updating the leaf state node, we update the nodes in thepath collected in selection stage in reversed order. Suppose wehave a path ( x , ( x , a ) , . . . , x i , ( x i , a i ) , x i + , . . . , x l ) and the node values of x i + , . . . , x l have been updated, thepreceding nodes x i and ( x i , a i ) are updated throughˆ Q N ( x , a ) ( x i , a i ) = R ( x i , a ) + ˆ V ∗ ( x i + ) , (10)¯ Q ( x i , a i ) ← N ( x i , a i ) − N ( x i , a i ) ¯ Q ( x i , a i ) + N ( x i , a i ) Q N ( x , a ) ( x i , a i ) , (11)¯ V ( x i ) ← N ( x i ) − N ( x i ) ¯ V ∗ ( x i ) + N ( x i ) ¯ Q ( x i , a i ) , (12)ˆ V ( x i ) ← ( − α N ( x i ) ) ¯ V ( x i ) + α N ( x i ) max a ∈ A x i ¯ Q ( x i , a ) , (13)where ¯ V ( · ) is an intermediate variable that records the averagevalue of the node through the root-to-leaf path, and α N ( x i ) ∈ [ , ] is a smoothing parameter. The updates are performedbackwards to the root node.Details of the OCBA tree policy are shown in Algorithm 1to 6. Algorithm 1:
MCTS
Input:
Simulation budget (roll-out number) N , root statenode x Output: ˆ a ∗ x , ˆ V ∗ ( x ) Set simulation counter n ← while n < N do path ← selection ( x ) lea f ← path [ end ] r ← simulate ( lea f ) backpropagate ( path , r ) n ← n + end return action ˆ a ∗ x = arg max a ∈ A ¯ Q ( x , a ) There are a few points worth emphasizing in Algorithm3. First, ˜ N ( x , a i ) is the total number of samples for eachaction i after the allocation. Given present information,i.e., all samples state node x , OCBA-MCTS assumes nowa total number of ∑ a ∈ A N ( x , a ) + ( ˜ N ( x , a ) , ˜ N ( x , a ) , . . . , ˜ N ( x , a | A | )) that maximizes APCS is cal-culated. Afterwards, one action based on Equation (8) isselected to sample and move to the next stage. This “most-starving” implementation of the OCBA policy as given inAlgorithm 3 is fully sequential, as each iteration allocatesonly one sample to an action before the allocation decision isrecomputed. It is also possible to allocate the sampling budgetin a batch of size ∆ >
1. We use the “most-starving” scheme,because it has been shown to be more efficient than the batchsampling scheme [26]. However, the benefit of sampling inbatches for MCTS is that in one iteration, multiple root-to-leaf paths can be examined, enabling parallelization of thealgorithm. We will consider this in future research.Second, updating ˆ V ( x i ) involves two stages: updating thevalue estimate along the path (Equation (12)) and taking themaximum over the values of the child state-action nodes(canonical way to update). Then the two values are mixedthrough α N ( x i ) to update ˆ V ( x i ) , as prior research (e.g., [27], EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 6
Algorithm 2: selection ( x ) Input: root state node x Sample a root-to-leaf path. path ← ( ) x ← x while True do Append state node x to pathN ( x ) ← N ( x ) + if x is a terminal node then return path endif x is expandable then ˆ a ← expand ( x ) y ← expand (( x , ˆ a )) Append state-action node ( x , ˆ a ) and leaf statenode y to pathN ( x , a ) ← N ( x , a ) + N ( x ) ← N ( x ) + path else ˆ a ← OCBAselection ( x ) Append state-action node ( x , ˆ a ) to pathN ( x , ˆ a ) ← N ( x , a ) + x ← expand (( x , ˆ a )) endendAlgorithm 3: OCBASelection ( x ) Input: state node x Identify ˆ a ∗ x = arg max a ¯ Q ( x , a ) δ x ( ˆ a ∗ x , a ) ← ¯ Q ( x , ˆ a ∗ x ) − ¯ Q ( x , a ) Compute new sampling allocation ( ˜ N ( x , a ) , ˜ N ( x , a ) , . . . , ˜ N ( x , a | A | )) by solving Equations (5) to (7)ˆ a ← arg max a ∈ A ( ˜ N ( x , a ) − N ( x , a )) return ˆ a Algorithm 4: expand ( x or ( x , a )) Input: a state node x or a state-action node ( x , a ) Output: child node to be added to the tree if the input node is a state node x then S ← { feasible actions of state x that has beensampled less than n times } ˆ a ← random choice of S Add ( x , ˆ a ) to the tree if it is unvisitedreturn ˆ a else Sample node ( x , a ) at state node x and obtain thechild state node y ∼ P ( x , a )( · ) Add y to the tree if it is unvisitedreturn y . end Algorithm 5: simulate ( x ) Input: state node x r ← while True doif x is not terminal then find a random child state-action node ( x , a ) of x r ← r + R ( x , a ) sample a and obtain the child state node y ∼ P ( x , a )( · ) x ← yelse return r endendAlgorithm 6: backpropagate ( path , reward ) Input: path to a leaf node path , simulated reward reward for node in reversed(path) do Update node values through Equations (9) to (13). end [6]) suggests mixing with α N ( x i ) → NALYSIS OF
OCBA-MCTSIn this section, we first analyze how the OCBA tree pol-icy in OCBA-MCTS balances exploration and exploitationmathematically. Then, we present several theoretical resultsregarding OCBA-MCTS. The proofs are given in the appendix.
A. Exploration-exploitation balance
Equations (5) to (7) determine the new sampling budgetallocation. First, Eq. (5) shows that the sub-optimal state-action nodes should be sampled proportional to their variancesand inversely proportional to the squared differences betweentheir sample means and that of the optimal state-action node.This represents a different type of trade off between explo-ration (sampling actions with high variances) and exploitation(sampling actions with higher sample means) compared tobandit-based algorithms.
B. Convergence analysis
In this part, we present three theorems regarding OCBA-MCTS. The first theorem ensures the estimate of the value-to-go function converges to the true value. The second theoremproves that OCBA-MCTS will select the correct action, i.e.,the PCS converges to 1. The last theorem guarantees that theAPCS, which is a lower bound of PCS, is maximized bysolving Equations (5) and (6) in each step. It is shown that
EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 7 at each point of the tree policy when a decision needs to bemade, the action that maximizes the APCS will be selectedand sampled. Therefore, the OCBA tree policy graduallymaximizes the overall APCS at the root, which is a lowerbound for PCS.
Theorem 1 (Asymptotic consistency) . Assume the expectedcumulative reward at state-action node ( x , a ) is a normalrandom variable with mean µ ( x , a ) and variance σ ( x , a ) < ∞ ,i.e., ˆ Q ( x , a ) ∼ N ( µ ( x , a ) , σ ( x , a )) for ≤ i < H. Furtherassume µ ( x , a ) is also normally distributed with unknownmean and known variance. Suppose the proposed OCBA-MCTS algorithm is run with a sampling budget N at rootstate node x . Then at any subsequent nodes x , lim N → ∞ ¯ Q ( x , a ) = E [ ˆ Q ( x , a )] = Q ( x , a ) , lim N → ∞ ˆ V ( x ) = V ∗ ( x ) , ∀ x ∈ X , ( x , a ) ∈ X × A x . Theorem 2 (Asymptotic correctness) . Under the same as-sumptions of Theorem 1, the PCS converges to 1 for any statenode x ∈ X , i.e.,P (cid:20) (cid:92) a ∈ A x , a (cid:54) = ˆ a ∗ x ( lim N → ∞ ˜ Q ( x , ˆ a ∗ x ) − lim N → ∞ ˜ Q ( x , a )) ≥ (cid:21) = , ∀ x ∈ X , where ˆ a ∗ x = arg max a ∈ A x ¯ Q ( x , a ) . Theorem 3.
Under the same assumptions of Theo-rem 1, the APCS defined in Equation (4) is max-imized asymptotically with simulation budget allocation ( ˜ N ( x , a ) , ˜ N ( x , a ) , . . . , ˜ N ( x , a | A | )) by solving Equations (5) and (6) with total budget N, i.e., ∑ a ∈ A x ˜ N ( x , a ) = N . Theorem 3, which follows from the result originally derivedin [15], shows that at each point of the algorithm when adecision needs to be made, the action that maximizes theAPCS will be selected and sampled. Therefore, the OCBAtree policy gradually maximizes the overall APCS at the root,which is a lower bound for PCS.
C. Performance lower bound
We take advantage of the normal distribution assumptionson the Q functions and provide a lower bound on PCS. Theorem 4 (Lower bound on the probability of correctselection) . Under the same assumptions of Theorem 1, thePCS at each stage and state is lower bounded by
PCS ≥ − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x Φ (cid:18) − δ x ( ˆ a ∗ x , a ) (cid:112) N ( x , ˆ a ∗ x ) (cid:113) σ ( x , ˆ a ∗ x ) + σ ( x , ˆ a ∗ x ) σ ( x , a ) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x r ( ˜ a , a ) σ ( x , ˜ a ) (cid:19) , where Φ ( · ) is the cdf of standard normal distribution andr x ( ˜ a , a ) = σ ( x , ˜ a ) δ x ( ˆ a ∗ x , a ) σ ( x , a ) δ x ( ˆ a ∗ x , ˜ a ) . V. N
UMERICAL EXAMPLES
In this section, we evaluate our proposed OCBA-MCTS ontwo tree search problems against the well-known UCT [5].The effectiveness is measured by PCS, which is estimated bythe fraction of times the algorithm chooses the true optimalaction. We first evaluate our algorithm on an inventory controlproblem with random non-normal reward. Then we apply ouralgorithm to the game of tic-tac-toe.For convenience, we restate the UCT tree policy here. At astate node x , the UCT policy will select the child state-actionnode with the highest upper confidence bound, i.e.,ˆ a = arg max a ∈ A x (cid:8) ¯ Q ( x , a ) + w e (cid:115) ∑ a (cid:48) ∈ A x N ( x , a (cid:48) ) N ( x , a ) (cid:9) , (14)where w e is the “exploration weight”. The original UCTalgorithm assumes the value function in each stage is boundedin [ , ] because it sets w e =
1, whereas the support is unknownin many practical problems. Therefore, in general, w e needsto be tuned to encourage exploration.For all experiments, we set the smoothing parameter inEquation (13) in the backpropagation phase to α N ( x ) = − N ( x ) . Since initial estimates of sample variance can be lessaccurate with small n , we add an initial variance σ > σ ( x , a ) = N ( x , a ) N ( x , a ) ∑ t = (cid:0) ˆ Q t ( x , a ) − ¯ Q ( x , a ) (cid:1) + σ / N ( x , a ) , where the first term is the sample variance, and second termvanishes as N ( x , a ) grows. A. Inventory control problem
We now evaluate the performance of OCBA-MCTS usingthe inventory control problem in [3]. The objective is to findthe initial order quantity that minimizes the total cost overa finite horizon. At decision period i , we denote by D i therandom demand in period i , x i = ( x i , i ) the state node, where x i is the inventory level at the end of period i (which is alsothe inventory at the beginning of period i + ( x i , a i ) thecorresponding child state-action node with a i being the orderamount in period i , p the per period per unit demand lostpenalty cost, h the per period per unit inventory holding cost, K the fixed (set-up) cost per order, M the maximum inventorylevel (storage capacity) and H the number of simulationstages. We set M =
20, initial state x = h = H = D i ∼ DU ( , ) (discrete uniform, inclusive), and consider twodifferent settings for p and K :1) Experiment 1: p =
10 and K = p = K = i , is defined by R ( x i , a i ) = − ( h max { , x i + a i − D i } + p max { , D i − x i − a i } + K { a i > } ) , EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 8 where is the indicator function, and the state transitionfollows x i + = max ( , x i + a i − D i ) , where a i ∈ A x i = { a | x i + a ≤ M } . For UCT, to accommodate the reward support not being [ , ] ,we adjust the exploration weight when updating a state-actionnode, i.e., set w e initially to 1, then in the backpropagationstep, update w e by w e = max ( w e , | ˆ Q N ( x , a ) ( x , a ) | ) , where ˆ Q N ( x , a ) ( x , a ) is obtained in Equation (10). The initialvariance σ is set to 100. For both OCBA-MCTS and UCT,we set the number of expansions ( n ) to 4 for depth 1 state-action nodes (i.e., the child nodes of the root) and to 2 forall other state action nodes in Experiment 1, and set n to 2for all nodes in Experiment 2. The different values of n aredue to the variance decreasing with the depth of a node, andExperiment 2 is a relatively easier problem.For both experiment settings, each algorithm is repeated1 ,
000 times at each simulation budget level N to estimate PCS.Since Experiment 1 is a much harder problem compared toExperiment 2, more rollouts (budget) are required. Therefore, N ranges from 10 ,
000 to 20 ,
000 and from 50 to 170 forExperiments 1 and 2, respectively.The estimated PCS curves for both experiments areillustrated in Figure 1, where the standard error ( = (cid:112) PCS ( − PCS ) / N ) is small and thus omitted for clarity.OCBA-MCTS achieves better PCS for both experiment setups.For Experiment 1 (optimal action a ∗ = a ∗ = i denotes ordering i units. Figures 2 and 3 illustrate theaverage number of visits, average estimated value function,and average estimated standard deviation of all child state-action nodes of the root node over 1,000 repeated runs with20,000 and 170 rollouts for Experiment 1 and Experiment2, respectively. Note that although the estimated standarddeviation does not play a role in determining the allocationfor UCT, we still plot it for reference. Both figures show thatthe number of visits to children nodes is, to some extent,proportional to the estimated value of the node for UCT.On the other hand, OCBA-MCTS puts more effort on theestimated optimal and second optimal actions (actions 4 and3 for Experiment 1 and actions 0 and 1 for Experiment 2,respectively), as illustrated in Figures 2b and 3b.In Experiment 1 where there are two competing actionswith similar estimated values (actions 3 and 4, with action P C S UCTOCBA (a) Experiment 1: p = K =
60 80 100 120 140 160N0.600.650.700.750.800.850.900.951.00 P C S UCTOCBA (b) Experiment 2: p = K = Fig. 1: The estimated PCS as a function of sampling budgetachieved by UCT-MCTS and OCBA-MCTS for inventorycontrol problem, averaged over 1,000 runs.4 being the optimal), OCBA-MCTS will spend most of itssampling budget on those two potential actions and put muchlesser effort on clearly inferior actions, such as actions 6to 14, compared to UCT. This strategy makes more sensewhen the objective is to identify the best action, and thusis more suitable for MCTS problems, as the ultimate goal isto make a decision. It is also interesting to note that OCBA-MCTS actually allocates slightly more visits to the competingsuboptimal action than the optimal one (mean 8486 and 8468for actions 3 and 4, respectively), which will not happen inbandit-based policies, as their goal is to minimize regret, andthus will put more effort on exploiting the estimated optimalaction. In Experiment 2 where the optimum is slightly easierto find, although OCBA-MCTS allocates a larger fraction ofsamples to suboptimal actions compared to that in Experiment1, most of the samples are still allocated to the top 2 actionsas shown in Figure 3b, whereas UCT performs similar to thatin Experiment 1.
B. Tic-tac-toe
In this section, we apply OCBA-MCTS and UCT to thegame of tic-tac-toe to identify the optimal move. Tic-tac-toeis a game for two players who take turns marking ‘X’ (Player1) and ‘O’ (Player 2) on a 3 × EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 9 E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (a) UCT E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (b) OCBA
Fig. 2: Sampling distribution for Experiment 1 with N = , , , , . . . ) andPlayer 1 makes decisions at odd stages (1 , , . . . ). The statetransitioning is deterministic and Player 1’s move is modeledusing a randomized policy. We consider two different policiesfor Player 1:1) Experiment 3: Player 1 plays randomly, i.e., with equalprobability to mark any feasible space;2) Experiment 4: Player 1 plays UCT.We compare the performance of OCBA-MCTS and UCT onPlayer 2 in both experiments. At state node x , the rewardfunction for taking action a is defined according to thefollowing rules: immediately after taking the action, if Player 2wins the game, R ( x , a ) =
1, if it leads to a draw, R ( x , a ) = . R ( x , a ) = n is set to 2 across all nodes for both UCT and E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (a) UCT E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (b) OCBA
Fig. 3: Sampling distribution for Experiment 2 with N = (a) Action layout (b) Root node (c) Optimal Fig. 4: Tic-tac-toe board setup.OCBA-MCTS. Since the value function for all state-actionnodes is now bounded in [ , ] , we set w e = σ isset to 10. For Experiment 4 where Player 1 plays UCT, itsgoal is to minimize the reward, therefore, Player 1 will selectthe action that minimizes the lower confidence bound, i.e.,ˆ a = arg min a ∈ A x (cid:8) ¯ Q ( x , a ) − w e (cid:115) ∑ a (cid:48) ∈ A x N ( x , a (cid:48) ) N ( x , a ) (cid:9) . Similar to the previous section, we plot the PCS of thetwo algorithms as a function of the number of rollouts, whichranges from 300 to 700 for both experiments and the PCS isestimated over 2000 independent experiments at each rolloutlevel. The results are shown in Figure 5, which indicates that
EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 10 the proposed OCBA-MCTS produces a more accurate estimateof the optimal action compared to UCT. Both experiments
300 350 400 450 500 550 600 650 700N0.700.750.800.850.900.951.00 P C S UCTOCBA (a) Experiment 3: Player 1 plays randomly.
300 350 400 450 500 550 600 650 700N0.700.750.800.850.900.951.00 P C S UCTOCBA (b) Experiment 4: Player 1 plays UCT.
Fig. 5: The estimated PCS as a function of sampling budgetachieved by UCT-MCTS and OCBA-MCTS for tic-tac-toe,averaged over 2000 runs.show that OCBA-MCTS is better at finding the optimal movewhen the sampling budget is relatively low. The performanceof UCT and OCBA-MCTS become comparable when moresamples become available. We also note that there is agreater performance gap between UCT and OCBA-MCTS inExperiment 3 than in Experiment 4: in Experiment 3, OCBA-MCTS achieves 10% better PCS, whereas in Experiment 4,the difference is around 5% when N <
500 and soon catchesup as N increases. This is expected, as it becomes easier todetermine the optimal action when the opponent applies an AIalgorithm (i.e., Player 1 has a better chance to take its optimalaction). In this case, space 4 becomes a clear optimum andtherefore Player 2’s UCT algorithm tends to exploit it more,which leads to better performance.The sampling distributions for OCBA-MCTS and UCT with N =
700 for both experiments are shown in Figures 6 and 7.In this game, since a relatively clear optimum is available,OCBA-MCTS and UCT behaved differently compared to thatin the inventory control problem. As shown in Figures 6a and7a, UCT spends most of the sampling budget exploiting thisaction, whereas OCBA will still try to explore other subopti-mal actions due to its tendency to better balance explorationand exploitation. E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (a) UCT E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (b) OCBA
Fig. 6: Sampling distributions for Experiment 3, averaged over2000 runs.In summary, the proposed OCBA-MCTS outperforms UCTin both experiments in finding the optimal action at the root.Since the objective of the proposed OCBA tree policy is tomaximize PCS, it leads to different budget allocation andbetter PCS.VI. C
ONCLUSION AND FUTURE RESEARCH
In this paper, we present a new OCBA tree policy forMCTS. Unlike bandit-based tree policies (e.g., UCT), the newpolicy maximizes PCS at the root node, and in doing so,balances the exploration and exploitation trade-off differently.Furthermore, the new OCBA tree policy relaxes the assump-tion of known bounded support on the reward distribution, andthus makes MCTS more generally applicable.For future research, we intend to explore the use of a batchsampling scheme in Algorithm 2, which allocates a batchof ∆ > ε − δ -correct and establish an upper boundon the sample complexity. EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 11 E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (a) UCT E s t i m a t e d v a l u e f un c t i o n Q E s t i m a t e d s t a n d a r d d e v i a t i o n A v e r a g e nu m b e r o f v i s i t s Qstd (b) OCBA
Fig. 7: Sampling distributions for Experiment 4, averaged over2000 runs. A
PPENDIX AC ONVERGENCE ANALYSIS
To prove that our algorithm correctly selects the optimalaction as the sampling budget goes to infinity, we first provethat at each stage, the PCS converges to 1. The process of ouralgorithm at each single stage is OCBA adapted from [15].OCBA tries to identify the alternative with highest mean froma set of normal random variables (alternatives) with means J i and known variances σ i , i = , , . . . , k by efficiently allocatingsamples that maximizes APCS. OCBA assumes that J i isalso normally distributed. Here we present OCBA again inAlgorithm 7 for convenience. The budget allocation processis similar to Equations (5) to (7). First define¯ J i : = l i l i ∑ m = ˆ J mi , b : = arg max i ¯ J i , δ ( b , i ) : = ¯ J b − ¯ J i , ∀ i (cid:54) = b , where l i is the number of samples for alternative i , ˆ J mi is the m -th sample of J i for 1 ≤ i ≤ k , 1 ≤ m ≤ l i . The new allocations ( ˜ l , ˜ l , . . . , ˜ l k ) with budget T > ∑ i l i can be obtained by solving the set of equations:˜ l i ˜ l j = (cid:0) σ i / δ ( b , i ) σ j / δ ( b , j ) (cid:1) , ∀ i (cid:54) = j (cid:54) = b , (15)˜ l b = σ b (cid:118)(cid:117)(cid:117)(cid:116) k ∑ i = , i (cid:54) = b ˜ l i ( σ i ) , (16) ∑ i = ˜ l i = T , (17)where σ i is the standard deviation of the i − th reward distri-bution. As in Remark 2, σ i is assumed to be known, but inpractice can be unknown and approximated by sample standarddeviation ˆ σ i = (cid:113) l i ∑ l i m = ( ˆ J mi − ¯ J i ) . Algorithm 7:
One-stage OCBA
Input:
Total sampling budget T , initial sample size n Output:
Index of optimal action ˆ b Sample each of the k alternatives n times;Set counter l i ← n ∀ i = , , . . . , k ; l ← kn ;Calculate ¯ J i and ˆ σ i , ∀ i = , , . . . , k ; while l < = T do Compute new budget allocation ( ˜ l , ˜ l , . . . , ˜ l k ) bysolving eq. (15)-(17) with budget l + i = arg max ≤ i ≤ k ( ˜ l i − l i ) ;Update ¯ J ˆ i (and ˆ σ i if sample variance is used); l ˆ i ← l ˆ i + l ← l + end return ˆ b = arg max ≤ i ≤ k ¯ J i ; Lemma 1.
Given a set of k normal random variables (actions)with mean J i and variance σ i , i = , , . . . , k, where J i . arealso normally distributed. Suppose OCBA is run with samplingbudget T . Define the PCSPCS = P (cid:20) k (cid:92) i = , i (cid:54) = b ( ˜ J b − ˜ J i ) ≥ (cid:21) , where ˜ J i is the posterior distribution of J i given l i samples ∀ i = , , . . . , k. Then, PCS → as T → ∞ .Proof. The
PCS can be lower bounded by APCS, i.e., by theBonferroni inequality
PCS = P (cid:20) k (cid:92) i = , i (cid:54) = b ( ˜ J b − ˜ J i ) ≥ (cid:21) ≥ − k ∑ i = , i (cid:54) = b P (cid:20) ˜ J b − ˜ J i ≤ (cid:21) = APCS . Thus, to prove that
PCS →
1, it suffices to prove
APCS → k ∑ i = , i (cid:54) = b P (cid:20) ( ˜ J b − ˜ J i ) ≤ (cid:21) → T → ∞ . EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 12
Based on the normality assumption, the posterior distributionis also normal, i.e., ˜ J i ∼ N ( ¯ J i , σ i / l i ) . Thus, ˜ J b − ˜ J i ∼ N ( ¯ J b − ¯ J i , σ b / l b + σ i / l i ) . Therefore, k ∑ i = , i (cid:54) = b P (cid:20) ( ˜ J b − ˜ J i ) ≤ (cid:21) = k ∑ i = , i (cid:54) = b Φ ( − ¯ J b − ¯ J i (cid:113) σ b / l b + σ i / l i ) , (18)where Φ is the cdf of the standard normal distribution. Since k ∑ i = l i = T , then when T → ∞ , at least one of the actions will be sampledinfinitely many times, i.e., there exists an index i such that l i → ∞ . Then there are two possible cases: i (cid:54) = b and i = b .Case 1: i (cid:54) = b According to eq. (15), l j = (cid:32) σ j / δ ( b , j ) σ i / δ ( b , i ) l i (cid:33) , ∀ j (cid:54) = i , j (cid:54) = b . Since σ i and δ ( b , i ) are bounded for all i, l j → ∞ , ∀ j (cid:54) = b .Therefore, by eq. (16), l b → ∞ . Thus, l i → ∞ for all i = , , . . . , k .Case 2: i = b According to (16), l b = σ b (cid:118)(cid:117)(cid:117)(cid:116) k ∑ i = , i (cid:54) = b l i ( σ i ) → ∞ . Thus there exists an index i (cid:54) = b such that l i → ∞ . By a similarargument in Case 1, we can conclude that l i → ∞ for all i = , , . . . , k .In either case, we have l i → ∞ for all i = , , . . . , k . Addi-tionally, since ¯ J b is defined to be the maximum of all ¯ J i , i.e.,¯ J b − ¯ J i ≥ i (cid:54) = b , eq. (18) becomes k ∑ i = , i (cid:54) = b P (cid:20) ( ˜ J b − ˜ J i ) ≤ (cid:21) = k ∑ i = , i (cid:54) = b Φ ( − ¯ J b − ¯ J i (cid:113) σ b / l b + σ i / l i ) → Corollary 5.
Suppose one-stage OCBA is run with budget T .Then ¯ J i → E [ J i ] w.p. as T → ∞ , ∀ i = , , . . . , k . The proof is a simple application of the strong law of largenumbers, since l b → ∞ . With this key lemma, we are readyto prove the first two theorems proposed in Section IV. Westart with Theorem 1. Proof of Theorem 1.
The result can be proved by induction.First observe that since N → ∞ , each path is exploredinfinitely many times. Thus the number of samples in eachstage also goes to infinity as N → ∞ .Suppose at some point of the algorithm, all nodes areexpanded. If the current state node x is at stage H − Q ( x , a ) can be viewed as a setof alternatives for a ∈ A . From Corollary 5, it is straightforwardthat lim N → ∞ ¯ Q ( x , a ) = Q ( x , a ) . Therefore, since the reward function is boundedlim N → ∞ ˆ V ( x ) = lim N → ∞ max a ∈ A x ¯ Q ( x , a )= max a ∈ A x lim N → ∞ ¯ Q ( x , a )= V ∗ ( x ) . Now suppose that the statement is true for all child state nodes y of a state x , i.e., ˆ V ( y ) → V ∗ ( y ) and y could be achieved from x . Then for x , the algorithm also reduces to OCBA. Thus fromCorollary 5 againlim N → ∞ ¯ Q ( x , a ) = lim N ( x , a ) → ∞ ¯ Q ( x , a )= E [ R ( x , a )] + E P ( x , a ) [ V ∗ ( y )]= Q ( x , a ) for all child state-action pair ( x , a ) . It follows thatlim N → ∞ ˆ V ( x ) → V ∗ ( x ) . Theorem 2 is a direct result of Lemma 1.
Proof of Theorem 2.
Since we assume ˆ Q ( x , a ) is normallydistributed with known variance, the posterior distribution ofˆ Q ( x , a ) , i.e., ˜ Q ( x , a ) , is also a normal random variable. Then,it follows directly from Lemma 1 that P (cid:20) k (cid:92) a ∈ A x , a (cid:54) = ˆ a ∗ x ( lim N → ∞ ˜ Q ( x , ˆ a ∗ x ) − lim N → ∞ ˜ Q ( x , a )) ≥ (cid:21) = , ∀ i = , . . . , H , x ∈ X , a ∈ A . A PPENDIX BA LLOCATION STRATEGY
Proof of Theorem 3.
The problem of maximizing APCS withbudget constraint can be formulated asmax ˜ N ( x , a ) , a ∈ A x − ∑ a ∈ A , a (cid:54) = ˆ a ∗ x P (cid:20) ˜ Q ( x , ˆ a ∗ x ) ≤ ˜ Q ( x , a ) (cid:21) s . t . ∑ a ∈ A x ˜ N ( x , a ) = N . With Lagrange multiplier λ , the Lagrangian can be written as L = − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x P (cid:20) ˜ Q ( x , ˆ a ∗ x ) ≤ ˜ Q ( x , a ) (cid:21) + λ ( ∑ a ∈ A x ˜ N ( x , a ) − N )= − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x Φ ( ¯ Q ( x , a ) − ¯ Q ( x , ˆ a ∗ x ) σ x ( a , ˆ a ∗ x ) ) + λ ( ∑ a ∈ A x ˜ N ( x , a ) − N )= − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x Φ ( − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) ) + λ ( ∑ a ∈ A x ˜ N ( x , a ) − N ) , EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 13 where σ x ( a , ˆ a ∗ x ) = σ ( x , ˆ a ∗ x ) N ( x , ˆ a ∗ x ) + σ ( x , a ) N ( x , a ) Apply Karush-Kuhn-Tucker (KKT) conditions [28]: • primal feasible N ( x , a ) ≥ , ∀ a ∈ A (19) ∑ a ∈ A x ˜ N ( x , a ) − N = , (20) • stationarity ∂ L ∂ N ( x , a ) = ∂ L ∂ ( − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) ) ∂ ( − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) ) ∂ σ x ( a , ˆ a ∗ x ) ∂ σ x ( a , ˆ a ∗ x ) ∂ N ( x , a )= a (cid:54) = ˆ a ∗ x ∂ L ∂ N ( x , a )= σ ( x , a ) δ x ( ˆ a ∗ x , a ) N ( x , a ) σ x ( a , ˆ a ∗ x ) √ π exp (cid:26) − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) (cid:27) + λ = a = ˆ a ∗ x ∂ L ∂ N ( x , ˆ a ∗ x )= ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x σ ( x , ˆ a ∗ x ) δ x ( ˆ a ∗ x , a ) N ( x , ˆ a ∗ x ) σ x ( a , ˆ a ∗ x ) √ π exp (cid:26) − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) (cid:27) + λ = δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) √ π exp (cid:26) − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) (cid:27) = − λ N ( x , a ) σ ( x , a ) . (24)Plug Equation (24) into Equation (23) yields σ ( x , ˆ a ∗ x ) N ( x , ˆ a ∗ x ) ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x λ N ( x , a ) σ ( x , a ) = λ , i.e., N ( x , ˆ a ∗ x ) = (cid:115) σ ( x , ˆ a ∗ x ) ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x N ( x , a ) σ ( x , a ) . (25)After sufficiently large number of samples, we may concludefrom Equation (25) that our algorithm would focus moreon sampling the sample optimal. Thus, we may assume that N ( x , ˆ a ∗ x ) (cid:29) N ( x , a ) for all suboptimal actions a ∈ A x .Now, for two suboptimal actions a (cid:54) = ˜ a (cid:54) = ˆ a ∗ x , we have σ ( x , a ) δ x ( ˆ a ∗ x , a ) N ( x , a )( σ ( x , ˆ a ∗ x ) N ( x , ˆ a ∗ x ) + σ ( x , a ) N ( x , a ) ) / exp − δ x ( ˆ a ∗ x , a ) ( σ ( x , ˆ a ∗ x ) N ( x , ˆ a ∗ x ) + σ ( x , a ) N ( x , a ) ) = σ ( x , ˜ a ) δ x ( ˆ a ∗ x , ˜ a ) N x ( x , ˜ a )( σ ( x , ˆ a ∗ x ) N x ( x , ˆ a ∗ x ) + σ ( x , ˜ a ) N ( x , ˜ a ) ) / exp − δ x ( ˆ a ∗ x , ˜ a ) ( σ ( x , ˆ a ∗ x ) N ( x , ˆ a ∗ x ) + σ ( x , ˜ a ) N ( x , ˜ a ) ) . Apply the N ( x , ˆ a ∗ x ) (cid:29) N ( x , a ) assumption: σ ( x , a ) δ x ( ˆ a ∗ x , a ) N ( x , a )( σ ( x , a ) N ( x , a ) ) / exp − δ x ( ˆ a ∗ x , a ) ( σ ( x , a ) N ( x , a ) ) = σ ( x , ˜ a ) δ x ( ˆ a ∗ x , ˜ a ) N ( x , ˜ a )( σ ( x , ˜ a ) N ( x , ˜ a ) ) / exp − δ x ( ˆ a ∗ x , ˜ a ) ( σ ( x , ˜ a ) N ( x , ˜ a ) ) . i.e., (cid:16) N ( x , ˜ a ) N ( x , a ) (cid:17) / = σ ( x , a ) σ ( x , ˜ a ) δ x ( ˆ a ∗ x , ˜ a ) δ x ( ˆ a ∗ x , a ) exp δ x ( ˆ a ∗ x , a ) ( σ ( x , a ) N ( x , a ) ) − δ x ( ˆ a ∗ x , ˜ a ) ( σ ( x , ˜ a ) N ( x , ˜ a ) ) . Taking log on both sides yieldslog N ( x , ˜ a ) − log ( N ( x , a )) = σ ( x , a ) σ ( x , ˜ a ) δ x ( ˆ a ∗ x , ˜ a ) δ x ( ˆ a ∗ x , a )+ δ x ( ˆ a ∗ x , a ) σ ( x , a ) N ( x , a ) − δ x ( ˆ a ∗ x , a ) σ ( x , ˜ a ) N ( x , ˜ a ) . When the number of samples is sufficiently large ( N → ∞ ),the log terms can be neglected compared to the terms linearin N ( x , a ) or N ( x , ˜ a ) . Therefore, removing the log terms yields, δ x ( ˆ a ∗ x , a ) σ ( x , a ) N ( x , a ) = δ x ( ˆ a ∗ x , a ) σ ( x , ˜ a ) N ( x , ˜ a ) , namely, ˜ N ( x , a ) ˜ N ( x , ˜ a ) = (cid:32) σ ( x , a ) / δ x ( ˆ a ∗ x , a ) σ ( x , ˜ a ) / δ x ( ˆ a ∗ x , ˜ a ) (cid:33) , ∀ a , ˜ a (cid:54) = ˆ a ∗ x . A PPENDIX CP ERFORMANCE BOUND ANALYSIS
Proof of Theorem 4.
When the number of samples at node x is large, we assume that N ( x , a ) satisfies Equations (5) to (6).From Equation (5), we have N ( x , ˜ a ) = (cid:16) σ ( x , ˜ a ) δ x ( ˆ a ∗ x , a ) σ ( x , a ) δ x ( ˆ a ∗ x , ˜ a ) (cid:17) N ( x , a ) , (26) ∀ ˜ a , a (cid:54) = ˆ a ∗ x . In this way, we can express the budget allocation to anysuboptimal action ˜ a as the product of the budget allocationto a particular suboptimal action a and the factor r x ( ˜ a , a ) = (cid:16) σ ( x , ˜ a ) δ x ( ˆ a ∗ x , a ) σ ( x , a ) δ x ( ˆ a ∗ x , ˜ a ) (cid:17) . From Equation (6): N ( x , ˆ a ∗ x ) = σ ( x , ˆ a ∗ x ) (cid:115) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x ( N ( x , ˜ a ) ) σ ( x , ˜ a ) . EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 14
Substitute N ( x , ˜ a ) from Equation (26) yields N ( x , ˆ a ∗ x ) = N ( x , a ) σ ( x , ˆ a ∗ x ) (cid:115) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x ( r x ( ˜ a , a )) σ ( x , ˜ a ) , i.e., N ( x , a ) = N ( x , ˆ a ∗ x ) σ ( x , ˆ a ∗ x ) (cid:114) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x ( r x ( ˜ a , a )) σ ( x , ˜ a ) . Since PCS is lower bounded by APCS, and the posterior˜ Q ( x , a ) is normally distributed with˜ Q ( x , a ) ∼ N ( ¯ Q ( x , a ) , σ ( x , a ) N ( x , a ) ) , then PCS ≥ APCS = − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x P (cid:20) ˜ Q ( x , ˆ a ∗ x ) ≤ ˜ Q ( x , a ) (cid:21) = − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x Φ ( ¯ Q ( x , a ) − ¯ Q ( x , ˆ a ∗ x ) σ x ( a , ˆ a ∗ x ) )= − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x Φ ( − δ x ( ˆ a ∗ x , a ) σ x ( a , ˆ a ∗ x ) ) , where the second equality is because ˜ Q ( x , ˆ a ∗ x ) − ˜ Q ( x , a ) isnormally distributed with mean ¯ Q ( x , a ) − ¯ Q ( x , ˆ a ∗ x ) and variance σ x ( a , ˆ a ∗ x ) = σ ( x , ˆ a ∗ x ) N ( x , ˆ a ∗ x ) + σ ( x , a ) N ( x , a )= N ( x , ˆ a ∗ x ) (cid:32) σ ( x , ˆ a ∗ x )+ σ ( x , ˆ a ∗ x ) σ ( x , a ) (cid:115) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x ( r x ( ˜ a , a )) σ ( x , ˜ a ) (cid:33) . Apply inequality (cid:113) ∑ ni = c i ≤ ∑ ni = (cid:113) c i = ∑ ni = c i for positivenumbers c i ’s yields σ x ( a , ˆ a ∗ x ) ≤ N ( x , ˆ a ∗ x ) (cid:32) σ ( x , ˆ a ∗ x )+ σ ( x , ˆ a ∗ x ) σ ( x , a ) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x r x ( ˜ a , a ) σ ( x , ˜ a ) (cid:33) . Since APCS is decreasing in σ x ( a , ˆ a ∗ x ) , we have PCS ≥ − ∑ a ∈ A x , a (cid:54) = ˆ a ∗ x Φ (cid:18) − δ x ( ˆ a ∗ x , a ) (cid:112) N ( x , ˆ a ∗ x ) (cid:113) σ ( x , ˆ a ∗ x ) + σ ( x , ˆ a ∗ x ) σ ( x , a ) ∑ ˜ a ∈ A x , ˜ a (cid:54) = ˆ a ∗ x r x ( ˜ a , a ) σ ( x , ˜ a ) (cid:19) as desired Yunchuan Li received a bachelor’s degree in Automation from University ofElectronic Science and Technology of China (UESTC). He is currently a Ph.D.candidate in the Department of Electrical and Computer Engineering at theUniversity of Maryland, College Park. His research focuses on optimizationand control with applications to operations research problems.
Michael C. Fu (S89M89SM06F08) received degrees in mathematics andEECS from MIT in 1985 and a Ph.D. in applied math from Harvard in1989. Since 1989, he has been at the University of Maryland, College Park,currently holding the Smith Chair of Management Science. He also served asthe Operations Research Program Director at the National Science Foundation.His research interests include simulation optimization and stochastic gradientestimation. He is a Fellow of the Institute for Operations Research and theManagement Sciences (INFORMS).
Jie Xu (S01-M09-SM17) received the Ph.D. degree in industrial engineeringand management sciences from Northwestern University, the M.S. degreein computer science from The State University of New York, Buffalo, theM.E. degree in electrical engineering from Shanghai Jiaotong University,and the B.S. degree in electrical engineering from Nanjing University. Heis currently an Associate Professor of Systems Engineering and OperationsResearch at George Mason University. His research interests are data analytics,stochastic simulation and optimization, with applications in cloud computing,manufacturing, and power systems.
EEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. XX, NO. Y, MONTH YEAR 15 R EFERENCES[1] Y. Li, M. C. Fu, and J. Xu, “Monte Carlo tree search with optimalcomputing budget allocation,” in
Proceedings of 58th IEEE Conferenceon Decision and Control , pp. 6332–6337 vol.3, Dec 2019.[2] D. P. Bertsekas,
Dynamic Programming and Optimal Control . AthenaScientific, 1995.[3] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, “An adaptive samplingalgorithm for solving Markov decision processes,”
Operations Research ,vol. 53, no. 1, pp. 126–139, 2005.[4] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis ofthe multiarmed bandit problem,”
Machine Learning , vol. 47, no. 2-3,pp. 235–256, 2002.[5] L. Kocsis and C. Szepesv´ari, “Bandit based Monte-Carlo planning,” in
Proceedings of the 17th European Conference on Machine Learning ,ECML’06, (Berlin, Heidelberg), pp. 282–293, Springer-Verlag, 2006.[6] R. Coulom, “Efficient selectivity and backup operators in Monte-Carlo tree search,” in
Computers and Games (H. J. van den Herik,P. Ciancarini, and H. H. L. M. J. Donkers, eds.), (Berlin, Heidelberg),pp. 72–83, Springer Berlin Heidelberg, 2007.[7] P. Hingston and M. Masek, “Experiments with Monte Carlo Othello,”in , pp. 4059–4064,Sept 2007.[8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, et al. , “Mastering the game of Go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, 2016.[9] P.-A. Coquelin and R. Munos, “Bandit algorithms for tree search,” arXivpreprint cs/0703062 , 2007.[10] M. P. D. Schadd, “Selective search in games of different complexity,”2011.[11] K. Teraoka, K. Hatano, and E. Takimoto, “Efficient sampling methodfor Monte Carlo tree search problem,”
IEICE TRANSACTIONS onInformation and Systems , vol. 97, no. 3, pp. 392–398, 2014.[12] E. Kaufmann and W. M. Koolen, “Monte-Carlo tree search by best armidentification,” in
Advances in Neural Information Processing Systems30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, eds.), pp. 4897–4906, Curran Associates, Inc.,2017.[13] J.-B. Grill, M. Valko, and R. Munos, “Blazing the trails before beatingthe path: Sample-efficient Monte-Carlo planning,” in
Advances in NeuralInformation Processing Systems 29 (D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett, eds.), pp. 4680–4688, CurranAssociates, Inc., 2016.[14] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocationrules,”
Advances in Applied Mathematics , vol. 6, pp. 4–22, Mar. 1985.[15] C.-H. Chen, J. Lin, E. Y¨ucesan, and S. E. Chick, “Simulation budgetallocation for further enhancing the efficiency of ordinal optimization,”
Discrete Event Dynamic Systems , vol. 10, no. 3, pp. 251–270, 2000.[16] C.-H. Chen, “An effective approach to smartly allocate computing budgetfor discrete event simulation,” in
Proceedings of 34th IEEE Conferenceon Decision and Control , vol. 3, pp. 2598–2603 vol.3, Dec 1995.[17] L. H. Lee, E. P. Chew, S. Teng, and D. Goldsman, “Optimal computingbudget allocation for multi-objective simulation models,” in
Proceedingsof the 2004 Winter Simulation Conference, 2004. , vol. 1, p. 594, Dec2004.[18] C.-H. Chen, D. He, M. Fu, and L. H. Lee, “Efficient simulationbudget allocation for selecting an optimal subset,”
INFORMS Journalon Computing , vol. 20, no. 4, pp. 579–595, 2008.[19] S. Zhang, L. H. Lee, E. P. Chew, J. Xu, and C.-H. Chen, “A simulationbudget allocation procedure for enhancing the efficiency of optimalsubset selection,”
IEEE Transactions on Automatic Control , vol. 61,pp. 62–75, Jan 2016.[20] A. Shleyfman, A. Komenda, and C. Domshlak, “On interruptible pureexploration in multi-armed bandits,” 2015.[21] S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen, “Combinatorial pureexploration of multi-armed bandits,” in
Advances in Neural InformationProcessing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, eds.), pp. 379–387, Curran Associates,Inc., 2014.[22] S. Bubeck, R. Munos, and G. Stoltz, “Pure exploration in multi-armed bandits problems,” in
Algorithmic Learning Theory (R. Gavald`a,G. Lugosi, T. Zeugmann, and S. Zilles, eds.), (Berlin, Heidelberg),pp. 23–37, Springer, 2009.[23] C.-H. Chen and L. H. Lee,
Stochastic Simulation Optimization: AnOptimal Computing Budget Allocation . River Edge, NJ, USA: WorldScientific Publishing Co., Inc., 1st ed., 2010. [24] M. H. DeGroot,
Optimal Statistical Decisions , vol. 82. John Wiley &Sons, 2005.[25] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling,P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton,“A survey of Monte Carlo tree search methods,”
IEEE Transactions onComputational Intelligence and AI in Games , vol. 4, pp. 1–43, March2012.[26] C.-H. Chen, D. He, and M. Fu, “Efficient dynamic simulation allocationin ordinal optimization,”
IEEE Transactions on Automatic Control ,vol. 51, no. 12, pp. 2005–2009, 2006.[27] D. R. Jiang, L. Al-Kanj, and W. B. Powell, “Monte Carlo tree searchwith sampled information relaxation dual bounds,” 2017.[28] S. Boyd and L. Vandenberghe,