Random-Sampling Monte-Carlo Tree Search Methods for Cost Approximation in Long-Horizon Optimal Control
RRandom-Sampling Monte-Carlo Tree Search Methods for CostApproximation in Long-Horizon Optimal Control
Shankarachary Ragi,
IEEE Senior Member and Hans D. Mittelmann
Abstract —In this paper, we develop Monte-Carlo basedheuristic approaches to approximate the objective func-tion in long horizon optimal control problems. In theseapproaches, to approximate the expectation operator in theobjective function, we evolve the system state over multi-ple trajectories into the future while sampling the noisedisturbances at each time-step, and find the average (orweighted average) of the costs along all the trajectories. Wecall these methods random sampling - multipath hypothesispropagation or RS-MHP. These methods (or variants) existin the literature; however, the literature lacks results onhow well these approximation strategies converge. Thispaper fills this knowledge gap to a certain extent. Wederive convergence results for the cost approximation errorfrom the RS-MHP methods and discuss their convergence(in probability) as the sample size increases. We considertwo case studies to demonstrate the effectiveness of ourmethods - a) linear quadratic control problem; b) UAVpath optimization problem.
Index Terms —Long horizon optimal control, cost ap-proximation, approximate dynamic programming, multi-path hypothesis propagation.
I. I
NTRODUCTION
Long-horizon optimal control problems appear nat-urally in robotics, advanced manufacturing, and eco-nomics, especially in applications requiring decisionmaking in stochastic environments. Often these problemsare solved via dynamic programming (DP) formulation[2]. DP problems are notorious for their computationalcomplexity, and require approximation approaches tomake them tractable. A plethora of approximation tech-niques called approximate dynamic programs (ADPs)exist in the literature to solve these problems approxi-mately. Some of the commonly used ADPs include pol-icy rollout [3], hindsight optimization [4], [5], etc. A sur-vey of the ADP approaches can be found in [2]. Feature-based techniques and deep learning methods are gainingimportance in the development of ADP approaches asdiscussed in [6]. These approximation techniques have
This work was supported in part by Air Force Office of ScientificResearch under grant FA9550-19-1-0070. This paper was presentedin part at The 4th IEEE Conference on Control Technology andApplications 2020 [1].Shankarachary Ragi (corresponding author) is withDepartment of Electrical Engineering, South Dakota Schoolof Mines and Technology, Rapid City, SD 57701, USA [email protected]
Hans D. Mittelmann is with the School of Mathematical andStatistical Sciences, Arizona State University, Tempe, AZ 85281, USA [email protected], [email protected] been successfully adopted to solve real-time problemssuch as a UAV guidance control problem in [7]–[9].Certain ADP approaches, especially the methods basedon approximation in value space, require numerical ap-proximation of the expectation in the objective function[7]. In this study, our objective is to develop Monte-Carlo-based approaches to approximate the expectationin the objective function in the long (but finite) horizonoptimal control problems, and study their convergence.A preliminary version of the parts of this paper werepublished as [1]. This paper differs from the conferencepaper [1] in the following ways: 1) we include detailedproofs omitted in the conference version; 2) we derivenew convergence results and proofs in Section II-A; 3)we implement our methods for a new case study - UAVpath optimization problem.
A. Preliminaries
A long horizon optimal control problem is describedas follows. Let x k be the state vector for a system attime k , which evolves according to a discrete stochasticprocess as follows: x k + = f ( x k , u k , w k ) (1)where f ( · ) represents the state-transition mapping, u k is the control vector, and w k random disturbance. Let g ( x k , u k ) represent the cost (a real value) of being in state x k and performing action u k . The functions f and g areindependent of k in our study, but can generally dependon k . The goal is to optimize the control vectors u k , k = , . . . , H − u k , k = ,..., H − E (cid:34) H − ∑ k = g ( x k , u k ) (cid:35) , (2)where H is the length of the planning horizon. Let x be the initial state and according to the dynamicprogramming formulation the optimal cost function isgiven by J ∗ ( x ) = min u E [ g ( x , u ) + J ∗ ( x )] , (3)where J ∗ represents the optimal cost-to-go from time k =
1, and x = f ( x , u , w ) . In this study, long horizon refers to the condition that H is sufficiently large that the a r X i v : . [ ee ss . S Y ] S e p ptimal policy is approximately stationary (independentof k ). Solving the above optimization problem is nottractable mainly due to two reasons: the expectation E [ · ] and the optimal cost-to-go J ∗ are hard to evaluate andare usually approximated by numerical methods or ADPapproaches.An ADP approach called nominal belief-state opti-mization (NBO) [7], [10] was developed primarily toapproximate the above expectation. In NBO, the expec-tation is replaced by a sample state trajectory generatedwith an assumption that the future noise variables inthe system take so called nominal or mean values,thus making the above objective function deterministic.The NBO method was developed to solve a UAV pathoptimization problem, which was posed as a partiallyobservable Markov decision process (POMDP). POMDPgeneralizes the long horizon optimal control problemdescribed in Eq. 2 in that the system state is assumedto be “partially” observable, which is inferred via usingnoisy observations and Bayes rules. Although the perfor-mance of the NBO approach was satisfactory, in that itallowed to obtain reasonably optimal control commandsfor the UAVs, it ignored the uncertainty due to noisedisturbances thus leading to inaccurate evaluation of theobjective function. To address this challenge, certainmethods exist in the literature usually referred to asMonte-Carlo Tree Search (MCTS) methods as surveyedin [11].Inspired from the NBO method and MCTS methods,we develop a new MCTS method called random sam-pling - multipath hypothesis propagation (RS-MHP) andderive convergence results. In this study, we use the NBOapproach as a benchmark for performance assessmentsince RS-MHP builds on the NBO approach.II. R ANDOM S AMPLING M ULTIPATH H YPOTHESIS P ROPAGATION (RS-MHP)In the NBO method, the expectation is replaced bya sample trajectory of the states (as opposed to randomstates) generated by˜ x k + = f ( ˜ x k , u k , ¯ w k ) , k = , . . . (4)where ˜ x = x (initial state or current state), and ¯ w k isthe mean of the random variable w k . Thus, the long hori-zon optimal control problem, with NBO approximation,reduces to min u k H − ∑ k = g ( ˜ x k , u k ) . (5)The above reduced problem, without the need for evalu-ating the expectation, can significantly reduce the com-putational burden in solving the long horizon controlproblems. However, the downside with this approach is itcompletely ignores the uncertainty in the state evolution,and may generate severely sub-optimal controls. To (a) (b) Figure 1. State trajectory sampling models: (a) tree branching model,(b) non-overlapping branching model. overcome this trivialization, we develop a Monte-Carloapproach to approximate the expectation described asfollows. We will follow the tree-like sampling approachas in Figure 1(a). For time step k =
1, we samplethe probability distribution of the noise disturbance N times to generate the samples w i with correspondingprobability p i , i = , . . . , N . Using these, we generate N sample states at k = x i = f ( x , u , w i ) , ∀ i . (6)We repeat this sampling approach for time k =
2, i.e.,we generate N noise samples w i with correspondingprobability p i , i = , . . . , N . Using these noise samplesand the sample states from the previous time step, wegenerate N sample states at k = x i , j = f ( x i , u , w j ) , ∀ i , j . (7)We repeat the above sampling procedure until the lasttime step k = H − N H − possible stateevolution trajectories using N noise samples generatedin each time step as depicted in Figure 1(a). Samplingapproach in Figure 1(b) will be discussed later.One can now replace the expectation in Eq. 2 with theweighted average of the cumulative cost correspondingto each state evolution trajectory, where the weightsare the probabilities or likeliness of the trajectories.Clearly, the number of possible state trajectories growexponentially with the horizon length H . Although thisapproach is not novel as many such methods exist in theliterature often classified as Monte-Carlo Tree Searchmethods, our study is focused on deriving convergenceresults of RS-MHP approaches.To avoid the exponential growth in our RS-MHPapproach, at each time step we retain only M samplestates and prune the remaining states, and if the numberof sample states at a given time instance is less than orequal to M , we do not perform pruning. For pruning, ateach time k , we rank the state trajectories up to time k according to their likeliness (obtained by multiplying the2 k +1 k +2 k k +1 k +2Sampling in NBO approachSampling in MHP approachMeanRandom samples Confidence ellipse Figure 2. Sampling probability distributions of noise variables: NBOvs. MHP. probabilities of all the noise samples that generated thetrajectory) and retain the top M trajectories with highestlikeliness and prune the rest. With this procedure, at k = H −
1, there would be only M state trajectories. Withpruning, the number of trajectories remains a constantirrespective of the time horizon length. An illustration ofthe above RS-MHP approach is shown in Figure 2 alongwith the NBO approach. Here, we consider pruningbased on likeliness of the state trajectories as the costsfrom these trajectories have higher contribution in thecost function in Eq. 1 than the less likely trajectories. Wewill consider other pruning strategies to further improvethe approximation error in our future study.Let i = , . . . , M represent the indices of the M distinctstate trajectories with q , q , . . . being their likelinessindex evaluated using the probabilities of the noisesamples that generate the trajectory i over time. Let J represent the actual objective function as describedbelow J = E (cid:34) H − ∑ k = g ( x k , u k ) (cid:35) . (8)We can now approximate the objective function J infour possible ways as described below (assuming N > M ). Let x ik represent the state at time k in the i th statetrajectory.(I) Sample Averaging . We can simply approximate theexpectation with an average over all possible trajec-tories as follows:No pruning: J ≈ ˜ J NP = N H − N H − ∑ i = (cid:32) H − ∑ k = g ( x ik , u k ) (cid:33) With pruning: J ≈ ˜ J P = M M ∑ i = (cid:32) H − ∑ k = g ( x ik , u k ) (cid:33) (9)(II) Weighted Sample Averaging . We can also approxi-mate the expectation with a weighted average with weights being the normalized likeliness indices ofthe state trajectories given by q i , i = , . . . (and ¯ q i inthe pruned case) as follows:No pruning: J ≈ ¯ J NP = N H − N H − ∑ i = q i (cid:32) H − ∑ k = g ( x ik , u k ) (cid:33) With pruning: J ≈ ¯ J P = M M ∑ i = ¯ q i (cid:32) H − ∑ k = g ( x ik , u k ) (cid:33) . (10)where ∑ N H − i = q i = N H − and ∑ Mi = q i = M .For a given sequence of control decisions u , u , . . . ,let g i denote the cost of the i th trajectory given by g i = H − ∑ k = g ( x ik , u k ) . (11)Clearly, g , g , . . . are identically distributed random vari-ables, but are dependent due to the overlapping statetrajectories in the tree-like sampling approach in Fig-ure 1(a), where E [ g i ] = J , ∀ i .The below result suggests that with sufficient numberof sample state trajectories (large N ), the approximationerror in ˜ J NP becomes small enough to ignore. Proposition 2.1:
For any given sequence of actions u , u l , . . . , if the random variables g , g , . . . have finitevariances, ˜ J NP converges to J in probability. Proof:
From Ex. 254 in [12], we know that ˜ J NP P −→ J if lim | i − j |→ ∞ Cov ( g i , g j ) = , (12)where Cov() represents covariance. Suppose, the se-quence g , g , . . . is arranged such that g representsthe cost for the left-most branch in Figure 1(a), and g representing the second branch from the left, and soon. Clearly, the first g , g , . . . , g N are dependent randomvariables as they share the same parent node, whereas thenext N terms g N + , g N + , . . . , g N , although dependentamong themselves, are independent of the previous N terms (as these branches evolve from a separate parentnode), and so on. Thus, Cov ( g i , g j ) = | i − j | > N ,which implies lim | i − j |→ ∞ Cov ( g i , g j ) = (cid:4) Furthermore, we can apply similar arguments to provethe convergence of ¯ J NP in probability. Proposition 2.2:
For a given sequence of actions u , . . . , u H − , if g , g , . . . have finite variances, then ¯ J NP converges to J in probability. Proof:
From [13], we know that if ˜ J NP P −→ J (whichis true as shown in Proposition 2.1), and if the weights q , q , . . . are monotonically decreasing, then ¯ J NP P −→ J .Without loss of generality, we can arrange the trajectorycosts g i such that their likeliness indices are mono-tonically decreasing, i.e., q ≥ q ≥ q ≥ . . . , whichcompletes the proof. (cid:4) . Non-overlapping State Trajectories or Tree Branches Suppose the state sample trajectories are generatedindependently of each other, where the state trajectoriesdo not share any common state samples as depictedin Figure 1b. In this new sampling approach, given u , u , . . . are the control decisions over the planninghorizon, let p i represent the cost associated with the i thstate trajectory. We can approximate the LHC objectivefunction as follows: ¯ J N = N N ∑ i = p i ˜ J N = N N ∑ i = q i p i , (13)where q i represents the likeliness index of the i th tra-jectory and ∑ i q i = N . From propositions 2.1 and 2.2,we can verify that ¯ J N P −→ J and ˜ J N P −→ J . Furthermore,since p , p , . . . are i.i.d., due to the strong law of largenumbers, we can verify that ¯ J N converges to J almostsurely. We can further derive the rate of convergence(in probability) for a special case as discussed below.Suppose the state-transition and cost functions are linear(motivated by the fact that the linear models capturethe state dynamics well in most control problems) asdescribed below: x k + = Ax k + Bu k + w k , w k ∼ N ( , Σ ) g ( x k , u k ) = Cx k + Du k , (14)where g ( x k , u k ) is a scalar function. The cost from thesample trajectory i is given by p i = H ∑ k = g ( x ik , u k ) = H ∑ k = ( Cx ik + Du k ) , (15)where x ik is the sampled state at time step k from the i th trajectory. Using the linear expressions in Eq. 14, wecan verify p i further satisfies the following equation: p i − E [ p i ] = C (cid:34) H − ∑ k = (cid:32) H − k − ∑ q = A q (cid:33) w k (cid:35) = C (cid:34) H − ∑ k = A k w k (cid:35) , (16)where A k = ∑ H − k − q = A q . Proposition 2.3:
For a given sequence of actions u , . . . , u H − P ( | J N − J | ≥ ε ) ≤ constant N ε . (17) Proof:
Let p represent the cost for a sampled statetrajectory. Using Eq. 16, we can verifyVar ( p ) = E (cid:2) ( p − E [ p ]) T ( p − E [ p ]) (cid:3) = C (cid:34) H − ∑ k = A k Σ A T k (cid:35) C T , (18)which is a real scalar. Thus, Var ( J N ) = Var ( p ) / N . Using Chebyshev’s inequality, we can verify easilythatP ( | J N − J | ≥ ε ) ≤ Var ( p ) N ε = C (cid:2) ∑ H − k = A k Σ A T k (cid:3) C T N ε . (19)Furthermore, lim N → ∞ P ( | J N − J | ≥ ε ) = , (20)which shows the convergence in probability as well.III. C ASE S TUDIES
We implement the above-discussed MHP methods inthe context of two case studies: (a) linear quadraticGaussian control (LQG); (b) path planning for unmannedaerial vehicles (UAVs). These case studies are discussedbelow.
A. Linear Quadratic Problem
Although there are closed-form solutions for LQGproblems, the below example allows us to quantify thebenefits of using RS-MHP methods over existing similarmethods, particularly NBO. Let the system state evolveaccording to the following linear equation: x k + = ( − a ) x k + au k + w k , w k ∼ N ( , σ ) , (21)where 0 < a < w k is a randomdisturbance modeled by a zero-mean Gaussian distribu-tion with variance σ . The cost function over the time-horizon H is defined as follows: J = E (cid:34) r ( x H − T ) + H − ∑ k = u k (cid:35) , (22)where r and T are constants. This is a simplified oventemperature control example borrowed from [14].If we apply the traditional NBO method, assuming H =
2, the cost function J is approximated (assumingnominal values or zeros for w and w ) as J NBO = r (cid:0) ( − a ) x + a ( − a ) u + au − T (cid:1) + u + u (23)and the exact cost function J can be evaluated analyti-cally as J = r (cid:0) ( − a ) x + a ( − a ) u + au − T (cid:1) + u + u + r σ (cid:0) ( − a ) + (cid:1) . (24)We notice the approximation error due to the NBOmethod is | J NBO − J | = r σ (cid:0) ( − a ) + (cid:1) . (25)This approximation error for a generic time-horizon H (the above error term is derived for H =
2) is given by | J NBO − J | = r σ H − ∑ n = ( − a ) n . (26)4he above expression suggests that the NBO approxi-mation error can be significantly high depending on theparameters a , σ , and r . With MHP approximation, thecost function reduces to J MHP = P (cid:32) P ∑ i = r ( x iH − T ) (cid:33) + H − ∑ k = u k , (27)where P is the number of state-trajectories generatedusing the MHP approach, and x iH is the final state inthe i th trajectory. Lemma 2.1 shows that the approxi-mation error due to the above MHP method converges(in probability) to zero. We verify this result with anumerical simulation, where we implement the NBOand the MHP methods with the following assumptions: x = , r = , T = , H = , u = . , u = . , σ = P from 100 to 10000 with increments of 100.Figure 3 shows the cost function approximated usingMHP and NBO methods. The figure clearly demon-strates that the error due to NBO approximation can besignificantly high, while MHP performs better in costapproximation. Number of state evolution trajectories C o s t app r o x i m a t i on MHP approximationNBO approximationExact cost
Figure 3. LQG problem: MHP vs. NBO
B. UAV path planning problem
We consider a UAV path planning problem, where thegoal is to optimize the kinematic controls of a UAV tomaximize a target tracking performance measure. Here,the UAV is assumed to be equipped with a sensor on-board that generates the location measurements of thetarget (a ground-based moving vehicle) corrupted byrandom noise. A detailed description of the problem canbe found in [7]. In [7], we posed this problem as a partially observable Markov decision process (POMDP),where the POMDP led to solving a long horizon optimalcontrol problem. We applied the NBO approach to solve the above POMDP. The resulting UAV path optimizationproblem is summarized as follows:min u E (cid:34) H − ∑ k = tr ( P k ( u )) (cid:35) NBO approx. −−−−−−−→ min u H − ∑ k = tr ( ˆP k ( u )) , where P k ( u ) (a random variable) represents the error co-variance matrix corresponding to the state of the system,tr() represents the matrix trace operator, u is the sequenceof UAV kinematic controls (e.g., forward accelerationand bank angle) applied over the discrete time planninghorizon of length H steps. After NBO approximation,the expectation over the random evolution of P k ( u ) isreplaced with the nominal sequence of the state covari-ance matrices tr ( ˆP k ( u )) .We now approximate the above objective functionusing the RS-MHP approach as follows:min u E (cid:34) H − ∑ k = tr ( P k ( u )) (cid:35) RS-MHP approx. −−−−−−−−−→ min u N T N ∑ i = H − ∑ k = tr ( ˜P ik ( u )) , where ˜P ik represents the state covariance matrix obtainedfrom the i th state trajectory generated from the RS-MHPapproach, and N T is the number of state trajectories.We implement this RS-MHP approach in MATLABand run a Monte-Carlo study to see the impact of N T on the performance of the above UAV path planningalgorithm, which is measured by the average targetlocation estimation error. Figure 4 shows the cumulativedistribution of average target location estimation errorsfrom the RS-MHP approach with H =
6, and for N T set to 50, 100, and 250. The figure shows a gradualincrease in the UAV path optimization performance withincreasing N T as expected. This result, as expected, alsosuggests that pruning methods (discussed in the previoussection) would degrade the performance of the RS-MHPmethods but can provide gains in terms of computationalintensity.RS-MHP has better capability in approximating theexpectation operator in Eq. 1 than the NBO approach aswe consider multiple hypotheses of state trajectories inRS-MHP as opposed to a single hypothesis in NBO asdemonstrated in Figure 5. This is demonstrated in theabove case studies.IV. C ONCLUSIONS
In this paper, we developed a Monte-Carlo tree searchmethod called random sampling - multipath hypothesispropagation or RS-MHP to approximate the expectationoperator in long horizon optimal control problems. Al-though variants of these methods exist in the literature,we focused on the convergence analysis of these ap-proximation methods. The basic theme of these methods5
Average target location error (m) -1 F ( x ) Empirical CDF N T =50N T =100N T =250 Figure 4. Cumulative distribution of average target location errors.Here N T represents the number of state evolution trajectories. Average target tracking error (m) F ( x ) Empirical CDF
NBORS-MHP (N=800)
Figure 5. Cumulative distribution of average target location errors:NBO vs. RS-MHP. is to evolve the system state over multiple trajectoriesinto the future while sampling the noise disturbancesat each time-step. We derive convergence results thatshow that the cost approximation errors from our RS-MHP methods converge (in probability) toward zero asthe sample size increases. We conducted a numericalstudy to assess the performance of our methods in twocase studies: linear quadratic control problem and UAVpath optimization problem. In both case studies, wedemonstrated the benefits of our approach against an ex-isting approach called nominal belief-state optimization or NBO (used as a benchmark).V. A
CKNOWLEDGMENT
The authors would like to thank Nicolas Lanchier,Arizona State University, for his valuable inputs andfeedback on the convergence results discussed in thispaper. R
EFERENCES[1] S. Ragi and H. D. Mittelmann, “Random-sampling multipathhypothesis propagation for cost approximation in long-horizonoptimal control,” in
Proc. 2020 IEEE Conference on ControlTechnology and Applications (CCTA) , Montreal, Canada, 2020,pp. 14–18.[2] E. K. P. Chong, C. M. Kreucher, and A. O. Hero, “Partiallyobservable Markov decision process approximations for adaptivesensing,”
Discrete Event Dynamic Systems , vol. 19, no. 3, pp.377–422, Sep 2009.[3] D. P. Bertsekas and D. A. Castanon, “Rollout algorithms forstochastic scheduling problems,”
J. Heuristics , vol. 5, pp. 89–108, 1999.[4] E. K. P. Chong, R. L. Givan, and H. S. Chang, “A framework forsimulation-based network control via hindsight optimization,” in
Proc. 39th IEEE Conf. Decision and Control , Sydney, Australia,2000, pp. 1433–1438.[5] G. Wu, E. K. P. Chong, and R. Givan, “Burst-level congestioncontrol using hindsight optimization,”
IEEE Trans. Autom. Con-trol , vol. 47, pp. 979–991, 2002.[6] D. Bertsekas, “Feature-based aggregation and deep reinforcementlearning: A survey and some new implementations,”
IEEE/CAAJournal of Automatica Sinica , no. 1, 2019.[7] S. Ragi and E. K. P. Chong, “UAV path planning in a dynamicenvironment via partially observable Markov decision process,”
IEEE Trans. Aerosp. Electron. Syst. , vol. 49, pp. 2397–2412,2013.[8] ——, “Dynamic UAV path planning for multitarget tracking,”in
Proc. American Control Conf. , Montreal, Canada, 2012, pp.3845–3850.[9] S. Ragi and H. D. Mittelmann, “Mixed-integer nonlinear pro-gramming formulation of a UAV path optimization problem,” in
Proc. American Control Conf. , Seattle, WA, 2017, pp. 406–411.[10] S. Miller, Z. Harris, and E. K. P. Chong, “A POMDP frameworkfor coordinated guidance of autonomous UAVs for multitargettracking,”
EURASIP Journal on Advances in Signal Processing ,2009.[11] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I.Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis,and S. Colton, “A survey of Monte Carlo tree search methods,”
IEEE Transactions on Computational Intelligence and AI inGames , vol. 4, no. 1, pp. 1–43, March 2012.[12] T. Cacoullos,
Exercises in Probability . New York: Springer-Verlag, 1989.[13] N. Etemadi, “Convergence of weighted averages of randomvariables revisited,”
Proc. American Mathematical Society ∼ dimitrib/Slides Lecture2 RLOC.pdfdimitrib/Slides Lecture2 RLOC.pdf