[PDF] Optimal control as a graphical model inference problem

Abstract

We reformulate a class of non-linear stochastic optimal control problems introduced by Todorov (2007) as a Kullback-Leibler (KL) minimization problem. As a result, the optimal control computation reduces to an inference computation and approximate inference methods can be applied to efficiently compute approximate optimal controls. We show how this KL control theory contains the path integral control method as a special case. We provide an example of a block stacking task and a multi-agent cooperative game where we demonstrate how approximate inference can be successfully applied to instances that are too complex for exact computation. We discuss the relation of the KL control approach to other inference approaches to control.

Full PDF

OOptimal control as a graphical model inference problem

Hilbert J. Kappen · Vicen¸c G´omez · Manfred OpperAbstract

We reformulate a class of non-linear stochastic optimal controlproblems introduced by Todorov (2007) as a Kullback-Leibler (KL) mini-mization problem. As a result, the optimal control computation reduces toan inference computation and approximate inference methods can be appliedto eﬃciently compute approximate optimal controls. We show how this KLcontrol theory contains the path integral control method as a special case.We provide an example of a block stacking task and a multi-agent cooperativegame where we demonstrate how approximate inference can be successfully ap-plied to instances that are too complex for exact computation. We discuss therelation of the KL control approach to other inference approaches to control.

Keywords optimal control · uncontrolled dynamics · Kullback-Leiblerdivergence · graphical model · approximate inference · cluster variationmethod · belief propagation Hilbert J. KappenDonders Institute for Brain Cognition and BehaviourRadboud University Nijmegen6525 EZ Nijmegen, The NetherlandsE-mail: [email protected]¸c G´omezDonders Institute for Brain Cognition and BehaviourRadboud University Nijmegen6525 EZ Nijmegen, The NetherlandsE-mail: [email protected] OpperDepartment of Computer ScienceD-10587 Berlin, TU Berlin, GermanyE-mail: [email protected] a r X i v : . [ m a t h . O C ] J a n Hilbert J. Kappen et al.

Stochastic optimal control theory deals with the problem to compute an opti-mal set of actions to attain some future goal. With each action and each statea cost is associated and the aim is to minimize the total future cost. Examplesare found in many contexts such as motor control tasks for robotics, planningand scheduling tasks or managing a ﬁnancial portfolio. The computation ofthe optimal control is typically very diﬃcult due to the size of the state spaceand the stochastic nature of the problem.The most common approach to compute the optimal control is throughthe Bellman equation. For the ﬁnite horizon discrete time case, this equationresults from a dynamic programming argument that expresses the optimalcost-to-go (or value function) at time t in terms of the optimal cost-to-go attime t + 1. For the inﬁnite horizon case, the value function is independent oftime and the Bellman equation becomes a recursive equation. In continuoustime, the Bellman equation becomes a partial diﬀerential equation.For high dimensional systems or for continuous systems the state space ishuge and the above procedure cannot be directly applied. A common approachto make the computation tractable is a function approximation approach wherethe value function is parameterized in terms of a number of parameters (Bert-sekas and Tsitsiklis, 1996). Another promising approach is to exploit graphicalstructure that is present in the problem to make the computation more eﬃcient(Boutilier et al., 1995; Koller and Parr, 1999). However, this graphical struc-ture is in general not inherited by the value function, and thus the graphicalrepresentation of the value function may not be appropriate.In this paper, we introduce a class of stochastic optimal control problemswhere the control is expressed as a probability distribution p over future tra-jectories given the current state and where the control cost can be written as aKullback-Leibler (KL) divergence between p and some interaction terms. Theoptimal control is given by minimizing the KL divergence, which is equivalentto solving a probabilistic inference problem in a dynamic Bayesian network.The optimal control is given in terms of (marginals of) a probability distri-bution over future trajectories. The formulation of the control problem as aninference problem directly suggests exact inference methods such as the Junc-tion Tree method (JT) (Lauritzen and Spiegelhalter, 1988) or a number ofwell-known approximation methods, such as the variational method (Jordan,1999), belief propagation (BP) (Murphy et al., 1999), the cluster variationmethod (CVM) or generalized belief propagation (GBP) (Yedidia et al., 2001)or Markov Chain Monte Carlo (MCMC) sampling methods. We refer to thisclass of problems as KL control problems.The class of control problems considered in this paper is identical as inTodorov (2008, 2007, 2009), who shows that the Bellman equation can bewritten as a KL divergence of probability distributions between two adjacenttime slices and that the Bellman equation computes backward messages in achain as if it were an inference problem. The novel contribution of the presentpaper is to identify the control cost with a KL divergence instead of making ptimal control as a graphical model inference problem 3 this identiﬁcation in the Bellman equation. The immediate consequence is thatthe optimal control problem is identical to a graphical model inference problemthat can be approximated using standard methods.We also show how KL control reduces to the previously proposed pathintegral control problem (Kappen, 2005) when noise is Gaussian in the limitof continuous space and time. This class of control problem has been appliedto multi-agent problems using a graphical model formulation and junctiontree inference in Wiegerinck et al. (2006, 2007) and approximate inference invan den Broek et al. (2008b,a). In robotics, Theodorou et al. (2009, 2010a,b)has shown the the path integral method has great potential for application.They have compared the path integral method with some state-of-the-art re-inforcement learning methods, showing very signiﬁcant improvements. In ad-dition, they have successful implemented the path integral control method toa walking robot dog. The path integral approach has recently been applied tothe control of character animation (da Silva et al., 2009). Let x = 1 , . . . , N be a ﬁnite set of states, x t denotes the state at time t . Denoteby p t ( x t +1 | x t , u t ) the Markov transition probability at time t under control u t from state x t to state x t +1 . Let p ( x T | x , u T − ) denote the probability toobserve the trajectory x T given initial state x and control trajectory u T − .If the system at time t is in state x and takes action u to state x (cid:48) , thereis an associated cost ˆ R ( x, u, x (cid:48) , t ). The control problem is to ﬁnd the sequence u T − that minimizes the expected future cost C ( x , u T − ) = (cid:88) x T p ( x T | x , u T − ) T (cid:88) t =0 ˆ R ( x t , u t , x t +1 , t )= (cid:42) T (cid:88) t =0 ˆ R ( x t , u t , x t +1 , t ) (cid:43) (1)with the convention that ˆ R ( x T , u T , x T +1 , T ) = R ( x T , T ) is the cost of theﬁnal state and (cid:104)(cid:105) denotes expectation with respect to p . Note, that C dependson u in two ways: through ˆ R and through the probability distribution of thecontrolled trajectories p ( x T | x , u T − ).The optimal control is normally computed using the Bellman equation,which results from a dynamic programming argument (Bertsekas and Tsitsik-lis, 1996). Instead, we will consider the restricted class of control problems forwhich C in Equation (1) can be written as a KL divergence. As a particularcase, we consider that ˆ R is the sum of a control dependent term and a statedependent term. We further assume the existence of a ’free’ (uncontrolled) dy-namics q t ( x t +1 | x t ), which can be any ﬁrst order Markov process that assignszero probability to physically impossible state transitions. Hilbert J. Kappen et al.

We quantify the control cost as the amount of deviation between p t ( x t +1 | x t , u t )and q t ( x t +1 | x t ) in KL sense. Thus,ˆ R ( x t , u t , x t +1 , t ) = log p t ( x t +1 | x t , u t ) q t ( x t +1 | x t ) + R ( x t , t ) t = 0 , . . . , T − R ( x, t ) an arbitrary state dependent control cost. Equation (1) becomes C ( x , p ) = KL ( p || ψ )= (cid:88) x T p ( x T | x ) log p ( x T | x ) ψ ( x T | x )= KL ( p || q ) + (cid:104) R (cid:105) (3) ψ ( x T | x ) = q ( x T | x ) exp (cid:32) − T (cid:88) t =0 R ( x t , t ) (cid:33) (4)Note, that C depends on the control u only through p . Thus, minimizing C with respect to u yields: 0 = dCdu = dCdp dpdu , where the minimization with respectto p is subject to the normalization constraint (cid:80) x T p ( x T | x ) = 1. Therefore,a suﬃcient condition for the optimal control is to set dCdp = 0. The result ofthis KL minimization is well known and yields the “Boltzmann distribution” p ( x T | x ) = 1 Z ( x ) ψ ( x T | x ) (5)and the optimal cost C ( x , p ) = − log Z ( x ) = − log (cid:88) x T q ( x T | x ) exp (cid:32) − T (cid:88) t =0 R ( x t , t ) (cid:33) (6)where Z ( x ) is a normalization constant (see Appendix A). In other words,the optimal control solution is the (normalized) product of the free dynamicsand the exponentiated costs. It is a distribution that avoids states of high R ,at the same time deviating from q as little as possible. Note that since q is aﬁrst order Markov process, p in Equation (5) is a ﬁrst order Markov processas well.The optimal control in the current state x at the current time t = 0 isgiven by the marginal probability p ( x | x ) = (cid:88) x T p ( x T | x ) (7)This is a standard graphical model inference problem, with p given by Equa-tion (5). Since ψ is a chain, we can compute p ( x | x ) by backward message ptimal control as a graphical model inference problem 5Dynamics: p t ( x t | x t − , u t − ) → dynamic programming → Bellman EquationCost: C ( x , u ) = (cid:68) ˆ R (cid:69) Cost-to-go: J ( x ) ↓ ↓ restricted class of problems approximate J ↓ ↓ Dynamics: p t ( x t | x t − ) → approximate inference → approximation C ( x , p ) = KL ( p || ψ ) of optimal u Fig. 1

Overview of the approaches to computing the optimal control. Top left) The gen-eral optimal control problem is formulated as a state transition model p that depends onthe control (or policy) u and a cost C ( u ) that is the expected ˆ R with respect to the con-trolled dynamics p . The optimal control is given by the u that minimizes a cost C ( u ). Topright) The traditional approach is to introduce the notion of cost-to-go or value function J , which satisﬁes the Bellman equation. The Bellman equation is derived using a dynamicprogramming argument. Bottom right) For large problems, an approximate representationof J is used to solve the Bellman equation which yields the optimal control. Bottom left)The approach in this paper is to consider a class of control problems for which C is writtenas a KL divergence. The computation of the optimal control (optimal p ) becomes a sta-tistical inference problem, that can be approximated using standard approximate inferencemethods. passing: β T ( x T ) = 1 β t ( x t ) = (cid:88) x t +1 ψ t ( x t , x t +1 ) β t +1 ( x t +1 ) p ( x t +1 | x t ) ∝ ψ t ( x t , x t +1 ) β t +1 ( x t +1 ) . The interpretation of the Bellman equation as message passing for theKL control problems was ﬁrst established in Todorov (2008). The diﬀerencebetween the KL control computation and the standard computation using theBellman equation is schematically illustrated in Figure 1.The optimal cost, Equation (6), is minus the log partition sum and is theexpectation value of the exponentiated state costs (cid:80) Tt =0 R ( x t , t ) under the uncontrolled dynamics q . This is a surprising result, because it means that wehave a closed form solution for the optimal cost-to-go C ( x , p ) in terms of theknown quantities q and R .A result of this type was previously obtained in Kappen (2005) for a class ofcontinuous non-linear stochastic control problems. Here, we show that a slightgeneralization of this problem ( g ai ( x, t ) = 1 in Kappen (2005)) is obtainedas a special case of the present KL control formulation. Let x denote an n -dimensional real vector with components x i . We deﬁne the stochastic dynamics dx i = f i ( x, t ) dt + (cid:88) a g ia ( x, t )( u a dt + dξ a ) (8)with f i an arbitrary function, dξ a an m -dimensional Gaussian process withcovariance matrix (cid:104) dξ a dξ b (cid:105) = ν ab dt and u a an m -dimensional control vector. Hilbert J. Kappen et al.

The distribution over trajectories is given by p ( x dt : T | x , u T − dt ) = T − dt (cid:89) s =0 N ( x s + dt | x s + ( f s + g s u s ) dt, g s ν ( g s ) T dt ) (9)with f t = f ( x t , t ) and the distribution over trajectories under the uncontrolleddynamics is deﬁned as q ( x dt : T | x ) = p ( x dt : T | x , u T − dt = 0).For this particular choice of p and q , the control cost in Equation (3)becomes (see Appendix B for a derivation) C ( x, u ( t → T )) = (cid:42) φ ( x ( T )) + (cid:90) Tt ds u ( x ( s ) , s ) T ν − u ( x ( s ) , s ) + R ( x ( s ) , s ) (cid:43) (10)where (cid:104)(cid:105) denotes expectation with respect to the controlled dynamics p , wherethe sums become integrals and where we have deﬁned φ ( x ) = R ( x, T ).Equations (8) and (10) deﬁne a stochastic optimal control problem. Thesolution for the optimal cost-to-go for this class of control problems can beshown to be given as a so-called path integral, an integral over trajectories,which is the continuous time equivalent of the sum over trajectories in Equa-tion (6). Note, that the cost of control is quadratic in u , but of a particularform with the matrix ν − in agreement with Kappen (2005). Thus, the KLcontrol theory contains the path integral control method as a particular limit.As is shown in Kappen (2005), this class of problems admits a solution of theoptimal cost-to-go as an integral over paths, which is similar to Equation (6).2.1 Graphical model inferenceIn typical control problems, x has a modular structure with components x = x , . . . , x n . For instance, for a multi-joint arm, x i may denote the state of eachjoint. For a multi-agent system, x i may denote the state of each agent. In allsuch examples, x i itself may be a multi-dimensional state vector. In such cases,the optimal control computation, Equation (7), is intractable. However, thefollowing assumptions are likely to be true: – The uncontrolled dynamics factorizes over components q t ( x t +1 | x t ) = n (cid:89) i =1 q ti ( x t +1 i | x ti ) – The interaction between components has a (sparse) graphical structure R ( x, t ) = (cid:80) α R α ( x α , t ) with α a subset of the indices 1 , . . . , n and x α thecorresponding variables. ptimal control as a graphical model inference problem 7 Fig. 2

Block stacking problem: the objective can be (but is not restricted to) to stack theinitial block conﬁguration (left) into a single stack (right) through a sequence of single blockmoves to adjacent positions (middle).

Typical examples are multi-agent systems and robot arms. In both cases thedynamics of the individual components (the individual agents and the diﬀerentjoints, respectively) are independent a priori . It is only through the executionof the task that the dynamics become coupled.Thus, ψ in Equation (4) has a graphical structure that we can exploitwhen computing the marginals in Equation (7). For instance, one may usethe junction tree (JT) method, which can be more eﬃcient than simply usingthe backward messages. Alternatively, we can use any of a large number ofapproximate graphical model inference methods to compute the optimal con-trol. In the following sections, we will illustrate this idea by applying severalapproximate inference algorithms in two diﬀerent tasks. Consider the example of piling blocks into a tower. This is a classic AI planningtask (Russell et al., 1996). It will be instructive to see how a variant of thisproblem is solved as a stochastic control problem, As we will see, the optimalcontrol solution will in general be a mixture over several actions. We deﬁne theKL-blocks-world problem in the following way: let there be n possible blocklocations on the one dimensional ring (line with periodic boundaries) as inFigure 2, and let x ti ≥ , i = 1 , . . . , n, t = 0 , . . . , T denote the height of stack i at time t . Let m be the total number of blocks.At iteration t , we allow to move one block from location k t and move it to aneighboring location k t + l t with l t = − , , k t , l t and the old state x t − , the new state is given as x tk t = x t − k t − x tk t + l t = x t − k t + l t + 1 (12)and all other stacks unaltered. We use the uncontrolled distribution q to imple-ment these allowed moves. For the purpose of memory eﬃciency, we introduceauxiliary variables s ti = − , , x i isdecremented, unchanged or incremented, respectively. The uncontrolled dy- Hilbert J. Kappen et al. k l s x s i x i k l s x s i x i k T l T s T x T s Ti x Ti s n x n s n x n s Tn x Tn Fig. 3

Block stacking problem: Graphical model representation as a dynamic Bayesiannetwork. Time runs horizontal and stack positions vertical. At each time, the transitionprobability of x t to x t +1 is a mixture over the variables k t , l t . The initial state is “clamped”to a given conﬁguration by conditioning on the variables x . To force a goal state or ﬁnalconﬁguration, the ﬁnal state x T can also be “clamped” (see Section 3.1.1). namics q becomes q ( k t ) = U (1 , . . . , n ), q ( l t ) = U ( − , , +1), q ( s t | k t , l t ) = n (cid:89) i =1 q ( s ti | k t , l t ) q ( s ti | k t , l t ) =  δ s ti , − for k t = i, l t = ± δ s ti , +1 for k t + l t = i, l t = ± δ s ti , otherwise , where U ( · ) denotes the uniform distribution. The transition from x t − to x t is a mixture over the values of k t , l t : q ( x t | x t − ) = (cid:88) k t ,l t n (cid:89) i =1 q ( x ti | x t − i , k t , l t ) q ( k t ) q ( l t ) (13) q ( x ti | x t − i , k t , l t ) = (cid:88) s ti q ( x ti | x t − i , s ti ) q ( s ti | k t , l t ) (14) q ( x ti | x t − i , s ti ) = δ x ti ,x t − i + s ti (15)Note, that there are combinations of x t − i and s ti that are forbidden: we cannotremove a block from a stack of size zero ( x t − i = 0 and s ti = −

1) and we cannotmove a block to a stack of size m ( x t − i = m and s ti = 1). If we restrict thevalues of x ti and x t − i in the last line above to 0 , . . . , m these combinations areautomatically forbidden.Figure 3 shows the graphical model associated with this representation.Notice that the graphical structure for q is eﬃcient compared to the naive ptimal control as a graphical model inference problem 9 implementation of q ( x t | x t − ) as a full table. Whereas the joint table requires m n entries, the graphical model implementation requires T n tables of sizes n × × p ( s t | k t , l t ) and n × n × p ( x t | x t − , s t ). In addition, the graphicalstructure can be exploited by eﬃcient approximate inference methods.Finally, a possible state cost can be deﬁned as the entropy of the distribu-tion of blocks: R ( x ) = − λ (cid:88) i x i m log x i m , (16)with λ a positive number to indicate the strength. Since (cid:80) i x i is constant (noblocks are lost), the minimum entropy solution puts all blocks on one stack(if enough time is available). The control problem is to ﬁnd the distribution p that minimizes C in Equation (3).3.1 Numerical resultsIn the next section, we consider two particular problems. First, we are inter-ested in ﬁnding a sequence of actions that, starting in a given initial state x ,reach a given goal state x T , without state cost. Then we consider the case ofentropy minimization, with no deﬁned goal state and nonzero state cost. λ = 0Figure 4 shows a small example where the planning task is to shift a towercomposed of four blocks which initially is at position 1 to the ﬁnal position 3.To ﬁnd the KL control we ﬁrst condition the model both on the initialstate and the ﬁnal state variables by “clamping” all variables x and x T . TheKL control solution is obtained by computing for t = 1 , . . . , T the marginal p ( k t , l t | x t − ). In this case, we can ﬁnd the exact solution via the junction tree(JT) algorithm (Lauritzen and Spiegelhalter, 1988; Mooij, 2010). The k t , l t isobtained by taking the MAP state of p ( k t , l t | x t − ) breaking ties at random,which results in a new state x t .These probabilities p ( k t , l t | x t − ) are shown in Figure 4b. Notice that thesymmetry in the problem is captured in the optimal control, which assignsequal probability when moving the ﬁrst block to left or right (Figure 4b,c,t=1). Figure 4d shows the strategy resulting from the MAP estimate, whichﬁrst unpacks the tower at position 1 leaving all four locations with one blockat t = 4, and then re-builds it again at the goal position 3.For larger instances, the JT method is not feasible because of too largetree widths. For instance, to stack 4 blocks on 6 locations within a horizon of11, the junction tree has a maximal width of 12, requiring about 15 Gbytesof memory. We can nevertheless obtain approximate solutions using diﬀerentapproximate inference methods. In this work, we use the belief propagationalgorithm (BP) and a generalization known as the Cluster Variation method m (a) m x T (c) k t = 1 (b) −1 0 124 l k t = 2−1 0 124 l k t = 3−1 0 124 l k t = 4−1 0 124l k t = 5−1 0 124 l k t = 6−1 0 124 l k t = 7−1 0 124 l k t = 8−1 0 124 x r T (d) Fig. 4

Control for the KL-blocks-world problem with end-cost: example with m = 4 , n = 4and T = 8. (a) Initial and goal states. (b)

Probability of action p ( k t , l t | x t − ) for each timestep t = 1 . . . T . (c) Expected value (cid:104) x ti (cid:105) , i = 1 , . . . , n given the initial position and desiredﬁnal position and (d) the MAP solution for all times using a gray scale coding with whitecoding for zero and darker colors coding for higher values.

10 15 2000.250.50.751 m% BP solved instances 10 15 2010 mcpu−time CVM (seconds) n = 4n = 6n = 8n = 10 Fig. 5

Control for the KL-blocks-world problem with end-cost: results on approximate in-ference using random initial and goal states. (Left) percent of instances where BP convergesfor all t = 1 : T as a function of m for diﬀerent values of n . (Right) CPU-time requiredfor CVM to ﬁnd a correct plan for diﬀerent values of n, m . T was set to (cid:100) m · n (cid:101) . We run 50instances for each pair ( m, n ).ptimal control as a graphical model inference problem 11 t = 0n = 8 m = 40 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9t = 10 t = 11 t = 12 t = 13 t = 14 t = 15 t = 16 t = 17 t = 18 t = 19t = 20 t = 21 t = 22 t = 23 t = 24 t = 25 t = 26 t = 27 t = 28 t = 29t = 30 t = 31 t = 32 t = 33 t = 34 t = 35 t = 36 t = 37 t = 38 t = 39t = 40 t = 41 t = 42 t = 43 t = 44 t = 45 t = 46 t = 47 t = 48 t = 49t = 50 t = 51 t = 52 t = 53 t = 54 t = 55 t = 56 t = 57 t = 58 t = 59t = 60 t = 61 t = 62 t = 63 t = 64 t = 65 t = 66 t = 67 t = 68 Fig. 6

Example of a large block stacking instance without end cost. n = 8 , m = 40 , T =80 , λ = 10 using CVM. (CVM). We brieﬂy summarize the main idea of the CVM method in Ap-pendix C. We use the minimal cluster size, that is, the outer clusters are equalto the interaction potentials ψ as shown in the graphical model Figure 3.To compute the sequence of actions we follow again a sequential approach.Figure 5 shows results using BP and CVM. For n = 4, BP converges fastand ﬁnds a correct plan for all instances. For larger n , BP fails to converge,more or less independently of m . Thus, BP can be applied successfully tosmall instances only. Conversely, CVM is able to ﬁnd a correct plan in all runinstances, although at the cost of more CPU time, as Figure 5 shows. Thevariance in the CPU error bars is explained by the randomness in the numberof actual moves required to solve each instance, which is determined by theinitial and goal states. λ > : entropy minimization We now consider the problem without conditioning on x T and λ >

0. Althoughthis may seem counter intuitive, removing the end constraint in fact makes thisproblem harder, as the number of states that have signiﬁcant probability forlarge t is much larger. BP is not able to produce any reliable result for thisproblem. We applied CVM to a large block stacking problem with n = 8 , m =40 , T = 80 and λ = 10. We use again the minimal cluster size and the doubleloop method of Heskes et al. (2003). The results are shown in Figure 6. The computation time was approximately 1 hour per t iteration and mem-ory use was approximately 27 Mb. This instance was too large to obtain exactresults. We conclude that, although the CPU time is large, the CVM method iscapable to yield an apparently accurate control solution for this large instance. In this section we consider a variant of the stag hunt game, a prototype gameof social conﬂict between personal risk and mutual beneﬁt (Skyrms, 2004). Theoriginal two-player stag hunt game proceeds as follows: there are two huntersand each of them can choose between hunting hare or hunting stag, withoutknowing in advance the choice of the other hunter. The hunters can catch ahare on their own, giving them a small reward. The stag has a much largerreward, but it requires both hunters to cooperate in catching it.

Table 1

Two-player stag hunt payoﬀ matrix example: rows and columns indicate actionsof one and the other player respectively. The payoﬀ describes the reward for each hunter.For instance, if both go for the stag, they both get a reward of 3. If one hunter goes for thestag and the other for the hare, they get a reward of 0 and 1 respectively.Stag HareStag , , , , Table 1 displays a possible payoﬀ matrix for a stag hunt game. It shows thatboth stag hunting and hare hunting are

Nash equilibria , that is, if the otherplayer chooses stag, it is best to choose stag ( payoﬀ equilibrium, top-left), andif the other player chooses hare, it is best to choose hare ( risk-dominant equi-librium, bottom-right). It is argued that these two possible outcomes makesthe game socially more interesting, than for example the prisoners dilemma ,which has only one Nash equilibrium. The stag hunt allows for the study ofcooperation within social structures (Skyrms, 1996) and for studying the col-laborative behavior of multi-agent systems (Yoshida et al., 2008).We deﬁne the KL-stag-hunt game as a multi-agent version of the originalstag hunt game where M agents live in a grid of N locations and can move toadjacent locations on the grid. The grid also contains H hares and S stags atcertain ﬁxed locations. Two agents can cooperate and catch a stag togetherwith a high payoﬀ R s . Catching a stag with more than two agents is alsopossible, but it does not increase the payoﬀ. The agents can also catch a hareindividually, obtaining a lower payoﬀ R h . The game is played for a ﬁnite time T and at each time-step all the agents perform an action. The optimal strategyis thus to coordinate pairs of agents to go for diﬀerent stags.Formally, let x ti = 1 , . . . , N, i = 1 , . . . , M, t = 1 , . . . , T denote the positionof agent i at time t on the grid. Also, let s j = 1 , . . . , N, j = 1 , . . . , S , and h k = 1 , . . . , N, k = 1 , . . . , H denote the positions of the j th stag and the k th ptimal control as a graphical model inference problem 13 hare respectively. We deﬁne the following state dependent reward as: R ( x t ) = R h H (cid:88) k =1 M (cid:88) i =1 δ x ti ,h k + R s S (cid:88) j =1 I (cid:40)(cid:32) M (cid:88) i =1 x ti = s j (cid:33) > (cid:41) , where I{·} denotes the indicator function. The ﬁrst term accounts for theagents located at the position of a hare. The second one accounts for therewards of the stags, which require that at least two agents to be on thesame location of the stag. Note that the reward for a stag is not increasedfurther if more than two agents go for the same stag. Conversely, the rewardcorresponding to a hare is proportional to the number of agents at its position.The uncontrolled dynamics factorizes among the agents. It allows an agentto stay on the current position or move to an adjacent position (if possible)with equal probability, thus performing a random walk on the grid. Considerthe state variables of an agent in two subsequent time-steps expressed in Carte-sian coordinates, x ti = (cid:104) l, m (cid:105) , x t +1 i = (cid:104) l (cid:48) , m (cid:48) (cid:105) . We deﬁne the following function: ψ q ( (cid:104) l (cid:48) , m (cid:48) (cid:105) , (cid:104) l, m (cid:105) ) := I (cid:110) (( l (cid:48) = l ) ∧ ( m (cid:48) = m )) ∨ (( l (cid:48) = l − ∧ ( m (cid:48) = m ) ∧ ( l > ∨ (( l (cid:48) = l ) ∧ ( m (cid:48) = m − ∧ ( m > ∨ (cid:16) ( l (cid:48) = l + 1) ∧ ( m (cid:48) = m ) ∧ ( l < √ N ) (cid:17) ∨ (cid:16) ( l (cid:48) = l ) ∧ ( m (cid:48) = m + 1) ∧ ( m < √ N ) (cid:17)(cid:111) , that evaluates to one if the agent does not move (ﬁrst condition), or if it movesleft, down, right, up (subsequent conditions) inside the grid boundaries. Theuncontrolled dynamics for one agent can be written as conditional probabilitiesafter proper normalization: q (cid:0) x t +1 i = (cid:104) l (cid:48) , m (cid:48) (cid:105)| x ti = (cid:104) l, m (cid:105) (cid:1) = ψ q ( (cid:104) l (cid:48) , m (cid:48) (cid:105) , (cid:104) l, m (cid:105) ) (cid:80) a,b ψ q ( (cid:104) a, b (cid:105) , (cid:104) l, m (cid:105) ) . and the joint uncontrolled dynamics become: q ( x t +1 | x t ) = M (cid:89) i =1 q ( x t +1 i | x ti ) . Since we are interested in the ﬁnal conﬁguration at end time T , we set thestate dependent path cost to zero for t = 1 , . . . , T − (cid:0) − λ R ( x T ) (cid:1) for the end time.To minimize C in Equation (3), exact inference in the joint space can bedone by backward message passing, using the following equations: β t ( x t ) =  exp (cid:18) − λ R ( x t ) (cid:19) for t = T (cid:88) x t +1 q ( x t +1 | x t ) β ( x t +1 ) for t < T , (17) λ = 0.1, logZ = 93 λ = 10, logZ = 0.06 Fig. 7

Exact inference KL-stag-hunt: Two hunters in a small grid. There are four hares ateach corner of the grid (small diamonds) and one stag in the middle (big diamond). Initialpositions of the hunters are denoted by small circles. One hunter is close to a hare and theother is at the same distance of the stag and two hares. Final positions are denoted byasterisks. The optimal paths are drawn in blue and red (Color online). (

Left ) For λ = 10,the optimal control is risk dominant, and hunters go for the hares. ( Right ) For λ = 0 .

1, thepayoﬀ dominant control is optimal and hunters cooperate. N = 25 , T = 4 , R s = −

10 and R h = − and the desired marginal probabilities can be obtained from the β -messages: p ( x t +1 | x t ) ∝ q ( x t +1 | x t ) β ( x t +1 ) . (18)To illustrate this game, we consider a small 5 × x , we selectthe next state according to the most probable state from p ( x t +1 i | x ti ) until theend time. We break ties randomly. Figure 7 shows one resulting trajectory fortwo values of λ .For high values of λ (left plot), each hunter catches one of the hares. In thiscase, the cost function is dominated by KL term. For small enough values of λ (right plot), both hunters cooperate to catch the stag. In this case, the statecost, function R ( x T ), governs the optimal control cost. Thus λ can be seen asa parameter that controls whether the optimal strategy is risk dominant orpayoﬀ dominant.Note that computing the exact solution using this procedure becomes in-feasible even for small number of agents, since the joint state space scales as N M . In the next section, we show a more eﬃcient representation using a factorgraph for which approximate inference is tractable. ptimal control as a graphical model inference problem 15 x x T − x x T − x M x T − M x T x T x TM ψ q ψ q ψ q ψ q ψ q ψ q ψ q ψ q ψ q exp (cid:0) − λ R (cid:1) ψ R x x x M Fig. 8

Factor graph representation of the KL-stag-hunt problem. Circles denote variablenodes (states of the agents at a given time-step) and squares denote factor nodes. Thereare two types of factor nodes: the ones corresponding to the uncontrolled dynamics ψ q andthe one corresponding to the state cost ψ R . Initial conﬁguration in gray denotes the states“clamped” to an initial given value. Despite being a tree, exact inference and approximateinference are intractable in this model due to the complex factor ψ R . ψ R ( x T ) :=exp( − λ R ( x T )) involves all the agent states, which still makes the problemintractable. Even approximate inference algorithms such as BP can not beapplied, since messages from ψ R to one of the state variables x Ti would requirea marginalization involving a sum of ( N − M terms.However, we can exploit the particular structure of that factor by decom-posing it in smaller factors deﬁned on small sets of (at most three) auxiliaryvariables of small cardinality. This transformation becomes intuitive once thegraphical model representation for the problem is identiﬁed. The proceduredeﬁnes indicator functions for the allowed conﬁgurations which are weightedaccording to the corresponding cost. Figure 9 illustrates the procedure for thecase of one stag.1. First, we add H × M factors ψ h k ( x Ti ), deﬁned for each hare location h k and each agent variable x Ti . These factors account for the hare costs: ψ h k ( x Ti ) := (cid:40) exp (cid:0) − λ R h (cid:1) if ( x Ti = h k )1 otherwise .

2. Second, we add factors ψ s j ( x Ti , d i,j ) for each stag j deﬁned on each statevariable x Ti and new introduced binary variables d i,j = 0 ,

1. These factors x T x T x TM d , d , d M, u , x T d , u , u M − , ψ s ψ s ψ s ψ s ψ r ψ r ψ r M − ψ r M ψ h H ψ h ψ h H ψ h ψ h H ψ h ψ h H ψ h exp (cid:0) − λ R (cid:1) Fig. 9

Decomposition of the complex factor ψ R into simple factors involving at most threevariables of small cardinality. Each state variable is linked to H factors corresponding tothe hares locations. For each stag there is a chain of factors ψ r i , i = 1 , . . . , M − ψ r M weightsthe conﬁguration of having zero, one or more agents being at the stag position (ﬁgure showsthe case of one stag only). evaluate to one when variable d i,j takes the value of a Kronecker δ of theagent’s state x Ti and the position of a stag s j , and zero otherwise: ψ s j ( x Ti , d i,j ) := I (cid:110) ( d i,j = δ x Ti ,s j ) (cid:111) .

3. Third, for each stag, we introduce a chain of factors that involve the binaryvariables d i,j and additional variables u i,j = 0 , ,

2. The new variables u i,j encode whether the stag j has zero, one, or more agents after consideringthe ( i + 1)th agent. The new factors are: ψ r ( d ,j , d ,j , u ,j ) := I (cid:110) (( d ,j = 0) ∧ ( d ,j = 0) ∧ ( u ,j = 0)) ∨ (( d ,j = 1) ∧ ( d ,j = 1) ∧ ( u ,j = 2)) ∨ (( d ,j (cid:54) = d ,j ) ∧ ( u ,j = 1)) (cid:111) .ψ r i − ( u i − ,j , d i,j , u i,j ) := I (cid:110) (( d i,j = 0) ∧ ( u i − ,j = u i,j )) ∨ (( d i,j = 1) ∧ ( u i − ,j = 0) ∧ ( u i,j = 1)) ∨ (( d i,j = 1) ∧ ( u i − ,j = 1) ∧ ( u i,j = 2)) ∨ (( d i,j = 1) ∧ ( u i − ,j = 2) ∧ ( u i,j = 2)) (cid:111) . ptimal control as a graphical model inference problem 17 λ = 0.1 λ = 10 Fig. 10

Approximate inference KL-stag-hunt: Control obtained using BP for M = 10hunters in a large grid. See Figure 7 for a description of the symbols. ( Left ) Risk dominantcontrol is obtained for λ = 10, where all hunters go for a hare. ( Right ) Payoﬀ dominantcontrol is obtained for λ = 0 .

1. In this case, all hunters cooperate to capture the stags exceptthe ones on the upper-right corner, who are too far away from the stag to reach it in T = 10steps. Their optimal choice is to go for a hare. N = 400 , S = M/ , R s = − , H = 2 M and R h = −

4. Finally, we deﬁne factors ψ r M that weight the allowed conﬁgurations: ψ r M ( u M − ,j ) := (cid:40) exp (cid:0) − λ R s (cid:1) if ( u M − ,j = 2)1 otherwise . The original factor can be rewritten marginalizing the auxiliary variables d i,j , u i,j over the product of the previous factors ψ s j , ψ h k , ψ r i :exp (cid:18) − λ R ( x T ) (cid:19) = ψ S ( x T ) ψ H ( x T ) ψ S ( x T ) := S (cid:89) j =1 (cid:104) (cid:88) d ,j ,d ,j u ,j ,u M − ,j (cid:0) ψ s j ( x T , d ,j ) ψ s j ( x T , d ,j ) (cid:1) ψ r ( d ,j , d ,j , u ,j ) ψ r M ( u M − ,j ) (cid:88) d ,j ,...,d M,j u ,j ,...,u M,j M (cid:89) i =3 ψ r i − ( u i − ,j , d i,j , u i,j ) ψ s j ( x Ti , d i,j ) (cid:105) ψ H ( x T ) := H (cid:89) k =1 ψ h k ( x Ti ) , where for clarity of notation we have grouped the factors related to the stagsand hares in ψ S ( x T ) and ψ H ( x T ), respectively.The extended factor graph is tractable since it involves factors of no morethan three variables of small cardinality. Note that this transformation canalso be applied if additional state costs are incorporated at each time-step ψ R ( x t ) (cid:54) = 0 , t = 1 , . . . , T . However, such a representation is not of practicalinterest, since it complicates the model unnecessarily.Finally, note that the tree-width of the extended graph still grows fast withthe number of agents M because variables d i,j and u i,j are coupled. Thus, exactinference using the JT algorithm is still possible on small instances only.4.2 Approximate inference of the KL-stag-hunt problemIn this section we analyze large systems for which exact inference is not pos-sible using the JT algorithm. The belief propagation (BP) algorithm is an al-ternative approximate algorithm that we can run on the previously describedextended factor graph.We use the following setup: for a ﬁxed number of agents M , we set thenumber of stags H = 2 M and the number of hares S = M . Their locations,as well as the initial states x are chosen randomly and non-overlapping. Wethen construct a factor graph with initial states “clamped” to x and buildinstance-dependent factors ψ s j and ψ h k . We run BP using sequential updatesof the messages. If BP converges in less than 500 iterations, the optimal trajec-tories of the agents are computed using the estimated marginals (factor beliefs)for ψ q ( x t +1 | x t ) after convergence. Starting from x , we select the next stateaccording to the most probable state from p BP ( x t +1 i | x ti ) until the end time.We break ties randomly. We analyze the system as a function of parameter λ for a several number of realizations.The global observed behavior is qualitatively similar to the one of a smallsystem: for λ very large, a risk-dominant control is obtained and for λ smallenough, payoﬀ control dominates. This is behavior is illustrated in Figure 10,where an example for λ = 10 and λ = 0 . − log Z BP . We observe that for large systems that quantitychanges abruptly at λ ≈

1. Qualitatively, the optimal control obtained onthe boundary between risk-dominant and payoﬀ-dominant strategies diﬀersmaximally between individual instances and strongly depends on the initialconﬁguration. This suggests a phase transition phenomenon typical of complexphysical systems, in this case separating the two kind of optimal behaviors,where λ plays the role of a “temperature” parameter.Figure 11 shows this eﬀect. The left plot shows the derivative of the ex-pected approximated cost averaged over 20 instances. The curve becomessharper and its maximum gets closer to λ = 1 for larger systems. Error bars ofthe number of iterations required for convergence is shown on the right. Thenumber of BP iterations quickly increases as we decrease λ , indicating thatthe solution for which agents cooperate is more complex to obtain. For λ verysmall, BP may fail to converge after 500 iterations. ptimal control as a graphical model inference problem 19 −1 λ∂ −log Z BP / ∂ λ Payoff dominant Risk dominant −1 λ M = 4M = 10

Fig. 11

Approximate inference KL-stag-hunt: (

Left ) Change in the expected cost withrespect to λ as a function of λ for (cid:104) M = 4 , N = 100 (cid:105) and (cid:104) M = 10 , N = 225 (cid:105) . The curvebecomes sharper and its maximum gets closer to λ = 1 for larger systems, suggesting aphase transition phenomenon between the risk dominant and the payoﬀ dominant regimes.( Right ) Number of BP iterations required for convergence as a function of λ . Results areaverages over 20 runs with random initial states. R s = − , R h = − T = 10. The idea to treat a control problem as an inference problem has a long his-tory. The best known example is the linear quadratic control problem, whichis mathematically equivalent to an inference problem and can be solved as aKalman smoothing problem (Stengel, 1994). The key insight is that the valuefunction that is iterated in the Bellman equation becomes the (log of the)backward message in the Kalman ﬁlter. The exponential relation was general-ized in Kappen (2005) for the non-linear continuous space and time (Gaussiancase) and in Todorov (2007) for a class of discrete problems.There is a line of research on how to compute optimal action sequences ininﬂuence diagrams using the idea of probabilistic inference (Cooper, 1988; Tat-man and Shachter, 1990; Shachter and Peot, 1992). Although this techniquecan be implemented eﬃciently using the junction tree approach for singledecisions, the approach does not generalize in an eﬃcient way to optimal de-cisions, in the expected-reward sense, in multi-step tasks. The reason is thatthe order in which one marginalizes and optimizes strongly aﬀects the eﬃ-ciency of the computation. For a Markov decision process (MDP) there is aneﬃcient solution in terms of the Bellman equation . For a general inﬂuencediagram, the marginalization approach as proposed in Cooper (1988); Tatmanand Shachter (1990); Shachter and Peot (1992) will result in an intractableoptimization problem over u T − that cannot be solved eﬃciently (using dy-namic programming), unless the inﬂuence diagram has an MDP structure. Here we mean by eﬃcient, that the sum or min over a sequence of states or actions canbe performed as a sequence of sums or mins over states.0 Hilbert J. Kappen et al.

The KL control theory shares similarities with work in reinforcement learn-ing for policy updating. The notion of KL divergence appears naturally in thework of Bagnell and Schneider (2003) who proposes an information geometricapproach to compute the natural policy gradient (for small step sizes). Thisidea is further developed into an Expectation-Maximization (EM) type algo-rithm (Dayan and Hinton, 1997) in recent work (Peters et al., 2010; Koberand Peters, 2011) using a relative entropy term. The KL divergence acts hereas a regularization that weights the relative dependence of the new policy onthe data observed and the old policy, respectively.It is interesting to compare the the notion of free energy in continuous-timedynamical systems with Gaussian noise considered in Friston et al. (2009) withthe path integral formalism of Kappen (2005), which is a special case of KLcontrol theory. Friston et al. (2009) advocate the optimization of free energyas a guiding principle to describe behavior of agents. The main diﬀerence be-tween the KL control theory and Friston’s free energy principle is that in KLcontrol theory, the KL divergence plays the role of an expected future cost andits optimization yields a (time dependent) optimal control trajectory, whereasFriston’s free energy computes a control that yields a time-independent equi-librium distribution, corresponding to the minimal free energy. Friston’s freeenergy formulation is obtained as a special case of KL control theory whenthe dynamics and the reward/cost is time-independent and the horizon timeis inﬁnite.The KL control approach proposed in this paper also bears some rela-tion to the EM approach of Toussaint and Storkey (2006), who consider thediscounted reward case with 0,1 rewards. The posterior can be considered amixture over times at which rewards are incorporated. For an homogeneousMarkov process and time independent costs, the backward message passingcan be eﬀectively done in a single chain and not the full mixture distributionneeds to be considered. We can compare the EM approach of Toussaint andStorkey (2006) (TS) and the KL control approach (KL): – The TS approach is more general than the KL approach, in the sense thatthe reward considered in TS is an arbitrary function of state and action R ( x, u ), whereas the reward considered in KL is a sum of a state dependentterm R ( x ) and a KL divergence. – The KL approach is signiﬁcantly more eﬃcient than the TS approach. Inthe TS approach, the backward messages are computed for a ﬁxed policy π (E-step), from which an improved policy is computed (M-step). Thisprocedure is iterated until convergence. In the KL approach, the back-ward messages give the optimal control directly, with no further need foriteration. – In addition, the KL approach is more eﬃcient than the TS approach fortime-dependent problems. Using the TS approach for time-dependent prob-lems makes the computation a factor T more time-consuming than for thetime-independent case, since all mixture components must be computed. ptimal control as a graphical model inference problem 21 The complexity of the KL control approach does not depend on whetherthe problem is time-dependent or not. – The TS and KL approach optimize with respect to a diﬀerent quantity.The TS approach writes the state transition p ( y | x ) = (cid:80) u p ( y | x, u ) π ( u | x )and optimizes with respect to π . The KL approach optimizes the statetransition probability p ( y | x ) directly either as a table or in a parametrizedway. In this paper, we have shown the equivalence of a class of stochastic optimalcontrol problems to a graphical model inference problem. As a result, exactor approximate inference methods can directly be applied to the intractablestochastic control computation. The class of KL control problems containsinteresting special cases such as the continuous non-linear Gaussian stochasticcontrol problems introduced in Kappen (2005), discrete planning tasks andmulti-agent games, as illustrated in this paper.We notice, that there exist many stochastic control problems that areoutside of this class. In the basic formulation of Equation (1), one can con-struct control problems where the functional form of the controlled dynamics p t ( x t +1 | x t , u t ) is given as well as the cost of control R ( x t , u t , x t +1 , t ). In gen-eral, there may then not exist a q t ( x t +1 | x t ) such that Equation (2) holds.In this paper, we have considered the model based case only. The extensionto the model free case would require a sampling based procedure. See Bierkensand Kappen (2012) for initial work in this direction.We have demonstrated the eﬀectiveness of approximate inference methodsto compute the approximate control in a block stacking task and a multi-agentcooperative task.For the KL-blocks-world, we have shown that an entropy minimizationtask is more challenging than stacking blocks at a ﬁxed location (goal state),because the control computation needs to ﬁnd out where the optimal loca-tion is. Standard BP does not give any useful results if no goal state wasspeciﬁed, but apparently good optimal control solutions were obtained usinggeneralized belief propagation (CVM). We found that the marginal computa-tion using CVM is quite diﬃcult compared to other problems that have beenstudied in the past (Albers et al., 2007), in the sense that relatively many in-ner loop iterations were required for convergence. One can improve the CVMaccuracy, if needed, by considering larger clusters (Yedidia et al., 2005) as hasbeen demonstrated in other contexts (Albers et al., 2006), at the cost of morecomputational complexity.We have given evidence that the KL control formulation is particularlyattractive for multi-agent problems, where q naturally factorizes over agentsand where interaction results from the fact that the reward depends on thestate of more than one agent. A ﬁrst step in this direction was already madein Wiegerinck et al. (2006); van den Broek et al. (2008a). In this case, we have considered the KL-stag-hunt game and shown that BP provides a goodapproximation and allows to analyze the behavior of large systems, whereexact inference is not feasible.We found that, if the game setting strongly penalizes large deviations fromthe baseline (random) policy, the coordinated solution is sub-optimal. Thatmeans that the optimal solution distributes the agents among the diﬀerenthares rather than bringing them jointly to the stags (risk-dominant regime).On the contrary, if the agents are not constrained by deviating too much fromthe baseline policy to maximize (cid:104) R (cid:105) , the coordinated solution becomes optimal(payoﬀ dominant regime). We believe that this is an interesting result, since itprovides a explanation of the emergence of cooperation in terms of an eﬀectivetemperature parameter λ . Acknowledgments

We would like to thank anonymous reviewers for helping on improving themanuscript, Kees Albers for making available his sparse CVM code, JorisMooij for making available the libDAI software and Stijn Tonk for useful dis-cussions. The work was supported in part by the ICIS/BSIK consortium.

A Boltzmann distribution

Consider the KL divergence between a normalized probability distribution p ( x ) and somepositive function ψ ( x ): C ( p ) = (cid:88) x p ( x ) log p ( x ) ψ ( x ) C is a function of the distribution p . We compute the distribution that minimizes C withrespect to p subject to normalization (cid:80) x p ( x ) = 1 by adding a Lagrange multiplier: L ( p ) = C ( p ) + β (cid:32)(cid:88) x p ( x ) − (cid:33) dLdp ( x ) = log p ( x ) ψ ( x ) + 1 + β Setting the derivative equal to zero yields p ( x ) = ψ ( x ) exp( − β −

1) = ψ ( x ) /Z , where wehave deﬁned Z = exp( β + 1). The normalization condition (cid:80) x p ( x ) = 1 ﬁxes Z = (cid:80) x ψ ( x ).Substituting the solution for p in the cost C yields C = − log Z .ptimal control as a graphical model inference problem 23 B Relation to continuous path integral model

We write p ( x (cid:48) | x ) = N ( x (cid:48) | x + f ( x, t ) dt + g ( x, t ) u ( x, t ) dt, Ξdt ) with Ξ ( x, t ) = g ( x, t ) νg ( x, t ) T in Equation (9) as p ( x (cid:48) | x ) = N ( x (cid:48) | x + f ( x, t ) dt, Ξ ( x, t ) dt ) exp (cid:16) ( ˙ x − f ( x, t )) T Ξ − g ( x, t ) u ( x, t ) − dt g ( x, t ) u ( x, t )) T Ξ − g ( x, t ) u ( x, t ) (cid:17) = q ( x (cid:48) | x ) exp (cid:0) U ( x, x (cid:48) , t ) dt (cid:1) U ( x, x (cid:48) , t ) = ( ˙ x − f ( x, t )) T Ξ − g ( x, t ) u ( x, t ) −

12 ( g ( x, t ) u ( x, t )) T Ξ − g ( x, t ) u ( x, t ) . with ˙ x = ( x (cid:48) − x ) /dt .In order to make the link to Equation (3) we compute (cid:88) x (cid:48) p ( x (cid:48) | x ) log p ( x (cid:48) | x ) q ( x (cid:48) | x ) = (cid:88) x (cid:48) p ( x (cid:48) | x ) U ( x, x (cid:48) , t ) dt = dt g ( x, t ) u ( x, t )) T Ξ ( x, t ) − g ( x, t ) u ( x, t )= dt u ( x, t ) T ν − u ( x, t ) , where we have made use of the fact that (cid:80) x (cid:48) p ( x (cid:48) | x ) x (cid:48) = x + f ( x, t ) dt + g ( x, t ) u ( x, t ) dt and g T Ξ − g = g T ( g − ) T ν − g − g = ν − . Therefore, KL ( p || q ) = (cid:88) x dt : T p ( x dt : T | x ) log p ( x dt : T | x ) q ( x dt : T | x )= T − dt (cid:88) s =0 (cid:88) x s p ( x s | x ) (cid:88) x s + dt p ( x s + dt | x s ) U ( x s , x s + dt , s ) dt = T − dt (cid:88) s =0 dt (cid:88) x s p ( x s | x ) 12 ( u ( x s , s )) T ν − u ( x s , s ) . In the limit of dt → p and q becomes KL ( p || q ) = (cid:28)(cid:90) T dt u ( x ( s ) , s ) T ν − u ( x ( s ) , s ) (cid:29) in agreement with Equation (10). C Cluster Variation Method

In this appendix, we give a brief summary of the CVM method and the double loop approach.For a more complete description see Yedidia et al. (2001); Kappen and Wiegerinck (2002);Heskes et al. (2003).The cluster variation method replaces the probability distribution p ( x ) in the minimiza-tion Equation (3) by a large number of (overlapping) probability distributions (clusters),each describing the interaction between a small number of variables. p ( x ) ≈ { p α ( x α ) , α = 1 , . . . } When g is not a square matrix (when the number of controls is less than the dimensionof x ), g − denotes the pseudo-inverse of g . For any u , the pseudo-inverse has the propertythat g − gu = u .4 Hilbert J. Kappen et al.

1 2 3 2 3 5 3 4 5 1 4 52 3 3 5 4 5 1 3 5 −1.5 −1 −0.5 0 0.5 1 1.500.511.52 x F F ~ CVMx Fig. 12 (Left)

Example of a small network and a choice of clusters for CVM. (Middle)

Intersections of clusters recursively deﬁne a set of sub-clusters. (Right) F cvm is non-convex(blue curve) and is bounded by a convex function ˜ F x (Color online).with each α a subset of the indices 1 , . . . , n , x α the corresponding subset of variables and p α the probability distribution on x α . The set of clusters is denoted by B , and must be suchthat any interaction term ψ α ( x α ), with ψ ( x ) = (cid:81) α ψ α ( x α ) from Equation (4), is containedin at least one cluster. One denotes the set of all pairwise intersections of clusters in B ,as well as intersections of intersections by M . Figure 12 (left) gives an example of a smalldirected graphical model, where B consists of 4 clusters and M consists of 5 sub-clusters,Figure 12 (middle).The CVM approximates the KL divergence, Equation (3), as C ( x , p ) ≈ F cvm ( { p α } ) F cvm ( { p α } ) = (cid:88) α ∈ B (cid:88) x α p α ( x α ) log p α ( x α ) ψ α ( x α ) + (cid:88) β ∈ M a β (cid:88) x β p β ( x β ) log p β ( x β ) .F cvm is minimized with respect to all { p α } subject to normalization and consistencyconstraints: (cid:88) x α p α ( x α ) = 1 , p α ( x β ) = p β ( x β ) , β ⊂ α, p α ( x α ) ≥ a β are called the M¨obius or overcounting numbers. They can be recursivelycomputed from the formula1 = (cid:88) α ∈ B ∪ M,α ⊃ β a α , ∀ β ∈ B ∪ M Since a α can be both positive and negative, F cvm is not convex. A guaranteed convergentapproach to minimize F cvm is a double loop approach where the outer loop is to upper-bound F cvm by a convex function ˜ F p that touches at the current cluster solution p = { p α } .Optimizing ˜ F p ( p ) is a convex problem that can be solved using the dual approach (innerloop) and is guaranteed to decrease F cvm to a local minimum. The solution p ∗ ( p ) of thisconvex sub-problem is guaranteed to decrease F cvm : F cvm ( p ) = ˜ F p ( p ) ≥ ˜ F p ( p ∗ ( p )) ≥ F cvm ( p ∗ ( p ))Based on p ∗ ( p ) a new convex upper bound is computed (outer loop). This is called a doubleloop method. The approach is illustrated in Figure 12 (right).Alternatively, one can choose to ignore the non-convexity issue. Adding Lagrange multi-pliers λ to enforce the constraints one can minimize with respect to p = { p α } and obtain anexplicit solution of p in terms of the interactions ψ and the λ ’s. Inserting this solution in theptimal control as a graphical model inference problem 25above constraints results in a set of non-linear equations for the λ ’s, which one may attemptto solve by ﬁxed point iteration. It can be shown that these equations are equivalent to themessage passing equations of belief propagation. Unlike the above double loop approach, be-lief propagation does not converge in general, but tends to give a fast and accurate solutionfor those problems for which it does converge. References

Albers, C. A., Heskes, T., and Kappen, H. J. (2007). Haplotype inference in general pedigreesusing the cluster variation method.

Genetics , 177(2):1101–1118.Albers, C. A., Leisink, M. A. R., and Kappen, H. J. (2006). The cluster variation methodfor eﬃcient linkage analysis on extended pedigrees.

BMC Bioinformatics , 7(S-1).Bagnell, J. A. and Schneider, J. (2003). Covariant policy search. In

IJCAI’03: Proceedingsof the 18th international joint conference on Artiﬁcial intelligence , pages 1019–1024, SanFrancisco, CA, USA. Morgan Kaufmann.Bertsekas, D. P. and Tsitsiklis, J. N. (1996).

Neuro-Dynamic Programming . Athena Scien-tiﬁc. Belmont. Massachusetts.Bierkens, J. and Kappen, B. (2012). Kl-learning: Online solution of Kullback-Leibler controlproblems. http://arxiv.org/abs/1112.1996.Boutilier, C., Dearden, R., and Goldszmidt, M. (1995). Exploiting structure in policy con-struction. In

IJCAI’95: Proceedings of the 14th international joint conference on Artiﬁ-cial intelligence , pages 1104–1111, San Francisco, USA. Morgan Kaufmann.Cooper, G. (1988). A method for using belief networks as inﬂuence diagrams. In

Proceedingsof the Workshop on Uncertainty in Artiﬁcial Intelligence (UAI’88) , pages 55–63.da Silva, M., Durand, F., and Popovi´c, J. (2009). Linear Bellman combination for controlof character animation.

ACM Transactions on Graphics , 28(3):82:1–82:10.Dayan, P. and Hinton, G. E. (1997). Using expectation-maximization for reinforcementlearning.

Neural Computation , 9(2):271–278.Friston, K. J., Daunizeau, J., and Kiebel, S. J. (2009). Reinforcement learning or activeinference?

PLoS ONE , 4(7):e6421.Heskes, T., Albers, K., and Kappen, H. J. (2003). Approximate inference and constrainedoptimization. In

Proceedings of the 19th Conference on Uncertainty in Artiﬁcial Intelli-gence (UAI’03) , pages 313–320, Acapulco, Mexico. Morgan Kaufmann.Jordan, M. I., editor (1999).

Learning in graphical models . MIT Press, Cambridge, MA,USA.Kappen, H. J. (2005). Linear theory for control of nonlinear stochastic systems.

PhysicalReview Letters , 95(20):200201.Kappen, H. J. and Wiegerinck, W. (2002). Novel iteration schemes for the cluster variationmethod. In

Advances in Neural Information Processing Systems 14 , pages 415–422. MITPress, Cambridge, MA.Kober, J. and Peters, J. (2011). Policy search for motor primitives in robotics.

MachineLearning , 84(1-2):171–203.Koller, D. and Parr, R. (1999). Computing factored value functions for policies in structuredmdps. In

IJCAI ’99: Proceedings of the 16th International Joint Conference on ArtiﬁcialIntelligence , pages 1332–1339, San Francisco, CA, USA. Morgan Kaufmann.Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilitieson graphical structures and their application to expert systems.

Journal of the RoyalStatistical society. Series B-Methodological , 50(2):154–227.Mooij, J. M. (2010). libDAI: A free and open source C++ library for discrete approximateinference in graphical models.

Journal of Machine Learning Research , 11:2169–2173.Murphy, K., Weiss, Y., and Jordan, M. (1999). Loopy belief propagation for approximateinference: An empirical study. In

Proceedings of the 15th Conference on Uncertainty inArtiﬁcial Intelligence (UAI’99) , pages 467–47, San Francisco, CA. Morgan Kaufmann.Peters, J., M¨ulling, K., and Alt¨un, Y. (2010). Relative entropy policy search. In

Proceedingsof the 24th AAAI Conference on Artiﬁcial Intelligence (AAAI 2010) , pages 1607–1612.AAAI Press.6 Hilbert J. Kappen et al.Russell, S. J., Norvig, P., Candy, J. F., Malik, J. M., and Edwards, D. D. (1996).

Artiﬁcialintelligence: a modern approach . Prentice-Hall, Inc., Upper Saddle River, NJ, USA.Shachter, R. D. and Peot, M. A. (1992). Decision making using probabilistic inferencemethods. In

Proceedings of the 8th conference on Uncertainty in Artiﬁcial Intelligence(UAI’92) , pages 276–283, San Francisco, CA, USA. Morgan Kaufmann.Skyrms, B. (1996).

Evolution of the Social Contract . Cambridge University Press, Cam-bridge/New York.Skyrms, B., editor (2004).

The Stag Hunt and Evolution of Social Structure . CambridgeUniversity Press, Cambridge, MA, USA.Stengel, R. F. (1994).

Optimal Control and Estimation . Dover Publications, Inc.Tatman, J. and Shachter, R. (1990). Dynamic programming and inﬂuence diagrams.

Sys-tems, Man and Cybernetics, IEEE Transactions on , 20(2):365 –379.Theodorou, E. A., Buchli, J., and Schaal, S. (2009). Path integral-based stochastic optimalcontrol for rigid body dynamics. In adaptive dynamic programming and reinforcementlearning, 2009. ADPRL ’09. IEEE symposium on , pages 219–225.Theodorou, E. A., Buchli, J., and Schaal, S. (2010a). Learning policy improvements withpath integrals. In

International conference on artiﬁcial intelligence and statistics (AIS-TATS 2010) .Theodorou, E. A., Buchli, J., and Schaal, S. (2010b). Reinforcement learning of motorskills in high dimensions: A path integral approach. In

Proceedings of the InternationalConference on Robotics and Automation (ICRA 2010) , pages 2397–2403. IEEE Press.Todorov, E. (2007). Linearly-solvable markov decision problems. In

Advances in NeuralInformation Processing Systems 19 , pages 1369–1376. MIT Press, Cambridge, MA.Todorov, E. (2008). General duality between optimal control and estimation. In , pages 4286–4292.Todorov, E. (2009). Eﬃcient computation of optimal actions.

Proceedings of the NationalAcademy of Sciences of the United States of America , 106(28):11478–11483.Toussaint, M. and Storkey, A. (2006). Probabilistic inference for solving discrete and contin-uous state Markov Decision Processes. In

ICML ’06: Proceedings of the 23rd internationalconference on Machine learning , pages 945–952, New York, NY, USA. ACM.van den Broek, B., Wiegerinck, W., and Kappen, H. J. (2008a). Graphical model inferencein optimal control of stochastic multi-agent systems.

Journal of Artiﬁcial IntelligenceResearch , 32(1):95–122.van den Broek, B., Wiegerinck, W., and Kappen, H. J. (2008b). Optimal control in largestochastic multi-agent systems.

Adaptive Agents and Multi-Agent Systems III. Adapta-tion and Multi-Agent Learning , 4865:15–26.Wiegerinck, W., van den Broek, B., and Kappen, H. J. (2006). Stochastic optimal controlin continuous space-time multi-agent systems. In

Proceedings of the 22nd Conferenceon Uncertainty in Artiﬁcial Intelligence (UAI’06) , pages 528–535, Arlington, Virginia.AUAI Press.Wiegerinck, W., van den Broek, B., and Kappen, H. J. (2007). Optimal on-line schedulingin stochastic multi-agent systems in continuous space and time.

Proceedings of the 6thinternational joint conference on Autonomous agents and multiagent systems AAMAS07 , 5:1.Yedidia, J., Freeman, W., and Weiss, Y. (2001). Generalized belief propagation. InT.K. Leen, T. D. and Tresp, V., editors,

Advances in Neural Information ProcessingSystems 13 , pages 689–995. MIT Press, Cambridge, MA.Yedidia, J., Freeman, W., and Weiss, Y. (2005). Constructing free-energy approximationsand generalized belief propagation algorithms.

Information Theory, IEEE Transactionson , 51(7):2282 – 2312.Yoshida, W., Dolan, R. J., and Friston, K. J. (2008). Game theory of mind.