Optimal control in Markov decision processes via distributed optimization
OOptimal control in Markov decision processes via distributed optimization
Jie Fu, Shuo Han and Ufuk Topcu
Abstract — Optimal control synthesis in stochastic systemswith respect to quantitative temporal logic constraints canbe formulated as linear programming problems. However,centralized synthesis algorithms do not scale to many practicalsystems. To tackle this issue, we propose a decomposition-baseddistributed synthesis algorithm. By decomposing a large-scalestochastic system modeled as a Markov decision process intoa collection of interacting sub-systems, the original controlproblem is formulated as a linear programming problem witha sparse constraint matrix, which can be solved throughdistributed optimization methods. Additionally, we proposea decomposition algorithm which automatically exploits, ifexists, the modular structure in a given large-scale system.We illustrate the proposed methods through robotic motionplanning examples.
I. I
NTRODUCTION
For many systems, temporal logic formulas are used todescribe desirable system properties such as safety, stability,and liveness [1]. Given a stochastic system modeled as aMarkov decision process (MDP), the synthesis problem isto find a policy that achieves optimal performance under agiven quantitative criterion regarding given temporal logicformulas. For instance, the objective may be to find apolicy that maximizes the probability of satisfying a giventemporal logic formula. In such a problem, we need tokeep track of the evolution of state variables that capturesystem dynamics as well as predicate variables that encodeproperties associated with the temporal logic constraints [2],[3]. As the number of states grows exponentially in thenumber of variables, we often encounter large MDPs, forwhich the synthesis problems are impractical to solve withcentralized methods. The insight for control synthesis oflarge-scale systems is to exploit the modular structure in asystem so that we can solve the original problem by solvinga set of small subproblems.In literature, distributed control synthesis methods are pro-posed in the pioneering work for MDPs with discounted re-wards [4], [5]. The authors formulate a two-stage distributedreinforcement learning method: The first stage constructs andsolves an abstract problem derived from the original one, andthe second stage iteratively computes parameters for localproblems until the collection of local problems’ solutionsconverge to one that solves the original problem. Recently,alternating direction method of multipliers (ADMM) is com-bined with a sub-gradient method into planning for average-reward problems in large MDPs in [6]. However, the methodin [6] applies only when some special conditions are satisfied
J. Fu, S. Han and U. Topcu are with the Department of Electrical andSystems Engineering, University of Pennsylvania, Philadelphia, PA 19104,USA jief, hanshuo, [email protected] . on the costs and transition kernels. Alternatively, hierarchicalreinforcement learning introduces action-aggregation and action-hierarchies to address the planning problems withlarge MDPs [7]. In action-aggregation, a micro-action is alocal policy for a subset of states and the global optimalpolicy maps histories of states into micro-actions. However,it is not always clear how to define the action hierarchies andhow the choice of hierarchies affects the optimality in theglobal policy. Additionally, the aforementioned methods arein general difficult to implement and cannot handle temporallogic specifications.For synthesis problems in MDPs with quantitative tem-poral logic constraints, centralized methods and tools [3],[8] are developed and applied to control design of stochasticsystems and robotic motion planning [9]–[11]. Since central-ized algorithms are based on either value iteration or linearprogramming, they inevitably hit the barrier of scalabilityand are not viable for large MDPs.In this paper, we develop a distributed optimizationmethod for large MDPs subject to temporal logic contraints.We first introduce a decomposition method for large MDPsand prove a property in such a decomposition that supportsthe application of the proposed distributed optimization.For a subclass of MDPs whose graph structures are Planargraphs, we introduce an efficient decomposition algorithmthat exploits the modular structure for the underlying MDPcaused by loose coupling between subsets of states and itsconstituting components. Then, given a decomposition ofthe original system, we employ a distributed optimizationmethod called block splitting algorithm [12] to solve theplanning problem with respect to discounted-reward objec-tives in large MDPs and average-reward objectives in largeergodic MDPs. Comparing to two-stage methods in [5], [6],[13], our method concurrently solves the set of sub-problemsand penalizes solutions’ mismatches in one step during eachiteration, and is easy to implement. Since the distributedcontrol synthesis is independent from the way how a largeMDP is decomposed, any decomposition method can beused. Lastly, we extend the method to solve the synthesisproblems for MDPs with two classes of quantative temporallogic objectives. Through case studies we investigate theperformance and effectiveness of the proposed method.II. P RELIMINARIES
Let Σ be a finite set. Let Σ ∗ , Σ ω be the set of finite andinfinite words over Σ . card (Σ) is the cardinality of the set Σ . A probability distribution on a finite set S is a function D : S → [0 , such that (cid:80) s ∈ S D ( s ) = 1 . The support of a r X i v : . [ c s . S Y ] M a r is the set Supp ( D ) = { s ∈ S | D ( s ) > } . The set ofprobability distributions on a finite set S is denoted D ( S ) . Markov decision process: A Markov decision process (MDP) M = (cid:104) S, A, u , P (cid:105) consists of a finite set S of states,a finite set A of actions, an initial distribution u ∈ D ( S ) of states, and a transition probability function P : S × A →D ( S ) that for a state s ∈ S and an action a ∈ A givesthe probability P ( s, a )( s (cid:48) ) of the next state s (cid:48) . Given anMDP M we define the set of actions enabled at state s as A ( s ) = { a ∈ A | ∃ s (cid:48) ∈ S, P ( s, a )( s (cid:48) ) > } . The cardinalityof the set { ( s, a ) | s ∈ S, a ∈ A ( s ) } is the number of state-action pairs in the MDP. Policies: A path is an infinite sequence s s . . . of states suchthat for all i ≥ , there exists a ∈ A , s i +1 ∈ Supp ( P ( s i , a )) .A policy is a function f : S ∗ → D ( A ) that, given a finitestate sequence representing the history, chooses a probabilitydistribution over the set A of actions. Policy f is memorylessif it only depends on the current state, i.e., f : S → D ( A ) .Once a policy f is chosen, an MDP M is reduced to aMarkov chain, denoted M f . We denote by X i and θ i therandom variables for the i -th state and the i -th action in thischain M f . Given a policy f , for a measurable function φ that maps paths into reals, let E fu [ φ ] (resp. E fs [ φ ] ) be theexpected value of φ when the policy f is used given u beingthe initial distribution of states (resp. s being the initial state). Rewards and objectives:
Given an MDP M , a rewardfunction R : S × A → R and a policy f , let γ be a discountingfactor, the discounted-reward value is defined as Val fγ ( u ) = Val fγ ( s ) · u ( s ) where Val fγ ( s ) = E fs ( (cid:80) ∞ n =0 γ n R ( X n , θ n )) ;the average-reward value is defined as Val f ( u ) = Val f ( s ) · u ( s ) where Val f ( s ) = lim n →∞ n E fs [ (cid:80) nk =0 R ( X k , A k )] .A discounted-reward (resp. an average-reward) problem is,for a given initial state distribution, to obtain a policythat maximizes the discounted-reward value (resp. average-reward value). For discounted-reward (average-reward) prob-lems, the optimal value can be attained by memorylesspolicies [14].A solution to the discounted-reward problem can be foundby solving the linear programming (LP) problem: max x ∈ R m + (cid:88) s ∈ S (cid:88) a ∈ A ( v ) x ( s, a ) · R ( s, a ) (1a)subject to (1b) (cid:88) a ∈ A ( s ) x ( s, a ) − γ · (cid:88) s (cid:48) ∈ S (cid:88) a (cid:48) ∈ A ( s (cid:48) ) x ( s (cid:48) , a (cid:48) ) · P ( s (cid:48) , a (cid:48) )( s )= u ( s ) , ∀ s ∈ S, (1c)where m is the total number of state-action pairs in theMDP, R m + is the non-negative orthant of R m , and variable x ( s, a ) can be interpreted as the expected discounted timeof being in state s and taking action a . Once the LPproblem in (1) is solved, the optimal policy is obtained as f ( s, a ) = x ( s,a ) (cid:80) a (cid:48)∈ A ( s ) x ( s,a (cid:48) ) and the objective function’s valueis the optimal discounted-reward value under policy f giventhe initial distribution u of states. In an ergodic MDP, the average-reward value is a constantregardless of the initial state distribution [15] (see [15] forthe definition of ergodicity). We obtain an optimal policy foran average-reward problem by solving the LP problem max x ∈ R m + (cid:88) s ∈ S (cid:88) a ∈ A ( s ) x ( s, a ) · R ( s, a ) (2a)subject to (cid:88) a ∈ A ( s ) x ( s, a ) − (cid:88) s (cid:48) ∈ S (cid:88) a (cid:48) ∈ A ( s (cid:48) ) x ( s (cid:48) , a (cid:48) ) · P ( s (cid:48) , a (cid:48) )( s ) = 0 , ∀ s ∈ S, (2b) (cid:88) s ∈ S (cid:88) a ∈ A ( s ) x ( s, a ) = 1 , (2c)where x ( s, a ) is understood as the long-run fraction of timethat the system is at state s and the action a is taken. Once theLP problem in (2) is solved, the optimal policy is obtained as f ( s, a ) = x ( s,a ) (cid:80) a (cid:48)∈ A ( s ) x ( s,a (cid:48) ) . The optimal objective value is theoptimal average-reward value and is the same for all states. Distributed optimization:
As a prelude to the distributedsynthesis method developed in section IV, now we describethe alternating direction method of multipliers (ADMM) [16]for the generic convex constrained minimization problem min z ∈ C g ( z ) where function g is closed proper convex andset C is closed nonempty convex. In iteration k of theADMM algorithm the following updates are performed: z k +1 / := prox g ( z k − ˜ z k ) , (3a) z k +1 := Π C ( z k +1 / + ˜ z k ) , (3b) ˜ z k +1 := ˜ z k + z k +1 / − z k +1 , (3c)where z k +1 / and ˜ z k are auxiliary variables, Π C isthe (Euclidean) projection onto C , and prox g ( v ) =arg min x (cid:0) g ( x ) + ( ρ/ (cid:107) x − v (cid:107) (cid:1) is the proximal operator [16] of g with parameter ρ > . The algorithm handlesseparately the objective function g in (3a) and the constraintset C in (3b). In (3c) the dual update step coordinates thesetwo steps and results in convergence to a solution of theoriginal problem. Temporal logic:
Linear temporal logic (LTL) formulas aredefined by: φ := p | ¬ φ | φ ∨ φ | (cid:13) φ | φ U φ , where p ∈ AP is an atomic proposition, and (cid:13) and U are temporalmodal operators for “next” and “until”. Additional temporallogic operators are derived from basic ones: ♦ ϕ := true U ϕ (eventually) and (cid:3) ϕ := ¬ ♦ ¬ ϕ . Given an MDP M , let AP be a finite set of atomic propositions, and a function L : S → AP be a labeling function that assigns a set of atomicpropositions L ( s ) ⊆ AP to each state s ∈ S that are valid atthe state s . L can be extended to paths in the usual way, i.e., L ( sρ ) = L ( s ) L ( ρ ) for s ∈ S, ρ ∈ S ω . A path ρ = s s . . . ∈ S ω satisfies a temporal logic formula ϕ if and only if L ( ρ ) satisfies ϕ . In an MDP M , a policy f induces a probabilitydistribution over paths in S ω . The probability of satisfyingan LTL formula ϕ is the sum of probabilities of all pathsthat satisfy ϕ in the induced Markov chain M f . roblem 1: Given an MDP M and an LTL formula ϕ ,synthesize a policy that optimizes a quantitative performancemeasure with respect to the formula ϕ in the MDP M .We consider the probability of satisfying a temporal logicformula as one quantitative performance measure. We alsoconsider the expected frequency of satisfying certain recur-rent properties specified in an LTL formula.By formulating a product MDP that incorporates theunderlying dynamics of a given MDP and its temporallogic specification, it can be shown that Problem 1 withdifferent quantitative performance measures can be formu-lated through pre-processing as special cases of discounted-reward and average-reward problems [3], [17]. Thus, in thefollowing, we first introduce decomposition-based distributedsynthesis methods for large MDPs with discounted-rewardand average-reward criteria. Then, we show the extension forsolving MDPs with quantitative temporal logic constraints.III. D ECOMPOSITION OF AN
MDP
A. Decomposition and its property
To exploit the modular structure of a given MDP, theinitial step is to decompose the state space into small subsetsof states, each of which can then be related to a smallproblem. In this section, we introduce some terminologiesin decomposition of MDPs from [5].Given an MDP M = (cid:104) S, A, u , P (cid:105) , let Π be any partitionof the state set S . That is, Π = { S , . . . , S N } ⊆ S , ∅ / ∈ Π , S i ∩ S j = ∅ when i (cid:54) = j and (cid:83) Ni =1 S i = S . A set in Π is called a region . The periphery of a region S i is a setof states outside S i , each of which can be reached with anon-zero probability by taking some action from a state in S i . Formally, Periphery ( S i ) = { s (cid:48) ∈ S \ S i | ∃ ( s, a ) ∈ S i × A, P ( s, a )( s (cid:48) ) > } . Let K = (cid:83) Ni =1 Periphery ( S i ) . Given a region S i ∈ Π ,we call K i = S i \ K the kernel of S i . We denote m i the number of state-action pairs restricted to K i , for each i = 0 , . . . , N . That is, m i is the cardinality of the set { ( s, a ) | s ∈ K i , a ∈ A ( s ) } . We call the partition { K i | ≤ i ≤ N } a decomposition of M . The following propertyof a decomposition is exploited in distributed optimization. Lemma 1:
Given a decomposition { K i , i = 0 , , . . . , N } obtained from partition Π = { S , . . . , S N } , for a state s ∈ K i where i (cid:54) = 0 , if there is a state s (cid:48) and an action a suchthat P ( s (cid:48) , a )( s ) (cid:54) = 0 , then either s (cid:48) ∈ K or s (cid:48) ∈ K i . Proof:
Suppose s (cid:48) / ∈ K and s (cid:48) / ∈ K i , then it mustbe the case that s (cid:48) ∈ K j for some j (cid:54) = 0 and j (cid:54) = i . Sincefrom state s (cid:48) ∈ S j , after taking action a , the probabilityof reaching s ∈ S i is non-zero, we can conclude that s ∈ Periphery ( S j ) , which implies s ∈ K . The implicationcontradicts the fact that s ∈ K i since K ∩ K i = ∅ . Hence,either s (cid:48) ∈ K i or s (cid:48) ∈ K . Example:
Consider the MDP in Figure 1, which is takenfrom [18]. The shaded region shows a partition
Π = { S = { s , s , s } , S = { s , s , s , s , s }} of the statespace. Then, Periphery ( S ) = { s , s } and Periphery ( S ) = { s } . We obtain a decomposition of M as K = ∪ i =1 Periphery ( S i ) = { s , s , s } , K = S \ K = { s , s } , and K = S \ K = { s , s , s } . It is observed that state s can only be reached with non-zero probabilities by actionstaken from states s and s .Fig. 1: Example of an MDP with states Q = { s i , i =0 , . . . , } , actions A = { α, β } , and transition probabilityfunction P as indicated. B. A decomposition method for a subclass of MDPs
Various methods are developed to derive a decomposi-tion of an MDP, for example, decompositions based onpartitioning the state space of an MDP according to thecommunicating classes in the induced graph (defined in thefollowing) of that MDP (see a survey in [13]). For thedistributed synthesis method developed in this paper, it willbe shown later in Section IV that the number of state-actionpairs and the number of states in K i are the number ofvariables and the number of constraints in a sub-problem,respectively. Thus, we prefer a decomposition that meetsone simple desirable property: For each i = 0 , , . . . , N ,the number m i of state-action pairs in K i is small in thesense that the classical linear programming algorithm canefficiently solve an MDP with state-action pairs of this size.Next, we propose a method that generates decompositionswhich meet the aforementioned desirable property for asubclass of MDPs. For an MDP in this subclass, its inducedgraph is Planar . It can be shown that MDPs derived fromclassical gridworld examples, which have many practicalapplications in robotic motion planning, are in this subclass.We start by relating an MDP with a directed graph. Definition 1:
The labeled digraph induced from an MDP M = (cid:104) S, A, u , P (cid:105) is a tuple G = (cid:104) S, E (cid:105) where S is a setof nodes, and E ⊆ S × A × S is a set of labeled edges suchthat ( s, a, s (cid:48) ) ∈ E if and only if P ( s, a )( s (cid:48) ) > .Let n = card ( S ) be the total number of nodes in the graph.A partition of states in the MDP gives rise to a partitionof nodes in the graph. Given a partition Π and a region S i ∈ Π , a node is said to be contained in S i if some edge ofthe region is incident to the node. A node contained in morethan one regions is called a boundary node. That is, s ∈ S i is a boundary node if and only if there exists ( s, a, s (cid:48) ) ∈ E or ( s (cid:48) , a, s ) ∈ E with s (cid:48) / ∈ S i . Formally, the boundary nodes A graph is planar if it can be drawn in the plane in such a way that notwo edges meet each other except at a vertex to which they are incident. f S i are B i = In i ∪ Out i where In i = { s ∈ S i | ∃ j (cid:54) = i, s (cid:48) ∈ S j , a ∈ A ( s (cid:48) ) , and ( s (cid:48) , a, s ) ∈ E } and Out i = { s ∈ S i | ∃ s (cid:48) ∈ S \ S i , a ∈ A ( s ) , and ( s, a, s (cid:48) ) ∈ E } . We define B = (cid:83) Ni =1 B i . Note that since (cid:83) Ni =1 In i = K , K ⊆ B .We use the number of boundary nodes as an upper boundon the size of the set K of states. Definition 2: [20] An r -division of an n -node graph is apartition of nodes into O ( n/r ) subsets, each of which have O ( r ) nodes and O ( √ r ) boundary nodes.Reference [20] shows an algorithm that divides a planargraph of n vertices into an r -division in O ( n log n ) time. Lemma 2:
Given a partition Π of an MDP with n statesobtained with a r -division the induced graph, the number ofstates in K is upper bounded by O ( n/ √ r ) and the numberof states in K i is upper bounded by O ( r ) . Proof:
Since each boundary node is contained in at mostthree regions and at least one region by the property ofan r -division [20], the total number of boundary nodes is O ( √ r · nr ) = O ( n/ √ r ) . The number of states in K i is upperbounded by the size of S i , which is O ( r ) .To obtain a decomposition, the user specifies an approxi-mately upper bound on the number of variables for all sub-problems. Then, the algorithm decides whether there is an r -division for some r that gives rise to a decomposition thathas the desirable property. Remark 1:
Although the decomposition method proposedhere is applicable for a subclass of MDPs, the distributedsynthesis method developed in this paper does not con-strain the way by which a decomposition is obtained. Adecomposition may be given or obtained straight-forwardlyby exploiting the existing modular structure of the system.Even if a decomposition does not meet the desirable propertyfor the distributed synthesis method, the proposed methodstill applies as long as each subproblem derived from thatdecomposition (see Section IV-C) can be solved given thelimitation in memory and computational capacities.IV. D
ISTRIBUTED SYNTHESIS : D
ISCOUNTED - REWARDAND AVERAGE - REWARD PROBLEMS
In this section, we show that under a decomposition, theoriginal LP problem for a discounted-reward or average-reward case can be formulated into one with a sparseconstraint matrix. Then, we employ block-splitting algorithmbased on ADMM in [12] for solving the LP problem in adistributed manner.
A. Discounted-reward case
Given a decomposition { K i | i = 0 , , . . . , N } of anMDP, let x i be a vector consisting of variables x ( s, a ) forall s ∈ K i with all actions enabled from s . Let ι i : K i →{ , . . . , card ( K i ) } be an index function. The constraints in(1c) can be written as: For each s ∈ K , (cid:88) a ∈ A ( s ) x ( s, a ) = u ( s )+ γ · N (cid:88) i =0 (cid:88) s (cid:48) ∈ K i (cid:88) a (cid:48) ∈ A ( s (cid:48) ) x ( s (cid:48) , a (cid:48) ) · P ( s (cid:48) , a (cid:48) )( s ) , (4) and for each s ∈ K i , i = 1 , . . . , N , (cid:88) a ∈ A ( s ) x ( s, a ) = u ( s )+ γ · (cid:0) (cid:88) s (cid:48) ∈ K (cid:88) a (cid:48) ∈ A ( s (cid:48) ) x ( s (cid:48) , a (cid:48) ) · P ( s (cid:48) , a (cid:48) )( s )+ (cid:88) s (cid:48) ∈ K i (cid:88) a (cid:48) ∈ A ( s (cid:48) ) x ( s (cid:48) , a (cid:48) ) · P ( s (cid:48) , a (cid:48) )( s ) (cid:1) . (5)Recall that, in Lemma 1, we have proven that each s ∈ K i with i (cid:54) = 0 can only be reached with non-zero probabilitiesfrom states in K and K i . As a result, for each state s in K i with i (cid:54) = 0 and each action a ∈ A ( s ) , the constrainton variable x ( s, a ) is only related with variables in x i and x . Let x = ( x , x , . . . , x N ) . We denote the number ofvariables in x i by m i and the number of states in the set K i by n i . Let m = (cid:80) Ni =0 m i . The LP problem in (1) is then min x ∈ R m + N (cid:88) j =0 c Tj x j , subject to Ax = b, (6)where c Tj x j = (cid:80) s ∈ K j (cid:80) a ∈ A ( s ) − R ( s, a ) x ( s, a ) , A = A A A . . . A N A A A A ... . . . A N A NN , b = b b b ... b N , and b i ∈ R n i where b i ( k ) = u ( s ) if ι i ( s ) = k . Thetransformation from (4) and (5) to (6) is straightforward byrewriting the constraints and we omit the detail. B. Average-reward case
For an ergodic MDP, the constraints in the LP problemof maximizing the average reward, described by (2b), can berewritten in the way just as how (1c) is rewritten into (4) and(5) for the discounted-reward problem. The difference is thatfor the average-reward case, we let γ = 1 and replace u ( s ) with , for all s ∈ S , in (4) and (5). An additional constraintfor the average-reward case is that (cid:80) s ∈ S (cid:80) a ∈ A ( s ) x ( s, a ) =1 . Hence, for an average-reward problem in an ergodic MDP,the corresponding LP problem in (2) is formulated as min x ∈ R m + N (cid:88) j =0 c Tj x j , subject to (cid:20) T A (cid:21) x = (cid:20) (cid:21) (7)where T is a row vector of m ones, c Tj x j = (cid:80) s ∈ K j (cid:80) a ∈ A ( s ) − R ( s, a ) x ( s, a ) and the block-matrix A has the same structure as of that in the discounted case with γ = 1 . We can compactly write the constraint in (7) as A (cid:48) x = b where A (cid:48) is a sparse constraint matrix similar instructure to the matrix A in the discounted-reward case. . Distributed optimization algorithm We solve the LP problems in (6) and (7) by employing theblock splitting algorithm based on ADMM in [12]. We onlypresent the algorithm for the discounted-reward case in (6).The extension to the average-reward case is straight-forward.First, we introduce new variables y i and let f i ( y i ) = I { b i } ( y i ) , where for a convex set C , I C is a function definedby I C ( x ) = 0 for x ∈ C , I C ( x ) = ∞ for x / ∈ C . Then,adding the term f i ( y i ) into the objective function enforces y i = b i . Let g i ( x i ) = c Ti x i + I R mi + ( x i ) . The term I R mi + ( x i ) enforces that x i is a non-negative vector. We rewrite the LPproblem in (6) as follows. min x,y N (cid:88) i =0 f i ( y i ) + N (cid:88) i =0 g i ( x i ) subject to y = N (cid:88) i =0 A i x i and for i = 1 , . . . , N,y i = A i x + A ii x i . (8)With this formulation, we modify the block splitting algo-rithm in [12] to solve (8) in a parallel and distributed manner(see the Appendix for the details). The algorithm takesparameters ρ , (cid:15) rel and (cid:15) abs : ρ > is a penalty parameterto ensure the constraints are satisfied, (cid:15) rel > is a relativetolerance and (cid:15) abs > is an absolute tolerance. The choiceof (cid:15) rel and (cid:15) abs depends on the scale of variable values. Insynthesis of MDPs, (cid:15) rel and (cid:15) abs may be chosen in the rangeof − to − . The algorithm is ensured to converge withany choice of ρ and the value of ρ may affect the convergencerate.V. E XTENSION TO QUANTITATIVE TEMPORAL LOGICCONSTRAINTS
We now extend the distributed control synthesis methodsfor MDPs with discounted-reward and average-reward crite-ria to solve Problem 1 in which quantitative temporal logicconstraints are enforced.
A. Maximizing the probability of satisfying an LTL specifi-cation
Preliminaries
Given an LTL formula ϕ as the systemspecification, one can always represent it by a deterministicRabin automaton (DRA) A ϕ = (cid:104) Q, AP , T, I, Acc (cid:105) where Q is a finite state set, AP is the alphabet, I ∈ Q isthe initial state, and T : Q × AP → Q the transitionfunction. The acceptance condition Acc is a set of tuples { ( J i , H i ) ∈ Q × Q | i = 1 , . . . , (cid:96) } . The run for an infiniteword w = σ σ . . . ∈ (2 AP ) ω is the infinite sequence ofstates q q . . . ∈ Q ω where q = I and q i +1 = T ( q i , σ i ) for i ≥ . A run ρ = q q . . . is accepted in A ϕ if there existsat least one pair ( J i , H i ) ∈ Acc such that
Inf ( ρ ) ∩ J i = ∅ and Inf ( ρ ) ∩ H i (cid:54) = ∅ where Inf ( ρ ) is the set of states thatappear infinitely often in ρ .Given an MDP M = (cid:104) S, A, u , P (cid:105) augmented with a set AP of atomic propositions and a labeling function L : S → AP , one can compute the product MDP M = M (cid:110) A ϕ = (cid:104) V, A, ∆ , v , Acc (cid:105) with the components defined as follows: V = S × Q is the set of states. A is the set of actions.The initial probability distribution of states is µ : V → [0 , such that given v = ( s, q ) with q = T ( I, L ( s )) , it isthat µ ( v ) = u ( s ) . ∆ : V × A → D ( V ) is the transitionprobability function. Given v = ( s, q ) , σ , v (cid:48) = ( s (cid:48) , q (cid:48) ) and q (cid:48) = T ( q, L ( s (cid:48) )) , let ∆( v, σ )( v (cid:48) ) = P ( s, σ )( s (cid:48) ) . The Rabinacceptance condition is Acc = { ( ˆ J i , ˆ H i ) | ˆ J i = S × J i , ˆ H i = S × H i , i = 1 , . . . , (cid:96) } .By construction, a path ρ = v v . . . ∈ V ω satisfies theLTL formula ϕ if and only if there exists i ∈ { , . . . , (cid:96) } , Inf ( ρ ) ∩ ˆ J i = ∅ and Inf ( ρ ) ∩ ˆ H i (cid:54) = ∅ . To maximize theprobability of satisfying ϕ , the first step is to compute theset of end components in M , each of which is a pair ( W, f ) where W ⊆ V is non-empty and f : W → A is afunction such that for any v ∈ W , for any a ∈ f ( v ) , (cid:80) v (cid:48) ∈ W ∆( v, a )( v (cid:48) ) = 1 and the induced directed graph ( W, → f ) is strongly connected. Here, v → f v (cid:48) is an edge inthe graph if there exists a ∈ f ( v ) , ∆( v, a )( v (cid:48) ) > . Anend component ( W, f ) is accepting if W ∩ ˆ J i = ∅ and W ∩ ˆ H i (cid:54) = ∅ for some i ∈ { , . . . , (cid:96) } .Let the set of accepting end components (AEC)s in M be AEC ( M ) and the set of accepting end states be C = { v |∃ ( W, f ) ∈ AEC ( M ) , v ∈ W } . Once we enter some state v ∈ C , we can find an AEC ( W, f ) such that v ∈ W , andinitiate the policy f such that for some i ∈ { , . . . , (cid:96) } , statesin ˆ J i will be visited a finite number of times and some statein ˆ H i will be visited infinitely often. Formulating the LP problem
An optimal policy that max-imizes the probability of satisfying the specification alsomaximizes the probability of hitting the set of acceptingend states C . Reference [21] develops GPU-based parallelalgorithms which significantly speed up the computation ofend components for large MDPs. After computing the set ofAECs, we formulate the following LP problem to computethe optimal policy using the proposed decomposition anddistributed synthesis method for discounted-reward cases.Given a product MDP M = (cid:104) V, A, ∆ , µ , Acc (cid:105) and theset C of accepting end states, the modified product MDP is ˜ M = (cid:104) ( V \ C ) ∪ { sink } , A, ˜∆ , ˜ µ , R (cid:105) where ( V \ C ) ∪ { sink } is the set of states obtained by grouping states in C as asingle state sink . For all a ∈ A , ˜∆( sink , a )( sink ) = 1 and ˜∆( v, a )( sink ) = (cid:80) v (cid:48) ∈C ∆( v, a )( v (cid:48) ) . The initial distribution ˜ µ of states is defined as follows: For v ∈ V \ C , ˜ µ ( v ) = µ ( v ) , and ˜ µ ( sink ) = (cid:80) v ∈C µ ( v ) . The reward function R : (( V \ C ) ∪ { sink } ) × A → R is defined such that for all v that is not sink , R ( v, a ) = (cid:80) v (cid:48) ∈ ( V \C ) ∪{ sink } ˜∆( v, a )( v (cid:48) ) · { sink } ( v (cid:48) ) where X ( v ) is the indicator function that outputs if and only if v ∈ X and otherwise. For any action a ∈ A ( sink ) , R ( sink , a ) = 0 .By definition of reward function, the discounted rewardwith γ = 1 from state v in the modified product MDP ˜ M is the probability of reaching a state in C from v underpolicy f in the product MDP M . Hence, with a decom-position of M , the proposed distributed synthesis methodfor discounted-reward problems can be used to compute theolicy that maximizes the probability of satisfying a givenLTL specification. B. Average reward under B¨uchi acceptance conditions
Preliminaries
Consider a temporal logic formula ϕ that canbe expressed as a deterministic B¨uchi automaton (DBA) A ϕ = (cid:104) Q, AP , T, I, F ϕ (cid:105) where Q, AP , T, I are definedsimilar to a DRA and F ϕ ⊆ Q is a set of accepting states .A run ρ is accepted in A ϕ if and only if Inf ( ρ ) ∩ F ϕ (cid:54) = ∅ .Given an MDP M = (cid:104) S, A, u , P, AP , L (cid:105) and a DBA A ϕ = (cid:104) Q, AP , T, q , F ϕ (cid:105) , the product MDP with B¨uchi objective is M = M (cid:110) A ϕ = (cid:104) V, A, ∆ , µ , F (cid:105) where components V, A, ∆ , µ are obtained similarly as in the product MDPwith Rabin objective. The difference is that F ⊆ S × F ϕ is the set of accepting states. A path ρ = v v . . . ∈ V ω satisfies the LTL formula ϕ if and only if Inf ( ρ ) ∩ F (cid:54) = ∅ . Formulating the LP problem
For a product MDP M with B¨uchi objective, we aim to synthesize a policy thatmaximizes the expected frequency of visiting an acceptingstate in the product MDP M = M (cid:110) A ϕ . This type ofobjectives ensures some recurrent properties in the temporallogic formula are satisfied as frequently as possible. Forexample, one such objective can be requiring a mobile robotto maximize the frequency of visiting some critical regions.This type of objectives can be formulated as an average-reward problem in the following way: Let the reward function R : V × A → R be defined by R ( v, a ) = (cid:80) v (cid:48) ∈ V ∆( v, a )( v (cid:48) ) · F ( v (cid:48) ) . By definition of the reward function, the optimalpolicy with respect to the average-reward criterion is the onethat maximizes the frequency of visiting a state in F . If theproduct MDP is ergodic, we can then solve the resultingaverage-reward problem by the distributed optimization al-gorithm with a decomposition of product MDP M .VI. C ASE STUDIES
We demonstrate the method with three robot motion plan-ning examples. All the experiments were run on a machinewith Intel Xeon 4 GHz, 8-core CPU and 64 GB RAMrunning Linux. The distributed optimization algorithm isimplemented in MATLAB. The decomposition and otheroperations are implemented in Python.Figure 2a shows a fraction of a gridworld. A robotmoves in this gridworld with uncertainty in different terrains(‘grass’, ‘sand’, ‘gravel’ and ‘pavement’). In each terrain andfor robot’s different action (heading north (‘N’), south (‘S’),west (‘W’) and east (‘E’)), the probability of arriving at thecorrect cell is . for pavement, . for grass, . for graveland . for sand. With a relatively small probability, therobot will arrive at the cell adjacent to the intended one.Figure 2b displays a × gridworld. The grey area andthe boundary are walls. If the robot runs into the wall, it willbe bounce back to its original cell. The walls give rise to anatural partition of the state space, as demonstrated in thisfigure. If no explicit modular structure in the system can befound, one can compute a decomposition using the methodin section III-B. In the following example, the wall patternis the same as in the × gridworld. EW NS NENWSW SE (a)
K1 K2K3 K4K0K0 K0K0 (b)
Fig. 2: (a) A fraction of a m × n gridworld. The dash arrowrepresents that if the robot takes action ‘N’, there are non-zero probabilities for it to arrive at NW, N, and NE cells.(b) A × gridworld. A natural partition of state spaceusing the walls gives rise to K , K , K , K , K subsets ofstates. States in K are enclosed using the squares. A. Discounted-reward case
We select a subset W of cells as “restricted area” and asubset G of cells as “targets”. The reward function is given:For s ∈ S , S / ∈ G ∪ W , R ( s, a ) = − counts for theamount of time the robot takes action a . For s ∈ W , for all a ∈ A ( s ) , R ( s, a ) = − . For s ∈ G , R ( s, a ) = 100 forall a ∈ A ( s ) . Intuitively, this reward function will encouragethe robot to reach the target with as fewer expected numberof steps as possible, while avoiding running into a cell in therestricted area. We select γ = 0 . . Case 1:
To show the convergence and correctness of thedistributed optimization algorithm, we first consider a × gridworld example that can be solved directly with acentralized algorithm. Since at each cell there are four actionsfor the robot to pick, the total number of variables is × for the × gridworld (the wall cells are excludedfrom the set of states). In this gridworld, there is only target cell. The restricted area include cells. The resultingLP problem (1) can be solved using CVX, a package forspecifying and solving convex programs [22]. The problemis solved in . seconds, and the optimal objective valueunder the optimal policy given by CVX is .Next, we solve the same problem by decomposing thestate space of the MDP along the walls into regions,each of which is a × gridworld. This partition of statespace yields states for each K i , i > and states for K . In which follows, we select ρ = 80 , , , , to show the convergence of the distributed optimizationalgorithm. Irrespective of the choices for ρ , the average timefor each iteration is about . sec. The solution accuracyrelative to CVX is summarized in Table I. The ‘rel. errorin objval’ is the relative error in objective value attained,treating the CVX solution as the accurate one, and theinfeasibility is the relative primal infeasibility of the solution,measured by (cid:107) Ax ∗ − b (cid:107) (cid:107) b (cid:107) . Figure 3 shows the convergence ofthe optimization algorithm.
Case 2:
For a × gridworld, the centralizedmethod in CVX fails to produce a solution for this large-ABLE I: Relative resolution accuracy for solving the × gridworld example with discounted reward (under (cid:15) abs =10 − , (cid:15) rel = 10 − ). ρ
80 100 200 500 1000Iterations 12001 11014 13868 11866 12733objval 9.54 9.90 9.89 9.96 9.96rel. error ( % ) 4.6 1.0 1.1 0.38 0.38infeasibility . × − × − . × − . × − . × − Iteration steps (x10 ) R e l . e rr o r ( % ) -4-202468 Fig. 3: Relative error in the objective value vesus iterations in × gridworld with discounted reward, under ρ = 1000 .For clarity, we did not draw the relative error for the initial steps, which are comparatively large.scale problem. Thus, we consider to solve it using thedecomposition and distributed synthesis method. In thisexample, we partition the gridworld such that each regionhas × cells, which results in regions. There are states in K and about states in each K i , for i = 1 , . . . , . In this example, we randomly select cellsto be the targets and cells to be the restricted areas. Bychoosing ρ = 1000 , (cid:15) rel = 10 − , (cid:15) abs = 10 − , the optimalpolicy is solved within seconds and it takes about . seconds for one iteration. The total number of iterations is . Under the (approximately)-optimal policy obtained bydistributed optimization, the objective value is . . Therelative primal infeasibility of the solution is . × − .Figure 4a shows the convergence of distributed optimizationalgorithm. B. Average-reward case with quantative temporal logic ob-jectives
We consider a × gridworld with no obstacles and critical regions labeled “ R ” , “ R ”, “ R ” and “ R ”.The system is given a temporal logic specification ϕ := (cid:3)♦ ( R ∧ ♦ R ) ∧ (cid:3)♦ ( R ∧ ♦ R ) . That is, the robot hasto always eventually visit region R and then R , andalso always eventually visit region R and then R . Thenumber of states in the corresponding DBA is aftertrimming the unreachable states, due to the fact that therobot cannot be at two cells simultaneously. The quantitativeobjective is to maximize the frequency of visiting all fourregions (an accepting state in the DBA). The formulatedMDP is ergodic and therefore our method for average-rewardproblems applies.For an average-reward case, we need to satisfy the con-straint T x = 1 in (7). This constraint leads to slow conver- Iterations O b j e c t i v e v a l ue ( x ) (a) Iterations O b j e c t i v e v a l ue (b) Fig. 4: Objective value vesus iterations in (a) × gridworld (the initial steps are omitted). (b) Objectivevalue vesus iterations in × gridworld with a B¨uchiobjective. Here we only show the first iterations as theobjective value converges to the optimal one after steps.gence and policies with large infeasibility measures in dis-tributed optimization. To handle this issue, we approximateaverage reward with discounted reward [23]: For ergodicMDPs, the discounted accumulated reward, scaled by − γ ,is approximately the average reward. Further, if − γ is largecompared to the mixing time [15] of the Markov chain, thenthe policy that optimizes the discounted accumulated rewardwith the discounting factor γ can achieve an approximatelyoptimal average reward.Given ρ = 1000 , (cid:15) ref = 10 − and (cid:15) abs = 10 − , γ =0 . , the distributed synthesis algorithm terminates in iteration steps and the optimal discounted reward is . .Scaling by − γ = 0 . , we obtain the average reward . × .
02 = 0 . , which is the approximately optimalvalue for this average reward under the obtained policy. Theconvergence result is shown in Figure 4b and the infeasibilitymeasure of the obtained solution is . .VII. C ONCLUSION
For solving large Markov decision process models ofstochastic systems with temporal logic specifications, weeveloped a decomposition algorithm and a distributed syn-thesis method. This decomposition exploits the modularityin the system structure and deals with sub-problems ofsmaller sizes. We employed the block splitting algorithmin distributed optimization based on the alternating direc-tion method of multipliers to cope with the difficulty ofcombining the solutions of sub-problems into a solution tothe original problem. Moreover, the formal decomposition-based distributed control synthesis framework establishedin this paper facilitates the application of other distributedand parallel large-scale optimization algorithms [24] to fur-ther improve the rate of convergence and the feasibilityof solutions for control synthesis in large MDPs. In thefuture, we will develop an interface to PRISM toolbox [8]with an implementation of the proposed decomposition anddistributed synthesis algorithms.A
PPENDIX
At the k -th iteration, for i, j = 0 , . . . , N , y k +1 / i := prox f i ( y ki − ˜ y ki ) = b i ,x k +1 / j := prox g j ( x kj − ˜ x kj )= proj R mj + ( x kj − ˜ x kj − c j /ρ ) , ( x k +1 / ij , y k +1 / ij ) := proj ij ( x kj − ˜ x kij , y kij + ˜ y ki ) ,x k +1 j := avg ( x k +1 / j , { x k +1 / ij } Ni =0 ) , ( y k +1 i , { y k +1 ij } Nj =0 ) := exch ( y k +1 / i , { y k +1 / ij } Nj =0 ) , if i = 0 , ( y k +1 i , { y k +1 i , y k +1 ii } ) := exch ( y k +1 / i , { y k +1 / i , y k +1 / ii } ) , if i = 1 , . . . , N, ˜ x k +1 j := ˜ x kj + x k +1 / j − x k +1 j , ˜ y k +1 i := ˜ y ki + y k +1 / i − y k +1 i , ˜ x k +1 ij := ˜ x kij + x k +1 / ij − x k +1 j , where proj R mi + denotes the projection to the nonnegativeorthant, proj ij denotes projection onto { ( x, y ) | y = A ij x } . avg is the elementwise averaging ; and exch isthe exchange operator, defined as below. exch ( c, { c j } Nj =1 ) is given by y ij := c j + ( c − (cid:80) Nj =1 c j ) / ( N − and y i := c − ( c − (cid:80) Nj =1 c j ) /N − . The variables can beinitialized to at k = 0 . Note that the computation in eachiteration can be parallelized. The iteration terminates whenthe stopping criterion for the block splitting algorithm is met(See [12] for more details). The solution can be obtained x ∗ = ( x k +1 / , . . . , x k +1 / N ) .R EFERENCES[1] Z. Manna and A. Pnueli,
The Temporal Logic of Reactive andConcurrent Systems: Specifications . springer, 1992, vol. 1.[2] S. Thi´ebaux, C. Gretton, J. K. Slaney, D. Price, F. Kabanza, andOthers, “Decision-Theoretic Planning with non-Markovian Rewards.”
Journal of Artificial Intelligence Research , vol. 25, pp. 17–74, 2006. Since for some i, j , x k +1 / ij = 0 , in the elementwise averaging, these x k +1 / ij will not be included. [3] C. Baier and J.-P. Katoen, Principles of Model Checking . MIT pressCambridge, 2008.[4] H. J. Kushner and C.-H. Chen, “Decomposition of systems governedby markov chains,”
IEEE Transactions on Automatic Control , vol. 19,no. 5, pp. 501–507, 1974.[5] T. Dean and S.-H. Lin, “Decomposition techniques for planning instochastic domains,” in
Proceedings of the 14th international jointconference on Artificial intelligence-Volume 2 . Morgan KaufmannPublishers Inc., 1995, pp. 1121–1127.[6] V. Krishnamurthy, C. Rojas, and B. Wahlberg, “Computing monotonepolicies for markov decision processes by exploiting sparsity,” in
Australian Control Conference , Nov 2013, pp. 1–6.[7] A. G. Barto and S. Mahadevan, “Recent advances in hierarchicalreinforcement learning,”
Discrete Event Dynamic Systems , vol. 13,no. 4, pp. 341–379, 2003.[8] M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verificationof probabilistic real-time systems,” in
Proceedings of InternationalConference on Computer Aided Verification , ser. LNCS, G. Gopalakr-ishnan and S. Qadeer, Eds., vol. 6806. Springer, 2011, pp. 585–591.[9] X. C. Ding, S. L. Smith, C. Belta, and D. Rus, “MDP optimal controlunder temporal logic constraints,” in
IEEE Conference on Decisionand Control and European Control Conference , 2011, pp. 532–538.[10] E. M. Wolff, U. Topcu, and R. M. Murray, “Optimal control withweighted average costs and temporal logic specifications.” in
Robotics:Science and Systems , 2012.[11] M. Lahijanian, S. Andersson, and C. Belta, “Temporal logic motionplanning and control with probabilistic satisfaction guarantees,”
IEEETransactions on Robotics , vol. 28, no. 2, pp. 396–409, April 2012.[12] N. Parikh and S. Boyd, “Block splitting for distributed optimization,”
Mathematical Programming Computation , vol. 6, no. 1, pp. 77–102,Oct. 2013.[13] C. Daoui, M. Abbad, and M. Tkiouat, “Exact decomposition ap-proaches for Markov decision processes: A survey,”
Advances inOperations Research , vol. 2010, pp. 1–19, 2010.[14] D. P. Bertsekas,
Dynamic Programming and Optimal Control, Vol. II ,3rd ed. Athena Scientific, 2007.[15] M. L. Puterman,
Markov Decision Processes: Discrete StochasticDynamic Programming . John Wiley & Sons, 2009, vol. 414.[16] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating directionmethod of multipliers,”
Foundations and Trends in Machine Learning ,vol. 3, no. 1, pp. 1–122, 2011.[17] T. Br´azdil, V. Brozek, K. Chatterjee, V. Forejt, and A. Kucera,“Two views on multiple mean-payoff objectives in markov decisionprocesses,” in
Annual IEEE Symposium on Logic in Computer Science ,2011, pp. 33–42.[18] C. Baier, M. Gr¨oß er, M. Leucker, B. Bollig, and F. Ciesinski,“Controller Synthesis for Probabilistic Systems (Extended Abstract),”in
Exploring New Frontiers of Theoretical Informatics , ser. IFIP Inter-national Federation for Information Processing, J.-J. Levy, E. Mayr,and J. Mitchell, Eds. Springer US, 2004, vol. 155, pp. 493–506.[19] H. J. Kushner and P. Dupuis,
Numerical Methods for StochasticControl Problems in Continuous Time . Springer, 2001, vol. 24.[20] G. N. Federickson, “Fast algorithms for shortest paths in planar graphs,with applications,”
SIAM Journal on Computing , vol. 16, no. 6, pp.1004–1022, 1987.[21] A. Wijs, J.-P. Katoen, and D. Boˇsnaˇcki, “Gpu-based graph decom-position into strongly connected and maximal end components,” in
Computer Aided Verification , ser. Lecture Notes in Computer Science,A. Biere and R. Bloem, Eds. Springer International Publishing, 2014,vol. 8559, pp. 310–326.[22] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convexprogramming, version 2.1,” Mar. 2014.[23] S. Kakade, “Optimizing average reward using discounted rewards,” in
Computational Learning Theory . Springer, 2001, pp. 605–615.[24] G. Scutari, F. Facchinei, L. Lampariello, and P. Song, “Parallel anddistributed methods for nonconvex optimization,” in