Optimal Control of MDPs with Temporal Logic Constraints
OOptimal Control of MDPs with Temporal Logic Constraints
M´aria Svoreˇnov´a, Ivana ˇCern´a and Calin Belta
Abstract — In this paper, we focus on formal synthesis ofcontrol policies for finite Markov decision processes withnon-negative real-valued costs. We develop an algorithm toautomatically generate a policy that guarantees the satisfactionof a correctness specification expressed as a formula of LinearTemporal Logic, while at the same time minimizing the expectedaverage cost between two consecutive satisfactions of a desiredproperty. The existing solutions to this problem are sub-optimal.By leveraging ideas from automata-based model checking andgame theory, we provide an optimal solution. We demonstratethe approach on an illustrative example.
I. INTRODUCTIONMarkov Decision Processes (MDP) are probabilistic mod-els widely used in various areas, such as economics, biology,and engineering. In robotics, they have been successfullyused to model the motion of systems with actuation andsensing uncertainty, such as ground robots [17], unmannedaircraft [21], and surgical steering needles [1]. MDPs are cen-tral to control theory [4], probabilistic model checking andsynthesis in formal methods [3], [9], and game theory [13].MDP control is a well studied area (see e.g., [4]). Thegoal is usually to optimize the expected value of a costover a finite time ( e.g., stochastic shortest path problem)or an average expected cost in infinite time ( e.g., averagecost per stage problem). Recently, there has been increasinginterest in developing MDP control strategies from richspecifications given as formulas of probabilistic temporallogics, such as Probabilistic Computation Tree Logic (PCTL)and Probabilistic Linear Temporal Logic (PLTL) [12], [17].It is important to note that both optimal control and temporallogic control problems for MDPs have their counterpart inautomata game theory. Specifically, optimal control translatesto solving / -player games with payoff functions, suchas discounted-payoff and mean-payoff games [6]. Temporallogic control for MDPs corresponds to solving / -playergames with parity objectives [2].Our aim is to optimize the behavior of a system subject tocorrectness (temporal logic) constraints. Such a connectionbetween optimal and temporal logic control is an intriguingproblem with potentially high impact in several applications.Consider, for example, a mobile robot involved in a persistentsurveillance mission in a dangerous area under tight fuel ortime constraints. The correctness requirement is expressed as M. Svoreˇnov´a, I. ˇCern´a are with Faculty of Informatics, MasarykUniversity, Brno, Czech Republic, [email protected],[email protected] . C. Belta is with Department of Mechanical Engineeringand the Division of Systems Engineering, Boston University, Boston, MA,USA, [email protected] . This work was partially supported at MasarykUniversity by grants GAP202/11/0312, LH11065, and at Boston Universityby ONR grants MURI N00014-09-1051, MURI N00014-10-10952 and byNSF grant CNS-1035588. a temporal logic specification, e.g., “Keep visiting A and thenB and always avoid C”. The resource constraints translate tominimizing a cost function over the feasible trajectories ofthe robot. Motivated by such applications, in this paper wefocus on correctness specifications given as LTL formulaeand optimization objectives expressed as average expectedcumulative costs per surveillance cycle (ACPC).The main contribution of this work is to provide a soundand complete solution to the above problem. This paper canbe seen as an extension of [18], [19], [11], [8]. In [18], wefocused on deterministic transition systems and developed afinite-horizon online planner to provably satisfy an LTL con-straint while optimizing the behavior of the system betweenevery two consecutive satisfactions of a given proposition.We extended this framework in [19], where we providedan algorithm to optimize the long-term average behaviorof deterministic transition systems with time-varying eventsof known statistics. The closest to this work is [11], wherethe authors focus on a problem of optimal LTL control ofMDPs with real-valued costs on actions. The correctnessspecification is assumed to include a persistent surveillancetask and the goal is to minimize the long-term expectedaverage cost between successive visits of the locations undersurveillance. Using dynamic programming techniques, theauthors design a solution that is sub-optimal in the generalcase. In [8], it is shown that, for a certain fragment of LTL,the solution becomes optimal. By using recent results fromgame theory [5], in this paper we provide an optimal solutionfor full LTL.The rest of the paper is organized as follows. In Sec. IIwe introduce the notation and provide necessary definitions.The problem is formulated in Sec. III. The main algorithmstogether with discussions on their complexity are presentedin Sec. IV. Finally, Sec. V contains experimental results.II. P
RELIMINARIES
For a set S , we use S ω and S + to denote the set of allinfinite and all non-empty finite sequences of elements of S ,respectively. For a finite sequence τ = a . . . a n ∈ S + , weuse | τ | = n + 1 to denote the length of τ . For ≤ i ≤ n , τ ( i ) = a i and τ ( i ) = a . . . a i is the finite prefix of τ of length i +1 . We use the same notation for an infinite sequence fromthe set S ω . A. MDP ControlDefinition 1: A Markov decision process (MDP) is atuple M = ( S, A, P , AP , L, g ) , where S is a non-emptyfinite set of states, A is a non-empty finite set of actions, P : S × A × S → [0 , is a transition probability function a r X i v : . [ c s . L O ] S e p uch that for every state s ∈ S and action α ∈ A itholds that (cid:80) s (cid:48) ∈ S P ( s, α, s (cid:48) ) ∈ { , } , AP is a finite set ofatomic propositions, L : S → AP is a labeling function, and g : S × A → R +0 is a cost function. An initialized Markovdecision process is an MDP M = ( S, A, P , AP , L, g ) witha distinctive initial state s init ∈ S .An action α ∈ A is called enabled in a state s ∈ S if (cid:80) s (cid:48) ∈ S P ( s, α, s (cid:48) ) = 1 . With a slight abuse of notation, A ( s ) denotes the set of all actions enabled in a state s . We assume A ( s ) (cid:54) = ∅ for every s ∈ S .A run of an MDP M is an infinite sequence of states ρ = s s . . . ∈ S ω such that for every i ≥ , there exists α i ∈ A ( s i ) , P ( s i , α i , s i +1 ) > . We use Run M ( s ) to denotethe set of all runs of M that start in a state s ∈ S . Let Run M = (cid:83) s ∈ S Run M ( s ) . A finite run σ = s . . . s n ∈ S + of M is a finite prefix of a run in M and Run M fin ( s ) denotesthe set of all finite runs of M starting in a state s ∈ S . Let Run M fin = (cid:83) s ∈ S Run M fin ( s ) . The length | σ | = n +1 of a finiterun σ = s . . . s n is also referred to as the number of stagesof the run. The last state of σ is denoted by last ( σ ) = s n .The word induced by a run ρ = s s . . . of M is an infinitesequence L ( s ) L ( s ) . . . ∈ (2 AP ) ω . Similarly, a finite run of M induces a finite word from the set (2 AP ) + . Definition 2:
Let M = ( S, A, P , AP , L, g ) be an MDP.An end component (EC) of the MDP M is an MDP N =( S N , A N , P | N , AP , L | N , g | N ) such that ∅ (cid:54) = S N ⊆ S , ∅ (cid:54) = A N ⊆ A . For every s ∈ S N and α ∈ A N ( s ) it holdsthat { s (cid:48) ∈ S | P ( s, α, s (cid:48) ) > } ⊆ S N . For every pair ofstates s, s (cid:48) ∈ S N , there exists a finite run σ ∈ Run N fin ( s ) such that last ( σ ) = s (cid:48) . We use P | N to denote the function P restricted to the sets S N and A N . Similarly, we use L | N and g | N with the obvious meaning. If the contextis clear, we only use P , L, g instead of P | N , L | N , g | N .EC N of M is called maximal (MEC) if there is no EC N (cid:48) = ( S N (cid:48) , A N (cid:48) , P , AP , L, g ) of M such that N (cid:48) (cid:54) = N , S N ⊆ S N (cid:48) and A N ( s ) ⊆ A N (cid:48) ( s ) for every s ∈ S N . The setof all end components and maximal end components of M are denoted by EC( M ) and MEC( M ) , respectively.The number of ECs of an MDP M can be up to exponen-tial in the number of states of M and they can intersect. Onthe other hand, MECs are pairwise disjoint and every EC iscontained in a single MEC. Hence, the number of MECs of M is bounded by the number of states of M . Definition 3:
Let M = ( S, A, P , AP , L, g ) be an MDP.A control strategy for M is a function C : Run M fin → A suchthat for every σ ∈ Run M fin it holds that C ( σ ) ∈ A ( last ( σ )) .A strategy C for which C ( σ ) = C ( σ (cid:48) ) for all finiteruns σ, σ (cid:48) ∈ Run M fin with last ( σ ) = last ( σ (cid:48) ) is calledmemoryless. In that case, we consider C to be a function C : S → A . A strategy is called finite-memory if it is definedas a tuple C = ( M, act , ∆ , start ) , where M is a finiteset of modes, ∆ : M × S → M is a transition function, act : M × S → A selects an action to be applied in M , and start : S → M selects the starting mode for every s ∈ S .A run ρ = s s . . . ∈ Run M of an MDP M is called a rununder a strategy C for M if for every i ≥ , it holds that P ( s i , C ( ρ ( i ) ) , s i +1 ) > . A finite run under C is a finite prefix of a run under C . The set of all infinite and finiteruns of M under C starting in a state s ∈ S are denoted by Run M ,C ( s ) and Run M ,C fin ( s ) , respectively. Let Run M ,C = (cid:83) s ∈ S Run M ,C ( s ) and Run M ,C fin = (cid:83) s ∈ S Run M ,C fin ( s ) .Let M be an MDP, s a state of M , and C a strategyfor M . The following probability measure is used to argueabout the possible outcomes of applying C in M startingfrom s . Let σ ∈ Run M ,C fin ( s ) be a finite run. A cylinder set Cyl( σ ) of σ is the set of all runs of M under C that have σ as a finite prefix. There exists a unique probability measure Pr M ,Cs on the σ -algebra generated by the set of cylinder setsof all runs in Run M ,C fin ( s ) . For σ = s . . . s n ∈ Run M ,C fin ( s ) ,it holds Pr M ,Cs (Cyl( σ )) = n − (cid:89) i =0 P ( s i , C ( σ i ) , s i +1 ) and Pr M ,Cs (Cyl( s )) = 1 . Intuitively, given a subset X ⊆ Run M ,C ( s ) , Pr M ,Cs ( X ) is the probability that a run of M under C that starts in s belongs to the set X .The following properties hold for any MDP M (see, e.g., [3]). For every EC N of M , there exists a finite-memorystrategy C for M such that M under C starting from anystate of N never visits a state outside N and all states of N are visited infinitely many times with probability 1. On theother hand, having any, finite-memory or not, strategy C , astate s of M and a run ρ of M under C that starts in s ,the set of states visited infinitely many times by ρ forms anend component. Let ec ⊆ EC( M ) be the set of all ECs of M that correspond, in the above sense, to at least one rununder the strategy C that starts in the state s . We say thatthe strategy C leads M from the state s to the set ec . B. Linear Temporal LogicDefinition 4: Linear Temporal Logic (LTL) formulae overa set AP of atomic propositions are formed according to thefollowing grammar: φ ::= true | a | ¬ φ | φ ∧ φ | X φ | φ U φ | G φ | F φ, where a ∈ AP is an atomic proposition, ¬ and ∧ are standardBoolean connectives, and X ( next ), U ( until ), G ( always ),and F ( eventually ) are temporal operators.Formulae of LTL are interpreted over the words from (2 AP ) ω , such as those induced by runs of an MDP M (fordetails see e.g., [3]). For example, a word w ∈ (2 AP ) ω satisfies G φ and F φ if φ holds in w always and eventually,respectively. If the word induced by a run ρ ∈ Run M satisfies a formula φ , we say that the run ρ satisfies φ . Withslight abuse of notation, we also use states or sets of statesof the MDP as propositions in LTL formulae.For every LTL formula φ , the set of all runs of M thatsatisfy φ is measurable in the probability measure Pr M ,Cs for any C and s [3]. With slight abuse of notation, we useLTL formulae as arguments of Pr M ,Cs . If for a state s ∈ S itholds that Pr M ,Cs ( φ ) = 1 , we say that the strategy C almost-surely satisfies φ starting from s . If M is an initialized MDPand Pr M ,Cs init ( φ ) = 1 , we say that C almost-surely satisfies φ .he LTL control synthesis problem for an initialized MDP M and an LTL formula φ over AP aims to find a strategy for M that almost-surely satisfies φ . This problem can be solvedusing principles from probabilistic model checking [3], [12].The algorithm itself is based on the translation of φ to aRabin automaton and the analysis of an MDP that combinesthe Rabin automaton and the original MDP M . Definition 5: A deterministic Rabin automaton (DRA) isa tuple A = ( Q, AP , δ, q , Acc ) , where Q is a non-emptyfinite set of states, AP is an alphabet, δ : Q × AP → Q isa transition function, q ∈ Q is an initial state, and Acc ⊆ Q × Q is an accepting condition.A run of A is a sequence q q . . . ∈ Q ω such that for every i ≥ , there exists A i ∈ AP , δ ( q i , A i ) = q i +1 . We say thatthe word A A . . . ∈ (2 AP ) ω induces the run q q . . . . A runof A is called accepting if there exists a pair ( B, G ) ∈ Acc such that the run visits every state from B only finitely manytimes and at least one state from G infinitely many times.For every LTL formula φ over AP , there exists a DRA A φ such that all and only words from (2 AP ) ω satisfying φ inducean accepting run of A φ [14]. For translation algorithms see e.g., [16], and their online implementations, e.g., [15]. Definition 6:
Let M = ( S, A, P , AP , L, g ) be an ini-tialized MDP and A = ( Q, AP , δ, q , Acc ) be a DRA.The product of M and A is the initialized MDP P = ( S P , A, P P , AP P , L P , g P ) , where S P = S × Q , P P (( s, q ) , α, ( s (cid:48) , q (cid:48) )) = P ( s, α, s (cid:48) ) if q (cid:48) = δ ( q, L ( s )) and otherwise, AP P = Q , L P (( s, q )) = q , g P (( s, q ) , α ) = g ( s, α ) . The initial state of P is s P init = ( s init , q ) .Using the projection on the first component, every (finite)run of P projects to a (finite) run of M and vice versa,for every (finite) run of M , there exists a (finite) run of P that projects to it. Analogous correspondence exists betweenstrategies for P and M . It holds that the projection of afinite-memory strategy for P is also finite-memory. Moreimportantly, for the product P of an MDP M and a DRA A φ for an LTL formula φ , the probability of satisfying theaccepting condition Acc of A φ under a strategy C P for P starting from the initial state s P init , i.e., Pr P ,C P s P init (cid:0) (cid:95) ( B,G ) ∈ Acc ( FG ( ¬ B ) ∧ GF G ) (cid:1) , is equal to the probability of satisfying the formula φ in theMDP M under the projected strategy C starting from theinitial state s init . Definition 7:
Let P = ( S P , A, P P , AP P , L P , g P ) be theproduct of an MDP M and a DRA A . An accepting endcomponent (AEC) of P is defined as an end component N = ( S N , A N , P P , AP P , L P , g P ) of P for which thereexists a pair ( B, G ) in the acceptance condition of A suchthat L P ( S N ) ∩ B = ∅ and L P ( S N ) ∩ G (cid:54) = ∅ . We say that N is accepting with respect to the pair ( B, G ) . An AEC N =( S N , A N , P P , AP P , L P , g P ) is called maximal (MAEC) ifthere is no AEC N (cid:48) = ( S N (cid:48) , A N (cid:48) , P P , AP P , L P , g P ) suchthat N (cid:48) (cid:54) = N , S N ⊆ S N (cid:48) , A N (( s, q )) ⊆ A N (cid:48) (( s, q )) for every ( s, q ) ∈ S P and N and N (cid:48) are accepting withrespect to the same pair. We use AEC( P ) and MAEC( P ) to denote the set of all accepting end components and maximalaccepting end components of P , respectively.Note that MAECs that are accepting with respect to thesame pair are always disjoint. However, MAECs that areaccepting with respect to different pairs can intersect.From the discussion above it follows that a necessarycondition for almost-sure satisfaction of the accepting con-dition Acc by a strategy C P for P is that there exists aset maec ⊆ MAEC( P ) of MAECs such that C P leads theproduct from the initial state to maec .III. P ROBLEM F ORMULATION
Consider an initialized MDP M = ( S, A, P , AP , L, g ) and a specification given as an LTL formula φ over AP ofthe form φ = ϕ ∧ GF π sur , (1)where π sur ∈ AP is an atomic proposition and ϕ is anLTL formula over AP . Intuitively, a formula of such formstates two partial goals – mission goal ϕ and surveillancegoal GF π sur . To satisfy the whole formula the systemmust accomplish the mission and visit the surveillance states S sur = { s ∈ S | π sur ∈ L ( s ) } infinitely many times.The motivation for this form of specification comes fromapplications in robotics, where persistent surveillance tasksare often a part of the specification. Note that the form inEq. (1) does not restrict the full LTL expressivity since everyLTL formula φ can be translated into a formula φ of theform in Eq. (1) that is associated with the same set of runsof M . Explicitly, φ = φ ∧ GF π sur , where π sur is suchthat π sur ∈ L ( s ) for every state s ∈ S .In this work, we focus on a control synthesis problem,where the goal is to almost-surely satisfy a given LTL speci-fication, while optimizing a long-term quantitative objective.The objective is to minimize the average expected cumulativecost between consecutive visits to surveillance states.Formally, we say that every visit to a surveillance statecompletes a surveillance cycle. In particular, starting fromthe initial state, the first visit to S sur completes the firstsurveillance cycle of a run. We use (cid:93) ( σ ) to denote the numberof completed surveillance cycles in a finite run σ plus one.For a strategy C for M , the cumulative cost in the first n stages of applying C to M starting from a state s ∈ S is g M ,C ( s, n ) = n (cid:88) i =0 g ( σ M ,Cs,n ( i ) , C ( σ M ,Cs,n ( i ) )) , where σ M ,Cs,n is the random variable whose values are finiteruns of length n + 1 from the set Run M ,C fin ( s ) and theprobability of a finite run σ is Pr M ,Cs (Cyl( σ )) . Note that g M ,C ( s, n ) is also a random variable. Finally, we definethe average expected cumulative cost per surveillance cycle(ACPC) in the MDP M under a strategy C as a function V M ,C : S → R +0 such that for a state s ∈ SV M ,C ( s ) = lim sup n →∞ E (cid:16) g M ,C ( s, n ) (cid:93) ( σ M ,Cs,n ) (cid:17) . he problem we consider in this paper can be formally statedas follows. Problem 1:
Let M = ( S, A, P , AP , L, g ) be an initializedMDP and φ be an LTL formula over AP of the form inEq. (1). Find a strategy C for M such that C almost-surely satisfies φ and, at the same time, C minimizes theACPC value V M ,C ( s init ) among all strategies almost-surelysatisfying φ .The above problem was recently investigated in [11].However, the solution presented by the authors is guaranteedto find an optimal strategy only if every MAEC N of theproduct P of the MDP M and the DRA for the specificationsatisfies certain conditions (for details see [11]). In this paper,we present a solution to Problem 1 that always finds anoptimal strategy if one exists. The algorithm is based onprinciples from probabilistic model checking [3] and gametheory [5], whereas the authors in [11] mainly use resultsfrom dynamic programming [4].In the special case when every state of M is a surveillancestate, Problem 1 aims to find a strategy that minimizesthe average expected cost per stage among all strategiesalmost-surely satisfying φ . The problem of minimizing theaverage expected cost per stage (ACPS) in an MDP, withoutconsidering any correctness specification, is a well studiedproblem in optimal control, see e.g., [4]. It holds thatthere always exists a stationary strategy that minimizes theACPS value starting from the initial state. In our approach toProblem 1, we use techniques for solving the ACPS problemto find a strategy that minimizes the ACPC value.IV. S OLUTION
Let M = ( S, A, P , AP , L, g ) be an initialized MDP and φ an LTL formula over AP of the form in Eq. (1). Tosolve Problem 1 for M and φ we leverage ideas fromgame theory [5] and construct an optimal strategy for M as a combination of a strategy that ensures the almost-sure satisfaction of the specification φ and a strategy thatguarantees the minimum ACPC value among all strategiesthat do not cause immediate unrepairable violation of φ .The algorithm we present in this section works with theproduct P = ( S P , A, P P , AP P , L P , g P ) of the MDP M anda deterministic Rabin automaton A φ = ( Q, AP , δ, q , Acc ) for the formula φ . We inherit the notion of a surveillancecycle in P by adding the proposition π sur to the set AP P and to the set L P (( s, q )) for every ( s, q ) ∈ S P such that π sur ∈ L ( s ) . Using the correspondence between strategiesfor P and M , an optimal strategy C for M is found asa projection of a strategy C P for P which almost-surelysatisfies the accepting condition Acc of A φ and at the sametime, minimizes the ACPC value V P ,C P ( s P init ) among allstrategies for P that almost-surely satisfy Acc .Since C P must almost-surely satisfy the accepting con-dition Acc , it leads from the initial state of P to a set ofMAECs. For every MAEC N , the minimum ACPC value V ∗N (( s, q )) that can be obtained in N starting from a state ( s, q ) ∈ S N is equal for all the states of N and we denotethis value V ∗N . The strategy C P is constructed in two steps. First, we find a set maec ∗ of MAECs of P and a strategy C that leads P from the initial state to the set maec ∗ . Werequire that C and maec ∗ minimize the weighted averageof the values V ∗N for N ∈ maec ∗ . The strategy C P applies C from the initial state until P enters the set maec ∗ .Second, we solve the problem of how to control theproduct once a state of an MAEC N ∈ maec ∗ is visited.Intuitively, we combine two finite-memory strategies, C φ N forthe almost-sure satisfaction of the accepting condition Acc and C V N for maintaining the average expected cumulativecost per surveillance cycle. To satisfy both objectives, thestrategy C P is played in rounds. In each round, we firstapply the strategy C φ N and then the strategy C V N , each for aspecific (finite) number of steps. A. Finding an optimal set of MAECs
Let
MAEC( P ) be the set of all MAECs of the product P that can be computed as follows. For every pair ( B, G ) ∈ Acc , we create a new MDP from P by removing all itsstates with label in B and the corresponding actions. For thenew MDP, we use one of the algorithms in [10], [9], [7] tocompute the set of all its MECs. Finally, for every MEC, wecheck whether it contains a state with label in G .In this section, the aim is to find a set maec ∗ ⊆ MAEC( P ) and a strategy C for P that satisfy conditions formally statedbelow. Since the strategy C will only be used to enter theset maec ∗ , it is constructed as a partial function. Definition 8: A partial strategy ζ for the MDP M is apartial function ζ : Run P fin → A , where if ζ ( σ ) is defined for σ ∈ Run P fin , then ζ ( σ ) ∈ A ( last ( σ )) .A partial stationary strategy for M can also be consideredas a partial function ζ : S → A or a subset ζ ⊆ S × A . The set Run M ,ζ of runs of M under ζ contains all infinite runs of M that follow ζ and all those finite runs σ of M under ζ forwhich ζ ( last ( σ )) is not defined. A finite run of M under ζ isthen a finite prefix of a run under ζ . The probability measure Pr M ,ζs is defined in the same manner as in Sec. II-A. Wealso extend the semantics of LTL formulas to finite words.For example, a formula FG φ is satisfied by a finite word ifin some non-empty suffix of the word φ always holds.The conditions on maec ∗ and C are as follows. First, thepartial strategy C leads P to the set maec ∗ , i.e., Pr P ,C s P init ( FG ( (cid:91) N ∈ maec ∗ S N )) = 1 . (2)Second, we require that maec ∗ and C minimize the value (cid:88) N ∈ maec ∗ Pr P ,C s P init ( FG S N ) · V ∗N . (3)The procedure to compute the optimal ACPC value V ∗N foran MAEC N of P is described in the next section. Assumewe already computed this value for each MAEC of P . Thealgorithm to find the set maec ∗ and partial strategy C isbased on an algorithm for stochastic shortest path (SSP)problem. The SSP problem is one of the basic optimizationproblems for MDPs. Given an initialized MDP and its state t , the goal is to find a strategy under which the MDP almost-surely reaches the state t , so called terminal state, whileinimizing the expected cumulative cost. If there exists atleast one strategy almost-surely reaching the terminal state,then there exists a stationary optimal strategy. For details andalgorithms see e.g., [4].The partial strategy C and the set maec ∗ are computed asfollows. First, we create a new MDP P (cid:48) from P by consider-ing only those states of P that can reach the set MAEC( P ) with probability 1 and their corresponding actions. The MDP P (cid:48) can be computed using backward reachability from the set MAEC( P ) . If P (cid:48) does not contain the initial state s P init ,there exists no solution to Problem 1. Otherwise, we adda new state t and for every MAEC N ∈
MAEC( P (cid:48) ) =MAEC( P ) , we add a new action α N to P (cid:48) . From each state ( s, q ) ∈ S N , N ∈
MAEC( P (cid:48) ) , we define a transition under α N to t with probability and set its cost to V ∗N . All othercosts in the MDP are set to . Finally, we solve the SSPproblem for P (cid:48) and the state t as the terminal state. Let C SSP be the resulting stationary optimal strategy for P (cid:48) .For every ( s, q ) ∈ S P , we define C (( s, q )) = C SSP (( s, q )) if the action C SSP (( s, q )) does not lead from ( s, q ) to t , C (( s, q )) is undefined otherwise. The set maec ∗ is the setof all MAECs N for which there exists a state ( s, q ) suchthat C SSP (( s, q )) = α N . Proposition 1:
The set maec ∗ and the partial stationarystrategy C resulting from the above algorithm satisfy theconditions in Eq. (2) and Eq. (3). Proof:
Both conditions follow directly from the factthat the strategy C SSP is an optimal solution to the SSPproblem for P (cid:48) and t . B. Optimizing ACPC value in an MAEC
In this section, we compute the minimum ACPC value V ∗N that can be attained in an MAEC N ∈
MAEC( P ) andconstruct the corresponding strategy for N . Essentially, wereduce the problem of computing the minimum ACPC valueto the problem of computing the minimum ACPS value byreducing N to an MDP such that every state of the reducedMDP is labeled with the surveillance proposition π sur .Let N = ( S N , A N , P P , AP P , L P , g P ) be an MAEC of P . Since it is an MAEC, there exists a state ( s, q ) ∈ S N with π sur ∈ L P (( s, q )) . Let S N sur denote the set of all suchstates in S N . We reduce N to an MDP N sur = ( S N sur , A sur , P sur , AP P , L P , g sur ) using Alg. 1. For the sake of readability, we use singletonssuch as v instead of pairs such as ( s, q ) to denote the statesof N . The MDP N sur is constructed from N by eliminatingstates from S N \ S N sur one by one in arbitrary order. Theactions A sur are partial stationary strategies for N in whichwe remember all the states and actions we eliminated. Laterwe prove that the transition probability P sur ( v, ζ, v (cid:48) ) forstates v, v (cid:48) ∈ S N sur and an action ζ ∈ A sur ( v ) is theprobability that in N under the partial stationary strategy ζ ,if we start from the state v , the next state that will be visitedfrom the set S N sur is the state v (cid:48) , i.e., the first surveillancecycle is completed by visiting v (cid:48) . The cost g sur ( v, ζ ) is the expected cumulative cost gained in N using partial stationarystrategy ζ from v until we reach a state in S N sur .In Fig. 1, we demonstrate the reduction on an exampleusing the notation introduced in Alg. 1. On the left side, wesee a part of an MAEC N with five states and two actions.First, we build an MDP X = ( S X , A X , P X , AP P , L P , g X ) from N by transforming every action of every state to apartial stationary strategy with a single pair given by thestate and the action. The MDP X is used in the algorithm asan auxiliary MDP to store the current version of the reducedsystem. Assume we want to reduce the state v . We considerall “incoming” and “outgoing” actions of v and combinethem pairwise as follows. There is only one outgoing actionfrom v in X , namely ζ , and only one incoming action,namely action ζ old of state v from . Since ζ and ζ old do notconflict as partial stationary strategies on any state of N ,we merge them to create a new partial stationary strategy ζ new that is an action of v from . The transition probability P X ( v from , ζ new , v to ) for a state v to of X is computed asthe sum of the transition probability P X ( v from , ζ old , v to ) oftransiting from v from to v to using the old action ζ old andthe probability of entering v to by first transiting from v from to v using ζ old and from v eventually reaching v to using ζ .The cost g X ( v from , ζ new ) is the expected cumulative costgained starting from v from by first applying action ζ old and if we transit to v , applying ζ until a state differentfrom v is reached. Now that we considered every pair ofan incoming and outgoing action of v , the state v and itsincoming and outgoing actions are reduced. The modifiedMDP X is depicted on the right side of Fig. 1. Proposition 2:
Let N = ( S N , A N , P P , AP P , L P , g P ) bean MAEC and N sur = ( S N sur , A sur , P sur , AP P , L P , g sur ) its reduction resulting from Alg. 1. The minimum ACPCvalue that can be attained in N sur starting from any of itsstates is the same and we denote it V ∗N sur . There exists astationary strategy C V N sur for N sur that attains this valueregardless of the starting state in N sur . Both V ∗N sur and C V N sur can be computed as a solution to the ACPS problem for N sur .It holds that V ∗N = V ∗N sur and from C V N sur , one can constructa finite-memory strategy C V N for N which regardless of thestarting state in N attains the optimal ACPC value V ∗N . Proof:
We prove the following correspondence between N and N sur . For every v, v (cid:48) ∈ S N sur and ζ ∈ A sur ( v ) , itholds that ζ is a well-defined partial stationary strategy for N . The transition probability P sur ( v, ζ, v (cid:48) ) is the probabilitythat in N , when applying ζ starting from v , the firstsurveillance cycle is completed by visiting v (cid:48) , i.e., P sur ( v, ζ, v (cid:48) ) = Pr N ,ζv ( X ( ¬ S N sur U v (cid:48) )) . The cost g sur ( v, ζ ) is the expected cumulative cost gained in N when applying ζ starting from v until the first surveillancecycle is completed. On the other hand, for every partialstationary strategy ζ for N such that Pr N ,ζv ( F S N sur ) = 1 for some v ∈ S N sur , there exists an action ζ (cid:48) ∈ A sur ( v ) such that the action ζ (cid:48) corresponds to the partial stationary new , . v from Xζ old = { ( v from , α ) } ζ = { ( v, β ) } ζ old , building X removing v v from Xζ new = { ( v from , α ) , ( v, β ) } v from N α , v to v (cid:48)(cid:48) to v (cid:48) to v β , v to v (cid:48) to v (cid:48)(cid:48) to v ζ , v to v (cid:48) to v (cid:48)(cid:48) to Fig. 1: Illustration of Alg. 1. A part of an MAEC N is shown in the left. An auxiliary MDP X is constructed by transforming actionsof N to partial stationary strategies. The MDP X after eliminating the state v is shown on the right. The costs associated with actionsare depicted in blue. strategy ζ in the above sense, i.e., P sur ( v, ζ (cid:48) , v (cid:48) ) = Pr N ,ζv ( X ( ¬ S N sur U v (cid:48) )) for every v (cid:48) ∈ S N sur , and the cost g sur ( v, ζ (cid:48) ) is the expectedcumulative cost gained in N when we apply ζ starting from v until we reach a state in S N sur .To prove the first part of the correspondence above,we prove the following invariant of Alg. 1. Let X =( S X , A X , P X , AP P , L P , g X ) be the MDP from the algo-rithm after the initialization, before the first iteration of thewhile cycle. It is easy to see that all actions of X arewell-defined partial stationary strategies. For the transitionprobabilities, it holds that P X ( v from , ζ, v to ) = Pr N ,ζv from ( X ( ¬ S X U v to )) for every v from , v to ∈ S X and ζ ∈ A X ( v from ) . The cost g X ( v from , ζ ) is the expected cumulative cost gained in N starting from v from when applying ζ until we reach a statein S X . We show that these conditions also hold after everyiteration of the while cycle.Let X satisfy the conditions above and let v ∈ S X \ S N sur .By removing the state v from S X , we obtain a new versionof the MDP X (cid:48) = ( S X (cid:48) , A X (cid:48) , P X (cid:48) , AP P , L P , g X (cid:48) ) . Notethat S X (cid:48) ∪ { v } = S X . Let v from ∈ S X (cid:48) be a state of X (cid:48) and ζ new ∈ A X (cid:48) ( v from ) be its action such that ζ new has changedin the process of removing the state v . The action ζ new is awell-defined partial stationary strategy because it must havebeen created as a union of an action ζ old of v from and anaction ζ of v , both from the previous version X , which donot conflict on any state from S X .Let X (cid:48) → v to denote the LTL formula X ( ¬ S X (cid:48) U v to ) . For astate v to ∈ S X (cid:48) , we prove that P X (cid:48) ( v from , ζ new , v to ) = Pr N ,ζ new v from ( X (cid:48) → v to ) . Since ζ new = ζ old ∪ ζ , the probability in N when applying ζ new starting from v from of reaching the state v to as thenext state in S X (cid:48) is the probability of reaching it as the nextstate in S X when using ζ old from v from , plus the probabilityof reaching v as the next state in S X from v from using ζ old and then eventually reaching the state v to from v using ζ .This means Pr N ,ζ new v from ( X (cid:48) → v to ) = Pr N ,ζ old v from ( X → v to ) ++ Pr N ,ζ old v from ( X → v ) · Pr N ,ζv ( F v to )= P X ( v from , ζ old , v to ) + P X ( v from , ζ old , v ) ·· (cid:0) ∞ (cid:88) i =0 P X ( v, ζ, v ) i · P X ( v, ζ, v to ) (cid:1) = P X ( v from , ζ old , v to ) ++ P X ( v from , ζ old , v ) · P X ( v, ζ, v to )1 − P X ( v, ζ, v ) which is exactly as defined in Alg. 1.Similarly, we prove that g X (cid:48) ( v from , ζ new ) is the expectedcumulative cost gained in N starting from v from whenapplying ζ new until we reach a state in S X (cid:48) . As ζ new = ζ old ∪ ζ , it is the expected cumulative cost of reaching astate in S X by using ζ old plus, in the case we reach v , theexpected cumulative cost of eventually reaching a state in S X (cid:48) , i.e., other than v , using ζ . To be specific, we have g X ( v from , ζ old ) + P X ( v from , ζ old , v ) ·· (cid:0) ∞ (cid:88) i =0 P X ( v, ζ, v ) i · (1 − P X ( v, ζ, v )) · ( i + 1) · g X ( v, ζ )= g X ( v from , ζ old ) + P X ( v from , ζ old , v ) · g X ( v, ζ )1 − P X ( v, ζ, v ) , just as defined in Alg. 1. This completes the proof of thefirst part of the correspondence between N and N sur .The second part of the correspondence between N and N sur follows directly from the fact that, in the process of re-moving a state v ∈ S X \ S N sur , we consider all combinationsof actions of v which eventually reach a state different from v , with all actions of all states v from having an action underwhich v is reached with non-zero probability.From the correspondence between N and N sur it followsthat in N sur , there exists a finite run between every twostates. Therefore, the minimum ACPC value that can beobtained in N sur from any of its states is the same and it isdenoted by V ∗N sur . Since every state of N sur is a surveillance lgorithm 1 Reduction of an MAEC N to N sur Input: N = ( S N , A N , P P , AP P , L P , g P ) Output: N sur = ( S N sur , A sur , P sur , AP P , L P , g sur ) let X = ( S X , A X , P X , AP P , L P , g X ) be an MDP where • S X := S N , • for v ∈ S X : A X ( v ) := { ζ α | ζ α = { ( v, α ) } , α ∈ A N ( v ) } , • for v, v (cid:48) ∈ S X , ζ ∈ A X : P X ( v, ζ, v (cid:48) ) := P P ( v, ζ ( v ) , v (cid:48) ) , • for v ∈ S X , ζ ∈ A X : g X ( v, ζ ) := g P ( v, ζ ( v )) while S X \ S N sur (cid:54) = ∅ do let v ∈ S X \ S N sur for all ζ ∈ A X ( v ) do if P X ( v, ζ, v ) < then for all v from ∈ S X , ζ old ∈ A X ( v from ) do if P X ( v from , ζ old , v ) > and ζ old , ζ do not con-flict for any state from S X then ζ new := ζ old ∪ ζ add ζ new to A X ( v from ) for every v to ∈ S X : P X ( v from , ζ new , v to ) := P X ( v from , ζ old , v to ) ++ P X ( v from , ζ old , v ) · P X ( v, ζ, v to )1 − P X ( v, ζ, v ) g X ( v from , ζ new ) := g X ( v from , ζ old ) ++ P X ( v from , ζ old , v ) · g X ( v, ζ )1 − P X ( v, ζ, v ) remove ζ old from A X ( v from ) end if end for end if remove ζ from A X ( v ) end for remove v from S X end while return X state, the ACPC problem for N sur is equivalent to solvingthe ACPS problem for N sur . Using one of the algorithmsin [4], we obtain a stationary strategy C V N sur that attains theACPC value V ∗N sur regardless of the starting state. From thecorrespondence between N and N sur it also follows that V ∗N sur = V ∗N .Now we construct the strategy C V N for N and show that itattains the minimum ACPC value V ∗N regardless of the initialstate. Intuitively, the strategy C V N is constructed to lead toa single EC of N that provides the minimum ACPC valueand that is the EC encoded by the strategy C V N sur for N sur .Let S def ⊆ S N be the set of all states v ∈ S N for whichthere exists a surveillance state v sur ∈ S N sur such that thepartial strategy C V N sur ( v sur ) for N is defined on the state v .We compute a partial strategy ζ init that leads from everystate from S N \ S def to the set S def as follows. Let N (cid:48) bean MDP that is created from N by adding a new state t and anew action α def . From every state v ∈ S def , we define a newtransition under α def to t with probability 1 and cost 0. Let C SSP be a stationary optimal strategy for the SSP problemfor N (cid:48) and t as the terminal state. We define ζ init ( v ) = C SSP ( v ) for every v ∈ S N \ S def . The strategy C V N is a then finite-memory strategy C V N = ( M, act , ∆ , start ) , where M = S N sur ∪ { init } is the set of modes, ∆ : M × S N → M is the transition function such that for every m ∈ M, v ∈ S N ∆( m, v ) = (cid:40) m if v (cid:54)∈ S N sur ,v otherwise . The function act : M × S N → A N that selects an action tobe applied in N is for m ∈ M, v ∈ S N defined as act ( m, v ) = (cid:40)(cid:0) C V N sur ( m ) (cid:1) ( v ) if m ∈ S N sur ζ init ( v ) otherwise.Finally, start : S N → S N sur selecting the starting mode for v ∈ S N is defined as start ( v ) = v if v ∈ S N sur ,m where (cid:0) C V N sur ( m ) (cid:1) ( v ) is defined, init otherwise.The strategy attains the ACPC value V ∗N since it only sim-ulates the strategy C V N sur by unwrapping the correspondingpartial strategies.The following property of the strategy C V N is crucial forthe correctness of our approach to Problem 1. Proposition 3:
For every ( s, q ) ∈ S N , it holds that lim n →∞ Pr N ,C V N ( s,q ) ( { ρ | g P ( ρ ( (cid:93)n ) ) n ≤ V ∗N } ) = 1 , where g P ( ρ ( (cid:93)n ) ) denotes the cumulative cost gained in thefirst n surveillance cycles of a run ρ ∈ Run N (( s, q )) . Hence,for every (cid:15) > , there exists j ( (cid:15) ) ∈ N such that if the strategy C V N is applied from a state ( s, q ) ∈ S N for any l ≥ j ( (cid:15) ) surveillance cycles, then the average expected cumulativecost per surveillance cycle in these l surveillance cycles isat most V ∗N + (cid:15) with probability at least − (cid:15) , i.e., Pr N ,C V N ( s,q ) ( { ρ | g P ( ρ ( (cid:93)l ) ) l ≤ V ∗N + (cid:15) } ) ≥ − (cid:15). Proof:
In [7] the authors prove that a strategy solvingthe ACPS problem for an MDP satisfies a property analogousto the one in the proposition. Especially, for the strategy C V N sur for the reduced MDP N sur , it holds that for any state ( s, q ) ∈ S N sur lim n →∞ Pr N sur ,C V N sur ( s,q ) ( { ρ | g N sur ( ρ ( n ) ) n ≤ V ∗N sur } ) = 1 , where g N sur ( ρ ( n ) ) denotes the cumulative cost gained in thefirst n stages of a run ρ ∈ Run N sur (( s, q )) . The propositionthen follows directly from the construction of the strategy C V N from the strategy C V N sur . . Almost-sure acceptance in an MAEC Here we design a strategy for an MAEC
N ∈
MAEC( P ) that guarantees almost-sure satisfaction of the acceptancecondition Acc of A φ . Let ( B, G ) be a pair in Acc such that N is accepting with respect to ( B, G ) , i.e., L P ( S N ) ∩ B = ∅ and L P ( S N ) ∩ G (cid:54) = ∅ . There exists a stationary strategy C φ N for N under which a state with label in G is reached withprobability 1 regardless of the starting state, i.e., Pr N ,C φ N ( s,q ) ( F G ) = 1 (4)for every ( s, q ) ∈ S N . The existence of such a strategyfollows from the fact that N is an EC [3]. Moreover, weconstruct C φ N to minimize the expected cumulative costbefore reaching a state in S N ∩ S × G .The strategy C φ N is found as follows. Let N (cid:48) be an MDPthat is created from N by adding a new state t and a newaction α G . From every state ( s, q ) ∈ S N ∩ S × G , we definea new transition under α G to t with probability 1 and cost0. Let C SSP be a stationary optimal strategy for the SSPproblem for N (cid:48) and t as the terminal state. For a state ( s, q ) ∈ S N , we define C φ N (( s, q )) = C SSP (( s, q )) if the state ( s, q ) does not have a label in G , otherwise C φ N (( s, q )) = α forsome α ∈ A N (( s, q )) . Proposition 4:
The strategy C φ N for N resulting from theabove algorithm almost-surely reaches the set S N ∩ S × G and minimizes the expected cumulative cost before reachingthe set, regardless of the initial state. Proof:
It follows directly from the fact that C SSP optimally solves the SSP problem for the MDP N (cid:48) and t . D. Optimal strategy for P Finally, we are ready to construct the strategy C P for theproduct P that projects to an optimal solution for M .First, starting from the initial state s P init , C P applies thestrategy C resulting from the algorithm described in Sec. IV-A until a state of an MAEC in the set maec ∗ is reached. Let N ∈ maec ∗ denote the MAEC and let ( B, G ) ∈ Acc bea pair from the accepting condition of A φ such that N isaccepting with respect to ( B, G ) .Now, the strategy C P starts to play the rounds. Each roundconsists of two phases. First, play the strategy C φ N fromSec. IV-C until a state with label in G is reached. Let usdenote k i the number of steps we play C φ N in i -th round.The second phase applies the strategy C V N from Sec. IV-B until the number of completed surveillance cycles in thesecond phase of the current round is l i . The number l i is anynatural number for which l i ≥ max { j ( i ) , i · k i · g P max } , where j ( i ) is from Prop. 3 and g P max is the maximumvalue of the costs g P . After applying the strategy C V N for l i surveillance cycles, we proceed to the next round i + 1 . Theorem 1:
The strategy C P almost-surely satisfies theaccepting condition Acc of A φ and at the same time, C P minimizes the ACPC value V P ,C P ( s P init ) among allstrategies for P almost-surely satisfying Acc . Proof:
From Prop. 1 it follows that when applying thestrategy C from the initial state s P init , the set maec ∗ isreached with probability 1.Assume that P enters MAEC N ∈ maec ∗ that is acceptingwith respect to a pair ( B, G ) ∈ Acc . Let i be the currentround of C P and (cid:15) i = i . According to Prop. 4, a statewith a label in G is almost-surely reached. In addition,using Prop. 3, the average expected cumulative cost persurveillance cycle in the i -th round is at most k i · g N max + l i ( V ∗N + (cid:15) i ) l i == V ∗N + (cid:15) i + k i · g N max l i ≤ V ∗N + (cid:15) i + 1 i ( l i ≥ i · k i · g N max )= V ∗N + 2 i with probability at least − i . Therefore, in the limit, in theMAEC N , we both satisfy the LTL specification and reachthe optimal ACPC value with probability . Together withthe fact that maec ∗ and C satisfy the condition in Eq. (3),we have that C P is an optimal strategy for P . E. Complexity and discussion
The size of a Rabin automaton for an LTL formula φ is inthe worst case doubly exponential in the size of the set AP .However, studies such as [16] show that in practice, for manyLTL formulas, automata are much smaller and manageable.Once the product P is built, we compute the set MAEC( P ) by running | Acc | -times an algorithm for MECdecomposition, which is polynomial in the size of P . Thesize of the set MAEC( P ) is in the worst case | Acc | · | S P | .For each MAEC N , we compute its reduction N sur usingAlg. 1 in time O ( | S N | · | A N | O ( | S N | ) ) . The optimal ACPCvalue V ∗N and an optimal finite-memory strategy C V N are thenfound in time polynomial in the size of the reduced MDP.The algorithm for finding the strategy C and the optimalset maec ∗ are again polynomial in the size of P . Similarly,computing a stationary strategy C φ N for an MAEC N ∈ maec ∗ is polynomial in the size of N .As was proved in Sec. IV-D, the presented solution toProblem 1 is correct and complete. However, the resultingoptimal strategy C P for P , and hence the projected strategy C for M as well, is not a finite-memory strategy in general.The reason is that in the second phase of every round i , thestrategy C V N is applied for l i surveillance cycles and l i isgenerally growing with i .This, however, does not prevent the solution to be ef-fectively used. The following simple rule can be appliedto avoid performing all l i ≥ max { i · k i · g P max , j ( i ) } surveillance cycles in every round i . When the computationis in the second phase of round i and the product is in anMAEC N ∈ maec ∗ , after completion of every surveillancecycle, we can check whether the average cumulative costper surveillance cycle in round i is at most V ∗N + i . If yes,we can proceed to the next round i + 1 , otherwise continue α , 1.0 0.9 base job α , 1.0 β , 1.0 α , 1.00.30.7 α , 1.0 β , 1.0 α , 1.0 α , 1.0 α γγ , 1.0 α , 1.0 0.1 γ β , 1.0 Condition 0 1 2 3 4 5 6 7 8 9 C init α – – – – – – – – – C p before job α β α α α γ γ α α γ after job α α α α α γ γ α α γC p α β α α α γ γ α α γ (a) (b) Fig. 2: (a) Initialized MDP M with initial state 0. The costs of applying α, β, γ in any states are 5, 10, 1, respectively, e.g., g (1 , α ) = 5 .(b) Definitions of strategies C init , C p , C p for M , the projections of strategies C , C φ N , C V N for P , respectively. The condition “before job ” means that the corresponding prescription is used if the job location has not yet been visited since the last visit of the base. Similarly,the prescription with condition “after job ” is used if the job location was visited at least once since the last visit of the base. with the second phase of round i . As the simulation resultsin Sec. V show, the use of this simple rule dramaticallydecreases the number of performed surveillance cycles inalmost every round.On the other hand, the complexity of the resulting strategy C for M can be reduced from non-finite-memory to finite-memory in the following case. Assume that for every N ∈ maec ∗ , the optimal ACPC strategy C V N leads to an EC thatcontains a state from G , where N is accepting with respect tothe pair ( B, G ) ∈ Acc . In this case, the optimal strategy C P can be defined as a finite-memory strategy that first appliesthe strategy C to reach a state of an MAEC N ∈ maec ∗ ,and from that point on, only applies the strategy C V N .V. C ASE S TUDY
We implemented the solution presented in Sec. IV inJava and applied it to a persistent surveillance roboticsexample [20]. In this section, we report on the simulationresults.Consider a mobile robot moving in a partitioned en-vironment. The motion of the robot is modeled by theinitialized MDP M shown in Fig. 2a. The set AP of atomicpropositions contains two propositions base and job . Asdepicted in Fig. 2a, state 0 is the base location and state 8is the job location. At the job location, the robot performssome work, and at the base, it reports on its job activity.The robot’s mission is to visit both base and job locationinfinitely many times. In addition, at least one job must beperformed after every visit of the base, before the base isvisited again. The corresponding LTL formula is φ = GF base ∧ GF job ∧ G (cid:0) base ⇒ X ( ¬ base U job ) (cid:1) . While satisfying the formula, we want to minimize theexpected average cost between two consecutive jobs, i.e., the surveillance proposition π sur = job .In the simulation, we use a Rabin automaton A φ for theformula that has 5 states and the accepting condition contains1 pair. The product P of the MDP M and A φ has 50 statesand one MAEC N of 19 states. The optimal set of MAECs maec ∗ = {N } . The optimal ACPC value V ∗N = 40 . . InFig. 2b, we list the projections of strategies C , C φ N , C V N for P to strategies C init , C p , C p for M , respectively. Theoptimal strategy C for M is then defined as follows. Startingfrom the initial state 0, apply strategy C init until a stateis reached, where C init is no longer defined. Start roundnumber 1. In i -th round, proceed as follows. In the first phaseof the round, apply strategy C p until the base is reached andthen for one more step (the product P has to reach a statefrom the Rabin pair). Let k i denote the number of steps inthe first phase of round i . In the second phase, use strategy C p for l i = max { i · k i · , j ( i ) } surveillance cycles, i.e., until the number of jobs performed by the robot is l i . Wealso use the rule described in Sec. IV-E to shorten the secondphase, if possible.Let us summarize the statistical results we obtained for 5executions of the strategy C for M , each of 100 rounds. Thenumber k i of steps in the first phase of a round i > wasalways 5 because in such case, the first phase starts at the joblocation and the strategy C p needs to be applied for exactly4 steps to reach the base. Therefore, in every round i > ,the number l i is at least · i , e.g., in round 100, l i ≥ .However, using the rule described in Sec. IV-E, the averagenumber of jobs per round was 130 and the median was only14. In particular, the number was not increasing with theround. On the contrary, it appears to be independent fromthe history of the execution. In addition, at most 2 roundsin each of the executions finished only at the point, whenthe number of jobs performed by the robot in the secondphase reached l i . The average ACPC value attained after100 rounds was 40.56.In contrast to our solution, the algorithm proposed in [11]does not find an optimal strategy for M . Regardless of theinitialization of the algorithm, it always results in a sub-optimal strategy, namely the strategy C p from Fig. 2b thathas ACPC value 50.5.VI. C ONCLUSION
In this paper, we focus on the problem of designing acontrol strategy for an MDP to guarantee satisfaction of anLTL formula with surveillance task, and at the same time,to minimize the expected average cumulative cost betweenvisits of surveillance states. This problem was previouslyddressed in [11], where the authors propose a sub-optimalsolution based on dynamic programming. In contrast to thiswork, we exploit recent results from theoretical computerscience, namely game theory and probabilistic model check-ing, to provide a sound and complete solution to this controlproblem. R
EFERENCES[1] R. Alterovitz, T. Sim´eon, and K. Goldberg. The stochastic motionroadmap: A sampling framework for planning with Markov motionuncertainty. In
Robotics: Science and Systems . Citeseer, 2007.[2] K. Apt and E. Gr¨adel.
Lectures in Game Theory for ComputerScientists . Cambridge University Press, 2011.[3] C. Baier and J. Katoen.
Principles of model checking . The MIT Press,2008.[4] D. Bertsekas.
Dynamic Programming and Optimal Control, vol.II . Athena Scientific Optimization and Computation Series. AthenaScientific, 2007.[5] K. Chatterjee and L. Doyen. Energy and Mean-Payoff Parity MarkovDecision Processes. In
Mathematical Foundations of Computer Sci-ence 2011 , volume 6907 of
Lecture Notes in Computer Science , pages206–218. Springer Berlin Heidelberg, 2011.[6] K. Chatterjee and L. Doyen. Games and Markov Decision Processeswith Mean-Payoff Parity and Energy Parity Objectives. In
Mathe-matical and Engineering Methods in Computer Science , volume 7119of
Lecture Notes in Computer Science , pages 37–46. Springer BerlinHeidelberg, 2012.[7] K. Chatterjee and M. Henzinger. Faster and Dynamic Algorithms forMaximal End-Component Decomposition and Related Graph Prob-lems in Probabilistic Verification. In
Proceedings of the Twenty-SecondAnnual ACM-SIAM Symposium on Discrete Algorithms, SODA’11 ,pages 1318–1336, 2011.[8] Y. Chen, J. Tumova, and C. Belta. LTL robot motion control based onautomata learning of environmental dynamics. In
IEEE InternationalConference on Robotics and Automation, ICRA’12 , pages 5177–5182,2012.[9] C. Courcoubetis and M. Yannakakis. The complexity of probabilisticverification.
Journal of the ACM , 42(4):857–907, July 1995.[10] L. de Alfaro.
Formal Verification of Probabilistic Systems . PhD thesis,Stanford University, 1997. Technical report STAN-CS-TR-98-1601.[11] X. C. Ding, S. Smith., C. Belta, and D. Rus. MDP Optimal Controlunder Temporal Logic Constraints. In
The 50th IEEE Conference onDecision and Control and European Control Conference (CDC-ECC) ,pages 532 –538, dec. 2011.[12] X. C. Ding, S. L. Smith, C. Belta, and D. Rus. LTL Control inUncertain Environments with Probabilistic Satisfaction Guarantees. In
Proceedings of the 18th IFAC World Congress , volume 18, 2011.[13] J. Filar and K. Vrieze.
Competitive Markov Decision Processes .Springer, 1996.[14] E. Gr¨adel, W. Thomas, and T. Wilke.
Automata, Logics, and InfiniteGames: A Guide to Current Research , volume 2500 of
Lecture Notesin Computer Science . Springer, 2002.[15] J. Klein. ltl2dstar – LTL to Deterministic Streett and Rabin Automata,2007. .[16] J. Klein and C. Baier. Experiments with deterministic ω -automatafor formulas of linear temporal logic. Theoretical Computer Science ,363(2):182 – 195, 2006.[17] M. Lahijanian, S. B. Andersson, and C. Belta. Temporal logic motionplanning and control with probabilistic satisfaction guarantees.
IEEETransaction on Robotics , 28:396–409, 2011.[18] M. Svoreˇnov´a, J. T˚umov´a, J. Barnat, and I. ˇCern´a. Attraction-BasedReceding Horizon Path Planning with Temporal Logic Constraints. In
Proceedings of the 51th IEEE Conference on Decision and Control,CDC’12 , pages 6749–6754, 2012.[19] M. Svoreˇnov´a, I. ˇCern´a, and C. Belta. Optimal Receding HorizonControl for Finite Deterministic Systems with Temporal Logic Con-straints. In