Learning Optimal Strategies for Temporal Tasks in Stochastic Games
LLearning Optimal Strategies for Temporal Tasks in Stochastic Games
Alper Kamil Bozkurt Yu Wang Miroslav Pajic
Abstract
Linear temporal logic (LTL) is widely used to for-mally specify complex tasks for autonomy. Un-like usual tasks defined by reward functions only,LTL tasks are noncumulative and require memory-dependent strategies. In this work, we introduce amethod to learn optimal controller strategies thatmaximize the satisfaction probability of LTL spec-ifications of the desired tasks in stochastic games,which are natural extensions of Markov DecisionProcesses (MDPs) to systems with adversarial in-puts. Our approach constructs a product gameusing the deterministic automaton derived fromthe given LTL task and a reward machine basedon the acceptance condition of the automaton;thus, allowing for the use of a model-free RL al-gorithm to learn an optimal controller strategy.Since the rewards and the transition probabilitiesof the reward machine do not depend on the num-ber of sets defining the acceptance condition, ourapproach is scalable to a wide range of LTL tasks,as we demonstrate on several case studies.
1. Introduction
We consider the problem of learning optimal strategies forformally specified tasks in zero-sum turn-based stochasticgames (SGs). SGs are natural extensions of Markov decisionprocesses (MDPs) to systems where an adversary can affectthe outcomes of the actions to prevent performing the giventask (Shapley, 1953; Filar & Vrieze, 2012). SGs can modelmany sequential decision making problems where a giventask needs to be successfully performed independently fromthe behavior of an adversary (Neyman et al., 2003).Linear temporal logic (LTL) (Pnueli, 1977) offers a high-level language to formally specify a wide range of taskswith temporal characteristics. For instance, the agent tasks“first go to the living room and then go to the baby’s room”(sequencing), “if the battery is low, go to the charger” (condi-tioning) and “continuously monitor the corridors of a build-ing” (repetition) can be easily specified using LTL. Control
Alper Kamil Bozkurt, Yu Wang and Miroslav Pajic are withDuke University, Durham, NC 27708, USA, { alper.bozkurt,yu.wang094, miroslav.pajic } @duke.edu synthesis directly from LTL specifications in stochastic sys-tems has been an attractive topic as it provides safe and ro-bust strategies that are correct-by-construction and thus hasbeen widely studied in robotics and cyber-physical systems(e.g., (Bacchus & Kabanza, 2000; Karaman et al., 2008;Kress-Gazit et al., 2009; Bhatia et al., 2010; Yordanov et al.,2011; Wolff et al., 2012; Ulusoy et al., 2013)). Such ap-proaches, however, require a complete model of the system.The common situations where the agent does not knowthe system/environment model in advance have motivatedthe use of reinforcement learning (RL) to design controllerstrategies for LTL tasks for MDPs. Examples include model-based approaches such as (Fu & Topcu, 2014; Br´azdil et al.,2014), which learn a model of the environment or require thetopology of the environment; model-free methods craftingreward functions directly from the specifications expressedin a fragment of LTL (Li et al., 2017; Icarte et al., 2018;De Giacomo et al., 2020) or from the acceptance conditionsof the automata derived from the specifications (Sadighet al., 2014; Hasanbeig et al., 2019; Camacho & McIlraith,2019; Hahn et al., 2019; Oura et al., 2020; Bozkurt et al.,2020b); and shielding approaches (Alshiekh et al., 2018;Bouton et al., 2018). All these focus on learning in MDPsand do not consider adversarial environments where anotheragent may take actions to disrupt performing the given task.On the other hand, there are only few studies investigatingthe use of learning for LTL tasks in SGs. (Wen & Topcu,2016) introduce a model-based probably approximately cor-rect (PAC) algorithm that pre-computes the winning stateswith respect to the LTL specification, thereby requiring thetopology of the game a-priori, whereas another PAC algo-rithm from (Ashok et al., 2019) only considers the reach-ability fragment of LTL, which can be used to specify thesequencing and conditioning tasks but not the continuoustasks. More recently, two automata-based model-free meth-ods have been proposed to learn control strategies for LTLtasks in SGs (Bozkurt et al., 2020a; Hahn et al., 2020).These approaches construct modified product games usingthe automata derived from the LTL specifications and learnstrategies via the rewards crafted based on the acceptancecondition of the automaton. However, these methods requiredistinct powers of some small parameter ε for the sets thatdefine the acceptance condition, which prohibits their use inpractice for many tasks. For example, even for the simple a r X i v : . [ c s . A I] F e b earning Optimal Strategies for Temporal Tasks in Stochastic Games surveillance task described in Section 5, the approach in(Hahn et al., 2020) may require episodes with a length of even when ε = 0 . .Consequently, in this work, we introduce a model-free RLapproach to learn optimal strategies for desired LTL tasksin SGs while significantly improving scalability comparedto existing methods. We start by composing an SG systemmodel with a deterministic parity automaton (DPA) thatis automatically obtained from the given LTL task. Suchproduct game is then combined with a novel reward machinesuitably crafted from the acceptance condition of the DPA.The rewards and the transition probabilities in the rewardmachine are independent from the number of sets definingthe acceptance condition, which results in the scalabilityimprovement. We show the applicability of our approachon two case studies with the LTL tasks, which cannot beeffectively learned via existing approaches.
2. Preliminaries and Problem Formulation
Stochastic Games.
We use SGs to model the problem ofperforming a given task by the controller (Player 1) againstan adversary (Player 2) in a stochastic environment.
Definition 1.
A (labeled turn-based two-player) stochasticgame is a tuple G = ( S, ( S µ , S ν ) , s , A, P, AP , L ) where S = S µ ∪ S ν is a finite set of states, S µ is the set of statesin which the controller takes action, S ν is the set of statesunder the control of the adversary, and s is the initialstate; A is a finite set of actions and A ( s ) denotes the setof actions that can be taken in a state s ; P : S × A × S (cid:55)→ [0 , is the probabilistic transition function such that (cid:80) s (cid:48) ∈ S P ( s, a, s (cid:48) ) = 1 if a ∈ A ( s ) and otherwise; finally, AP is a finite set of atomic propositions and L : S (cid:55)→ AP isthe labeling function. SGs can be considered as infinite games played by the con-troller and the adversary on finite directed graphs consistingof state and state-action vertices. In each state vertex, theowner of the state chooses one of the state-action vertices ofthe state; the game then transitions to one of the successorsof the vertex according to the given transition function.
Example 1.
An example SG is shown in Fig. 1a. The gamestarts in s , where a label { c } is received. In s , the con-troller can choose either β or β . With β , the game thenmakes transition to s with probability (w.p.) . , and staysin the s w.p. . . If the controller persistently chooses β the game eventually moves to state s w.p. 1 and { a, b, c } will be received. Each play in an SG G induces an infinite path π := s s . . . where for all t ≥ , there exists an action a ∈ A ( s t ) suchthat P ( s t , a, s t +1 ) > . We denote the state s t , prefix s s . . . s t and suffix s t s t +1 ... by π [ t ] , π [: t ] and π [ t :] re-spectively. The players’ behaviors can be considered as strategy functions mapping the history of the visited statesto an action. We focus on finite-memory strategies sincethey suffice for the tasks considered in this work. Definition 2. A finite-memory strategy for a game G is atuple σ = ( M, m , T, α ) where M is a finite set of modes; m is the initial mode; T : M × S (cid:55)→ D ( M ) is the tran-sition function that maps the current mode and state to adistribution over the next modes; α : M × S (cid:55)→ D ( A ) is afunction that maps a given mode m ∈ M and a state s ∈ S to a discrete distribution over A ( s ) . A controller strategy µ is a finite-memory strategy that maps only the controllerstates to distributions over actions. Similarly, an adversarystrategy ν is a finite-memory strategy mapping the adver-sary states to distributions over actions. A finite-memorystrategy is called pure memoryless if there is only one mode( | M | = 1 ) and α ( m , s ) is a point distribution assigning aprobability of 1 to a single action for all s ∈ S . Intuitively, a finite-memory strategy is a finite state machinemoving from one mode (memory state) to another as the SGstates are visited, outputting a distribution over the actions ineach state. Unlike the standard definition of finite-memorystrategies (e.g., (Chatterjee & Henzinger, 2012; Baier &Katoen, 2008)) where transitions among the modes are alldeterministic, Def. 2 also allows for probabilistic transitions.For an SG G , and a pair of a controller strategy µ and an ad-versary strategy ν , we denote the resulting induced Markovchain (MC) as G µ,ν . We write π ∼ G µ,ν to denote a pathdrawn from G µ,ν , and π s ∼ G sµ,ν to denote a path drawnfrom the MC G sµ,ν , where G sµ,ν is same as G µ,ν except thatthe state s is designated as the initial state instead of s . Fi-nally, a bottom strongly connected component (BSCC) of anMC is a set of states such that there is a path from each stateto any other state in the set without any outgoing transitions.We denote the set of all BSCCs of G µ,ν by B ( G µ,ν ) . Linear Temporal Logic.
LTL provides a high-level for-malism to specify tasks with temporal properties by placingrequirements for infinite paths. LTL specifications consistof nested combinations of Boolean and temporal operatorsaccording to the grammar ϕ := true | a | ϕ ∧ ϕ | ¬ ϕ | (cid:13) ϕ | ϕ U ϕ , a ∈ AP . (1)The other Boolean operators are defined via the standardequivalences (e.g., ϕ ∨ ϕ := ¬ ( ¬ ϕ ∧ ¬ ϕ ) , ϕ → ϕ (cid:48) := ¬ ϕ ∨ ϕ (cid:48) ). A path π of an SG G satisfies an LTL specification ϕ , denoted by π | = ϕ , if one of the following holds:• if ϕ = a and L ( π [0]) = a , i.e., a immediately holds;• if ϕ = ϕ ∧ ϕ , π | = ϕ and π | = ϕ ;• if ϕ = ¬ ϕ (cid:48) and π (cid:54)| = ϕ (cid:48) ;• if ϕ = (cid:13) ϕ (cid:48) (called next ϕ (cid:48) ) and π [1:] | = ϕ (cid:48) ;• if ϕ = ϕ U ϕ (called ϕ until ϕ ) and there exists t ≥ such that π [ t :] | = ϕ and for all ≤ i < t , π [ i :] | = ϕ . earning Optimal Strategies for Temporal Tasks in Stochastic Games 𝑠 { 𝑐 } 𝛽 𝛽 𝑠 { 𝑎 , 𝑏 , 𝑐 } 𝑠 { 𝑎 , 𝑏 } 𝑠 { 𝑎 } 𝛽 𝛽 𝛽 𝛽 𝑠 { 𝑎 } 𝑠 { 𝑐 } 𝛽 𝛽 (a) An example SG. 𝑞 𝑞 {}|{ 𝑎 }|{ 𝑏 }|{ 𝑎 , 𝑏 } (5){ 𝑐 }|{ 𝑏 , 𝑐 } (5){ 𝑎 , 𝑐 } (2){ 𝑎 } (3){ 𝑎 , 𝑏 }|{ 𝑎 , 𝑏 , 𝑐 } (4){}|{ 𝑏 } (5) { 𝑐 }|{ 𝑎 , 𝑐 }|{ 𝑏 , 𝑐 }|{ 𝑎 , 𝑏 , 𝑐 } (4) (b) An example DPA for ϕ = ( ♦(cid:3) a ∧ (cid:3)♦ b ) ∨ ♦(cid:3) c . 𝑠 𝑞 (5) 𝛽 𝛽 𝑠 𝑞 (4) 𝑠 𝑞 (5) 𝑠 𝑞 (3) 𝛽 𝛽 𝛽 𝛽 𝑠 𝑞 (5) 𝑠 𝑞 (4) 𝛽 𝛽 𝑠 𝑞 (4) 𝛽 𝑠 𝑞 (3)1 𝛽 𝛽 𝑠 𝑞 (5) 𝛽 𝑠 𝑞 (4) 𝛽 (c) Product game of the SG in (a) and the DPA in (b). Figure 1.
A product game construction example. In (a) and (c), the green circles and red squares are the controller and adversary states,respectively. The blue diamonds represent actions and the outgoing black edges represent probabilistic transitions. In (a), the sets of lettersare the state labels. In (b), q and q are the DPA states and the arrows represent the DPA transitions triggered by the labels on the edges.The colored numbers within parentheses in (b) and (c) represent the color of the DPA transitions and the product states, respectively. Intuitively, the temporal operator (cid:13) ϕ expresses that ϕ needsto hold in the next time step, whereas ϕ U ϕ specifies that ϕ needs to hold until ϕ holds. Other temporal operatorssuch as eventually ( ♦ ) and always ( (cid:3) ) are also commonlyused: ♦ ϕ := true U ϕ (i.e., ϕ eventually holds), and (cid:3) ϕ := ¬ ( ♦ ¬ ϕ ) (i.e., ϕ always holds). Example 2.
Majority of robotics tasks can be expressed asan LTL formula (Kress-Gazit et al., 2009). For instance, • Sequencing:
The agent should go to the room a , openthe package b , take the item c , and place the item on thedesk d : ϕ seq = ♦ ( a ∧ ♦ ( b ∧ ♦ ( c ∧ ♦ d ))) ; • Surveillance:
The agent should repeatedly visit the rooms a and b , and if an item c is seen on the floor, should reportto the administrator f : ϕ sur = (cid:3)♦ a ∧ (cid:3)♦ b ∧ (cid:3) ( c → ♦ f ) ; • Persistence and Avoidance:
The agent should go to theroom a and stay there, without entering a danger zone g : ϕ per = ♦(cid:3) a ∧ (cid:3) ¬ g . Example 3.
Consider a simple persistence task ϕ = ♦(cid:3) a in the SG from Fig. 1a. If the controller and the adversaryalways choose β and β in s and s respectively, s willbe eventually reached, followed by ‘ a ’ being received afterevery transition; thus, satisfying the specification. However,if the adversary occasionally chooses β , s , a state without‘ a ’, will be eventually reached from any time-point in thegame; this makes the probability of satisfying the specifica-tion . In this case, the controller can improve the proba-bility of satisfying ϕ by choosing β (in s ), which w.p. . leads to an infinite loop of s and s , both labeled with ‘ a ’. Deterministic Parity Automata.
Any LTL task ϕ can betranslated to a deterministic parity automaton (DPA) thataccepts an infinite path satisfying ϕ (Esparza et al., 2017). Definition 3. A deterministic parity automaton derivedfrom an LTL specification ϕ in an SG G is a tuple A ϕ =( Q, q , δ, k, C ) such that Q is a finite set of automaton states; q ∈ Q is the initial state; δ : Q × AP (cid:55)→ Q is the transitionfunction; k is the number of colors; C : Q × AP (cid:55)→ [ k ] isthe coloring function where [ k ] := { , . . . , k } . A path π of G induces an execution τ π = (cid:104) q , L ( π [0]) (cid:105)(cid:104) q , L ( π [1]) (cid:105) . . . such that for all t ≥ , δ ( q t , L ( π [ t ])) = q t +1 . Let Inf ( τ π ) denote the set of the transitions (cid:104) q, ρ (cid:105) ∈ Q × AP madeinfinitely often by τ π ; then, a path π is accepted by a DPA if max { C ( (cid:104) q, ρ (cid:105) ) | (cid:104) q, ρ (cid:105) ∈ Inf ( π ) } is an even number. DPAs provide a systematic way to evaluate the satisfactionof any LTL specification, which can be expressed by satis-faction of the parity condition of a constructed DPA. Theparity condition is satisfied simply when the largest coloramong the colors repeatedly visited is an even number. Thisprovides a natural framework to reason about LTL tasks inSGs; the controller tries to visit the states triggering the even-colored transitions as often as possible, while adversary triesto do the opposite (i.e., odd-colored transitions).
Example 4.
An example DPA derived from the LTL formula ϕ = ( ♦(cid:3) a ∧ (cid:3)♦ b ) ∨ ♦(cid:3) c is shown in Fig. 1b. Any infiniteexecution visiting q and q infinitely many times is notaccepted as the transitions between the states are coloredwith , which is an odd number and the largest color. Anexecution visiting q infinitely often is accepting only ifafter some finite time steps, the execution always stays in q by receiving the labels { c } , { a, c } , { b, c } or { a, b, c } , thussatisfying ♦(cid:3) c . Also, the accepting transitions visiting q infinitely often after some point either only receive { a, c } ,satisfying ♦(cid:3) c , or repeatedly receive { a, c } and { a, b, c } without receiving {} or { b } , hence satisfying ♦(cid:3) a ∧ (cid:3)♦ b . Problem Statement
We can now formulate the problem we consider in this work.
Problem 1.
For a given LTL task specification ϕ and anSG G where the transition probabilities are completely un-known, design a model-free RL approach to learn an optimalcontroller strategy under which the given LTL tasks are per-formed successfully with the highest probability against anoptimal (i.e., worst case) adversary. Formally, our objective is to learn a controller strategy µ ϕ in the SG G such that under µ ϕ , the probability that a path earning Optimal Strategies for Temporal Tasks in Stochastic Games π satisfies the LTL specification of the task ϕ is maximizedin the worst case; i.e., µ ϕ := arg max µ min ν P r π ∼G µ,ν { π | π | = ϕ } . (2)However, model-free RL algorithms such as minimax-Qrequire a reward function associating states to scalar rewardsand learn a strategy maximizing the minimum expectedvalue of the return, which is the sum of the discountedrewards obtained for some discount factor γ ∈ [0 , . For astrategy pair ( µ, ν ) , let r ( t ) µ,ν denote the reward collected attime step t and G µ,ν denote the return; then, we have G µ,ν := (cid:88) ∞ t =0 γ t r ( t ) µ,ν . (3)Thus, we solve Problem 1 by systematically constructinga reward machine from an LTL specification such that anystrategy maximizing the minimum expected return is anoptimal controller strategy as defined in (2).
3. Priority Games
In this section, we introduce a reduction from Problem 1 tothe problem of learning a controller strategy maximizingthe return in (3) in the worst case. We start by constructinga product game by composing the given SG G with the DPA A ϕ derived from the LTL specification ϕ of the desired task.We then introduce a novel reward machine based on theacceptance condition of the DPA where the transitions andthe rewards are controlled by a parameter ε . We compose thereward machine with the product game to obtain Markovianrewards; for such rewards we show that as ε approacheszero, a strategy maximizing the minimum expected returnmaximizes the minimum probability of satisfying the LTLspecification – i.e., the obtained optimal strategy satisfies (2).The overall approach is summarized in Algorithm 1. The problem of learning an optimal controller strategy foran LTL task in an SG can be reduced into meeting the paritycondition via constructing a product game.
Definition 4. A product game of an SG G and a DPA A ϕ is a tuple G × =( S × , ( S × µ , S × ν ) , s × , A × , P × , C × ) where S × = S × Q is the set of product states; S × µ = S µ × Q and S × ν = S ν × Q are controller and adversary product statesrespectively; s × = (cid:104) s , q (cid:105) is the initial state; A × = A with A × ( (cid:104) s, q (cid:105) )= A ( s ) ; P × : S × × A × × S × (cid:55)→ [0 , is the proba-bilistic transition function such that P ( (cid:104) s, q (cid:105) , a, (cid:104) s (cid:48) , q (cid:48) (cid:105) )= (cid:40) P ( s, a, s (cid:48) ) if q (cid:48) = δ ( q, L ( s )) , otherwise; (4) and C × : S × (cid:55)→ [ k ] is the product coloring func-tion such that C × ( (cid:104) s, q (cid:105) ) = C ( q, L ( s )) . A path π × = (cid:104) s , q (cid:105)(cid:104) s , q (cid:105) . . . in a product game meets the parity con-dition if the following holds: ϕ × := “ max (cid:8) C × ( s × ) | s × ∈ Inf × ( π × ) (cid:9) is even ” (5) Table 1.
Frequently used symbols and notation.Description Stochastic Product PriorityGame ( G ) Game ( G × ) Game ( G (cid:63) )State s s × := (cid:104) s, q (cid:105) s (cid:63) := (cid:104) s × , j (cid:105) Path π := s s . . . π × := s × s × . . . π (cid:63) := s (cid:63) s (cid:63) . . . Strategy Pair ( µ, ν ) ( µ × , ν × ) ( µ (cid:63) , ν (cid:63) ) Induced MC G µ,ν G × µ × ,ν × G (cid:63)µ (cid:63) ,ν (cid:63) Objective ϕ (LTL) ϕ × (Parity) G (Return) where Inf × ( π × ) is the set of product states visited infinitelymany times by π × . From the definition and the fact that whenever the DPA A ϕ makes a transition colored with j , a product state (cid:104) s, q (cid:105) col-ored with j is visited in the G × , the following lemma holds: Lemma 1.
In the product game G × of the SG G and theDPA A ϕ , consider paths π × = (cid:104) s , q (cid:105)(cid:104) s , q (cid:105) . . . in G × and π = s s . . . in G . Then, ( π × | = ϕ × ) ⇔ ( π | = ϕ ) . The product game effectively captures synchronous execu-tion of the DPA A ϕ with the SG G . The DPA starts in itsinitial state and whenever the SG moves to a state, the DPAconsumes the label of the state and a makes a transition. Forexample, when G is in s and A ϕ is in q , the product game G × is in (cid:104) s, q (cid:105) . If the SG moves to s (cid:48) , the DPA moves to thestate q (cid:48) = δ ( q, L ( s (cid:48) )) , which is represented by the productgame transition from (cid:104) s, q (cid:105) to (cid:104) s (cid:48) , q (cid:48) (cid:105) .A strategy in G × induces a finite-memory strategy in G where the states of A ϕ act as the modes governed by thetransition function of A ϕ . To illustrate, let µ × denote a purememoryless strategy in G × and µ denote its induced strategyin G . While µ is operating in mode m corresponding to theDPA state q , if a state s is visited in G , µ changes its modefrom m to m (cid:48) , which corresponds to q (cid:48) = δ ( q, L ( s (cid:48) )) andchooses the action that µ × selects in (cid:104) s (cid:48) , q (cid:48) (cid:105) .Hence, from Lemma 1 the probability of satisfying the paritycondition ϕ × under a strategy pair ( µ × , ν × ) in the productgame G × is equal to the probability of satisfying the LTLspecification ϕ in the SG G under the induced strategy pair ( µ, ν ) ; i.e., it holds that P r { π | π | = ϕ } = P r (cid:8) π × | π × | = ϕ × (cid:9) , (6)where π and π × are random paths drawn from the MCs G µ,ν and G × µ × ,ν × respectively. Example 5.
Fig. 1c presents the product game G × obtainedfrom the SG G in Fig. 1a and the DPA A ϕ of the LTL task ϕ in Fig. 1b. In G , if the adversary follows a pure memorylessstrategy, (s)he loses (i.e., the controller wins in the sensethat ϕ is satisfied) w.p. 1 when G moves to s ; the reason isthat with a pure adversary strategy ν , G alternates eitherbetween s and s (satisfying ♦(cid:3) a ∧ (cid:3)♦ b ) or between s and s (satisfying ♦(cid:3) c ), and thus ϕ is satisfied. However, earning Optimal Strategies for Temporal Tasks in Stochastic Games the adversary can win in the G × by following a pure memo-ryless strategy that chooses β in (cid:104) s , q (cid:105) and β in (cid:104) s , q (cid:105) as in this case, the maximal color in the obtained infinite cy-cle (cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→ ... is5. This pure memoryless strategy in G × induces the finite-memory strategy in G that alternates between the modes m and m , under which β and β are selected in s respec-tively – m and m correspond the DPA states q and q . We use
P r µ × ,ν × ( s × | = ϕ × ) to denote the probability thata state s × ∈ S × satisfies ϕ × under ( µ × , ν × ) ; i.e., P r µ × ,ν × ( s × | = ϕ × ) := P r (cid:8) π × s × | π × s × | = ϕ × (cid:9) (7)where π × s × is a random path drawn from the product MC G × ,s × µ × ,ν × , which is obtained from G × µ × ,ν × by assigning s × as the initial state. Therefore, from (6) and Lemma 1, ourobjective can be revised as learning a strategy µ × ϕ × in theproduct game G × defined as µ × ϕ × := arg max µ × min ν × P r µ × ,ν × ( s × | = ϕ × ); (8)the strategy µ × ϕ × is then used to induce the finite-memorystrategy µ ϕ from (2). To achieve this using a model-free RL,we introduce a reward machine that ensures the equivalencebetween the probability of satisfying the parity condition ϕ × and the expected return. The goal of our priority reward machines (PRMs), a classof reward machines (Icarte et al., 2018) previously used forRL in MDPs, is to provide suitable rewards for learningstrategies satisfying parity conditions. PRMs have a prioritystate for each color in a given parity condition and outputpositive rewards only in the priority states corresponding toeven colors.
Definition 5. A priority reward machine is a tuple ρ =( k, ε, (cid:37), ϑ, Γ) where • k is the number of colors; • ε ∈ (0 , is the parameter controlling the transitionsbetween priorities; • (cid:37) = { (cid:37) , . . . , (cid:37) k } is the set of priority states; • ϑ : (cid:37) × { }∪ [ k ] (cid:55)→ [0 , is the probabilistic transitionfunction such that ϑ ( (cid:37) j , c, (cid:37) j (cid:48) ) = − √ ε if j =0 , j (cid:48) =0 , √ ε if j =0 , j (cid:48) =1 , − ε if j ≥ , j ≥ c, j = j (cid:48) ,ε if j ≥ , j ≥ c, j (cid:48) =1 , if j ≥ , j The priority game of a product game G × anda PRM ρ is a tuple G (cid:63) = ( S (cid:63) , ( S (cid:63)µ , S (cid:63)ν ) , s (cid:63) , A (cid:63) , P (cid:63) , ε, R (cid:63) ) : • S (cid:63) = S × × { }∪ [ k ] is the augmented state set with S (cid:63)µ = S × µ × { }∪ [ k ] are the controller and S (cid:63)ν = S (cid:63) \ S (cid:63)µ the adversary states, and s (cid:63) = (cid:104) s × , (cid:105) is the initial state; • A (cid:63) = A × with A (cid:63) ( (cid:104) s × , j (cid:105) ) = A × ( s × ) for all ( (cid:104) s × , j (cid:105) ∈ S (cid:63) is the set of actions; • − ε is the discounting factor; • P (cid:63) : S (cid:63) × A (cid:63) × S (cid:63) (cid:55)→ [0 , is the probabilistic transitionfunction defined as P (cid:63) ( (cid:104) s × , j (cid:105) , a, (cid:104) s × (cid:48) , j (cid:48) (cid:105) ) == (1 −√ ε ) P ( s × , a, s × (cid:48) ) if j =0 , j (cid:48) =0 , √ εP ( s × , a, s × (cid:48) ) if j =0 , j (cid:48) =1 , (1 − ε ) P ( s × , a, s × (cid:48) ) if j ≥ , j ≥ c, j = j (cid:48) ,εP ( s × , a, s × (cid:48) ) if j ≥ , j ≥ c, j (cid:48) =1 ,P ( s × , a, s × (cid:48) ) if j ≥ , j A PRM for colors, with states. The rewards providedin each state are displayed next to the state names. The blue outgo-ing edges are the transitions triggered by the colors displayed nearthe edges. The black edges represent the probabilistic transitions. in the SG G where the PRM combined with the DPA servesas the memory mechanism. Hereafter, we focus on purememoryless strategy pairs in priority games, and we showin the next section that if µ (cid:63) maximizes the minimum ex-pected return in the priority game G (cid:63) , its induced strategy µ maximizes the minimum satisfaction probability of the LTLtask in the SG G . 4. Main Result We now state our main result. Here, we use G ( π (cid:63) ) to denotethe return of a path π (cid:63) in a priority game. The value of astate s (cid:63) ∈ S (cid:63) under a strategy pair ( µ (cid:63) , ν (cid:63) ) , denoted by v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) , is defined as the expected value of the return ofa path starting in s (cid:63) ; i.e. v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) := E π (cid:63)s(cid:63) (cid:2) G ( π (cid:63)s (cid:63) )] and E π (cid:63)s(cid:63) (cid:2) G ( π (cid:63)s (cid:63) )] = E π (cid:63)s(cid:63) (cid:34) ∞ (cid:88) t =0 (1 − ε ) t R (cid:63) ( π (cid:63)s (cid:63) [ t ]) (cid:35) (10)where π (cid:63)s (cid:63) ∼ G (cid:63),s (cid:63) µ (cid:63) ,ν (cid:63) , ( G (cid:63),s (cid:63) µ (cid:63) ,ν (cid:63) is the MC with the initial state s (cid:63) , induced by ( µ (cid:63) , ν (cid:63) ) ). Theorem 1. Let µ (cid:63) ∗ be an optimal controller strategy maxi-mizing the minimum value of the initial state in a prioritygame G (cid:63) constructed from an SG G for a desired LTL speci-fication ϕ , which is formally defined as µ (cid:63) ∗ := arg max µ (cid:63) min ν (cid:63) v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) . (11) Then there exists ε > such that the induced strategy of µ (cid:63) ∗ in G is an optimal strategy µ ϕ as defined in (2) . Theorem 1 states that an optimal controller strategy for agiven LTL task ϕ can be obtained by finding a maximinstrategy in the constructed priority game. Before provingTheorem 1, we provide some useful lemmas. Throughoutthe lemmas, ( µ (cid:63) , ν (cid:63) ) denote an arbitrary strategy pair in agiven priority game G (cid:63) , and ( µ × , ν × ) denote its inducedstrategy pair in the product game G × . Lemma 2. For a path π (cid:63) in G (cid:63) , it holds that ≤ G ( π (cid:63) ) ≤ . (12) Proof. Since there are no negative rewards, the return isalways nonnegative. The maximum return is obtained whena reward of ε is obtained in every state, which makes theupper bound ε (cid:80) ∞ t =0 (1 − ε ) t = 1 .The parity condition in the product games is defined over thestates that are visited infinitely many times. Thus, we needto establish a connection between the recurrent states of theproduct games and their priority games. Under a strategypair, a BSCC of the induced MC will be eventually reachedand all the states in the BSCC will be visited infinitely often.We now show there is a one-to-one correspondence betweenthe BSCCs of the priority games and product games. Lemma 3. There is a bijection between the BSCCs of G (cid:63)µ (cid:63) ,ν (cid:63) and G × µ × ,ν × such that for any V (cid:63) ∈ B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) andits pair V × ∈ B ( G × µ × ,ν × ) it holds that V × = (cid:8) s × | (cid:104) s × , j (cid:105) ∈ V (cid:63) (cid:9) . (13) Proof. A key observation is that if an (cid:104) s × , j (cid:105) belongs toa BSCC then (cid:104) s × , (cid:105) belongs to the same BSCC due tothe ε -transitions. Thus, any state (cid:104) s × , j (cid:105)∈ S (cid:63) either is atransient state or it belongs to the BSCC having the state (cid:104) s × , (cid:105) ; note that if (cid:104) s × , (cid:105) is a transient state, then any (cid:104) s × , j (cid:105)∈ S (cid:63) must be a transient state. Thus, if an s × ∈ S × belongs to a V × ∈B ( G × µ × ,ν × ) , then there is only one BSCC V (cid:63) ∈B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) including an (cid:104) s × , j (cid:105) for some j ∈ [ k ] . Simi-larly, if an (cid:104) s × , j (cid:105)∈ S (cid:63) belongs to a V (cid:63) ∈B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) , then thepair V × ∈B ( G × µ × ,ν × ) is the BSCC containing s × .The BSCCs of a product game are of particular interestbecause the problem of satisfying the parity condition canbe reduced into the problem of reaching certain BSCCs. Fora given BSCC V × ∈ B ( G × µ × ,ν × ) , let j V × denote the largestcolor among the colors of the states in V × ; i.e., j V × := max (cid:8) C × ( s × ) | s × ∈ V × (cid:9) . (14)If j V × is an even number, then the parity condition is sat-isfied almost surely once V × is reached, as every state in V × will be visited infinitely many times. In this case, V × is called accepting . Similarly, if j V × is an odd number, theprobability of satisfying the parity condition in V × is zeroand V × is called rejecting . Lemma 4. For a BSCC V (cid:63) ∈ B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) and its pair V × in G × µ × ,ν × , the following holds for all (cid:104) s × , j (cid:105) ∈ V (cid:63) : lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , j (cid:105) ) = (cid:40) if V × is accepting , if V × is rejecting . (15) Proof. From Def. 6, once a state (cid:104) s × (cid:48) , j (cid:48) (cid:105) ∈ V (cid:63) with j (cid:48) Let U (cid:63) be the union of the pairs (in G (cid:63)µ (cid:63) ,ν (cid:63) ) ofall accepting BSCCs in G × µ × ,ν × . Then, it holds that: P r µ × ,ν × ( s × | = ϕ × ) = P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (25) where ♦ U (cid:63) denote the objective of reaching a state in U (cid:63) . Proof. This follows from Lemma 3. If a path π (cid:63) = (cid:104) s × , j (cid:105)(cid:104) s × , j (cid:105) · · · ∼ G (cid:63)µ (cid:63) ,ν (cid:63) enters a BSCC V (cid:63) ∈B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) and visits an (cid:104) s × , (cid:105) ∈ V (cid:63) then its projection π × = s × s × in G × µ × ,ν × . . . eventually enters the pair BSCC V × and visits s × and vice versa.We can now state the equivalence between the satisfactionprobabilities and the values in the limit. Lemma 6. The value of the initial state in G (cid:63) approachesthe satisfaction probability of the parity condition in G × as ε approaches ; i.e., lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) = P r µ × ,ν × ( s × | = ϕ × ) . (26) Proof of Lemma 6. From Lemma 5, we can express v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) as v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) = E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) | = ♦ U (cid:63) ] · p + E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) (cid:54)| = ♦ U (cid:63) ] · (1 − p ) (27)where π (cid:63) ∼ G (cid:63)µ (cid:63) ,ν (cid:63) and p := P r µ × ,ν × ( s × | = ϕ × ) . Let N (cid:48) be the number of time steps before reaching a state s × ∈ V × where V × is accepting and C × ( s × ) = j V × . When such an s × is reached for the first time in G × µ × ,ν × , either• (cid:104) s × , (cid:105) is reached, which means the √ ε -transition hasnot happened yet and the pair BSCC V (cid:63) will be reachedafter N √ ε + N time steps;• or an (cid:104) s × , j (cid:105) with some j between and j V × is reached,then V (cid:63) will be reached in next time step;• or an (cid:104) s × , j (cid:105) with a j > j V × is reached, then it means √ ε -transition happened and after that, a state with higherpriority was visited; thus the number of steps until reach-ing V (cid:63) is N ε + N .Here, N and N ε denote the random variables defined in theproof of Lemma 4 and N √ ε is the number of steps before a √ ε -transition happens.We can observe that the probability that N √ ε < N goesto zero as ε goes to zero. Therefore, we can ignore thethird item above in the limit. For sufficiently small ε , fromJensen’s inequality and Markov property, we have that v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) ≥ E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) | = ♦ U (cid:63) ] · p ≥ E π (cid:63) [(1 − ε ) N (cid:48)(cid:48) + N √ ε | π (cid:63) | = ♦ U (cid:63) ] · p ≥ (1 − ε ) n (cid:48)(cid:48) E π (cid:63) [(1 − ε ) N √ ε ] · v · p (28)where N (cid:48)(cid:48) = N + N (cid:48) , n (cid:48)(cid:48) > is a constant and v :=min s (cid:63) ∈U (cid:63) v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) . Since lim ε → + v = 1 from Lemma4 and lim ε → + E π (cid:63) [(1 − ε ) N √ ε ] = 1 , we have that lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) ≥ P r ( s (cid:63) | = ♦ U (cid:63) ) . (29)Similarly, for sufficiently small ε , using Jensen’s inequalityand Markov property, we can obtain the upper bound v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) ≤ E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) (cid:54)| = ♦ U (cid:63) ] · (1 − p ) + p ≤ (cid:16) − (1 − ε ) n (cid:48)(cid:48) E π (cid:63) [(1 − ε ) N √ ε ] (cid:0) − v (cid:1)(cid:17) · (1 − p ) + p (30) earning Optimal Strategies for Temporal Tasks in Stochastic Games Algorithm 1: Model-free RL for LTL tasks. Input: LTL formula ϕ , stochastic game G , parameter ε Translate ϕ to a DPA A ϕ Construct the product game G × of SG G and DPA A ϕ Construct the PRM ρ for the accepting condition of A ϕ and ε Construct the priority game G (cid:63) by composing G × with ρ Initialize Q ( s , a ) on G (cid:63) for i = 0 to I − dofor t = 0 to T − do Derive an (cid:15) -greedy strategy pair ( µ (cid:63) , ν (cid:63) ) from Q Take the action a t ← (cid:40) µ (cid:63) ( s t ) , s t ∈ S (cid:63)µ ν (cid:63) ( s t ) , s t ∈ S (cid:63)ν Observe the next state s t +1 Q ( s t , a t ) ← (1 − α ) Q ( s t , a t ) + αR ( s t )+ α Γ( s t ) · (cid:40) max a (cid:48) Q ( s t +1 , a (cid:48) ) , s t +1 ∈ S (cid:63)µ min a (cid:48) Q ( s t +1 , a (cid:48) ) , s t +1 ∈ S (cid:63)ν end forend forreturn a greedy controller strategy µ (cid:63) ∗ from Q where v := max s (cid:63) ∈U (cid:63) v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) and U is the unionof BSCCs in G (cid:63)µ (cid:63) ,ν (cid:63) whose pairs in G × µ × ,ν × are reject-ing. Since we have lim ε → + v = 0 from Lemma 4 and lim ε → + E π (cid:63) [(1 − ε ) N √ ε ] = 1 , we can establish that lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) ≤ P r ( s (cid:63) | = ♦ U (cid:63) ) , (31)which concludes the proof.Using the previous lemmas, we now prove our main result. Proof of Theorem 1. Let P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) denote the max-imin satisfaction probability defined as P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) := max µ (cid:63) min ν (cid:63) P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) . (32)Since there are only finitely many different pure memorylessstrategies that can be followed in a priority game, thereexists d > such that the following holds (cid:12)(cid:12) P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) − P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (cid:12)(cid:12) ≥ d (33)for any µ (cid:63) , ν (cid:63) with P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (cid:54) = P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) .Also, from Lemma 6, there exists an ε > such that (cid:12)(cid:12) v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) − P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (cid:12)(cid:12) < d/ (34)for all ( µ (cid:63) , ν (cid:63) ) ; thus, v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) >v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) implies P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) >P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) for any ( µ (cid:63) , ν (cid:63) ) and ( µ (cid:63) , ν (cid:63) ) . This, combined with Lemma 1, Lemma 5, andthe fact that pure memoryless strategies suffice for both play-ers for parity conditions (Chatterjee & Henzinger, 2012),concludes the proof. 5. Case Studies We implemented our approach, summarized in Algorithm 1,using Owl (Kret´ınsk´y et al., 2018) for LTL to DPA trans-lation and the minimax-Q algorithm to learn the optimalcontroller strategies (the entire source code is provided inthe supplementary file). We set ε to . . During the learn-ing phase, an (cid:15) -greedy strategy was followed, which chooses random actions with a probability of (cid:15) and chooses greedilyotherwise. The learning rate α and the parameter (cid:15) wereinitialized to . and gradually decreased to . . Eachepisode terminated after a K time steps ( K = 2 ). Thestrategies presented in Fig. 3 were obtained after K episodes. The first K episodes started in a random stateand the last K episodes started in the initial state. Theentire learning phase took less than an hour.We used × grids to model the system. In this system,each cell represents a state, and a robot can move from onecell to a neighbor cell by taking four actions: up , down , right and left . An adversary can observe the actions therobot takes and can act to disturb the movement so that therobot may move in a perpendicular direction of the intendeddirection. Specifically, there are four actions the adversarycan choose none , cw , ccw , both ; for none , the robot willmove in the intended direction w.p. , for cw (or ccw ), therobot will move in the intended direction w.p. . , andmove in the perpendicular direction that is ° clockwiseor (counter-clockwise respectively) w.p. . ; for both , therobot will move in any of the perpendicular direction withw.p. . . The robot cannot move towards an obstacle ora grid edge; The obstacles are represented as large circlesfilled with gray. The labels of cells (i.e. states) are displayedas encircled letters filled with various colors. Nursery Scenario. In this case study, the objective of therobot is to ( i ) enter a room labeled with e and stay there; ( ii ) inform the adult (labeled with a ) exactly once that ithas started monitoring the baby; ( iii ) visit the baby (labeledwith b ) and go back to the charger station (labeled with c ) repeatedly; ( iv ) avoid the danger zone (labeled with d ).This can be formally captured as the LTL task: ϕ = ♦(cid:3) e (cid:124)(cid:123)(cid:122)(cid:125) ( i ) ∧ ♦ a ∧ (cid:3) ( a →(cid:13) (cid:3) ¬ a ) (cid:124) (cid:123)(cid:122) (cid:125) ( ii ) ∧ (cid:3)♦ b ∧ (cid:3)♦ c (cid:124) (cid:123)(cid:122) (cid:125) ( iii ) ∧ (cid:3) ¬ d (cid:124)(cid:123)(cid:122)(cid:125) ( iv ) (35)The corresponding DPA has states, and colors. For thesystem in Fig. 3a, 3b, Algorithm 1 converged to an optimalstrategy that first visits a while staying in the purple regionin Fig. 3a, then starts visiting b and c repeatedly whilestaying in the purple region in Fig. 3b. Under this strategy,the robot successfully stays in the room ( e ) and avoids thedanger zone ( d ). Note that the robot takes the action down to leave the adult; otherwise the adversary might cause it tohit the obstacle, thus visiting the adult for a second time. Stable Surveillance. In this case study, the robot needsfind a zone where it can successfully monitor some particu-lar cells. There are three different zones (Fig. 3c), labeledwith b , d , and g . The robot needs to monitor the cell labeledwith a in b ; the cell labeled with c in d and two cells labeledwith e and f in g . The robot is free to choose any of thezones. This task can be captured in LTL as ϕ = ( ♦(cid:3) b ∧ (cid:3)♦ a ) (cid:124) (cid:123)(cid:122) (cid:125) First Zone ∨ ( ♦(cid:3) d ∧ (cid:3)♦ c ) (cid:124) (cid:123)(cid:122) (cid:125) Second Zone ∨ ( ♦(cid:3) g ∧ (cid:3)♦ e ∧ (cid:3)♦ f ) (cid:124) (cid:123)(cid:122) (cid:125) Third Zone (36) earning Optimal Strategies for Temporal Tasks in Stochastic Games (a) de e e ec, e e a, e e b, ee e e e ee e e e e (b) de e e ec, e e a, e e b, ee e e e ee e e e e (c) a, b b d c, dg g g g gg g g g gg g g gg e, g g f, g g Figure 3. The grid worlds used in the case studies. The encircled letters are the labels of the cells. The circles filled with gray are obstacles.The cells visited under the optimal strategies are highlighted in purple. (a) Nursery: before a ; (b) Nursery: after a ; (c) Surveillance. The DPA derived from this LTL has states and colors.Under the strategy learned (Fig. 3c), the robot moves andstays in the region highlighted in purple; and repeatedlyvisits the cells labeled with e and f . Although the (sub)tasksin the other zones are relatively simpler (only one cell needsto be visited), due to the adversary, there is always a chanceto move the outside of these zones; thus, moving to thesezones cannot be part of an optimal strategy. Also, due tothe unlabeled cell in the middle, the robot cannot go to thecell in the middle bottom; if it does, the adversary mightsucceed pulling the robot to the middle once in a while. 6. Conclusions In this work, we have introduced an RL approach to findcontroller strategies for LTL tasks in SGs. Our approachconstructs a reward machine directly from the LTL task’sspecification and uses an off-the-shelf RL algorithm to ob-tain optimal controller strategies. For the introduced prioritygames, we have shown that any controller strategy that max-imizes the discounted sum of rewards obtained in the worstcase, also maximizes the minimum probability of satisfyingthe LTL specification for some sufficiently small parameterassociated with the rewards. Finally, we have demonstratedthe applicability of our approach on two case studies, whichcould not be handled in practice with the existing methods.Our approach can be readily used for MDPs as they arespecial cases of SGs. Furthermore, our approach can beeasily extended to learn to perform tasks described by areward function but subject to LTL constraints. The LTLconstraints can be reduced to lower bounds on the returnobtained in priority games, which can be solved by anyconstrained SG/MDP technique. Acknowledgements This work is sponsored in part by the ONR under agreementsN00014-17-1-2504 and N00014-20-1-2745, the AFOSRaward number FA9550-19-1-0169, and the NSF CNS-1652544 grant. References Alshiekh, M., Bloem, R., Ehlers, R., K¨onighofer, B.,Niekum, S., and Topcu, U. Safe reinforcement learn-ing via shielding. In Proceedings of the AAAI Conferenceon Artificial Intelligence , volume 32, 2018.Ashok, P., Kˇret´ınsk`y, J., and Weininger, M. Pac statisti-cal model checking for markov decision processes andstochastic games. In International Conference on Com-puter Aided Verification , pp. 497–519. Springer, 2019.Bacchus, F. and Kabanza, F. Using temporal logics toexpress search control knowledge for planning. Artificialintelligence , 116(1-2):123–191, 2000.Baier, C. and Katoen, J.-P. Principles of Model Checking .MIT Press, Cambridge, MA, USA, 2008.Bhatia, A., Kavraki, L. E., and Vardi, M. Y. Sampling-basedmotion planning with temporal goals. In ,pp. 2689–2696. IEEE, 2010.Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K.,Kochenderfer, M. J., and Tumova, J. Reinforcementlearning with probabilistic guarantees for autonomousdriving. In Workshop on Safety Risk and Uncertainty inReinforcement Learning , 2018.Bozkurt, A. K., Wang, Y., Zavlanos, M., and Pajic, M.Model-free reinforcement learning for stochastic gameswith linear temporal logic objectives. arXiv preprintarXiv:2010.01050 , 2020a.Bozkurt, A. K., Wang, Y., Zavlanos, M. M., and Pajic, M.Control synthesis from linear temporal logic specifica-tions using model-free reinforcement learning. In , pp. 10349–10355. IEEE, 2020b.Br´azdil, T., Chatterjee, K., Chmelik, M., Forejt, V.,Kˇret´ınsk`y, J., Kwiatkowska, M., Parker, D., and Ujma, M.Verification of Markov decision processes using learning earning Optimal Strategies for Temporal Tasks in Stochastic Games algorithms. In International Symposium on AutomatedTechnology for Verification and Analysis , pp. 98–114.Springer, 2014.Camacho, A. and McIlraith, S. A. Learning interpretablemodels expressed in linear temporal logic. In Proceedingsof the International Conference on Automated Planningand Scheduling , volume 29, pp. 621–630, 2019.Chatterjee, K. and Henzinger, T. A. A survey of stochastic ω -regular games. Journal of Computer and System Sciences ,78(2):394–413, 2012.De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F., andRonca, A. Temporal logic monitoring rewards via trans-ducers. In Proceedings of the International Conference onPrinciples of Knowledge Representation and Reasoning ,volume 17, pp. 860–870, 2020.Esparza, J., Kˇret´ınsk`y, J., Raskin, J.-F., and Sickert, S. FromLTL and limit-deterministic B¨uchi automata to determin-istic parity automata. In International Conference onTools and Algorithms for the Construction and Analysisof Systems , pp. 426–442. Springer, 2017.Filar, J. and Vrieze, K. Competitive Markov decision pro-cesses . Springer Science & Business Media, 2012.Fu, J. and Topcu, U. Probably approximately correct MDPlearning and control with temporal logic constraints. In Proceedings of Robotics: Science and Systems , Berkeley,USA, July 2014.Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A.,and Wojtczak, D. Omega-regular objectives in model-freereinforcement learning. In International Conference onTools and Algorithms for the Construction and Analysisof Systems , pp. 395–412. Springer, 2019.Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi,A., and Wojtczak, D. Model-free reinforcement learningfor stochastic parity games. In . SchlossDagstuhl-Leibniz-Zentrum f¨ur Informatik, 2020.Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pap-pas, G. J., and Lee, I. Reinforcement learning for tempo-ral logic control synthesis with probabilistic satisfactionguarantees. In , pp. 5338–5343. IEEE, 2019.Icarte, R. T., Klassen, T., Valenzano, R., and McIlraith, S.Using reward machines for high-level task specificationand decomposition in reinforcement learning. In Interna-tional Conference on Machine Learning , pp. 2107–2116,2018. Karaman, S., Sanfelice, R. G., and Frazzoli, E. Optimalcontrol of mixed logical dynamical systems with lineartemporal logic specifications. In , pp. 2117–2122. IEEE,2008.Kress-Gazit, H., Fainekos, G. E., and Pappas, G. J.Temporal-logic-based reactive mission and motion plan-ning. IEEE transactions on robotics , 25(6):1370–1381,2009.Kret´ınsk´y, J., Meggendorfer, T., and Sickert, S. Owl: Alibrary for ω -words, automata, and LTL. In Lahiri, S. K.and Wang, C. (eds.), Automated Technology for Verifica-tion and Analysis - 16th International Symposium, ATVA2018, Los Angeles, CA, USA, October 7-10, 2018, Pro-ceedings , volume 11138 of Lecture Notes in ComputerScience , pp. 543–550. Springer, 2018.Li, X., Vasile, C.-I., and Belta, C. Reinforcement learningwith temporal logic rewards. In , pp. 3834–3839. IEEE, 2017.Neyman, A., Sorin, S., and Sorin, S. Stochastic games andapplications , volume 570. Springer Science & BusinessMedia, 2003.Oura, R., Sakakibara, A., and Ushio, T. Reinforcementlearning of control policy for linear temporal logic speci-fications using limit-deterministic generalized B¨uchi au-tomata. IEEE Control Systems Letters , 4(3):761–766,2020.Pnueli, A. The temporal logic of programs. In , pp. 46–57. IEEE, 1977.Sadigh, D., Kim, E. S., Coogan, S., Sastry, S. S., and Seshia,S. A. A learning based approach to control synthesisof Markov decision processes for linear temporal logicspecifications. In , pp. 1091–1096. IEEE, 2014.Shapley, L. S. Stochastic games. Proceedings of the Na-tional Academy of Sciences , 39(10):1095–1100, 1953.Ulusoy, A., Smith, S. L., Ding, X. C., Belta, C., and Rus,D. Optimality and robustness in multi-robot path plan-ning with temporal logic constraints. The InternationalJournal of Robotics Research , 32(8):889–911, 2013.Wen, M. and Topcu, U. Probably approximately correctlearning in stochastic games with temporal logic speci-fications. In Proceedings of the Twenty-Fifth Interna-tional Joint Conference on Artificial Intelligence , pp.3630–3636, 2016. earning Optimal Strategies for Temporal Tasks in Stochastic Games Wolff, E. M., Topcu, U., and Murray, R. M. Robust controlof uncertain Markov decision processes with temporallogic specifications. In , pp. 3372–3379. IEEE,2012.Yordanov, B., Tumova, J., Cerna, I., Barnat, J., and Belta, C.Temporal logic control of discrete-time piecewise affinesystems.