[PDF] Learning Optimal Strategies for Temporal Tasks in Stochastic Games

Abstract

Linear temporal logic (LTL) is widely used to formally specify complex tasks for autonomy. Unlike usual tasks defined by reward functions only, LTL tasks are noncumulative and require memory-dependent strategies. In this work, we introduce a method to learn optimal controller strategies that maximize the satisfaction probability of LTL specifications of the desired tasks in stochastic games, which are natural extensions of Markov Decision Processes (MDPs) to systems with adversarial inputs. Our approach constructs a product game using the deterministic automaton derived from the given LTL task and a reward machine based on the acceptance condition of the automaton; thus, allowing for the use of a model-free RL algorithm to learn an optimal controller strategy. Since the rewards and the transition probabilities of the reward machine do not depend on the number of sets defining the acceptance condition, our approach is scalable to a wide range of LTL tasks, as we demonstrate on several case studies.

Full PDF

LLearning Optimal Strategies for Temporal Tasks in Stochastic Games

Alper Kamil Bozkurt Yu Wang Miroslav Pajic

Abstract

Linear temporal logic (LTL) is widely used to for-mally specify complex tasks for autonomy. Un-like usual tasks deﬁned by reward functions only,LTL tasks are noncumulative and require memory-dependent strategies. In this work, we introduce amethod to learn optimal controller strategies thatmaximize the satisfaction probability of LTL spec-iﬁcations of the desired tasks in stochastic games,which are natural extensions of Markov DecisionProcesses (MDPs) to systems with adversarial in-puts. Our approach constructs a product gameusing the deterministic automaton derived fromthe given LTL task and a reward machine basedon the acceptance condition of the automaton;thus, allowing for the use of a model-free RL al-gorithm to learn an optimal controller strategy.Since the rewards and the transition probabilitiesof the reward machine do not depend on the num-ber of sets deﬁning the acceptance condition, ourapproach is scalable to a wide range of LTL tasks,as we demonstrate on several case studies.

1. Introduction

We consider the problem of learning optimal strategies forformally speciﬁed tasks in zero-sum turn-based stochasticgames (SGs). SGs are natural extensions of Markov decisionprocesses (MDPs) to systems where an adversary can affectthe outcomes of the actions to prevent performing the giventask (Shapley, 1953; Filar & Vrieze, 2012). SGs can modelmany sequential decision making problems where a giventask needs to be successfully performed independently fromthe behavior of an adversary (Neyman et al., 2003).Linear temporal logic (LTL) (Pnueli, 1977) offers a high-level language to formally specify a wide range of taskswith temporal characteristics. For instance, the agent tasks“ﬁrst go to the living room and then go to the baby’s room”(sequencing), “if the battery is low, go to the charger” (condi-tioning) and “continuously monitor the corridors of a build-ing” (repetition) can be easily speciﬁed using LTL. Control

Alper Kamil Bozkurt, Yu Wang and Miroslav Pajic are withDuke University, Durham, NC 27708, USA, { alper.bozkurt,yu.wang094, miroslav.pajic } @duke.edu synthesis directly from LTL speciﬁcations in stochastic sys-tems has been an attractive topic as it provides safe and ro-bust strategies that are correct-by-construction and thus hasbeen widely studied in robotics and cyber-physical systems(e.g., (Bacchus & Kabanza, 2000; Karaman et al., 2008;Kress-Gazit et al., 2009; Bhatia et al., 2010; Yordanov et al.,2011; Wolff et al., 2012; Ulusoy et al., 2013)). Such ap-proaches, however, require a complete model of the system.The common situations where the agent does not knowthe system/environment model in advance have motivatedthe use of reinforcement learning (RL) to design controllerstrategies for LTL tasks for MDPs. Examples include model-based approaches such as (Fu & Topcu, 2014; Br´azdil et al.,2014), which learn a model of the environment or require thetopology of the environment; model-free methods craftingreward functions directly from the speciﬁcations expressedin a fragment of LTL (Li et al., 2017; Icarte et al., 2018;De Giacomo et al., 2020) or from the acceptance conditionsof the automata derived from the speciﬁcations (Sadighet al., 2014; Hasanbeig et al., 2019; Camacho & McIlraith,2019; Hahn et al., 2019; Oura et al., 2020; Bozkurt et al.,2020b); and shielding approaches (Alshiekh et al., 2018;Bouton et al., 2018). All these focus on learning in MDPsand do not consider adversarial environments where anotheragent may take actions to disrupt performing the given task.On the other hand, there are only few studies investigatingthe use of learning for LTL tasks in SGs. (Wen & Topcu,2016) introduce a model-based probably approximately cor-rect (PAC) algorithm that pre-computes the winning stateswith respect to the LTL speciﬁcation, thereby requiring thetopology of the game a-priori, whereas another PAC algo-rithm from (Ashok et al., 2019) only considers the reach-ability fragment of LTL, which can be used to specify thesequencing and conditioning tasks but not the continuoustasks. More recently, two automata-based model-free meth-ods have been proposed to learn control strategies for LTLtasks in SGs (Bozkurt et al., 2020a; Hahn et al., 2020).These approaches construct modiﬁed product games usingthe automata derived from the LTL speciﬁcations and learnstrategies via the rewards crafted based on the acceptancecondition of the automaton. However, these methods requiredistinct powers of some small parameter ε for the sets thatdeﬁne the acceptance condition, which prohibits their use inpractice for many tasks. For example, even for the simple a r X i v : . [ c s . A I] F e b earning Optimal Strategies for Temporal Tasks in Stochastic Games surveillance task described in Section 5, the approach in(Hahn et al., 2020) may require episodes with a length of even when ε = 0 . .Consequently, in this work, we introduce a model-free RLapproach to learn optimal strategies for desired LTL tasksin SGs while signiﬁcantly improving scalability comparedto existing methods. We start by composing an SG systemmodel with a deterministic parity automaton (DPA) thatis automatically obtained from the given LTL task. Suchproduct game is then combined with a novel reward machinesuitably crafted from the acceptance condition of the DPA.The rewards and the transition probabilities in the rewardmachine are independent from the number of sets deﬁningthe acceptance condition, which results in the scalabilityimprovement. We show the applicability of our approachon two case studies with the LTL tasks, which cannot beeffectively learned via existing approaches.

2. Preliminaries and Problem Formulation

Stochastic Games.

We use SGs to model the problem ofperforming a given task by the controller (Player 1) againstan adversary (Player 2) in a stochastic environment.

Deﬁnition 1.

A (labeled turn-based two-player) stochasticgame is a tuple G = ( S, ( S µ , S ν ) , s , A, P, AP , L ) where S = S µ ∪ S ν is a ﬁnite set of states, S µ is the set of statesin which the controller takes action, S ν is the set of statesunder the control of the adversary, and s is the initialstate; A is a ﬁnite set of actions and A ( s ) denotes the setof actions that can be taken in a state s ; P : S × A × S (cid:55)→ [0 , is the probabilistic transition function such that (cid:80) s (cid:48) ∈ S P ( s, a, s (cid:48) ) = 1 if a ∈ A ( s ) and otherwise; ﬁnally, AP is a ﬁnite set of atomic propositions and L : S (cid:55)→ AP isthe labeling function. SGs can be considered as inﬁnite games played by the con-troller and the adversary on ﬁnite directed graphs consistingof state and state-action vertices. In each state vertex, theowner of the state chooses one of the state-action vertices ofthe state; the game then transitions to one of the successorsof the vertex according to the given transition function.

Example 1.

An example SG is shown in Fig. 1a. The gamestarts in s , where a label { c } is received. In s , the con-troller can choose either β or β . With β , the game thenmakes transition to s with probability (w.p.) . , and staysin the s w.p. . . If the controller persistently chooses β the game eventually moves to state s w.p. 1 and { a, b, c } will be received. Each play in an SG G induces an inﬁnite path π := s s . . . where for all t ≥ , there exists an action a ∈ A ( s t ) suchthat P ( s t , a, s t +1 ) > . We denote the state s t , preﬁx s s . . . s t and sufﬁx s t s t +1 ... by π [ t ] , π [: t ] and π [ t :] re-spectively. The players’ behaviors can be considered as strategy functions mapping the history of the visited statesto an action. We focus on ﬁnite-memory strategies sincethey sufﬁce for the tasks considered in this work. Deﬁnition 2. A ﬁnite-memory strategy for a game G is atuple σ = ( M, m , T, α ) where M is a ﬁnite set of modes; m is the initial mode; T : M × S (cid:55)→ D ( M ) is the tran-sition function that maps the current mode and state to adistribution over the next modes; α : M × S (cid:55)→ D ( A ) is afunction that maps a given mode m ∈ M and a state s ∈ S to a discrete distribution over A ( s ) . A controller strategy µ is a ﬁnite-memory strategy that maps only the controllerstates to distributions over actions. Similarly, an adversarystrategy ν is a ﬁnite-memory strategy mapping the adver-sary states to distributions over actions. A ﬁnite-memorystrategy is called pure memoryless if there is only one mode( | M | = 1 ) and α ( m , s ) is a point distribution assigning aprobability of 1 to a single action for all s ∈ S . Intuitively, a ﬁnite-memory strategy is a ﬁnite state machinemoving from one mode (memory state) to another as the SGstates are visited, outputting a distribution over the actions ineach state. Unlike the standard deﬁnition of ﬁnite-memorystrategies (e.g., (Chatterjee & Henzinger, 2012; Baier &Katoen, 2008)) where transitions among the modes are alldeterministic, Def. 2 also allows for probabilistic transitions.For an SG G , and a pair of a controller strategy µ and an ad-versary strategy ν , we denote the resulting induced Markovchain (MC) as G µ,ν . We write π ∼ G µ,ν to denote a pathdrawn from G µ,ν , and π s ∼ G sµ,ν to denote a path drawnfrom the MC G sµ,ν , where G sµ,ν is same as G µ,ν except thatthe state s is designated as the initial state instead of s . Fi-nally, a bottom strongly connected component (BSCC) of anMC is a set of states such that there is a path from each stateto any other state in the set without any outgoing transitions.We denote the set of all BSCCs of G µ,ν by B ( G µ,ν ) . Linear Temporal Logic.

LTL provides a high-level for-malism to specify tasks with temporal properties by placingrequirements for inﬁnite paths. LTL speciﬁcations consistof nested combinations of Boolean and temporal operatorsaccording to the grammar ϕ := true | a | ϕ ∧ ϕ | ¬ ϕ | (cid:13) ϕ | ϕ U ϕ , a ∈ AP . (1)The other Boolean operators are deﬁned via the standardequivalences (e.g., ϕ ∨ ϕ := ¬ ( ¬ ϕ ∧ ¬ ϕ ) , ϕ → ϕ (cid:48) := ¬ ϕ ∨ ϕ (cid:48) ). A path π of an SG G satisﬁes an LTL speciﬁcation ϕ , denoted by π | = ϕ , if one of the following holds:• if ϕ = a and L ( π [0]) = a , i.e., a immediately holds;• if ϕ = ϕ ∧ ϕ , π | = ϕ and π | = ϕ ;• if ϕ = ¬ ϕ (cid:48) and π (cid:54)| = ϕ (cid:48) ;• if ϕ = (cid:13) ϕ (cid:48) (called next ϕ (cid:48) ) and π [1:] | = ϕ (cid:48) ;• if ϕ = ϕ U ϕ (called ϕ until ϕ ) and there exists t ≥ such that π [ t :] | = ϕ and for all ≤ i < t , π [ i :] | = ϕ . earning Optimal Strategies for Temporal Tasks in Stochastic Games 𝑠 { 𝑐 } 𝛽 𝛽 𝑠 { 𝑎 , 𝑏 , 𝑐 } 𝑠 { 𝑎 , 𝑏 } 𝑠 { 𝑎 } 𝛽 𝛽 𝛽 𝛽 𝑠 { 𝑎 } 𝑠 { 𝑐 } 𝛽 𝛽 (a) An example SG. 𝑞 𝑞 {}|{ 𝑎 }|{ 𝑏 }|{ 𝑎 , 𝑏 } (5){ 𝑐 }|{ 𝑏 , 𝑐 } (5){ 𝑎 , 𝑐 } (2){ 𝑎 } (3){ 𝑎 , 𝑏 }|{ 𝑎 , 𝑏 , 𝑐 } (4){}|{ 𝑏 } (5) { 𝑐 }|{ 𝑎 , 𝑐 }|{ 𝑏 , 𝑐 }|{ 𝑎 , 𝑏 , 𝑐 } (4) (b) An example DPA for ϕ = ( ♦(cid:3) a ∧ (cid:3)♦ b ) ∨ ♦(cid:3) c . 𝑠 𝑞 (5) 𝛽 𝛽 𝑠 𝑞 (4) 𝑠 𝑞 (5) 𝑠 𝑞 (3) 𝛽 𝛽 𝛽 𝛽 𝑠 𝑞 (5) 𝑠 𝑞 (4) 𝛽 𝛽 𝑠 𝑞 (4) 𝛽 𝑠 𝑞 (3)1 𝛽 𝛽 𝑠 𝑞 (5) 𝛽 𝑠 𝑞 (4) 𝛽 (c) Product game of the SG in (a) and the DPA in (b). Figure 1.

A product game construction example. In (a) and (c), the green circles and red squares are the controller and adversary states,respectively. The blue diamonds represent actions and the outgoing black edges represent probabilistic transitions. In (a), the sets of lettersare the state labels. In (b), q and q are the DPA states and the arrows represent the DPA transitions triggered by the labels on the edges.The colored numbers within parentheses in (b) and (c) represent the color of the DPA transitions and the product states, respectively. Intuitively, the temporal operator (cid:13) ϕ expresses that ϕ needsto hold in the next time step, whereas ϕ U ϕ speciﬁes that ϕ needs to hold until ϕ holds. Other temporal operatorssuch as eventually ( ♦ ) and always ( (cid:3) ) are also commonlyused: ♦ ϕ := true U ϕ (i.e., ϕ eventually holds), and (cid:3) ϕ := ¬ ( ♦ ¬ ϕ ) (i.e., ϕ always holds). Example 2.

Majority of robotics tasks can be expressed asan LTL formula (Kress-Gazit et al., 2009). For instance, • Sequencing:

The agent should go to the room a , openthe package b , take the item c , and place the item on thedesk d : ϕ seq = ♦ ( a ∧ ♦ ( b ∧ ♦ ( c ∧ ♦ d ))) ; • Surveillance:

The agent should repeatedly visit the rooms a and b , and if an item c is seen on the ﬂoor, should reportto the administrator f : ϕ sur = (cid:3)♦ a ∧ (cid:3)♦ b ∧ (cid:3) ( c → ♦ f ) ; • Persistence and Avoidance:

The agent should go to theroom a and stay there, without entering a danger zone g : ϕ per = ♦(cid:3) a ∧ (cid:3) ¬ g . Example 3.

Consider a simple persistence task ϕ = ♦(cid:3) a in the SG from Fig. 1a. If the controller and the adversaryalways choose β and β in s and s respectively, s willbe eventually reached, followed by ‘ a ’ being received afterevery transition; thus, satisfying the speciﬁcation. However,if the adversary occasionally chooses β , s , a state without‘ a ’, will be eventually reached from any time-point in thegame; this makes the probability of satisfying the speciﬁca-tion . In this case, the controller can improve the proba-bility of satisfying ϕ by choosing β (in s ), which w.p. . leads to an inﬁnite loop of s and s , both labeled with ‘ a ’. Deterministic Parity Automata.

Any LTL task ϕ can betranslated to a deterministic parity automaton (DPA) thataccepts an inﬁnite path satisfying ϕ (Esparza et al., 2017). Deﬁnition 3. A deterministic parity automaton derivedfrom an LTL speciﬁcation ϕ in an SG G is a tuple A ϕ =( Q, q , δ, k, C ) such that Q is a ﬁnite set of automaton states; q ∈ Q is the initial state; δ : Q × AP (cid:55)→ Q is the transitionfunction; k is the number of colors; C : Q × AP (cid:55)→ [ k ] isthe coloring function where [ k ] := { , . . . , k } . A path π of G induces an execution τ π = (cid:104) q , L ( π [0]) (cid:105)(cid:104) q , L ( π [1]) (cid:105) . . . such that for all t ≥ , δ ( q t , L ( π [ t ])) = q t +1 . Let Inf ( τ π ) denote the set of the transitions (cid:104) q, ρ (cid:105) ∈ Q × AP madeinﬁnitely often by τ π ; then, a path π is accepted by a DPA if max { C ( (cid:104) q, ρ (cid:105) ) | (cid:104) q, ρ (cid:105) ∈ Inf ( π ) } is an even number. DPAs provide a systematic way to evaluate the satisfactionof any LTL speciﬁcation, which can be expressed by satis-faction of the parity condition of a constructed DPA. Theparity condition is satisﬁed simply when the largest coloramong the colors repeatedly visited is an even number. Thisprovides a natural framework to reason about LTL tasks inSGs; the controller tries to visit the states triggering the even-colored transitions as often as possible, while adversary triesto do the opposite (i.e., odd-colored transitions).

Example 4.

An example DPA derived from the LTL formula ϕ = ( ♦(cid:3) a ∧ (cid:3)♦ b ) ∨ ♦(cid:3) c is shown in Fig. 1b. Any inﬁniteexecution visiting q and q inﬁnitely many times is notaccepted as the transitions between the states are coloredwith , which is an odd number and the largest color. Anexecution visiting q inﬁnitely often is accepting only ifafter some ﬁnite time steps, the execution always stays in q by receiving the labels { c } , { a, c } , { b, c } or { a, b, c } , thussatisfying ♦(cid:3) c . Also, the accepting transitions visiting q inﬁnitely often after some point either only receive { a, c } ,satisfying ♦(cid:3) c , or repeatedly receive { a, c } and { a, b, c } without receiving {} or { b } , hence satisfying ♦(cid:3) a ∧ (cid:3)♦ b . Problem Statement

We can now formulate the problem we consider in this work.

Problem 1.

For a given LTL task speciﬁcation ϕ and anSG G where the transition probabilities are completely un-known, design a model-free RL approach to learn an optimalcontroller strategy under which the given LTL tasks are per-formed successfully with the highest probability against anoptimal (i.e., worst case) adversary. Formally, our objective is to learn a controller strategy µ ϕ in the SG G such that under µ ϕ , the probability that a path earning Optimal Strategies for Temporal Tasks in Stochastic Games π satisﬁes the LTL speciﬁcation of the task ϕ is maximizedin the worst case; i.e., µ ϕ := arg max µ min ν P r π ∼G µ,ν { π | π | = ϕ } . (2)However, model-free RL algorithms such as minimax-Qrequire a reward function associating states to scalar rewardsand learn a strategy maximizing the minimum expectedvalue of the return, which is the sum of the discountedrewards obtained for some discount factor γ ∈ [0 , . For astrategy pair ( µ, ν ) , let r ( t ) µ,ν denote the reward collected attime step t and G µ,ν denote the return; then, we have G µ,ν := (cid:88) ∞ t =0 γ t r ( t ) µ,ν . (3)Thus, we solve Problem 1 by systematically constructinga reward machine from an LTL speciﬁcation such that anystrategy maximizing the minimum expected return is anoptimal controller strategy as deﬁned in (2).

3. Priority Games

In this section, we introduce a reduction from Problem 1 tothe problem of learning a controller strategy maximizingthe return in (3) in the worst case. We start by constructinga product game by composing the given SG G with the DPA A ϕ derived from the LTL speciﬁcation ϕ of the desired task.We then introduce a novel reward machine based on theacceptance condition of the DPA where the transitions andthe rewards are controlled by a parameter ε . We compose thereward machine with the product game to obtain Markovianrewards; for such rewards we show that as ε approacheszero, a strategy maximizing the minimum expected returnmaximizes the minimum probability of satisfying the LTLspeciﬁcation – i.e., the obtained optimal strategy satisﬁes (2).The overall approach is summarized in Algorithm 1. The problem of learning an optimal controller strategy foran LTL task in an SG can be reduced into meeting the paritycondition via constructing a product game.

Deﬁnition 4. A product game of an SG G and a DPA A ϕ is a tuple G × =( S × , ( S × µ , S × ν ) , s × , A × , P × , C × ) where S × = S × Q is the set of product states; S × µ = S µ × Q and S × ν = S ν × Q are controller and adversary product statesrespectively; s × = (cid:104) s , q (cid:105) is the initial state; A × = A with A × ( (cid:104) s, q (cid:105) )= A ( s ) ; P × : S × × A × × S × (cid:55)→ [0 , is the proba-bilistic transition function such that P ( (cid:104) s, q (cid:105) , a, (cid:104) s (cid:48) , q (cid:48) (cid:105) )= (cid:40) P ( s, a, s (cid:48) ) if q (cid:48) = δ ( q, L ( s )) , otherwise; (4) and C × : S × (cid:55)→ [ k ] is the product coloring func-tion such that C × ( (cid:104) s, q (cid:105) ) = C ( q, L ( s )) . A path π × = (cid:104) s , q (cid:105)(cid:104) s , q (cid:105) . . . in a product game meets the parity con-dition if the following holds: ϕ × := “ max (cid:8) C × ( s × ) | s × ∈ Inf × ( π × ) (cid:9) is even ” (5) Table 1.

Frequently used symbols and notation.Description Stochastic Product PriorityGame ( G ) Game ( G × ) Game ( G (cid:63) )State s s × := (cid:104) s, q (cid:105) s (cid:63) := (cid:104) s × , j (cid:105) Path π := s s . . . π × := s × s × . . . π (cid:63) := s (cid:63) s (cid:63) . . . Strategy Pair ( µ, ν ) ( µ × , ν × ) ( µ (cid:63) , ν (cid:63) ) Induced MC G µ,ν G × µ × ,ν × G (cid:63)µ (cid:63) ,ν (cid:63) Objective ϕ (LTL) ϕ × (Parity) G (Return) where Inf × ( π × ) is the set of product states visited inﬁnitelymany times by π × . From the deﬁnition and the fact that whenever the DPA A ϕ makes a transition colored with j , a product state (cid:104) s, q (cid:105) col-ored with j is visited in the G × , the following lemma holds: Lemma 1.

In the product game G × of the SG G and theDPA A ϕ , consider paths π × = (cid:104) s , q (cid:105)(cid:104) s , q (cid:105) . . . in G × and π = s s . . . in G . Then, ( π × | = ϕ × ) ⇔ ( π | = ϕ ) . The product game effectively captures synchronous execu-tion of the DPA A ϕ with the SG G . The DPA starts in itsinitial state and whenever the SG moves to a state, the DPAconsumes the label of the state and a makes a transition. Forexample, when G is in s and A ϕ is in q , the product game G × is in (cid:104) s, q (cid:105) . If the SG moves to s (cid:48) , the DPA moves to thestate q (cid:48) = δ ( q, L ( s (cid:48) )) , which is represented by the productgame transition from (cid:104) s, q (cid:105) to (cid:104) s (cid:48) , q (cid:48) (cid:105) .A strategy in G × induces a ﬁnite-memory strategy in G where the states of A ϕ act as the modes governed by thetransition function of A ϕ . To illustrate, let µ × denote a purememoryless strategy in G × and µ denote its induced strategyin G . While µ is operating in mode m corresponding to theDPA state q , if a state s is visited in G , µ changes its modefrom m to m (cid:48) , which corresponds to q (cid:48) = δ ( q, L ( s (cid:48) )) andchooses the action that µ × selects in (cid:104) s (cid:48) , q (cid:48) (cid:105) .Hence, from Lemma 1 the probability of satisfying the paritycondition ϕ × under a strategy pair ( µ × , ν × ) in the productgame G × is equal to the probability of satisfying the LTLspeciﬁcation ϕ in the SG G under the induced strategy pair ( µ, ν ) ; i.e., it holds that P r { π | π | = ϕ } = P r (cid:8) π × | π × | = ϕ × (cid:9) , (6)where π and π × are random paths drawn from the MCs G µ,ν and G × µ × ,ν × respectively. Example 5.

Fig. 1c presents the product game G × obtainedfrom the SG G in Fig. 1a and the DPA A ϕ of the LTL task ϕ in Fig. 1b. In G , if the adversary follows a pure memorylessstrategy, (s)he loses (i.e., the controller wins in the sensethat ϕ is satisﬁed) w.p. 1 when G moves to s ; the reason isthat with a pure adversary strategy ν , G alternates eitherbetween s and s (satisfying ♦(cid:3) a ∧ (cid:3)♦ b ) or between s and s (satisfying ♦(cid:3) c ), and thus ϕ is satisﬁed. However, earning Optimal Strategies for Temporal Tasks in Stochastic Games the adversary can win in the G × by following a pure memo-ryless strategy that chooses β in (cid:104) s , q (cid:105) and β in (cid:104) s , q (cid:105) as in this case, the maximal color in the obtained inﬁnite cy-cle (cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→(cid:104) s , q (cid:105)→ ... is5. This pure memoryless strategy in G × induces the ﬁnite-memory strategy in G that alternates between the modes m and m , under which β and β are selected in s respec-tively – m and m correspond the DPA states q and q . We use

P r µ × ,ν × ( s × | = ϕ × ) to denote the probability thata state s × ∈ S × satisﬁes ϕ × under ( µ × , ν × ) ; i.e., P r µ × ,ν × ( s × | = ϕ × ) := P r (cid:8) π × s × | π × s × | = ϕ × (cid:9) (7)where π × s × is a random path drawn from the product MC G × ,s × µ × ,ν × , which is obtained from G × µ × ,ν × by assigning s × as the initial state. Therefore, from (6) and Lemma 1, ourobjective can be revised as learning a strategy µ × ϕ × in theproduct game G × deﬁned as µ × ϕ × := arg max µ × min ν × P r µ × ,ν × ( s × | = ϕ × ); (8)the strategy µ × ϕ × is then used to induce the ﬁnite-memorystrategy µ ϕ from (2). To achieve this using a model-free RL,we introduce a reward machine that ensures the equivalencebetween the probability of satisfying the parity condition ϕ × and the expected return. The goal of our priority reward machines (PRMs), a classof reward machines (Icarte et al., 2018) previously used forRL in MDPs, is to provide suitable rewards for learningstrategies satisfying parity conditions. PRMs have a prioritystate for each color in a given parity condition and outputpositive rewards only in the priority states corresponding toeven colors.

Deﬁnition 5. A priority reward machine is a tuple ρ =( k, ε, (cid:37), ϑ, Γ) where • k is the number of colors; • ε ∈ (0 , is the parameter controlling the transitionsbetween priorities; • (cid:37) = { (cid:37) , . . . , (cid:37) k } is the set of priority states; • ϑ : (cid:37) × { }∪ [ k ] (cid:55)→ [0 , is the probabilistic transitionfunction such that ϑ ( (cid:37) j , c, (cid:37) j (cid:48) ) =  − √ ε if j =0 , j (cid:48) =0 , √ ε if j =0 , j (cid:48) =1 , − ε if j ≥ , j ≥ c, j = j (cid:48) ,ε if j ≥ , j ≥ c, j (cid:48) =1 , if j ≥ , j

The priority game of a product game G × anda PRM ρ is a tuple G (cid:63) = ( S (cid:63) , ( S (cid:63)µ , S (cid:63)ν ) , s (cid:63) , A (cid:63) , P (cid:63) , ε, R (cid:63) ) : • S (cid:63) = S × × { }∪ [ k ] is the augmented state set with S (cid:63)µ = S × µ × { }∪ [ k ] are the controller and S (cid:63)ν = S (cid:63) \ S (cid:63)µ the adversary states, and s (cid:63) = (cid:104) s × , (cid:105) is the initial state; • A (cid:63) = A × with A (cid:63) ( (cid:104) s × , j (cid:105) ) = A × ( s × ) for all ( (cid:104) s × , j (cid:105) ∈ S (cid:63) is the set of actions; • − ε is the discounting factor; • P (cid:63) : S (cid:63) × A (cid:63) × S (cid:63) (cid:55)→ [0 , is the probabilistic transitionfunction deﬁned as P (cid:63) ( (cid:104) s × , j (cid:105) , a, (cid:104) s × (cid:48) , j (cid:48) (cid:105) ) ==  (1 −√ ε ) P ( s × , a, s × (cid:48) ) if j =0 , j (cid:48) =0 , √ εP ( s × , a, s × (cid:48) ) if j =0 , j (cid:48) =1 , (1 − ε ) P ( s × , a, s × (cid:48) ) if j ≥ , j ≥ c, j = j (cid:48) ,εP ( s × , a, s × (cid:48) ) if j ≥ , j ≥ c, j (cid:48) =1 ,P ( s × , a, s × (cid:48) ) if j ≥ , j

A PRM for colors, with states. The rewards providedin each state are displayed next to the state names. The blue outgo-ing edges are the transitions triggered by the colors displayed nearthe edges. The black edges represent the probabilistic transitions. in the SG G where the PRM combined with the DPA servesas the memory mechanism. Hereafter, we focus on purememoryless strategy pairs in priority games, and we showin the next section that if µ (cid:63) maximizes the minimum ex-pected return in the priority game G (cid:63) , its induced strategy µ maximizes the minimum satisfaction probability of the LTLtask in the SG G .

4. Main Result

We now state our main result. Here, we use G ( π (cid:63) ) to denotethe return of a path π (cid:63) in a priority game. The value of astate s (cid:63) ∈ S (cid:63) under a strategy pair ( µ (cid:63) , ν (cid:63) ) , denoted by v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) , is deﬁned as the expected value of the return ofa path starting in s (cid:63) ; i.e. v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) := E π (cid:63)s(cid:63) (cid:2) G ( π (cid:63)s (cid:63) )] and E π (cid:63)s(cid:63) (cid:2) G ( π (cid:63)s (cid:63) )] = E π (cid:63)s(cid:63) (cid:34) ∞ (cid:88) t =0 (1 − ε ) t R (cid:63) ( π (cid:63)s (cid:63) [ t ]) (cid:35) (10)where π (cid:63)s (cid:63) ∼ G (cid:63),s (cid:63) µ (cid:63) ,ν (cid:63) , ( G (cid:63),s (cid:63) µ (cid:63) ,ν (cid:63) is the MC with the initial state s (cid:63) , induced by ( µ (cid:63) , ν (cid:63) ) ). Theorem 1.

Let µ (cid:63) ∗ be an optimal controller strategy maxi-mizing the minimum value of the initial state in a prioritygame G (cid:63) constructed from an SG G for a desired LTL speci-ﬁcation ϕ , which is formally deﬁned as µ (cid:63) ∗ := arg max µ (cid:63) min ν (cid:63) v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) . (11) Then there exists ε > such that the induced strategy of µ (cid:63) ∗ in G is an optimal strategy µ ϕ as deﬁned in (2) . Theorem 1 states that an optimal controller strategy for agiven LTL task ϕ can be obtained by ﬁnding a maximinstrategy in the constructed priority game. Before provingTheorem 1, we provide some useful lemmas. Throughoutthe lemmas, ( µ (cid:63) , ν (cid:63) ) denote an arbitrary strategy pair in agiven priority game G (cid:63) , and ( µ × , ν × ) denote its inducedstrategy pair in the product game G × . Lemma 2.

For a path π (cid:63) in G (cid:63) , it holds that ≤ G ( π (cid:63) ) ≤ . (12) Proof.

Since there are no negative rewards, the return isalways nonnegative. The maximum return is obtained whena reward of ε is obtained in every state, which makes theupper bound ε (cid:80) ∞ t =0 (1 − ε ) t = 1 .The parity condition in the product games is deﬁned over thestates that are visited inﬁnitely many times. Thus, we needto establish a connection between the recurrent states of theproduct games and their priority games. Under a strategypair, a BSCC of the induced MC will be eventually reachedand all the states in the BSCC will be visited inﬁnitely often.We now show there is a one-to-one correspondence betweenthe BSCCs of the priority games and product games. Lemma 3.

There is a bijection between the BSCCs of G (cid:63)µ (cid:63) ,ν (cid:63) and G × µ × ,ν × such that for any V (cid:63) ∈ B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) andits pair V × ∈ B ( G × µ × ,ν × ) it holds that V × = (cid:8) s × | (cid:104) s × , j (cid:105) ∈ V (cid:63) (cid:9) . (13) Proof.

A key observation is that if an (cid:104) s × , j (cid:105) belongs toa BSCC then (cid:104) s × , (cid:105) belongs to the same BSCC due tothe ε -transitions. Thus, any state (cid:104) s × , j (cid:105)∈ S (cid:63) either is atransient state or it belongs to the BSCC having the state (cid:104) s × , (cid:105) ; note that if (cid:104) s × , (cid:105) is a transient state, then any (cid:104) s × , j (cid:105)∈ S (cid:63) must be a transient state. Thus, if an s × ∈ S × belongs to a V × ∈B ( G × µ × ,ν × ) , then there is only one BSCC V (cid:63) ∈B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) including an (cid:104) s × , j (cid:105) for some j ∈ [ k ] . Simi-larly, if an (cid:104) s × , j (cid:105)∈ S (cid:63) belongs to a V (cid:63) ∈B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) , then thepair V × ∈B ( G × µ × ,ν × ) is the BSCC containing s × .The BSCCs of a product game are of particular interestbecause the problem of satisfying the parity condition canbe reduced into the problem of reaching certain BSCCs. Fora given BSCC V × ∈ B ( G × µ × ,ν × ) , let j V × denote the largestcolor among the colors of the states in V × ; i.e., j V × := max (cid:8) C × ( s × ) | s × ∈ V × (cid:9) . (14)If j V × is an even number, then the parity condition is sat-isﬁed almost surely once V × is reached, as every state in V × will be visited inﬁnitely many times. In this case, V × is called accepting . Similarly, if j V × is an odd number, theprobability of satisfying the parity condition in V × is zeroand V × is called rejecting . Lemma 4.

For a BSCC V (cid:63) ∈ B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) and its pair V × in G × µ × ,ν × , the following holds for all (cid:104) s × , j (cid:105) ∈ V (cid:63) : lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , j (cid:105) ) = (cid:40) if V × is accepting , if V × is rejecting . (15) Proof.

From Def. 6, once a state (cid:104) s × (cid:48) , j (cid:48) (cid:105) ∈ V (cid:63) with j (cid:48)

Let U (cid:63) be the union of the pairs (in G (cid:63)µ (cid:63) ,ν (cid:63) ) ofall accepting BSCCs in G × µ × ,ν × . Then, it holds that: P r µ × ,ν × ( s × | = ϕ × ) = P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (25) where ♦ U (cid:63) denote the objective of reaching a state in U (cid:63) . Proof. This follows from Lemma 3. If a path π (cid:63) = (cid:104) s × , j (cid:105)(cid:104) s × , j (cid:105) · · · ∼ G (cid:63)µ (cid:63) ,ν (cid:63) enters a BSCC V (cid:63) ∈B ( G (cid:63)µ (cid:63) ,ν (cid:63) ) and visits an (cid:104) s × , (cid:105) ∈ V (cid:63) then its projection π × = s × s × in G × µ × ,ν × . . . eventually enters the pair BSCC V × and visits s × and vice versa.We can now state the equivalence between the satisfactionprobabilities and the values in the limit. Lemma 6.

The value of the initial state in G (cid:63) approachesthe satisfaction probability of the parity condition in G × as ε approaches ; i.e., lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) = P r µ × ,ν × ( s × | = ϕ × ) . (26) Proof of Lemma 6.

From Lemma 5, we can express v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) as v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) = E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) | = ♦ U (cid:63) ] · p + E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) (cid:54)| = ♦ U (cid:63) ] · (1 − p ) (27)where π (cid:63) ∼ G (cid:63)µ (cid:63) ,ν (cid:63) and p := P r µ × ,ν × ( s × | = ϕ × ) . Let N (cid:48) be the number of time steps before reaching a state s × ∈ V × where V × is accepting and C × ( s × ) = j V × . When such an s × is reached for the ﬁrst time in G × µ × ,ν × , either• (cid:104) s × , (cid:105) is reached, which means the √ ε -transition hasnot happened yet and the pair BSCC V (cid:63) will be reachedafter N √ ε + N time steps;• or an (cid:104) s × , j (cid:105) with some j between and j V × is reached,then V (cid:63) will be reached in next time step;• or an (cid:104) s × , j (cid:105) with a j > j V × is reached, then it means √ ε -transition happened and after that, a state with higherpriority was visited; thus the number of steps until reach-ing V (cid:63) is N ε + N .Here, N and N ε denote the random variables deﬁned in theproof of Lemma 4 and N √ ε is the number of steps before a √ ε -transition happens.We can observe that the probability that N √ ε < N goesto zero as ε goes to zero. Therefore, we can ignore thethird item above in the limit. For sufﬁciently small ε , fromJensen’s inequality and Markov property, we have that v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) ≥ E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) | = ♦ U (cid:63) ] · p ≥ E π (cid:63) [(1 − ε ) N (cid:48)(cid:48) + N √ ε | π (cid:63) | = ♦ U (cid:63) ] · p ≥ (1 − ε ) n (cid:48)(cid:48) E π (cid:63) [(1 − ε ) N √ ε ] · v · p (28)where N (cid:48)(cid:48) = N + N (cid:48) , n (cid:48)(cid:48) > is a constant and v :=min s (cid:63) ∈U (cid:63) v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) . Since lim ε → + v = 1 from Lemma4 and lim ε → + E π (cid:63) [(1 − ε ) N √ ε ] = 1 , we have that lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) ≥ P r ( s (cid:63) | = ♦ U (cid:63) ) . (29)Similarly, for sufﬁciently small ε , using Jensen’s inequalityand Markov property, we can obtain the upper bound v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) ≤ E π (cid:63) [ G ( π (cid:63) ) | π (cid:63) (cid:54)| = ♦ U (cid:63) ] · (1 − p ) + p ≤ (cid:16) − (1 − ε ) n (cid:48)(cid:48) E π (cid:63) [(1 − ε ) N √ ε ] (cid:0) − v (cid:1)(cid:17) · (1 − p ) + p (30) earning Optimal Strategies for Temporal Tasks in Stochastic Games Algorithm 1: Model-free RL for LTL tasks.

Input:

LTL formula ϕ , stochastic game G , parameter ε Translate ϕ to a DPA A ϕ Construct the product game G × of SG G and DPA A ϕ Construct the PRM ρ for the accepting condition of A ϕ and ε Construct the priority game G (cid:63) by composing G × with ρ Initialize Q ( s , a ) on G (cid:63) for i = 0 to I − dofor t = 0 to T − do Derive an (cid:15) -greedy strategy pair ( µ (cid:63) , ν (cid:63) ) from Q Take the action a t ← (cid:40) µ (cid:63) ( s t ) , s t ∈ S (cid:63)µ ν (cid:63) ( s t ) , s t ∈ S (cid:63)ν Observe the next state s t +1 Q ( s t , a t ) ← (1 − α ) Q ( s t , a t ) + αR ( s t )+ α Γ( s t ) · (cid:40) max a (cid:48) Q ( s t +1 , a (cid:48) ) , s t +1 ∈ S (cid:63)µ min a (cid:48) Q ( s t +1 , a (cid:48) ) , s t +1 ∈ S (cid:63)ν end forend forreturn a greedy controller strategy µ (cid:63) ∗ from Q where v := max s (cid:63) ∈U (cid:63) v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) and U is the unionof BSCCs in G (cid:63)µ (cid:63) ,ν (cid:63) whose pairs in G × µ × ,ν × are reject-ing. Since we have lim ε → + v = 0 from Lemma 4 and lim ε → + E π (cid:63) [(1 − ε ) N √ ε ] = 1 , we can establish that lim ε → + v εµ (cid:63) ,ν (cid:63) ( (cid:104) s × , (cid:105) ) ≤ P r ( s (cid:63) | = ♦ U (cid:63) ) , (31)which concludes the proof.Using the previous lemmas, we now prove our main result. Proof of Theorem 1.

Let

P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) denote the max-imin satisfaction probability deﬁned as P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) := max µ (cid:63) min ν (cid:63) P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) . (32)Since there are only ﬁnitely many different pure memorylessstrategies that can be followed in a priority game, thereexists d > such that the following holds (cid:12)(cid:12) P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) − P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (cid:12)(cid:12) ≥ d (33)for any µ (cid:63) , ν (cid:63) with P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (cid:54) = P r ∗ ( s (cid:63) | = ♦ U (cid:63) ) .Also, from Lemma 6, there exists an ε > such that (cid:12)(cid:12) v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) − P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) (cid:12)(cid:12) < d/ (34)for all ( µ (cid:63) , ν (cid:63) ) ; thus, v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) >v εµ (cid:63) ,ν (cid:63) ( s (cid:63) ) implies P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) >P r µ (cid:63) ,ν (cid:63) ( s (cid:63) | = ♦ U (cid:63) ) for any ( µ (cid:63) , ν (cid:63) ) and ( µ (cid:63) , ν (cid:63) ) . This, combined with Lemma 1, Lemma 5, andthe fact that pure memoryless strategies sufﬁce for both play-ers for parity conditions (Chatterjee & Henzinger, 2012),concludes the proof.

5. Case Studies

We implemented our approach, summarized in Algorithm 1,using

Owl (Kret´ınsk´y et al., 2018) for LTL to DPA trans-lation and the minimax-Q algorithm to learn the optimalcontroller strategies (the entire source code is provided inthe supplementary ﬁle). We set ε to . . During the learn-ing phase, an (cid:15) -greedy strategy was followed, which chooses random actions with a probability of (cid:15) and chooses greedilyotherwise. The learning rate α and the parameter (cid:15) wereinitialized to . and gradually decreased to . . Eachepisode terminated after a K time steps ( K = 2 ). Thestrategies presented in Fig. 3 were obtained after K episodes. The ﬁrst K episodes started in a random stateand the last K episodes started in the initial state. Theentire learning phase took less than an hour.We used × grids to model the system. In this system,each cell represents a state, and a robot can move from onecell to a neighbor cell by taking four actions: up , down , right and left . An adversary can observe the actions therobot takes and can act to disturb the movement so that therobot may move in a perpendicular direction of the intendeddirection. Speciﬁcally, there are four actions the adversarycan choose none , cw , ccw , both ; for none , the robot willmove in the intended direction w.p. , for cw (or ccw ), therobot will move in the intended direction w.p. . , andmove in the perpendicular direction that is ° clockwiseor (counter-clockwise respectively) w.p. . ; for both , therobot will move in any of the perpendicular direction withw.p. . . The robot cannot move towards an obstacle ora grid edge; The obstacles are represented as large circlesﬁlled with gray. The labels of cells (i.e. states) are displayedas encircled letters ﬁlled with various colors. Nursery Scenario.

In this case study, the objective of therobot is to ( i ) enter a room labeled with e and stay there; ( ii ) inform the adult (labeled with a ) exactly once that ithas started monitoring the baby; ( iii ) visit the baby (labeledwith b ) and go back to the charger station (labeled with c ) repeatedly; ( iv ) avoid the danger zone (labeled with d ).This can be formally captured as the LTL task: ϕ = ♦(cid:3) e (cid:124)(cid:123)(cid:122)(cid:125) ( i ) ∧ ♦ a ∧ (cid:3) ( a →(cid:13) (cid:3) ¬ a ) (cid:124) (cid:123)(cid:122) (cid:125) ( ii ) ∧ (cid:3)♦ b ∧ (cid:3)♦ c (cid:124) (cid:123)(cid:122) (cid:125) ( iii ) ∧ (cid:3) ¬ d (cid:124)(cid:123)(cid:122)(cid:125) ( iv ) (35)The corresponding DPA has states, and colors. For thesystem in Fig. 3a, 3b, Algorithm 1 converged to an optimalstrategy that ﬁrst visits a while staying in the purple regionin Fig. 3a, then starts visiting b and c repeatedly whilestaying in the purple region in Fig. 3b. Under this strategy,the robot successfully stays in the room ( e ) and avoids thedanger zone ( d ). Note that the robot takes the action down to leave the adult; otherwise the adversary might cause it tohit the obstacle, thus visiting the adult for a second time. Stable Surveillance.

In this case study, the robot needsﬁnd a zone where it can successfully monitor some particu-lar cells. There are three different zones (Fig. 3c), labeledwith b , d , and g . The robot needs to monitor the cell labeledwith a in b ; the cell labeled with c in d and two cells labeledwith e and f in g . The robot is free to choose any of thezones. This task can be captured in LTL as ϕ = ( ♦(cid:3) b ∧ (cid:3)♦ a ) (cid:124) (cid:123)(cid:122) (cid:125) First Zone ∨ ( ♦(cid:3) d ∧ (cid:3)♦ c ) (cid:124) (cid:123)(cid:122) (cid:125) Second Zone ∨ ( ♦(cid:3) g ∧ (cid:3)♦ e ∧ (cid:3)♦ f ) (cid:124) (cid:123)(cid:122) (cid:125) Third Zone (36) earning Optimal Strategies for Temporal Tasks in Stochastic Games (a) de e e ec, e e a, e e b, ee e e e ee e e e e (b) de e e ec, e e a, e e b, ee e e e ee e e e e (c) a, b b d c, dg g g g gg g g g gg g g gg e, g g f, g g

Figure 3.

The grid worlds used in the case studies. The encircled letters are the labels of the cells. The circles ﬁlled with gray are obstacles.The cells visited under the optimal strategies are highlighted in purple. (a) Nursery: before a ; (b) Nursery: after a ; (c) Surveillance. The DPA derived from this LTL has states and colors.Under the strategy learned (Fig. 3c), the robot moves andstays in the region highlighted in purple; and repeatedlyvisits the cells labeled with e and f . Although the (sub)tasksin the other zones are relatively simpler (only one cell needsto be visited), due to the adversary, there is always a chanceto move the outside of these zones; thus, moving to thesezones cannot be part of an optimal strategy. Also, due tothe unlabeled cell in the middle, the robot cannot go to thecell in the middle bottom; if it does, the adversary mightsucceed pulling the robot to the middle once in a while.

6. Conclusions

In this work, we have introduced an RL approach to ﬁndcontroller strategies for LTL tasks in SGs. Our approachconstructs a reward machine directly from the LTL task’sspeciﬁcation and uses an off-the-shelf RL algorithm to ob-tain optimal controller strategies. For the introduced prioritygames, we have shown that any controller strategy that max-imizes the discounted sum of rewards obtained in the worstcase, also maximizes the minimum probability of satisfyingthe LTL speciﬁcation for some sufﬁciently small parameterassociated with the rewards. Finally, we have demonstratedthe applicability of our approach on two case studies, whichcould not be handled in practice with the existing methods.Our approach can be readily used for MDPs as they arespecial cases of SGs. Furthermore, our approach can beeasily extended to learn to perform tasks described by areward function but subject to LTL constraints. The LTLconstraints can be reduced to lower bounds on the returnobtained in priority games, which can be solved by anyconstrained SG/MDP technique.

Acknowledgements

This work is sponsored in part by the ONR under agreementsN00014-17-1-2504 and N00014-20-1-2745, the AFOSRaward number FA9550-19-1-0169, and the NSF CNS-1652544 grant.

References

Alshiekh, M., Bloem, R., Ehlers, R., K¨onighofer, B.,Niekum, S., and Topcu, U. Safe reinforcement learn-ing via shielding. In

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , volume 32, 2018.Ashok, P., Kˇret´ınsk`y, J., and Weininger, M. Pac statisti-cal model checking for markov decision processes andstochastic games. In

International Conference on Com-puter Aided Veriﬁcation , pp. 497–519. Springer, 2019.Bacchus, F. and Kabanza, F. Using temporal logics toexpress search control knowledge for planning.

Artiﬁcialintelligence , 116(1-2):123–191, 2000.Baier, C. and Katoen, J.-P.

Principles of Model Checking .MIT Press, Cambridge, MA, USA, 2008.Bhatia, A., Kavraki, L. E., and Vardi, M. Y. Sampling-basedmotion planning with temporal goals. In ,pp. 2689–2696. IEEE, 2010.Bouton, M., Karlsson, J., Nakhaei, A., Fujimura, K.,Kochenderfer, M. J., and Tumova, J. Reinforcementlearning with probabilistic guarantees for autonomousdriving. In

Workshop on Safety Risk and Uncertainty inReinforcement Learning , 2018.Bozkurt, A. K., Wang, Y., Zavlanos, M., and Pajic, M.Model-free reinforcement learning for stochastic gameswith linear temporal logic objectives. arXiv preprintarXiv:2010.01050 , 2020a.Bozkurt, A. K., Wang, Y., Zavlanos, M. M., and Pajic, M.Control synthesis from linear temporal logic speciﬁca-tions using model-free reinforcement learning. In , pp. 10349–10355. IEEE, 2020b.Br´azdil, T., Chatterjee, K., Chmelik, M., Forejt, V.,Kˇret´ınsk`y, J., Kwiatkowska, M., Parker, D., and Ujma, M.Veriﬁcation of Markov decision processes using learning earning Optimal Strategies for Temporal Tasks in Stochastic Games algorithms. In

International Symposium on AutomatedTechnology for Veriﬁcation and Analysis , pp. 98–114.Springer, 2014.Camacho, A. and McIlraith, S. A. Learning interpretablemodels expressed in linear temporal logic. In

Proceedingsof the International Conference on Automated Planningand Scheduling , volume 29, pp. 621–630, 2019.Chatterjee, K. and Henzinger, T. A. A survey of stochastic ω -regular games. Journal of Computer and System Sciences ,78(2):394–413, 2012.De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F., andRonca, A. Temporal logic monitoring rewards via trans-ducers. In

Proceedings of the International Conference onPrinciples of Knowledge Representation and Reasoning ,volume 17, pp. 860–870, 2020.Esparza, J., Kˇret´ınsk`y, J., Raskin, J.-F., and Sickert, S. FromLTL and limit-deterministic B¨uchi automata to determin-istic parity automata. In

International Conference onTools and Algorithms for the Construction and Analysisof Systems , pp. 426–442. Springer, 2017.Filar, J. and Vrieze, K.

Competitive Markov decision pro-cesses . Springer Science & Business Media, 2012.Fu, J. and Topcu, U. Probably approximately correct MDPlearning and control with temporal logic constraints. In

Proceedings of Robotics: Science and Systems , Berkeley,USA, July 2014.Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A.,and Wojtczak, D. Omega-regular objectives in model-freereinforcement learning. In

International Conference onTools and Algorithms for the Construction and Analysisof Systems , pp. 395–412. Springer, 2019.Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi,A., and Wojtczak, D. Model-free reinforcement learningfor stochastic parity games. In . SchlossDagstuhl-Leibniz-Zentrum f¨ur Informatik, 2020.Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pap-pas, G. J., and Lee, I. Reinforcement learning for tempo-ral logic control synthesis with probabilistic satisfactionguarantees. In , pp. 5338–5343. IEEE, 2019.Icarte, R. T., Klassen, T., Valenzano, R., and McIlraith, S.Using reward machines for high-level task speciﬁcationand decomposition in reinforcement learning. In

Interna-tional Conference on Machine Learning , pp. 2107–2116,2018. Karaman, S., Sanfelice, R. G., and Frazzoli, E. Optimalcontrol of mixed logical dynamical systems with lineartemporal logic speciﬁcations. In , pp. 2117–2122. IEEE,2008.Kress-Gazit, H., Fainekos, G. E., and Pappas, G. J.Temporal-logic-based reactive mission and motion plan-ning.

IEEE transactions on robotics , 25(6):1370–1381,2009.Kret´ınsk´y, J., Meggendorfer, T., and Sickert, S. Owl: Alibrary for ω -words, automata, and LTL. In Lahiri, S. K.and Wang, C. (eds.), Automated Technology for Veriﬁca-tion and Analysis - 16th International Symposium, ATVA2018, Los Angeles, CA, USA, October 7-10, 2018, Pro-ceedings , volume 11138 of

Lecture Notes in ComputerScience , pp. 543–550. Springer, 2018.Li, X., Vasile, C.-I., and Belta, C. Reinforcement learningwith temporal logic rewards. In , pp. 3834–3839. IEEE, 2017.Neyman, A., Sorin, S., and Sorin, S.

Stochastic games andapplications , volume 570. Springer Science & BusinessMedia, 2003.Oura, R., Sakakibara, A., and Ushio, T. Reinforcementlearning of control policy for linear temporal logic speci-ﬁcations using limit-deterministic generalized B¨uchi au-tomata.

IEEE Control Systems Letters , 4(3):761–766,2020.Pnueli, A. The temporal logic of programs. In , pp. 46–57. IEEE, 1977.Sadigh, D., Kim, E. S., Coogan, S., Sastry, S. S., and Seshia,S. A. A learning based approach to control synthesisof Markov decision processes for linear temporal logicspeciﬁcations. In , pp. 1091–1096. IEEE, 2014.Shapley, L. S. Stochastic games.

Proceedings of the Na-tional Academy of Sciences , 39(10):1095–1100, 1953.Ulusoy, A., Smith, S. L., Ding, X. C., Belta, C., and Rus,D. Optimality and robustness in multi-robot path plan-ning with temporal logic constraints.

The InternationalJournal of Robotics Research , 32(8):889–911, 2013.Wen, M. and Topcu, U. Probably approximately correctlearning in stochastic games with temporal logic speci-ﬁcations. In

Proceedings of the Twenty-Fifth Interna-tional Joint Conference on Artiﬁcial Intelligence , pp.3630–3636, 2016. earning Optimal Strategies for Temporal Tasks in Stochastic Games

Wolff, E. M., Topcu, U., and Murray, R. M. Robust controlof uncertain Markov decision processes with temporallogic speciﬁcations. In , pp. 3372–3379. IEEE,2012.Yordanov, B., Tumova, J., Cerna, I., Barnat, J., and Belta, C.Temporal logic control of discrete-time piecewise afﬁnesystems.