The Logical Options Framework
Brandon Araki, Xiao Li, Kiran Vodrahalli, Jonathan DeCastro, Micah J. Fry, Daniela Rus
TThe Logical Options Framework
Brandon Araki Xiao Li Kiran Vodrahalli Jonathan DeCastro J. Micah Fry Daniela Rus Abstract
Learning composable policies for environmentswith complex rules and tasks is a challenging prob-lem. We introduce a hierarchical reinforcementlearning framework called the
Logical OptionsFramework (LOF) that learns policies that are sat-isfying , optimal , and composable . LOF efficientlylearns policies that satisfy tasks by representingthe task as an automaton and integrating it intolearning and planning. We provide and prove con-ditions under which LOF will learn satisfying,optimal policies. And lastly, we show how LOF’slearned policies can be composed to satisfy un-seen tasks with only 10-50 retraining steps. Weevaluate LOF on four tasks in discrete and con-tinuous domains, including a 3D pick-and-placeenvironment.
1. Introduction
To operate in the real world, intelligent agents must beable to make long-term plans by reasoning over symbolicabstractions while also maintaining the ability to react tolow-level stimuli in their environment (Zhang & Sridharan,2020). Many environments obey rules that can be repre-sented as logical formulae; e.g., the rules a driver followswhile driving, or a recipe a chef follows to cook a dish.Traditional motion and path planning techniques struggleto plan over these long-horizon tasks, but hierarchical ap-proaches such as hierarchical reinforcement learning (HRL)can solve lengthy tasks by planning over both the high-levelrules and the low-level environment. However, solving theseproblems involves trade-offs among multiple desirable prop-erties, which we identify as satisfaction , optimality , and composability (described below). Today’s hierarchical plan-ning algorithms lack at least one of these objectives. Forexample, Reward Machines (Icarte et al., 2018) are satisfy- CSAIL, Massachusetts Institute of Technology, Cambridge,MA Department of Computer Science, Colombia University, NewYork City, NY Toyota Research Institute, Cambridge, MA MITLincoln Laboratory, Lexington, MA. Correspondence to: BrandonAraki < [email protected] > .DISTRIBUTION STATEMENT A. Approved for public release.Distribution is unlimited. ing and optimal, but not composable; the options framework(Sutton et al., 1999) is composable and hierarchically op-timal, but cannot satisfy specifications. An algorithm thatachieves all three of these properties would be very pow-erful because it would enable a model learned on one setof rules to generalize to arbitrary rules. We introduce the Logical Options Framework , which builds upon the optionsframework and aims to combine symbolic reasoning andlow-level control to achieve satisfaction, optimality, andcomposability with as few compromises as possible. Fur-thermore, we demonstrate that models learned with ourframework generalize to arbitrary sets of rules without anyfurther learning, and we also show that our framework iscompatible with arbitrary domains and planning algorithms,from discrete domains and value iteration to continuousdomains and proximal policy optimization (PPO).
Satisfaction:
An agent operating in an environment gov-erned by rules must be able to satisfy the specified rules.Satisfaction is a concept from formal logic, in which theinput to a logical formula causes the formula to evaluate to
True . Logical formulae can encapsulate rules and taskslike the ones described in Fig. 1, such as “pick up the gro-ceries” and “do not drive into a lake”. In this paper, we stateconditions under which our method is guaranteed to learnsatisfying policies.
Optimality:
Optimality requires that the agent maximizeits expected cumulative reward for each episode. In general,satisfaction can be achieved by rewarding the agent for satis-fying the rules of the environment. In hierarchical planningthere are several types of optimality, including hierarchi-cal optimality (optimal with respect to the hierarchy) andoptimality (optimal with respect to everything). We provein this paper that our method is hierarchically optimal and,under certain conditions, optimal.
Composability:
Our method is also composable – once ithas learned the low-level components of a task, the learnedmodel can be rearranged to satisfy arbitrary tasks. Morespecifically, the rules of an environment can be factored intoliveness and safety properties, which we discuss in Sec. 3.The learned model has high-level actions called options thatcan be composed to satisfy new liveness properties. A short-coming of many RL models is that they are not composable– trained to solve one specific task, they are incapable of han- a r X i v : . [ c s . A I] F e b he Logical Options Framework “Go grocery shopping , pick up the kid , and go home , unless your partner calls telling you that they will pick up the kid, in which case just go grocery shopping and then go home . And don’t drive into the lake .” (a) These natural language instructions can be transformed into an FSA, shown in (b). { } SafetypropositionSubgoal propositions { , , }
S0 S2 S3 GS1 { }
Eventproposition (b) The FSA representing the natural language instructions. Thepropositions are divided into “subgoal”, “safety”, and “event.” ` (c) The low-level MDP and corresponding policy that satis-fies the instructions. Figure 1.
Many parents face this task after school ends – who picks up the kid, and who gets groceries? The pictorial symbols representpropositions, which are true or false depending on the state of the environment. The arrows in (c) represent sub-policies, and the colors ofthe arrows match the corresponding transition in the FSA. The boxed phone at the beginning of some of the arrows represents how thesesub-policies can occur only after the agent receives a phone call. dling even small variations in the task structure. However,the real world is a dynamic and unpredictable place, so theability to use a learned model to automatically reason overas-yet-unseen tasks is a crucial element of intelligence.Fig. 1 gives an example of how LOF works. The environ-ment is a world with a grocery store, your (hypothetical)kid, your house, and some lakes, and in which you, theagent, are driving a car. The propositions are divided into“subgoals”, representing events that can be achieved, suchas going grocery shopping; “safety” propositions, represent-ing events that you must avoid (driving into a lake); and“event” propositions, corresponding to events that you haveno control over (receiving a phone call) (Fig. 1b). In thisenvironment, you have to follow rules (Fig. 1a). These rulescan be converted into a logical formula, and from there intoa finite state automaton (FSA) (Fig. 1b). LOF learns anoption for each subgoal (illustrated by the arrows in Fig. 1c),and a meta-policy for choosing amongst the options to reachthe goal state of the FSA. After learning, the options can berecombined to fulfill arbitrary tasks.
This paper introduces the Logical Options Framework(LOF) and makes four contributions to the hierarchical rein-forcement learning literature:1. The definition of a hierarchical semi-Markov DecisionProcess (SMDP) that is the product of a logical FSA and a low-level environment MDP.2. A planning algorithm for learning options and meta-policies for the SMDP that allows the options to becomposed to solve new tasks with only 10-50 retrainingsteps and no additional samples from the environment.3. Conditions and proofs for satisfaction and optimality.4. Experiments on a discrete delivery domain, a continu-ous 2D reacher domain, and a continuous 3D pick-and-place domain on four tasks demonstrating satisfaction,optimality, and composability.
2. Background
Linear Temporal Logic:
We use linear temporal logic(LTL) to formally specify rules (Clarke et al., 2001). LTLcan express tasks and rules using temporal operators such as“eventually” and “always.” LTL formulae are used only indi-rectly in LOF, as they are converted into automata that thealgorithm uses directly. We chose to use LTL to representrules because LTL corresponds closely to natural languageand has proven to be a more natural way of expressingtasks and rules for engineers than designing FSAs by hand(Kansou, 2019). Formulae φ have the syntax grammar φ := p | ¬ φ | φ ∨ φ | (cid:13) φ | φ U φ where p is a proposition (a boolean-valued truth statementthat can correspond to objects or events in the world), ¬ he Logical Options Framework is negation, ∨ is disjunction, (cid:13) is “next”, and U is “until”.The derived rules are conjunction ( ∧ ) , implication ( = ⇒ ) , equivalence ( ↔ ) , “eventually” ( ♦ φ ≡ True U φ ) and“always” ( (cid:3) φ ≡ ¬♦¬ φ ) (Baier & Katoen, 2008). φ U φ means that φ is true until φ is true, ♦ φ means that there isa time where φ is true and (cid:3) φ means that φ is always true. The Options Framework:
The options framework is aframework for defining and solving semi-Markov DecisionProcesses (SMDPs) with a type of macro-action called anoption (Sutton et al., 1999). The inclusion of options inan MDP problem turns it into an SMDP problem, becauseactions are dependent not just on the previous state but alsoon the identity of the currently active option, which couldhave been initiated many time steps before the current time.An option o is a variable-length sequence of actions definedas o = ( I , π, β, R o ( s ) , T o ( s (cid:48) | s )) . I ⊆ S is the initiation setof the option. π : S × A → [0 , is the policy of the option. β : S → [0 , is the termination condition. R o ( s ) is thereward model of the option. T o ( s (cid:48) | s ) is the transition model.A major challenge in option learning is that, in general, thenumber of time steps before the option terminates, k , is arandom variable. With this in mind, R o ( s ) is defined asthe expected cumulative reward of option o given that theoption is initiated in state s at time t and ends after k timesteps. Letting r t be the reward received by the agent at t time steps from the beginning of the option, R o ( s ) = E (cid:2) r + γr + . . . γ k − r k (cid:3) (1) T o ( s (cid:48) | s ) is the combined probability p o ( s (cid:48) , k ) that option o will terminate at state s (cid:48) after k time steps: T o ( s (cid:48) | s ) = ∞ (cid:88) k =1 p o ( s (cid:48) , k ) γ k (2)A crucial benefit of using options is that they can be com-posed in arbitrary ways. In the next section, we describehow LOF composes them to satisfy logical specifications.
3. Logical Options Framework
Here is a brief overview of how we will present our formu-lation of LOF:1. The LTL formula is decomposed into liveness andsafety properties. The liveness property defines thetask specification and the safety property defines thecosts for violating rules.2. The propositions are divided into subgoals, safetypropositions, and event propositions. Each subgoalis associated with its own option, whose goal is to achieve that subgoal. Safety propositions are used todefine rules. Event propositions serve as control flowvariables that affect the task.3. We define an SMDP that is the product of a low-levelMDP and a high-level logical FSA.4. We define the logical options.5. We present an algorithm for finding the hierarchicallyoptimal policy on the SMDP.6. We state conditions under which satisfaction of theLTL specification is guaranteed, and we prove that theplanning algorithm converges to an optimal policy byshowing that the hierarchically optimal SMDP policyis the same as the optimal MDP policy.
The Logic Formula:
LTL formulae can be translated intoB¨uchi automata using automatic translation tools such asSPOT (Duret-Lutz et al., 2016). All B¨uchi automata canbe decomposed into liveness and safety properties (Alpern& Schneider, 1987). We assume here that the LTL for-mula itself can be divided into liveness and safety formulae, φ = φ liveness ∧ φ safety . For the case where the LTL for-mula cannot be factored, see App. A. The liveness propertydescribes “things that must happen” to satisfy the LTL for-mula. It is a task specification and is used in planning todetermine which subgoals the agent must achieve. Thesafety property describes “things that can never happen”and is used to define costs for violating the rules. In LOF,the liveness property is written using a finite-trace subset ofLTL called syntactically co-safe LTL (Bhatia et al., 2010),in which (cid:3) (“always”) is not allowed and (cid:13) , U , and ♦ areonly used in positive normal form. This way, the livenessproperty can be satisfied by finite sequences of propositions,so the property can be represented as an FSA. Propositions:
Propositions are boolean-valued truth state-ments corresponding to goals, objects, and events in theenvironment. We distinguish between three types of propo-sitions: subgoals P G , safety propositions P S , and eventpropositions P E . Subgoals must be achieved in order tosatisfy the liveness property. They are associated with goalssuch as “the agent is at the grocery store”. They only appearin φ liveness . Each subgoal may only be associated with onestate. Note that in general, it may be impossible to avoidhaving subgoals appear in φ safety . App. A describes howto deal with this scenario. Safety propositions are proposi-tions that the agent must avoid – for example, driving intoa lake. They only appear in φ safety . Event propositionsare not goals, but they can affect the task specification –for example, whether or not a phone call is received. Theymay occur in φ liveness , and, with extensions described inApp. A, in φ safety . In the fully observable setting, eventpropositions are somewhat trivial because the agent knows he Logical Options Framework exactly when/if the event will occur, but in the partiallyobservable setting, they enable complex control flow. Ouroptimality guarantees only apply in the fully observablesetting; however, LOF’s properties of satisfaction and com-posability still apply in the partially observable setting. Thegoal state of the liveness FSA must be reachable from everyother state using only subgoals. This means that no matterwhat event propositions occur, it must be possible for theagent to satisfy the liveness property. T P G : S → P G and T P S : S → P S relate states to the subgoal and safetypropositions that are true at that state. T P E : 2 P E → { , } assigns truth labels to the event propositions. Hierarchical SMDP:
LOF defines a hierarchical semi-Markov Decision Process (SMDP), learns the options, andplans over them. The high level of the SMDP is an FSA spec-ified with LTL. The low level is an environment MDP. Weassume that the LTL specification φ can be decomposed intoa liveness property φ liveness and a safety property φ safety .The propositions P are the union of the subgoals P G , safetypropositions P S , and event propositions P E . We assumethat the liveness property can be translated into an FSA T = ( F , P , T F , R F , f , f g ) . F is the set of automatonstates; P is the set of propositions; T F is the transition func-tion relating the current state and proposition to the nextstate, T F : F × P × F → [0 , . In practice, T F is determin-istic despite our use of probabilistic notation. We assumethat there is a single initial state f and final state f g , and thatthe goal state f g is reachable from every state f ∈ F usingonly subgoals. The reward function assigns a reward to ev-ery FSA state, R F : F → R . In our experiments, the safetyproperty takes the form (cid:86) p s ∈P S (cid:3) ¬ p s , which implies thatno safety proposition is allowed, and that they have asso-ciated costs, R S : 2 P S → R . φ safety is not limited to thisform; App. A covers the general case. There is a low-levelenvironment MDP E = ( S , A , R E , T E , γ ) . S is the statespace and A is the action space. They can be discrete orcontinuous. R E : S ×A → R is a low-level reward functionthat characterizes, for example, distance or actuation costs. R E is a combination of the safety reward function R S and R E , e.g. R E ( s, a ) = R E ( s, a ) + R S ( T P S ( s )) . The transi-tion function of the environment is T E : S ×A×S → [0 , .From these parts we define a hierarchical SMDP M =( S × F , A , P , O , T E × T P × T F , R SMDP , γ ) . The hierar-chical state space contains two elements: low-level states S and FSA states F . The action space is A . The set ofpropositions is P . The set of options (one option associatedwith each subgoal in P G ) is O . The transition function con-sists of the low-level environment transitions T E and theFSA transitions T F . T P = T P G × T P S × T P E . We call T P ,relating states to propositions, a transition function becauseit determines when FSA transitions occur. The transitionsare applied in the order T E , T P , T F . The reward function R SMDP ( f, s, o ) = R F ( f ) R o ( s ) , so R F ( f ) is a weighting Algorithm 1
Learning and Planning with Logical Options Given:
Propositions P partitioned into subgoals P G , safetypropositions P S , and event propositions P E Logical FSA T = ( F , P G × P E , T F , R F , f , f g ) de-rived from φ liveness Low-level MDP E = ( S , A , R E , T E , γ ) , where R E ( s, a ) = R E ( s, a ) + R S ( T P S ( s )) combines the en-vironment and safety rewardsProposition labeling functions T P G : S → P G , T P S : S → P S , and T P E : 2 P E → { , } To learn: Set of options O , one for each subgoal p ∈ P G Meta-policy µ ( f, s, o ) , Q ( f, s, o ) , and V ( f, s ) Learn logical options: for p ∈ P G do Learn an option that achieves p , o p = ( I o p , π o p , β o p , R o p ( s ) , T o p ( s (cid:48) | s )) I o p = S β o p = (cid:40) if p ∈ T P G ( s )0 otherwise π o p = optimal policy on E with rollouts terminatingwhen p ∈ T P G ( s ) T o p ( s (cid:48) | s ) = E γ k if p ∈ T P G ( s (cid:48) ) ; k is numberof time steps to reach p otherwise R o p ( s ) = E [ R E ( s, a ) + γR E ( s , a ) + . . . + γ k − R E ( s k − , a k )] end for Find a meta-policy µ over the options: Initialize Q : F × S × O → R , V : F × S → R to for ( k, f, s ) ∈ [1 , . . . , n ] × F × S do for o ∈ O do Q k ( f, s, o ) ← R F ( f ) R o ( s )+ (cid:80) f (cid:48) ∈F (cid:80) ¯ p e ∈ P E (cid:80) s (cid:48) ∈S T F ( f (cid:48) | f, T P ( s (cid:48) ) , ¯ p e ) T P E (¯ p e ) T o ( s (cid:48) | s ) V k − ( f (cid:48) , s (cid:48) ) end for V k ( f, s ) ← max o ∈O Q k ( f, s, o ) end for µ ( f, s, o ) = arg max o ∈O Q ( f, s, o ) Return:
Options O , meta-policy µ ( f, s, o ) and Q- andvalue functions Q ( f, s, o ) , V ( f, s ) on the option rewards. The SMDP has the same discountfactor γ as E . Planning is done on the SMDP in two steps:first, the options O are learned over E using an appropriatepolicy-learning algorithm such as PPO or Reward Machines.Next, a meta-policy over the task specification T is foundusing the learned options and the reward function R SMDP . Logical Options:
The first step of Alg. 1 is to learn the log- he Logical Options Framework ical options. We associate every subgoal p with an option o p = ( I o p , π o p , β o p , R o p , T o p ) . These terms are definedstarting at Alg. 1 line 5. Every o p has a policy π o p whosegoal is to reach the state s p where p is true. Options arelearned by training on the environment MDP E and termi-nating only when s p is reached. As we discuss in Sec. 3.1,under certain conditions the optimal option policy is guar-anteed to always terminate at the subgoal. This allows usto simplify the transition model of Eq. 2 to the form inAlg. 1 line 11. In the experiments, we further simplify thisexpression by setting γ = 1 . Logical Value Iteration:
After finding the logical options,the next step is to find a meta-policy for FSA T over theoptions (see Alg. 1 line 14). Q- and value functions arefound for the SMDP using the Bellman update equations: Q k ( f, s, o ) ← R F ( f ) R o ( s ) + (cid:88) f (cid:48) ∈F (cid:88) ¯ p e ∈ P E (cid:88) s (cid:48) ∈S T F ( f (cid:48) | f, T P G ( s (cid:48) ) , ¯ p e ) T P E (¯ p e ) T o ( s (cid:48) | s ) V k − ( f (cid:48) , s (cid:48) ) (3) V k ( f, s ) ← max o ∈O Q k ( f, s, o ) (4)Eq. 3 differs from the generic equations for SMDP valueiteration in that the transition function has two extra compo-nents, (cid:80) f (cid:48) ∈F T F ( f (cid:48) | f, T P ( s (cid:48) ) , ¯ p e ) and (cid:80) ¯ p e ∈ P E T P E (¯ p e ) .The equations are derived from Araki et al. (2019) and thefact that, on every step in the environment, three transitionsare applied: the option transition T o , the event proposi-tion “transition” T P E , and the FSA transition T F . Note that R o ( s ) and T o ( s (cid:48) | s ) compress the consequences of choosingan option o at a state s from a multi-step trajectory into tworeal-valued numbers, allowing for more efficient planning. Here we give an overview of the proofs and necessary con-ditions for satisfaction and optimality. The full proofs anddefinitions are in App. B.First, we describe the condition for an optimal option toalways reach its subgoal. Let π (cid:48) ( s | s (cid:48) ) be the optimal goal-conditioned policy for reaching a goal s (cid:48) . If the optimaloption policy equals the goal-conditioned policy for reach-ing the subgoal s g , i.e. π ∗ ( s ) = π g ( s | s g ) , then the optionwill always reach the subgoal. This can be stated in termsof value functions: let V π (cid:48) ( s | s (cid:48) ) be the expected returnof π (cid:48) ( s | s (cid:48) ) . If V π g ( s | s g ) > V π (cid:48) ( s | s (cid:48) ) ∀ s, s (cid:48) (cid:54) = s g , then π ∗ ( s ) = π g ( s | s g ) . This occurs for example if −∞
Theorem 3.1.
Given that the conditions for satisfactionand hierarchical optimality are met, the LOF hierarchicallyoptimal meta-policy µ g with optimal option sub-policies π g has the same expected returns as the optimal policy π ∗ HMDP and satisfies the task specification.
The results in Sec. 3.1 guarantee that LOF’s learned modelcan be composed to satisfy new tasks. Furthermore, thecomposed policy has the same properties as the originalpolicy – satisfaction and optimality. LOF’s possession ofcomposability along with satisfaction and optimality derivesfrom two facts: 1) Options are inherently composable be-cause they can be executed in any order. 2) If the conditionsof Thm. 3.1 are met, LOF is guaranteed to find a (hierarchi-cally) optimal policy over the options that will satisfy anyliveness property that uses subgoals associated with the op-tions. The composability of LOF distinguishes it from otheralgorithms that can achieve satisfaction and optimality.
4. Experiments & Results
Experiments:
We performed experiments to demonstratesatisfaction and composability. For satisfaction, we measurecumulative reward over training steps. Cumulative rewardis a proxy for satisfaction, as the environments can onlyachieve the maximum reward when they satisfy their tasks.For the composability experiments, we take the trained op-tions and record how many meta-policy retraining steps ittakes to learn an optimal meta-policy for a new task. he Logical Options Framework
Environments:
We measure the performance of LOF onthree environments. The first environment is a discretegridworld (Fig. 3a) called the “delivery domain,” as it canrepresent a delivery truck delivering packages to three lo-cations ( a , b , c ) and having a home base h . There are alsoobstacles o (the black squares). The second environmentis called the reacher domain, from OpenAI Gym (Fig. 3d).It is a two-link arm that has continuous state and actionspaces. There are four subgoals represented by coloredballs: red r , green g , blue b , and yellow y . The third en-vironment is called the pick-and-place domain, and it is acontinuous 3D environment with a robotic Panda arm fromCoppeliaSim and PyRep (James et al., 2019). It is inspiredby the lunchbox-packing experiments of Araki et al. (2019)in which subgoals r , g , and b are food items that must bepacked into lunchbox y . All environments also have anevent proposition called can , which represents when theneed to fulfill part of a task is cancelled. Tasks:
We test satisfaction and composability on four tasks.The first task is a “sequential” task. For the delivery domain,the LTL formula is ♦ ( a ∧ ♦ ( b ∧ ♦ ( c ∧ ♦ h ))) ∧ (cid:3) ¬ o –“deliver package a , then b , then c , and then return to home h . And always avoid obstacles.” The next task is the “IF”task (equivalent to the task shown in Fig. 1b): ( ♦ ( c ∧ ♦ a ) ∧ (cid:3) ¬ can ) ∨ ( ♦ c ∧ ♦ can ) ∧ (cid:3) ¬ o – “deliver package c , andthen a , unless a gets cancelled. And always avoid obstacles”.We call the third task the “OR” task, ♦ (( a ∨ b ) ∧ ♦ c ) ∧ (cid:3) ¬ o – “deliver package a or b , then c , and always avoidobstacles”. The “composite” task has elements of all threeof the previous tasks: ( ♦ (( a ∨ b ) ∧ ♦ ( c ∧ ♦ h )) ∧ (cid:3) ¬ can ) ∨ ( ♦ (( a ∨ b ) ∧ ♦ h ) ∧ ♦ can ) ∧ (cid:3) ¬ o . “Deliver package a or b ,and then c , unless c gets cancelled, and then return to home h . And always avoid obstacles”. The tasks for the reacherand pick-and-place environments are equivalent, except thatthere are no obstacles for the reacher and arm to avoid.The sequential task is meant to show that planning is effi-cient and effective even for long-time horizon tasks. The“IF” task shows that the agent’s policy can respond to eventpropositions, such as being alerted that a delivery is can-celled. The “OR” task is meant to demonstrate the optimal-ity of our algorithm versus a greedy algorithm, as discussedin Fig. 2. Lastly, the composite task shows that learning andplanning are efficient and effective even for complex tasks. Baselines:
We test four baselines against our algorithm.Our algorithm is
LOF-VI , short for “Logical OptionsFramework with Value Iteration,” because it uses value it-eration for high-level planning.
LOF-QL uses Q-learninginstead (details are in App. C.3). Unlike
LOF-VI , LOF-QL does not need explicit knowledge of T F , the FSA transitionfunction. Greedy is a naive implementation of task satis-faction; it uses its knowledge of the FSA to select the nextsubgoal with the lowest cost to attain. This leaves it vul- ` -2-3-2 -5 S0 S1 S2
LOF Total Reward: -5Greedy Total Reward: -7-3-2 -2-5
Figure 2.
In this environment, the agent must either pick up thekid or go grocery shopping, and then go home (the “OR” task).Starting at S , the greedy algorithm picks the next step in the FSAwith the lowest cost (picking up the kid), which leads to a higheroverall cost. LOF finds the optimal path through the FSA. nerable to choosing suboptimal paths through the FSA, asshown in Fig. 2. Flat Options uses the options frame-work with no knowledge of the FSA. Its SMDP formulationis not hierarchical – the state space and transition functiondo not contain high-level states F or transition function T F .The last baseline is RM , short for Reward Machines (Icarteet al., 2018). Whereas LOF learn options to accomplish sub-goals, RM learns sub-policies for every FSA state. App. C.4discusses the differences between RM and LOF in detail. Implementation:
For the delivery domain, options werelearned using Q-learning with an (cid:15) -greedy exploration pol-icy. RM was learned using the Q-learning for Reward Ma-chines (QRM) algorithm described in (Icarte et al., 2018).For the reacher and pick-and-place domains, options werelearned by using proximal policy optimization (PPO) (Schul-man et al., 2017) to train goal-oriented policy and valuefunctions, which were represented using × and × × fully connected neural networks, respec-tively. Deep-QRM was used to train RM . The implementa-tion details are discussed more fully in App. C. Results for the satisfaction experiments, aver-aged over all four tasks, are shown in Figs. 3b, 3e, and 3h.(Results on all tasks are in App. C.6). As expected,
FlatOptions shows no ability to satisfy tasks, as it has no he Logical Options Framework bhca (a) Delivery domain. (b) Satisfaction performance. (c) Composability performance. r gyb (d) Reacher domain. (e) Satisfaction performance. (f) Composability performance. b gr y (g) Pick-and-place domain. (h) Satisfaction performance. (i) Composability performance. Figure 3.
Performance on the satisfaction and composability experiments, averaged over all tasks. Note that
LOF-VI composes newmeta-policies in just 10-50 retraining steps. The first row is the delivery domain, the second row is the reacher domain, and the third rowis the pick-and-place domain. All results, including RM performance on the reacher and pick-and-place domains, are in App. C.6. knowledge of the FSAs. Greedy trains as quickly as
LOF-VI and
LOF-QL , but its returns plateau before theothers because it chooses suboptimal paths in the compositeand OR tasks. The difference is small in the continuousdomains but still present.
LOF-QL achieves as high a returnas
LOF-VI , but it is less composable (discussed below). RM learns much more slowly than the other methods. Thisis because for RM , a reward is only given for reaching thegoal state, whereas in the LOF-based methods, options arerewarded for reaching their subgoals, so during trainingLOF-based methods have a richer reward function than RM .For the continuous domains, RM takes an order of magnitudemore steps to train, so we left it out of the figures for clarity(see App. Figs. 14 and 16). However, in the continuousdomains, RM eventually achieves a higher return than theLOF-based methods. This is because for those domains,we define the subgoals to be spherical regions rather thansingle states, violating one of the conditions for optimality.Therefore, for example, it is possible that the meta-policydoes not take advantage of the dynamics of the arm to swingthrough the subgoals more efficiently. RM does not have this condition and learns a single policy that can take advantageof inter-subgoal dynamics to learn a more optimal policy. Composability:
The composability experiments were doneon the three composable baselines,
LOF-VI , LOF-QL , and
Greedy . App. C.4 discusses why RM is not composable. Flat Options is not composable because its formula-tion does not include the FSA T . Therefore it is completelyincapable of recognizing and adjusting to changes in theFSA. The composability results are shown in Figs. 3c, 3f,and 3i. Greedy requires no retraining steps to “learn” ameta-policy on a new FSA – given its current FSA state,it simply chooses the next available FSA state that has thelowest cost to achieve. However, its meta-policy may bearbitrarily suboptimal.
LOF-QL learns optimal (or in thecontinuous case, close-to-optimal) policies, but it takes ∼ ∼ LOF-VI . Therefore
LOF-VI strikes a balance between
Greedy and
LOF-QL ,requiring far fewer steps than
LOF-QL to retrain, andachieving better performance than
Greedy . he Logical Options Framework
5. Related Work
We distinguish our work from related work in HRL by itspossession of three desirable properties – composability,satisfaction, and optimality. Most other works possess twoof these properties at the cost of the other.
Not Composable:
The previous work most similar to oursis Icarte et al. (2018; 2019), which introduces a method tosolve tasks defined by automata called Reward Machines.Their method learns a sub-policy for every state of the au-tomaton that achieves satisfaction and optimality. However,the learned sub-policies have limited composability becausethey end up learning a specific path through the automaton,and if the structure of the automaton is changed, there isno guarantee that the sub-policies will be able to satisfythe new automaton without re-training. By contrast, LOFlearns a sub-policy for every subgoal, independent of theautomaton, and therefore the sub-policies can be arrangedto satisfy arbitrary tasks. Another similar work is LogicalValue Iteration (LVI) (Araki et al., 2019; 2020). LVI definesa hierarchical MDP and value iteration equations that findsatisfying and optimal policies; however, the algorithm islimited to discrete domains and has limited composability.A number of HRL algorithms use reward shaping to guidethe agent through the states of an automaton (Li et al., 2017;2019; Camacho et al., 2019; Hasanbeig et al., 2018; Joth-imurugan et al., 2019; Shah et al., 2020; Yuan et al., 2019).While these algorithms can guarantee satisfaction and some-times optimality, they cannot be composed because theirpolicies are not hierarchical. Another approach is to use asymbolic planner to find a satisfying sequence of tasks anduse an RL agent to learn and execute that sequence of tasks(Gordon et al., 2019; Illanes et al., 2020; Lyu et al., 2019).However, the meta-controllers of Gordon et al. (2019) andLyu et al. (2019) are not composable as they are trainedtogether with the low-level controllers. Although the workof Illanes et al. (2020) is amenable to transfer learning, it isnot composable. Paxton et al. (2017); Mason et al. (2017)use logical constraints to guide exploration, and while theseapproaches are also satisfying and optimal, they are notcomposable as the agent is trained for a specific set of rules.LOF is composable unlike the above methods because it hasa hierarchical action space with high-level options. Oncethe options are learned, they can be composed arbitrarily.
Not Satisfying:
Most hierarchical frameworks cannot sat-isfy tasks. Instead, they focus on using state and actionabstractions to make learning more efficient (Dietterich,2000; Dayan & Hinton, 1993; Parr & Russell, 1998; Diuket al., 2008; Oh et al., 2019). The options framework (Suttonet al., 1999) stands out because of its composability and itsguarantee of hierarchical optimality, which is why we basedour work off of it. There is also a class of HRL algorithmsthat builds on the idea of goal-oriented policies that can navigate to nearby subgoals (Eysenbach et al., 2019; Ghoshet al., 2018; Faust et al., 2018). By sampling sequencesof subgoals and using a goal-oriented policy to navigatebetween them, these algorithms can travel much longer dis-tances than a policy can travel on its own. Although thesealgorithms are “composable” in that they can navigate tofar-away goals without further training, they are not able tosolve tasks. Andreas et al. (2017) present an algorithm forsolving simple policy “sketches” which is also composable;however, sketches are considerably less expressive than au-tomata and linear temporal logic, which we use. Unlike theabove methods, LOF is satisfying because it has a hierar-chical state space with low-level MDP states and high-levelFSA states. Therefore LOF can satisfy tasks by learningpolicies that reach the FSA goal state.
Not Optimal:
In HRL, there are at least three types of opti-mality – hierarchical, recursive, and overall. As defined inDietterich (2000), the hierarchically optimal policy is theoptimal policy given the constraints of the hierarchy, andrecursive optimality is when a policy is optimal given thepolicies of its children. For example, the options frame-work is hierarchically optimal, while MAXQ and abstractMDPs (Gopalan et al., 2017) are recursively optimal. Themethod described in Kuo et al. (2020) is fully composable,but not optimal as it uses a recurrent neural network to gen-erate a sequence of high-level actions and is therefore notguaranteed to find optimal policies. LOF is hierarchicallyoptimal because it finds an optimal meta-policy over thehigh-level options, and as we state in the paper, there arealso conditions under which the overall policy is optimal.
6. Discussion and Conclusion
In this work, we claim that LOF has a unique combinationof three properties: satisfaction, optimality, and composabil-ity. We state and prove the conditions for satisfaction andoptimality in Sec. 3.1. The experimental results confirm ourclaims while also pointing out some weaknesses.
LOF-VI achieves optimal or near-optimal policies and trains an orderof magnitude faster than the existing work most similar toit, RM . However, the optimality condition that each subgoalbe associated with one state cannot be met for continuousdomains, and therefore RM eventually outperforms LOF-VI .But even when optimality is not guaranteed,
LOF-VI is hi-erarchically optimal, which is why it outperforms
Greedy in the composite and OR tasks. Next, the composabilityexperiments show that
LOF-VI can compose its learnedoptions to perform new tasks in about 10-50 iterations. Al-though
Greedy requires no retraining steps, it is a tinyfraction of the tens of thousands of steps required to learnthe original policy. Lastly, we have shown that LOF learnspolicies efficiently, and that it can be used with a variety ofdomains and policy-learning algorithms. he Logical Options Framework
Acknowledgments
Toyota Research Institute provided funds to support thiswork. This material is also supported by the Office of NavalResearch grant ONR N00014-18-1-2830, and the UnderSecretary of Defense for Research and Engineering underAir Force Contract No. FA8702-15-D-0001. Any opinions,findings, conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarilyreflect the views of the Under Secretary of Defense forResearch and Engineering.
References
Abel, D. and Winder, J. The expected-length model ofoptions. In
IJCAI , 2019.Alpern, B. and Schneider, F. B. Recognizing safety andliveness.
Distributed computing , 2(3):117–126, 1987.Andreas, J., Klein, D., and Levine, S. Modular multitaskreinforcement learning with policy sketches. In
Interna-tional Conference on Machine Learning , pp. 166–175,2017.Araki, B., Vodrahalli, K., Leech, T., Vasile, C. I., Don-ahue, M., and Rus, D. Learning to plan with logicalautomata. In
Proceedings of Robotics: Science and Sys-tems , FreiburgimBreisgau, Germany, June 2019. doi:10.15607/RSS.2019.XV.064.Araki, B., Vodrahalli, K., Leech, T., Vasile, C. I., Donahue,M., and Rus, D. Deep bayesian nonparametric learningof rules and plans from demonstrations with a learnedautomaton prior. In
AAAI , pp. 10026–10034, 2020.Baier, C. and Katoen, J.
Principles of model checking . MITPress, 2008. ISBN 978-0-262-02649-9.Bhatia, A., Kavraki, L. E., and Vardi, M. Y. Sampling-basedmotion planning with temporal goals. In ,pp. 2689–2696. IEEE, 2010.Camacho, A., Icarte, R. T., Klassen, T. Q., Valenzano, R. A.,and McIlraith, S. A. Ltl and beyond: Formal languagesfor reward function specification in reinforcement learn-ing. In
IJCAI , volume 19, pp. 6065–6073, 2019.Clarke, E. M., Grumberg, O., and Peled, D.
Model Checking .MIT Press, 2001. ISBN 978-0-262-03270-4.Dayan, P. and Hinton, G. E. Feudal reinforcement learning.In
Advances in neural information processing systems ,pp. 271–278, 1993.Dietterich, T. G. Hierarchical reinforcement learning withthe maxq value function decomposition.
Journal of artifi-cial intelligence research , 13:227–303, 2000. Diuk, C., Cohen, A., and Littman, M. L. An object-orientedrepresentation for efficient reinforcement learning. In
Pro-ceedings of the 25th international conference on Machinelearning , pp. 240–247, 2008.Duret-Lutz, A., Lewkowicz, A., Fauchille, A., Michaud, T.,Renault, E., and Xu, L. Spot 2.0 — a framework for LTLand ω -automata manipulation. In Proceedings of the 14thInternational Symposium on Automated Technology forVerification and Analysis (ATVA’16) , volume 9938 of
Lec-ture Notes in Computer Science , pp. 122–129. Springer,October 2016. doi: 10.1007/978-3-319-46520-3 8.Eysenbach, B., Salakhutdinov, R. R., and Levine, S. Searchon the replay buffer: Bridging planning and reinforce-ment learning. In
Advances in Neural Information Pro-cessing Systems , pp. 15220–15231, 2019.Faust, A., Oslund, K., Ramirez, O., Francis, A., Tapia, L.,Fiser, M., and Davidson, J. Prm-rl: Long-range roboticnavigation tasks by combining reinforcement learningand sampling-based planning. In ,pp. 5113–5120. IEEE, 2018.Ghosh, D., Gupta, A., and Levine, S. Learning actionablerepresentations with goal-conditioned policies. arXivpreprint arXiv:1811.07819 , 2018.Gopalan, N., Littman, M. L., MacGlashan, J., Squire, S.,Tellex, S., Winder, J., Wong, L. L., et al. Planning withabstract markov decision processes. In
Twenty-SeventhInternational Conference on Automated Planning andScheduling , 2017.Gordon, D., Fox, D., and Farhadi, A. What should i do now?marrying reinforcement learning and symbolic planning. arXiv preprint arXiv:1901.01492 , 2019.Hasanbeig, M., Abate, A., and Kroening, D. Logically-constrained reinforcement learning. arXiv preprintarXiv:1801.08099 , 2018.Icarte, R. T., Klassen, T., Valenzano, R., and McIlraith, S.Using reward machines for high-level task specificationand decomposition in reinforcement learning. In
Interna-tional Conference on Machine Learning , pp. 2107–2116,2018.Icarte, R. T., Waldie, E., Klassen, T., Valenzano, R., Castro,M., and McIlraith, S. Learning reward machines forpartially observable reinforcement learning. In
Advancesin Neural Information Processing Systems , pp. 15523–15534, 2019.Illanes, L., Yan, X., Icarte, R. T., and McIlraith, S. A. Sym-bolic plans as high-level instructions for reinforcementlearning. In
Proceedings of the International Conference he Logical Options Framework on Automated Planning and Scheduling , volume 30, pp.540–550, 2020.James, S., Freese, M., and Davison, A. J. Pyrep: Bring-ing v-rep to deep robot learning. arXiv preprintarXiv:1906.11176 , 2019.Jothimurugan, K., Alur, R., and Bastani, O. A composablespecification language for reinforcement learning tasks.In
Advances in Neural Information Processing Systems ,pp. 13041–13051, 2019.Kansou, B. K. A. Converting asubset of ltl formula to buchiautomata.
International Journal of Software Engineering& Applications (IJSEA) , 10(2), 2019.Kuo, Y.-L., Katz, B., and Barbu, A. Encoding formulas asdeep networks: Reinforcement learning for zero-shot ex-ecution of ltl formulas. arXiv preprint arXiv:2006.01110 ,2020.Li, X., Vasile, C.-I., and Belta, C. Reinforcement learningwith temporal logic rewards. In , pp. 3834–3839. IEEE, 2017.Li, X., Serlin, Z., Yang, G., and Belta, C. A formal meth-ods approach to interpretable reinforcement learning forrobotic planning.
Science Robotics , 4(37), 2019.Lyu, D., Yang, F., Liu, B., and Gustafson, S. Sdrl: inter-pretable and data-efficient deep reinforcement learningleveraging symbolic planning. In
Proceedings of theAAAI Conference on Artificial Intelligence , volume 33,pp. 2970–2977, 2019.Mason, G. R., Calinescu, R. C., Kudenko, D., and Banks,A. Assured reinforcement learning with formally verifiedabstract policies. In . York, 2017.Oh, Y., Patel, R., Nguyen, T., Huang, B., Pavlick, E.,and Tellex, S. Planning with state abstractions fornon-markovian task specifications. arXiv preprintarXiv:1905.12096 , 2019.Parr, R. and Russell, S. J. Reinforcement learning with hier-archies of machines. In
Advances in neural informationprocessing systems , pp. 1043–1049, 1998.Paxton, C., Raman, V., Hager, G. D., and Kobilarov, M.Combining neural networks and tree search for task andmotion planning in challenging environments. In , pp. 6059–6066. IEEE, 2017.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 , 2017. Shah, A., Li, S., and Shah, J. Planning with uncertainspecifications (puns).
IEEE Robotics and AutomationLetters , 5(2):3414–3421, 2020.Sutton, R. S., Precup, D., and Singh, S. Between mdpsand semi-mdps: A framework for temporal abstraction inreinforcement learning.
Artificial intelligence , 112(1-2):181–211, 1999.Yuan, L. Z., Hasanbeig, M., Abate, A., and Kroening, D.Modular deep reinforcement learning with temporal logicspecifications. arXiv preprint arXiv:1909.11591 , 2019.Zhang, S. and Sridharan, M. A survey of knowledge-basedsequential decision making under uncertainty. arXivpreprint arXiv:2008.08548 , 2020. he Logical Options Framework
A. Formulation of Logical OptionsFramework with Safety Automaton
In this section, we present a more general formulation ofLOF than that presented in the paper. In the paper, we maketwo assumptions that simplify the formulation. The firstassumption is that the LTL specification can be divided intotwo independent formulae, a liveness property and a safetyproperty: φ = φ liveness ∧ φ safety . However, not all LTLformulae can be factored in this way. We show how LOFcan be applied to LTL formulae that break this assumption.The second assumption is that the safety property takes asimple form that can be represented as a penalty on safetypropositions. We show how LOF can be used with arbitrarysafety properties. A.1. Automata and Propositions
All LTL formulae can be translated into B¨uchi automatausing automatic translation tools such as SPOT (Duret-Lutzet al., 2016). All B¨uchi automata can be decomposed intoliveness and safety properties (Alpern & Schneider, 1987),so that automaton W = W liveness × W safety . This is ageneralization of the assumption that all LTL formulae canbe divided into liveness and safety properties φ liveness and φ safety . The liveness property W liveness must be an FSA,although this assumption could also be loosened to allowit to be a deterministic B¨uchi automaton via some minormodifications (allowing multiple goal states to exist andcontinuing episodes indefinitely, even once a goal state hasbeen reached).As in the main text, we assume that there are three typesof propositions – subgoals P G , safety propositions P S , andevent propositions P E . The event propositions have set val-ues and can occur in both W liveness and W safety . Safetypropositions only appear in W safety . Subgoal propositionsonly appear in W liveness . Each subgoal may only be asso-ciated with one state. Note that after writing a specificationand decomposing it into W liveness and W safety , it is possi-ble that some subgoals may unexpectedly appear in W safety .This can be dealt with by creating “safety twins” of each sub-goal – safety propositions that are associated with the samelow-level states as the subgoals and can therefore substitutefor them in W safety .Subgoals are propositions that the agent must achieve inorder to reach the goal state of W liveness . Although eventpropositions can also define transitions in W liveness , weassume that “achieving” them is not necessary in orderto reach the goal state. In other words, we assume thatfrom any state in W liveness , there is a path to the goalstate that involves only subgoals. This is because in ourformulation, the event propositions are meant to serve aspropositions that the agent has no control over, such as receiving a phone call. If satisfaction of the liveness propertywere to depend on such a proposition, then it would beimpossible to guarantee satisfaction. However, if the user isunconcerned with guaranteeing satisfaction, then specifyinga liveness property in which satisfaction depends on eventpropositions is compatible with LOF.Safety propositions may only occur in W safety and are as-sociated with things that the agent “must avoid”. This isbecause every state of W safety is an accepting state (Alpern& Schneider, 1987), so all transitions between the statesare non-violating. However, any undefined transition is notallowed and is a violation of the safety property. In ourformulation, we assign costs to violations, so that violationsare allowed but come at a cost. In practice, it also may bethe case that the agent is in a low-level state from whichit is impossible to reach the goal state without violatingthe safety property. In our formulation, satisfaction of theliveness property (but not the safety property) is still guaran-teed in this case, as the finite cost associated with violatingthe rule is less than the infinite cost of not satisfying theliveness property, so the optimal policy for the agent will beto violate the rule in order to satisfy the task (see the proofs,Appendix B). This scenario can be avoided in several ways.For example, do not specify an environment in which it isonly possible for the agent to satisfy the task by violatinga rule. Or, instead of prioritizing satisfaction of the task,it is possible to instead prioritize satisfaction of the safetyproperty. In this case, satisfaction of the liveness propertywould not be guaranteed but satisfaction of the safety prop-erty would be guaranteed. This could be accomplished byterminating the rollout if a safety violation occurs.We assume that event propositions are observed – in otherwords, that we know the values of the event propositionsfrom the start of a rollout. This is because we are planning ina fully observable setting, so we must make this assumptionto guarantee convergence to an optimal policy. However, thepartially observable case is much more interesting, in whichthe values of the event propositions are not known untilthe agent checks or the environment randomly reveals theirvalues. This case is beyond the scope of this paper; however,LOF can still guarantee satisfaction and composability inthis setting, just not optimality.Proposition labeling functions relate states to propositions: T P G : S → P G , T P S : S → P S , and T P E : 2 P E →{ , } .Given these definitions of propositions, it is possibleto define the liveness and safety properties formally. W liveness = ( F , P G ∪ P E , T F , R F , f , f g ) . F is the setof states of the liveness property. The propositions can beeither subgoals P G or event propositions P E . The transitionfunction relates the current FSA state and active propositionsto the next FSA state, T F : F × P G × P E × F → [0 , . he Logical Options Framework The reward function assigns a reward to the current FSAstate, R F : F → R . We assume there is one initial state f and one goal state f g .The safety property is a B¨uchi automaton W safety =( F S , P S ∪ P E , T S , R S , F ) . F S are the states of the au-tomaton. The propositions can be safety propositions P S orevent propositions P E . The transition function T S relatesthe current state and active propositions to the next state, T S : F S × P S × P E × F S → [0 , . The reward func-tion relates the automaton state and safety propositions torewards (or costs), R S : F S × P S → R . F defines the setof initial states. We do not specify an accepting conditionbecause for safety properties, every state is an acceptingstate. A.2. The Environment MDP
There is a low-level environment MDP E =( S , A , R E , T E , γ ) . S is the state space and A is theaction space. They can be either discrete or continuous. R E is the low-level reward function that character-izes, for example, time, distance, or actuation costs. T E : S × A × S → [0 , is the transition function and γ isthe discount factor. Unlike in the simpler formulation inthe paper, we do not combine R E and the safety automatonreward function R S in the MDP formulation E . A.3. Logical Options
We associate every subgoal p g with an option o p g =( I p g , π p g , β p g , R p g , T p g ) . Every o p g has a policy π p g whosegoal is to reach the state s p g where p g is true. Option poli-cies are learned by training on the product of the environ-ment and the safety automaton, E ×W safety and terminatingtraining only when s p g is reached. R E : F S × S × A → R is the reward function of the product MDP E × W safety .There are many reward-shaping policy-learning algorithmsthat specify how to define R E . In fact, learning a policy for E × W safety is the sort of hierarchical learning problem thatmany reward-shaping algorithms excel at, including RewardMachines (Icarte et al., 2018) and (Li et al., 2017). This isbecause in LOF, safety properties are not composable, sousing a learning algorithm that is satisfying and optimal butnot composable to learn the safety property is appropriate.Alternatively, there are many scenarios where W safety is atrivial automaton in which each safety proposition is asso-ciated with its own state, as we describe in the main paper,so penalties can be assigned to propositions and the state ofthe agent in W safety can be ignored.Note that since the options are trained independently, onelimitation of our formulation is that the safety propertiescannot depend on the liveness state. In other words, whenan agent reaches a new subgoal, the safety property cannotchange. However, the workaround for this is not too compli- Algorithm 2
Learning and Planning with Logical Options Given:
Propositions P partitioned into subgoals P G , safetypropositions P S , and event propositions P E W liveness = ( F , P G ∪ P E , T F , R F , f , f g ) W safety = ( F S , P S ∪ P E , T S , R S , F ) Low-level MDP E = ( S , A , R E , T E , γ ) Proposition labeling functions T P G : S → P G , T P S : S → P S , and T P E : 2 P E → { , } To learn: Set of options O , one for each subgoal proposition p ∈ P G Meta-policy µ ( f, f s , s, o ) along with Q ( f, f s , s, o ) and V ( f, f s , s ) Learn logical options: For every p in P G , learn an option for achieving p , o p = ( I o p , π o p , β o p , R o p , T o p ) I o p = S β o p = (cid:40) if p ∈ T P G ( s )0 otherwise π o p = optimal policy on E × W safety with rolloutsterminating when p ∈ T P G ( s ) T o p ( f (cid:48) s , s (cid:48) | f s , s ) = ∞ (cid:80) k =1 p ( f (cid:48) s , k ) γ k if p ∈ T P ( s (cid:48) )0 otherwise R o p ( f s , s ) = E [ R E ( f s , s, a ) + γ R E ( f s, , s , a ) + · · · + γ k − R E ( f s,k − , s k − , a k )] Find a meta-policy µ over the options: Initialize Q : F × F S × S × O → R and V : F ×F S × S → R to for ( k, f, f s , s ) ∈ [1 , . . . , n ] × F × F S × S : do for o ∈ O : do Q k ( f, f s , s, o ) ← R F ( f ) R o ( f s , s )+ (cid:80) f (cid:48) ∈F (cid:80) f (cid:48) s ∈F S (cid:80) ¯ p e ∈ P E (cid:80) s (cid:48) ∈S T F ( f (cid:48) | f, T P G ( s (cid:48) ) , ¯ p e ) T S ( f (cid:48) s | f s , T P S ( s (cid:48) ) , ¯ p e ) T P E (¯ p e ) T o ( s (cid:48) | s ) V k − ( f (cid:48) , f (cid:48) s , s (cid:48) ) end for V k ( f, f s , s ) ← max o ∈O Q k ( f, f s , s, o ) end for µ ( f, f s , s, o ) = arg max o ∈O Q ( f, f s , s, o ) Return:
Options O , meta-policy µ ( f, f s , s, o ) , and Q ( f, f s , s, o ) , V ( f, f s , s ) cated. First, if the liveness state affects the safety property,this implies that liveness propositions such as subgoals maybe in the safety property. In this case, as we described above,the subgoals present in the safety property need to be substi-tuted with “safety twin” propositions. Then during optiontraining, a policy-learning algorithm must be chosen thatwill learn sub-policies for all of the safety property states, he Logical Options Framework even if those states are only reached after completing a com-plicated task (for example, all of the sub-policies could betrained in parallel as in (Icarte et al., 2018)). Lastly, duringmeta-policy learning and during rollouts, when a new optionis chosen, the current state of the safety property must bepassed to the new option.The components of the logical options are defined starting atAlg. 2 line 5. Note that for stochastic low-level transitions,the number of time steps k at which the option terminates isstochastic and characterized by a distribution function. Ingeneral this distribution function must be learned, which is achallenging problem. However, there are many approachesto solving this problem; (Abel & Winder, 2019) contains anexcellent discussion.The most notable difference between the general formula-tion and the formulation in the paper is that the option policy,transition, and reward functions are functions of the safetyautomaton state f s as well as the low-level state s . Thismakes Logical Value Iteration more complicated, becausein the paper, we could assume we knew the final state ofeach option (i.e., the state of its associated subgoal s g ). Butnow, although we still assume that the option will termi-nate at s g , we do not know which safety automaton stateit will terminate in, so the transition model must learn adistribution over safety automaton states, and Logical ValueIteration must account for this uncertainty. A.4. Hierarchical SMDP
Given a low-level environment E , a liveness property W liveness , a safety property W safety , and logical options O , we can define a hierarchical semi-Markov Decision Pro-cess (SMDP) M = E × W liveness × W safety with options O and reward function R SMDP . This SMDP differs sig-nificantly from the SMDP in the paper in that the safetyproperty W safety is now an integral part of the formulation. R SMDP ( f, f s , s, o ) = R F ( f ) R o ( f s , o ) . A.5. Logical Value Iteration
A value function and Q-function are found for the SMDPusing the Bellman update equations: Q k ( f, f s , s, o ) ← R F ( f ) R o ( f s , s ) + (cid:88) f (cid:48) ∈F (cid:88) f (cid:48) s ∈F S (cid:88) ¯ p e ∈ P E (cid:88) s (cid:48) ∈S T F ( f (cid:48) | f, T P G ( s (cid:48) ) , ¯ p e ) T S ( f (cid:48) s | f s , T P S ( s (cid:48) ) , ¯ p e ) T P E (¯ p e ) T o ( s (cid:48) | s ) V k − ( f (cid:48) , f (cid:48) s , s (cid:48) ) (5) V k ( f, f s , s ) ← max o ∈O Q k ( f, f s , s, o ) (6) B. Proofs and Conditions for Satisfaction andOptimality
The proofs are based on the more general LOF formulationof Appendix A, as results on the more general formulationalso apply to the simpler formulation used in the paper.
Definition B.1.
Let the reward function of the environmentbe R E ( f s , s, a ) , which is some combination of R E ( s, a ) and R S ( f s , ¯ p s ) = R S ( f s , T P S ( s )) . Let π (cid:48) : F S ×S ×A×S → [0 , be the optimal goal-conditioned policy for reachinga state s (cid:48) . In the case of a goal-conditioned policy, thereward function is R E , and the objective is to maximize theexpected reward with the constraint that s (cid:48) is reached ina finite amount of time. We assume that every state s (cid:48) isreachable from any state s , a standard regularity assump-tion in MDP literature. Let V π (cid:48) ( f s , s | s (cid:48) ) be the optimalexpected cumulative reward for reaching s (cid:48) from s withgoal-conditioned policy π (cid:48) . Let s g be the state associatedwith the subgoal, and let π g be the optimal goal-conditionedpolicy associated with reaching s g . Let π ∗ be the optimalpolicy for the environment E . Condition B.1.
The optimal policy for the option must bethe same as the goal-conditioned policy that has subgoal s g as its goal: π ∗ ( f s , s ) = π g ( f s , s | s g ) . In other words, V π g ( f s , s | s g ) > V π (cid:48) ( f s , s | s (cid:48) ) ∀ f s , s, s (cid:48) (cid:54) = s g . This condition guarantees that the optimal option policywill always reach the subgoal s g . It can be achieved bysetting all rewards −∞ < R E ( f s , s, a ) < and terminatingthe episode only when the agent reaches s g . Thereforethe expected return for reaching s g is a bounded negativenumber, and the expected return for all other states is −∞ . Lemma B.2.
Given that the goal state of W liveness is reach-able from any other state using only subgoals and that thereis an option for every subgoal and that all the options meetCondition B.1, there exists a meta-policy that can reach theFSA goal state from any non-trap state in the FSA.Proof. This follows from the fact that transitions in W liveness are determined by achieving subgoals, and it isgiven that there exists an option for achieving every sub-goal. Therefore, it is possible for the agent to execute anysequence of subgoals, and at least one of those sequencesmust satisfy the task specification since the FSA represent-ing the task specification is finite and satisfiable, and thegoal state f g is reachable from every FSA state f ∈ F usingonly subgoals. Definition B.2.
From Dietterich (2000): A hierarchicallyoptimal policy for an MDP or SMDP is a policy thatachieves the highest cumulative reward among all policiesconsistent with the given hierarchy.
In our case, this means that the hierarchically optimal meta-policy is optimal over the available options. he Logical Options Framework
Definition B.3.
Let the expected cumulative reward func-tion of an option o started at state ( f s , s ) be R o ( f s , s ) . Letthe reward function on the SMDP be R SMDP ( f, f s , s, o ) = R F ( f ) R o ( f s , s ) with R F ( f ) ≥ . Let µ (cid:48) : F × F S ×S × O × F → [0 , be the hierarchically optimal goal-conditioned meta-policy for achieving liveness state f (cid:48) . Theobjective of the meta-policy is to maximize the reward func-tion R SMDP with the constraint that it reaches f (cid:48) in a finitenumber of time steps. Let V µ (cid:48) ( f, f s , s | f (cid:48) ) be the hierar-chically optimal return for reaching f (cid:48) from ( f, f s , s ) withgoal-conditioned meta-policy mu (cid:48) . Let µ ∗ be the hierar-chically optimal policy for the SMDP. Let f g be the goalstate, and µ g be the hierarchically optimal goal-conditionedmeta-policy for achieving the goal state. Condition B.3.
The hierarchically optimal meta-policymust be the same as the goal-conditioned meta-policy thathas the FSA goal state f g as its goal: µ ∗ ( f, f s , s ) = µ g ( f, f s , s | f g ) . In other words, V µ g ( f, f s , s | f g ) >V µ (cid:48) ( f, f s , s | f (cid:48) ) ∀ f, f s , s, f (cid:48) (cid:54) = f g . This condition guarantees that the hierarchically optimalmeta-policy will always go to the FSA goal state f g (thereby satisfying the specification). Here is an exam-ple of how this condition can be achieved: If −∞
From (Sutton et al., 1999): Value iteration onan SMDP converges to the hierarchically optimal policy.
Therefore, the meta-policy found using the Logical Op-tions Framework converges to a hierarchically optimal meta-policy that satisfies the task specification as long as Condi-tions B.1 and B.3 are met.
Definition B.4.
Consider the SMDP where planning is al-lowed over the low-level actions instead of the options.We will call this the hierarchical MDP (HMDP), as thisMDP is the product of the low-level environment E , the live-ness property W liveness , and the safety property W safety .Let R F ( f ) > ∀ f , and let R HMDP ( f, f s , s, a ) = R F ( f ) R E ( f s , s, a ) , and let π ∗ HMDP be the optimal policyfor the HMDP.
Theorem B.5.
Given Conditions B.1 and B.3, the hierarchi-cally optimal meta-policy µ g with optimal option policies π g has the same expected returns as the HMDP optimalpolicy π ∗ and satisfies the task specification. The assumption that R SMDP ( f, f s , s, o ) = R F ( f ) R o ( f s , s ) and R HMDP ( f, f s , s, a ) = R F ( f ) R E ( f s , s, a ) can be relaxedso that R SMDP and R HMDP are functions that are monotonicincreasing in the low-level rewards R o and R E , respectively. Proof.
By Condition B.1, every subgoal has an option as-sociated with it whose optimal policy is to go to the sub-goal. By Condition B.3, the hierarchically optimal meta-policy will reach the FSA goal state f g . The meta-policycan only accomplish this by going to the subgoals in a se-quence that satisfies the task specification. It does this byexecuting a sequence of options that correspond to a satis-fying sequence of subgoals and are optimal in expectation.Therefore, since R F ( f ) > ∀ f and R SMDP ( f, f s , s, o ) = R F ( f ) R o ( f s , s ) , and since the event propositions that af-fect the order of subgoals necessary to satisfy the task areindependent random variables, the expected cumulative re-ward is a positive linear combination of the expected optionrewards, and since all option rewards are optimal with re-spect to the environment and the meta-policy is optimalover the options, our algorithm attains the optimal expectedcumulative reward. C. Experimental Implementation
We discuss the implementation details of the experiments inthis section. Because the setups of the domains are analo-gous, we discuss the delivery domain first in every sectionand then briefly relate how the same formulation applies tothe reacher and pick-and-place domains as well. In this sec-tion, we use the simpler formulation of the main paper andnot the more general formulation discussed in Appendix A.
C.1. Propositions
The delivery domain has 7 propositions plus 4 compos-ite propositions. The subgoal propositions are P G = { a, b, c, h } . Each of these propositions is associated with asingle state in the environment (see Fig. 12a). The safetypropositions are P S = { o, e } . o is the obstacle proposi-tion. It is associated with many states – the black squares inFig. 12a. e is the empty proposition, associated with all ofthe white squares in the domain. This is the default propo-sition for when there are no other active propositions. Theevent proposition is P E = { can } . can is the “cancelled”proposition, representing when one of the subgoals has beencancelled.To simplify the FSAs and the implementation, we make anassumption that multiple propositions cannot be true at thesame state. However, it is reasonable for can to be trueat the subgoals, and therefore we introduce 4 compositepropositions, ca = a ∧ can , cb = b ∧ can , cc = c ∧ can , ch = h ∧ can . These can be counted as event propositionswithout affecting the operation of the algorithm.The reacher domain has analogous propositions. The sub-goals are r, g, b, y and correspond to a, b, c, h . The envi-ronment does not contain obstacles o but does have safetyproposition e , and it also has the event proposition can he Logical Options Framework and the composite propositions cr, cg, cb, cy for when can is true at the same time that a subgoal proposition is true.Another difference is that the subgoal propositions are asso-ciated with a small spherical region instead of a single stateas in the delivery domain; this is a necessity for continuousdomains and unfortunately breaks one of our conditions foroptimality because the subgoals are now associated withmultiple states instead of a single state. However, the LOFmeta-policy will still converge to a hierarchically optimalpolicy.The pick-and-place domain has subgoals r, g, b, y like thereacher domain, and event proposition can . Like the reacherdomain, the pick-and-place domain’s subgoals become truein a region around the goal state, breaking one of the nec-essary conditions for optimality. However, the LOF meta-policy still converges to a hierarchically optimal policy. C.2. Reward Functions
Next, we define the reward functions of the physical environ-ment R E , safety propositions R S , and FSA states R F . Werealize that often in reinforcement learning, the algorithmdesigner has no control over the reward functions of theenvironment. However, in our case, there are no publiclyavailable environments such as OpenAI Gym or the Deep-Mind Control Suite that we know of that have a high-levelFSA built-in. Therefore, anyone implementing our algo-rithm will likely have to implement their own high-levelFSA and define the rewards associated with it.For the delivery domain, the low-level environment rewardfunction R E : S × A → R is defined to be − ∀ s, a . Inother words, it is a time/distance cost.We assign costs to the safety propositions by defining thereward function R S : P S → R . All of the costs are exceptfor the obstacle cost, R S ( o ) = − . Therefore, there is avery high penalty for encountering an obstacle.We define the environment reward function R E : S × A → R to be R E ( s, a ) = R E ( s, a ) + R S ( T P ( s )) . In other words,it is the sum of R E and R S . This reward function meetsCondition B.1 for the optimal option policies to alwaysconverge to their subgoals.Lastly, we define R F : F → R to be R F ( f ) = 1 ∀ f .Therefore the SMDP cost R S M DP ( f, s, o ) = R o ( s ) andmeets Condition B.3 so that the LOF meta-policy convergesto the optimal policy.The reacher environment has analogous reward functions.The safety reward function R S ( p ) = 0 ∀ p ∈ P S becausethere is no obstacle proposition. Also, the physical envi-ronment reward function differs during option training andmeta-policy learning. For meta-policy learning, the rewardfunction is R E ( s, a ) = − a (cid:62) a − . – a time cost and an actuation cost. During option training, we speed learningby adding the distance to the goal state as a cost, insteadof a time cost: R E ( s, a ) = − a (cid:62) a − || s − s g || . Althoughthe reward functions and value functions are different, thecosts are analogous and lead to good performance as seen inthe results. Note that this method can’t be used for RewardMachines, because it trains sub-policies for FSA states, andthe subgoals for FSA states are not known ahead of time, sodistance to subgoal cannot be calculated.The pick-and-place domain has reward functions analogousto the reacher domain’s. C.3. Algorithm for
LOF-QL
The
LOF-QL baseline uses Q-learning to learn the meta-policy instead of value iteration. We therefore use “LogicalQ-Learning” equations in place of the Logical Value Iter-ation equations described in Eqs. 3 and 4 in the main text.The algorithm is described in Alg. 3. A benefit of usingQ-learning instead of value iteration is that the transitionfunction T F of the FSA T does not have to be explicitlyknown, as the algorithm samples from the transitions ratherthan using T F explicitly in the formula. However, as de-scribed in the main text, this comes at the expense of reducedcomposability, as LOF-QL takes around x more iterationsto converge to a new meta-policy than LOF-VI does. Let Q ( f, s, o ) be initialized to be all s. The Q update formulasare given in Alg. 3 lines 13 and 14. C.4. Comparison of LOF and Reward Machines
Figs. 4, 5, 6, and 7 give a visual overview of how LOF andReward Machines work, and illustrate how they differ.
C.5. Tasks
We test the environments on four tasks, a “sequential” task(Fig. 8), an “IF” task (Fig. 9), an “OR” task (Fig. 10), anda “composite” task (Fig. 11). The reacher domain has thesame tasks, expect r, g, b, y replace a, b, c, h , and there areno obstacles o . Note that in the LTL formulae, (cid:3) ! o is thesafety property φ safety ; the preceding part of the formula isthe liveness property φ liveness used to construct the FSA. he Logical Options Framework ` (a) Environment MDP E . Go grocery shopping OR pick up the kid, then go home.
S0 S1 Gor (b) Liveness property T . The natural language rule can be represented as an LTLformula which can be translated into an FSA. Figure 4.
LOF and RM both require an environment MDP E and an automaton T that specifies a task. ` S0 or GoalState ` S1 Figure 5. In RM , sub-policies are learned for each state of the automaton. In this case, in state S , a sub-policy is learned that goes eitherto the shopping cart of the kid, whichever is closer. In state S , the sub-policy goes to the house. C.6. Full Experimental Results
For the satisfaction experiments for the delivery domain, 10policies were trained for each task and for each baseline.Training was done for 1600 episodes, with 100 steps perepisode. Every 2000 training steps, the policies were testedon the domain and the returns recorded. For this discretedomain, we know the minimum and maximum possiblereturns for each task, and we normalized the returns usingthese minimum and maximum returns. The error bars arethe standard deviation of the returns over the 10 policies’rollouts.For the satisfaction experiments for the reacher domain, asingle policy was trained for each task and for each baseline.The baselines were trained for 900 epochs, with 50 steps perepoch. Every 2500 training steps, each policy was testedby doing 10 rollouts and recording the returns. For the RM baseline, training was for 1000 epochs with 800 stepsper epoch, and the policy was tested every 8000 trainingsteps. Because we don’t know the minimum and maximumrewards for each task, we did not normalize the returns. Theerror bars are the standard deviation over the 10 rollouts foreach baseline.For the composability experiments, a set of optionswas trained once, and then meta-policing training using LOF-VI , LOF-QL , and
Greedy was done for each task.Returns were recorded at every training step by rolling outeach baseline 10 times. The error bars are the standarddeviations on the 10 rollouts.For the pick-and-place domain, 1 policy was trained for thesatisfaction experiments, and experimental results were eval-uated over 2 rollouts. Training was done for 7500 epochswith 1000 steps per epoch. Every 250,000 training steps,the policy was tested by doing 2 rollouts and recording thereturns. For the composability experiments, returns wererecorded by rolling out each baseline 2 times. The RM base-line was trained over 10000 epochs with 1000 steps perepoch.Code and videos of the domains and tasks are in the supple-ment. D. Further Discussion
What happens when incorrect rules are used?
Onebenefit of representing the rules of the environment as LTLformulae/automata is that these forms of representing rulesare much more interpretable than alternatives (such as neu-ral nets). Therefore, if an agent’s learned policy has badbehavior, a user of LOF can inspect the rules to see if the he Logical Options Framework ` ` ` option option option (a) Step 1 of LOF: Learn a logical option for each subgoal. ` S0 GoalState ` S1 or (b) Step 2 of LOF: Use Logical Value Iteration to find a meta-policy that satisfies the liveness property. In this image, the boxed subgoalsindicate that the corresponding option is the optimal option to take from that low-level state. The policy ends up being the same as RM ’spolicy – in state S , the optimal meta-policy chooses the “grocery shoppping” option if the grocery cart is closer and the “pick up kid”option if the kid is closer. In the state S , the optimal meta-policy is to always choose the “home” option. Figure 6.
LOF has two steps. In (a) the first step, logical options are learned for each subgoal. In (b) the second step, a meta-policy isfound using Logical Value Iteration. bad behavior is a consequence of a bad rule specification.Furthermore, one of the consequences of composability isthat any modifications to the FSA will alter the resultingpolicy in a direct and predictable way. Therefore, for exam-ple, if an incorrect human-specified task yields undesirablebehavior, with our framework it is possible to tweak the taskand test the new policy without any additional low-leveltraining (however, tweaking the safety rules would requireretraining the logical options).
What happens if there is a rule conflict?
If the specifiedLTL formula is invalid, the LTL-to-automaton translationtool will either throw an error or return a trivial single-stateautomaton that is not an accepting state. Rollouts wouldterminate immediately.
What happens if the agent can’t satisfy a task withoutviolating a rule?
The solution to this problem depends onthe user’s priorities. In our formulation, we have assignedfinite costs to rule violations and an infinite cost to not satis-fying the task (see Appendix B). We have prioritized tasksatisfaction over safety satisfaction. However, it is possible to flip the priorities around by terminating training/rolloutsif there is a safety violation. In our proofs, we have assumedthat the agent can reach every subgoal from any state, imply-ing either that it is always possible to avoid safety violationsor that safety violations are allowed.
Why is the safety property not composable?
The safetyproperty is not composable because we allow safety propo-sitions to be associated with more than one state in theenvironment (unlike subgoals). The fact that there can bemultiple instances of a safety proposition in the environ-ment means that it is impossible to guarantee that a newoption policy will be optimal if retraining is done only at thelevel of the safety automaton and not also over the low-levelstates. In order to guarantee optimality, retraining wouldhave to be done over both the high and low levels (the safetyautomaton and the environment). Our definition of com-posability involves only replanning over the high level ofthe FSA. Therefore, safety properties are not composable.Furthermore, rewards/costs of the safety property can beassociated with propositions and not just with states (aswith the liveness property). This is because a safety viola- he Logical Options Framework
Go home OR pick up the kid, then go grocery shopping
S0 S1 Gor (a) LOF can easily solve this new liveness property without training new options. ` S0 ` S1 GoalState or (b) Logical Value Iteration can be used to find a meta-policy on the new task without the need to retrain the logical options. A newmeta-policy can be found in 10-50 iterations. The new policy finds that in state S , “home” option is optimal if the agent is closer to“home”, and the “kid” option is optimal if the agent is closer to “kid”. In state S , the “grocery shopping” option is optimal everywhere. Figure 7.
What distinguishes LOF from RM is that the logical options of LOF can be easily composed to solve new tasks. In this example,the new task is to go home or pick up the kid, then go grocery shopping. Logical Value Iteration can find a new meta-policy in 10-50iterations without needing to relearn the options. init s s s goal a b c h Figure 8.
FSA for the sequential task. The LTL formula is ♦ ( a ∧♦ ( b ∧ ♦ ( c ∧ ♦ h ))) ∧ (cid:3) ! o . The natural language interpretation is“Deliver package a , then b , then c , and then return home h . Andalways avoid obstacles o ”. init s s s goal a ∧ ¬ can c ∨ ( c a n ∧ ¬ a ) a ∧ c a n c ∧¬ c a n c a n a ∨ c a n a Figure 9.
FSA for the IF task. The LTL formula is ( ♦ ( c ∧ ♦ a ) ∧ (cid:3) ! can ) ∨ ( ♦ a ∧♦ can ) ∧ (cid:3) ! o . The natural language interpretationis “Deliver package c , and then a , unless a gets cancelled. Andalways avoid obstacles o ”. init s goal a ∨ b c Figure 10.
FSA for the OR task. The LTL formula is ♦ (( a ∨ b ) ∧♦ c ) ∧ (cid:3) ! o . The natural language interpretation is “Deliver package a or b , then c , and always avoid obstacles o ”. init s s s s s goal ( a ∨ b ) ∧ ¬ c a n c a n ∧¬ a ∧¬ b ( a ∨ b ) ∧ c a n h ∧ ¬ can c ∨ ( c a n ∧¬ h ) h ∧ c a n c ∧ ¬ can c a n h ∨ c a n a ∨ b h Figure 11.
FSA for the composite task. The LTL formula is ( ♦ (( a ∨ b ) ∧♦ ( c ∧♦ h )) ∧ (cid:3) ! can ) ∨ ( ♦ (( a ∨ b ) ∧♦ h ) ∧♦ can ) ∧ (cid:3) ! o .The natural language interpretation is “Deliver package a or b , andthen c , unless c gets cancelled, and then return to home h . Andalways avoid obstacles”. he Logical Options Framework bhca (a) Delivery domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 12.
All satisfaction experiments on the delivery domain. Notice how for the composite and OR tasks (Figs. 12c and 12d), the
Greedy baseline plateaus before
LOF-VI and
LOF-QL . This is because
Greedy chooses a suboptimal path through the FSA, whereas
LOF-VI and
LOF-QL find an optimal path. Also, notice that RM takes many more training steps to achieve the optimal cumulativereward. This is because for RM , the only reward signal is from reaching the goal state. It takes a long time for the agent to learn an optimalpolicy from such a sparse reward signal. This is particularly evident for the sequential task (Fig. 12f), which requires the agent to take alonger sequence of actions/FSA states before reaching the goal. The options-based algorithms train much faster because when training theoptions, the agent receives a reward for reaching each subgoal, and therefore the reward signal is much richer. r gyb (a) Reacher domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 13.
Satisfaction experiments for the reacher domain, without RM results. The results are equivalent to the results on the deliverydomain. tion via one safety proposition (e.g., a car going onto thewrong side of the road) may incur a different penalty thana violation via a different proposition (a car going off theroad). The propositions are associated with low-level states of the environment. Therefore any retraining would haveto involve retraining at both the high and low levels, onceagain violating our definition of composability. he Logical Options Framework r gyb (a) Reacher domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 14.
Satisfaction experiments for the reacher domain, including RM results. RM takes significantly more training steps to train thanthe other baselines, although it eventually reaches and surpasses the cumulative reward of the other baselines. This is because for thecontinuous domain, we violate some of the conditions required for optimality when using the Logical Options Framework – in particular,the condition that each subgoal is associated with a single state. In a continuous environment, this condition is impossible to meet, andtherefore we made the subgoals small spherical regions, and we only made the subgoals associated with specific Cartesian coordinatesand not velocities (which are also in the state space). Meanwhile, the optimality conditions of RM are looser and were not violated, whichis why it achieves a higher final cumulative reward. b gr y (a) Reacher domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 15.
Satisfaction experiments for the pick-and-place domain, without RM results. The results are equivalent to the results on thedelivery and reacher domains. Simplifying the option transition model:
In our experi-ments, we simplify the transition model by setting γ = 1 ,an assumption that does not affect convergence to optimal-ity. In the case where γ = 1 , Eq. 2 reduces to T o ( s (cid:48) | s ) = (cid:80) k p ( s (cid:48) , k ) . Assuming that the option terminates only atstate s g , then Eq.2 further reduces to T o ( s g | s ) = 1 and T o ( s (cid:48) | s ) = 0 for all other s (cid:48) (cid:54) = s g . Therefore no learningis required for the transition model. For cases where the he Logical Options Framework b gr y (a) Reacher domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 16.
Satisfaction experiments for the pick-and-place domain, including RM results. For the pick-and-place domain, RM did notconverge to a solution within the training time allotted for it (10 million training steps). bhca (a) Delivery domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 17.
All composability experiments for the delivery domain. assumption that γ = 1 does not apply, (Abel & Winder,2019) contains an interesting discussion. Learning the option reward model:
The option rewardmodel R o ( s ) is the expected reward of carrying out option o to termination from state s . It is equivalent to a valuefunction. Therefore, it is convenient if the policy-learningalgorithm used to learn the options learns a value functionas well as a policy (e.g., Q-learning and PPO). However,as long as the expected return can be computed between pairs of states, it is not necessary to learn a complete valuefunction. This is because during Logical Value Iteration,the reward model is only queried at discrete points in thestate space (typically corresponding to the initial state andthe subgoals). So as long as expected returns between theinitial state and subgoals can be computed, Logical ValueIteration will work. Why is
LOF-VI so much more efficient than the RM baseline? In short,
LOF-VI is more efficient than RM he Logical Options Framework r gyb (a) Delivery domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 18.
All composability experiments for the reacher domain. b gr y (a) Delivery domain. (b) Averaged. (c) Composite.(d) OR. (e) IF. (f) Sequential. Figure 19.
All composability experiments for the pick-and-place domain. because
LOF-VI has a dense reward function during train-ing and RM has a sparse reward function. During training, LOF-VI trains the options independently and rewards theagent for reaching the subgoals associated with the options.This is in effect a dense reward function. The generic re-ward function for RM only rewards the agent for reaching thegoal state. There are no other high-level rewards to guidethe agent through the task. This is a very sparse rewardthat results in less efficient training. RM ’s reward functioncould easily be made dense by rewarding every transition of the automaton. In this case, RM would probably train asefficiently as LOF-VI . However, imagine an FSA with twopaths to the goal state. One path has only 1 transition buthas much lower low-level cost, and one path has 20 tran-sitions and a much higher low-level cost. RM might learnto prefer the reward-heavy 20-transition path rather thanthe reward-light 1-transition path, even if the 1-transitionpath results in a lower low-level cost. In theory it mightbe possible to design an RM reward function that adjuststhe automaton transition reward depending on the length of he Logical Options Framework Algorithm 3
LOF with (cid:15) -greedy Q-learning Given:
Propositions P partitioned into subgoals P G , safetypropositions P S , and event propositions P E Environment MDP E = ( S , A , T E , R E , γ ) Logical options O with reward models R o ( s ) and tran-sition models T o ( s (cid:48) | s ) Liveness property T = ( F , P G ∪ P E , T F , R F , f , f g ) ( T F does not have to be explicitly known if it can besampled from a simulator)Learning rate α , exploration probability (cid:15) Number of training episodes n , episode length m To learn: Meta-policy µ ( f, s, o ) along with Q ( f, s, o ) and V ( f, s ) Find a meta-policy µ over the options: Initialize Q : F × S × O → R and V : F × S → R to for k ∈ [1 , . . . , n ] : do Initialize FSA state f ← , s a random initial statefrom E Draw ¯ p e ∼ T P E () for j ∈ [1 , . . . , m ] : do With probability (cid:15) let o be a random option; other-wise, o ← arg max o (cid:48) ∈O Q ( f, s, o (cid:48) ) s (cid:48) ∼ T o ( s ) f (cid:48) ∼ T F ( T P G ( s (cid:48) ) , ¯ p e , f ) Q k ( f, s, o ) ← Q k − ( f, s, o ) + α (cid:0) R F ( f ) R o ( s ) + γV ( f (cid:48) , s (cid:48) ) − Q k − ( f, s, o ) (cid:1) V k ( f, s ) ← max o (cid:48) ∈O Q k ( f, s, o (cid:48) ) f ← f (cid:48) end for end for µ ( f, s, o ) = arg max o ∈O Q ( f, s, o ) Return: