Induction and Exploitation of Subgoal Automata for Reinforcement Learning
Daniel Furelos-Blanco, Mark Law, Anders Jonsson, Krysia Broda, Alessandra Russo
IInduction and Exploitation of Subgoal Automata forReinforcement Learning
Daniel Furelos-Blanco [email protected]
Imperial College London, UK
Mark Law [email protected]
Imperial College London, UK
Anders Jonsson [email protected]
Universitat Pompeu Fabra, Spain
Krysia Broda [email protected]
Imperial College London, UK
Alessandra Russo [email protected]
Imperial College London, UK
Abstract
In this paper we present ISA, an approach for learning and exploiting subgoals inepisodic reinforcement learning (RL) tasks. ISA interleaves reinforcement learning withthe induction of a subgoal automaton, an automaton whose edges are labeled by the task’ssubgoals expressed as propositional logic formulas over a set of high-level events. A subgoalautomaton also consists of two special states: a state indicating the successful completionof the task, and a state indicating that the task has finished without succeeding. A state-of-the-art inductive logic programming system is used to learn a subgoal automaton thatcovers the traces of high-level events observed by the RL agent. When the currentlyexploited automaton does not correctly recognize a trace, the automaton learner induces anew automaton that covers that trace. The interleaving process guarantees the induction ofautomata with the minimum number of states, and applies a symmetry breaking mechanismto shrink the search space whilst remaining complete. We evaluate ISA in several grid-world and continuous state space problems using different RL algorithms that leverage theautomaton structures. We provide an in-depth empirical analysis of the automaton learningprocess performance in terms of the traces, the symmetric breaking and specific restrictionsimposed on the final learnable automaton. For each class of RL problem, we show thatthe learned automata can be successfully exploited to learn policies that reach the goal,achieving an average reward comparable to the case where automata are not learned buthandcrafted and given beforehand.
1. Introduction
Reinforcement learning (RL) is a family of algorithms for controlling an agent that actsin an environment with the purpose of maximizing some measure of cumulative reward itreceives. These algorithms have played a key role in recent breakthroughs like human-levelvideo game playing from raw sensory input (Mnih et al., 2015) and mastering complexboard games (Silver et al., 2018). However, despite of these impressive advancements, RLalgorithms still struggle to solve other complex tasks. A possible way to overcome thisproblem is through exploiting the task structure in the form of (temporal) abstractions. a r X i v : . [ c s . A I] S e p urelos-Blanco, Law, Jonsson, Broda, & Russo Finite-state automata have been extensively used as a means for abstraction across dif-ferent areas of Artificial Intelligence (AI), including the control of agents in robotics (Brooks,1989) and games (Buckland, 2004), as well as automated planning (Bonet et al., 2009; Hu& De Giacomo, 2011; Segovia Aguas et al., 2018). In the context of RL, automata havebeen used for multiple purposes, such as to represent abstract decision hierarchies (Parr& Russell, 1997; Leonetti et al., 2012), be used as memory in partially observable envi-ronments (Meuleau et al., 1999; Toro Icarte et al., 2019), or ease the interpretation of thepolicies encoded by a neural network (Koul et al., 2019). In particular, Toro Icarte et al.(2018) recently proposed reward machines (RMs), which are automata that play the role ofreward functions while revealing the task’s structure to the RL agent. This approach hasdrawn attention from the community and several works have recently attempted to learnRMs (Toro Icarte et al., 2019; Xu et al., 2020) or similar kinds of automata (Furelos-Blancoet al., 2020; Gaon & Brafman, 2020).In this paper we propose ISA ( I nduction of S ubgoal A utomata for Reinforcement Learn-ing), a method for learning and exploiting a minimal automaton that encodes the subgoalsof an episodic goal-oriented task. Indeed, these automata are called subgoal automata sinceeach transition is labeled by a subgoal, which is a boolean formula over a set of high-levelevents that characterizes the task. A set of high-level events is sensed by the agent at eachstate. Besides, subgoal automata have accepting and rejecting states. The former indi-cate the successful completion of the task, while the latter indicate that the task has beenfinished but without succeeding.We represent subgoal automata using Answer Set Programming (ASP) (Gelfond & Kahl,2014), a logic programming language. A state-of-the-art inductive logic programming (ILP)system for learning ASP programs, ILASP (Law et al., 2015b), is used to learn the au-tomata. Specifically, given a set of automaton states and a set of traces of high-level events,ILASP learns the transitions between states such that the traces are correctly classified.For instance, all those traces achieving the task’s goal must finish in the accepting state.To speed up the automaton learning process, we devise a symmetry breaking mechanismthat discards multiple equivalent automata in order to shrink the search space.ISA interleaves the automata and reinforcement learning processes. The automatalearner is executed only when the RL agent finds a trace not correctly classified by the cur-rent automaton. The interleaving scheme guarantees that the induced subgoal automatonis minimal (i.e., has the minimum number of states).Importantly, subgoal automata address two types of abstraction: state abstraction andaction abstraction (Konidaris, 2019; Ho, Abel, Griffiths, & Littman, 2019). The set ofautomaton states is an abstraction of the original state space: they determine the levelof completion of a given task. That is, they indicate which subgoals have been achievedand which remain to be achieved. Conversely, the subgoals labeling the edges can beseen as local objectives of abstract actions. The latter has been successfully addressedby hierarchical reinforcement learning (HRL, Barto & Mahadevan, 2003) algorithms, whichdivide a single task into several subtasks that can be solved separately. We use HRL methodsto exploit subgoal automata by seeing each subgoal as a subtask, as well as methods thatexploit similar automata structures like the aforementioned reward machines. All in all,abstractions in the form of subgoal automata can potentially: nduction and Exploitation of Subgoal Automata for RL
1. Make learning simpler since mastering a subtask should be easier than mastering thewhole task.2. Allow for better exploration since the agent moves more quickly between abstractstates (i.e., different levels of completion of the task).3. Allow for generalization between different tasks if they share common subtasks.4. Handle partial observability by acting as an external memory.We evaluate ISA in several grid-world and continuous state space tasks. We showthat a subgoal automaton can be simultaneously induced from interaction and exploitedby different RL algorithms to learn a policy that achieves the task’s goal. Importantly,the performance with a learned automaton is comparable to that where a handcraftedautomaton is used. Furthermore, we make a thorough analysis of how the reinforcementlearning process affects automata learning and vice versa.The description of our approach previously appeared in a conference paper (Furelos-Blanco et al., 2020). Compared to the conference version, the present paper includes thefollowing novel material: • A method for breaking symmetries in our automata that speeds up the automatalearning phase. • An extensive experimental analysis of our interleaving method. The experimentsinclude two new domains, one of which is characterized by a continuous state spaceand, thus, requires the use of function approximation techniques. Besides, we alsoevaluate a hierarchical RL algorithm that was not included in the conference versionof the paper. • A detailed description of recent related work, some of which was not yet available atthe time of submission of the conference paper. • A discussion on the limitations of our work and ideas to be developed in future work.The paper is organized as follows. Section 2 introduces the background of our work. Theformalization of the main components of our approach (the tasks, automata and the traces)is given in Section 3. Section 4 describes how subgoal automata and traces are representedusing logic programming, while Section 5 explains how this class of automata can be learnedfrom traces. In Section 6 we introduce a set of constraints for checking and guaranteeing thata given automaton complies with specific structural properties (determinism and symmetrybreaking constraints). Section 7 details how the automata learning process is interleavedwith different RL algorithms that exploit the resulting automata. The effectiveness of ourmethod across different tasks is evaluated in Section 8, followed by a discussion on relatedwork in Section 9. Section 10 concludes the paper and suggests directions for future work.
2. Background
In this section we briefly summarize the key background concepts of reinforcement learningand inductive learning of answer set programs, which are the main building blocks of ourapproach. urelos-Blanco, Law, Jonsson, Broda, & Russo Reinforcement learning (RL) (Sutton & Barto, 1998) is a family of algorithms for learningto act in an unknown environment. Typically, this learning process is formulated as a
Markov Decision Process (MDP) , i.e., a tuple M = (cid:104) S, A, p, r, γ (cid:105) , where S is a finite set ofstates, A is a finite set of actions, p : S × A → ∆( S ) is a transition probability function , r : S × A × S → R is a reward function, and γ ∈ [0 ,
1) is a discount factor. At time t , the agent observes state s t ∈ S , executes action a t ∈ A , transitions to the next state s t +1 ∼ p ( ·| s t , a t ) and receives reward r ( s t , a t , s t +1 ).We consider episodic MDPs that terminate in a given set of terminal states, which canbe either goal or undesirable states (i.e., dead-ends). Let S T ⊆ S be the set of terminalstates and S G ⊆ S T the set of goal states. The aim is to find a policy π : S → ∆( A ), amapping from states to probability distributions over actions, that maximizes the expectedsum of discounted reward (or return ), R t = E [ (cid:80) nk = t γ k − t r k ], where n is the last step of theepisode.In model-free RL the transition probability function p and reward function r are un-known to the agent, and a policy is learned via interaction with the environment. Q-learning(Watkins, 1989) computes an action-value function Q ( s, a ) = E [ R t | s t = s, a t = a ] that es-timates the return from each state-action pair when following an approximately optimalpolicy. In each iteration the estimates are updated as Q ( s, a ) = Q ( s, a ) + α (cid:18) r ( s, a, s (cid:48) ) + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) − Q ( s, a ) (cid:19) , where α is a learning rate and s (cid:48) is the state after applying a in s . The term r ( s, a, s (cid:48) ) + γ max a (cid:48) Q ( s (cid:48) , a (cid:48) ) is the target , while the whole expression within parentheses is the Bellmanerror . Usually, an (cid:15) -greedy policy selects a random action with probability (cid:15) and the actionmaximizing Q ( s, a ) otherwise. The policy is induced by the action that maximizes Q ( s, a )in each s . Options (Sutton et al., 1999) address temporal abstraction in RL. Given an MDP M = (cid:104) S, A, p, r, γ (cid:105) , an option is a tuple ω = (cid:104) I ω , π ω , β ω (cid:105) where I ω ⊆ S is the option’sinitiation set, π ω : S → ∆( A ) is the option’s policy, and β ω : S → [0 ,
1] is the option’stermination condition. An option is available in state s ∈ S if s ∈ I ω . If the option isstarted, the actions are chosen according to π ω . The option terminates at a given state s ∈ S with probability β ω ( s ). Note that an action a ∈ A can be viewed as an option thatterminates in any state with probability 1.An MDP whose action set is extended with options is a Semi-Markov Decision Process(SMDP). The learning methods for SMDPs require minimal changes with respect to MDPs.The counterpart of Q-learning for SMDPs is called SMDP Q-learning (Bradtke & Duff,1994). The update it performs when an option ω has terminated is: Q ( s, ω ) = Q ( s, ω ) + α (cid:18) r + γ k max ω (cid:48) Q ( s (cid:48) , ω (cid:48) ) − Q ( s, ω ) (cid:19) ,
1. For any finite set X , ∆( X ) = { µ ∈ R X | (cid:80) x µ ( x ) = 1 , µ ( x ) ≥ ∀ x ) } is the probability simplex over X . nduction and Exploitation of Subgoal Automata for RL where k is the number of steps between s and s (cid:48) , and r is the cumulative discounted rewardover this time. Similarly to Q-learning, α is a learning rate and s (cid:48) is the state in whichoption ω terminates.Intra-option learning (Kaelbling, 1993; Sutton et al., 1998) is an often used method forlearning option policies. During intra-option learning, an experience ( s, a, r, s (cid:48) ) generatedby the current option can be used not only to update its policy, but also other options’policies. Since experience accumulates faster, the convergence speed is usually increased. In this section we describe answer set programming (ASP) and the ILASP system forlearning ASP programs.
Answer Set Programming (ASP) (Gelfond & Kahl, 2014) is a declarative programminglanguage for knowledge representation and reasoning. An ASP problem is expressed in alogical format and the models (called answer sets) of its representation provide the solutionsto that problem. In the paragraphs below we describe the main concepts of ASP used inthe paper.An atom is an expression of the form p ( t , . . . , t n ) where p is a predicate symbol of arity n and t , . . . , t n are terms . If n = , we omit the parentheses. In this paper, a term can beeither a variable or a constant. By convention, variables are denoted using upper case (e.g., X or Y ), while constants are written in lower case (e.g., coffee or mail ). An atom is saidto be ground if none of its terms is a variable. A literal is an atom a or its negation not a .The not symbol is called negation as failure (Clark, 1977).An ASP program P is a set of rules. In this paper, we assume that this set is formed bynormal rules, choice rules and constraints. Given an atom h and a set of literals b , . . . , b n ,a normal rule is of the form h : - b , . . . , b n , where h is the head and b , . . . , b n is the body of the rule. A normal rule with an empty body is a fact . A choice rule is of theform lb { h , . . . , h m } ub : - b , . . . , b n , where lb and ub are integers, h , . . . , h m are atoms and b , . . . , b n are literals. Rules of the form : - b , . . . , b n are called constraints .Given a set of ground atoms (or interpretation ) I , a ground normal rule is satisfied ifthe head is satisfied by I when the body literals are satisfied by I . The head of a choicerule is satisfied by I if and only if the number of satisfied atoms in the head is between lb and ub (both included), i.e., lb ≤ | I ∩ { h , . . . , h m }| ≤ ub . A ground constraint is satisfiedif the body is not satisfied by I . The reduct P I of a program P with respect to I is builtin 4 steps (Law et al., 2015a):1. Replace the heads of all constraints with ⊥ .2. For each choice rule R : • if its head is not satisfied, replace its head with ⊥ , or • if its head is satisfied then remove R and for each atom h in the head of R suchthat h ∈ I , add the rule h : - body ( R ) (where body ( R ) is the set of literals formingthe body of R ). urelos-Blanco, Law, Jonsson, Broda, & Russo
3. Remove any rule R such that the body of R contains the negation of an atom in I .4. Remove all negation from any remaining rules.An interpretation I is an answer set of P if and only if (1) I satisfies the rules in P I , and(2) no subset of I satisfies the rules in P I . ILASP (Inductive Learning of Answer Set Programs) (Law et al., 2015b) is an inductivelogic programming system for learning ASP programs from partial answer sets.A context-dependent partial interpretation (CDPI) (Law et al., 2016) is a pair (cid:104)(cid:104) e inc , e exc (cid:105) ,e ctx (cid:105) , where (cid:104) e inc , e exc (cid:105) is a partial interpretation and e ctx is an ASP program, called a con-text . A program P accepts e if and only if there is an answer set A of P ∪ e ctx such that e inc ⊆ A and e exc ∩ A = ∅ .An ILASP task (Law et al., 2016) is a tuple T = (cid:104) B, S M , (cid:104) E + , E − (cid:105)(cid:105) where • B is the ASP background knowledge, • S M is the set of ASP rules allowed in the hypotheses, and • E + and E − are sets of CDPIs called, respectively, the positive and negative examples.A hypothesis H ⊆ S M is an inductive solution of T if and only if:1. ∀ e ∈ E + , B ∪ H accepts e , and2. ∀ e ∈ E − , B ∪ H does not accept e .
3. Problem Formulation
The objectives of this work are twofold:1. Propose a method for learning a subgoal automaton from traces of a given reinforce-ment learning task (Sections 4-6).2. Propose a method that interleaves the automata learning and reinforcement learningprocesses (Section 7).In this section we formalize the main components of our automata-driven RL approach,including the class of RL tasks, the notion of traces generated by the RL agent and thedefinition of subgoal automata. In later sections, we show how subgoal automata andtraces are represented in ASP (see Section 4), and how learning a subgoal automaton canbe formalized as an ILASP learning task that uses traces as examples (see Section 5).
The class of RL tasks we consider are episodic MDPs M = (cid:104) S, A, p, r, γ, S T , S G (cid:105) character-ized by a set of observables O . An observable is a propositional event that the agent cansense while interacting with the environment. A labeling function L : S → O maps an nduction and Exploitation of Subgoal Automata for RL D ∗ ∗ C (cid:75) ∗ o (cid:66) ∗ (cid:75) A ∗ ∗ B Figure 1: Example grid used in the
OfficeWorld environment (Toro Icarte et al., 2018).MDP state into a subset of observables (or observation ) O ⊆ O perceived by the agent atthat state.We use the OfficeWorld environment (Toro Icarte et al., 2018) as a running exampleto explain our method. It consists of a 9 ×
12 grid (see Figure 1) where an agent ( ) canmove in the four cardinal directions; that is, the action set is A = { up , down , left , right } .The agent remains in the same location if it moves towards a wall. The set of observablesis O = { (cid:75) , (cid:66) , o, A, B, C, D, ∗} . The agent picks up the coffee and the mail when it steps onlocations (cid:75) and (cid:66) respectively, and delivers them to the office when it steps on location o .The decorations ∗ break if the agent steps on them. There are also four locations labeled A , B , C and D . The agent sees these observables when it is at their respective locations.Three tasks with different goals are defined in this environment: • Coffee : deliver coffee to the office. • CoffeeMail : deliver coffee and mail to the office. • VisitABCD : visit A , B , C and D in order.The tasks terminate when the goal is achieved or a decoration is broken (this is a dead-endstate). A reward of 1 is given when the goal is achieved, else the reward is 0. We define the different kinds of traces that can be generated in our class of RL tasks.
Definition 3.1 (Execution trace) . An execution trace T = (cid:104) s , a , r , s , a , . . . , a n − , r n , s n (cid:105) is a finite state-action-reward sequence induced by a (changing) policy during an episode.An execution trace can be one of the following: • A goal execution trace T G if s n ∈ S G (i.e., the final state is a goal state). • A dead-end execution trace T D if s n ∈ S T \ S G (i.e., the final state is a dead-endstate). urelos-Blanco, Law, Jonsson, Broda, & Russo • An incomplete execution trace T I if s n / ∈ S T (i.e., the final state is not terminal). Execution traces are the traces collected by a RL agent. However, these are not thetraces that are used to learn the subgoal automata. Since subgoal automata aim to providethe RL agent with a subgoal structure that is independent from the state space, they aredefined at a higher level of abstraction. Therefore, the subgoal automata learner needstraces of higher level events as an input. These traces are called observation traces.
Definition 3.2 (Observation trace) . An observation trace T L, O is a sequence of observations O i ⊆ O , ≤ i ≤ n , obtained by applying a labeling function L to each state s i ∈ S in anexecution trace T = (cid:104) s , a , r , s , a , . . . , a n − , r n , s n (cid:105) . Formally, T L, O = (cid:104) O , . . . , O n | O i = L ( s i ) , s i ∈ T, s i ∈ S (cid:105) . Finally, we propose a method for shortening an observation trace based on the followingassumptions:1. Empty observations are irrelevant.2. Seeing the same observation twice or more in a row is equivalent to seeing it once.Given these two assumptions, we define a subtype of observation trace called compressedobservation trace.
Definition 3.3 (Compressed observation trace) . A compressed observation trace ˆ T L, O = (cid:104) ˆ O , . . . , ˆ O m (cid:105) is the result of removing empty observations and, thereafter, removing con-tiguous equal observations from an observation trace T L, O = (cid:104) O , . . . , O n (cid:105) . As we will see experimentally (Section 8), compressed observation traces are helpfulto speed up automata learning. However, their applicability is limited to tasks where theassumptions above hold. Thus, this kind of traces should not be used to learn the automataof tasks where every single observation is important, such as “observe (cid:75) twice in a row”.Importantly, the automata learning component of our method does not distinguish ob-servation traces and compressed observation traces. Therefore, for the rest of the paper, wemay simply use the term trace or observation trace when we refer to any kind of observationtrace.A set of execution traces is denoted by T = T G ∪ T D ∪ T I , where T G , T D and T I are sets of goal, dead-end and incomplete execution traces, respectively. The associatedsets of observation traces and compressed observation traces are analogously denoted by T L, O = T GL, O ∪ T DL, O ∪ T IL, O and ˆ T L, O = ˆ T GL, O ∪ ˆ T DL, O ∪ ˆ T IL, O . Example 3.1.
The first trace below is a goal execution trace for the
OfficeWorld ’s Coffee task using the grid in Figure 1. The second trace is the associated observationtrace, whereas the third trace is the resulting compressed observation trace. T G = (cid:104) s , , ← , , s , , ← , , s , , → , , s , , ↓ , , s , , ↓ , , s , (cid:105) ,T GL, O = (cid:104){} , { (cid:75) } , { (cid:75) } , {} , {} , { o }(cid:105) , ˆ T GL, O = (cid:104){ (cid:75) } , { o }(cid:105) . The states’ subindices in the execution trace correspond to positions in the grid, whereas thearrows correspond to the different actions: up ( ↑ ), down ( ↓ ), left ( ← ) and right ( → ). nduction and Exploitation of Subgoal Automata for RL Now we formally define the kind of automaton that is learned and used together with theRL component in our approach. The edges that characterize this class of automata arelabeled by propositional logic formulas over a set of observables. These formulas can beinterpreted as the subgoals of the task represented by the automaton. Therefore, we referto these automata as subgoal automata . Definition 3.4 (Subgoal automaton) . A subgoal automaton is a tuple A = (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) where • U is a finite set of states, • O is a set of observables (or alphabet), • δ ϕ : U × O → U is a deterministic transition function that takes as arguments a stateand a subset of observables (or observation) and returns a state, • u ∈ U is the unique initial state, • u A ∈ U is the unique absorbing accepting state, and • u R ∈ U is the unique absorbing rejecting state. The accepting state ( u A ) denotes the task’s goal achievement. In contrast, the rejectingstate ( u R ) indicates that the goal cannot be achieved from there. Therefore, these statesare both absorbing meaning that they do not have transitions to other states. That is, δ ϕ ( u, O ) = u for u ∈ { u A , u R } and any observation O ⊆ O .To determine whether the behavior of an agent during an episode is successful or not,we introduce the concept of automaton traversal . Definition 3.5 (Automaton traversal) . Given an observation trace T L, O = (cid:104) O , . . . , O n (cid:105) ,an automaton traversal A ( T L, O ) = (cid:104) v , v , . . . , v n +1 (cid:105) is a unique sequence of automatonstates such that1. v = u , and2. δ ϕ ( v i , O i ) = v i +1 for i = 0 , . . . , n .A subgoal automaton A accepts an observation trace T L, O if the automaton traversal A ( T L, O ) = (cid:104) v , v , . . . , v n +1 (cid:105) is such that v n +1 = u A . Analogously, A rejects T L, O if v n +1 = u R . Note that given an automaton A , a goal trace T GL, O must be accepted by A . Similarly,a dead-end trace T DL, O must be rejected, and an incomplete trace T IL, O cannot be acceptedor rejected.The transition function δ ϕ is constructed from a logical transition function ϕ that mapsstate pairs into propositional formulas over O , each representing a subgoal of the task. Ourapproach represents and learns logical transition functions.
2. For the rest of the paper, we may simply use the term automata when we refer to subgoal automata. urelos-Blanco, Law, Jonsson, Broda, & Russo u start u A u u R otherwiseotherwise otherwiseotherwise (cid:75) ∧ ¬ o ∗ ∧ ¬ (cid:75) ∗ ∧ ¬ oo (cid:75) ∧ o Figure 2: Subgoal automaton for
OfficeWorld ’s Coffee task.
Definition 3.6 (Logical transition function) . A logical transition function ϕ : U × U → DNF O is a transition function that maps a state pair into a disjunctive normal form (DNF)formula over O , where ϕ ( u, u ) = ⊥ for each u ∈ U (i.e., ϕ only represents transitions todifferent states). Expressing ϕ ( u, u (cid:48) ) as a DNF formula allows representing multiple edges between thesame pair of nodes u and u (cid:48) . That is, each of the conjunctions inside a DNF formula ϕ ( u, u (cid:48) )labels a different edge between u and u (cid:48) . The following notation is used throughout thepaper: • | ϕ ( u, u (cid:48) ) | denotes the number of conjunctive formulas that form the formula ϕ ( u, u (cid:48) ), • conj i denotes the i -th conjunction (left-to-right) in the DNF formula ϕ ( u, u (cid:48) ), and • O | = ϕ ( u, u (cid:48) ) denotes that observation O ⊆ O satisfies the DNF formula ϕ ( u, u (cid:48) ).Note that O is used as a truth assignment where observables in the set (i.e., o ∈ O )are true and observables that are not in the set (i.e., o / ∈ O ) are false. Formally, O | = ϕ ( u, u (cid:48) ) ≡ ∃ conj i ∈ ϕ ( u, u (cid:48) ) such that O | = conj i . Given a logical transition function ϕ , the transition function δ ϕ can be formally definedin terms of ϕ as follows: δ ϕ ( u, O ) = (cid:40) u (cid:48) if O | = ϕ ( u, u (cid:48) ) u if (cid:64) u (cid:48) ∈ U such that O | = ϕ ( u, u (cid:48) ) . (1)Note that loop transitions are implicitly defined by the absence of a satisfied formula onoutgoing transitions to other states. Besides, this mapping only works if ϕ is deterministic;that is, given a state u ∈ U and an observation O ⊆ O , at most one formula is satisfied.Formally, there are not two states u (cid:48) , u (cid:48)(cid:48) ∈ U such that O | = ϕ ( u, u (cid:48) ) , O | = ϕ ( u, u (cid:48)(cid:48) ) , and u (cid:48) (cid:54) = u (cid:48)(cid:48) . nduction and Exploitation of Subgoal Automata for RL Example 3.2.
Figure 2 shows the subgoal automaton for
OfficeWorld ’s Coffee task.The edges are labeled according to the logical transition function given below and loop tran-sitions are only taken if no outgoing transition holds. ϕ ( u , u ) = (cid:75) ∧ ¬ oϕ ( u , u A ) = o ϕ ( u , u A ) = (cid:75) ∧ oϕ ( u , u R ) = ∗ ∧ ¬ o ϕ ( u , u R ) = ∗ ∧ ¬ (cid:75) For all absent pairs of states ( u, u (cid:48) ) , ϕ ( u, u (cid:48) ) = ⊥ . The automaton covers two acceptingcases: (1) (cid:75) and o are observed in the same tile (i.e., direct path from u to u A ) and (2) (cid:75) and o are observed in different tiles (i.e., path from u to u A through u ). Note that, inthis case, there are not multiple edges from one state to another; that is, | ϕ ( u, u (cid:48) ) | = 1 forthe pairs of states shown above. Example 3.3.
The automaton traversal for the observation trace T L, O = (cid:104){} , { (cid:75) } , {} , {} , { o }(cid:105) in the automaton of Figure 2 is A ( T L, O ) = (cid:104) u , u , u , u , u , u A (cid:105) .
4. Representation of Subgoal Automata in Answer Set Programming
In this section we explain how subgoal automata are represented using Answer Set Program-ming (ASP). First, we describe how traces and subgoal automata are represented. Then,we present the general rules that describe the behavior of a subgoal automaton. Finally, weprove the correctness of the representation.
Definition 4.1 (ASP representation of an observation trace) . Given an observation trace T L, O = (cid:104) O , . . . , O n (cid:105) , M ( T L, O ) denotes the set of ASP facts that describe it: M ( T L, O ) = { obs ( o, t ) . | ≤ t ≤ n, o ∈ O t } ∪ { step ( t ) . | ≤ t ≤ n } ∪ { last ( n ) . } . The obs ( o, t ) predicate indicates that observable o ∈ O is observed at step t , step ( t )states that t is a step of the trace, and last ( n ) indicates that the trace ends at step n . Example 4.1.
The set of ASP facts for the observation trace T L, O = (cid:104){ a } , {} , { b, c }(cid:105) is M ( T L, O ) = { obs ( a, ., obs ( b, ., obs ( c, ., step (0) , step (1) ., step (2) ., last (2) . } . Definition 4.2 (ASP representation of a subgoal automaton) . Given a subgoal automaton A = (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) , M ( A ) = M U ( A ) ∪ M ϕ ( A ) denotes the set of ASP rules thatdescribe it, where: M U ( A ) = { state ( u ) . | u ∈ U } and M ϕ ( A ) = ed ( u, u (cid:48) , i ) . ¯ ϕ ( u, u (cid:48) , i, T ) : - not obs ( o , T ) , step ( T ) . u ∈ U \ { u A , u R } , ... u (cid:48) ∈ U \ { u } , ¯ ϕ ( u, u (cid:48) , i, T ) : - not obs ( o n , T ) , step ( T ) . ≤ i ≤ | ϕ ( u, u (cid:48) ) | , ¯ ϕ ( u, u (cid:48) , i, T ) : - obs ( o n +1 , T ) , step ( T ) . conj i ∈ ϕ ( u, u (cid:48) ) , ... conj i = o ∧ · · · ∧ o n ∧ ¬ o n +1 ∧ · · · ∧ ¬ o m ¯ ϕ ( u, u (cid:48) , i, T ) : - obs ( o m , T ) , step ( T ) . . urelos-Blanco, Law, Jonsson, Broda, & Russo The rules in M ( A ) are described as follows: • Facts state ( u ) indicate that u is an automaton state. • Facts ed ( u, u (cid:48) , i ) indicate that there is a transition from state u to u (cid:48) using edge i .Note that i is the i -th conjunction in the DNF formula ϕ ( u, u (cid:48) ) . • Normal rules whose head is of the form ¯ ϕ ( u, u (cid:48) , i, T ) state that the transition fromstate u to state u (cid:48) with edge i does not hold at step T . The body of these rules consistsof a single obs ( o, T ) literal and an atom step ( T ) indicating that T is a step. Rememberthat we represent variables using upper case letters, which is the case of steps T here.Note that ¯ ϕ represents the negation of the logical transition function ϕ . This is because,as discussed in more detail later in Section 5, learning the negation of the logical transitionfunction ϕ makes the search space smaller and, thus, makes the learning process faster. Example 4.2.
The following rules represent the automaton in Figure 2 (p10). state ( u ) . state ( u ) . state ( u A ) . state ( u R ) . ed ( u , u , . ed ( u , u A , . ed ( u , u R , . ed ( u , u A , . ed ( u , u R , . ¯ ϕ ( u , u , , T ) : - not obs ( (cid:75) , T ) , step ( T ) . ¯ ϕ ( u , u , , T ) : - obs ( o, T ) , step ( T ) . ¯ ϕ ( u , u A , , T ) : - not obs ( (cid:75) , T ) , step ( T ) . ¯ ϕ ( u , u A , , T ) : - not obs ( o, T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - not obs ( ∗ , T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - obs ( (cid:75) , T ) , step ( T ) . ¯ ϕ ( u , u A , , T ) : - not obs ( o, T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - not obs ( ∗ , T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - obs ( o, T ) , step ( T ) . General Rules.
In order to check whether an automaton accepts or rejects an observationtrace, it is necessary to reason about the automaton’s behavior. This is done by means ofa set of rules R that define how a subgoal automaton processes an observation trace. Thisset of rules is given by the union of different components, R = R ϕ ∪ R δ ∪ R st . The subsets R ϕ and R δ define the rules related to the automaton transition function: • The first rule in R ϕ defines the logical transition function ϕ in terms of its negation¯ ϕ and ed atoms. The second rule indicates that an outgoing transition from state X is taken at step T . R ϕ = (cid:26) ϕ ( X , Y , E , T ) : - not ¯ ϕ ( X , Y , E , T ) , ed ( X , Y , E ) , step ( T ) . out ϕ ( X , T ) : - ϕ ( X , , , T ) . (cid:27) • The rules in R δ define the transition function δ in terms of ϕ , as defined in Equa-tion 1 (p10). The first rule states that X transitions to Y at step T if an outgoingtransition to Y holds at that step. In contrast, the second rule indicates that state X transitions to itself at step T if no outgoing transition is satisfied at that step. R δ = (cid:26) δ ( X , Y , T ) : - ϕ ( X , Y , , T ) .δ ( X , X , T ) : - not out ϕ ( X , T ) , state ( X ) , step ( T ) . (cid:27)
3. Remember that each conjunction in the DNF formula ϕ ( u, u (cid:48) ) represents a different edge between states u and u (cid:48) . nduction and Exploitation of Subgoal Automata for RL The subset R st is used to define the automaton traversal of the trace (that is, thesequence of visited automaton states), and the criteria for accepting or rejecting a trace.The st ( T , X ) atoms indicate that a trace is in state X at step T . The first rule defines thatthe agent is in u at step . The second rule determines that at step T + the agent will bein state Y if it is in state X at step T and a transition between them holds at that step. Thethird (resp. fourth) rule indicates that the observation trace is accepted (resp. rejected) ifthe state at the trace’s last step is u A (resp. u R ). R st = st ( , u ) . st ( T + , Y ) : - st ( T , X ) , δ ( X , Y , T ) . accept : - last ( T ) , st ( T + , u A ) . reject : - last ( T ) , st ( T + , u R ) . Proposition 4.1 (Correctness of the ASP encoding) . Given an automaton A and a finiteobservation trace T ∗ L, O , where ∗ ∈ { G, D, I } , the program P = M ( A ) ∪ R ∪ M ( T ∗ L, O ) has aunique answer set AS and (1) accept ∈ AS if and only if ∗ = G , and (2) reject ∈ AS ifand only if ∗ = D .Proof. See Appendix A.
5. Learning Subgoal Automata from Traces
This section describes our approach for learning a subgoal automaton. Firstly, we formalizethe task of learning an automaton from traces.
Definition 5.1.
An automaton learning task is a tuple T A = (cid:104) U, O , u , u A , u R , T L, O , κ (cid:105) ,where • U ⊇ { u , u A , u R } is a set of states, where u is the initial state, u A is the acceptingstate and u R is the rejecting state; • O is a set of observables; • T L, O = T GL, O ∪ T DL, O ∪ T IL, O is a set of observation traces; and • κ is the maximum number of directed edges ( u, u (cid:48) ) from a state u ∈ U to another state u (cid:48) ∈ U \ { u } .An automaton A is a solution of T A if and only if it correctly classifies the traces in T L, O ;that is, if and only if it accepts all goal traces T GL, O , rejects all dead-end traces T DL, O , andneither accepts nor rejects incomplete traces T IL, O . Note that κ can be seen as the maximum number of disjuncts that a DNF formula ϕ ( u, u (cid:48) ) between two states u and u (cid:48) can have.Given an automaton learning task T A , we map it into an ILASP learning task M ( T A ) = (cid:104) B, S M , (cid:104) E + , ∅(cid:105)(cid:105) and use the ILASP system (Law et al., 2015b) to find a minimal inductivesolution M ϕ ( A ) ⊆ S M that covers the examples . We define the different components of M ( T A ) below.
4. Note that we do not use negative examples ( E − = ∅ ). urelos-Blanco, Law, Jonsson, Broda, & Russo Background Knowledge.
The background knowledge B = B U ∪ R is the union of twodifferent components: • B U is a set of state ( u ) facts for each u ∈ U ; and • R is the set of general rules that encode the behavior of a subgoal automaton (definedin Section 4). Hypothesis Space.
The hypothesis space S M contains all ed and ¯ ϕ rules that character-ize a transition from a non-terminal state u ∈ U \ { u A , u R } to a different state u (cid:48) ∈ U \ { u } using edge i ∈ [1 , κ ]. Formally, it is defined as S M = ed ( u, u (cid:48) , i ) . u ∈ U \ { u A , u R } , ¯ ϕ ( u, u (cid:48) , i, T ) : - obs ( o, T ) , step ( T ) . u (cid:48) ∈ U \ { u } , ¯ ϕ ( u, u (cid:48) , i, T ) : - not obs ( o, T ) , step ( T ) . i ∈ [1 , κ ] , o ∈ O . Loop transitions are not included since they are unsatisfiable formulas (see Definition 3.6).Note that it is possible to learn unlabeled transitions, which are taken unconditionally (thatis, regardless of the current observation). For example, for a transition from u to u (cid:48) usingedge i , an inductive solution may only include ed ( u, u (cid:48) , i ) and not ¯ ϕ ( u, u (cid:48) , i, T ).As mentioned before, the learner induces the negation ¯ ϕ of a logical transition function ϕ . A different hypothesis space where the learned rules characterize ϕ directly could havebeen defined. However, this requires guessing the maximum number of literals that labela transition between two states . Therefore, we represent subgoal automata using ¯ ϕ andinstead of imposing a maximum size for the conjunctive formulas, we impose a limit ( κ ) onthe number of edges from one state to another.It is important to realize that the learned hypothesis is denoted by M ϕ ( A ) and not M ( A ) (see Definition 4.2). The set of automaton states is given in the background knowl-edge B , and the hypothesis space S M only contains transition rules. Hence, the hypothesisis a smallest subset of transition rules that covers all the examples. Since the set of au-tomaton states is provided through the background knowledge, a minimal automaton (i.e.,an automaton with the minimum number of states) is only guaranteed to be learned whenthe set of automaton states is the minimal one. The mechanism that interleaves reinforce-ment and automata learning described in Section 7 ensures that the learned automaton isminimal for a specific κ . Example Sets.
Given a set of traces T L, O = T GL, O ∪T DL, O ∪T IL, O , the set of positive examples is defined as E + = {(cid:104) e ∗ , M ( T L, O ) (cid:105) | ∗ ∈ { G, D, I } , T L, O ∈ T ∗ L, O } , where • e G = (cid:104){ accept } , { reject }(cid:105) ,
5. Remember that the set of rules R introduced in Section 4 defines ϕ in terms of ¯ ϕ .6. This is because ILASP has the maximum length of learnable rules as a parameter. This is a problem toenforce determinism (see Section 6.1) since we do not know how many literals are going to be neededto make two formulas mutually exclusive. Besides, allowing for an arbitrarily large number of literals toovercome the problem increases the hypothesis massively. nduction and Exploitation of Subgoal Automata for RL • e D = (cid:104){ reject } , { accept }(cid:105) , and • e I = (cid:104){} , { accept , reject }(cid:105) are the partial interpretations for goal, dead-end and incomplete traces. The accept and reject atoms express whether a trace is accepted or rejected by the automaton; hence,goal traces must only be accepted, dead-end traces must only be rejected, and incompletetraces cannot be accepted or rejected. Note that the context of each example is the set ofASP facts M ( T L, O ) that represents the corresponding trace (see Definition 4.1). Correctness of the Learning Task.
The following theorem captures the correctness ofthe automaton learning task.
Theorem 5.1.
Given an automaton learning task T A = (cid:104) U, O , u , u A , u R , T L, O , κ (cid:105) , an au-tomaton A is a solution of T A if and only if M ϕ ( A ) is an inductive solution of M ( T A ) = (cid:104) B, S M , (cid:104) E + , ∅(cid:105)(cid:105) .Proof. Assume A is a solution of T A . ⇐⇒ A accepts traces in T GL, O , rejects traces in T DL, O and does not accept nor rejecttraces in T IL, O . ⇐⇒ By Proposition 4.1, for each trace T ∗ L, O ∈ T ∗ L, O where ∗ ∈ { G, D, I } , M ( A ) ∪ R ∪ M ( T ∗ L, O ) has a unique answer set AS and (1) accept ∈ AS if and only if ∗ = G , and (2) reject ∈ AS if and only if ∗ = D . ⇐⇒ For each example e ∈ E + , R ∪ M ( A ) accepts e . ⇐⇒ For each example e ∈ E + , B ∪ M ϕ ( A ) accepts e (the two programs are identical). ⇐⇒ M ϕ ( A ) is an inductive solution of M ( T A ). Optimization.
To make the formalization simpler, we have always included u A and u R in the set of automaton states U of the automaton learning task T A . However, in practice, u A is not included in U when the set of goal traces T GL, O is empty. Likewise, u R is notincluded in U when the set of dead-end traces T DL, O is empty. Removing these states whenthey are not needed is helpful to make the hypothesis space smaller.
6. Verification of Structural Properties
The formalization of the automaton learning task described in Section 5 induces non-deterministic automata. That is, there can exist an observation that simultaneously satisfiestwo formulas labeling outgoing edges to two different states. In order to comply with thedefinition of the logical transition function (see Definition 3.6, p10), we need to constrainit to be deterministic. The automaton states act as a proxy for determining what is thecurrent level of completion of the task; in other words, which subgoals have been achieved.If the automaton is non-deterministic, several different levels of completion can be activesimultaneously, which is impractical from the perspective of an RL agent. These agentsneed to know exactly in which stage they are in order to make an appropriate decision.Determinism is not the only property we can impose. Note that an automaton canbe easily transformed into an equivalent one by rearranging its state and edge identifiers.Therefore, ILASP can visit multiple symmetric automata during the search for a solution urelos-Blanco, Law, Jonsson, Broda, & Russo that covers the examples. By avoiding revisiting parts of the search space, the automatalearning time can be greatly reduced.In this section we introduce rules that allow us to verify that a given automaton is deter-ministic and complies with specific properties that characterize a canonical representation(e.g., a criteria for naming the automaton states). Crucially, these verification rules canalso be used to enforce these properties during search.Both the determinism and symmetry properties are related to the automata’s structure;that is, the edges and formulas that label them. The ASP representation M ( A ) of asubgoal automaton A given in Definition 4.2 represents the formulas on the edges as rules.Consequently, we cannot impose constraints over them easily. To solve this problem, wemap the ¯ ϕ rules in M ( A ) to facts of the form pos ( u, u (cid:48) , i, o ) and neg ( u, u (cid:48) , i, o ). These factsexpress that observable o ∈ O appears positively (resp. negatively) in the edge i from state u to state u (cid:48) . Formally, given the ASP encoding M ( A ) of automaton A , its correspondingmapping into facts F ( M ( A )) is defined as follows: F ( M ( A )) = state ( u ) . state ( u ) . ed ( u, u (cid:48) , i ) . ed ( u, u (cid:48) , i ) . pos ( u, u (cid:48) , i, o ) . ¯ ϕ ( u, u (cid:48) , i, T ) : - not obs ( o , T ) , step ( T ) . ... ... pos ( u, u (cid:48) , i, o n ) . ¯ ϕ ( u, u (cid:48) , i, T ) : - not obs ( o n , T ) , step ( T ) . neg ( u, u (cid:48) , i, o n +1 ) . ¯ ϕ ( u, u (cid:48) , i, T ) : - obs ( o n +1 , T ) , step ( T ) . ... ... neg ( u, u (cid:48) , i, o m ) . ¯ ϕ ( u, u (cid:48) , i, T ) : - obs ( o m , T ) , step ( T ) . . Note that the right hand side of the set corresponds to a given M ( A ). The facts state and ed are replicated, while the rules ¯ ϕ are transformed into pos and neg facts . Given thisfactual representation of a subgoal automaton, we can now define constraints for enforcingdeterminism and a canonical structure over these facts. Example 6.1.
The set of facts below represents the formulas on the edges of the automatonin Figure 2 (p10) generated from the logical transition function in Example 4.2 (p12). pos ( u , u , , (cid:75) ) . neg ( u , u , , o ) . pos ( u , u A , , (cid:75) ) . pos ( u , u A , , o ) . pos ( u , u R , , ∗ ) . neg ( u , u R , , (cid:75) ) . pos ( u , u A , , o ) . pos ( u , u R , , ∗ ) . neg ( u , u R , , o ) . Verifying that a learned automaton complies with a set of structural properties can bedone over its factual encoding. That is, those automata that violate the properties arediscarded as solutions to the automaton learning task. However, this is clearly computa-tionally expensive. ILASP allows the above mapping to be included in the learning taskthrough meta-program injection (Law et al., 2018). This enables the learner to verify the
7. Note that we could have learned the factual representation F ( M ( A )) instead of the one based on normalrules M ( A ). We opted for the latter since it requires less grounding and can be potentially used torepresent more complex automata in future work (see Section 10). nduction and Exploitation of Subgoal Automata for RL properties during the search for an automaton, which effectively shrinks the search spaceand speeds up automata learning.In what follows, we use the factual representation above to encode the determinismconstraints (Section 6.1) and a canonical structure for breaking symmetries (Section 6.2). As described in Section 3.3, a logical transition function is deterministic if no observation cansatisfy two outgoing formulas to two different states simultaneously. As these formulas areconjunctions of literals, two formulas are mutually exclusive (i.e., cannot be both satisfiedat the same time) if and only if an observable appears positively in the first formula andnegatively in the second (or vice versa). The set of rules below encodes this definition bymeans of the mutex ( X , Y , EY , Z , EZ ) predicate which indicates that the formula on the edgefrom X to Y with index EY is mutually exclusive with the edge X to Z with index EZ . The firstand second rules specify that two outgoing edges from state X to two different states ( Y and Z ) are mutually exclusive if an observable O appears positively in one edge and negativelyin the other . The third rule enforces edges from a state X to two different states Y and Z to be mutually exclusive. mutex ( X , Y , EY , Z , EZ ) : - pos ( X , Y , EY , O ) , neg ( X , Z , EZ , O ) , Y < Z . mutex ( X , Y , EY , Z , EZ ) : - neg ( X , Y , EY , O ) , pos ( X , Z , EZ , O ) , Y < Z . : - not mutex ( X , Y , EY , Z , EZ ) , ed ( X , Y , EY ) , ed ( X , Z , EZ ) , Y < Z . In this section we describe rules which aim to enforce that the induced automata follow acanonical structure in order to make the search for solution faster by breaking symmetries.There are several types of symmetries we are interested in breaking. Firstly, any twostates except for u , u A and u R are interchangeable. For example, Figure 3 shows twoautomata whose states u , u and u can be used interchangeably. Secondly, there canalso be symmetries in the automaton edges. The inductive solution of the automatonlearning task M ( T A ) contains transition rules whose edge indices range between 1 and κ (the maximum number of edges from one state to another). For instance, if κ = 2 apotentially learned representation of the Coffee automaton (see Figure 2, p10) is shownbelow. Note that edges can arbitrarily be labeled 1 or 2 even though there is a single edgebetween every pair of states. ed ( u , u , . ed ( u , u A , . ed ( u , u R , . ed ( u , u A , . ed ( u , u R , . ¯ ϕ ( u , u , , T ) : - not obs ( (cid:75) , T ) , step ( T ) . ¯ ϕ ( u , u , , T ) : - obs ( o, T ) , step ( T ) . ¯ ϕ ( u , u A , , T ) : - not obs ( (cid:75) , T ) , step ( T ) . ¯ ϕ ( u , u A , , T ) : - not obs ( o, T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - not obs ( ∗ , T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - obs ( (cid:75) , T ) , step ( T ) . ¯ ϕ ( u , u A , , T ) : - not obs ( o, T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - not obs ( ∗ , T ) , step ( T ) . ¯ ϕ ( u , u R , , T ) : - obs ( o, T ) , step ( T ) . . Finally, the indices of two edges between the same pair of states can be also interchanged.
8. The comparison Y < Z is done instead of Y != Z for efficiency purposes. Note that both comparisons areequivalent in this context. The former imposes a lexicographical order to evaluate the rules and thusavoids reevaluating the expression when Y and Z are interchanged. urelos-Blanco, Law, Jonsson, Broda, & Russo u start u u u u A u R A ∧ ¬∗ B ∧ ¬∗ C ∧ ¬∗ D ∧ ¬∗∗ ∗∗∗ (a) VisitABCD u start u A u u u (cid:75) ∧ ¬ (cid:66) ¬ (cid:75) ∧ (cid:66)(cid:75) ∧ (cid:66) ∧ ¬ o (cid:75) ∧ (cid:66) ∧ o (cid:66) ∧ ¬ o (cid:66) ∧ o (cid:75) ∧ ¬ o (cid:75) ∧ oo (b) CoffeeMail
Figure 3: Automata for two
OfficeWorld tasks. Self-loops and transitions to the re-jecting state u R in (b) are omitted for simplicity. The shaded states can beinterchanged in the absence of symmetry breaking.The idea of our symmetry breaking mechanism is to impose a unique assignment ofstate and edge indices given a labeling of the automaton edges. In Section 6.2.1 we proposea symmetry breaking mechanism for a particular class of labeled directed graphs, whereasin Section 6.2.2 we explain how this method applies to subgoal automata, which can berepresented as graphs in this class. In this section we propose a symmetry breaking mechanism for a particular class of labeleddirected graphs.Let L = { l , . . . , l k } be a set of labels , and let G = ( V, E ) be a labeled directed graphwith a set of nodes V = { v , . . . , v n } and a set of edges E . Each edge in E is of the form( u, v, L ), where u, v ∈ V are the two connected nodes, and L ⊆ L is a subset of labels. Foreach node u ∈ V , let E o ( u ) = { ( v, w, L ) ∈ E | u = v } be the set of outgoing labeled edgesfrom u , and let E i ( u ) = { ( v, w, L ) ∈ E | u = w } be the set of incoming labeled edges.We define a class G of labeled directed graphs by imposing three assumptions: Assumption 1.
The node v is a designated start node. Assumption 2.
Each node u ∈ V \ { v } is reachable on a directed path from v . Assumption 3.
Outgoing label sets from each node are unique, i.e. for each u ∈ V andlabel set L ⊆ L there is at most one edge ( u, v, L ) ∈ E o ( u ) . nduction and Exploitation of Subgoal Automata for RL As a consequence of Assumption 2, it holds that | E i ( u ) | ≥ u ∈ V \ { v } . Example 6.2.
The figure below is a labeled directed graph G = (cid:104) V, E (cid:105) that belongs to class G , where V = { v , . . . , v } . The set of labels is L = { a, b, c, d, e, f } . v start v v v v { a, f }{ b, e } { a, b } { b }{ a } { c }{ d } Label Set Ordering.
Given a set of labels L = { l , . . . , l k } , we impose a total order onlabel sets as follows. Definition 6.1.
A label set L ⊆ L is lower than a label set L (cid:48) ⊆ L , denoted L < L (cid:48) , if thereexists a label l ≤ m ≤ k ∈ L such that1. l m / ∈ L and l m ∈ L (cid:48) , and2. there is not a label l m (cid:48) | m (cid:48) Given the label set L = { a, b, c, d, e, f } , the following inequalities betweensome of its subsets hold. Note that the second column contains the corresponding binaryrepresentations of the sets. {} < { a, d, f } (000000 < , { d } < { c } (000100 < , { b, e } < { a, f } (010010 < , { a, f } < { a, b } (100001 < . Graph Indexing. Given a graph G ∈ G , we next introduce a graph indexing that assignsunique integers to each node in V and, for each node u ∈ V , to each outgoing edge in E o ( u ).Formally, a graph indexing is a tuple I ( G ) = (cid:104) f, { Γ u } u ∈ V (cid:105) of bijections defined as f : V → { , . . . , | V |} s . t . f ( v ) = 1 , Γ u : E o ( u ) → { , . . . , | E o ( u ) |} , ∀ u ∈ V. urelos-Blanco, Law, Jonsson, Broda, & Russo Hence a graph indexing always assigns 1 to the designated start node v . Since outgoinglabel sets are unique due to Assumption 3, we use Γ u ( L ) as shorthand for Γ u ( u, v, L ).Given a graph indexing I ( G ), we introduce an associated parent function Π I : V \{ v } →{ , . . . , | V |} × N from nodes (excluding the start node v ) to pairs of integers, defined asΠ I ( v ) = min ( u,v,L ) ∈ E i ( v ) ( f ( u ) , Γ u ( L )) . Here, the minimum is with respect to a lexicographical ordering of integer pairs. HenceΠ I ( v ) = ( i, e ) is the smallest integer i assigned to any node on an incoming edge to v and,in the case of ties, the smallest integer e on such an edge. Note that the parent function iswell-defined since | E i ( v ) | ≥ v ∈ V \ { v } due to the reachability assumption.We are particularly interested in the graph indexing that corresponds to a breadth-firstsearch (BFS) traversal of the graph G , which we proceed to define. Definition 6.2. A graph indexing I ( G ) is a BFS traversal if the following conditions hold:1. For each pair of nodes u and v in V \ { v } , Π I ( u ) < Π I ( v ) ⇔ f ( u ) < f ( v ) .2. For each node u ∈ V and each pair of outgoing edges ( u, v, L ) and ( u, v (cid:48) , L (cid:48) ) in E o ( u ) , L < L (cid:48) ⇔ Γ u ( L ) < Γ u ( L (cid:48) ) . Due to the second condition in Definition 6.2, the bijection Γ u of each node u ∈ V clearly orders outgoing edges by their label sets. Due to the first condition, the bijection f orders node u before node v if the parent function of u is smaller than that of v . In a BFStraversal from v , nodes are processed in the order they are first visited. In this context,the parent function identifies the edge used to visit a node for the first time. Together,these facts imply that f assigns integers to nodes in the order they are visited by a BFStraversal from v , given that the label set ordering is used to break ties among edges. ThisBFS traversal can be characterized by a BFS subtree whose edges are defined by the parentfunction. Example 6.4. The figure below shows a graph indexing for the graph in Example 6.2, withnodes and edges labeled by their assigned integer. This graph indexing is a BFS traversalsince nodes are ordered according to their distance from the start node v , and since the edgeintegers used to break ties are consistent with the label set ordering shown in Example 6.3.The parent function is given by Π I ( v ) = (4 , , Π I ( v ) = (1 , , Π I ( v ) = (1 , , and Π I ( v ) = (1 , , and the corresponding BFS subtree appears in bold. v start v v v v { a, f } { b, e } { a, b } { b } { a } { c } { d } nduction and Exploitation of Subgoal Automata for RL The important property that we exploit about BFS traversals is that they are unique,which we prove in Lemma 6.1. To prove it, we use the result in Proposition 6.1 when BFSis applied to any kind of directed graph where all nodes are reachable. Proposition 6.1. If BFS visits the neighbors of each node in a fixed order, the resultingtree is unique.Proof. By contradiction. Assume that two different BFS trees, T and T (cid:48) , are producedusing the same visitation criteria. Then T contains an edge ( u, v ) that is not in T (cid:48) , and T (cid:48) contains an edge ( u (cid:48) , v ) that is not in T . In the case of T , this means that u was visitedbefore u (cid:48) . Analogously, u (cid:48) was visited before u to produce T (cid:48) . Therefore, the visitationcriteria is different for each of the BFS trees. This is a contradiction. Lemma 6.1. Each graph G ∈ G has a unique associated BFS traversal I ( G ) .Proof. Intuitively, the lemma holds because there is only one way to perform a BFS traversalfrom v , given that we use the label set ordering to break ties among edges.Formally, for each u ∈ V , since outgoing label sets are unique, there is a unique bijectionΓ u that satisfies the second condition in Definition 6.2. If Γ u orders outgoing edges in anyother way, there will always be two outgoing edges ( u, v, L ) and ( u, v (cid:48) , L (cid:48) ) in E o ( u ) suchthat L < L (cid:48) and Γ u ( L ) > Γ u ( L (cid:48) ), thus violating the condition.Then, if we fix the correct definition of Γ u for each u ∈ V , there is a unique bijection f that satisfies the first condition in Definition 6.2. To recover this bijection we can simplyperform a BFS traversal from v , using the bijections Γ u , u ∈ V , to break ties amongedges. By Proposition 6.1, this results in a unique BFS tree. If f orders nodes in any otherway, there will always be two nodes u and v in V \ { v } such that Π I ( u ) < Π I ( v ) and f ( u ) > f ( v ), thus violating the condition.Appendix B.1 shows an encoding of our symmetry breaking mechanism in the form ofa satisfiability (SAT) formula and formally prove several of its properties. In this section we devise a symmetry breaking mechanism based on the graph indexing fromthe previous subsection. The key idea is to encode rules that force ILASP to generate agraph indexing which is also a BFS traversal. Since this graph indexing is unique due toLemma 6.1, ILASP can only represent each graph in one way, precluding multiple symmetricvariations. In fact, we could have used any unique graph indexing for this purpose.We first show that a subgoal automaton A = (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) is a special case of alabel directed graph G = (cid:104) V, E (cid:105) in the class G . The set of automaton states U correspondsto the set of nodes V , while the logical transition function ϕ corresponds to the set of edges E . Crucially, a subgoal automaton complies with all three assumptions we made aboutgraphs in the class G : • Assumption 1 holds because A has an initial state u . • Assumption 2 is enforced by the set of rules below. The first rule defines u to bereachable, while the second rule indicates that a state is reachable if it has an incoming urelos-Blanco, Law, Jonsson, Broda, & Russo edge from a reachable state. Finally, the third rule enforces all states to be reachable. reachable ( u ) . reachable ( Y ) : - reachable ( X ) , ed ( X , Y , ) . : - not reachable ( X ) , state ( X ) . • Assumption 3 holds because the automaton is deterministic (see Section 6.1). If theautomaton is deterministic, the formulas labeling two outgoing edges from a givenstate to two different states are mutually exclusive and, thus, are different (whichfulfills the assumption). Besides, two outgoing edges from a given state to anotherstate cannot be equal because the automaton learner, ILASP, induces a minimalhypothesis (in our case, a minimal set of transition rules). Since a hypothesis withtwo equally labeled edges is not minimal, the assumption still holds.Even though a subgoal automaton complies with the three assumptions of labeled di-rected graphs, there are two differences between them:1. The edges of a subgoal automaton are labeled by propositional formulas over a set ofobservables O , whereas the edges of a labeled directed graph are defined over a set oflabels L .2. The edge indices in an indexed labeled directed graph are different than those in ourrepresentation of a subgoal automaton. In the former, the edge indices from a givennode range between 1 to the number of outgoing edges from than node. In contrast,the indices range from 1 to the number of edges between each pair of states. In otherwords, the edge indices from a given node in the labeled directed graph are unique,whereas in a subgoal automaton they can be repeated.To address these differences we just need to set a mapping from the representation used bythe subgoal automata to that characterizing a labeled directed graph. The set of symmetrybreaking constraints is defined on top of this mapping. Appendix B.2 shows two possibleways of encoding the symmetry breaking mechanism in ASP using the factual representationintroduced at the beginning of this section. 7. Interleaved Automata Learning In this section we describe ISA ( I nduction of S ubgoal A utomata for Reinforcement Learn-ing), a method that combines reinforcement and automaton learning. Firstly, we explaintwo reinforcement learning algorithms that we use to exploit a given subgoal automaton(Section 7.1). Secondly, we explain how the reinforcement and automata learning processesare interleaved to learn a minimal automaton (Section 7.2). The automata learning pro-cess includes the constraints introduced in Section 6 for enforcing determinism while thesymmetry breaking ones are optional. In this section we describe two methods to learn a policy for an episodic MDP M = (cid:104) S, A, p, r, γ, S T , S G (cid:105) by exploiting the automaton structure given by its subgoals. Each nduction and Exploitation of Subgoal Automata for RL method is characterized by a different way of using the options framework (Sutton et al.,1999):1. Learning an option for each outgoing edge in the automaton and a metacontroller tochoose between options in each automaton state (Section 7.1.1).2. Learning an option for each automaton state (Section 7.1.2).The automaton is used as an external memory: each automaton state indicates whichsubgoals have been achieved so far. In general, a subgoal automaton is used as follows.Firstly, the agent selects an option when it reaches an automaton state. Note that in(1) there can be multiple options to choose from, whereas in (2) there is a single option.Once an option is chosen, the agent selects actions according to that option’s policy until itstermination. An option terminates when either (a) a terminal MDP state is reached or (b)when a formula on an outgoing edge from the current automaton state is satisfied. Afterthe agent experiences ( s, a, s (cid:48) ) in the automaton state u , it transitions to the automatonstate u (cid:48) = δ ϕ ( u, L ( s (cid:48) )) given by the transition function of the automaton. Remember thatthe labeling function L maps a state into an observation.In the following sections we describe the features of the RL methods in detail including: • how options are modeled, • how policies are learned, and • which optimality guarantees they have. The edges of a subgoal automaton are labeled by propositional formulas over a set ofobservables O . The idea is that each of these formulas represents a subgoal of the taskencoded by the automaton. Therefore, an intuitive approach to exploit the automatonstructure consists in learning an option that aims to reach a state that satisfies a givenformula, and a metacontroller that learns which option to take at each automaton state.Since decisions are taken at two different hierarchical levels, we will refer to this approachas HRL (Hierarchical Reinforcement Learning). Option Modeling. Given a subgoal automaton A = (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) , the set ofoptions in a non-terminal state u ∈ U (that is, a state with outgoing transitions to otherstates) is Ω u = (cid:8) ω u,φ | φ ∈ ϕ ( u, u (cid:48) ) , u (cid:54) = u (cid:48) (cid:9) , where ω u,φ is the option that attempts to satisfy the formula φ in state u ∈ U . Note that φ is a disjunct of the DNF formula ϕ ( u, u (cid:48) ) . Each option at the automaton state u is a tuple ω u,φ = (cid:104) I u , π φ , β u (cid:105) where: 9. Remember that the logical transition function ϕ : U × U → DNF O maps an automaton state pair intoa DNF formula over O . urelos-Blanco, Law, Jonsson, Broda, & Russo • The initiation set I u is the set of all those MDP states that satisfy a formula on anincoming edge of u : I u = (cid:8) s ∈ S | L ( s ) | = ϕ ( u (cid:48) , u ) , u (cid:54) = u (cid:9) . In the case of the initial automaton state u ∈ U , the initiation set is the whole statespace S since there is no restriction imposed by any previous automaton state. • The policy π φ : S → ∆( A ) maps MDP states into a probability distribution overprimitive actions with the goal of satisfying the propositional formula φ . • The termination condition β u indicates that the option terminates if any formula onan outgoing edge from u holds (i.e., the formula does not necessarily have to be φ ) ora terminal state is reached: β u ( s ) = (cid:26) s ∈ S T or ∃ u (cid:48) ∈ U such that L ( s ) | = ϕ ( u, u (cid:48) )0 otherwise . Note that the policy π φ can be shared by different options in the automaton. This oppor-tunity arises when two different automaton states have outgoing edges labeled by the sameformula φ . Indeed, we store a dictionary that maps a formula φ into its Q-function and useit for the different options that depend on φ .In the case of terminal states (e.g., u A and u R ), the set of available options at automatonstate u ∈ U is Ω u = { ω u,a | a ∈ A } where ω u,a is the option that always selects action a ∈ A in any MDP state. Each of theseoptions is a tuple ω u,a = (cid:104) I u , π a , β a (cid:105) where: • The initiation set is defined as for non-terminal automaton states. • The policy π a : S → A always maps MDP states into action a ∈ A . • The termination condition β a indicates that the option always terminates after run-ning the selected action: β a ( s ) = 1 , ∀ s ∈ S. Policy Learning. In this approach, decisions are taken at two levels and, thus, two typesof policies are learned: (1) policies over options (i.e., a metacontroller) and (2) optionpolicies.A policy over options Π u : S → ∆(Ω u ) in state u ∈ U maps a given MDP state intoa probability distribution over the options available at u . These policies are learned usingSMDP Q-learning (Bradtke & Duff, 1994) with (cid:15) -greedy exploration. Given a state s ∈ S and an option ω ∈ Ω u , the SMDP Q-learning update rule is the following: Q u ( s, ω ) = Q u ( s, ω ) + α (cid:18) r + γ k max ω (cid:48) ∈ Ω u (cid:48) Q u (cid:48) ( s (cid:48) , ω (cid:48) ) − Q u ( s, ω ) (cid:19) , where k is the number of steps between s and s (cid:48) , and r is the cumulative discounted rewardover this time. Note that the discounted term depends on the next automaton state u (cid:48) nduction and Exploitation of Subgoal Automata for RL and becomes 0 when s (cid:48) ∈ S T (i.e., is a terminal state) since there is no applicable optionthereafter.An option policy π φ : S → ∆( A ) aiming to satisfy a formula φ is not learned using therewards from the MDP. Instead, we use a pseudoreward function r φ : S × A × S → R , whichis defined as follows: r φ ( s, a, s (cid:48) ) = r success if L ( s (cid:48) ) | = φr deadend if s (cid:48) ∈ S T \ S G r step otherwise , where r success > φ , r deadend ≤ r step ≤ . These policiesare learned using Q-learning (Watkins, 1989) with (cid:15) -greedy exploration. The update rulefor a given formula φ is: Q φ ( s, a ) = Q φ ( s, a ) + α (cid:18) r φ ( s, a, s (cid:48) ) + γ max a (cid:48) Q φ ( s (cid:48) , a (cid:48) ) − Q φ ( s, a ) (cid:19) , (2)where the second term of the target becomes 0 when either s (cid:48) is a terminal MDP state(i.e., s (cid:48) ∈ S T ) or the next observation satisfies φ (i.e., L ( s (cid:48) ) | = φ ).Intra-option learning is easily applicable to update an option policy π φ (cid:48) while anotherpolicy π φ is being followed. That is, given an option policy π φ , an experience ( s, a, s (cid:48) )generated by this policy is used to update the Q-value of ( s, a ) of another formula φ (cid:48) through Equation 2. Optimality. Since the sets of available options Ω u in non-terminal states u ∈ U do notinclude any primitive actions as one-step options, then optimal policies over the set ofavailable options are in general suboptimal policies of the core MDP (Barto & Mahadevan,2003). Instead of learning one option for each outgoing edge and a metacontroller for each au-tomaton state, we can learn a single option for each automaton state such that its policyperforms the action that appears globally best. Therefore, unlike the previous approach,there is not an explicit decision hierarchy and the agent does not commit to satisfying aparticular formula at every step. In the tabular case, this approach is equivalent to learninga Q-function in the augmented state space S × U .The method we describe here is better known as QRM (Q-learning for Reward Ma-chines) (Toro Icarte et al., 2018). QRM was created to exploit the structure of RewardMachines (RMs), a family of automata similar to our subgoal automata. The main differ-ence is that each transition in a RM is not only labeled by a propositional formula overa set of observables, but also by a reward function. An in-depth comparison is made inSection 9.1. We explain QRM in terms of options for a better comparison with the methodwe presented in the previous section. 10. In Section 8 we instantiate r φ in two different ways and show the impact they have on learning. urelos-Blanco, Law, Jonsson, Broda, & Russo Option Modeling. Given a subgoal automaton A = (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) , each state u ∈ U encapsulates an option ω u = (cid:104) I u , π u , β u (cid:105) where: • The initiation set I u and the termination β u are defined as in Section 7.1.1 for non-terminal automaton states. • The policy π u : S → ∆( A ) selects the action that appears globally best at a givenstate (i.e., the action that leads to the fastest achievement of the task’s goal). In otherwords, the policy does not attempt to satisfy a particular formula: it will eventuallysatisfy the formula that appears to be the best to reach the task’s goal. Therefore,the agent may act towards reaching different formulas for different regions of the statespace. Policy Learning. Given a state s ∈ S and an action a ∈ A , the policy of an option ω u islearned through Q-learning updates of the form: Q u ( s, a ) = Q u ( s, a ) + α (cid:18) r ( u, u (cid:48) ) + γ max a (cid:48) Q u (cid:48) ( s (cid:48) , a (cid:48) ) − Q u ( s, a ) (cid:19) , (3)where the discounted term depends on the next automaton state u (cid:48) and becomes 0 when s (cid:48) ∈ S T (i.e., is a terminal state) since there is no applicable action thereafter. Crucially,the reward r used in the update does not come from the MDP. As said before, QRM isoriginally applied on automata whose transitions are also labeled by reward functions. Thereward r is obtained by evaluating those functions. Since subgoal automata are not labeledwith reward functions, we assume that the reward function r : S × A × S → R of theunderlying MDP is as follows: r ( s, a, s (cid:48) ) = (cid:40) s (cid:48) ∈ S G . Then, given the current automaton state u ∈ U and the next automaton state u (cid:48) ∈ U thereward r in Equation 3 is always 0 except when we transition to the accepting state: r ( u, u (cid:48) ) = (cid:40) u (cid:54) = u A , u (cid:48) = u A . QRM performs the update in Equation 3 for all the options given a single ( s, a, s (cid:48) )experience. That is, given the option ω u of automaton state u ∈ U , the next automatonstate u (cid:48) ∈ U used for bootstrapping is determined by evaluating the observation L ( s (cid:48) ) of thenext MDP state in u . Note this is a form of intra-option learning: we update the policiesof all options from the experience generated by a single option’s policy. Optimality. Toro Icarte et al. (2018) proved that in the tabular case QRM is guaranteedto converge to an optimal policy in the limit. nduction and Exploitation of Subgoal Automata for RL Reward Shaping. A subgoal automaton does not only provide the subgoals of a giventask, but also gives an intuition of how far the agent is from achieving the task goal.Intuitively, the closer the agent is to the accepting state, the closer it is to the task goal.Therefore, we can provide the agent with an extra positive reward signal when it gets closerto the accepting state. The idea of giving additional rewards to the agent to guide itsbehavior is known as reward shaping.Ng et al. (1999) proposed a function that provides the agent with additional rewardwhile guaranteeing that optimal policies remain unchanged: F ( s, a, s (cid:48) ) = γ Φ( s (cid:48) ) − Φ( s ) , where γ is the MDP’s discount factor and Φ : S → R is a real-valued function. Theautomata structure can be exploited by defining F : ( U \ { u A , u R } ) × U → R in terms of theautomaton states instead of the MDP states (Camacho et al., 2019; Furelos-Blanco et al.,2020): F ( u, u (cid:48) ) = γ Φ( u (cid:48) ) − Φ( u ) , where Φ : U → R . Consequently, Equation 3 is rewritten as: Q u ( s, a ) = Q u ( s, a ) + α (cid:18) r ( u, u (cid:48) ) + F ( u, u (cid:48) ) + γ max a (cid:48) Q u (cid:48) ( s (cid:48) , a (cid:48) ) − Q u ( s, a ) (cid:19) . Since we want the value of F ( u, u (cid:48) ) to be positive when the agent gets closer to the acceptingstate u A , we define Φ as Φ( u ) = | U | − d ( u, u A ) , where | U | is the number of automaton states, and d ( u, u A ) is a measure of distance between u and u A . If u A is unreachable from u , then d ( u, u A ) = ∞ . Note that | U | acts as an upperbound of the maximum length (i.e., number of directed edges) of an acyclic path between u and u A . The distance between u and u A can be given either by the length of the shortestpath between them ( d min ) or by the length of the longest acyclic path between them ( d max ). Example 7.1. The following figures show the additional rewards generated by the rewardshaping function using d min (left) and d max (right) with γ = 0 . in the Coffee task’sautomaton (see Figure 2, p10). The numbers inside the states correspond to the valuesreturned by Φ , whereas the numbers on the edges are the values returned by F . Note that | U | = 4 and that the only difference in the Φ values occurs in the initial state. 11. In practice, we use a sufficiently big number to represent ∞ (e.g., 10 ). urelos-Blanco, Law, Jonsson, Broda, & Russo . start . . −∞− . − . − . −∞−∞ +0 . 96 +0 . 96 2 . start . . −∞− . − . . −∞−∞ +0 . 96 +1 . In this section we describe how the ISA algorithm interleaves the reinforcement and au-tomaton learning processes. Given an episodic MDP M = (cid:104) S, A, p, r, γ, S T , S G (cid:105) , a set ofobservables O , a labeling function L : S → O , and a maximum number of edges be-tween two states κ , ISA aims to learn and iteratively refine a subgoal automaton A = (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) from the experience of a reinforcement learning agent. The subgoalautomaton A is the automaton with the smallest number of states that has at most κ edgesfrom one state to another and correctly classifies the traces observed by the agent.Algorithm 1 contains the pseudocode describing how the RL and automaton learningprocesses are interleaved. The pseudocode applies to both of the methods described inSection 7.1 and consists of two functions related to automata learning: • The IsCounterexample function (lines 24-25) checks whether the current automa-ton state u correctly recognizes the current MDP state s . It returns true in thefollowing cases: – s is a goal state and u is not the accepting state ( s ∈ S G ∧ u (cid:54) = u A ), or – s is a dead-end state and u is not the rejecting state ( s ∈ S T \ S G ∧ u (cid:54) = u R ), or – s is not a terminal state and u is either the accepting or rejecting state ( s / ∈ S T ∧ u ∈ { u A , u R } ). • The OnCounterexampleFound function (lines 26-33) determines what to do whena trace T L, O is not correctly recognized by the current automaton:(a) Add T L, O to the corresponding set of traces (line 27): – to the set of goal traces T GL, O if s is a goal state ( s ∈ S G ), or – to the set of dead-end traces T DL, O if s is a dead-end state ( s ∈ S T \ S G ), or – to the set of incomplete traces T IL, O if s is not a terminal state ( s / ∈ S T ).(b) Run the automaton learner (lines 28-32). If the automaton learning task isunsatisfiable, it means that the hypothesis space does not include the automaton nduction and Exploitation of Subgoal Automata for RL Algorithm 1 ISA Algorithm Input: An initial state ( u ), an accepting state ( u A ), a rejecting state ( u R ), a set of observ-ables O , a labeling function L , and the maximum number of edges between two states( κ ). U ← { u , u A , u R } A ← (cid:104) U, O , δ ϕ , u , u A , u R (cid:105) T L, O ← {} (cid:46) Set of counterexamples InitQFunctions ( A ) for l = 0 to num episodes do s ← Env.InitialState () u ← δ ϕ ( u , L ( s )) T L, O ← (cid:104) L ( s ) (cid:105) (cid:46) Initialize trace if IsCounterexample ( s, u ) then OnCounterexampleFound ( T L, O ) u ← δ ϕ ( u , L ( s )) t ← while t < max episode length ∧ s / ∈ S T do (cid:46) Run episode a, s (cid:48) ← Env.Step ( s, u ) u (cid:48) ← δ ϕ ( u, L ( s (cid:48) )) UpdateTrace ( T L, O , L ( s (cid:48) )) if IsCounterexample ( s (cid:48) , u (cid:48) ) then OnCounterexampleFound ( T L, O ) break else UpdateQFunctions ( s , a , s (cid:48) , L ( s (cid:48) )) s ← s (cid:48) ; u ← u (cid:48) t ← t + 1 function IsCounterexample ( s, u ) return ( s ∈ S G ∧ u (cid:54) = u A ) ∨ ( s ∈ S T \ S G ∧ u (cid:54) = u R ) ∨ ( s / ∈ S T ∧ u ∈ { u A , u R } ) function OnCounterexampleFound ( T L, O ) T L, O ← T L, O ∪ { T L, O } is unsat ← true while is unsat do A , is unsat ← LearnAutomaton ( U, u , u A , u R , T L, O , κ ) if is unsat then U ← U ∪ { u | U |− } ResetQFunctions ( A )we are looking for. Therefore, we add a new state to U . We adopt this iterativedeepening strategy to find the subgoal automaton with the lowest number ofstates with at most κ edges from one state to another. 12. Non-special states (i.e., not u , u A or u R ) are labeled from 1 upwards ( u , u , . . . ). urelos-Blanco, Law, Jonsson, Broda, & Russo (c) When a new automaton is learned, the Q-functions are reset (line 33). We laterexplain what resetting the Q-functions means for the two RL algorithms we use.We now describe the main function of the algorithm:1. Initially, the set of states is formed by the initial state u , the accepting state u A and the rejecting state u R (line 1). The automaton is initialized such that it doesnot accept nor reject anything; that is, there are not edges between the states in U (line 2). The set of counterexample traces and the Q-functions are also initialized(lines 3 and 4).2. When an episode starts, the current automaton state u is u . One transition is thenapplied depending on the agent’s initial observation L ( s ) (lines 6-7). For instance, ifthe agent initially observes { (cid:75) } in OfficeWorld ’s Coffee task, then u must be thestate where the agent has already observed (cid:75) (see Figure 2, p10). The episode trace T L, O is initialized with the initial observation L ( s ) (line 8). If a counterexample isdetected at the beginning of the episode (line 9), a new automaton is learned (line 10),the automaton state is reset (line 11) and the episode continues.3. At each episode’s step, we select an action a in state s (line 14). Based on the newobservation L ( s (cid:48) ), we get the next state u (cid:48) (line 15) and update the observation trace T L, O (line 16). If a counterexample trace T L, O is found (line 17), a new automatonis learned (line 18) and the episode ends (line 19). Else, the Q-functions are updated(line 21) and the episode continues.Theorem 7.1 shows that if the target automaton is in the hypothesis space, there will onlybe a finite number of learning steps in the algorithm before it converges to such automaton(or an equivalent one). Theorem 7.1. Given a target finite automaton A ∗ , there is no infinite sequence σ ofautomaton-counterexample pairs (cid:104)A i , e i (cid:105) such that ∀ i : (1) A i covers all examples e , . . . , e i − ,(2) A i does not cover e i , and (3) A i is in the finite hypothesis space S M .Proof. By contradiction. Assume that σ is infinite. Given that S M is finite, the numberof possible automata is finite. Hence, some automaton A must appear in σ at least twice,say as A i = A j , i < j . By definition, A i does not cover e i and A j covers e i . This is acontradiction.In the following paragraphs we describe important aspects regarding the implementa-tion of the algorithm. Firstly, we describe what has to be done differently when compressedobservation traces are used to learn the automaton. Secondly, how the Q-functions are man-aged for the two different algorithms we consider (HRL and QRM). Finally, we introducetwo optimizations to make automata learning more efficient. Compressed Observation Traces. The use of compressed observation traces affects theway in which an automaton is traversed by the RL agent. Remember that a compressedobservation trace results from removing empty observations and contiguous duplicated ob-servations from an observation trace (see Definition 3.3). Therefore, the transition function δ ϕ is only queried when the current observation (1) is not empty and (2) it is different from nduction and Exploitation of Subgoal Automata for RL the last observation. If these two conditions do not hold, the agent remains in the sameautomaton state. Management of Q-functions. A critical aspect of the algorithm is how Q-functions areinitialized, updated and reset when a new automaton is learned. In the tabular case, all theQ-values for the different state-action and/or state-option pairs are initialized to 0 for bothHRL and QRM. The Q-functions are updated using the rules described in Sections 7.1.1and 7.1.2 respectively. Note that in the case of HRL, Algorithm 1 omits the call to thefunction responsible for updating the metacontroller Q-functions, which occurs when theselected option terminates .Ideally, when a new automaton is learned, we would like to reuse the knowledge fromthe previous automata into the new one: • In the case of HRL, as described in Section 7.1.1, the policies of all options that havebeen used throughout learning are stored in a dictionary. Therefore, these policies canbe reused whenever their corresponding formulas appear in the automaton. There aretwo choices to make in this case:1. Which Q-functions to update. There are two alternatives: (i) update all thestored Q-functions or (ii) update only the Q-functions of the formulas appearingin the automaton. While (i) is more costly, the fact that all Q-functions areupdated makes them more reliable if they are to be used in the future. Experi-mentally, we use (i) for the tabular case and (ii) in the function approximationcase because of the running time.2. Reuse Q-functions by copying those defined for similar formulas. Given thatoptions correspond to propositional formulas, we can compare how similar twoformulas are and, thus, initialize a new Q-function from an existing one. Weuse the number of matching positive literals between two formulas; in case of adraw, choose the formula whose Q-function has been updated the most. Experi-mentally, the number of matching positive literals works better than the numberof all matching literals (positive and negative). Taking into account negativeliterals can lead to negative transfer because they usually emerge from the deter-minism constraints to make two formulas mutually exclusive; therefore, they donot usually represent the “important” part of the subgoal formula in the taskswe use later for evaluation.Unlike option policies, it is more difficult to determine when to transfer the metacon-troller (i.e., a policy over options) from one automaton to another. In this case, wecreate a new Q-function and do not reuse any previous knowledge. • In the case of QRM, the policy at each automaton state selects the action that appearsbest towards satisfying the task’s final goal. Therefore, it can attempt to satisfydifferent formulas for different regions of the state space. This makes the transfer 13. We have omitted it for the generality of the pseudocode. Note that according to the termination conditiondefined in Section 7.1, an option ends when s (cid:48) ∈ S T or u (cid:54) = u (cid:48) . Therefore, such check can be done canbe done at the end of each step (between lines 21 and 22). urelos-Blanco, Law, Jonsson, Broda, & Russo of policies between automata non-trivial. In this case we just reinitialize all the Q-functions, which causes the agent to forget everything it learned. Nevertheless, thereward shaping mechanism introduced before can be helpful to alleviate this problem. Optimizations. In practice we use two additional optimizations to make the automatonlearning phase more efficient:1. The agent does not learn an automaton for the first time until a goal trace is found(i.e., the goal is achieved). Experimentally, we have observed that in tasks wheredead-ends are frequent, the learner constantly finds counterexamples with the aim torefine the paths to the rejecting state. Starting to learn automata only when there isa goal trace in the counterexample set has proved to be a better strategy.2. As explained in Section 5, the rejecting state u R is not included in the set of states U if the set of dead-end traces is empty. This avoids an unnecessary increase in thenumber of rules in the hypothesis space, specially in tasks without dead-end states. 8. Experiments In this section, we evaluate the effectiveness of ISA in different domains. Our analysisfocuses on evaluating how the behavior of the RL agent and the task being learned affectautomata learning and vice versa. Firstly, we describe how we evaluate our approach andintroduce some restrictions that we can apply on the automaton learning task. Secondly,we make a thorough analysis of the impact that the reinforcement and automata learn-ing parameters have on the overall performance of ISA using the OfficeWorld domain.Finally, we run additional experiments in the CraftWorld (Andreas et al., 2017) and WaterWorld (Toro Icarte et al., 2018) domains.We use ILASP2 to learn the automata with a 2 hour timeout for each automaton learningtask. All experiments ran on 3.40GHz Intel R (cid:13) Core TM i7-6700 processors. The code isavailable at https://github.com/ertsiger/induction-subgoal-automata-rl . We consider the problem of learning an automaton given a set of MDPs D = {M , . . . , M |D| } .All MDPs in D correspond to the same task (e.g., OfficeWorld ’s Coffee task) and arecharacterized by the same set of observables O . However, they do not need to share thesame state and action spaces. Algorithm 1 is applied on a set of MDPs D by running onefull episode for each MDP until reaching the maximum number of episodes for all MDPs.We impose a maximum episode length N to guarantee all episodes terminate in a reasonableamount of time, especially in MDPs where the terminal states are hard to reach. There aretwo reasons for which it is useful to use a set of MDPs:1. The automaton will generalize to several MDPs. For example, in OfficeWorld thecoffee ( (cid:75) ) and the office ( o ) can sometimes be in the same location. Therefore, thelearned automaton should reflect these two situations: when (cid:75) and o are together andwhen they are not (see Figure 2, p10). Furthermore, using different MDPs can helpto avoid overgeneralization, which is related to the fact that a minimal automaton is nduction and Exploitation of Subgoal Automata for RL D ∗ ∗ C (cid:75) o ∗ (cid:66) ∗ (cid:75) A ∗ ∗ B u start u A o Figure 4: Example of an OfficeWorld grid whose traces cause automata overgeneraliza-tion for the Coffee task.learned from positive examples only (see discussion in Section 10). Figure 4 showsan OfficeWorld grid that if used alone to learn an automaton for Coffee wouldproduce the automaton on the right. Since the agent ( ) cannot see any trace thatreaches o without having seen (cid:75) , it does not learn that observing (cid:75) is important toreach the goal.2. The set of MDPs may contain some MDPs in which it is easier to reach the goal thanin others. Therefore, automata learning can be initialized earlier. The RL agent canthen immediately exploit the automata in the harder MDPs to effectively reduce theamount of exploration needed to reach the goal in them.The sets of MDPs used in these experiments are not handcrafted, but randomly gener-ated (e.g., placing observables randomly in a grid) . Therefore, (1) there is no guaranteethat certain observations will be seen (e.g., two observables in the same tile of a grid), and(2) the difficulty of the MDPs is not fully controlled. Given the two previous consequencesof using randomly generated sets of MDPs, we will evaluate how different sets of MDPsaffect automata learning and RL in the following sections. Note that it is easy to handcrafttasks such that the target automaton is learned. In the case of a grid-world, like Office-World , we need at most one grid for each path to the accepting state (in the absence ofdead-ends). However, we believe that using a set of randomly generated MDPs reduces ourbias on the automaton learning process.We assume that the tasks’ subgoals cannot be characterized only by the non-occurrenceof certain observables. That is, the formula labeling an edge cannot be formed only bynegated observables. The following constraint discards such solutions by enforcing an ob-servable to occur positively whenever an observable appears negatively in a given edge :: - neg ( X , Y , E , ) , not pos ( X , Y , E , ) . The automata we consider in this paper comply with this assumption, which helps to slightlysimplify the automata learning phase. 14. The constraints we impose in the MDP generation are explained later for each of the used domains.15. Remember that the pos and neg facts were introduced in Section 6. urelos-Blanco, Law, Jonsson, Broda, & Russo In the following paragraphs, we describe restrictions that can be made on the automatonstructure, the traces and the observables, and that are later evaluated. Just like the ruleabove, any rules we introduce below are added to the automaton learning task defined inprevious sections. We also explain which RL algorithms are used in the experiments andhow we report the results. Acyclic Automata. There are tasks whose corresponding minimal automata do notcontain cycles (i.e., a previously visited automaton state cannot be revisited). For instance,the three OfficeWorld tasks we have considered so far belong to this class of automata.Therefore, the search space can be made smaller by ruling out solutions containing cycles.The following set of rules enforces acyclicity. The path ( X , Y ) predicate indicates thereis a directed path (i.e., a sequence of directed edges) from X to Y . The first rule states thatthere is a path from X to Y if there is an edge from X to Y . The second rule indicates thatthere is a path from X to Y if there is an edge from X to an intermediate state Z which has apath to Y . Finally, the third rule rules out those answer sets where X and Y can be reachedfrom each other through directed edges. path ( X , Y ) : - ed ( X , Y , ) . path ( X , Y ) : - ed ( X , Z , ) , path ( Z , Y ) . : - path ( X , Y ) , path ( Y , X ) . Trace Compression. The traces used as counterexamples in ISA depend on the agentbehavior. While the agent has not managed to reach the goal, its behavior is random.Consequently, the counterexample that the agent provides to the automaton learner can belong and include many observables that are irrelevant to the task at hand. These two factors,as we will see later, have a negative impact on the time required to learn an automaton.Compressed observation traces, as described in Section 3, result from removing emptyand contiguous equal observations from an observation trace. Given that they are shorter,an automaton learner is usually able to induce an automaton faster. Furthermore, since theinformation about the number of performed steps is lost and there are no empty observa-tions, unlabeled transitions are meaningless when these traces are used. Therefore, we ruleout automata with unlabeled edges using the following constraint when trace compressionis enabled: : - ed ( X , Y , E ) , not pos ( X , Y , E , ) , not neg ( X , Y , E , ) . This constraint rules out any inductive solution where an edge from X to Y with index E isnot labeled by a positive or a negative literal. Restricted Observable Set. Another form of simplifying the traces consists in usingonly those observables relevant to the task at hand. For example, if the task is CoffeeMail ,then the set of observables becomes O = { (cid:75) , o, ∗} instead of O = { (cid:75) , (cid:66) , o, A, B, C, D, ∗} .This greatly simplifies the automaton learning tasks since the hypothesis space becomessmaller and ILASP does not have to discern which are the relevant observables. Reinforcement Learning Algorithms. We use the following nomenclature for the dif-ferent RL algorithms applied in the experiments: • HRL: HRL where r success = 1 . r deadend = 0 . r step = 0 . nduction and Exploitation of Subgoal Automata for RL • HRL G : HRL where r success = 1 . r deadend = − N and r step = − . • QRM: QRM without reward shaping. • QRM min : QRM with reward shaping based on the length of the shortest path to theaccepting state ( d min ). • QRM max : QRM with reward shaping based on the length of the longest acyclic pathto the accepting state ( d max ).In the case of HRL algorithms, the option policies are updated as follows: • In the tabular case, we update all the Q-functions of the formulas in the dictionaryafter every step (i.e., all discovered formulas during learning). • In the function approximation case, we update only the Q-functions of the formulasappearing in the current automaton. Reporting Results. We report results using tables and figures. In the tables we reportthe following average automata learning statistics across runs where the automaton learnerhas not timed out and at least one automaton has been learned: • Total time (in seconds) used to run the automaton learner. • Number of examples needed to learn the final automaton. • Length of the examples used to learn the final automaton.For the first two cases, the numbers in brackets correspond to the standard error, while inthe last case we use the standard deviation since it is an average of the example lengthsacross all runs. We mark with an asterisk (*) cases where either no automaton has beenlearned or the automaton learner has timed out between 1 and 10 runs. A dash (-) isused if the number of such cases is higher than 10.The figures show the average reward across the MDPs in D and number of runs (20).Each point of the learning curve represents the sum of rewards obtained by the greedypolicy at a given episode. By default, the greedy policy is evaluated after every trainingepisode for one episode. Dotted vertical lines correspond to episodes where an automatonwas learned. When the automaton learner times out, the reward is set to 0 for the entireinteraction. OfficeWorld In this section we make a thorough analysis of ISA using the Coffee , CoffeeMail and VisitABCD tasks from the OfficeWorld domain (Toro Icarte et al., 2018) introducedin Section 3. These tasks constitute a good test-bed since their automata are different andthey are incrementally more challenging: • The Coffee task consists of 2 subgoals and is represented by a 4 state automaton. 16. Remember that automata are started to be learned once a goal trace has been observed. urelos-Blanco, Law, Jonsson, Broda, & Russo Learning rate ( α ) 0.1Exploration rate ( (cid:15) ) 0.1Discount factor ( γ ) 0.99Number of episodes 10,000Avoid learning purely negative formulas (cid:51) Number of tasks ( |D| ) 50Maximum episode length ( N ) 250Trace compression (cid:51) Enforce acyclicity (cid:51) Number of disjunctions ( κ ) 1Use restricted observable set (cid:55) RL algorithm HRL G Table 1: Parameters used in the OfficeWorld experiments. The top part of the tablecontains those parameters that remain unchanged, while the bottom part containsthose that change across experiments. • The CoffeeMail task consists of 3 subgoals and is represented by a 6 state automa-ton. • The VisitABCD task consists of 4 subgoals and is represented by a 6 state automaton.The number of subgoals above is the number of directed edges of the longest acyclic pathfrom the initial state to the accepting state. In general, the longer the sequence of subgoalsis, the harder it becomes to achieve the goal. Since the automata of these tasks do not involvedisjunctions nor cycles, we will exceptionally use other tasks to show that our method canlearn such structures.Table 1 shows the parameters used throughout these experiments. The parameter setat the top of the table remains unchanged for all experiments. In contrast, the parametersin the bottom part of the table can change between experiments. The parameters α , (cid:15) and γ are the same for both the metacontrollers and the options in the case of HRL.The section is organized as follows. First, we describe the constraints used in the randomgeneration of OfficeWorld MDPs. We then analyze in different subsections how tuningthe parameters in Table 1 affects the automaton and reinforcement learning processes. MDP Generation. The OfficeWorld tasks we consider have the 9 × 12 grid shownin Figure 1 (p7). The set of observables is O = { (cid:75) , (cid:66) , o, A, B, C, D, ∗} . The grid containsone observable of each type except for the coffee location (cid:75) (2) and the decoration ∗ (6).The agent and the observables are randomly placed in the grid using the following criteria: • The agent cannot be initially placed with decorations ∗ or observables A − D . • The decorations ∗ do not share a location with any other observable. • The decorations ∗ and observables A − D cannot be placed next to each other (in-cluding diagonals) nor in locations that connect two rooms (e.g., (1 , 2) and (1 , nduction and Exploitation of Subgoal Automata for RL • Observables A − D and the office o cannot be in the same location.Note that (cid:75) , (cid:66) and o are allowed to be in the same location, and that (cid:75) and (cid:66) can sharea location with any observable A − D . Number of Tasks & Number of Steps per Episode. Let D = {M , , . . . , M , } and D = {M , , . . . , M , } be two sets of 100 MDPs. Let D ji = {M i, , . . . , M i,j } ⊆ D i denote the subset of the first j MDPs from the i -th MDP set, and N be the maximumepisode length (i.e., the maximum number of steps within an episode). These experimentsare performed with j ∈ { , , } and N ∈ { , , } .Given the set of MDPs D , Figure 5 shows the learning curves for the different combi-nations of MDP sets and steps. Although all experiments were run for 10,000 episodes, thecurves are just shown for some of them to clearly display the variances between differentvalues of N . We make the following observations: • The lowest maximum episode length ( N = 100) works fine when a goal state is easyto achieve (i.e., the number of subgoals is low, like in Coffee ). As the number ofsubgoals in the task increases, the maximum episode length needs to be increasedin order to reach a task’s final goal. If N is not high enough, then there is onlya low chance that the agent will observe the required counterexamples to refine anautomaton. Remember that if the goal is not achieved at least once, the agent willnot be able to exploit the knowledge given by an automaton since it will not learn any.For instance, in the case of VisitABCD , we observe that the learning curves barelyconverge when N = 100 because no automaton is learned. Even when the number ofMDPs is high (100), a goal trace to start automaton learning is only found in 9 outof 20 runs. • Small sets of MDPs are sufficient to learn an automaton and policies that achieve thegoal in the Coffee task. However, in the tasks involving more subgoals, a small setof MDPs (10) does not speed up convergence as much as bigger sets of MDPs (50 or100). For example, in VisitABCD , automata are rarely learned when the numberof MDPs is 10 for any maximum episode length. Increasing the number of MDPsincreases the chance that easier grids are included in the set and, thus, also increasesthe chance of observing counterexamples to start learning automata. • Small values of N and small number of MDPs often cause automata learning to occurthrough the entire interaction. In contrast, higher values of N and bigger MDP setsusually concentrate learning early in the interaction. Again, when N is small, theagent has a lower chance of observing a counterexample trace to refine the automatonsince the interaction with the environment is shorter. This is specially detrimental inthe case of smaller MDP sets where the goal is particularly difficult to achieve (e.g.,if the observables are sparsely distributed in the grid).A high maximum episode length seems to be the best choice to make sure that a policythat achieves the goal is learned. Intuitively, however, such choice can produce longertraces and can make automata learning more complex since (1) the chance of observingirrelevant observables to the task at hand increases (i.e., the system has to learn they are urelos-Blanco, Law, Jonsson, Broda, & Russo . . . . . . A v e r ag e r e w a r d Coffee ( D ) 0 200 400 600 800 1000Number of episodes0 . . . . . . A v e r ag e r e w a r d Coffee ( D ) 0 200 400 600 800 1000Number of episodes0 . . . . . . A v e r ag e r e w a r d Coffee ( D )0 500 1000 1500 2000 2500Number of episodes0 . . . . . . A v e r ag e r e w a r d CoffeeMail ( D ) 0 500 1000 1500 2000 2500Number of episodes0 . . . . . . A v e r ag e r e w a r d CoffeeMail ( D ) 0 500 1000 1500 2000 2500Number of episodes0 . . . . . . A v e r ag e r e w a r d CoffeeMail ( D )0 2000 4000 6000 8000Number of episodes0 . . . . . . A v e r ag e r e w a r d VisitABCD ( D ) 0 2000 4000 6000 8000Number of episodes0 . . . . . . A v e r ag e r e w a r d VisitABCD ( D ) 0 2000 4000 6000 8000Number of episodes0 . . . . . . A v e r ag e r e w a r d VisitABCD ( D ) N = 100 N = 250 N = 500Figure 5: Learning curves for different combinations of MDPs sets ( D , D , D ) andmaximum episode lengths (100, 250, 500). nduction and Exploitation of Subgoal Automata for RL not important) and (2) it is more difficult to figure out the order in which subgoals must beobserved. Table 2 shows the average total automata learning time, while Table 3 shows theaverage number of examples needed to learn the last automaton in each run. The followingis observed from these tables: • The running times generally increase with the maximum episode length. Moreover,Table 4 contains the average example length for the D MDP set, which shows thatthe longer the episode is, the longer the counterexamples become. • The running times increase when the set of MDPs becomes bigger, specifically whenchanging from D to D . This is due to the presence of observations that occurwithin the latter and not in the former. For instance, observables (cid:75) and o only occurtogether in D and D but not in D , so additional time is spent to learn thetransition (cid:75) ∧ o for the Coffee and CoffeeMail tasks. • The running time increases with the number of subgoals. An automaton for the VisitABCD task takes (in average) more time than one for the CoffeeMail taskeven though they are both characterized by automata with the same number of states.Table 4 shows that the example length is longer for the tasks having more subgoals,which can definitely have an effect on the time needed to learn the automaton asexplained at the beginning of the section. • The number of examples remains similar across different sets of MDPs. There is onlya noticeable difference between D to D because the latter includes observationsthat do not happen in the former, as explained before. Note that such changes do notoccur in VisitABCD because there is a single path to the accepting state. • The number of examples increases with the number of subgoals that characterizesthe task. Table 5 provides the decomposition of the number of examples into goal,dead-end and incomplete traces for a specific setting. The number of goal examples isapproximately the same as the number of paths to the accepting state. For instance,in VisitABCD there is only one such path, so the number of goal examples is approx-imately 1 . On the other hand, the observables that characterize the CoffeeMail automaton can appear jointly or not; consequently, there are more paths to the ac-cepting state and, thus, the required number of goal examples increases. Furthermore,while there is a relationship between the number of goal examples and the number ofpaths to the accepting state, we do not observe such relationship between the numberof dead-end examples and the number of paths to the rejecting state. The numberof dead-end and incomplete examples is higher than that of goal examples; thus, wehypothesize that these two kinds of examples are mainly used to refine the automatongiven the set of goal examples.The results for the set of MDPs D are similar to the ones described above. We referthe reader to Appendix C for a quantitative comparison. In qualitative terms, the main 17. The number of goal traces can sometimes be higher than 1 for VisitABCD if the first used goal exampleis complex (longer than needed or with many unnecessary symbols), thus making the subgoals unclear.In such cases a simpler goal trace might be found as a counterexample. urelos-Blanco, Law, Jonsson, Broda, & Russo D D D N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 Coffee CoffeeMail VisitABCD - - 2966.4 (1323.9)* - 163.2 (44.3) 311.9 (63.4) - 230.7 (99.9) 230.8 (48.2) Table 2: Total automata learning time in seconds for different combinations of MDPs sets( D , D , D ) and maximum episode lengths (100, 250, 500). D D D N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 Coffee CoffeeMail VisitABCD - - 86.0 (12.0)* - 54.9 (3.8) 50.6 (3.5) - 55.2 (6.0) 49.9 (2.5) Table 3: Number of examples needed to learn the last automaton for different combinationsof MDPs sets ( D , D , D ) and maximum episode lengths (100, 250, 500). N = 100 N = 250 N = 500 G D I G D I G D I Coffee CoffeeMail VisitABCD - - - 9.2 (3.0) 5.1 (3.0) 5.4 (3.0) 12.4 (4.2) 6.2 (4.0) 6.3 (4.2) Table 4: Example length of the goal, dead-end and incomplete examples used to learn thelast automaton in the D setting.All G D I Coffee CoffeeMail VisitABCD D and N = 250 setting. nduction and Exploitation of Subgoal Automata for RL Time (s.) Coffee CoffeeMail VisitABCD Table 6: Automata learning statistics when trace compression is on (C) and off (U). Time (s.) O ˆ O O ˆ O O ˆ O Coffee CoffeeMail VisitABCD Table 7: Automata learning statistics when the set of observables is unrestricted ( O ) orrestricted to a particular task ( ˆ O ).changes we observe occur in the VisitABCD task. While N = 100 was usually insuffi-cient to learn an automaton in D across runs, it is enough in D . By comparing the VisitABCD curves, it is clear that D consists of MDPs requiring less steps than thosein D . Furthermore, in the case of CoffeeMail , D has more types of joint eventsthan D , thus more examples are needed in that case and the final automaton is slightlydifferent. This clearly shows that the set of MDPs has an impact on automata learning. Trace Compression. Table 6 shows the impact of compressed observation traces inautomata learning. We omit the learning curves since the learned automata (if any) aresimilar for both types of traces and exhibit close performance. All runs using uncompressedtraces finished on time for Coffee and CoffeeMail ; in contrast, all such runs timed outfor VisitABCD . Crucially, compressed traces allow to learn an automaton for VisitABCD on time. Besides, an automaton is also learned orders of magnitude faster in CoffeeMail using trace compression. Note that a compressed trace is considerably shorter than anuncompressed one even in the simplest task ( Coffee ).As pointed out at the beginning of the section compressed traces are applicable to the OfficeWorld tasks because: (1) empty observations are meaningless (i.e., there is no taskwhere the number of steps is important), and (2) seeing an observation for more than onestep is equivalent to seeing it once. Restricted Observable Set. Table 7 shows how using the restricted observable set ˆ O for each OfficeWorld task compares to using the full observable set O . The learningcurves are very similar for both cases, so we do not report them. The restricted observableset ˆ O causes a sensible decrease in the automata learning time, specially for the hardertasks. Similarly, fewer examples are needed to learn a helpful automaton and the averageexample length is also reduced. Intuitively, using only the observables that describe the urelos-Blanco, Law, Jonsson, Broda, & Russo Time (s.) Coffee CoffeeMail VisitABCD CoffeeDrop CoffeeMailDrop - 312.2 (145.9) - 37.8 (1.7) - 7.0 (5.3) Table 8: Comparison of different automata learning statistics for the cases where automatamust be acyclic and where automata can have cycles.subgoals makes the learning task easier: the hypothesis space is smaller and no examplesto discard irrelevant observables are needed. Cyclicity. Table 8 shows the effect that enforcing the automaton to be acyclic has onautomata learning. The tasks that we have considered so far do not require cycles. There-fore, we introduce two other tasks, CoffeeDrop and CoffeeMailDrop , whose automatainvolve cycles to show that our approach can learn such structures. These tasks modify the OfficeWorld dynamics such that when the agent steps on a decoration ( ∗ ), the coffee isdropped if the agent holds it. In such case, the agent must go back to the coffee location.Note that in these tasks there are not dead-end states and, thus, the learned automata donot have rejecting states.The results show a considerable increase in the running time for CoffeeMail and VisitABCD when the automaton is allowed to have cycles. The fact that the hypothesisspace is bigger does not only cause an increase in the running times, but also in the numberof examples, which must rule out solutions with cycles. In contrast, the average examplelength remains approximately the same.Regarding the performance in the additional tasks, the results in the cyclic settingfor CoffeeDrop and CoffeeMailDrop are similar to the ones for Coffee and Cof-feeMail respectively. Note that the average running time is lower for CoffeeMailDrop than CoffeeMail since the former has less states than the latter (it does not have a re-jecting state). Besides, the length of the examples is longer since there are no dead-endstates to actively avoid. In contrast, in the acyclic setting an automaton is only found in10/20 and 2/20 runs for CoffeeDrop and CoffeeMailDrop respectively. Note thatthe number of automaton states depends on the trace with the maximum number of timeswhere the coffee has been picked and dropped. Clearly, these tasks are not well-suited tobe expressed by an acyclic automaton. Maximum Number of Edges between States. Table 9 shows the effect that increasingthe maximum number of edges between two states from κ = 1 to κ = 2 has on automatalearning. Since the OfficeWorld tasks we have considered so far have at most one edgefrom a state to another, we add a new task that can only be learned with κ > CoffeeOrMail task consists in going to the coffee or maillocation (it does not matter which) and then go to the office while avoiding the decorations. nduction and Exploitation of Subgoal Automata for RL Time (s.) κ = 1 κ = 2 κ = 1 κ = 2 κ = 1 κ = 2 Coffee CoffeeMail VisitABCD CoffeeOrMail Table 9: Automata learning statistics for different maximum number of edges from onestate to another ( κ ). Acyclic CyclicNo SB SB No SB SB Coffee CoffeeMail VisitABCD κ . The case of VisitABCD is the most notable one: only one run has not timed outwith κ = 2. The number of successful runs has also decreased in CoffeeMail from 20to 15; besides, the running time is orders of magnitude higher. While the increase on thehypothesis space size has caused the running time to vastly increase, the average numberof examples and the average example lengths remain similar with respect to κ = 1.Regarding the CoffeeOrMail task, note that it is not much harder than the Coffee task: its automaton consists of 4 and 5 states for κ = 2 and κ = 1, respectively. Rememberthat the Coffee task consists of 4 states (regardless of κ ). Despite of the increase on thenumber of states from κ = 1 to κ = 2, the running time and the other statistics remainalmost the same. However, while κ = 1 is effective in this case (although it cannot return aminimal automaton), it would not be enough to learn an automaton for any task requiringan edge with a disjunction to the accepting state (e.g., “observe (cid:75) or (cid:66) ”). Symmetry Breaking. Table 10 shows the effect that symmetry breaking constraintshave on the time required to learn an automaton. The average number of examples andexample length barely change when symmetry breaking is used, so we do not report theresults. The symmetry breaking constraints speed up automata learning by an order ofmagnitude in CoffeeMail (acyclic, cyclic) and VisitABCD (acyclic). Furthermore, whiletwo runs have timed out for the case where automata can contain cycles and no symmetrybreaking is used (one for CoffeeMail and one for VisitABCD ), no run has timed outwhen the symmetry breaking is used. urelos-Blanco, Law, Jonsson, Broda, & Russo Reinforcement Learning Algorithms. We compare the learning curves produced bythe two RL algorithms we have previously presented (HRL and QRM). We show howlearning the automata in an interleaved manner affects RL and how introducing guidance(e.g., reward shaping in QRM) influences when automata are learned.Figure 6 shows the average learning curves for the settings above in the OfficeWorld tasks. We observe the following: • The algorithms using auxiliary guidance (HRL G , QRM min and QRM max ) convergefaster than their respective basic versions (HRL and QRM). The use of these auxiliaryreward signals helps to explore the state space more effectively, which helps to observethe needed counterexample traces early. Consequently, automata learning is much lessfrequent at later stages of the RL process, which is convenient to avoid resets in theQ-functions. • QRM max converges faster than QRM min except in VisitABCD where they performexactly the same because there is a single path to the accepting state. As previouslyshown in Example 7.1, QRM max provides a positive reward signal for any path thatmakes the agent closer to the accepting state. In contrast, QRM min only providesa positive signal for the shortest path(s). If the shortest path is not available in acertain grid (e.g., (cid:75) and o occurring together), QRM min gives a negative reward forchoosing the only available path to the accepting state. Consequently, convergence isnot as fast as in QRM max . • HRL converges faster than QRM across the different tasks. In the absence of rewardshaping, QRM needs to satisfy the formula on an edge to the accepting state to startpropagating positive reward through the different states. On the other hand, HRLcan independently update the Q-functions of each of the formulas in the automaton.Figure 7 shows the impact of automata learning on the HRL and QRM learning curvesof the Coffee task in a single run . Remember that an HRL agent only forgets what itlearned at the metacontroller level and keeps the Q-functions of the formulas unchanged.Besides, it reuses the Q-functions between similar formulas. In contrast, QRM forgetseverything it has learned. The plot illustrates this behavior: while HRL quickly recovers,QRM requires a few more episodes to match HRL again.Figure 8 shows how the learning curves with interleaved automata learning (ISA-HRL,ISA-QRM) compare to those obtained with handcrafted automata (HRL, QRM). The curvesfor the settings involving automata learning perform closely to the ones with handcraftedautomata. Naturally, sometimes the convergence is slower due to the fact that a properautomaton cannot be exploited from episode 0. This is especially noticeable in VisitABCD ,where a stable automaton requires several relearning steps to be found. CraftWorld The CraftWorld domain (Andreas et al., 2017) consists of a 39 × 39 grid without walls.The grid contains raw materials (wood, grass, iron) and tools/workstations (toolshed, work-bench, factory, bridge, axe), which constitute the set of observables O . There are 5 labeled 18. Note that abrupt changes in the learning curves do not occur in other plots because they are averagedacross 20 runs. Hence, changes are smoother and not fully visible. nduction and Exploitation of Subgoal Automata for RL . . . . . . A v e r ag e r e w a r d Coffee . . . . . . A v e r ag e r e w a r d CoffeeMail . . . . . . A v e r ag e r e w a r d VisitABCD HRL HRL G QRM QRM min QRM max Figure 6: Learning curves for different RL algorithms in the OfficeWorld tasks usinginterleaved automata learning. . . . . . . A v e r ag e r e w a r d HRL QRM Figure 7: Example of the impact that interleaved automata learning has on the learningcurves of the Coffee task. An automaton is learned around episode 300. WhileHRL quickly recovers (it only has to relearn the policies over options), QRMneeds some more episodes because it forgets everything. urelos-Blanco, Law, Jonsson, Broda, & Russo N u m b e r o f e p i s o d e s . . . . . . Average reward C o ff ee N u m b e r o f e p i s o d e s . . . . . . Average reward C o ff ee M a i l N u m b e r o f e p i s o d e s . . . . . . Average reward V i s i t A B C D H R L H R L G I S A - H R L I S A - H R L G N u m b e r o f e p i s o d e s . . . . . . Average reward C o ff ee N u m b e r o f e p i s o d e s . . . . . . Average reward C o ff ee M a i l N u m b e r o f e p i s o d e s . . . . . . Average reward V i s i t A B C D Q R M Q R M m i n Q R M m a x I S A - Q R M I S A - Q R M m i n I S A - Q R M m a x F i g u r e : L e a r n i n g c u r v e s f o r d i ff e r e n t R L a l go r i t h m s i n t h e O ff i c e W o r l d t a s k s w h e n i n t e r l e a v e d a u t o m a t a l e a r n i n g i s o ff ( H R L , Q R M ) a nd o n ( I S A - H R L , I S A - Q R M ) . nduction and Exploitation of Subgoal Automata for RL locations for each material, and 2 labeled locations for each tool/workstation. Like in OfficeWorld , the agent moves in the four cardinal directions and remains in the samelocation if it tries to cross the grid’s limits. The agent sees the observables when it is ontheir respective locations. The grids are randomly generated such that all items must be indifferent locations (i.e., the observations consist of one observable at most).The tasks in this domain consist in observing a specific sequence of materials andtools/workstations. We use the set of tasks in (Toro Icarte et al., 2018) for evaluation:1. MakePlank : wood, toolshed.2. MakeStick : wood, workbench.3. MakeCloth : grass, factory.4. MakeRope : grass, toolshed.5. MakeShears : iron, wood, workbench (the iron and the wood can be observed in anyorder).6. MakeBridge : iron, wood, factory (the iron and the wood can be observed in anyorder).7. GetGold : iron, wood, factory, bridge (the iron and the wood can be observed in anyorder).8. MakeBed : wood, toolshed, grass, workbench (the grass can be observed anytimebefore the workbench.)9. MakeAxe : wood, workbench, iron, toolshed (the iron can be observed anytime beforethe toolshed).10. GetGem : wood, workbench, iron, toolshed, axe (the iron can be observed anytimebefore the toolshed).Tasks 1-4 have 2 subgoals and are represented by 3 state automata. Tasks 5-6 have 3subgoals and are represented by 5 state automata. Task 7 has 4 subgoals and is representedby a 6 state automaton. Tasks 8-9 have 4 subgoals and are represented by 7 state automata.Task 10 has 5 subgoals and is represented by an 8 state automaton. The agent gets a rewardof 1 upon the goal’s achievement and 0 otherwise. Unlike OfficeWorld , this domain hasno dead-end states, so the set of dead-end examples is always empty.Table 11 lists the parameters used in these experiments. The only difference with respectto the default parameters used in OfficeWorld is the number of tasks |D| , which is 100in this case. Experimentally, we observed that using 100 instead of 50 was a better choicefor tasks 8-10 which, as we will explain later, occasionally time out.Table 12 shows the automaton learning statistics for the presented CraftWorld tasksusing HRL G . Note that we have divided the tasks into several groups according to thenumber of subgoals they have and the number of states that their corresponding automatahave. Figure 9 shows the learning curves for one representative of each group of tasks with and without interleaved learning of automata. We observe the following: 19. The learning curves are similar between members of each group, so we report just one of them. urelos-Blanco, Law, Jonsson, Broda, & Russo Learning rate ( α ) 0.1Exploration rate ( (cid:15) ) 0.1Discount factor ( γ ) 0.99Number of episodes 10,000Avoid learning purely negative formulas (cid:51) Number of tasks ( |D| ) 100Maximum episode length ( N ) 250Trace compression (cid:51) Enforce acyclicity (cid:51) Number of disjunctions ( κ ) 1Use restricted observable set (cid:55) Table 11: Parameters used in the CraftWorld experiments. • Like in the OfficeWorld tasks, the more subgoals and automaton states, the higherautomata learning statistics (running time, number of examples and example length)become. Besides, the number of goal examples still corresponds to the number ofpaths from the initial state to the accepting state. The figure also shows that, asbefore, learning becomes more frequent throughout the interaction as the tasks becomeharder. • The automata learning statistics are very close between groups of tasks, speciallyfor the ones having simpler automata. As the tasks become harder, the differencesbetween tasks in the same group become bigger (e.g., MakeBed and MakeAxe ).Naturally, it is extremely hard that an agent observes two equivalent sets of examplesfor two different tasks, specially when examples become longer (as we have seen before,the more subgoals a task has, the longer the examples become). Therefore, it is normalthat these differences arise for harder tasks. • The running time increases dramatically from GetGold to MakeBed and MakeAxe .Actually, the automaton learner has timed out a few times for the latter tasks: 5 for MakeBed and 4 for MakeAxe . Furthermore, in the case of GetGem , the hardertask, it has timed out 9 times. The number of timeouts varies between algorithms,which is probably caused by exploration. For example, standard HRL has timed out8 times for MakeAxe and only once for MakeBed . • The difference between the curves where interleaved automata learning is on (ISA-HRL, ISA-QRM) and off (HRL, QRM) is small for most of the tasks, like in Of-ficeWorld . This shows that the induced automata are useful to learn a policy thatreaches the goal. Note that for the hardest tasks ( MakeAxe , MakeBed and Get-Gem ) the gap between HRL and QRM with respect to ISA-HRL and ISA-QRM isusually bigger than in other tasks. This is due to the presence of timeouts in theapproaches that induce automata, as explained before. • When an automaton is handcrafted, QRM min performs like QRM max because theminimum and maximum distances to the accepting state are the same in these au- nduction and Exploitation of Subgoal Automata for RL Time (s.) G I MakePlank MakeStick MakeCloth MakeRope MakeShears MakeBridge GetGold MakeBed MakeAxe GetGem Table 12: Automata learning statistics for the CraftWorld tasks using HRL G .tomata. This similarity also occurs when an automaton is learned; however, they arenot identical since the intermediately learned automata may cause some variances. • The approaches not using guidance (HRL, QRM) start converging faster than the onesthat use it (HRL G , QRM min , QRM max ). However, the latter eventually learn to reachthe goal earlier in the interaction. The initially slower convergence for the approachesusing guidance can be due to the fact that guidance encourages more exploration. A CraftWorld grid is bigger than an OfficeWorld grid, which causes the agentto explore the environment for longer and delays the start of convergence. However,all the knowledge acquired while exploring is later quickly exploited and the learningcurves surpass those of the approaches without guidance. WaterWorld The WaterWorld domain (Toro Icarte et al., 2018) (see Figure 10) consists of a 2D boxcontaining 12 balls of 6 different colors (2 balls per color). Each ball moves at a constantspeed in a given direction. The balls bounce only when they collide with a wall. The agentis a white ball that can change its velocity in any of the four cardinal directions. The setof observables O = { r, g, b, y, c, m } is formed by the balls’ colors. The agent observes acolor when it overlaps with the ball painted with it. For example, in Figure 10 the agentwould observe { g } (green). Note that several balls can overlap at the same time; thus, theagent can simultaneously observe several colors. The tasks we consider consist in observinga sequence of colors in a specific order: • RGB : red ( r ) then green ( g ) then blue ( b ). It consists of 3 subgoals and is representedby a 4 state automaton. • RG-B : red ( r ) then green ( g ) and (independently) blue ( b ). Note that there are twosequences and they can be interleaved. For instance, (cid:104){ r }{ g }{ b }(cid:105) , (cid:104){ b }{ r }{ g }(cid:105) and (cid:104){ r }{ b }{ g }(cid:105) are three possible goal traces. Note that RGB is a subcase of this task.It consists of 3 subgoals and is represented by a 6 state automaton. urelos-Blanco, Law, Jonsson, Broda, & Russo N u m b e r o f e p i s o d e s . . . . . . Average reward M a k e S t i c k N u m b e r o f e p i s o d e s . . . . . . Average reward M a k e S h e a r s N u m b e r o f e p i s o d e s . . . . . . Average reward M a k e S t i c k N u m b e r o f e p i s o d e s . . . . . . Average reward M a k e S h e a r s N u m b e r o f e p i s o d e s . . . . . . Average reward G e t G o l d N u m b e r o f e p i s o d e s . . . . . . Average reward M a k e B e d N u m b e r o f e p i s o d e s . . . . . . Average reward G e t G o l d N u m b e r o f e p i s o d e s . . . . . . Average reward M a k e B e d N u m b e r o f e p i s o d e s . . . . . . Average reward G e t G e m H R L H R L G I S A - H R L I S A - H R L G N u m b e r o f e p i s o d e s . . . . . . Average reward G e t G e m Q R M Q R M m i n Q R M m a x I S A - Q R M I S A - Q R M m i n I S A - Q R M m a x F i g u r e : L e a r n i n g c u r v e s f o r d i ff e r e n t R L a l go r i t h m s i n t h e C r a f t W o r l d t a s k s w h e n i n t e r l e a v e d a u t o m a t a l e a r n i n g i s o ff ( H R L , Q R M ) a nd o n ( I S A - H R L , I S A - Q R M ) . T h e I S A c u r v e s f o r M a k e B e d a nd G e t G e m a r e b e l o w t h e h a nd c r a f t e d a u t o m a t a c u r v e s b ec a u s e o f t h e t i m e o u t r un s . W h e n a r unfin i s h e ss u cce ss f u ll y , b o t h t y p e s o f c u r v e s a r e s i m il a r . nduction and Exploitation of Subgoal Automata for RL Figure 10: The WaterWorld domain (Toro Icarte et al., 2018). • RGBC : touch red ( r ) then green ( g ) then blue ( b ) then cyan ( c ). It consists of 4subgoals and is represented by a 5 state automaton.The agent gets a reward of 1 upon the goal’s achievement and 0 otherwise. The tasks weconsider have no dead-end states, like in CraftWorld . The balls start with a randomposition and direction at the beginning of each episode.Unlike OfficeWorld and Craftworld , the state space is continuous, so we cannotuse tabular Q-functions. Instead, like in (Toro Icarte et al., 2018), we use a Double DQN(DDQN) (van Hasselt et al., 2016) to approximate the Q-functions in both HRL and QRM.The neural networks consist of 4 hidden layers of 64 neurons, each followed by a ReLU. Anetwork’s input is a vector containing the absolute position and velocity of the agent, andthe relative positions and velocities of the other balls. Just like in a standard DQN (Mnihet al., 2015), the output contains the estimated Q-value for each action (or option in thecase of a metacontroller in HRL). We train the neural networks using the Adam optimizer(Kingma & Ba, 2015) with α = 1 × − . The target networks are updated every 100 steps.A total of 50,000 episodes are executed for each setting with parameters (cid:15) = 0 . γ = 0 . WaterWorld are constantly moving makes it easier for the agent to observe all possiblecombinations of observables. Therefore, we do not need to use a larger MDP set to learn ageneral automaton.Figure 11 shows the learning curves for the WaterWorld tasks introduced before. Weshow how those that use interleaved automata learning (ISA-HRL, ISA-QRM) compare tothose obtained with handcrafted automata (HRL, QRM). The greedy policy was evaluatedevery 500 episodes: the evaluation consisted in running the greedy policy for 10 differentepisodes and averaging the reward obtained across them. The learning curves are smoothed urelos-Blanco, Law, Jonsson, Broda, & Russo Learning rate ( α ) 1 × − Exploration rate ( (cid:15) ) 0.1Discount factor ( γ ) 0.9Number of episodes 50,000Maximum episode length ( N ) 150Replay memory size 50,000Replay start size 1,000Batch size 32Number of tasks ( |D| ) 1Target network update frequency 100Trace compression (cid:51) Enforce acyclicity (cid:51) Number of disjunctions ( κ ) 1Avoid learning purely negative formulas (cid:51) Use restricted observable set (cid:55) Table 13: Parameters used in the WaterWorld experiments.using a sliding window of size 1,000. Table 14 shows the automaton learning statistics forthese tasks using HRL G20 . We observe the following: • The tasks with more subgoals run the automaton learner more often. Automatalearning is concentrated at the beginning of the interaction in RGB; in contrast, it iscalled at many different times in RGBC. • Even though the target automaton of RG-B has more states than that of RGB, thereward is sparser in the latter since there are not as many ways to achieve the goalas in the former. This is why RGB’s learning curve does not converge faster thanRG-B’s. Note that the time and the number of examples needed to learn the RG-Bautomaton are higher than for RGB because its set of automaton states is bigger.Finally, the fact that RG-B is the task requiring more goal traces shows that is thetask where achieving the goal is easier although it makes automata learning moredifficult. • Even though RGBC has more subgoals than RG-B, its running time is lower. Themost likely reason is that the automaton for RG-B has more states and, therefore, thehypothesis space is bigger. Naturally, the bigger the hypothesis space is, the harderit becomes to find a solution. However, as we saw in the grid-world experiments,the tasks with more subgoals require more examples. In particular, RGBC needsmore incomplete examples than RG-B possibly because there are more sequences ofcandidate subgoals to be discarded. • The approaches based on HRL perform better than the ones based on QRM, speciallyin RGBC. We hypothesize there are two possible causes for this behavior: 20. The automata learning statistics for the other RL algorithms are similar, so we do not report them. nduction and Exploitation of Subgoal Automata for RL Time (s.) G I RGB RG-B RGBC Table 14: Automata learning statistics for the WaterWorld tasks using HRL G .(i) The Q-functions resetting. Remember that while in HRL only the metacon-trollers are reset, all the Q-functions in QRM are reset. Given that new automataare learned throughout the entire interaction, the agents using QRM rarely havethe chance to converge to a stable policy.(ii) While the HRL agent commits to satisfy a given formula (determined by themetacontroller) at a given step, the QRM agent selects the globally best actionat each step. Unlike the grid-worlds, in this domain the observables are con-stantly changing their position, so it can be more difficult to learn a functionthat generalizes to many scenarios (i.e., different initializations of the task) .To determine which of these two causes is more plausible, we examine the performanceof QRM with a handcrafted automaton. In general, the performance of approachesusing automata learning are very similar to the ones using a handcrafted automaton.This shows that the forgetting effect is not as present in WaterWorld as in the grid-world domains. The use of experience replay is likely to be responsible for this becausethe agents can update the Q-functions without having to relive successful experiencesthat happened in the past. Therefore, we conclude that cause (ii) is better supportedby our experiments. • There is barely any difference between the learning curves of the algorithms that donot use auxiliary guidance (HRL, QRM), and the ones that do use it (HRL G , QRM min ,QRM max ). Camacho et al. (2019) also showed similar behavior in the case of QRMusing handcrafted reward machines with a different reward shaping mechanism. Wehypothesize that since the Q-functions must generalize to different settings (rememberthat all episodes start with a random configuration), an agent that does not use guid-ance might explore similarly to an agent that does use it. Therefore, using guidancedoes not help much in these tasks. Throughout this section we have observed several commonalities between the results acrossdomains. We provide a short summary of the main common findings below: 21. We highlight that our evaluation QRM in the WaterWorld domain differs a bit from the one in (ToroIcarte et al., 2018). While we randomly initialize the environment at the start of every episode, they usea fixed map. In the latter case, QRM quickly converges in RGBC (we have been able to reproduce theresults) but we consider that learning Q-functions that generalize to diverse scenarios is more interesting. urelos-Blanco, Law, Jonsson, Broda, & Russo N u m b e r o f e p i s o d e s . . . . . . Average reward R G B N u m b e r o f e p i s o d e s . . . . . . Average reward R G - B N u m b e r o f e p i s o d e s . . . . . . Average reward R G B C H R L H R L G I S A - H R L I S A - H R L G N u m b e r o f e p i s o d e s . . . . . . Average reward R G B N u m b e r o f e p i s o d e s . . . . . . Average reward R G - B N u m b e r o f e p i s o d e s . . . . . . Average reward R G B C Q R M Q R M m i n Q R M m a x I S A - Q R M I S A - Q R M m i n I S A - Q R M m a x F i g u r e : L e a r n i n g c u r v e s f o r d i ff e r e n t R L a l go r i t h m s i n t h e W a t e r W o r l d t a s k s w h e n i n t e r l e a v e d a u t o m a t a l e a r n i n g i s o ff ( H R L , Q R M ) a nd o n ( I S A - H R L , I S A - Q R M ) . nduction and Exploitation of Subgoal Automata for RL • The higher the number of task subgoals and the number of required states are, thehigher the value of the different automata learning statistics (running time, numberof examples and example length) becomes. • The number of goal examples used to learn an automaton is approximately the samethan the number of paths from the initial state to the accepting state. Incompleteand dead-end examples are used to refine those paths and increase as the number ofsubgoals and automaton states increases. • Using auxiliary reward signals (or guidance) has been shown useful to speed up conver-gence in grid-world tasks. Importantly, as mainly shown in OfficeWorld , it helpsto learn an automaton early in the interaction and thus reduces relearning later. Thisis extremely helpful in QRM since it resets all the Q-functions when a new automatonis learned. On the other hand, in WaterWorld there is not a big difference betweenapproaches using guidance and those which do not use it. • HRL usually converges faster than QRM, specially in the absence of guidance. Insuch case, QRM needs to satisfy the goal at least once to start propagating rewardthrough the automaton states. In contrast, HRL can update the Q-functions of theformulas independently and, importantly, can be kept throughout the entire learningprocess. • The difference between the approaches learning an automaton and those using a hand-crafted one is not big except for the cases where an automaton cannot be normallylearned under the imposed timeout. This shows that the learned automata properlyrepresent the subgoal structure of the tasks. 9. Related Work The following sections describe the works that share more commonalities with ours. First,we qualitatively compare subgoal automata and ISA with other forms of automata andautomata learning methods that have been recently used in RL. Second, we briefly discusssome work on discovering hierarchies in HRL. Finally, we describe some of the work onbreaking symmetries in graphs and automata in particular. In this section we describe others forms of automata that have been used in reinforcementlearning and how their structure has been exploited. We compare these automata and theapproaches for learning them to subgoal automata and ISA respectively. Reward Machines. Subgoal automata are similar to another formalism that uses au-tomata in RL called reward machines (RMs) (Toro Icarte et al., 2018). Both automataconsist of edges labeled by propositional formulas over a set of observables O . The maindifferences with respect to our formalism are:1. RMs do not have explicit accepting and rejecting states. Therefore, they can be usedto represent continuing tasks. urelos-Blanco, Law, Jonsson, Broda, & Russo 2. RMs specify a reward-transition function δ r : U × U → [ S × A × S → R ] that mapsan automaton state pair into a reward function.Toro Icarte et al. (2018) also define simple RMs where the reward-transition function δ r : U × U → R maps an automaton state pair into a reward instead of a reward function.The authors propose the QRM (Q-learning for Reward Machines) algorithm to exploit thestructure of RMs, which was one of the RL algorithms we have applied on our automata.Different methods for simultaneously learning and exploiting simple reward machinesfrom observation traces have been recently proposed. Toro Icarte et al. (2019) propose LRM,which formulates the RM learning problem as a discrete optimization problem and use alocal search algorithm called tabu search to solve it. They use QRM to learn the policies.Unlike our approach, LRM starts by performing random actions for a fixed number of stepsto collect some traces. These traces are used for two things:1. Learn an initial automaton.2. Know which is the set of observables O of the task. This is done by looping throughthe traces. Interestingly, if two events happen at the same time in these traces, thenthey will constitute a single observable. For example, if the traces show that (cid:75) and o happen together in OfficeWorld ’s Coffee task, then the set observables O wouldinclude a single observable (cid:75) o instead of two distinct observables (cid:75) and o .Note that these two aspects are different from our approach (ISA) because:1. ISA does not learn automata from a set of randomly collected traces. Instead, itlearns a new automaton when a counterexample is found.2. ISA is given the set of observables O in advance.Even though they learn an initial automaton from random traces, they also use counterex-amples to refine the automata. LRM is also different from ISA in the following aspects: • The traces used by LRM consist of observations formed by a single observable. Thisis due to the process for getting the set of observables O described above. Therefore,conjunctions are not learned but assumed to be given in O . This can be problematicif the automaton is to be used in a task where the set of observables is different.Keeping observables separated and explicitly learning the conjunctions as we do ismore complex but allows for better generalization. Note that the observations we usecan contain an arbitrary number of observables. • LRM assumes that the observables are mutually exclusive. Consequently, it does notenforce mutual exclusivity between the edges to two different automaton states. ISAdoes enforce mutual exclusivity between the formulas labeling the outgoing transitionsto two different states. • LRM’s optimization scheme is used to find an automaton that is good at predictingwhat will be the next different observable given a maximum number of states. Thus,they do not aim to find a minimal automaton. In contrast, ISA aims to find a minimalautomaton that covers the example traces through the use of an iterative deepeningstrategy on the number of states. This has two main consequences: nduction and Exploitation of Subgoal Automata for RL – The fact that ISA looks for a minimal automaton using only positive examples(i.e., traces obtained from the agent-environment interaction) makes it prone toovergeneralize in certain tasks (see Section 10 for a detailed discussion). LRMdoes not suffer from this problem given that they do not look for a minimalautomaton, but an automaton that is good at predicting what will be the nextdifferent observation. – Since LRM aims to be good at predicting which will be the next different obser-vation, they use compressed observation traces. Remember that two consecutiveobservations in a compressed observation trace are always different. Thus, theirmethod cannot be applied to tasks like counting how many times an observablehas been seen in a row, as explained in Section 3.2. • LRM does not classify examples into different categories. The reward machine it aimsto learn can also represent continuing tasks; therefore, they do not use an explicitnotion of goal or dead-end as we do. • LRM does not apply a symmetry breaking mechanism and may consider differentequivalent solutions during the search for an automaton. ISA uses symmetry breakingconstraints to shrink the search space and speed up automata learning.Xu et al. (2020) propose another algorithm to learn simple RMs called JIRP (JointInference of Reward Machines and Policies). The authors express the automaton learningproblem as a SAT problem and use QRM to learn a policy for a given task. JIRP is similarto ISA in the following aspects:1. It aims to learn a minimal automaton based on an iterative deepening strategy on thenumber of states.2. The automaton learner is triggered when a counterexample is found.3. It learns only from positive examples (i.e., attainable traces by the reinforcementlearning agent).4. It learns RMs for episodic tasks where the reward is 1 only when the goal is achievedand 0 otherwise . Therefore, like in our case, these RMs consist of absorbing accept-ing and rejecting states. The rejecting state, however, seems to be induced implicitlyand not indicated explicitly as in our case.Unlike ISA, the traces used by JIRP do not only consist of observations but they also includerewards. Crucially, these sequences of rewards are used to determine counterexamples: if atrace in the MDP yields a different sequence of rewards in the automaton, then that traceis a counterexample. Furthermore, there are four other differences: • JIRP does not call the automaton learner after every single counterexample. Instead,it accumulates them into a batch and calls the automaton learner periodically. 22. Note that we assumed 0/1 reward tasks only when we used QRM and not HRL (see Section 7.1.2). urelos-Blanco, Law, Jonsson, Broda, & Russo • JIRP learns reward machines whose edges are labeled by sets of observables, whichis similar to what LRM does . In contrast, the edges of a subgoal automaton arelabeled by propositional formulas over a set of observables. • JIRP learns conditions for the loop transitions. In contrast, our approach takes looptransitions only when outgoing transitions to other states cannot be taken. • JIRP reuses the Q-functions learned by QRM when a new automaton is learned. TheQ-function at a given state u is reused in a new state u (cid:48) if they are equivalent. Twostates are equivalent if they yield the exact same sequence of rewards for all traces inthe set of counterexamples.We decided not to include this mechanism in the QRM experiments for several reasons.As explained in Section 7.1.2, the policy at a given automaton state selects the actionthat appears to be best to achieve the task’s final goal. In other words, that policymight aim to satisfy any of the formulas on the outgoing edges, so it is unclear whatshould be transferred to the new automaton. Importantly, the transfer proposed byXu et al. is based on the sequence of rewards a trace yields from a given automatonstate, and not on what the policies are trying to achieve. In our view, the latter aspectis what should be taken into account.Gaon and Brafman (2020) learn a deterministic finite automaton (DFA), so the tran-sitions are labeled by symbols instead of propositional formulas. However, it still sharessome commonalities with reward machines and subgoal automata. The tasks they considerare also episodic and terminate when the goal is reached. The main difference with re-spect to subgoal automata is that the set of observables O contains an observable for eachaction. Therefore, their automata are learned from action traces. The authors use twowell-known algorithms from the grammatical inference literature (de la Higuera, 2010) tolearn a minimal DFA: L* and EDSM (Evidence Driven State Merging): • L* (Angluin, 1987) is an active learning algorithm that can learn a DFA from just apolynomial number of queries. Typically, two types of queries are considered: – Membership queries: the learner requests to label a trace (i.e., state whether thetrace belongs to the language or not). – Equivalence queries: the learner asks whether its automaton captures the targetlanguage. If it does not, a counterexample is returned.Gaon and Brafman propose to use the RL agent as the oracle. A membership queryis answered by attempting to reproduce the sequence of actions in it, whereas anequivalence query is answered by checking if a counterexample trace has been observedin the past. Note that membership queries and negative answers to equivalence queriescan be unfeasible traces and, thus, overgeneralization is controlled. However, theirmethod might be prone to make wrong guesses, specially in large state and actionspaces where it is unlikely that a given trace has been seen before. 23. Xu et al.’s paper shows RMs whose edges are labeled by propositional formulas. However, we haveverified through personal communication with the authors that the transitions are indeed labeled by setsof observables. The propositional formulas were used to make the representation simpler in the paper. nduction and Exploitation of Subgoal Automata for RL • EDSM (Lang et al., 1998) is a state-merging approach to automata learning, whichconsists of two phases:1. Build an initial DFA called Prefix Tree Acceptor (PTA), which is a tree-structuredDFA built from the prefixes of a finite set of traces such that it accepts the pos-itive traces in the set and rejects the negative ones.2. Iteratively choose pairs of equivalent states to merge and produce a new automa-ton. If the automaton does not cover all the examples, it backtracks and choosesanother pair of states. When no additional merging is possible, it stops.The automaton learned by EDSM depends on the quality of the example set and,under specific conditions, the algorithm is proved to converge to the minimal DFA.The complexity of the algorithm is polynomial in the number of examples.To apply EDSM, Gaon and Brafman keep a record of the traces that reach the goaland those that do not reach the goal. Note that this is similar to what we do, althoughour dead-end and incomplete traces would be both inside their set of traces that donot reach the goal because they do not have an explicit rejecting state.State-merging approaches follow a different path to minimality than our approach andXu et al.’s. The former start from a big set of states and aim to reduce it, whereasthe latter begin with a small set of states and increase it when becomes insufficient tocover the examples. Nevertheless, even though these approaches are slightly different,they can both generate overgeneralized automata if only positive examples are used,as we show in Section 10.The authors combine the automata learning approaches with Q-learning and a model-basedRL algorithm called R-max (Brafman & Tennenholtz, 2002) and test them in tabular tasks.Note that while the minimality of our automaton is dependent on the maximum number ofedges between two states ( κ ), the approaches by Xu et al. (2020) and Gaon and Brafman(2020) do not mention this dependence (they both show examples of automata involvingdisjunctions).Automaton structures have also been exploited in reward machines to give bonus re-ward signals. Camacho et al. (2019) convert reward functions expressed in various formallanguages (e.g., linear temporal logic) into RMs, and propose a reward shaping method thatruns value iteration on the RM states. Similarly, Camacho et al. (2017) use automata asrepresentations of non-markovian rewards and exploit their structure to guide the search ofan MDP planner using reward shaping. In our case, we have proposed two reward shapingmechanisms based on the maximum and minimum distances to the accepting state. Hierarchical Abstract Machines (HAMs). Throughout the paper we have shownthe connection between subgoal automata and the options framework (Sutton et al., 1999),which is one of the classical approaches for HRL along with HAMs (Parr & Russell, 1997)and MAXQ (Dietterich, 2000). Even though our method is closer to options, it is also similarto HAMs in that both use an automaton. However, HAMs are non-deterministic automatawhose transitions can invoke lower level machines and are not labeled by observables (thehigh-level policy consists in deciding which transition to fire). urelos-Blanco, Law, Jonsson, Broda, & Russo Relational Macros. Torrey et al. (2007) learn a similar kind of automata to subgoalautomata called relational macros , which are finite state machines where both states andtransitions are characterized by first-order logic formulas. These formulas are built on thefirst-order logic predicates that describe the environment states. The formulas on the au-tomaton states indicate which action to take, while the formulas on the transitions say whenthe transition is taken. The learning of the relational macros is done in two phases. The structure learning phase finds a sequence of actions that distinguishes traces reaching thegoal from those which do not and composes them into an automaton. In the ruleset learning phase, their system learns the conditions for choosing actions and for taking transitions.The authors use Aleph (Srinivasan, 2001), an inductive-logic programming (ILP) system, tolearn the rules in both phases from positive and negative examples. The learned automatais used for transfer learning: the target task follows the strategy encoded by the automatonfor some steps to estimate the Q-values of the actions in the strategy, and then stops usingthe automaton and acts according to the Q-values. This approach is different from ours inthat:1. The traces it uses to learn the automata are formed by actions and not high-levelevents (observables). However, similarly to us, the traces are divided into groupsdepending on whether they reach the goal or not.2. The transitions are labeled by first-order logic formulas instead of propositional for-mulas.3. It learns logic rules that describe what action to take in each node, while we learnpolicies to choose the actions.4. A relational macro requires that the target task has the same action space as the sourcetask since the rules are defined on these actions. In contrast, a subgoal automatoncan be reused in another task if the set of observables and the goal are the same, evenwhen the state and action spaces are different.Note that the first-order logic predicates are similar to our observables. Even though ob-servables are propositional, both provide a high-level abstraction of the state space. Policy Graphs. Meuleau et al. (1999) propose to represent policies with finite memoryusing a class of finite-state automata called policy graphs. These automata are appliedto Partially Observable MDPs (POMDPs), where policies does not specify the action toperform as a function of the current state but as a function of all the previous historyof observations. The states of a policy graph are labeled with actions, while each edges islabeled by a single observation. Importantly, observations are partially observed states and,thus, are low level. This differs from our case where observations are sets of propositions(named observables) and we use propositional logic formulas to label the edges. Anotherdifference is that the transition function between states in the automaton is probabilistic.This function is represented as a parametric function and is learned through stochasticgradient descent using a set of traces obtained by the RL agent. Similarly to us, policygraphs are applied to tasks with a subset of states characterizing the tasks’ goals. nduction and Exploitation of Subgoal Automata for RL Moore Machines. Koul et al. (2019) transform the policy encoded by a Recurrent NeuralNetwork (RNN) into a Moore machine, which is a quantized version of the RNN. That is,the Moore machine is defined in terms of quantized state and observation representationsof the RNN. Unlike our method and the previously presented works, the authors use theresulting machine for interpretability and do not exploit its structure. In Section 7 we described two RL algorithms for exploiting the structure of a subgoalautomaton using the options framework (Sutton et al., 1999), which is one of the classicalapproaches for HRL along with HAMs (Parr & Russell, 1997) and MAXQ (Dietterich,2000).One of the core problems in the options framework is finding a set of options that helpsto maximize the return instead of handcrafting such set. This problem is known as optiondiscovery . The method we propose in this paper, ISA, can certainly be seen as an optiondiscovery method. The family of option discovery methods where ISA fits best are bottleneck methods, which find “bridges” between regions of the state space. In particular, each stateof our automata can represent a different region of the state space, and the bottleneck isrepresented by the formula connecting two automaton states. The option discovery methodmost similar to ours, except for the reward machine related ones (see Section 9.1), is due toMcGovern and Barto (2001). Their approach uses diverse density to find landmark statesin state traces, and it is similar to ours because:1. It learns from traces.2. It classifies traces into two different categories depending one whether they achievethe goal or not.3. It interleaves option discovery and learning of policies for the discovered options.The main difference is that while our bottlenecks are propositional logic formulas, theirsare crucial states to achieve the task’s goal. Therefore, they do not use/require a set ofpropositional events (i.e., observables) to be provided in advance.Just like some option discovery methods (e.g., McGovern & Barto, 2001; Stolle & Precup,2002), our approach requires the task to be solved at least once. Other methods (e.g.,Menache et al., 2002; Simsek & Barto, 2004; Simsek et al., 2005; Machado et al., 2017)discover options without solving the task and, thus, are also suited to continuing tasks.Alternative formalisms to automata for expressing formal languages, like grammars,have been used to discover options. Lange and Faisal (2019) induce a straight-line gram-mar, a non-branching and loop-free context-free grammar, which can only generate a singlestring. The authors use greedy algorithms to find a straight-line grammar from the shortestsequence of actions that leads to the goal. The production rules are then flattened, lead-ing to one macro-action (a sequence of actions) per production rule. These macro-actionsconstitute the set of options.Similarly to options, there has been work on learning the structures used in other HRLframeworks. Leonetti et al. (2012) synthesize a HAM from the set of shortest solutions toa non-deterministic planning problem, and use it to refine the choices at non-deterministic urelos-Blanco, Law, Jonsson, Broda, & Russo points through RL. Mehta et al. (2008) propose a method for discovering MAXQ hierarchiesfrom a trajectory that reaches the task’s goal. The symmetry breaking mechanism that we have proposed in this paper has been shown tohelp decrease the time needed to find a subgoal automaton that covers a set of examples.In this section we briefly mention some of the most related works to ours; that is, thoseaddressing the problem of breaking symmetries in graphs (specifically, automata) or thatuse ASP to encode problems.SAT-based approaches to learning deterministic finite automata (DFA) have used sym-metry breaking constraints to shrink the search space. Heule and Verwer (2010) reduce theDFA learning problem into a graph coloring problem, which is translated into SAT. Thegraph to be colored is derived from the Prefix Tree Acceptor (PTA) of the examples: twostates are connected if they cannot be merged. The vertices in a k -clique must be coloreddifferently and there are k ! different ways of coloring them. The authors propose to breakthese symmetries by imposing a way of assigning colors after finding a large clique using anapproximation algorithm (given that the problem is NP-complete). On the other hand, sim-ilarly to us, orderings based on well-known search algorithms like breadth-first search (BFS)(Ulyantsev et al., 2015; Zakirzyanov et al., 2019) and depth-first search (DFS) (Ulyantsevet al., 2016) have also been proposed. State-merging approaches to learning DFA have alsoused BFS to break symmetries (Lambeau et al., 2008).Given the successes of the use of symmetry breaking in SAT solving, the ASP communityhas also produced some work on symmetry breaking. Drescher et al. (2011) propose sbass ,a system that detects symmetries in a grounded ASP program through a reduction to agraph automorphism problem. Then, it adds constraints to the initial program to breakthe detected symmetries.Codish et al. (2019) propose a method that breaks symmetries in undirected graphs byimposing a lexicographical order in the rows of the adjacency matrix.In our previous work (Furelos-Blanco et al., 2020), we introduced a method for breakingsymmetries in acyclic subgoal automata, which consists in:1. assigning an integer index to each automaton state such that u has the lowest indexand u A and u R have the highest indices; and2. imposing that a trace must visit automaton states in increasing order of indices.However, this method cannot break symmetries when there is not a trace that traverses allstates in the automaton (e.g., if there are two different paths to the accepting state). Weillustrate this drawback with the example below. Example 9.1. Figure 3 (p18) showed two automata whose states u , u and u can beused interchangeably if no symmetry breaking is used. If we assign indices , . . . , to states u , . . . , u and apply the symmetry breaking rule in (Furelos-Blanco et al., 2020), the learnedautomaton for OfficeWorld ’s VisitABCD task will always be the one showed in Fig-ure 3a. On the other hand, the automaton states u and u in Figure 3b can still be switchedsince there is no trace that traverses both of them. nduction and Exploitation of Subgoal Automata for RL The method presented in this paper does not depend on the sequence of states visitedby the traces. Therefore, it can break both symmetries given in the example above. 10. Conclusions and Future Work In this paper we have proposed ISA, a method that interleaves the learning and exploitationof an automaton whose edges encode the subgoals of an episodic goal-oriented task. Thesubgoals are expressed as propositional logic formulas over a set of high-level events thatthe agent observes when interacting with the environment. These automata are representedusing a logic programing language and learned with a state-of-the-art inductive logic pro-gramming system from traces of high-level events observed by the agent. Importantly, wehave devised a symmetry breaking mechanism that speeds up the automata learning phaseby avoiding to revisit equivalent solutions in the hypothesis space. Besides, the interleav-ing mechanism we propose ensures that the learned automata are minimal (i.e., have thefewest number of states). We have experimentally tested ISA to different types of tasks,showing that it is capable of learning automata that can be exploited by existing reinforce-ment learning techniques. We have also shown how the automata learning process affectsthe reinforcement learning process and vice versa, and that ISA achieves a performancecomparable to the setting where the automaton is handcrafted and given beforehand.We now discuss possible improvements to our algorithm and directions for future work. Learning from Positive and Negative Examples. ISA aims to learn a minimal sub-goal automaton from positive examples only (i.e., traces that the agent can observe), whichcan lead to overgeneralization (Angluin, 1980). More specifically, in our case, the learnedautomaton will be too general in tasks where there is a temporal dependency between theobservables (e.g., an observable o can only be observed if another observable o (cid:48) has beenobserved before). Previously, we also showed that overgeneral automata might be learnedwhen key traces to learn all subgoals cannot be observed (see Figure 4, p33) . Example 10.1. Imagine that the set of observables O for the OfficeWorld environmentincludes an observable g that states whether the task’s goal has been achieved. In the case ofthe Coffee task, g is observed after the agent has been in the coffee location (cid:75) and then inthe office location o ; thus, a possible goal trace is (cid:104){ (cid:75) } , {} , { o, g }(cid:105) . The automaton learnercould then output the automaton below. u start u A g This automaton is minimal and there is no other positive example that contradicts it because g is only observed when the goal is reached. To avoid overgeneralization, we need traces that are impossible to observe from theagent-environment interaction. These traces would constitute the set of negative examples ,which are supported by ILASP, the system we have used to learn the automata. Forinstance, the trace { g } would be a negative example for the task described in Example 10.1 24. Remember that we addressed this issue by learning an automaton that generalizes to a set of MDPs. urelos-Blanco, Law, Jonsson, Broda, & Russo because it is impossible to observe g without observing (cid:75) followed by o . The usage of thisnegative example would lead to the learning of an automaton representing the sequence“observe (cid:75) , then observe o ”.Since it is impossible to observe unfeasible traces, the agent must be able to learn orhypothesize what is unfeasible in the environment in order to form negative examples.We hypothesize that approaches which learn the MDP’s model (that is, the transitionprobability function) would be useful to solve the problem along with more sophisticatedexploration strategies than (cid:15) -greedy. Observable Discovery. ISA assumes that an appropriate set of observables is given inadvance. The same occurs with other methods that learn automaton structures, like rewardmachines (Toro Icarte et al., 2019; Xu et al., 2020). Discovering the set of observablesfrom interaction is an important task towards automating the entire automaton learningprocess. Possible future work could extract observables from object keypoints in images(Kulkarni et al., 2019) or represent the abstract states from an abstracted state space usingobservables. Improve Scalability. Learning a minimal automaton from examples is a well-knownhard problem (Gold, 1978). The main factors that affect the scalability of our approach arethe number of states, whether cycles are allowed, the observable set size, the length of thecounterexample traces, and the maximum number of edges between two states. More Expressive Automata. A natural extension of this work is to learn automatawhose edges are labeled by first-order logic formulas instead of propositional logic formulas.This would allow to enable features not currently supported by subgoal automata, suchas counting (without using an arbitrary number of edges). The ILASP system can learnfirst-order logic rules. Acknowledgments Anders Jonsson is partially supported by the Spanish grants TIN2015-67959 and PCIN-2017-082. Appendix A. Proof of Proposition 4.1 To prove Proposition 4.1, we use the following result due to Gelfond and Lifschitz (1988): Theorem A.1. If an ASP program P is stratified, then it has a unique answer set. Now, we give the definition of a stratified ASP program and proceed to prove Proposition4.1. Definition A.1. An ASP program P is stratified when there is a partition P = P ∪ P ∪ · · · ∪ P n ( P i and P j disjoint for all i (cid:54) = j ) such that, for every predicate p nduction and Exploitation of Subgoal Automata for RL • the definition of p (all clauses with p in the head) is contained in one of the partitions P i and, for each ≤ i ≤ n : • if a predicate occurs positively in a clause of P i then its definition is contained within (cid:83) j ≤ i P j . • if a predicate occurs negatively in a clause of P i then its definition is contained within (cid:83) j
In this section we describe the details of our symmetry breaking mechanism. For ease ofpresentation, we first encode it in the form of a satisfiability (SAT) formula and formallyprove several of its properties. We later explain how to convert this SAT formula intoASP rules. Finally, we propose a more efficient ASP encoding of the symmetry breakingconstraints than the direct translation from SAT. B.1 SAT Encoding Our idea is to define a SAT formula that encodes a BFS traversal I ( G ) = (cid:104) f, { Γ u } u ∈ V (cid:105) ofa given graph G = ( V, E ) in the class G , defined on a set of labels L = { l , . . . , l k } . Sincea graph indexing I ( G ) assigns unique integers to nodes and edges, we use i to refer to anode u such that f ( u ) = i , ( i, e ) or ( i, e, L ) to refer to an edge ( u, v, L ) such that f ( u ) = i and Γ u ( L ) = e , and m to refer to a label l m ∈ L . We sometimes extend this notation in thenatural way, e.g. by writing Γ i ( L ), E o ( i ) and Π I ( i ). Variables. We first define a set X of propositional SAT variables for all combinations ofsymbols (sometimes with restrictions as indicated):1. ed ( i, j, e ), [edge ( i, e ) ends in node j ]2. label ( i, e, m ), [the label set on edge ( i, e ) includes l m ∈ L ]3. pa ( i, j ), i < j , [node i is the parent of j in the BFS subtree]4. sm ( i, j, e ), i < j , [ e is the smallest integer on a BFS edge from node i to j ]5. lt ( i, e − , e, m ), e > 1, [there is a label l m (cid:48) | m (cid:48) ≤ m on ( i, e ) and not on ( i, e − ed ( i, j, e ) and label ( i, e, m ) are used to encode a graph G together withan associated graph indexing I ( G ), variables pa ( i, j ) and sm ( i, j, e ) are used to encode theparent function Π I , and variables lt ( i, e − , e, m ) are used to encode the label set ordering. Clauses. We next define a set C of clauses on X for all combinations of symbols (some-times with restrictions as indicated). The first set of clauses (1-8) enforces the first conditionin Definition 6.2: for any two nodes i > j > 1, Π I ( i ) < Π I ( j ) ⇔ i < j . nduction and Exploitation of Subgoal Automata for RL (cid:87) i | i 1, [node j > pa ( i, j ) ⇒ ¬ pa ( i (cid:48) , j ), i < i (cid:48) < j , [incoming BFS edge is unique]3. pa ( i, j ) ⇒ (cid:87) e sm ( i, j, e ), i < j , [BFS edge implies smallest integer]4. pa ( i, j ) ⇒ ¬ ed ( i (cid:48) , j (cid:48) , e ), i (cid:48) < i < j ≤ j (cid:48) , [respect BFS order]5. sm ( i, j, e ) ⇒ pa ( i, j ), i < j , [smallest integer implies BFS edge]6. sm ( i, j, e ) ⇒ ¬ sm ( i, j, e (cid:48) ), i < j , e < e (cid:48) , [smallest integer is unique]7. sm ( i, j, e ) ⇒ ed ( i, j, e ), i < j , [smallest integer implies edge]8. sm ( i, j, e ) ⇒ ¬ ed ( i, j (cid:48) , e (cid:48) ), i < j ≤ j (cid:48) , e (cid:48) < e , [correctly break ties]Intuitively, Clauses 1 and 2 state that each node j > i in theBFS subtree. Clauses 3, 5 and 6 state that each node j > i, e ) with smallest integer e from its parent i in the BFS subtree. Clause 7 ensures that theincoming edge ( i, e ) to j in the BFS subtree corresponds to an actual edge in the graph G .Clauses 4 and 8 constitute the core of symmetry breaking by enforcing the conditionthat Π I ( i ) < Π I ( j ) should imply i < j . By definition of Π I , the incoming edge ( i, e ) to j in the BFS subtree should be the lexicographically smallest such integer pair. Hence thegraph G cannot contain any incoming edge ( i (cid:48) , e (cid:48) ) to j from a node i (cid:48) < i . In addition,no node j (cid:48) > j can have such an incoming edge either, since otherwise its parent functionwould be smaller than that of j , thus violating the desired condition. These two facts arejointly encoded in Clause 4 by enforcing the restriction j (cid:48) ≥ j .Likewise, if ( i, e ) is the incoming edge to j in the BFS subtree, graph G cannot containan incoming edge ( i, e (cid:48) ) from the same node i with e (cid:48) < e . Again, no node j (cid:48) > j can havesuch an incoming edge either, since otherwise its parent function would be smaller than thatof j . These two facts are jointly encoded in Clause 8 by enforcing the restriction j (cid:48) ≥ j .The second set of clauses (9-14) assigns edge integers to the outgoing edges from eachnode, enforcing the second condition in Definition 6.2: for each node i and pair of outgo-ing edges ( i, e, L ) and ( i, e (cid:48) , L (cid:48) ), L < L (cid:48) ⇔ e < e (cid:48) . Due to the transitivity of the relation < , it is sufficient to check that the condition holds for all pairs of consecutive edge inte-gers ( e − , e ). Clauses 9 and 10 enforce that edge integers are unique between 1 and | E o ( i ) | .9. ed ( i, j, e ) ⇒ (cid:87) j (cid:48) ed ( i, j (cid:48) , e − e > 1, [edge integers start at 1 and are contiguous]10. ed ( i, j, e ) ⇒ ¬ ed ( i, j (cid:48) , e ), j < j (cid:48) , [edge integers cannot be duplicated]Clauses 11-14 are used to enforce that two consecutive edges ( i, e − , L ) and ( i, e, L (cid:48) ) satisfy L < L (cid:48) . Formally, variable lt ( i, e − , e, m ) is only true if there exists m (cid:48) ≤ m such that l m (cid:48) / ∈ L and l m (cid:48) ∈ L (cid:48) . This is implemented using the following two clauses:11. lt ( i, e − , e, m ) ⇒ ¬ label ( i, e − , m ) ∨ lt ( i, e − , e, m − e > lt ( i, e − , e, m ) ⇒ label ( i, e, m ) ∨ lt ( i, e − , e, m − e > lt ( i, e − , e, m ) holds, either l m / ∈ L and l m ∈ L (cid:48) , or lt ( i, e − , e, m − 1) holds for m − 1. The disjuncts mentioning m − m > 1. The next clause en-sures that for each edge ( i, e ) with e > lt ( i, e − , e, m ) is true for at least one label l m ∈ L :13. ed ( i, j, e ) ⇒ (cid:87) m lt ( i, e − , e, m ) , e > urelos-Blanco, Law, Jonsson, Broda, & Russo Finally, the following clause encodes the second part of Definition 6.1, ensuring that thelabel set on edge ( i, e ) is not lower than that on ( i, e − lt ( i, e − , e, m ) ∨ ¬ label ( i, e − , m ) ∨ label ( i, e, m ) , e > Properties. We proceed to prove several properties about the SAT encoding. Concretely,we show that there is a one-to-one correspondence between the BFS traversal of a graphand a solution to the SAT encoding. Definition B.1. Given a graph G = ( V, E ) ∈ G defined on a set of labels L = { l , . . . , l k } and an associated graph indexing I ( G ) = (cid:104) f, { Γ u } u ∈ V (cid:105) , let X ( G, I ) be an assignment to theSAT variables in X , assigning false to all variables in X except as follows: • For each edge ( u, v, L ) ∈ E , ed ( f ( u ) , f ( v ) , Γ u ( L )) is true. • For each edge ( u, v, L ) ∈ E and each label l m ∈ L , label ( f ( u ) , Γ u ( L ) , m ) is true. • For each node v ∈ V \ { v } with Π I ( v ) = ( i, e ) , pa ( i, f ( v )) and sm ( i, f ( v ) , e ) are true. • For each node u ∈ V , each pair of outgoing edges ( u, v, L ) and ( u, v (cid:48) , L (cid:48) ) in E o ( u ) such that Γ u ( L ) = Γ u ( L (cid:48) ) − , and each label l m ∈ L , lt ( f ( u ) , Γ u ( L ) , Γ u ( L (cid:48) ) , m ) is trueif there exists m (cid:48) ≤ m such that l m (cid:48) / ∈ L and l m (cid:48) ∈ L (cid:48) . Example B.1. Given the graph G , graph indexing I ( G ) and set of labels L = { a, b, c, d, e, f } from Example 6.4, the assignment X ( G, I ) assigns true to the following SAT variables in X : ed (1 , , label (1 , , label (1 , , ed (1 , , label (1 , , label (1 , , ed (1 , , label (1 , , label (1 , , ed (2 , , label (2 , , ed (3 , , label (3 , , ed (4 , , label (4 , , ed (4 , , label (4 , , pa (1 , pa (1 , pa (1 , pa (4 , sm (1 , , sm (1 , , sm (1 , , sm (4 , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (1 , , , lt (4 , , , lt (4 , , , lt (4 , , , lt (4 , , , Theorem B.1. Given a graph G and a graph indexing I ( G ) , the assignment X ( G, I ) tothe SAT variables in X satisfies all SAT clauses in C if and only if I ( G ) is a BFS traversal.Proof. ⇐ : Assume that I ( G ) is a BFS traversal (i.e., it satisfies the conditions of Defini-tion 6.2). We show that each clause in C is satisfied:1. (cid:87) i | i 1) holds for u with f ( u ) = i since Γ u is a bijection onto { , . . . , | E o ( u ) |} .10. ed ( i, j, e ) ⇒ ¬ ed ( i, j (cid:48) , e ) holds for u with f ( u ) = i since Γ u is a bijection.11. lt ( i, e − , e, m ) ⇒ ¬ label ( i, e − , m ) ∨ lt ( i, e − , e, m − 1) holds for u , f ( u ) = i , andoutgoing edges ( u, v, L ), ( u, v (cid:48) , L (cid:48) ) with Γ u ( L ) = e − u ( L (cid:48) ) − lt ( i, e − , e, m )implies that either l m / ∈ L or there exists m (cid:48) < m such that l m (cid:48) / ∈ L and l m (cid:48) ∈ L (cid:48) .12. lt ( i, e − , e, m ) ⇒ label ( i, e, m ) ∨ lt ( i, e − , e, m − 1) holds for the same setting since lt ( i, e − , e, m ) implies that either l m ∈ L (cid:48) or there exists m (cid:48) < m such that l m (cid:48) / ∈ L and l m (cid:48) ∈ L (cid:48) .13. ed ( i, j, e ) ⇒ (cid:87) j (cid:48) ed ( i, j (cid:48) , e − 1) holds since I ( G ) is a BFS traversal, implying that thebijection Γ u for u with f ( u ) = i satisfies L < L (cid:48) whenever Γ u ( L ) < Γ u ( L (cid:48) ) (and inparticular when Γ u ( L ) = e − u ( L (cid:48) ) − L < L (cid:48) , there has to exist at leastone m, ≤ m ≤ k such that l m / ∈ L and l m ∈ L (cid:48) due to Definition 6.1.14. lt ( i, e − , e, m ) ∨ ¬ label ( i, e − , m ) ∨ label ( i, e, m ) also holds since Γ u ( L ) = e − u ( L (cid:48) ) − L < L (cid:48) . Hence for the given m , Definition 6.1 is satisfied either1) by a label l m (cid:48) ≤ m , implying that lt ( i, e − , e, m ) is true; or 2) by a label l m (cid:48) >m ,implying that l m ∈ L and l m / ∈ L (cid:48) cannot both be true. ⇒ : Assume that X ( G, I ) satisfies all SAT clauses. We show that I ( G ) is a BFS traversal.First note from above that X ( G, I ) satisfies all clauses except 4, 8, 13 and 14 even if I ( G )is not a BFS traversal. Hence we can focus exclusively on these four clauses. urelos-Blanco, Law, Jonsson, Broda, & Russo We first analyze Clauses 13 and 14. For Clause 13 to be true, any edge ( i, e ) inducedfrom graph G with e > lt ( i, e − , e, m ) for at least one label l m ∈ L . Let u be the node with f ( u ) = i and let ( u, v, L ) and ( u, v (cid:48) , L (cid:48) ) be the two outgoing edges in E o ( u )such that Γ u ( L ) = e − u ( L (cid:48) ) − 1. By definition of X ( G, I ) there exists m, ≤ m ≤ k such that l m / ∈ L and l m ∈ L (cid:48) . For the smallest such m there cannot exist m (cid:48) < m suchthat l m (cid:48) ∈ L and l m (cid:48) / ∈ L (cid:48) , else Clause 14 would be violated for m (cid:48) . Hence Definition 6.1holds for label l m , implying L < L (cid:48) .We next analyze Clauses 4 and 8. Let u be the node such that f ( u ) = j and Π I ( u ) =( i, e ), and let v be any node such that f ( v ) = j (cid:48) > j . Since Clause 4 holds, graph G cannotcontain an edge ( w, v, L ) such that f ( w ) = i (cid:48) < i . Since Clause 8 holds, graph G cannotcontain an edge ( w, v, L ) such that f ( w ) = i and Γ w ( L ) < e . Since Π I ( v ) cannot equal( i, e ), it has to be larger than ( i, e ), implying Π I ( u ) < Π I ( v ).We have shown that the two conditions in Definition 6.2 hold: for each pair of nodes u and v in V \ { v } , Π I ( u ) < Π I ( v ) implies f ( u ) < f ( v ), and for each node u ∈ V and pairof outgoing edges ( u, v, L ) and ( u, v (cid:48) , L (cid:48) ) in E o ( u ), L < L (cid:48) implies Γ u ( L ) < Γ u ( L (cid:48) ). Henceby definition I ( G ) is a BFS traversal. Definition B.2. Let X be an assignment to the SAT variables X that satisfies the SATclauses in C . Given an edge ( i, e ) , let L ( i, e ) = { l m | label ( i, e, m ) } be the label set induced by X . We define a mapping G ( X ) = ( G, I ( G )) from X to a graph G = ( V, E ) and associatedgraph indexing I ( G ) = (cid:104) f, { Γ u } u ∈ V (cid:105) as follows: • The set of nodes is V = { v , . . . , v n } , where n is the largest node index in the assign-ment, and f ( v i ) = i for each v i ∈ V . • The set of edges is E = { ( v i , v j , L ( i, e )) | ed ( i, j, e ) } and Γ v i ( L ( i, e )) = e . • The parent function of each v j ∈ V equals Π I ( v j ) = ( i, e ) if sm ( i, j, e ) is true. Theorem B.2. Given an assignment X to the SAT variables X that satisfies all clausesin C , the mapping G ( X ) = ( G, I ( G )) induces a graph G in the class G and a well-definedgraph indexing I ( G ) .Proof. We first show that the induced graph indexing I ( G ) is well-defined. Clearly f is a bijection onto { , . . . , | V |} by definition. We next show that Γ v i is a bijection onto { , . . . , | E o ( v i ) |} for each node v i ∈ V . Clause 10 ensures that ed ( i, j, e ) and ed ( i, j (cid:48) , e )cannot be true simultaneously for j (cid:54) = j (cid:48) . Due to Clause 9, if edge ( i, e ) is defined for e > i, e − i, e ) is uniquely definedfor e ∈ { , . . . , | E o ( v i ) |} , where | E o ( v i ) | is the largest integer of an outgoing edge from i .We next show that the induced label set L ( i, e ) on each outgoing edge ( i, e ) from i isunique. Clause 13 implies that for each edge ( i, e ) with e > lt ( i, e − , e, m ) is true for atleast one label l m ∈ L . Clauses 11 and 12 ensure that lt ( i, e − , e, m ) is true only if thereexists m (cid:48) ≤ m such that ¬ label ( i, e − , m (cid:48) ) and label ( i, e, m (cid:48) ) are true. For the smallest such m (cid:48) there cannot exist m (cid:48)(cid:48) < m (cid:48) such that label ( i, e − , m (cid:48)(cid:48) ) and ¬ label ( i, e, m (cid:48)(cid:48) ) are true, elseClause 14 would be violated for m (cid:48)(cid:48) . Hence label l m satisfies the condition in Definition 6.1with respect to the induced label sets L ( i, e − 1) and L ( i, e ), implying L ( i, e − < L ( i, e ).In conclusion, we have shown that ( i, e ) is uniquely defined for e ∈ { , . . . , | E o ( v i ) |} ,and that L ( i, e − < L ( i, e ) holds for each pair of consecutive integers in { , . . . , | E o ( v i ) |} . nduction and Exploitation of Subgoal Automata for RL Since Γ v i is defined as Γ v i ( L ( i, e )) = e for each e ∈ { , . . . , | E o ( v i ) |} , this implies that Γ v i isa well-defined bijection from E o ( v i ) to { , . . . , | E o ( v i ) |} .We also need to show that the induced parent function Π I is well-defined, i.e. that foreach j > sm ( i, j, e ) is true for a single i and e , and that Π I ( v j ) = ( i, e ) is consistent withthe definition of Π I . Clauses 1 and 2 imply that pa ( i, j ) is true for a single i . Clauses 3,5 and 6 imply that sm ( i, j, e ) can only be true for the same i and j as pa ( i, j ), and that sm ( i, j, e ) is true for a single e . For Π I ( v j ) = ( i, e ) to hold, G has to contain the edge( v i , v j , L ( i, e )), which is guaranteed by Clause 7. Moreover, G cannot contain any edge( v i (cid:48) , v j , L ( i (cid:48) , e (cid:48) )) such that ( i (cid:48) , e (cid:48) ) < ( i, e ), which is guaranteed by Clauses 4 and 8.We finally show that the induced graph G belongs to the class G , i.e. that the threeassumptions on page 18 are satisfied. We satisfy Assumption 1 by designating v as thestart node. We have already shown above that the induced label set L ( i, e ) on each outgoingedge ( i, e ) from i is unique, satisfying Assumption 3. It remains to show that Assumption 2holds, i.e. that each node v j , j > 1, is reachable from v . Since sm ( i, j, e ) is true for asingle i and e such that i < j , G has to contain the edge ( v i , v j , L ( i, e )) due to Clause 7.Aggregating these incoming edges for all nodes different from v results in a BFS subtreerooted in v , and each node v j , j > 1, is reachable from v in this subtree.By combining Theorems B.1 and B.2, it follows that the mapping G ( X ) = ( G, I ( G )) ofa satisfying assignment X to the SAT clauses in C is such that I ( G ) is a BFS traversal.Since by Lemma 6.1 (p21) each graph G ∈ G has a unique BFS traversal, it follows thatthe SAT encoding cannot generate two permutations of node integers that represent thesame graph G . Hence the SAT encoding breaks the symmetries in graphs such as those inFigure 3 (p18).We remark that the SAT encoding is not forced to include edges in the induced graph G other than those needed to correctly represent the parent function Π I . However, bycombining the SAT encoding with another encoding for generating automata, we ensurethat the encoding cannot produce two symmetric automata. B.2 ASP Encodings In this section we present two different encodings of our symmetry breaking method inASP for its application to subgoal automata. The first method (Appendix B.2.1) is a directtranslation from the SAT clauses introduced in the previous section. However, there arecertain aspects that can be made more efficient in ASP, so we propose a second method forencoding the symmetry breaking constraints (Appendix B.2.2).In order to apply the proposed symmetry breaking mechanism to a subgoal automaton,we have to take the following aspects into account:1. The edges of a subgoal automaton are labeled by propositional formulas over a set ofobservables O , whereas the edges of a labeled directed graph are defined over a set oflabels L .2. The graph indexing presented for labeled directed graphs assigns a different edge indexfor each of the outgoing edges from a node. In contrast, in the ASP representation M ( A ) of a subgoal automaton A the edge indices are unique only from one state toanother (i.e., can be repeated between other pairs of states). urelos-Blanco, Law, Jonsson, Broda, & Russo We now partially describe how we address (1), which is common to both methods. Wemap the set of observables O used by the subgoal automaton into a set of labels L , whichconsists of integer values for easy comparison. Each of these integer values must encodeeither an observable or its negation. Given a set of observables O = { o , . . . , o |O| } , the setof labels L is defined as: L = { i, i + |O| | o i ∈ O} , where i corresponds to o i and i + |O| corresponds to its negation, ¬ o i . Therefore, L consistsof integers from 1 to 2 |O| where labels 1 . . . |O| correspond to each of the observables, andlabels |O| + 1 . . . |O| correspond to the observable negations. This mapping is encoded inASP using the following predicates: • obs id ( o, i ) indicates that observable o ∈ O is assigned id i . • num obs ( i ) indicates that the set of observables has size i . • valid label ( l ) indicates that l is a label.We simply ground the above predicates according to their descriptions: { obs id ( o i , i ) . | o i ∈ O} ∪ { num obs ( |O| ) . } ∪ { valid label ( l ) . | ≤ l ≤ |O| ) } . The mapping is not complete: we need to map the formulas on the edges into sets of labels.Specifically, we map the factual representation of the automaton introduced in Section 6 into label facts similar to the variables of the same name used in the SAT encoding. However,these facts are slightly different between encodings, so we describe them in detail in theirrespective sections. The second aspect commented above is also addressed differently foreach encoding.Another common feature between the encodings is that states in the automaton areassigned an integer index for easy comparison. This is needed only for the rules enforcingthe BFS traversal on the automaton, which corresponds to Clauses 1-8 in the SAT encoding.The predicate state id ( u, i ) denotes that the automaton state u has index i and it isgrounded as follows: { state id ( u i , i ) . | u i ∈ U \ { u A , u R }} . Note that, without loss of generality, the first state index is 0 (because of the initial state u ) and that not all states are assigned an index. Given that the accepting state u A and the rejecting state u R are fixed, they cannot be interchanged with any other state.Furthermore, they cannot be the parent of any other state since they are absorbing states.Thus, they are excluded from the BFS ordering. In order to easily compare the indicesof two states, we introduce the set of rules below. The first rule defines the predicate state lt ( u, u (cid:48) ) to express that the index of state u is lower than that of u (cid:48) . Similarly, thesecond rule defines the predicate state leq ( u, u (cid:48) ) to express that the index of state u islower or equal than that of u (cid:48) . (cid:26) state lt ( X , Y ) : - state id ( X , XID ) , state id ( Y , YID ) , XID < YID . state leq ( X , Y ) : - state id ( X , XID ) , state id ( Y , YID ) , XID < = YID . (cid:27) 25. Remember that in Section 6.2 the bijection f assigns indices between starting from 1. nduction and Exploitation of Subgoal Automata for RL B.2.1 SAT-Based Encoding The encoding we present in this section is a direct translation from the SAT encodingintroduced in Appendix B.1. The presentation of this encoding is divided into three parts.First, we introduce a mapping from the edge indices used by the subgoal automata intothe edge indices used in the symmetry breaking method. Second, we describe how thepreviously introduced mapping from observables into labels is applied to map formulas overobservables into label sets. Third, we transform the SAT clauses into ASP rules. Edge Index Mapping. The set of rules below encodes the edge index mapping. Firstly,we define a predicate edge id ( i ), where i is an edge index, and ground it for values between1 and ( | U | − κ : { edge id ( i ) | ≤ i ≤ ( | U | − κ } . Note that ( | U | − κ is the maximum number of outgoing edges from a state: each statecan have edges to | U | − κ edges are allowed from one state to another.We use facts of the form mapping ( u, v, e, e (cid:48) ) facts to indicate that edge e between u and v is mapped into e (cid:48) . The mapping is enforced using the set of rules below. The first ruledescribes that an edge index E from X to Y is mapped into exactly one edge index EE in therange given by the edge id facts. The second rule enforces that two outgoing edges froma state X to two different states Y and Z must be mapped into different edge indices. Thethird rule enforces that two edge indices E and EP between the same pair of states X and Y must be mapped into different edge indices. The fourth rule indicates that if there are twoedge indices E and EP between states X and Y such that E < EP , then the indices to whichthey are mapped ( EE and EEP ) must preserve the same ordering ( EE < EEP ). { mapping ( X , Y , E , EE ) : edge id ( EE ) } : - ed ( X , Y , E ) . : - mapping ( X , Y , , EE ) , mapping ( X , Z , , EE ) , Y < Z . : - mapping ( X , Y , E , EE ) , mapping ( X , Y , EP , EE ) , E < EP . : - ed ( X , Y , E ) , ed ( X , Y , EP ) , E < EP , mapping ( X , Y , E , EE ) , mapping ( X , Y , EP , EEP ) , EE > EEP . Given the mapping, it is straightforward to redefine the ed atoms, as well as the pos and neg facts used in the factual representation of the automata introduced in Section 6: map ed ( X , Y , EP ) : - ed ( X , Y , E ) , mapping ( X , Y , E , EP ) . map pos ( X , Y , EP , O ) : - pos ( X , Y , E , O ) , mapping ( X , Y , E , EP ) . map neg ( X , Y , EP , O ) : - neg ( X , Y , E , O ) , mapping ( X , Y , E , EP ) . The predicates map ed , map pos and map neg are used in the ASP encoding of the symmetrybreaking constraints explained later. Mapping of Formulas into Label Sets. The set of rules below uses the previouslydescribed observable-to-label and the edge index mappings to transform formulas over aset of observables into label sets. The first rule sets OID as a label of edge E from X if thecorresponding observable O appears positively in that edge. The second rule sets OID + N asa label of edge E from X if the corresponding observable O appears negatively in that edgeand N is the number of observables. (cid:26) label ( X , E , OID ) : - map pos ( X , Y , E , O ) , obs id ( O , OID ) . label ( X , E , OID + N ) : - map neg ( X , Y , E , O ) , obs id ( O , OID ) , num obs ( N ) . (cid:27) urelos-Blanco, Law, Jonsson, Broda, & Russo Symmetry Breaking Rules. Similarly to the SAT encoding, we divide the resultingASP rules into three sets. First, we introduce the set of rules enforcing the indexing givenby the BFS traversal on the automaton. These rules are defined in terms of an auxiliarypredicate ed sb ( u, u (cid:48) , i ) equivalent to map ed ( u, u (cid:48) , i ) but only defined for those states u (cid:48) that have a state index. Remember that the accepting ( u A ) and rejecting ( u R ) states areexcluded from the traversal and, thus, they are the only states without an index. Thefollowing rule defines ed sb in terms of map ed : ed sb ( X , Y , E ) : - map ed ( X , Y , E ) , state id ( Y , ) . The set of rules below that corresponds to Clauses 1-8 in the SAT encoding. { pa ( X , Y ) : state ( X ) , state lt ( X , Y ) } : - state ( Y ) , state id ( Y , YID ) , YID > . : - pa ( X , Y ) , pa ( XP , Y ) , state lt ( X , XP ) , state lt ( XP , Y ) . { sm ( X , Y , E ) : edge id ( E ) } : - pa ( X , Y ) . : - pa ( X , Y ) , ed sb ( XP , YP , ) , state lt ( XP , X ) , state leq ( Y , YP ) . : - sm ( X , Y , ) , not pa ( X , Y ) . : - sm ( X , Y , E ) , sm ( X , Y , EP ) , E < EP . : - sm ( X , Y , E ) , not ed sb ( X , Y , E ) . : - sm ( X , Y , E ) , ed sb ( X , YP , EP ) , state lt ( X , Y ) , state leq ( Y , YP ) , EP < E . The next set of rules encodes Clauses 9 and 10, which enforce that edge indices areunique between 1 and the number of outgoing edges from X : (cid:26) : - map ed ( X , Y , E ) , not map ed ( X , , E − ) , E > . : - map ed ( X , Y , E ) , map ed ( X , Z , E ) , Y < Z . (cid:27) The following rules encode SAT Clauses 11-14 in ASP. Note that Clauses 11 and 12 havebeen divided into two rules respectively to cover the different values of L . : - lt ( X , E − , E , L ) , label ( X , E − , L ) , not lt ( X , E − , E , L − ) , E > , L > . : - lt ( X , E − , E , L ) , label ( X , E − , L ) , E > , L = . : - lt ( X , E − , E , L ) , not label ( X , E , L ) , not lt ( X , E − , E , L − ) , E > , L > . : - lt ( X , E − , E , L ) , not label ( X , E , L ) , E > , L = . { lt ( X , E − , E , L ) : valid label ( L ) } : - map ed ( X , Y , E ) , E > . : - not lt ( X , E − , E , L ) , label ( X , E − , L ) , not label ( X , E , L ) , map ed ( X , , E ) , E > . B.2.2 Alternative Encoding The encoding we describe in this section is an alternative to the one presented before.Despite of the differences, it also enforces the graph indexing given by the BFS traversaldescribed in Section 6.Similarly to the SAT-based encoding, we need to address the fact that the mechanismuses an indexing for the edges different from the one we use to represent a subgoal au-tomaton. In the SAT-based approach, we defined a mapping from the edge indices usedby a subgoal automaton to those required by the symmetry breaking. In this approach, wedo not use such mapping and directly operate on the edge indices used by the automata.The edge indexing used in the symmetry breaking is such that each outgoing edge from a nduction and Exploitation of Subgoal Automata for RL given state has a different index. In other words, each edge is uniquely identified by aninteger number. Here we preserve the same uniqueness principle by expressing the edges as( u, ( v, e )) meaning that there is an edge from u to v with edge index e . Note that the tuple( v, e ) uniquely identifies each outgoing edge from u .The rest of this section is divided into two parts. First, like in the SAT-based approach,we map the propositional formulas on the edges into sets of labels. Then, we describe theset of ASP rules we use for breaking symmetries. Mapping of Formulas into Label Sets. We use the predicate label ( u, ( v, e ) , l ) toexpress that label l appears in the edge from state u to state v with index e . Similarlyto the SAT-based encoding, the set of rules below transforms the formulas over a set ofobservables into label sets. The first rule sets OID as a label of edge ( Y , E ) from X if thecorresponding observable O appears positively in that edge. Likewise, the second rule sets OID + N as a label of edge ( Y , E ) from X if the corresponding observable O appears negativelyin that edge and N is the number of observables. (cid:26) label ( X , ( Y , E ) , OID ) : - pos ( X , Y , E , O ) , obs id ( O , OID ) . label ( X , ( Y , E ) , OID + N ) : - neg ( X , Y , E , O ) , obs id ( O , OID ) , num obs ( N ) . (cid:27) Note that while the label predicate in the SAT encoding only used the edge index forreferring to an outgoing edge, here we use a state-edge pair as explained at the beginningof the section. Symmetry Breaking Rules. We start by describing the rules which enforce outgo-ing edges from a given node to be ordered by their respective label sets. The predicate ed lt ( X , ( Y , E ) , ( YP , EP )) indicates that the edge from X to Y with edge index E is lower thanthe edge from X to YP with edge index EP . The set of rules below describes how this orderingis determined and what constraints we impose on it. The first rule determines that giventwo outgoing edges from X , ( Y , E ) and ( YP , EP ), either ( Y , E ) is lower than ( YP , EP ) or viceversa. Now, the order between outgoing edges from a state X must respect two constraints: • The second rule enforces transitivity. That is, if Edge1 is lower than Edge2 , and Edge2 is lower than Edge3 , then Edge1 must be lower than Edge3 . • The third rule enforces that two edges to the same state Y must be ordered accordingto their edge index. That is, given edges ( Y , E ) and ( Y , EP ) from X such that E < EP ,edge ( Y , E ) must be lower than ( Y , EP ). { ed lt ( X , ( Y , E ) , ( YP , EP )); ed lt ( X , ( YP , EP ) , ( Y , E )) } : - ed ( X , Y , E ) , ed ( X , YP , EP ) , ( Y , E ) < ( YP , EP ) . : - ed lt ( X , Edge1 , Edge2 ) , ed lt ( X , Edge2 , Edge3 ) , not ed lt ( X , Edge1 , Edge3 ) , Edge1 != Edge3 . : - ed lt ( X , ( Y , E ) , ( Y , EP )) , ed ( X , Y , E ) , ed ( X , Y , EP ) , E > EP . Note that the previous set of rules guesses an ordering of outgoing edges from a given state.However, this ordering must comply with that of the label sets given in Definition 6.1.We use the predicate label lt ( X , Edge1 , Edge2 , L ) to indicate that there is label L (cid:48) ≤ L that appears in Edge2 and does not appear in a lower edge Edge1 , where both Edge1 and urelos-Blanco, Law, Jonsson, Broda, & Russo Edge2 are outgoing edges from X . Note that this predicate encodes the first condition inDefinition 6.1 up to a specific label. The set of rules below prunes solutions where outgoingedges do not follow the established label ordering criteria. The first rule indicates that label lt ( X , Edge1 , Edge2 , L ) is true if Edge1 is lower than Edge2 , and the label L does notappear in Edge1 and appears in Edge2 . The second rule states that label lt is true for avalid label L + if it is true for L . The third rule states that if Edge1 is lower than Edge2 ,then the label set on Edge1 must be lower than that on Edge2 . Note that the three lastliterals on the constraint enforce both conditions in Definition 6.1. label lt ( X , Edge1 , Edge2 , L ) : - ed lt ( X , Edge1 , Edge2 ) , not label ( X , Edge1 , L ) , label ( X , Edge2 , L ) . label lt ( X , Edge1 , Edge2 , L + ) : - label lt ( X , Edge1 , Edge2 , L ) , valid label ( L + ) . : - ed lt ( X , Edge1 , Edge2 ) , label ( X , Edge1 , L ) , not label ( X , Edge2 , L ) , not label lt ( X , Edge1 , Edge2 , L ) . The set of rules below imposes that lower edge indices cannot be left unused. First,we define a fact for each possible edge index between 1 and κ . Remember that κ is themaximum number of edges from one state to another. Second, the constraint indicates thatif there is an edge from X to Y with edge index E where E > 1, then there must be also anedge between the same states but with edge index E − . (cid:26) edge id ( ..κ ) . : - ed ( X , Y , E ) , not ed ( X , Y , E − ) , edge id ( E ) , E > . (cid:27) Finally, we introduce a set of rules for enforcing a BFS traversal of the subgoal au-tomaton. Firstly, like in the SAT-based approach, we use a predicate ed sb ( X , Y , E ) whichis grounded for all the edges except for those directed to a state without an index (theaccepting and rejecting states): ed sb ( X , Y , E ) : - ed ( X , Y , E ) , state id ( Y , ) . We now introduce a predicate pa ( X , Y ) denoting that state X is the parent of Y in theBFS subtree. Note that it is equivalent to the variable pa ( i, j ) from the SAT encoding. Theset of rules below uses this predicate to enforce the BFS ordering. The first rule definesthat state X is the parent of Y if there is an edge from X to Y , X has a lower index than Y ,and there is no state Z whose index is lower than X ’s and has an edge to Y . The secondrule indicates that all states with an index except for the initial state must have a parent.The third rule imposes the BFS ordering similarly to Clause 4 in the SAT encoding. Notethat the first two rules encode Clauses 1 and 2 of the SAT encoding. pa ( X , Y ) : - ed sb ( X , Y , ) , state lt ( X , Y ) , false : ed sb ( Z , Y , ) , state lt ( Z , X ) . : - state id ( Y , YID ) , YID > , not pa ( , Y ) . : - pa ( X , Y ) , ed sb ( XP , YP , ) , state lt ( XP , X ) , state leq ( Y , YP ) . Now we need to enforce that the BFS children from a given state are correctly ordered;that is, those children pointed by lower edges should be identified by lower state indices. 26. Note that what follows false : must not hold in order to make the body of the rule true. nduction and Exploitation of Subgoal Automata for RL D D D N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 Coffee CoffeeMail VisitABCD - - 802.1 (297.1) 130.1 (46.4) 290.2 (123.1) 481.6 (115.1) 53.2 (15.6) 269.2 (108.0) 355.8 (68.2) Table 15: Total automata learning time in seconds for different combinations of MDPs sets( D , D , D ) and maximum episode lengths (100, 250, 500). D D D N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 N = 100 N = 250 N = 500 Coffee CoffeeMail VisitABCD - - 74.8 (8.2) 57.7 (5.6) 53.8 (6.2) 58.6 (3.9) 42.5 (3.3) 51.9 (3.3) 53.3 (3.3) Table 16: Number of examples needed to learn the last automaton for different combinationsof MDPs sets ( D , D , D ) and maximum episode lengths (100, 250, 500).We use the state ord ( X ) predicate to indicate that state X is properly ordered with respectto its siblings (i.e., other states with the same parent state). The set of rules below enforcesthis ordering. The first rule defines that a state Y is correctly ordered with respect to itssiblings if the edge from their parent X to Y , ( Y , E ), is lower than the edge to another state( YP , EP ) if Y < YP . That is, the edges must be ordered according to the order of the stateindices. The second rule enforces that all states with a parent must be correctly orderedwith respect to their siblings. state ord ( Y ) : - ed sb ( X , Y , E ) , pa ( X , Y ) , false : ed sb ( X , YP , EP ) , state lt ( Y , YP ) , ed lt ( X , ( YP , EP ) , ( Y , E )) . : - pa ( , Y ) , not state ord ( Y ) . Appendix C. Additional Experimental Results In Section 8.2 we qualitatively described the effect that two different MDP sets, D and D ,have on the automaton learning process. The automata learning statistics were reportedfor D but not for D since the results are (in general) similar. The results for D areanalogous to those shown in Section 8.2 for D . Table 15 shows the average automatalearning time, whereas Table 16 contains the average number of examples needed to learnan automaton for different subsets of tasks and maximum episode lengths ( N ). Table 17shows the average example length of each kind of trace for different values of N . Table 18contains the average number of examples needed to learn the last automaton in a specificsetting. Finally, Figure 12 displays the average learning curves for different combinationsof MDP subsets and values of N . urelos-Blanco, Law, Jonsson, Broda, & Russo . . . . . . A v e r ag e r e w a r d Coffee ( D ) 0 200 400 600 800 1000Number of episodes0 . . . . . . A v e r ag e r e w a r d Coffee ( D ) 0 200 400 600 800 1000Number of episodes0 . . . . . . A v e r ag e r e w a r d Coffee ( D )0 500 1000 1500 2000 2500Number of episodes0 . . . . . . A v e r ag e r e w a r d CoffeeMail ( D ) 0 500 1000 1500 2000 2500Number of episodes0 . . . . . . A v e r ag e r e w a r d CoffeeMail ( D ) 0 500 1000 1500 2000 2500Number of episodes0 . . . . . . A v e r ag e r e w a r d CoffeeMail ( D )0 2000 4000 6000 8000Number of episodes0 . . . . . . A v e r ag e r e w a r d VisitABCD ( D ) 0 2000 4000 6000 8000Number of episodes0 . . . . . . A v e r ag e r e w a r d VisitABCD ( D ) 0 2000 4000 6000 8000Number of episodes0 . . . . . . A v e r ag e r e w a r d VisitABCD ( D ) N = 100 N = 250 N = 500Figure 12: Learning curves for different combinations of MDPs sets ( D , D , D ) andmaximum episode lengths (100, 250, 500). nduction and Exploitation of Subgoal Automata for RL N = 100 N = 250 N = 500 G D I G D I G D I Coffee CoffeeMail VisitABCD Table 17: Example length of the goal, dead-end and incomplete examples used to learn thelast automaton in the D setting.All G D I Coffee CoffeeMail VisitABCD D and N = 250 setting. References Andreas, J., Klein, D., & Levine, S. (2017). Modular Multitask Reinforcement Learningwith Policy Sketches. In Proceedings of the International Conference on MachineLearning (ICML) , pp. 166–175.Angluin, D. (1980). Inductive Inference of Formal Languages from Positive Data. Inf.Control. , (2), 117–135.Angluin, D. (1987). Learning Regular Sets from Queries and Counterexamples. Inf. Com-put. , (2), 87–106.Barto, A. G., & Mahadevan, S. (2003). Recent Advances in Hierarchical ReinforcementLearning. Discrete Event Dynamic Systems , (4), 341–379.Bonet, B., Palacios, H., & Geffner, H. (2009). Automatic Derivation of Memoryless Policiesand Finite-State Controllers Using Classical Planners. In Proceedings of the Interna-tional Conference on Automated Planning and Scheduling (ICAPS) .Bradtke, S. J., & Duff, M. O. (1994). Reinforcement Learning Methods for Continuous-TimeMarkov Decision Problems. In Proceedings of the Advances in Neural InformationProcessing Systems (NeurIPS) Conference , pp. 393–400.Brafman, R. I., & Tennenholtz, M. (2002). R-MAX - A General Polynomial Time Algorithmfor Near-Optimal Reinforcement Learning. J. Mach. Learn. Res. , , 213–231.Brooks, R. A. (1989). A Robot that Walks; Emergent Behaviors from a Carefully EvolvedNetwork. Neural Computation , (2), 253–262.Buckland, M. (2004). AI Game Programming by Example . Wordware Publishing Inc. urelos-Blanco, Law, Jonsson, Broda, & Russo Camacho, A., Chen, O., Sanner, S., & McIlraith, S. A. (2017). Non-Markovian RewardsExpressed in LTL: Guiding Search Via Reward Shaping. In Proceedings of the Inter-national Symposium on Combinatorial Search (SOCS) , pp. 159–160.Camacho, A., Icarte, R. T., Klassen, T. Q., Valenzano, R. A., & McIlraith, S. A. (2019).LTL and Beyond: Formal Languages for Reward Function Specification in Reinforce-ment Learning. In Proceedings of the International Joint Conference on ArtificialIntelligence (IJCAI) , pp. 6065–6073.Clark, K. L. (1977). Negation as Failure. In Logic and Data Bases , pp. 293–322.Codish, M., Miller, A., Prosser, P., & Stuckey, P. J. (2019). Constraints for symmetrybreaking in graph representation. Constraints , (1), 1–24.de la Higuera, C. (2010). Grammatical Inference: Learning Automata and Grammars . Cam-bridge University Press.Dietterich, T. G. (2000). Hierarchical Reinforcement Learning with the MAXQ Value Func-tion Decomposition. J. Artif. Intell. Res. , , 227–303.Drescher, C., Tifrea, O., & Walsh, T. (2011). Symmetry-breaking Answer Set Solving. AICommun. , (2), 177–194.Furelos-Blanco, D., Law, M., Russo, A., Broda, K., & Jonsson, A. (2020). Induction of Sub-goal Automata for Reinforcement Learning. In Proceedings of the AAAI Conferenceon Artificial Intelligence (AAAI) , pp. 3890–3897.Gaon, M., & Brafman, R. I. (2020). Reinforcement Learning with Non-Markovian Rewards.In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , pp. 3980–3987.Gelfond, M., & Kahl, Y. (2014). Knowledge Representation, Reasoning, and the Design ofIntelligent Agents: The Answer-Set Programming Approach . Cambridge UniversityPress.Gelfond, M., & Lifschitz, V. (1988). The Stable Model Semantics for Logic Programming. In Proceedings of the International Conference and Symposium on Logic Programming ,pp. 1070–1080.Gold, E. M. (1978). Complexity of Automaton Identification from Given Data. Informationand Control , (3), 302–320.Heule, M., & Verwer, S. (2010). Exact DFA Identification Using SAT Solvers. In Pro-ceedings of the International Colloquium on Grammatical Inference: Algorithms andApplications (ICGI) , pp. 66–79.Ho, M. K., Abel, D., Griffiths, T. L., & Littman, M. L. (2019). The value of abstraction. Current Opinion in Behavioral Sciences , , 111 – 116.Hu, Y., & De Giacomo, G. (2011). Generalized Planning: Synthesizing Plans that Workfor Multiple Environments. In Proceedings of the International Joint Conference onArtificial Intelligence (IJCAI) , pp. 918–923.Kaelbling, L. P. (1993). Hierarchical Learning in Stochastic Domains: Preliminary Results.In Proceedings of the International Conference on Machine Learning (ICML) , pp.167–173. nduction and Exploitation of Subgoal Automata for RL Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In Proceed-ings of the International Conference on Learning Representations (ICLR) .Konidaris, G. (2019). On the necessity of abstraction. Current Opinion in BehavioralSciences , , 1 – 7.Koul, A., Fern, A., & Greydanus, S. (2019). Learning Finite State Representations of Re-current Policy Networks. In Proceedings of the International Conference on LearningRepresentations (ICLR) .Kulkarni, T. D., Gupta, A., Ionescu, C., Borgeaud, S., Reynolds, M., Zisserman, A., & Mnih,V. (2019). Unsupervised Learning of Object Keypoints for Perception and Control.In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)Conference , pp. 10723–10733.Lambeau, B., Damas, C., & Dupont, P. (2008). State-Merging DFA Induction Algorithmswith Mandatory Merge Constraints. In Proceedings of the International Colloquiumon Grammatical Inference: Algorithms and Applications (ICGI) , pp. 139–153.Lang, K. J., Pearlmutter, B. A., & Price, R. A. (1998). Results of the Abbadingo OneDFA Learning Competition and a New Evidence-Driven State Merging Algorithm.In Proceedings of the International Colloquium on Grammatical Inference: Algorithmsand Applications (ICGI) , pp. 1–12.Lange, R. T., & Faisal, A. (2019). Semantic RL with Action Grammars: Data-EfficientLearning of Hierarchical Task Abstractions. CoRR , abs/1907.12477 .Law, M., Russo, A., & Broda, K. (2015a). Simplified Reduct for Choice Rules in ASP. Tech.rep., DTR2015-2, Imperial College of Science, Technology and Medicine, Departmentof Computing.Law, M., Russo, A., & Broda, K. (2015b). The ILASP System for Learning Answer SetPrograms..Law, M., Russo, A., & Broda, K. (2016). Iterative Learning of Answer Set Programs fromContext Dependent Examples. Theory Pract. Log. Program. , (5-6), 834–848.Law, M., Russo, A., & Broda, K. (2018). The Meta-program Injection Feature in ILASP.Tech. rep..Leonetti, M., Iocchi, L., & Patrizi, F. (2012). Automatic Generation and Learning of Finite-State Controllers. In Proceedings of the International Conference on Artificial Intel-ligence: Methodology, Systems, Applications (AIMSA) , pp. 135–144.Machado, M. C., Bellemare, M. G., & Bowling, M. H. (2017). A Laplacian Frameworkfor Option Discovery in Reinforcement Learning. In Proceedings of the InternationalConference on Machine Learning (ICML) , pp. 2295–2304.McGovern, A., & Barto, A. G. (2001). Automatic Discovery of Subgoals in ReinforcementLearning using Diverse Density. In Proceedings of the International Conference onMachine Learning (ICML) , pp. 361–368.Mehta, N., Ray, S., Tadepalli, P., & Dietterich, T. G. (2008). Automatic Discovery andTransfer of MAXQ Hierarchies. In Proceedings of the International Conference onMachine Learning (ICML) , pp. 648–655. urelos-Blanco, Law, Jonsson, Broda, & Russo Menache, I., Mannor, S., & Shimkin, N. (2002). Q-Cut - Dynamic Discovery of Sub-goalsin Reinforcement Learning. In Proceedings of the European Conference on MachineLearning (ECML) , pp. 295–306.Meuleau, N., Peshkin, L., Kim, K., & Kaelbling, L. P. (1999). Learning Finite-State Con-trollers for Partially Observable Environments. In Proceedings of the Conference onUncertainty in Artificial Intelligence (UAI) , pp. 427–436.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik,A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D.(2015). Human-level control through deep reinforcement learning. Nature , (7540),529–533.Ng, A. Y., Harada, D., & Russell, S. J. (1999). Policy Invariance Under Reward Transforma-tions: Theory and Application to Reward Shaping. In Proceedings of the InternationalConference on Machine Learning (ICML) , pp. 278–287.Parr, R., & Russell, S. J. (1997). Reinforcement Learning with Hierarchies of Machines.In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS)Conference , pp. 1043–1049.Segovia Aguas, J., Jim´enez, S., & Jonsson, A. (2018). Computing Hierarchical Finite StateControllers With Classical Planning. J. Artif. Intell. Res. , , 755–797.Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2018). Ageneral reinforcement learning algorithm that masters chess, shogi, and Go throughself-play. Science , (6419), 1140–1144.Simsek, ¨O., & Barto, A. G. (2004). Using Relative Novelty to Identify Useful Temporal Ab-stractions in Reinforcement Learning. In Proceedings of the International Conferenceon Machine Learning (ICML) .Simsek, ¨O., Wolfe, A. P., & Barto, A. G. (2005). Identifying Useful Subgoals in Rein-forcement Learning by Local Graph Partitioning. In Proceedings of the InternationalConference on Machine Learning (ICML) , pp. 816–823.Srinivasan, A. (2001). The Aleph Manual..Stolle, M., & Precup, D. (2002). Learning Options in Reinforcement Learning. In Proceed-ings of the International Symposium on Abstraction, Reformulation and Approxima-tion (SARA) , pp. 212–223.Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction . MIT Press.Sutton, R. S., Precup, D., & Singh, S. P. (1998). Intra-Option Learning about TemporallyAbstract Actions. In Proceedings of the International Conference on Machine Learning(ICML) , pp. 556–564.Sutton, R. S., Precup, D., & Singh, S. P. (1999). Between MDPs and Semi-MDPs: AFramework for Temporal Abstraction in Reinforcement Learning. Artif. Intell. , (1-2), 181–211. nduction and Exploitation of Subgoal Automata for RL Toro Icarte, R., Klassen, T. Q., Valenzano, R. A., & McIlraith, S. A. (2018). Using Re-ward Machines for High-Level Task Specification and Decomposition in Reinforce-ment Learning. In Proceedings of the International Conference on Machine Learning(ICML) , pp. 2112–2121.Toro Icarte, R., Waldie, E., Klassen, T. Q., Valenzano, R. A., Castro, M. P., & McIlraith,S. A. (2019). Learning Reward Machines for Partially Observable ReinforcementLearning. In Proceedings of the Advances in Neural Information Processing Systems(NeurIPS) Conference , pp. 15497–15508.Torrey, L., Shavlik, J. W., Walker, T., & Maclin, R. (2007). Relational Macros for Transfer inReinforcement Learning. In Proceedings of the International Conference on InductiveLogic Programming (ILP) , pp. 254–268.Ulyantsev, V., Zakirzyanov, I., & Shalyto, A. (2015). BFS-Based Symmetry Breaking Pred-icates for DFA Identification. In Proceedings of the International Conference on Lan-guage and Automata Theory and Applications (LATA) , pp. 611–622.Ulyantsev, V., Zakirzyanov, I., & Shalyto, A. (2016). Symmetry Breaking Predicates forSAT-based DFA Identification. CoRR , abs/1602.05028 .van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with DoubleQ-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) ,pp. 2094–2100.Watkins, C. (1989). Learning from Delayed Rewards . Ph.D. thesis, King’s College, Cam-bridge, UK.Xu, Z., Gavran, I., Ahmad, Y., Majumdar, R., Neider, D., Topcu, U., & Wu, B. (2020). JointInference of Reward Machines and Policies for Reinforcement Learning. In Proceedingsof the International Conference on Automated Planning and Scheduling (ICAPS) , pp.590–598.Zakirzyanov, I., Morgado, A., Ignatiev, A., Ulyantsev, V., & Marques-Silva, J. (2019). Ef-ficient Symmetry Breaking for SAT-Based Minimum DFA Inference. In Proceedingsof the International Conference on Language and Automata Theory and Applications(LATA) , pp. 159–173., pp. 159–173.