[PDF] Learning Portable Representations for High-Level Planning

Abstract

We present a framework for autonomously learning a portable representation that describes a collection of low-level continuous environments. We show that these abstract representations can be learned in a task-independent egocentric space specific to the agent that, when grounded with problem-specific information, are provably sufficient for planning. We demonstrate transfer in two different domains, where an agent learns a portable, task-independent symbolic vocabulary, as well as rules expressed in that vocabulary, and then learns to instantiate those rules on a per-task basis. This reduces the number of samples required to learn a representation of a new task.

Full PDF

LLearning Portable Representationsfor High-Level Planning

Steven James and Benjamin Rosman

University of the WitwatersrandJohannesburg, South Africa {steven.james, benjamin.rosman1}@wits.ac.za

George Konidaris

Brown UniversityProvidence RI 02912, USA [email protected]

Abstract

We present a framework for autonomously learning a portable representation thatdescribes a collection of low-level continuous environments. We show that theseabstract representations can be learned in a task-independent egocentric space speciﬁc to the agent that, when grounded with problem-speciﬁc information, areprovably sufﬁcient for planning. We demonstrate transfer in two different domains,where an agent learns a portable, task-independent symbolic vocabulary, as well asrules expressed in that vocabulary, and then learns to instantiate those rules on a per-task basis. This reduces the number of samples required to learn a representationof a new task.

A major goal of artiﬁcial intelligence is creating agents capable of acting effectively in a varietyof complex environments. Robots, in particular, face the difﬁcult task of generating behaviourwhile sensing and acting in high-dimensional and continuous spaces. Decision-making at thislevel is typically infeasible—the robot’s innate action space involves directly actuating motors ata high frequency, but it would take thousands of such actuations to accomplish most useful goals.Similarly, sensors provide high-dimensional signals that are often continuous and noisy. Hierarchicalreinforcement learning [Barto and Mahadevan, 2003] tackles this problem by abstracting away thelow-level action space using higher-level skills , which can accelerate learning and planning. Skillsalleviate the problem of reasoning over low-level actions, but the state space remains unchanged;efﬁcient planning may also therefore require state space abstraction.Therefore, we may also wish toperform abstraction in state space.One approach is to build a state abstraction of the environment that supports planning. Suchrepresentations can then be used as input to task-level planners, which plan using much more compactabstract state descriptors. This mitigates the issue of reward sparsity and admits solutions to long-horizon tasks, but raises the question of how to build the appropriate abstract representation of aproblem. This is often resolved manually, requiring substantial effort and expertise.Fortunately, recent work demonstrates how to learn a provably sound symbolic representationautonomously, given only the data obtained by executing the high-level actions available to the agent[Konidaris et al., 2018]. A major shortcoming of that framework is the lack of generalisability—the learned symbols are grounded in the current task, so an agent must relearn the appropriaterepresentation for each new task it encounters (see Figure 1). This is a data- and computation-intensiveprocedure involving clustering, probabilistic multi-class classiﬁcation, and density estimation inhigh-dimensional spaces, and requires repeated execution of actions within the environment.

Preprint. Under review. a r X i v : . [ c s . L G ] M a y a) The distribution over positions from where theagent is able interact with the door. (b) In the new task, the learned distribution is nolonger useful since the door’s location has changed. Figure 1: An illustration of the shortcomings of learning task-speciﬁc state abstractions [Konidariset al., 2018]. (a) An agent (represented by a red circle) learns a distribution over states ( x, y, θ tuples,describing its position in a room) in which it can interact with a door. (b) However, this distributioncannot be reused in a new room with a differing layout.The contribution of this work is twofold. First, we introduce a framework for deriving a symbolicabstraction over an egocentric state space [Agre and Chapman, 1987, Guazzelli et al., 1998, Finneyet al., 2002, Konidaris et al., 2012]. Because such state spaces are relative to the agent, they providea suitable avenue for representation transfer. However, these abstractions are necessarily non-Markov,and so are insufﬁcient for planning. Our second contribution is thus to prove that the addition ofvery particular problem-speciﬁc information (learned autonomously from the task) to the portableabstractions results in a representation that is sufﬁcient for planning. This combination of portableabstractions and task-speciﬁc information results in lifted action rules that are preserved across tasks,but which have parameters that must be instantiated on a per-task basis.We describe our framework using a simple toy domain, and then demonstrate successful transfer intwo domains. Our results show that an agent is able to learn symbols that generalise to tasks withdifferent dynamics, reducing the experience required to learn a representation of a new task. We assume that the tasks faced by an agent can be modelled as a semi-Markov decision process(SMDP) M = (cid:104)S , O , T , R(cid:105) , where

S ⊆ R n is the n -dimensional continuous state space and O ( s ) is the set of temporally-extended actions known as options available to the agent at state s . Thereward function R ( s, o, τ, s (cid:48) ) speciﬁes the feedback the agent receives from the environment whenit executes option o from state s and arrives in state s (cid:48) after τ steps. T describes the dynamics ofthe environment, specifying the probability of arriving in state s (cid:48) after option o is executed from s for τ timesteps: T oss (cid:48) = Pr( s (cid:48) , τ | s, o ) . An option o is deﬁned by the tuple (cid:104) I o , π o , β o (cid:105) , where I o = { s | o ∈ O ( s ) } is the initiation set that speciﬁes the states in which the option can be executed, π o is the option policy which speciﬁes the action to execute, and β o is the termination condition ,where β o ( s ) is the probability of option o halting in state s . We assume that tasks are related because they are faced by the same agent [Konidaris et al., 2012].For example, consider a robot equipped with various sensors that is required to perform a number ofas yet unspeciﬁed tasks. The only aspect that remains constant across all these tasks is the presenceof the robot, and more importantly its sensors, which map the state space S to a portable, lossy,egocentric observation space D . To differentiate, we refer to S as problem space [Konidaris andBarto, 2007].Augmenting an SMDP with this egocentric data produces the tuple M i = (cid:104)S i , O i , T i , R i , D(cid:105) foreach task i , where the egocentric observation space D remains constant across all tasks. We can use D to deﬁne portable options, whose option policies, initiation sets and termination conditions areall deﬁned egocentrically. Because D remains constant regardless of the underlying SMDP, theseoptions can be transferred across tasks [Konidaris and Barto, 2007]. Egocentric state spaces have also been adopted by recent reinforcement learning frameworks, such as

VizDoom [Kempka et al., 2016],

Minecraft [Johnson et al., 2016] and

Deepmind Lab [Beattie et al., 2016]. .2 Abstract Representations We wish to learn an abstract representation to facilitate planning. We deﬁne a probabilistic plan p Z = { o , . . . , o n } to be the sequence of options to be executed, starting from some state drawnfrom distribution Z . It is useful to introduce the notion of a goal option , which can only be executedwhen the agent has reached its goal. Appending this option to a plan means that the probability ofsuccessfully executing a plan is equivalent to the probability of reaching some goal.A representation suitable for planning must allow us to calculate the probability of a given plansuccessfully executing to completion. As a plan is simply a chain of options, it is therefore necessary(and sufﬁcient) to learn when an option can be executed, as well as the outcome of doing so [Konidariset al., 2018]. This corresponds to learning the precondition Pre ( o ) = Pr( s ∈ I o ) , which expresses theprobability that option o can be executed at state s ∈ S , and the image Im ( Z, o ) , which represents thedistribution of states an agent may ﬁnd itself in after executing o from states drawn from distribution Z . An illustration of this is provided in the supplementary material.In general, we cannot model the image for an arbitrary option; however, we can do so for a subclassknown as subgoal options [Precup, 2000], whose terminating states are independent of their startingstates [Konidaris et al., 2018]. That is, for any subgoal option o , Pr( s (cid:48) | s, o ) = Pr( s (cid:48) | o ) . We canthus substitute the option’s image for its effect : Eff ( o ) = Im ( Z, o ) ∀ Z .Subgoal options are not overly restrictive, since they refer to options that drive an agent to some set ofstates with high reliability, which is a common occurrence in robotics owing to the use of closed-loopcontrollers. Nonetheless, it is likely an option may not be subgoal. It is often possible, however, to partition an option’s initiation set into a ﬁnite number of subsets, so that it possesses the subgoalproperty when initiated from each of the individual subsets. That is, we partition an option’s startstates into classes C such that Pr( s (cid:48) | s, c ) ≈ Pr( s (cid:48) | c ) , c ∈ C (see Figure 2 in the supplementarymaterial). This can be practically achieved by clustering state transition samples based on effectstates, and assigning each cluster to a partition. For each pair of partitions we then check whethertheir start states overlap signiﬁcantly, and if so merge them, which accounts for probabilistic effects[Andersen and Konidaris, 2017, Konidaris et al., 2018, Ames et al., 2018].Once we have partitioned subgoal options, we estimate the precondition and effect for each. Estimat-ing the precondition is a classiﬁcation problem, while the effect is one of density estimation. Finally,for all valid combinations of effect distributions, we construct a forward model by computing theprobability that states drawn from their grounding lies within the learned precondition of each option,discarding rules with low probability of occurring. To aid in explanation, we use of a simple continuous task where a robot navigates the buildingillustrated in Figure 2a. The problem space is the xy -coordinates of the robot, while we use anegocentric view of the environment (nearby walls and windows) around the agent for transfer. Theseobservations are illustrated in Figure 2. (a) (b) (c) (d) Figure 2: (a) A continuous navigation task where an agent navigates between different regions in xy -space. Walls are represented by grey lines, while the two white bars represent windows. Arrowsdescribe the agent’s options. (b–d) Local egocentric observations. We name these window-junction , dead-end and wall-junction respectively. 3he robot is equipped with options to move between different regions of the building, halting when itreaches the start or end of a corridor. It possesses the following four options: (a) Clockwise and

Anticlockwise , which move the agent in a clockwise or anticlockwise direction respectively, (b)

Outward , which moves the agent down a corridor away from the centre of the building, and (c)

Inward , which moves it towards the centre.We could adopt the approach of Konidaris et al. [2018] to learn an abstract representation usingtransition data in S . However, that procedure generates symbols that are distributions over xy -coordinates, and are thus tied directly to the particular problem conﬁguration. If we were to simplytranslate the environment along the plane, the xy -coordinates would be completely different, and ourlearned representation would be useless.To overcome that limitation, we propose learning a symbolic representation over D , instead of S .Transfer can be achieved in this manner because D remains consistent both within the same SMDPand across SMDPs, even if the state space or transition function do not.Given only data produced by sensors, the agent proceeds to learn an abstract representation, identify-ing three portable symbols, which are exactly those illustrated by Figure 2. The learned rules arelisted in Table 1, where it is clear that naïvely considering egocentric observations alone is insufﬁcientfor planning purposes: the agent does not possess an option with probabilistic outcomes, but the Inward option appears to have probabilistic effects due to aliasing.Table 1: A list of the six subgoal options, specifying their preconditions and effects in agent space.

Option Precondition Effect

Clockwise1 wall-junction window-junctionClockwise2 window-junction wall-junctionAnticlockwise1 wall-junction window-junctionAnticlockwise2 window-junction wall-junctionOutward wall-junction ∨ window-junction dead-endInward dead-end (cid:26) window-junction w.p. . wall-junction w.p. . A further challenge appears when the goal of the task is deﬁned in S . If we have goal G ⊆ S , thengiven information from D , we cannot determine whether we have achieved the goal . This followsfrom the fact that the egocentric observations are lossy—two states s, t ∈ S may produce the sameegocentric observation d , but if s ∈ G and t / ∈ G , the knowledge of d alone is insufﬁcient to determinewhether we have entered a state in G . We therefore require additional information to disambiguatesuch situations, allowing us to map from egocentric observations back into S .We can accomplish this by partitioning our portable options based on their effects in S . Thisnecessitates having access to both state and and egocentric observations. Recall that options arepartitioned to ensure the subgoal property holds, and so each partition deﬁnes its own unique imagedistribution. If we label each problem-space partition, then each label refers to a unique distributionin S and is sufﬁcient for disambiguating our egocentric symbols. Figure 3 annotates the domain withlabels according to their problem-space partitions. Note that the partition numbers are completelyarbitrary.Generating agent-space symbols results in lifted symbols such as dead-end(X) , where dead-end isthe name for a distribution over D , and X is a partition number that must be determined on a per-taskbasis. Note that the only time problem-speciﬁc information is required is to determine the values of X ,which grounds the portable symbol in the current task.The following results shows that the combination of agent-space symbols with problem-spacepartition numbers provides a sufﬁcient symbolic vocabulary for planning. (The proof is given in thesupplementary material.) Theorem 1.

The ability to represent the preconditions and image of each option in egocentricspace, together with goal G ’s precondition in problem space and partitioning in S , is sufﬁcient fordetermining the probability of being able to execute any probabilistic plan p from starting distribution Z . Figure 3: Each number refers to the initiation set of an option partitioned in problem space. Forreadability, we merge identical partitions. For instance, refers to the initation sets of a singleproblem space partition of

Outward , Clockwise and

Anticlockwise . Our approach can be viewed as a two-step process. The ﬁrst phase learns portable symbolic rulesusing egocentric transition data from possibly several tasks, while the second phase uses problem-space transitions from the current task to partition options in S . The partition labels are then used asparameters to ground the previously-learned portable rules in the current task. We use these labelsto learn linking functions that connect precondition and effect parameters. For example, when theparameter of Anticlockwise2 is , then its effect should take parameter . Figure 4 illustratesthis grounding process. Given transition datacollected by executingoptions Partitioninto subgoaloptions Estimatepreconditionsand effects Generateabstractforward modelPartitionoptionsbased oneffects in S Learn transitionsbetween partitionlabels undereach optionGround portablerules using partitionlabels for precon-ditions and effects.

Figure 4: The full process of learning portable representations from data. Red nodes are learnedusing egocentric data from all previously encountered tasks, while green nodes use problem-spacedata from the current task only.These linking functions are learned by simply executing options and recording the start and endpartition labels of each transition. We use a simple count-based approach that, for each option,records the fraction of transitions from one partition label to another. A more precise description ofthis approach is speciﬁed in the supplementary material.A combination of portable rules and partition numbers reduces planning to a search over the space Σ × N , where Σ is the set of generated symbols. Alternatively (and equivalently), we can generate either afactored MDP or a PPDDL representation [Younes and Littman, 2004]. To generate the latter, we usea function named partition to store the current partition number and specify predicates for the threesymbols derived in the previous sections: window-junction , dead-end and wall-junction . Thefull domain description is provided in the supplementary material. In our example, it is not clear why one would want to learn portable symbolic representations—weperform symbol acquisition in D and instantiate the rules for the given task, which requires morecomputation than directly learning symbols in S . We now demonstrate the advantage of doing so bylearning portable models of two different domains, both of which feature continuous state spaces andprobabilistic transition dynamics. 5 .1 Rod-and-Block We construct a domain we term

Rod-and-Block in which a rod is constrained to move along a track.The rod can be rotated into an upward or downward position, and a number blocks are arranged toimpede the rod’s movement. Two walls are also placed at either end of the track. One such taskconﬁguration is illustrated by Figure 5.Figure 5: The

Rod-and-Block domain. This particular task consists of three obstacles that preventthe rod from moving along the track when the rod is in either the upward or downward position.Different tasks are characterised by different block placements.The problem space consists of the rod’s angle and its x position along the track. Egocentric observa-tions return the types of objects that are in close proximity to the rod, as well as its angle. In Figure 5,for example, there is a block to the left of the rod, which has an angle of π . The high-level optionsgiven to the agent are GoLeft , GoRight , RotateUp , and

RotateDown . The ﬁrst two translate therod along the rail until it encounters a block or wall while maintaining its angle. The remainingoptions rotate the rod into an upwards or downwards position, provided it does not collide with anobject. These rotations can be done in both a clockwise and anti-clockwise direction.We learn a symbolic representation using egocentric transitions only, using the same procedure asprior work [Konidaris et al., 2018]: ﬁrst, we collect agent-space transitions by interacting with theenvironment. We partition the options in agent space using the DBSCAN clustering algorithm [Esteret al., 1996] so that the subgoal property approximately holds. This produces partitioned agent-spaceoptions. Finally, we estimate the options’ preconditions using a support vector machine with Plattscaling [Cortes and Vapnik, 1995, Platt, 1999], and use kernel density estimation [Rosenblatt, 1956,Parzen, 1962] to model effect distributions.The above procedure results in portable action rules, one of which is illustrated by Figure 6. Theserules can be reused for new tasks or conﬁgurations of the

Rod-and-Block domain—we need notrelearn them when we encounter a new task, though we can always use data from a new task toimprove them. More portable rules are given in the supplementary material.Once we have learned sufﬁciently accurate portable rules, the rules need only be instantiated forthe given task by learning the linking between partitions. This requires far fewer samples thanclassiﬁcation and density estimation over the state space S , which is required to learn a task-speciﬁcrepresentation.To illustrate this, we construct a set of ten tasks ρ , . . . , ρ by randomly selecting the number blocks,and then randomly positioning them along the track. Because tasks have different conﬁgurations,constructing a symbolic representation in problem space requires relearning a model of each task (a) (b) (:action Up Clockwise_1:parameters():precondition (and (symbol_18)(symbol_11)(notfailed)):effect (and (symbol_12)(not symbol_18))) (c) Figure 6: (a) The precondition for

RotateUpClockwise1 operator, which states that in order toexecute the option, the rod must be left of a wall facing down. Note that the precondition is aconjunction of these two symbols—the ﬁrst symbol is a distribution over the rod’s angle only, whilethe second is independent of it. (b) The effect of the option, with the rod adjacent to the wall in anupward position. (c) PDDL description of the above operator, which is used for planning.6rom scratch. However, when constructing an egocentric representation, symbols learned in onetask can immediately be used in subsequent tasks. We gather k transition samples from each taskby executing options uniformly at random, and use these samples to build both task-speciﬁc andegocentric (portable) models.In order to evaluate a model’s accuracy, we randomly select 100 goal states for each task, aswell as the optimal plans for reaching each from some start state. Each plan consists of twooptions, and we denote a single plan by the tuple (cid:104) s , o , s , o (cid:105) . Let M ρ i k be the model constructedfor task ρ i using k samples. We calculate the likelihood of each optimal plan under the model: Pr( s ∈ I o | M ρ i k ) × Pr( s (cid:48) ∈ I o | M ρ i k ) , where s (cid:48) ∼ Eff ( o ) . We build models using increasingnumbers of samples, varying the number of samples in steps of 250, until the likelihood averagedover all plans is greater than some acceptable threshold (we use a value of . ), at which point wecontinue to the next task. The results are given by Figure 8a. We next applied our approach to the

Treasure Game , where an agent navigates a continuous maze insearch of treasure. The domain contains ladders and doors which impede the agent. Some doors canbe opened and closed with levers, while others require a key to unlock.The problem space consists of the xy -position of the agent, key and treasure, the angle of the levers(which determines whether a door is open) and the state of the lock. The egocentric space is a vectorof length 9, the elements of which are the type of sprites in each of the nine directions around theagent, plus the “bag” of items possessed by the agent. The agent possesses a number of high-leveloptions, such as GoLeft and

DownLadder . More details are given by Konidaris et al. [2018].We construct a set of ten tasks ρ , . . . , ρ corresponding to different levels of the Treasure Game , and learn portable models and test their sample efﬁciency as in Section 5.1. An example of a portableaction rule, as well as its problem-space partitioning, is given by Figure 7, while the number ofsamples required to learn a good model of all 10 levels is given by Figure 8b. (a) (b) (:action DownLadder_1:parameters():precondition (and (symbol_80)(notfailed)):effect (and (symbol_622)(not symbol_80))) (c) Figure 7: (a) The precondition (top) and positive effect (bottom) for the

DownLadder operator, whichstates that in order to execute the option, the agent must be standing above the ladder. The optionresults in the agent standing on the ground below it. The black spaces refer to unchanged low-levelstate variables. (b) Three problem-space partitions for the

DownLadder operator. Each of the circledpartitions is assigned a unique label and combined with the portable rule in (a) to produce a groundedoperator. (c) The PDDL representation of the operator speciﬁed in (a).

Naturally, learning problem-space symbols results in a sample complexity that scales linearly with thenumber of tasks, since we must learn a model for each new task from scratch. Conversely, by learningand reusing portable symbols, we can reduce the number of samples we require as we encountermore tasks, leading to a sublinear increase. The agent initially requires about samples to learn atask-speciﬁc model of each

Rod-and-Block conﬁguration, but decreases to roughly after onlytwo tasks. Similarly, samples are initially needed for each level of the

Treasure Game , but only after four levels, and about after seven. We made no effort to design tasks in a curriculum-like fashion. The levels are given in the supplementarymaterial. N u m b e r S a m p l e s R e q u i r e d ( C u m u l a t i v e ) Portable SymbolsTask-Specific Symbols (a) Results for the

Rod-and-Block domain. N u m b e r S a m p l e s R e q u i r e d ( C u m u l a t i v e ) Portable SymbolsTask-Specific Symbols (b) Results for the

Treasure Game domain.

Figure 8: Cumulative number of samples required to learn sufﬁciently accurate models as a functionof the number of tasks encountered. Results are averaged over 100 random permutations of the taskorder. Standard errors are speciﬁed by the shaded areas.Intuitively, one might expect the number of samples to plateau as the agent observes more tasks. Thatwe do not is a result of the exploration policy—the agent must observe all relevant partitions at leastonce, and selecting actions uniformly at random is naturally suboptimal. Nonetheless, we still requirefar fewer samples to learn the links between partitions than does learning a full model from scratch.In both of our experiments, we construct a set of 10 domain conﬁgurations and then test our approachby sampling 100 goals for each, for a total of 1000 tasks per domain. Our model-based approachlearns 10 forward models, and then uses them to plan a sequence of actions to achieve each goal. Bycontrast, a model-free approach [Jonschkowski and Brock, 2015, Higgins et al., 2017, Kirkpatricket al., 2017, Finn et al., 2017, de Bruin et al., 2018] would be required to learn all 1000 policies, sinceevery goal deﬁnes another unique SMDP that must be solved.

There has been some work in autonomously learning parameterised representations of skills, par-ticularly in the ﬁeld of relational reinforcement learning. Finney et al. [2002], Pasula et al. [2004]and Zettlemoyer et al. [2005], for instance, learn operators that transfer across tasks. However,the high-level symbolic vocabulary is given; we show how to learn it. Ames et al. [2018] adoptsa similar approach to Konidaris et al. [2018] to learn symbolic representations for parameterisedactions. However, the representation learned is fully propositional (even if the actions are not) andcannot be transferred across tasks.

Relocatable action models [Sherstov and Stone, 2005, Lefﬂer et al., 2007] assume states can beaggregated into “types” which determine the transition behaviour. State-independent representationsof the outcomes from different types are learned and improve the learning rate in a single task.However, the mapping from lossy observations to states is provided to the agent, since learning thismapping is as hard as learning the full MDP.More recently, Zhang et al. [2018] propose a method for constructing portable representations forplanning. However, the mapping to abstract states is provided, and planning is restricted solely tothe equivalent of an egocentric space. Similarly, Srinivas et al. [2018] learn a goal-directed latentspace in which planning can occur. However, the goal must be known up front and be expressible inthe latent space. We do not compare to either, since both are unsuited to tasks with goals deﬁned inproblem space, and neither provide soundness guarantees.

We have introduced a framework for autonomously learning portable symbols given only data gatheredfrom option execution, and showed that the addition of particular problem-speciﬁc information resultsin a representation that is provably sufﬁcient for learning a sound representation for planning. Thisallows us to leverage experience in solving new unseen tasks—an important step towards creatingadaptable, long-lived agents. 8 eferences

A.G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.

DiscreteEvent Dynamic Systems , 13(4):341–379, 2003.G.D. Konidaris, L.P. Kaelbling, and T. Lozano-Pérez. From skills to symbols: Learning symbolicrepresentations for abstract high-level planning.

Journal of Artiﬁcial Intelligence Research , 61(January):215–289, 2018.P.E. Agre and D. Chapman. Pengi: An implementation of a theory of activity. In

Proceedings of theSixth National Conference on Artiﬁcial Intelligence , volume 87, pages 286–272, 1987.A. Guazzelli, M. Bota, F.J. Corbacho, and M.A. Arbib. Affordances, motivations, and the worldgraph theory.

Adaptive Behavior , 6(3-4):435–471, 1998.S. Finney, N.H. Gardiol, L.P. Kaelbling, and T. Oates. The thing that we tried didn’t work very well:deictic representation in reinforcement learning. In

Proceedings of the Eighteenth Conference onUncertainty in Artiﬁcial Intelligence , pages 154–161, 2002.G.D. Konidaris, I. Scheidwasser, and A.G. Barto. Transfer in reinforcement learning via sharedfeatures.

Journal of Machine Learning Research , 13(May):1333–1371, 2012.M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Ja´skowski. Vizdoom: A Doom-based AIresearch platform for visual reinforcement learning. In , pages 1–8. IEEE, 2016.M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The Malmo platform for artiﬁcial intelligenceexperimentation. In

Proceedings of the Twenty-Ninth International Joint Conference on ArtiﬁcialIntelligence , pages 4246–4247, 2016.C. Beattie, J.Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green,V. Valdés, A. Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801 , 2016.G.D. Konidaris and A.G. Barto. Building portable options: skill transfer in reinforcement learning.In

Proceedings of the Twentieth International Joint Conference on Artiﬁcial Intelligence , volume 7,pages 895–900, 2007.D. Precup.

Temporal abstraction in reinforcement learning . PhD thesis, University of MassachusettsAmherst, 2000.G. Andersen and G.D. Konidaris. Active exploration for learning symbolic representations. In

Advances in Neural Information Processing Systems , pages 5016–5026, 2017.B. Ames, A. Thackston, and G.D. Konidaris. Learning symbolic representations for planning withparameterized skills. In

Proceedings of the 2018 IEEE/RSJ International Conference on IntelligentRobots and Systems , 2018.H.L.S. Younes and M.L. Littman. PPDDL 1.0: An extension to PDDL for expressing planningdomains with probabilistic effects. Technical report, 2004.M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters inlarge spatial databases with noise. In , volume 96, pages 226–231, 1996.C. Cortes and V. Vapnik. Support-vector networks.

Machine learning , 20(3):273–297, 1995.J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihoodmethods.

Advances in large margin classiﬁers , 10(3):61–74, 1999.N. Rosenblatt. Remarks on some nonparametric estimates of a density function.

The Annals ofMathematical Statistics , pages 832–837, 1956.E. Parzen. On estimation of a probability density function and mode.

The annals of mathematicalstatistics , 33(3):1065–1076, 1962. 9. Jonschkowski and O. Brock. Learning state representations with robotic priors.

AutonomousRobots , 39(3):407–428, 2015.I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, andA. Lerchner. DARLA: Improving zero-shot transfer in reinforcement learning. In

InternationalConference on Machine Learning , pages 1480–1490, 2017.J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan,T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.

Proceedings of the national academy of sciences , page 201611835, 2017.C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In

International Conference on Machine Learning , pages 1126–1135, 2017.T. de Bruin, J. Kober, K. Tuyls, and R. Babuška. Integrating state representation learning into deepreinforcement learning.

IEEE Robotics and Automation Letters , 3(3):1394–1401, 2018.H. Pasula, L.S. Zettlemoyer, and L.P. Kaelbling. Learning probabilistic relational planning rules. In

Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling ,pages 73–81, 2004.L.S. Zettlemoyer, H. Pasula, and L.P. Kaelbling. Learning planning rules in noisy stochastic worlds.In

Proceedings of the Twentieth National Conference on Artiﬁcial Intelligence , pages 911–918,2005.A.A. Sherstov and R. Stone. Improving action selection in MDPs via knowledge transfer. In

Proceedings of the Twentieth National Conference on Artiﬁcial Intelligence , volume 5, pages1024–1029, 2005.Bethany R Lefﬂer, Michael L Littman, and Timothy Edmunds. Efﬁcient reinforcement learning withrelocatable action models. In

Proceedings of the 22nd AAAI Conference on Artiﬁcial Intelligence ,volume 7, pages 572–577, 2007.A. Zhang, A. Lerer, S. Sukhbaatar, R. Fergus, and A. Szlam. Composable planning with attributes.In

International Conference on Machine Learning , pages 5842–5851, 2018.A. Srinivas, A. Jabri, P. Abbeel, Levine S., and Finn C. Universal planning networks: Learninggeneralizable representations for visuomotor control. In