PLOTS: Procedure Learning from Observations using Subtask Structure
PPLOTS: Procedure Learning from Observations using SubtaskStructure
Tong Mu
Department of Electrical EngineeringStanford [email protected]
Karan Goel
Department of Computer ScienceStanford [email protected]
Emma Brunskill
Department of Computer ScienceStanford [email protected]
ABSTRACT
In many cases an intelligent agent may want to learn how to mimica single observed demonstrated trajectory. In this work we con-sider how to perform such procedural learning from observation,which could help to enable agents to better use the enormous setof video data on observation sequences. Our approach exploits theproperties of this setting to incrementally build an open loop actionplan that can yield the desired subsequence, and can be used inboth Markov and partially observable Markov domains. In addi-tion, procedures commonly involve repeated extended temporalaction subsequences. Our method optimistically explores actions toleverage potential repeated structure in the procedure. In compar-ing to some state-of-the-art approaches we find that our explicitprocedural learning from observation method is about 100 timesfaster than policy-gradient based approaches that learn a stochasticpolicy and is faster than model based approaches as well. We alsofind that performing optimistic action selection yields substantialspeed ups when latent dynamical structure is present.
KEYWORDS
Reinforcement Learning; Learning from Demonstration; BehaviorCloning; Hierarchy
ACM Reference Format:
Tong Mu, Karan Goel, and Emma Brunskill. 2019. PLOTS: Procedure Learn-ing from Observations using Subtask Structure. In
Proc. of the 18th Interna-tional Conference on Autonomous Agents and Multiagent Systems (AAMAS2019), Montreal, Canada, May 13–17, 2019,
IFAAMAS, 9 pages.
An incredible feature of human intelligence is our ability to im-itate behavior, such as a procedure, simply by observing it. Whetherwatching someone perform CPR or observing a chef cook an omelette,people can learn to mimic such demonstrations with relative ease.While there has been extensive interest in learning from demonstra-tion, particularly for robotics, this work typically assumes accessto the demonstrator’s actions and resulting impacts on the environ-ment (observations). In contrast, there exists orders of magnitudesmore demonstration data that only contains the observation trajec-tories but not the actions – we see the result of the motor commandswhen cracking an egg, but not the motor commands themselves.In this paper we focus on how an agent can efficiently learnto match a single observation sequence, which we call procedure
Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019,Montreal, Canada learning from observation . The agent has access to a simulator of theenvironment, and must efficiently learn to match the demonstratedbehavior. For this to be possible, the dynamics of the underlyingdomain must be deterministic, at least at the level of the observationsequence . There are many cases in which we would like an agentto perform such procedure learning from a single demonstration– e.g. to learn a recipe, play a musical piece, swing a golf club orfold a shirt. Often, such procedures themselves involve repeatedsubstructure in the necessary action sequence, where subsequencesof actions are repeated several times. Cracking a series of eggs tomake an omelette or performing multiple rounds of chest compres-sions during CPR are examples of this. Our algorithm leverages thestructure of such settings to improve the data-efficiency of an agentlearning to mimic the desired observation sequence.Procedure learning from a single demonstrated observation tra-jectory relates to two recent research threads. The prior work onlearning from observations [13, 18, 22, 25] has focused on learninggeneralizable conditional policies. In contrast we focus on buildingan open-loop action plan (which must be sufficient to enable opti-mal behavior in procedural imitation), and find this can drasticallyreduce the amount of experience needed for an agent to learn. Otherwork has sought to leverage provided policy sketches [4], weaksupervision of structure in the decision policy, in order to speedand/or improve learning in the multi-task reinforcement learningand imitation learning (with provided actions) setting [4, 23]. Inthis work we consider how similar policy sketches can be used inthe observational learning setting, and, unlike prior related work,our focus is particularly on inferring or assuming such structurein order to speed learning of the procedure. Unlike some relatedwork [25], our work does not assume the observation space isMarkov and it can be applied to domains with perceptual aliasingin the observation space.Our two key ideas are to learn a plan rather than a policy, andto opportunistically drive action selection to leverage potential re-peated structure in the procedure. To do so we introduce a methodloosely inspired by backtracking beam search. Our method incre-mentally constructs a partial plan to yield observations that matchthe first part of the demonstrated observation sequence. To achievethis, it maintains a set of possible clusterings or alignments of theactions in that plan, using these to guide exploration to mimic theremaining part of the demonstration.We find that our algorithm learns substantially faster than policygradient approaches in both Markov and non-Markov simulated,deterministic domains. We find additional benefits from leveraging In other words, there must exist at least one single sequence that can deterministicallyachieve the demonstrated observation trajectory. a r X i v : . [ c s . L G ] A p r dditional information in the form of input policy sketches. Inter-estingly we also find that these benefits can be obtained even whensuch policy sketches are not provided, by a variant of our algorithmthat opportunistically biases exploration towards potential repeatedaction sub-sequences. We conclude with a brief investigation ofhow our approach may be useful in a continuous domain, and adiscussion of limitations and future directions. Procedure learning from observation is related to many ideas, draw-ing on insights from the extensive learning from demonstrationliterature and hierarchical learning.
Inspired by human learning from direction observation (without ac-cess to the actions), learning by observation has attracted increasinginterest in the last few years [9, 13, 18, 22, 24, 25, 27]. Observationallearning has the potential to allow humans or artificial agents tolearn directly from raw video demonstrations. Due to the wealthof such recorded videos, successful observational learning couldenable important advances in agent learning. Observational learn-ing can potentially enable a learner to achieve the task with anentirely different set of actions than the original demonstrator ( e.g. robotic manipulator vs human hands) and to translate shifts in theobservation space between the demonstrator and the learner (a newviewpoint, a different background, etc ). Several papers have focusedon robotic learning from third-person demonstrations, particularlywhere the viewpoint of the demonstrator is different from the tar-get context [9, 24]. Unlike such work we focus on the simpler casewhere the agent’s observation space matches the demonstrator’sobservation space; at least on the subset of features required to spec-ify the reward ( e.g. for playing the piano, the visual features mightnot match but the audio features must). However, our work alsotackles the harder case of learning from only a single demonstration.Prior work that operates on a single observational demonstrationtypically assumes some additional experience or training – such asprior experience used to learn a dynamics model for the learner’senvironment [25], a set of expert demonstrations in different con-texts [18], a batch of prior paired expert demonstrations and robotdemonstrations (complete with the robot actions) [27] – that canthen speed agent learning given a new observation sequence. Suchlearning of transferable (dynamics or policy) models is often framedas a form of meta-learning or transfer learning [11], and has led toexciting successes for one-shot imitation learning in new tasks.In contrast, our focus in this paper is in enabling fast procedurelearning from a single observation trajectory – learning to exactlymimic the trajectory as performed by the demonstrator. If the do-main has stochastic dynamics, this is in general impossible, so wefocus on the case where the dynamics are deterministic,at least atthe level of the observations. We highlight this point because theobservations may already be provided at some level of perceptualabstraction rather than low-level sensor readings. For example, themotion of a robot may be slightly jittery, but we can define extendedtemporal action sequences that can deterministically transition be-tween high-level, abstract observations, like whether the agent isinside or outside a building.
A number of papers have considered hierarchical imitation learn-ing. The majority of such work assumes the agent has access todemonstrated state-action trajectories, where behavioral cloningcould be applied, but no additional supervision (though exceptionswhich leverage additional expert interactions exist e.g. [3]). Theagent performs unsupervised segmentation or sub-policy discoveryfrom the observed demonstration trajectories [10, 21] using (forexample) changepoint detection [15], latent temporal variable mod-eling (e.g. [19]), expectation-gradient approaches (e.g. [7, 8, 16]) ormixture-of-experts modeling [2]. Such methods often leverage para-metric assumptions about the underlying domain to help guide thediscovery of hierarchical structure. Often, the learned sub-policieshave been shown to accelerate the learner on the same tasks (asdemonstrated) and/or benefit transfer learning to related tasks. Arelated idea involves inferring a sequence of abstracted actions(named a workflow) consistent with a demonstration: there can bemultiple potential workflows per demonstration [17]. The workflowstructures are used to prioritize exploration for related webpagetasks, and show promising improvements, but the workflow infer-ence presupposes particular properties of webpage tasks. In contrastto such unsupervised option discovery, recent work [23] shows thatif the demonstrated state-action trajectories are weakly labeled withthe sequence of subtasks required to complete the task (inspired bythe labels provided in modular policy sketches [4]), this can yieldperformance almost as strong as if full supervision of the sub-policysegmentation is provided, though the authors did not compare tounsupervised option-discovery methods.Our work also seeks to leverage such policy sketches in imitationlearning, but, to the best of our knowledge, in contrast to the abovehierarchical imitation learning research, our work is the first toconsider hierarchical learning from observation.
Procedural learning from observations is possible when the do-main is deterministic. Prior work has shown stronger performanceguarantees when the decision process is deterministic [26] com-pared to more general, stochastic decision processes. Intuitively,deterministic domains imply that an open loop action or plan isoptimal, compared to a state-dependent policy. We find similar per-formance benefits in our setting. Many tasks are deterministic orcan be approximated as such by employing the right observationabstraction.Our technical approach for performing procedural learning byobservation is related to backtracking beam search [29]. Backtrack-ing beam search is a strategy for exploring graphs efficiently byonly exploring a fixed number b of the most promising next nodesat each time-step while maintaining a stack of unexplored nodes tobacktrack to, guaranteeing correctness. We define the task of procedure learning from observation as: givena single fixed input observation sequence Z ∗ = ( z ∗ , z ∗ , . . . , z ∗ H ) ,the agent must learn an action plan a ∗ H = ( a ∗ , a ∗ , . . . , a ∗ H ) that,when executed, yields the same observation sequence Z ∗ . Specifi-cally we assume the agent is acting in a stationary, deterministic, igure 1: An illustration ofour Piano domain inspired byBach’s Prelude in C with thesubtasks labeled. Icons used tomake this image are creditedin [1] potentially partially observable Markov decision process consist-ing of a fixed set of actions A , states S and observations Z . Thedynamics model is deterministic: for each ( s , a ) tuple, P ( s ′ | s , a ) = s ′ . If the domain is partially observable, the stateobservation mapping is also assumed to be deterministic but mayinvolve aliasing of two states having the same observation, e.g. p ( z | s , a ) = p ( z | s , a ) . This implies that executing an actionfrom a given state will yield a single next observation. The dynam-ics model is assumed to be unknown to the learning agent.Note that the learning agent’s observation space must includethe set of distinct observations in the observation demonstration Z ∗ but the action space of the learning agent may not match thedemonstrator’s action space. For example, a series of photos mayshow the steps of creating an omelette by a human chef, but a robotcould learn to perform the same task and generate the same photos.In many situations the observed procedure may itself consist ofmultiple subtasks which can repeat multiple times within a task.Similar to the policy sketch notation [4] we assume that thereis an underlying procedure sketch K ∗ = ( b , b , . . . , b L ) whereeach element of the sketch is a label for a particular open-loopaction sequence drawn from a fixed set B , departing slightly fromthe original policy sketches work in which each subtask was apolicy. The actual action sequence associated with each element isunknown. An example of these subtasks in one of our domains isshown in Figure 1. We present two versions of our online procedure learning algo-rithm :(1) PLOTS-Sketch is given the task sketch and uses it to infersubtask assignments and alignments.(2)
PLOTS-NoSketch is not given the task sketch and insteadinfers and stores possible low level action sequences thatcould potentially be subtasks.There are two main insights to our approach. The first is toleverage the deterministic structure of the procedure imitationsetting to systematically search for a sequence of actions that willenable the learner to match the desired observation trajectory. Thesecond is to strategically use the potential presence of repeatedstructure to guide exploration. All code https://github.com/StanfordAI4HI/PLOTS
Recall the agent’s goal is to learn how to imitate a fixed input se-quence of observations Z ∗ = z ∗ , z ∗ , . . . , z ∗ H . For this to be possiblewe assume that the dynamics of the underlying domain is deter-ministic, at least in terms of the actions available to the agent inorder to achieve the desired observation sequence. Note that we donot assume that the observation space is necessarily Markov.Our algorithm proceeds by incrementally learning a sequenceof actions that yields the observation sequence Z ∗ . Notice that Z ∗ provides dense labels/rewards after an action a t taken at time step t , since the agent sees its next observation ˜ z t and can immediatelyidentify if ˜ z t matches the desired observation z ∗ t . If it matches, then a t is identified as a candidate for the correct action at time step t and is added to a partial solution action trajectory a ∗ t . The agentthen continues, trying a new action a t + to match z ∗ t + .If ˜ z t does not match the desired observation z ∗ t , the agent simplyplays random actions until the end of the trajectory H . It is thenreset to the start state, and follows the known partial solution actiontrajectory a ∗ t − until it reaches time step t , and then with uniformprobability chooses an action that has not yet been tried for t .In general, aliasing may occur if the observation space is notMarkov. In such cases, even if an action a t yields the desired ob-servation z ∗ t , the latent state underlying the agent’s observation z t may be wrong due to aliasing, preventing the agent from mimickingthe rest of the sequence. This is detectable when an agent reachesa later time step t ′ for which no actions can yield the specifiedobservation z ∗ t ′ . In this case the agent backtracks a ∗ t − one step, a ∗ t ′ − and restarts the process from there to find new actions thatyield the same remaining procedure observation sequence, possiblybacktracking again when necessary. We will refer to the agent thatdoes this as Backtracking Procedure Search (BPS). Learning Efficiency of BPS.
If the observation space is Markovgiven the agent’s actions, then once an action at time step t yieldsthe specified desired next observation z t , that action never needs tobe revised. For a Markov state at most | A | actions must be explored.Since each "failed" action attempt requires the agent to act untilthe end of the episode and then replay the learned solution actionsequence up to the desired time step t , it can take at most H | A | timesteps for the agent to learn the right action to take in time step t .Repeating this for all H time steps yields a total sample complexityof | A | H to learn the procedure. This matches the expected samplecomplexity for deterministic tabular Markov decision processes,since only a single sample is needed to learn each state–actionynamics model. Note that if we were to treat this problem as apolicy search problem, the number of possible policies is | A | | S | or | A | H if each observation is unique in the procedure demonstration.In general this will be substantially less efficient than our method.If the observation space is not Markov and aliasing occurs, inthe worst case, the process of backtracking and going forwardmay occur repeatedly until all | A | H possible action trajectoriesare explored. This matches the potential set of policies consideredby direct policy search algorithms for this domain, that are alsorobust to non-Markovian structure. However, in practice we rarelyencounter such cases, and we find that our approach only has toperform infrequent backtracking. The BPS algorithm described above is agnostic to and does not uti-lize the presence of any hierarchical structure . To leverage potentialrepeated subsequence structure, we extend BPS by proposing thePLOTSs – which provide heuristics for action selection resulting insmarter exploration. Note that accounting for repeated structureshould provide significant speedups if such structure exists, but ifno such structure exists then the PLOTS algorithms should performequivalently to BPS.
PLOTS-Sketch . For PLOTS-Sketch, in addition to main-taining a search tree to build a potential solution action trajectory a ∗ H , our algorithm also maintains a finite set of partial actionsketch instantiation hypotheses. As a concrete example, considerthe observed procedural sequence ( z , z , z , z , z , z , z , z ) andthe associated subtask sketch ( b , b , b , b , b ). Let the agent havelearned that the first 5 actions are a ∗ = ( e , f , д , e , f ) . Then two po-tential partial action sketch assignments are ˜ M = [ b = e , f , b = д ] ˜ M = [ b = e , b = f , д ] Both of these partial action sketchhypotheses are consistent with the learned partial solution actiontrajectory a ∗ . Yet they have different implications for the optimalaction sequence in the remainder of the trajectory.The two primary functions we must address is how to use po-tential action sketch hypotheses to facilitate faster learning, andhow to update existing and instantiate new hypotheses.
Action Selection Using Partial Action Sketch Hypotheses .To use these partial action sketch hypotheses for action selection, ateach timestep t , all hypothesis tracked by the agent can potentiallysuggest an action to take next using the following guidelines: • If the hypothesis estimates the current time step t is in asubtask for which it has an assignment, it will execute thenext action in that subtask. • Otherwise, the hypothesis returns
NU LL to the agent, indi-cating that it does not have any action suggestions.In practice we found a slight variant of the above score functionand action selection procedure was beneficial. Instead of returning
NU LL , the hypothesis makes an optimistic assumption that thefirst repeating subtask that has not yet been assigned will repeatas soon as possible and will have length as long as possible. Forexample, consider at timestep t = M = [] ,which has not yet instantiated any potential mappings of subtasksto actions. At t = e , f , д , e . The first repeated subtask in this case is known to be b and it is also known that b aligns with the beginning of the partialaction solution. Due to the non-emptiness of subtasks, we knowthe first e found at t = a ∗ belongs to b . So we optimisticallyassume that b is currently repeating and the second e found at t = b repeating as opposed to belonging to b . Wealso optimistically assume b is as long as possible and the f foundat t = b as opposed to b . With these optimisticassumptions, the next action should be f which ˜ M will suggestinstead of suggesting NU LL .With each hypothesis possibly suggesting an action, the agentmust select a hypothesis to follow. To this end, we compute a scorefor each hypothesis and use this score to select among them. Thescore C ( ˜ M i , t ) is the maximum reduction in time needed to learnthe remaining procedure that could result if that hypothesis ˜ M i were true and is calculated as: C ( ˜ M i , t ) = L (cid:213) j = N b j l ( ˜ M i ( b j )) , (1)where N b j is the number of repeats of subtask b j in the remainderof the procedure given hypothesis ˜ M i , and l ( ˜ M i ( b j )) is the lengthof the action subsequence cooresponding to subtask b j in ˜ M i . Notethat if ˜ M i does not include a hypothesized assignment for element b j , then its length is assigned to be 0. Continuing our runningexample, consider computing the score for ˜ M after t =
5. Underthis hypothesis, the remaining sketch for the rest of the trajectoryis only ( b , b ) since ˜ M hypothesizes that ( b , b , b ) have alreadybeen observed. Therefore, C ( ˜ M , t = ) = L (cid:213) j = N b j l ( ˜ M i ( b j )) = N ( b ) l ( e , f ) + N ( b ) l () = ∗ = M does not include an instantiation for b so l ( ˜ M ( b )) = b = e , f under this hypothesis.To use this score to select a hypothesis, recall that in discretedomains, at each timestep t the agent learns the correct actionby trying actions until the correct one is found and the observednext state ˜ z t + matches the correct state at t + z ∗ t + . Let A ′ t bethe set of all incorrect actions the agent has tried at t . Let H A ′ , t represent the set of hypotheses tracked by the agent at time t thatare not suggesting an action in A ′ t and that are not suggesting NU LL . After all scores are computed for the tracked hypothesis,the partial sketch ˜ M ∗ = arg max ˜ M i ∈H A ′ , t C ( ˜ M i , t ) with the highestscore is selected. The agent then follows the action suggested bythis hypothesis. Hypothesis Creation and Updating . Whenever the agent reachesa time step t on which it adds a new partial solution action trajec-tory element a t that is a repeat of a previously encountered actionin the current solution trajectory, new subtask action hypothesescan be introduced. To reduce computational complexity, the agentonly reasons about assignments for one subtask at a time and ad-ditional subtasks get assigned only if the assignment of the mainsubtask immediately implies it. To reduce the memory complexityof enumerating and storing all possibilities, we only create hypothe-ses for subtasks assignments we have consistent evidence to be truein the sense that we have seen a consistent alignment where thatubtask assignment has already repeated at least one. To continuewith our example, consider again the timestep t = M , thehypothesis which has not yet instantiated any mappings. For thishypothesis, the first new item is b so the main hypothesis it istrying to find an assignment for is b . At t =
4, the partial actionsolution is a ∗ = e , f , д , e , and from ˜ M the agent can instantiate˜ M = [ b = e , b = f , д ] because we have consistent evidence in a ∗ of b = e , meaning, we have seen e repeat at least once in a ∗ andassigning b = e is consistent with the assumptions made aboutthe subtask structure. By assigning b = e , it immediately applies b = f , д in this hypothesis so we also make an assignment for b . However, we do not instantiate ˜ M = [ b = e , f , b = д ] orother hypotheses that would assign b = e , f , д or b = e , f , д , e ,etc. because at t =
4, we have not yet seen those sequences for b repeating in a consistent manner. We also continue to track ˜ M which has not yet instantiated any mappings but we will name it˜ M for clarity. Now consider moving forward to the next timestepafter discovering the next correct action is f . Now the agent is at t =
5, and a ∗ = e , f , д , e , f . At this point from ˜ M , we can branchand instantiate ˜ M = [ b = e , f , b = д ] because we have now seenthe sequence e , f repeat in a consistent manner.From this example, we can notice that we only need to findinstantiations for the main hypothesis where the repeats matchat the end of a ∗ . For example at t =
5, ˜ M even though we haveconsistent evidence for b = e , we do not need to re-instantiatethat because we have already instantiated that hypothesis at t = Computational Tractability . Like in beam search, for computa-tional tractability we maintain only a finite set of subtask actionhypotheses. As previously mentioned, whenever the agent findsa new partial solution action element a t for a time step t , newsubtask action hypotheses can be introduced. Each existing hy-pothesis can generate at most H / t . To see this, we deviate from our running example andpresent a new example. Consider the situation where the subtasksequence is b , b , b , b , ... and the agent is at timestep t = a ∗ = e , f , д , h , i , f , д , h . Let one of the hypothesis the agent istracking be ˜ M = [] , one that has no hypothesized subtask assign-ments. At this timestep, we instantiate the following assignmentsfor b (and by immediate implication also make assignments for b and b ) all of which we have consistent evidence for: ˜ M = [ b = h , b = i , b = e , f , д ] , ˜ M = [ b = д , h , b = i , b = e , f ] ,˜ M = [ b = f , д , h , b = i , b = e ] . Because we only instantiate ahypothesis once we see repeats, the greatest number of branchingwe can have at each step is at most H /
2. Though each individualhypothesis will only generate at most a polynomial number of ad-ditional hypotheses at each time step, repeating this across manytime steps can yield an exponential growth. Therefore we maintaina finite set of N potential hypotheses which we actively updateand we do not the update the rest. This is done via two mecha-nisms. First, hypotheses are ranked according to the score function(Equation 1) and only the top N are kept active. We will refer tothe hypotheses not in the top N that we are not tracking as frozen.Second, if the current hypothesis is inconsistent with the observedprocedure and partial action solution trajectory, that hypothesisis eliminated. This can occur later during the procedure learningwhen additional discoveries of elements of a ∗ make it clear that an earlier hypothesis is inconsistent. To maintain the correctnessof our algorithm, if we reach a point in where we have no moretracked consistent hypotheses, we can unfreeze frozen hypothesesand continue. Empirically, we have found that in the domains weconsidered, our sorting metric works well and if a reasonable num-ber of hypotheses are tracked, then very little unfreezing needs tobe done. This leads to a memory complexity of O ( H ) in terms ofthe number of hypotheses stored. Pseudocode for PLOTS-Sketch ispresented in Alg 1. Algorithm 1
PLOTS-Sketch d ( Z ∗ (observation sequence) M = ∅ , a ∗ = // actions yielding partial match of Z ∗ A p = { | A |} , i = while | a ∗ | < H do // haven’t learned full procedure Reset to s , t = Execute a ∗ // execute known subprocedure t = | a ∗ | + Evaluate score C ( ˜ M , t ) for each hypothesis M a a t ← Action from arg max ˜ M C ( ˜ M , t ) Execute a t , observe z i , t + if z i , t + == z ∗ t + then // Found action that yields observation M ←
UpdateActiveH a ∗ ← ( a ∗ , a t ) else if M == ∅ then No consistent active hypotheses
Backtrack to unroll past incorrect actions & reset M end if end while PLOTS-NoSketch . The PLOTS-NoSketch algorithm isnot given the task sketch and relies on the fact that the task consistsof repeating subtasks. At each timestep, PLOTS-NoSketch looksinto the partial solution action trajectory a ∗ t for repeated sequencesof low level actions. Repeated low level action sequences, or hypoth-esized subtasks, are stored along with the number of times they arerepeated. To reduce the computational complexity of this method,we only add and update the counts of repeated action sequencesthat also match at the end of a ∗ t . This is sufficient because otherrepeated action sequences will have been discovered and updatedat previous time steps. To suggest an action, we sort all hypothe-sized subtasks by the number of times they have repeated. We thenfollow in that order the consistent next actions of hypothesizedsubtasks until the correct action for time t is found. We compare our methods against state-of-the-art baselines thatcan learn observational procedures and baseline versions of ourmethod. Our methods are summarized below(1)
PLOTS-Sketch is given the procedure sketch and lever-ages it to hypothesize about the assignments of low levelactions to subtasks and the alignment of the procedure sketchto the state sequence to perform smarter exploration. a) Island (b) Gem (c) CPR
Figure 2: Illustrations for the domains [1] .(2)
PLOTS-NoSketch is not given the procedure sketch. Itleverages the fact that the task is made of repeated subtasksand hypothesizes possible action sequences that could cor-respond to subtasks to use for smarter exploration.We compare with baseline version of our method:(1)
BPS is described in section 4.1. It is not given the proceduresketch and does not infer or leverage any of the repeatedhierarchical structure.(2)
BPS with Oracle Sketch Alignment (BPSOSA) isgiven the oracle alignment of the procedure sketch to thestate sequence in addition to the procedure sketch. This agentis able to learn faster because it does not need to hypothesizeabout the alignments and only needs to learn the assignmentof action sequences to subtasks.We compare against state-of-the-art policy gradient based methods (1) Modular [4] leverages the sketch to learn the procedure.Originally Modular was used to learn multiple tasks withsparse rewards using the sketches, but we instead providethe method with dense per-step rewards for our setting.(2)
Gail [14] is an imitation learning method that learns to im-itate the given observation sequence by adversarial training.This method is not able to leverage the sketch.(3)
Policy Gradient (PG) is adapted from Gail [14] and re-places the discriminator with per-step rewards to result in apurely policy-gradient approach. We reason that this couldpotentially be more efficient than Gail as instead of learningthe reward function (discriminator) we directly provide it.This method is also not able to leverage the sketch.These baseline methods all rely on a policy-gradient approachto learn a stochastic policy, rather than learning an open-loop planlike PLOTS variants and baselines. Since our procedure is determin-istic and our methods are specialized to learning in deterministicdomains where open-loop plans are sufficient, we expect thesebaselines will all converge more slowly to a locally optimum policy.However, they do have the additional benefit of being able to lever-age a deep neural network to internally learn a state abstraction.Because many deep neural network approaches are sensitive tohyperparameters, for the results reported for each of the baselines,we did a basic hyperparameter sweep over 4-6 different sets ofhyperparameters and display the set that performed best. Code for GAIL which we additionally modified to obtain our Policy Gradientbaseline is taken from github.com/openai/baselines and Modular from github.com/jacobandreas/psketch.
We also compare against model-based methods which we modifyto be computationally tractable in our domains which have largestate spaces. As with the policy gradient based approaches we alsoprovide dense, one step rewards signaling whether the agent hasfound the correct action to perform the procedure.(1)
RMax+ [6] a tabular model based algorithm that initial-izes the values of all states optimistically. For computationaltractability, we build up the Q-value, reward, and transitiontables as we see new states and group all states that werenot on the demonstration trajectory as the termination state.(2)
UCB+ [5] a bandit algorithm that keeps track of confidenceintervals of the rewards of the arms and chooses the armwith the highest upper confidence reward. We apply thisby treating each unique state as a separate bandit problem.Because we are only considering the deterministic case, theexact reward of a state action pair ( s , a ) can be learned afterone attempt of the action in the state and the confidenceinterval shrinks to zero. For tractability we also build up thenumber of bandit problems as we see new states and treatall states that were not on the demonstration trajectory asthe degenerate bandit where all actions lead to zero rewards.Note that we do not compare to methods that require the demon-strator’s actions to be provided, such as behavior cloning and recentvariations on this [12, 23], since we assume we do not have or arenot able to utilize the demonstrator’s actions. A discrete 2D-domain introduced by An-dres et al [4] to evaluate policy learning using policy sketches inmultitask domains with sparse rewards. In this domain, the agent isrequired to complete various tasks by moving and interacting withobjects using 5 deterministic actions: up, down, left, right, use. Thetasks have hierarchical structure so each task has a correspondingpolicy sketch. This domain was first proposed to demonstrate theeffectiveness of an algorithm that learned policies in a multitasksetting. Therefore in the original tasks, a single task did not haveany repeated subtasks but the agent could leverage repeated sub-tasks across multiple tasks to speed learning. This differs from oursetting, since we are primarily interested in the single task settingwhere there is repeated structure within a task. Therefore to eval-uate our method, we create a new task that involves collectingmultiple wood objects, forming them into planks, and using themto building a raft to reach an island (Island, Fig 2a). This procedureis length H =
67 with a policy sketch of length L =
16 consistingof |B| = (Fig 2c) The task of the agent in CPR worldis to follow the correct steps necessary to perform CPR on a patientbased on standard CPR procedures . The agent has 23 actions thatare used in the observation demonstration of length H = L = B = All code and more detailed environment descriptionshttps://github.com/StanfordAI4HI/PLOTS a) (b) (c)(d) (e) (f) Figure 3: Comparing PLOTS with policy-gradient based baselines in four discrete domains (a) Piano (b) Island (c) CPR (d) Gem.A basic hyperparameter sweep was done for the baselines and the best set of hyperparameters were chosen. The Gem domain(d) did not have any repeated structure and shows the speedup of our method that is specialized to learning procedures overusing more general procedures. Our approach was able to learn 10 - 100 orders of magnitude faster. For the Island domain, wealso show a hyperparameter sweep of our algorithm (e) and for GAIL (f). In all plots, PLOTS refers PLOTS-Sketch and for (e),the -number refers to the number of hypothesis tracked.
In the Piano domain an agent learns toplay the right hand component of Bach’s Prelude in C (boxed inblue in Fig 1) in a simulated piano environment. The observationsequence has H =
64 notes, with a policy sketch of length L = |B| = Env PLOTS-Sketch PLOTS-NoSketch BPS BPSOSA RMax+ UCB+ GAILIsland 92 80 137 72 265 265 20198Gem 43 43 44 43 86 86 9892Piano 405 392 539 206 30,000+ 30,000+ 18526CPR 319 286 2005 455 4230 4302 25653
Table 1: Average number of episodes until the procedure islearned for PLOTS-Sketch and baselines, only
GAIL is listedamongst the policy gradient based baselines as it did best.
Figures 3a, 3b, 3d, 3c and Table 1 display the results of runningour approach and baselines on the Craftworlds, Piano and CPRsimulation domains. From the figures, in all cases we observe thatour procedural learning from observation action requires at least100 times less episodes to learn the desired procedure than thebaseline policy learning algorithms. This clearly illustrates theenormous benefit of leveraging knowledge of the deterministicdynamics in order to incrementally compute a plan. Note this istrue both in the large state space Markov domains (Craftworlds) aswell as the partially observable Markov domain (Piano).Additionally we can see from Table 1 all our algorithms per-formed significantly better than the model-based baselines. Thisimprovement results from our method not optimistically explor-ing all possibilities but instead focusing on finding a single planthat achieves the desired full sequence. Additionally, model-basedbaselines do not perform well in the Piano domain where a Markovmodel is history-dependent and requires exploration over an expo-nential history space of O (| A | H ) .It has been recently observed that curriculum learning can speedreinforcement learning, and indeed the policy sketches algorithmemployed hand-designed curriculum learning across different lengthsketches during their multi-task training procedure [4]. One mightwonder if curriculum learning could be applied to improve theerformance of the baselines in these domains, since our own ap-proaches implicitly perform incremental curriculum learning asthey slowly build up a correct action plan that yields the desiredobservation sequence. To mimic this process, one could imaginefirst training a policy network to first correctly obtain the firstobservation, then train it to correctly obtain the first two obser-vations, etc. Unfortunately, in partially observable environments,at some point it is likely that the previously trained policy for anearlier observation is incorrect. In our approaches this is wheresystematic backtracking can be done, to efficiently unroll/unlearnproposed solution action plans. However, in generic policy training,this additional guidance about how to start searching for alternatepolicies, and which parts to revisit, is entirely unstructured, makingit likely that this could incur a general cost of expanding all prior | A | H decisions. In contrast, our method typically only backtracksa small number of times, yielding a final computational cost thatis closest to a linear C scale up of the Markovian decision space C | A | H rather than needing to explore the full exponential space. Table 1 additionally shows a comparison between the variants of ourmethod, against our own baselines, illustrating that our algorithmvariants that leverage knowledge of the subtask structure within theobservation demonstration learn with substantially less episodesthan our variant, BPS, which is agnostic to potential substructure.The Gem example which has no repeated action substructure illus-trates that if no substructure exists, all of our algorithms performsimilarly, as expected.Interestingly, note that sometimes our algorithms that do notreceive the ground truth alignment outperform the oracle variant,BPSOSA. We find that in practice there may be repeated actionsubsequences that can’t yet be confidently aligned with particularobservations, but that optimistically assuming such alignmentscan yield substantial speedups. Indeed, in many of the problems,there is additional substructure that is not reflected in the sketches.For example, in Island, one open loop action subsequence couldbe to travel from the workshop to the forest entrance (the placewe term that is around all the wood) using a action sequence thathas one action repeated many times (for example Down, Down,Down, Down, Down, Down, Left). In this case there is additionalsubstructure, (Down, Down, Down), that PLOTS-NoSketch is ableto use that can allow it to perform better than PLOTS-Sketch.However this result is specific to the problem structure where thereis additional substructure within a subtask open loop plan.The above experiments illustrate the benefit of action substruc-ture. To better understand the potential impact on agent learning ofstrategic action substructure hypothesis generation to inform actionselection, we explored the sensitivity of the PLOTS-Sketch algo-rithm to the number of tracked hypotheses, our main hyperparam-eter(Figure3e). We find a significant jump from using at least 2hypotheses, but more yield minor differences. This illustrates thatbeing able to strategically suggest potentially beneficial actionsgiven a small set of hypotheses can be beneficial and computation-ally tractable (due to the low number of tracked hypotheses).
Our experiments show that PLOTS-Sketch and PLOTS-NoSketch arecapable of quickly learning a given procedure, leveraging the sketchto discover macro-actions that can be reused later on. Our resultsshow that for learning procedures with deterministic dynamics,specialized algorithms for learning procedures with deterministicdynamics, focusing on specialized algorithms can be vastly moreefficient than more general policy-gradient style methods whichare able additionally able to learn stochastic policies. In this workwe focus on discrete domains since many domains are naturallydiscrete or near discrete. We have preliminary work in successfullyadapting our algorithm to domains with both continuous state andaction spaces, using gradient descent on the action space in do-mains where the reward is continuous and convex with respect tothe action. Note that in continuous state spaces, it is impossibleto match the observation state exactly. Thus, we approximatelymatch observations, with a tolerance on the l distance between theagent and demonstrator observations. Due to this approximation,we cannot directly apply the learned actions of a subtask as-is, dueto compounding errors, however learning subtask assignments isstill useful in that they provide a favorable initialization for theaction search, allowing the number of episodes needed to find anapproximately correct action to be half of the number episodesneeded with a random initialization when using some gradientbased optimization methods such as COBYLA [20]. Additionally inthis work we do not consider stochastic domains; in such domains,without additional assumptions, it is impossible for any algorithmto guarantee that it can find a policy or action sequence to matchthe observed procedure. However an area of future exploration isstochastic domains where the dynamics appear deterministic givenan appropriate state abstraction [28]. We introduce PLOTS-Sketch and PLOTS-NoSketch, novel ap-proaches for learning to imitate deterministic procedures in tasksthat have repeated structure in the form of subtasks. PLOTS-Sketch isable to incorporate additional information in the form of a proce-dure sketch to help reason about action to subtask assignmentsand speed learning. PLOTS-NoSketch inferred possible actionsequences that could correspond to subtasks without the sketch in-formation. We evaluated the performance of our algorithms in fourdifferent domains, including a domain that is partially observablein the state space. Our algorithm for learning procedures in dis-crete deterministic domains vastly outperformed related methodsdesigned for general classes of problems.
ACKNOWLEDGEMENTS
This material is based upon work supported by the Schmidt Foun-dation, the NSF CAREER award and the National Physical ScienceConsortium fellowship.
REFERENCES
Optiongan: Learning joint reward-policy options using generative adversarialinverse reinforcement learning .3] Nichola Abdo, Henrik Kretzschmar, Luciano Spinello, and Cyrill Stachniss. 2013.Learning manipulation actions from a few demonstrations. In
Robotics and Au-tomation (ICRA), 2013 IEEE International Conference on . IEEE, 1268–1275.[4] Jacob Andreas, Dan Klein, and Sergey Levine. 2017. Modular Multitask Rein-forcement Learning with Policy Sketches. In
ICML .[5] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis ofthe multiarmed bandit problem.
Machine learning
47, 2-3 (2002), 235–256.[6] Ronen I Brafman and Moshe Tennenholtz. 2002. R-max-a general polynomialtime algorithm for near-optimal reinforcement learning.
Journal of MachineLearning Research
3, Oct (2002), 213–231.[7] Hung Hai Bui, Svetha Venkatesh, and Geoff West. 2002. Policy recognition inthe abstract hidden markov model.
Journal of Artificial Intelligence Research
MachineLearning
Advances in neural information processing systems .1087–1098.[10] Staffan Ekvall and Danica Kragic. 2006. Learning task models from multiplehuman demonstrations. In
Robot and Human Interactive Communication, 2006.ROMAN 2006. The 15th IEEE International Symposium on . Citeseer, 358–363.[11] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine.2017. One-shot visual imitation learning via meta-learning. arXiv preprintarXiv:1709.04905 (2017).[12] Roy Fox, Sanjay Krishnan, Ion Stoica, and Kenneth Y. Goldberg. 2017. Multi-LevelDiscovery of Deep Options.
CoRR abs/1703.08294 (2017).[13] Wonjoon Goo and Scott Niekum. 2018. Learning Multi-Step Robotic Tasks fromObservation. arXiv preprint arXiv:1806.11244 (2018).[14] Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation Learning.In
NIPS .[15] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew Barto. 2012.Robot learning from demonstration by constructing skill trees.
The InternationalJournal of Robotics Research
31, 3 (2012), 360–375.[16] Sanjay Krishnan, Roy Fox, Ion Stoica, and Ken Goldberg. 2017. DDCO: Discoveryof Deep Continuous Options for Robot Learning from Demonstrations. In
Pro-ceedings of the 1st Annual Conference on Robot Learning (Proceedings of MachineLearning Research) , Sergey Levine, Vincent Vanhoucke, and Ken Goldberg (Eds.),Vol. 78. PMLR, 418–437. http://proceedings.mlr.press/v78/krishnan17a.html[17] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang.2018. Reinforcement Learning on Web Interfaces using Workflow-Guided Explo-ration. In
ICLR .[18] YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. 2018. Imitationfrom Observation: Learning to Imitate Behaviors from Raw Video via ContextTranslation. In
ICRA .[19] Scott Niekum, Sarah Osentoski, George Konidaris, Sachin Chitta, BhaskaraMarthi, and Andrew G Barto. 2015. Learning grounded finite-state representa-tions from unstructured demonstrations.
The International Journal of RoboticsResearch
34, 2 (2015), 131–157.[20] Michael JD Powell. 1994. A direct search optimization method that modelsthe objective and constraint functions by linear interpolation. In
Advances inoptimization and numerical analysis . Springer, 51–67.[21] Stefan Schaal. 2006. Dynamic movement primitives-a framework for motorcontrol in humans and humanoid robotics. In
Adaptive motion of animals andmachines . Springer, 261–280.[22] Pierre Sermanet, Kelvin Xu, and Sergey Levine. 2016. Unsupervised perceptualrewards for imitation learning. arXiv preprint arXiv:1612.06699 (2016).[23] Kyriacos Shiarlis, Markus Wulfmeier, Sasha Salter, Shimon Whiteson, and IngmarPosner. 2018. TACO: Learning Task Decomposition via Temporal Alignment forControl. In
International Conference on Machine Learning .[24] Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. 2017. Third Person ImitationLearning. In
ICLR .[25] Faraz Torabi, Garrett Warnell, and Peter Stone. 2018. Behavioral Cloning fromObservation. In
Proceedings of the 27th International Joint Conference on ArtificialIntelligence (IJCAI) .[26] Zheng Wen and Benjamin Van Roy. 2017. Efficient Reinforcement Learning inDeterministic Systems with Value Function Generalization.
Math. Oper. Res.
RSS .[28] Amy Zhang, Adam Lerer, Sainbayar Sukhbaatar, Rob Fergus, and Arthur Szlam.2018. Composable Planning with Attributes. In
International Conference onMachine Learning .[29] Rong Zhou and Eric A Hansen. 2005. Beam-Stack Search: Integrating Backtrack-ing with Beam Search.. In