Action Priors for Large Action Spaces in Robotics
Ondrej Biza, Dian Wang, Robert Platt, Jan-Willem van de Meent, Lawson L. S. Wong
AAction Priors for Large Action Spaces in Robotics
Ondrej Biza
Northeastern UniversityBoston, MA, [email protected]
Dian Wang
Northeastern UniversityBoston, MA, [email protected]
Robert Platt
Northeastern UniversityBoston, MA, [email protected]
Jan-Willem van de Meent
Northeastern UniversityBoston, MA, [email protected]
Lawson L.S. Wong
Northeastern UniversityBoston, MA, [email protected]
ABSTRACT
In robotics, it is often not possible to learn useful policies using puremodel-free reinforcement learning without significant reward shap-ing or curriculum learning. As a consequence, many researchersrely on expert demonstrations to guide learning. However, acquir-ing expert demonstrations can be expensive. This paper proposesan alternative approach where the solutions of previously solvedtasks are used to produce an action prior that can facilitate explo-ration in future tasks. The action prior is a probability distributionover actions that summarizes the set of policies found solving pre-vious tasks. Our results indicate that this approach can be usedto solve robotic manipulation problems that would otherwise beinfeasible without expert demonstrations. Source code is availableat https://github.com/ondrejba/action_priors.
KEYWORDS reinforcement learning; deep learning; action prior; robotics; roboticmanipulation
ACM Reference Format:
Ondrej Biza, Dian Wang, Robert Platt, Jan-Willem van de Meent, and LawsonL.S. Wong. 2021. Action Priors for Large Action Spaces in Robotics. In
Proc.of the 20th International Conference on Autonomous Agents and MultiagentSystems (AAMAS 2021), Online, May 3–7, 2021 , IFAAMAS, 13 pages.
Advances in deep learning have made model-free robot control aviable alternative to model-based motion planning [7, 14, 33]. How-ever, the complexity of tasks solvable by these approaches withoutextra supervision is limited, partly due to sample inefficiency ofdeep learning. Hand-crafted temporal abstraction of end-to-endmotions such as picking, placing and pushing are a compellingalternative, as they allow agents to reason over longer timescales[13, 36, 37]. In particular, Zeng et al. [37] proposed a pixel-wise
The 3rd, 4th, and 5th authors are listed in alphabetical order and contributed equally.The authors thank Yunus Terzioglu, Tarik Kelestemur, and our anonymous reviewersfor helpful feedback. This work was supported by the Intel Corporation, the 3M Cor-poration, National Science Foundation (1724257, 1724191, 1763878, 1750649, 1835309),NASA (80NSSC19K1474), startup funds from Northeastern University, the Air ForceResearch Laboratory (AFRL), and DARPA..
Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online parameterization of the action space, where each pixel in the ob-served image of the workspace corresponds to a reaching action tothat position followed by a pick or place.While both low-level action spaces with long time horizons andpixel-wise action spaces are difficult to explore, the pixel-wise pa-rameterization makes this challenge more explicit: the agent ispresented with thousands of possible actions, and usually onlya handful of them enable the agent to make progress toward itsgoal. Exploration challenges like this are often addressed using re-ward shaping, curriculum learning, or imitation learning. However,these methods require additional supervision that maybe difficultto provide. Ideally, our agent could learn new skills without expertsupervision.In this paper, we construct priors over the action space – actionpriors – that inform the agent of actions that were useful in thecontext of previously learned tasks. The idea of action priors hasexisted for some time. Sherstov and Stone [25] considered a singleaction prior for all states; later, action priors were extended to state-specific priors [1, 6, 21, 22]. However, to date, action priors have notbeen applied outside of learning in grid-world-like environmentswith small action spaces [6, 21, 22, 25] and planning in factoredmodels [1].In contrast, we train action priors in environments with imagestates and pixel-wise action spaces with thousands of actions. Tothat end, we represent an action prior as a single fully-convolutionalneural network trained to summarize a library of pre-trained poli-cies. We distinguish between a set of training tasks, which we solveusing imitation learning, and a held-out set of testing tasks to besolved without expert information. The role of action priors is tobias exploration on the testing tasks toward actions that were foundto be useful when solving the training tasks.We evaluate our approach on 16 robotic block stacking tasks.We proceed in three stages. First, our agent uses imitation learningto find near-optimal solutions to a subset of the 16 tasks. Second,we condense these near-optimal policies onto a state-dependentprobability distribution over actions (i.e. the action prior) that giveshigh probability to any action that was part of one of the near-optimal policies. Finally, we use the action prior to bias explorationwhen solving a new task. In the block stacking domain, this actionprior gives a high probability to picking actions that are likely tolift a block of some type or placing actions that are likely to resultin a stable placement. Although we explicitly focus on robotic a r X i v : . [ c s . R O ] F e b a) (b) (c) (d) Figure 1: Our PyBullet block stacking setup. (a) a simulated UR5 arm and a × cm workspace with blocks, (b) a simulatedtop-down depth camera image of the workspace, (c) and (d) are examples of the goals states of 2 of our 16 block stacking tasks. manipulation, our approach should generalize well to any problemin robotics with a large action space.This paper makes two main contributions. First, we show that astate-conditioned action prior is an effective way to transfer knowl-edge from previously solved tasks to new tasks in a robotic manipu-lation domain. Our experimental results indicate that this approachcan dramatically increase the probability of visiting a goal stateduring exploration. Second, we introduce a method of learning astate-conditioned action prior in situations where the previouslylearned policies are valid over different regions of the state/actionspace. This problem only occurs in large state/action spaces such asin robotic manipulation, and we believe we are the first to addressit. Action priors bias action selection during the exploration phaseof learning towards actions that were previously determined to beviable. This information can either be specified by an expert [1] orextracted from policies from previously solved tasks [1, 6, 21, 22, 25].Note that action priors refer to a different construct than policypriors [5, 34], as action priors do not involve posterior inference ofpolicy parameters.Sherstov and Stone [25] eliminated actions not optimal for anyprevious task in a state-agnostic way. Together with their trans-fer learning algorithm, the state-agnostic action prior increaseslearning speed in grid-world mazes. Fernández-Rebollo and Veloso[6], Rosman and Ramamoorthy [21, 22] explored state-specific ac-tion priors in similar discrete-state-space MDPs. Fernández-Rebolloand Veloso [6] alternated between rolling out the policy beinglearned and a policy sampled from a library. The probability distri-bution over the library of policies was updated online to maximizerewards for the current task. Rosman and Ramamoorthy [21, 22]filled in the pseudo-counts of Dirichlet distributions used to selectactions in each state by a weighted-sum of actions selected by pre-viously learned policies. Abel et al. [1] combined action priors andhand-crafted object-oriented representations [4] to improve therun time of dynamic programming policy search for a Minecraftenvironment and a real-world robotic manipulation task.In contrast to pre-defined factored representations in Abel et al.[1], we learn the action prior and model-free policies from pixels.The Dirichlet prior [21, 22] is not easily extensible to continuousstate spaces; instead, we learn the action prior as a convolutional network. Compared to Fernández-Rebollo and Veloso [6], we can-not keep a library of policies loaded in memory, as each policyis parameterized by a large convolutional network–we distill allpolicies into a single action prior network.In concurrent work, Ajay and Agrawal [2], Pertsch et al. [19]learned action priors over fixed-length sequences of actions (alsocalled skill priors). Both approaches use varitional autoencodersto learn representations for action sequences, and can be used tosolve composite robotic manipulation tasks. Singh et al. [26] studiedaction priors (here called behavior priors) in a setting where trainingand testing tasks differ in terms of the objects being manipulated,but are otherwise the same.The topic of efficient exploration is closely related to actionprior. Methods in this category often do not use additional informa-tion, such as prior policies. Instead, they use a notion of surpriseor information content of a visited state. These quantities can bemeasured by counting the number of times states were visited [27]or by model-based approaches [11, 17]. Our problem statement isincomparable with these approaches, as we exploit additional infor-mation from previously learned tasks, which facilitates much moretargeted exploration compared to the notion of surprise alone.
Transfer learning has been studied extensively both in classi-cal reinforcement learning [28] and in deep reinforcement learning[9, 16, 29]. Goyal et al. [9], Teh et al. [29] learned a so-called defaultpolicy while learning multiple specialized policies in a multi-taskor a multi-goal RL. To transfer to new tasks, Teh et al. [29] used theKL-divergence between the default policy and a new policy as reg-ularization, and Goyal et al. [9] used their default policy to quantifythe notion of a "decision state": a state in which we need make adecision based on the task we want to solve (e.g., a crossroads in amaze). Their agent is then encouraged to explore decision statesby adding an intrinsic reward. Both Goyal et al. [9], Teh et al. [29]focuses their experimental evaluation on navigation tasks, with thelatter transfer method only being applicable to discrete-state-spacedomains (due to them using count-based exploration). Parisottoet al. [16] distill policies from training tasks into a single student,which is then used to initialize the testing policy.
We model the pick and place robotics tasks in this paper as MarkovDecision Processes (MDPs, [3]) M = ⟨ 𝑆, 𝐴, 𝑃, 𝑅, 𝜌 , 𝛾 ⟩ . 𝑆 and 𝐴 represent the sets of state and actions, 𝑃 : 𝑆 × 𝐴 → 𝑃𝑟 ( 𝑆 ) is a lgorithm 1 Action prior learning
Input:
Set of training tasks 𝑇 = { 𝑇 ,𝑇 , ...,𝑇 𝑁 } . Output:
Action prior network 𝑓 AP . procedure LearnAP for 𝑇 𝑖 in 𝑇 do Train an expert policy 𝜋 𝑖 for task 𝑇 𝑖 . Collect 𝐾 transitions by rolling out 𝜋 𝑖 . Store visitedstates in 𝐷 𝑖 . end for Concatenate datasets { 𝐷 , 𝐷 , .., 𝐷 𝑁 } into 𝐷 . Train task classifier 𝑓 C : 𝑆 → Δ 𝑁 − on 𝐷 (Section 5.1). Collect optimal action sets for 𝜋 , . . . , 𝜋 𝑁 on 𝐷 . Merge optimal action sets using 𝑓 C and add the unionset to 𝐷 (Section 5.2). Train action prior network 𝑓 AP on 𝐷 (Section 5.3). Return 𝑓 AP . end procedure transition function that returns a probability mass/density overstates and the reward function 𝑅 : 𝑆 × 𝐴 → R maps state-actionpairs to their expected rewards. We consider MDPs an initial statedistribution 𝜌 and a discount factor 𝛾 .A policy 𝜋 : 𝑆 × 𝐴 → [ , ] captures the decision making processof an agent as the probability distribution over actions for eachstate. Each policy has an associated state-action value function 𝑄 𝜋 ( 𝑠, 𝑎 ) = 𝑅 ( 𝑠, 𝑎 ) + 𝛾 E 𝑠 ′ ∼ 𝑃,𝑎 ′ ∼ 𝜋 [ 𝑄 𝜋 ( 𝑠 ′ , 𝑎 ′ )] , the discounted returnwhen executing action 𝑎 in state 𝑠 and following policy 𝜋 thereafter.In this paper, we consider MDPs with the following properties: • States are represented as images, • the action space is large, usually one action per state pixel, • the time horizon is short, around 10 time steps, and • rewards are sparse.Learning an optimal policy for this class of MDPs without ad-ditional information is extremely difficult because there is only ahandful of optimal actions in each state. Hence, the probability ofgetting a reward for a sequence of random actions is minuscule. Let 𝑇 train = { 𝑇 ,𝑇 , ...,𝑇 𝑁 } be a set of training tasks expressed asMDPs. A task 𝑇 𝑖 = ⟨ 𝑆, 𝐴, 𝑃 𝑖 , 𝑅 𝑖 , 𝜌 , 𝛾 ⟩ shares its definition withall other tasks except for its reward and transition function. Forexample, a task of building a tower from blocks of height two anda task of building a tower of height three clearly have a differentreward function. Even though the dynamics of picking and placingobjects are the same for both tasks, the former task terminateswhen a tower of two is built, whereas the latter does not. Therefore,there are small variations in the transition function between thetasks related to terminal states.We assume we have access to an expert policy 𝜋 𝑖 for each train-ing task 𝑖 together with a dataset of on-policy transitions 𝐷 𝑖 . Givena testing task 𝑇 𝑁 + (different in its transition and reward dynamicsfrom training tasks), our goal is to learn the best possible policy.We formalize this as summarizing experience from previous tasks ( 𝐷 𝑖 , 𝜋 𝑖 ) 𝑁𝑖 = in some function (such as an action prior) parameter-ized by 𝜙 . The parameters are then used in some training process Algorithm 2
Action prior exploration
Input:
Reinforcement learning agent 𝑓 RL , action prior net-work 𝑓 AP , action prior probability threshold 𝜎 , task 𝑇 . procedure ExploreAP while Stopping condition not reached do Get environment state 𝑠 . if Explore then ¯ 𝐴 ∗ ← (cid:8) 𝑎 (cid:12)(cid:12) 𝑎 ∈ 𝐴 ∧ 𝑓 AP ( 𝑠, 𝑎 ) > 𝜎 (cid:9) . Randomly sample 𝑎 ∼ Uniform ( ¯ 𝐴 ∗ ) . else Choose 𝑎 according to 𝑓 RL (e.g. 𝑎 with maximumQ-value in DQN). end if Execute action 𝑎 in the environment. Observe reward 𝑟 and next state 𝑠 ′ . Add tuple ( 𝑠, 𝑎, 𝑟, 𝑠 ′ ) into the replay buffer. Perform a learning step of 𝑓 RL . end while end procedure 𝜋 ( 𝜙 ) resulting in a policy for the testing task. We then indirectlymaximize the success of the testing policy by manipulating 𝜙 .arg max 𝜙 E 𝜋 ( 𝜙 ) ,𝜌 ,𝑃 𝑁 + (cid:34) ∞ ∑︁ 𝑡 = 𝛾 𝑡 𝑅 𝑁 + ( 𝑠 𝑡 , 𝑎 𝑡 ) (cid:35) . (1) Given a set of expert policies 𝜋 , . . . , 𝜋 𝑁 for training tasks 𝑇 𝑖 ∈ 𝑇 train ,we define the action prior to be a policy 𝜋 AP ( 𝑠, 𝑎 ) = 𝜂 max 𝑖 ∈[ ...𝑁 ] 𝜋 𝑖 ( 𝑠, 𝑎 ) , where 𝜂 is normalizes 𝜋 AP . For example, if 𝜋 , . . . , 𝜋 𝑁 are determin-istic, then 𝜋 AP assigns equal probability to each action 𝜋 ( 𝑠 ) . . . 𝜋 𝑁 ( 𝑠 ) .Algorithm 1 outlines the procedure we use to train the actionprior. First, we train expert policies for the 𝑁 training tasks. Actionpriors are invariant to the method used to train them (Section B).Then in Step 4, for each task 𝑖 and policy 𝜋 𝑖 , we obtain a sample ofon-policy states by rolling out 𝜋 𝑖 .Next, in Step 7, we train a classifier 𝑓 𝐶 that predicts the taskthat is most likely to have caused the agent to visit a state. This isimportant because the policies 𝜋 𝑖 are not all valid over the entirestate space. For example, a policy trained to assemble the structurein Figure 1a (tower from two cubes and a small roof) has neverseen the state shown in Figure 1b (a house built from two blocks, abrick and a large roof). To determine the set of policies applicablein a given state, we train a task classifier 𝑓 𝐶 : 𝑆 → Δ 𝑁 − (Section5.1), which predicts the tasks in which a state is most frequentlyencountered. This allows the action prior to ignore policies for tasksthat are not relevant to a particular state.Then, in Step 8 of Algorithm 1, after training the policies 𝜋 𝑖 and the task classifier 𝑓 𝐶 , we collect the training dataset for theaction prior (Section 5.2). The dataset contains an equal number ofstates for each task, which are obtained by rolling out the learnedpolicies 𝜋 𝑖 . For each state, we compute a binary mask that representsthe union of actions that are optimal (in any task) given a setf policies. The set of applicable policies is predicted by the taskclassifier. Step 10 of Algorithm 1 trains the action prior network 𝑓 AP : 𝑆 × 𝐴 → [ , ] to predict the probability of an action beingoptimal for any task in a given state (Section 5.3). Finally, we createan action prior policy 𝜋 AP ( 𝑠, 𝑎 ) based on the action prior network 𝑓 AP . By thresholding the probabilities predicted by 𝑓 AP , we get aset of proposed actions 𝐴 ∗ (s) for state 𝑠 . We set 𝜋 AP ( 𝑠, 𝑎 ) to be auniform distribution over 𝐴 ∗ ( 𝑠 ) with actions outside of the optimalset being assigned zero probability. The task classifier 𝑓 𝐶 : 𝑆 → Δ 𝑁 − determines which of the ex-pert policies are relevant in the context of a particular state. Totrain this classifier, we use a dataset of states and categorical labels {( 𝑠 𝑖 , 𝑦 𝑖 )} 𝑀𝑖 = . We construct this dataset by generating policy rolloutsfor each task. If a state 𝑠 was visited during a rollout of a policy fora task 𝑦 ∈ { , . . . , 𝑁 } we include the pair ( 𝑠, 𝑦 ) in the dataset. Notethat this results in a dataset where class labels for each state are notunique, as each state could have been encountered during rolloutsfor multiple tasks (e.g. the initial state with all objects placed on theground appears in all tasks). We can interpret the data as samples 𝑠 𝑖 , 𝑦 𝑖 ∼ 𝑝 ( 𝑠, 𝑦 ) from a distribution 𝑝 ( 𝑠, 𝑦 ) = 𝑝 ( 𝑠 | 𝑦 ) 𝑝 ( 𝑦 ) in which 𝑝 ( 𝑦 ) is a uniform prior over tasks (i.e. the training dataset is bal-anced), and 𝑝 ( 𝑠 | 𝑦 ) is the fraction of rollouts for each task in whichstate 𝑠 appears. The classifier now approximates the conditionaldistribution 𝑝 ( 𝑦 | 𝑠 ) ∝ 𝑝 ( 𝑠, 𝑦 ) ∝ 𝑝 ( 𝑠 | 𝑦 ) .We implement 𝑓 𝐶 as a neural network NN ( 𝑠 ) that predicts logitsfor all classes, which we normalize using a softmax function 𝑝 ( 𝑦 | 𝑠 ) ≃ 𝑓 𝐶 ( 𝑠 ) = softmax ( NN ( 𝑠 )) , (2)and train the classifier using a cross-entropy loss 𝐿 𝐶 = − 𝑀 𝑀 ∑︁ 𝑖 = | 𝑇 | ∑︁ 𝑗 = I [ 𝑦 𝑖 = 𝑗 ] log 𝑓 𝐶 ( 𝑠 ) 𝑗 . (3)To determine if policy for task 𝑗 is applicable in state 𝑠 , we check ifthe predicted probability 𝑓 𝐶 ( 𝑠 ) 𝑗 is above some threshold 𝛿 .Since neural classifiers in general do not have well-calibratedprobabilities, our approximation of 𝑝 ( 𝑦 | 𝑠 ) can be overconfident[10]. That said, we find that this classification strategy works suffi-ciently well for our purposes in practice. For an ideal action prior, we wold like to determine the set of actions 𝐴 ∗ ( 𝑠 ) that are optimal for any task in state 𝑠 . Given an expert policy 𝜋 𝑖 for task 𝑖 and corresponding value function 𝑄 𝜋 𝑖 , we define theoptimal action set for state 𝑠 as 𝐴 ∗ ( 𝑠 ) = 𝑁 (cid:216) 𝑖 = (cid:26) 𝑎 (cid:12)(cid:12)(cid:12)(cid:12) 𝑄 𝜋 𝑖 ( 𝑠, 𝑎 ) = max 𝑎 ′ ∈ 𝐴 𝑄 𝜋 𝑖 ( 𝑠, 𝑎 ′ ) (cid:27) . (4)We expect the cardinality of 𝐴 ∗ ( 𝑠 ) to be high in our domain,as there are many equivalent ways of picking and placing objects.Since our learned expert policies tend to be noisy, it is howevernon-trivial to determine the set of optimal actions without carefullysetting a threshold for each trained model.Instead, we restrict the optimal actions set to one action pertask with the optimal action for the 𝑖 th task denoted by 𝑎 ∗ 𝑖 = arg max 𝑎 𝜋 𝑖 ( 𝑠, 𝑎 ) with ties broken randomly. These action forman approximate optimal actions set (cid:101) 𝐴 ∗ ( 𝑠 ) = { 𝑎 ∗ , 𝑎 ∗ , ..., 𝑎 ∗ 𝑁 } . (5) (cid:101) 𝐴 ∗ ( 𝑠 ) , which we can compute, is a subset of the ground-truthset 𝐴 ∗ ( 𝑠 ) . We deal with the problem of increasing the number ofproposed optimal actions in Section 5.3. Furthermore, we restrictthe optimal actions set to only the task determined to be applicableby the task classifier (Section 5.1). We use the approximate optimal action sets (cid:101) 𝐴 ∗ ( 𝑠 ) to learn an actionprior 𝑓 AP : 𝑆 × 𝐴 → [ , ] , which takes the form of a multi-taskclassifier that returns binary predictions for all actions 𝐴 given astate 𝑠 ∈ 𝑆 . We train this classifier using pairs of states and optimalactions masks {( 𝑠 𝑖 , 𝑚 𝑖 )} 𝐿𝑖 = . We collect training states by rollingout the learned policies for each task for a fixed number of timesteps. For each state, we then compute the optimal mask, whichfor each action 𝑎 contains a bit that indicates whether this action ispart of the approximate optimal action set 𝑚 𝑖,𝑎 = I (cid:104) 𝑎 ∈ (cid:101) 𝐴 ∗ ( 𝑠 𝑖 ) (cid:105) . (6)We train this multi-task classifier using a standard logistic loss 𝐿 AP = − 𝐿 𝐿 ∑︁ 𝑖 = | 𝐴 | ∑︁ 𝑗 = 𝑚 𝑖 𝑗 log 𝑓 AP ( 𝑠 𝑖 , 𝑎 𝑗 ) +( − 𝑚 𝑖 𝑗 ) log (cid:0) − 𝑓 AP ( 𝑠 𝑖 , 𝑎 𝑗 ) (cid:1) . (7)As discussed above, the training masks contain only a subsetof the union of optimal action sets for all tasks. We can view thisproblem from the perspective of precision-recall trade-off. A modelthat achieves a low 𝐿 AP value will have high precision (i.e., it willavoid false-positive optimal actions), but it might have low recallbecause not every optimal action is represented in the training data.This trade-off can be controlled by moving the decision bound-ary I [ 𝑓 AP ( 𝑠, 𝑎 ) ≥ 𝜎 ] that determines if an action is deemed optimal.Experimentally, we found that setting 𝜎 to a low value (i.e., increas-ing recall and decreasing precision) results in a large increase inthe success rate of the action prior exploration policy. Algorithm 2 summarizes how we use the action prior for explo-ration when training a reinforcement learning agent on a new task.We follow the standard recipe of alternating between selecting ran-dom actions (exploration) and action according to the policy beinglearned (exploitation). The most common and one of the simplestapproaches to controlling this trade-off is an 𝜖 -greedy explorationpolicy with the value of 𝜖 decaying over time (e.g., this approachwas used in the original deep Q-network paper [15]).We use an 𝜖 -greedy strategy in which, during exploration, weselect actions uniformly at random from the set actions for whichthe action prior exceeds a threshold 𝜎 ¯ 𝐴 ∗ ( 𝑠 ) = (cid:8) 𝑎 (cid:12)(cid:12) 𝑎 ∈ 𝐴 ∧ 𝑓 AP ( 𝑠, 𝑎 ) > 𝜎 (cid:9) . (8)This results in a more focused exploration compared to selectingactions uniformly at random from the full set 𝐴 . Training Step R e w a r d ModelAP (ours)DQNAM-shareAM-freezeAM-prog (a) Pick up four fruits in any order.
Training Step R e w a r d ModelAP (ours)DQNAM-shareAM-freezeAM-prog (b) Pick up four fruits in a specific order.
Figure 2: Transfer learning results in Fruits World. There arefive distinct fruits in the environment and the agent shouldpick up between one and four of them. Picking can be doneeither (a) in any order or (b) in a particular order. We useall possible tasks for fruit combinations (30 tasks) and sam-ple 20 tasks for fruit sequences. The results were obtainedby leave-one-out cross-validation, where we learn an actionprior over 𝑁 − tasks and perform transfer learning on the 𝑁 th task. We plot the learning curves for the hardest taskshere, see Table 3 in the Appendix for all results. We reportmeans and its 95% confidence intervals over 10 runs. It is possible that a new task will require the agent to select anaction that was not optimal in any previous tasks. In this context,it could be beneficial to occasionally select a completely randomaction during the exploration step. But, we find it to decrease thesuccess rate of the action prior exploration policy. We hypothesizethat by setting 𝜎 to a low value (around 0.1) the action prior gener-alizes to actions that were not necessarily optimal for any previoustask but are nevertheless plausible. We demonstrate the effectiveness of action priors in two domains:a proof-of-concept Fruits World (Section 6.2) and a block stackingrobotic manipulation experiment in the PyBullet physics simulator(Section 6.3). Policies learned in PyBullet can be deployed on a real-world UR5 robotic arm (Section 6.4). The key question we aimto answer is if action priors enable a deep Q-network (DQN) tolearn tasks that were previously out of reach.
Fruits World is a grid-world-like domain. The world is a 5 × Figure 3: Sequential fruitpicking task. surprisingly difficult to solvewith model-free deep reinforce-ment learning. Hence, we alsogive the agent a reward of -0.1for picking up the wrong fruitand only allow it to put the rightfruits in the basket.The state is a 5 × × PyBullet block stacking is a set of simulated tasks involvingstacking blocks of various shapes and sizes with a robotic arm(Figure 1). The robot observes the workspace with a depth cam-era from above (Figure 1 (b)), receiving a 90 ×
90 image with pixelvalues corresponding to heights. It can execute a top-down pickor place action at a specified coordinate with fixed hand rotation.Before executing a pick action, the robot will take a 24 ×
24 picturecentered at the coordinates, where it is executing the pick action.If it successfully picks up an object, it uses the in-hand image todecide where to place it.We discretize the action space as a 90 ×
90 grid, each cell corre-sponding to one pixel of the observation. We instantiate 16 differentblock stacking tasks: each one builds a structure of a width of one ortwo small blocks and a height of two or three blocks; each structurehas a roof on top. Figure 1 (c) and (d) show goal states of buildinga tower from two small blocks and a small roof and of building astructure from two small blocks followed by one long block and alarge roof respectively. Each task is represented as a string (e.g. asmall roof on top of a small block is "1b1r") and we use a context-freegrammar to generate all possible tasks with particular parameters(Appendix Section A).ethod Final success rate on task1b1r 2b1r 2b2r 1l1r 1l2r 1b1b1r 2b1b1r 2b2b1rDQN RS 100% 0% 0% 100% 100% 0% 0% 0%DQN RS, WS 98% 0% 0% 99% 100% 0% 0% 0%DQN HS 97% 0% 0% 100% 100% 0% 0% 0%DQN HS, WS 97% 0% 2% 100% 99% 3% 0% 0%
DQN AP
DQN AP, WS
DQN AP
95% 96% 0% 100% 99% 90% 93% 97%
DQN AP, WS
0% 97% 0% 100% 100% 98% 100% 99%
Table 1: Transfer experiments in the block stacking domain. We consider "DQN AP, WS" the main contribution of this paper.Each column reports the final success rate averaged over 100 episodes after training. The baselines and ablations of our methodwe consider are random action selection (RS), heuristic action selection (HS), weight sharing (WS) and action prior (AP).
Both the 90 ×
90 observation and the 24 ×
24 in-hand image areencoded using a modified version of U-Net [20, 31]. We chose thismodel because it can produce detailed segmentation maps. Thetask our DQN, which we use for model-free learning, and actionprior models solve is comparable to image segmentation: we makea prediction for each pixel of the input image. In the DQN case,the U-Net predicts a single state-action value for each pixel of theinput image. In the action prior case, the model produces a logit foreach pixel of the observation. The exact architecture of our U-Netis depicted in Figure 7 in the appendix.
We sample 20 and 30 tasks for fruits sequences and combinationstasks respectively (Section 6.1). Action priors are tested in a leave-one-out transfer experiment: an action prior is trained on 19 and29 tasks respectively; it is then used for exploration on the held-outtask. This process is repeated for each task. Expert policies for thetraining tasks use imitation learning (Section B).Our comparison in Figure 2 involves action priors (AP), a ran-domly initialized deep Q-network (DQN) and three baselines basedon Actor-Mimic [16] (AP-share, AP-freeze and AP-prog). Actor-Mimic first trains a student to mimic the policies of all experts. Thestudent weights are then used to initialize the agent when learningthe testing task. We followed the original implementation exceptfor us having a student with 𝑁 heads, one for each training task. URL: https://github.com/eparisotto/ActorMimic, visited 21/02/08.
This deviation from the original paper is due to the state of ourenvironment not containing any information about which task theagent is solving. In contrast, Parisotto et al. [16] test their methodon several Atari games, each with a distinct state space.In all cases, we remove the 𝑁 student heads and transfer theweights of the hidden layers. In AM-share, all weights are trainable;AM-freeze only learns the final fully-connected layer. AM-prog isbased on Rusu et al. [23], where each hidden layer of the agentnetwork receives both features from the previous agent layer andfrom the previous student layer. Hence, there are two networks,student and agent, but only the agent is trained. We do not usean adaptation layer suggested by [23] because we only have onestudent network.Action priors significantly outperform all baseline in fruits com-binations (Figure 2a) and are the only method to solve any fruitssequences task (Figure 2b). In the former, AM-prog and AM-freezehave similar performance, whereas random initialization and AM-share train much slower. We hypothesize that AM-share overwritesthe student weights at the start of training, causing it to performsimilarly to random initialization. The negative rewards at the startof training received by action priors in Figure 2b can be explainedby AP picking up the wrong fruits (small negative reward), and theother methods failing to pick up any fruit (zero reward). a) 𝜎 = . , empty hand (b) 𝜎 = . , cube in hand(c) 𝜎 = . , empty hand (d) 𝜎 = . , cube in hand Figure 4: Visualizing actions proposed by an action prior(red crosses) with the probability threshold 𝜎 set to 0.1 (toprow) and 0.5 (bottom row). In the first column, the robot issupposed to pick up an object, in the second column, it isholding a cube in its hand and wants to place it. The imagesshown are pictures from a depth camera mounted at the topof the workspace in simulation. We perform two experiments in this domain: one focused on model-free learning with action priors and the other solely on explorationusing action priors. The first experiment matches the setup in Sec-tion 6.2: we use 15 out of the 16 block stacking tasks to learn anaction prior, which we subsequently test on the 16th task (Table 1).An imitation learning method called SDQfD facilitates the expertpolicies for the training tasks (Section B). The second experimentevaluates the success rate of action prior policies on all tasks with-out additional training (Figure 5 shows two example tasks, Figure9 contains the results for all tasks). We measure success rate as afunction of the action prior probability threshold 𝜎 and the presenceor absence of a task classifier. Experiment compares our method with two baselines.
DQNAP is a deep Q-network with action prior exploration. We test ran-dom and heuristic action selection. As our observations are depthimages taken from the top of the workspace, the heuristic only al-lows selecting actions that act on parts of the observation with non-zero height. Therefore, it forces the agent to interact with objects.But, not all interactions necessarily lead to desirable outcomes–theagent often ends up pushing objects out of the workspace. We callthe two methods
DQN RS and
DQN HS respectively.As a second mechanism for speeding up learning, we employweight sharing in the unseen task. Because an action prior summa-rizes the information contained in experts for the 15 training tasks,we initialize the DQN for the 16th testing task with the action prior s u cc e ss p r o b a b ili t y ( % ) task classifierno task classifier0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50prediction threshold0.51.01.52.02.5 s u cc e ss p r o b a b ili t y ( % ) Figure 5: Success rates for "1b1r" and "2b1l1r" using explo-ration with an action prior as a function of 𝜎 . We compareaction priors trained with and without task classifier. weights. To mitigate catastrophic forgetting [8], we include a 𝐿 penalty between the original action prior weights and the currentDQN weight; it is added to the DQN loss function with a weight of 𝜔 𝑊 𝑆 . We test weight sharing both with random and heuristic base-lines (
DQN RS, WS and
DQN HS, WS ) and action priors (
DQNAP, WS ).A DQN with an action prior can solve all tasks except for "2b2b1r"and "2b1l2r" (Table 1). As shown in Figure 9, the success rates ofaction priors on these tasks without training is between 1 and 4%.Hence, a more targeted action prior might be needed.The DQN sharing weights with the action prior does not leadto noticeable benefits. In fact, it fails to learn "2b2b2r" unlike thenon-weight-sharing variant. Since the purpose of weight sharing isto improve training speed of the network, it only provides minorbenefits due to exploration being the main challenge in our domains.As we expected, both the heuristic and random action selectionbaselines are unable to solve most of the tasks. Random actionssucceed on the three simplest tasks ("1b1r", "2b1r" and "2b2r"), andheuristic selection increases the number to four (plus "1l1l2r"). Inthe rest of the tasks, the success rates of the baseline explorationpolicies are close to zero; therefore, no learning method includingweight sharing can succeed.
Experiment evaluates two variants of action prior each againtrained in leave-one-out fashion on 15 tasks and tested on the 16thtask. In this experiment, we only log the success rate of the actionprior policy without learning on the testing task.Figure 5 shows examples of action prior success rates for tasks"1b1r" and "2b1l1r". We plot the success rates as a function of theaction prediction threshold 𝜎 and the presence or absence of a taskclassifier. The success rates are measured by rolling out episodes inhe test task (depicted in the top-left corner) with actions selectedby the action prior policy 𝜋 𝐴𝑃 . We show results for all tasks inAppendix Figure 9.The main trends in Figure 5 (and the Appendix Figure 9) are that(a) more complicated tasks benefit more from the task classifier andthat (b) almost all action priors that use a task classifier benefit froma decrease in the prediction probability threshold. The increase insuccess rates for action priors with task classifiers in more complextasks can be explained by them needing a targeted and preciseexploration policy: a task classifier ensures that the training datafor the action prior does not mark actions from irrelevant tasks asoptimal. On the other hand, we are unsure as to why action priorswithout task classifiers perform well on simple tasks. See Figure 4for example of sets of actions proposed by action priors. In theory, policies trained in our PyBullet simulation can be trans-ferred to a real-world robot, as both setups use a depth camera anda robotic arm with the same action spaces. However, in practicenoise in the real-world observations often confuses the policiestrained with perfect depth images.To make our models more robust, we re-train the action priorDQNs with weight sharing (DQN AP, WS) in the transfer learningexperiment in Section 6.3 with two modifications to the environ-ment. First, we add Perlin noise, a type of correlated gradient noise,to the simulated depth images [12, 18]. The noise models blobs oflower and higher estimated depth values that are present in thereal-world images. Next, we initialize all objects in the environmentwith small rotations compared to their default orientations. In ourprevious experiments, all objects had the same orientation, as theagent cannot control the rotation of the robot hand. However, theinitialization of the real-world scene will inevitably be less precise.Our setup is depicted in Figure 6 and described in details inAppendix Section C.3. We pick 8 of the 16 trained policies fortesting. Each such policy is run for ten episodes and we reportthe fraction of successful episodes (Table 2). Most policies reacharound 80% −
90% success rate. Two common failure modes are therobot failing to find the small roof because of noise in the depthimage and the robot accidentally knocking down a structure whenplacing a block. The latter cannot always be attributed to a badpolicy–our gripper sometimes releases objects off-center. Since thereal-world gripper is significantly larger than the simulated one, itfails if objects get pushed too close to each other.
In this work, we proposed a method for efficient exploration us-ing action priors. Our approach to learning action priors involvessolving a set of training tasks with imitation learning and summa-rizing the learned policies in a single action prior neural network.This network is trained on sets of optimal actions predicted by thepolicies with an addition of a task classifier that determines whichpolicies are relevant in each state.In contrast to prior work on action priors, which has predom-inantly considered grid-world-like environments, our method isapplicable to domains with image states and thousands of actions.In addition to a proof-of-concept Fruits World domain, where it
Figure 6: Our pick and place experiment with a UR5 roboticarm. The robot observes the workspace with a depth cameramounted above it, and it can choose to either pick or placean object with a top grasp and fixed orientation.
Task Successes
Table 2: Real-world experiment with a UR5 robotic arm (Fig-ure 6). The environment was set up in the same way as thesimulated workspace and we evaluated policies trained insimulation in 10 trials. Each trial had a budget of 20 timesteps. We break down failures into two components: pickfailures and place failures. Pick failures mean that the ro-bot could not find the desired object or it could not pick itwith sufficient precision. Place failures occur when the ro-bot fails to place an object on the appropriate structure or ittopples the structure over. always finds near-optimal policies, we demonstrate performanceon a simulated robotic block stacking task. A deep Q-network aug-mented with an action prior can solve 14 out of the 16 block stackingtasks within 100k episodes. Moreover, the policies trained in sim-ulation can also be deployed on the real robotic arm with only asmall drop in their success rates.A future direction of our work is to develop action priors thatare suitable to learning online during training on a new task. Inthis way, the action prior would progressively increase the chanceof hitting the goal state while exploring. A natural extension ofour manipulation task is to consider a full SE(3) action space (i.e.grasping from any angle).
EFERENCES [1] David Abel, D. Ellis Hershkowitz, Gabriel Barth-Maron, Stephen Brawner, KevinO’Farrell, James MacGlashan, and Stefanie Tellex. 2015. Goal-Based Action Priors.In
Proceedings of the Twenty-Fifth International Conference on Automated Planningand Scheduling, ICAPS 2015, Jerusalem, Israel, June 7-11, 2015 . 306–314.[2] Anurag Ajay and Pulkit Agrawal. 2020. Learning Action Priors for Visuomotortransfer.
International Conference on Machine Learning workshop on InductiveBiases, Invariances and Generalization in RL (BIG) .[3] Richard Bellman. 1957. A Markovian Decision Process.
Journal of Mathematicsand Mechanics
6, 5 (1957), 679–684.[4] Carlos Diuk, Andre Cohen, and Michael L. Littman. 2008. An object-orientedrepresentation for efficient reinforcement learning. In
Machine Learning, Proceed-ings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland,June 5-9, 2008 . 240–247.[5] Finale Doshi-Velez, David Wingate, Nicholas Roy, and Joshua B. Tenenbaum.2010. Nonparametric Bayesian Policy Priors for Reinforcement Learning. In
Advances in Neural Information Processing Systems 23: 24th Annual Conferenceon Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9December 2010, Vancouver, British Columbia, Canada . 532–540.[6] Fernando Fernández-Rebollo and Manuela M. Veloso. 2006. Probabilistic policyreuse in a reinforcement learning agent. In . 720–727.[7] Chelsea Finn and Sergey Levine. 2017. Deep visual foresight for planning robotmotion. In . 2786–2793.[8] Robert M. French. 1999. Catastrophic forgetting in connectionist networks.
Trendsin Cognitive Sciences
3, 4 (1999), 128 – 135.[9] Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Hugo Larochelle,Matthew Botvinick, Yoshua Bengio, and Sergey Levine. 2019. InfoBot: Transferand Exploration via the Information Bottleneck. In .[10] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibrationof Modern Neural Networks. In
Proceedings of the 34th International Conferenceon Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 . 1321–1330.[11] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and PieterAbbeel. 2016. VIME: Variational Information Maximizing Exploration. In
Ad-vances in Neural Information Processing Systems 29: Annual Conference on NeuralInformation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain . 1109–1117.[12] Stephen James, Andrew J. Davison, and Edward Johns. 2017. Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-StageTask. In . 334–343.[13] Robert Platt Jr., Colin Kohler, and Marcus Gualtieri. 2019. Deictic Image Mapping:An Abstraction for Learning Pose Invariant Manipulation Policies. In
The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-FirstInnovative Applications of Artificial Intelligence Conference, IAAI 2019, The NinthAAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019,Honolulu, Hawaii, USA, January 27 - February 1, 2019 . 8042–8049.[14] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen.2018. Learning hand-eye coordination for robotic grasping with deep learningand large-scale data collection.
Int. J. Robotics Res.
37, 4-5 (2018), 421–436.[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, GeorgOstrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015.Human-level control through deep reinforcement learning.
Nat. .[17] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. 2017.Curiosity-driven Exploration by Self-supervised Prediction. In
Proceedings ofthe 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW,Australia, 6-11 August 2017 . 2778–2787.[18] Ken Perlin. 1985. An image synthesizer. In
Proceedings of the 12th Annual Con-ference on Computer Graphics and Interactive Techniques, SIGGRAPH 1985, San Francisco, California, USA, July 22-26, 1985 . 287–296.[19] Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. 2020. Accelerating Reinforce-ment Learning with Learned Skill Priors.
Conference on Robot Learning (CoRL) .[20] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: ConvolutionalNetworks for Biomedical Image Segmentation. In
Medical Image Computing andComputer-Assisted Intervention - MICCAI 2015 - 18th International ConferenceMunich, Germany, October 5 - 9, 2015, Proceedings, Part III . 234–241.[21] Benjamin Rosman and Subramanian Ramamoorthy. 2012. What good are actions?Accelerating learning using learned action priors. In . 1–6.[22] Benjamin Rosman and Subramanian Ramamoorthy. 2015. Action Priors forLearning Domain Invariances.
IEEE Trans. Auton. Ment. Dev.
7, 2 (2015), 107–118.[23] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, JamesKirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Pro-gressive Neural Networks.
CoRR abs/1606.04671 (2016).[24] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. PrioritizedExperience Replay. In .[25] Alexander A. Sherstov and Peter Stone. 2005. Improving Action Selection inMDP’s via Knowledge Transfer. In
Proceedings, The Twentieth National Conferenceon Artificial Intelligence and the Seventeenth Innovative Applications of ArtificialIntelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania, USA . 1024–1029.[26] Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and SergeyLevine. 2020. Parrot: Data-Driven Behavioral Priors for Reinforcement Learning.
CoRR abs/2011.10024 (2020).[27] Alexander L. Strehl and Michael L. Littman. 2005. A theoretical analysis of Model-Based Interval Estimation. In
Machine Learning, Proceedings of the Twenty-SecondInternational Conference (ICML 2005), Bonn, Germany, August 7-11, 2005 . 856–863.[28] Matthew E. Taylor and Peter Stone. 2009. Transfer Learning for ReinforcementLearning Domains: A Survey.
J. Mach. Learn. Res.
10 (2009), 1633–1685.[29] Yee Whye Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirk-patrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. 2017. Distral: Robustmultitask reinforcement learning. In
Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9December 2017, Long Beach, CA, USA . 4496–4506.[30] Hado van Hasselt, Arthur Guez, and David Silver. 2016. Deep ReinforcementLearning with Double Q-Learning. In
Proceedings of the Thirtieth AAAI Conferenceon Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA . 2094–2100.[31] Dian Wang, Colin Kohler, and Robert Platt. 2020. Policy learning in SE(3) actionspaces.
Conference on Robot Learning (CoRL) .[32] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, andNando de Freitas. 2016. Dueling Network Architectures for Deep ReinforcementLearning. In
Proceedings of the 33nd International Conference on Machine Learning,ICML 2016, New York City, NY, USA, June 19-24, 2016 . 1995–2003.[33] Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin A.Riedmiller. 2015. Embed to Control: A Locally Linear Latent Dynamics Model forControl from Raw Images. In
Advances in Neural Information Processing Systems28: Annual Conference on Neural Information Processing Systems 2015, December7-12, 2015, Montreal, Quebec, Canada . 2746–2754.[34] David Wingate, Noah D. Goodman, Daniel M. Roy, Leslie Pack Kaelbling, andJoshua B. Tenenbaum. 2011. Bayesian Policy Search with Policy Priors. In
IJCAI2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence,Barcelona, Catalonia, Spain, July 16-22, 2011 . 1565–1570.[35] Kevin Zakka, Andy Zeng, Johnny Lee, and Shuran Song. 2020. Form2Fit: Learn-ing Shape Priors for Generalizable Assembly from Disassembly. In . 9404–9410.[36] Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, andThomas A. Funkhouser. 2018. Learning Synergies Between Pushing and Graspingwith Self-Supervised Deep Reinforcement Learning. In . 4238–4245.[37] Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois Robert Hogan,Maria Bauzá, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli,Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem QuNair, Druck Green, Ian J. Taylor, Weber Liu, Thomas A. Funkhouser, and Al-berto Rodriguez. 2018. Robotic Pick-and-Place of Novel Objects in Clutter withMulti-Affordance Grasping and Cross-Domain Image Matching. In . 1–8.
STACKING GRAMMAR
Instead of arbitrarily choosing tasks, we characterize the whole family of possible tasks. Here, we define a simple grammar of stacking tasks.It only captures the notion of a stack, so it cannot represent tasks such as two blocks in a row. The only notion of two blocks being near oneanother is if a long block or a long roof can be put on top of them.We have the following terminals: one block (1b), two blocks next to each other (2b), one brick (1l), short roof (1r) and long roof (2r). Thenon-terminals are ground (G), wildcard (W), short (S) and long (L). The starting symbol is the ground.The rules allow for stacking of long objects (two blocks next to each other or one long block) on long objects, short on short and short onlong. We separate the different possible outcomes of a rule by commas. • G → • W → S, L • S → • L → B EXPERT POLICIES
Action priors as well as our weight sharing and Actor-Mimic baselines require expert policies for training tasks . Next, we describe themethods we used to train expert policies in Fruits World and Block Stacking, but action priors are invariant to these design decisions. In bothcases, we start with a replay buffer pre-populated with expert demonstrations and learn a near-optimal policy, which we also call an expert.This might seem redundant, but, as you will see, a method that is used to generate expert demonstrations is not always able to select theoptimal action in an arbitrary state. Moreover, the expert demonstrations contain optimal actions, not distributions over all actions, whichare required by Actor-Mimic [16].
Fruits World:
We use the same DQN both for the experts and for training on the testing task (Section C.1), but the expert version is givena pre-populated replay buffer. The buffer contains 50k transitions generated by a policy that executes the optimal action with 50% probability,and a random action otherwise. The optimal action can be easily computed given the full state of the environment. The expert is trained for50k steps offline on this buffer; then, it is trained online for another 50k steps with new experiences mixed into the pre-populated buffer.
Simulated Block Stacking:
We use an imitation learning method called SDQfD [31] to learn expert policies. Similarly to above, itrequires a replay buffer pre-populated with expert demonstrations. Following Wang et al. [31], we start in the goal state (i.e., with a simulatorinitialized so that a particular goal structure is build), pick blocks from the top of the structure, and place them in empty spots on the ground.Since all actions are reversible, a deconstruction episode played backward looks like the agent is building the goal structure. We consider thereversed episodes to be expert demonstrations. Programming an initialization function for each structure and a deconstruction policy ismuch easier than having a custom planner for each task. Note that the initialization function and deconstruction policy is used only fortraining tasks, not the testing task.The objective of SDQfD is a weighted sum of a temporal-difference loss 𝐿 TD (as in DQN [15]) and a term 𝐿 SLM (with weight 𝜔 ) thatpenalizes action not selected by an expert with high predicted values 𝐿 SDQfD = 𝐿 TD + 𝜔 𝐿 SLM . (9)During training, the method identifies the set of non-expert actions 𝐴 𝑠,𝑎 𝑒 with values higher than the value of an expert action 𝑎 𝑒 minus amargin 𝑙 ( 𝑎 𝑒 , 𝑎 ) (positive constant for 𝑎 𝑒 ≠ 𝑎 ; zero otherwise) 𝐴 𝑠,𝑎 𝑒 = { 𝑎 ∈ 𝐴 | 𝑄 ( 𝑠, 𝑎 ) > 𝑄 ( 𝑠, 𝑎 𝑒 ) − 𝑙 ( 𝑎 𝑒 , 𝑎 )} . (10)The penalty called a "strict large margin" is then applied to the values of all actions in 𝐴 𝑠,𝑎 𝑒 𝐿 SLM = | 𝐴 𝑠,𝑎 𝑒 | ∑︁ 𝑎 ∈ 𝐴 𝑠,𝑎𝑒 𝑄 ( 𝑠, 𝑎 ) + 𝑙 ( 𝑎 𝑒 , 𝑎 ) − 𝑄 ( 𝑠, 𝑎 𝑒 ) . (11)We pre-train SDQfD on a replay buffer with expert demonstrations for 10k steps. After pre-training, we alternate between taking oneenvironment step according to the current policy and performing a training step. We maintain two separate buffers: one for expert and onefor on-policy transitions. Each batch of training data contains an equal number of samples from both buffers. The strict large margin is onlycomputed for the expert data. C EXPERIMENT DETAILSC.1 Fruits World
We use the same neural network for both the DQN and the action prior: it has two hidden layers with 256 neurons and we use the ReLUnon-linearity in-between. The DQN predicts a Q-value for each action and the action prior predicts the log probability of an action beingoptimal for each action separately.Baseline DQNs are trained for 100k steps with an 𝜖 -greedy policy that linearly decreases from 1.0 to 0.1 over 80k steps. The learning ratewas set to 5 ∗ − and we use prioritized replay with default parameters, dueling network and double Q-learning [24, 30, 32].ach action prior network is trained for 10k steps with a learning rate of 0.01. For the transfer experiments, we train a DQN with thesame parameters as in the training tasks, except it uses an action prior for exploration, or it is initialized with Actor-Mimic weights (seeSection 6.2). C.2 Simulated Block Stacking
We use a modified version of the U-Net architecture [20] for all our networks in this experiment. The schema of the fully-convolutionalnetwork is included in Figure 7.To train the SDQfD, we collect 50k expert trajectories obtained by reversing deconstruction experience [35]. We pre-train model on thisexperience for 10k steps. Then we train the model while it interacts with the simulator for 40k episode. Each episode has a maximum lengthof 20. The learning rate is set to 5 ∗ − and both the large margin weight and the margin coefficients are set to 0.1. There is no exploration,the batch size is set to 32 and the discount to 0.9. We run five simulated environments in parallel–we take one step in each environment,collect the transitions, take one training step of the SDQfD and repeat.For the training datasets, we collect 20k steps for each of the 15 tasks used to train an action prior and concatenate the experience. The taskclassifier is trained on this data for 20k steps with a learning rate of 10 − , weight decay of 10 − and batch size set to 32. We use a probabilitythreshold for relevant tasks 𝜃 of 0.05. Action priors are trained with the same settings except for a batch size of 50 for 10k training steps.In the transfer experiment, we use an action prior probability threshold 𝜎 of 0.1. During exploration, the model only selects actionsproposed by the action prior. We train a DQN for 100k episodes with the modified 𝜖 -greedy policy. 𝜖 linearly decays from 1.0 to 0.01 for 80kepisodes. The learning rate is set to 10 − , batch size to 32 and the discount factor to 0.9. We use prioritized experience replay with defaultparameters, but we do not use the weighting of the sampled transitions in the loss function.In the weight sharing experiments, the weights 𝐿 penalty 𝜔 𝑊 𝑆 is set to 0.1.
C.3 Real-World Block Stacking
We tested the trained model on a Universal Robots UR5 arm with a Robotiq 2F-85 Gripper. The observation is provided by an OccipitalStructure sensor pointing to the workspace from top-down. Figure 6 shows the robot experiment setup. All task parameters mirror thesimulation. We run 10 trials for each task, and the maximum number of steps for each trial is 20. Figure 8 shows an example run of task1l2b2r.
D ADDITIONAL RESULTS
Table 9 shows results for experiment 𝜎 tends to increase the success rates of the action priors that use the task classifier. Figure 7: A modified U-Net network architecture. The inputs are a × depth image of the environment 𝐼 and a × zoomed-in image of an object the robot picked up. If the robot is not holding anything, all pixel values are set to zero. × conv, means a convolutional layer with 32 filters of size 3. All max pooling layers use a kernel of size 2 and a stride of 2. We omitthe ReLU activations in the diagram.igure 8: One example run of the 1l2b2r task. s u cc e ss p r o b a b ili t y ( % ) task classifierno task classifier 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50prediction threshold1214161820 s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) s u cc e ss p r o b a b ili t y ( % ) Figure 9: We show the same results as in Figure 5 in the main test, but for all 16 tasks um. fruits AP (ours) DQN AM-share AM-freeze AM-progMid. Fin. Mid. Fin. Mid. Fin. Mid. Fin. Mid. Fin.Fruit combinations1 1 1 1 1 1 1 0.88 0.94 0.98 12 1 1 1 1 1 1 0.75 0.85 0.9 0.993 1 1 0.98 1 0.98 1 0.77 0.85 0.82 0.964 0.97 1 0.02 0.68 0.05 0.8 0.78 0.84 0.84 0.93Mean0.63 0.74 0.64 0.74 0.64 0.74 0.58 0.66