[PDF] Learning Manipulation Skills Via Hierarchical Spatial Attention

Abstract

Learning generalizable skills in robotic manipulation has long been challenging due to real-world sized observation and action spaces. One method for addressing this problem is attention focus -- the robot learns where to attend its sensors and irrelevant details are ignored. However, these methods have largely not caught on due to the difficulty of learning a good attention policy and the added partial observability induced by a narrowed window of focus. This article addresses the first issue by constraining gazes to a spatial hierarchy. For the second issue, we identify a case where the partial observability induced by attention does not prevent Q-learning from finding an optimal policy. We conclude with real-robot experiments on challenging pick-place tasks demonstrating the applicability of the approach.

Full PDF

11 Learning Manipulation Skills ViaHierarchical Spatial Attention

Marcus Gualtieri and Robert Platt

Abstract —Learning generalizable skills in roboticmanipulation has long been challenging due to real-world sized observation and action spaces. Onemethod for addressing this problem is attention focus– the robot learns where to attend its sensors andirrelevant details are ignored. However, these methodshave largely not caught on due to the difﬁculty oflearning a good attention policy and the added partialobservability induced by a narrowed window of focus.This article addresses the ﬁrst issue by constraininggazes to a spatial hierarchy. For the second issue,we identify a case where the partial observabilityinduced by attention does not prevent Q-learningfrom ﬁnding an optimal policy. We conclude withreal-robot experiments on challenging pick-place tasksdemonstrating the applicability of the approach.

I. I

NTRODUCTION

Learning robotic manipulation has remained anactive and challenging research area. This is be-cause the real-world environments in which robotsexist are large, dynamic, and complex. Partial ob-servability – where the robot does not at onceperceive the entire environment – is common andrequires reasoning over past perceptions. Addition-ally, the ability to generalize to new situations iscritical because, in the real world, new objects canappear in different places unexpectedly.The particular problem addressed in this paperis the large space of possible robot observationsand actions – how the robot processes its pastand current perceptions to make high-dimensionaldecisions. Visual attention has long been suggestedas a solution to this problem [1]. Focused percep-tions can ignore irrelevant details, and generaliza-tion is improved by the elimination of the many

Khoury College of Computer Sciences, Northeastern Univer-sity, Boston, MA, 02115 USA e-mail: [email protected]. irrelevant combinations of object arrangements [1].Additionally, as we later show, attention can resultin a substantial reduction to the number of actionsthat need considered. Indeed, when selecting po-sition, the number of action choices can becomelogarithmic rather than linear in the volume of therobot’s workspace. In spite of these beneﬁts, visualattention has largely not caught on due to (a) theadditional burden of learning where to attend and(b) additional partial observability caused by thenarrowed focus.We address the ﬁrst challenge – efﬁciently learn-ing where to attend – by constraining the systemto a spatial hierarchy of attention. On a high levelthis means the robot must ﬁrst see a large part ofthe scene in low detail, select a position withinthat observation, and see the next observation inmore detail at the position previously selected, andso on for a ﬁxed number of gazes. We addressthe second challenge – partial observability inducedby the narrowed focus – by identifying attentionwith a type of state-abstraction which preservesthe ability to learn optimal policies with efﬁcientreinforcement learning (RL) algorithms.This article extends our prior work [2], whereinwe introduced the hierarchical spatial attention(HSA) approach and demonstrated it on 3 chal-lenging, 6-DoF, pick-place tasks. New additionsinclude (a) faster training and inference times, (b)more ablation studies and comparisons to relatedwork, (c) better understanding of when an optimalpolicy can be learned when using this approach, (d)longer time horizons, and (e) improved real-robotexperimental results.The rest of the paper is organized as follows.First is related work (Section II). Next, the generalmanipulation problem is described and the visual a r X i v : . [ c s . R O ] M a r attention aspect is added (Sections III and IV-A).After that, the HSA constraints are added, and thisapproach is viewed as a generalization of earlierapproaches (Section IV-B to IV-E). The bulk ofthe paper includes analysis and comparisons in 4domains of increasing complexity (Section V). Realrobot experiments are described close to the end(Sections V-C and V-D). Finally, we conclude withwhat we learned and future directions (Section VI).II. R ELATED W ORK

This work is most related to robotic manipula-tion, reinforcement learning, and attention models.It is extends our prior research on 6-DoF pick-place[2] and primarily builds on DQN [3] and DeicticImage Mapping [4].

A. Learning Robotic Manipulation

Traditional approaches to robotic manipulationconsider known objects – a model of every objectto be manipulated is provided in advance [5], [6],[7]. While these systems can be quite robust incontrolled environments, they encounter failureswhen the shapes of the objects differ from ex-pected. Recent work has demonstrated grasping ofnovel objects by employing techniques intended toaddress the problem of generalization in machinelearning [8], [9], [10], [11], [12], [13], [14], [15],[16].There have been attempts to extend novel objectgrasping to more complex tasks such as pick-place. However, these have assumed either ﬁxedgrasp choices [17] or ﬁxed place choices [18]. Theobjective of the present work is to generalize theseattempts – a single system that can ﬁnd 6-DoF graspand place poses.Other research considers grasping and pushingnovel objects to a target location [19]. Their ap-proach is quite different: a predictive model ofthe environment is learned and used for planning,whereas we aim to learn a policy directly. Otherwork has considered the problem of domain transfer[20] and sparse rewards in RL [21]. We view theseas complimentary ideas that could be combinedwith our approach for an improvement.

B. Reinforcement Learning

Like several others, we apply RL techniques tothe problem of robotic manipulation (see above-mentioned [10], [13], [15], [18], [21] and survey[22]). RL is appealing for robotic control for severalreasons. First, several algorithms (e.g., [23], [24])do not require a complete model of the environment.This is of particular relevance to robotics, wherethe environment is dynamic and difﬁcult to de-scribe exactly. Additionally, observations are oftenencoded as camera or depth sensor images. DeepQ-Networks (DQN) demonstrated an agent learningdifﬁcult tasks (Atari games) where observationswere image sequences and actions were discrete [3].An alternative to DQN that can handle continuousaction spaces are actor-critic methods like DDPG[25]. Finally, RL – which has its roots in optimalcontrol – provides tools for the analysis of learningoptimal behavior (e.g. [26], [27], [28]), which werefer to in Section V-A.

C. Attention Models

Our approach is inspired by models of visualattention. Following the early work of Whiteheadand Ballard [1], we distinguish overt actions (whichdirectly affect change to the environment) from per-ceptual actions (which retrieve information). Simi-lar to their agent model, our abstract robot has avirtual sensor which can be used to focus attentionon task-relevant parts of the scene. The presentwork updates their methodology to address morerealistic problems, and we extend their analysisby describing a situation where an optimal policycan be learned even in the presence of “perceptualaliasing” (i.e. partial observability).Attention mechanisms have also been used withartiﬁcial neural networks to identify an object ofinterest in a 2D image [29], [30], [31], [32]. Oursituation is more complex in that we identify 6-DoF poses of the robot’s hand. Improved graspperformance has been observed by active control ofthe robot’s sensor [33], [34]. These methods attemptto identify the best sensor placement for graspsuccess. In contrast, our robot learns to controla virtual sensor for the purpose of reducing thecomplexity of action selection and learning.

Work contemporary with ours also consideredattention for controlling high-dimensional manipu-lators [35]. Important differences from our approachinclude the use of policy gradient instead of value-based methods, sensing 2D depth instead of 3Dpoint clouds, and learned instead of ﬁxed attentionparameters.III. P

ROBLEM S TATEMENT

The problem considered herein is learning tocontrol a move-effect system (Fig. 1, cf. [4]):

Deﬁnition 1 (Move-Effect System) . A move-effectsystem is a discrete time system consisting of arobot, equipped with a depth sensor and end ef-fector (e.e.), and rigid objects of various shapesand conﬁgurations. The robot perceives a historyof point clouds C , . . . , C k , where C i ∈ R n c × is acquired by the depth sensor; an e.e. status, h ∈ { , . . . , n h } ; and a reward r ∈ R . The robot’saction is move - eﬀect ( T ee , o ) , where T ee ∈ W isthe pose of the e.e., W ⊆ SE (3) is the robot’sworkspace, and o ∈ { , . . . , n o } is a prepro-grammed controller for the e.e. For each stage t =1 , . . . , t max , the robot receives a new perceptionand takes an action. The reward is usually instrumented by the systemengineer to indicate progress toward completionof some desired task. The robot initially has noknowledge of the system’s state transition dynam-ics. The objective is, by experiencing a sequence ofepisodes, for the robot to learn a policy – a mappingfrom observations to actions – which maximizes theexpected sum of per-episode rewards.For example, suppose the e.e. is a 2-ﬁngeredgripper, o ∈ { open , close } , h ∈ { empty , holding } ,the objects are bottles and coasters, and the task isto place all the bottles on the coasters. The rewardcould be for placing a bottle on a coaster, − forremoving a placed bottle, and otherwise.IV. A PPROACH

Our approach has two parts. The ﬁrst part isto reformulate the problem as a Markov decisionprocess (MDP) with abstract states and actions Fig. 1: The move-effect system. The robot has ane.e. which can be moved to pose T ee to performoperation o .(Section IV-A). With this reformulation, the result-ing state representation is substantially reduced, andit becomes possible for the robot to learn to restrictattention to task-relevant parts of the scene. Thesecond part is to add constraints to the actions sothat e.e. pose is decided sequentially (Section IV-B).After these improvements, the problem is thenamenable to solution via standard RL algorithmslike DQN. A. Sense-Move-Effect MDP

The sense-move-effect system adds a control-lable, virtual sensor which perceives a portion ofthe point cloud from a parameterizable perspective(Fig. 2).

Deﬁnition 2 (Sense-Move-Effect System) . A sense-move-effect system is a move-effect system wherethe robot’s actions are augmented with sense ( T s , z ) (where T s ∈ W and z ∈ R > ) and the point cloudobservations C , . . . , C k are replaced with a historyof k images, I , . . . , I k (where I ∈ R n ch × n x × n y ).The sense action has the effect of adding I = Proj ( Crop ( Trans ( T s , C k ) , z )) to the history. The sense action makes it possible for the robotto get either a compact overview of the scene or to Proj : R n c × → R n ch × n x × n y is n ch orthographicprojections of points onto n ch , n x × n y images. Each imageplane is positioned at the origin with a different orientation.Image values are the point to plane distance, ambiguities resolvedwith the nearest distance. Crop : R n c × → R n c (cid:48) × returnsthe n c (cid:48) ≤ n c points of C which lie inside a rectangular volumesituated at the origin with length, width, height z . Trans ( T s , C ) expresses C (initially expressed w.rt. the world frame) w.r.t. T s . Fig. 2: The sense-move-effect system adds a virtual,mobile sensor which observes points in a rectangu-lar volume at pose T s with size z .attend to a small part of the scene in detail. Since theresolution of the images is ﬁxed, large values of z correspond to seeing more objects in less detail, andsmall values of z correspond to seeing less objectsin more detail.The robot’s memory need not include the last k images – it can include any previous k im-ages selected according to a predetermined strat-egy. Because the environment only changes after move - eﬀect actions, we keep the latest image, I k ,and the last k − images that appeared just before move - eﬀect actions. Fig. 3 shows an example inthe bottles on coasters domain.Fig. 3: Scene and observed images for k = 2 and n ch = 1 . Left . Scene’s initial appearance.

Left center . sense image (large z ) just before move - eﬀect ( T ee , close ) . Right center . Scene’scurrent appearance.

Right . Current sense image,focused on the coaster (small z ).In order to apply standard RL algorithms to theproblem of learning to control a sense-move-effectsystem, we deﬁne the sense-move-effect MDP. Deﬁnition 3 (Sense-Move-Effect MDP) . Given asense-move-effect system, a reward function, andtransition dynamics, a sense-move-effect MDP isa ﬁnite horizon MDP where states are sense-move- effect system observations and actions are sense-move-effect system actions.

The reward function and transition details aretask and domain speciﬁc, respectively, examples ofwhich are given in Section V.

B. Hierarchical Spatial Attention

The observation is now similar to that of DQN– a short history of images plus the e.e. status –and can be used by a Q-network to approximate Q-values. However, the action space remains large dueto the 6-DoF choice for T s or T ee and the 3-DoFchoice for z . Additionally, it may take a long timefor the robot to learn which sense actions resultin useful observations. To remedy both issues, wedesign constraints to the sense-move-effect actions. Deﬁnition 4 (Hierarchical Spatial Attention) . Givena sense-move-effect system, L ∈ N > , T s ∈ W , andthe list of pairs [( z , d ) , . . . , ( z L , d L )] , (where z i ∈ R > and d i ∈ R ), hierarchical spatial attention(HSA) constrains the robot to take L sense ( T s , z ) actions, with z = z i for i = 1 , . . . , L , prior to each move - eﬀect action. Furthermore, the ﬁrst sensorpose in this sequence must be T s ; the sensor poses T i +1 s , for i = 1 , . . . , L − , must be offset no morethan d i from T is ; and e.e. pose T ee must be offsetno more than d L of T Ls . The process is thus divided into t max overtstages , where, for each stage, L sense actions arefollowed by 1 move - eﬀect action (Fig. 4). Theconstraints should be set such that the observationsize z i and offset d i decrease as i increases, sothe point cloud under observation decreases in size,and the volume within which the e.e. pose canbe selected is also decreasing. These constraintsare called hierarchical spatial attention because therobot is forced to learn to attend to a small part ofthe scene (e.g., Fig. 5).To see how HSA can improve action sampleefﬁciency, consider the problem of selecting po-sition in a 3D volume. Let α be the largest vol-ume allowed per sample. With naive sampling, the Concretely, d i = [ x, y, z, θ, φ, ρ ] indicates a position offsetof ± x/ , ± y/ , and ± z/ and a rotation offset of ± θ/ , ± φ/ , and ± ρ/ . Fig. 4: Initially, the state is empty. Then, L senseactions are taken, at each point the latest imageis state. After this, the robot takes 1 move - eﬀect action. Then, the process repeats, but with the lastimage before move - eﬀect saved to memory.Fig. 5: HSA applied to grasping in the bottleson coasters domain (Section V-C). There are 4levels (i.e. L = 4 ). The sensor’s volume size z is × × . cm for level 1, . × . × . cmfor levels 2 and 3, and × × . cm for level 4.As indicated by blue squares, d constrains positionin the range of ± × × . cm for level 1, ± . cm for level 2, and ± . cm for level3. Orientation is selected for level 4 in the rangeof ± ◦ about the hand approach axis. Red crossesindicate the x, y position selected by the robot, andthe red circle indicates the angle selected by therobot. Positions are sampled uniformly on a × × grid and 60 orientations are uniformly sampled.Pixel values normalized and height selection notshown for improved visualization.required number of samples n s is proportional tothe workspace volume v = d (1) d (2) d (3) , i.e., n s = (cid:100) v /α (cid:101) . But with HSA, we select positionsequentially, by say, halving the volume size in eachdirection at each step, i.e., d i +1 = 0 . d i . In thiscase L samples are needed, i.e., a sample for eachoctant at each step. The volume represented by eachsample at step i , for i = 1 , . . . , L , is v i = v / i .To get v L ≤ α , i.e., to get the volume representedby samples used for selecting e.e. position to be nomore than α , L = (cid:100) log ( v /α ) (cid:101) . Thus, with HSA,the sample complexity becomes logarithmic, ratherthan linear, in v . C. Lookahead Sense-Move-Effect

So far we have not speciﬁed how action parame-ters T s , T ee , and z are encoded. For standard sense-move-effect , these are typically encoded as 6 ﬂoat-ing point numbers representing the pose T and 3ﬂoating point numbers representing the volume size z . Alternatively, the pair ( T, z ) could be encodedas the sense image that would be seen if the sensorwere to move to pose T with zoom z . This is asif the action was “looking ahead” at the pose thesensor or e.e. would move to if this action wereselected.In particular, the lookahead sense-move-effect MDP has actions sense ( T s , z s ) and move - eﬀect ( T ee , z ee , o ) , the difference being theadditional parameter z ee ∈ R > for move - eﬀect .The action samples are encoded as the height mapthat would be generated by sense ( T, z ) . Becauseaction has this rich encoding, state is just the e.e.status and a history of k actions.The HSA constraints for the lookahead varianthave the same parameterization – an initial pose T s and a list of pairs [( z , d ) , . . . , ( z L , d L )] . The se-mantics are slightly different. z i for i = 1 , . . . , L − is the z s parameter for the i th sense , and z L is the z ee parameter. The d i for i = 1 , . . . , L − specifythe offset of the sense action samples relative tothe last pose decided, T is . d L speciﬁes the offset of T ee relative to T Ls . D. Relation to Other Approaches in the Literature1) DQN:

Consider a sense-move-effect MDPwith HSA constraints L = 1 , T s centered in therobot’s workspace, and z and d large enough tocapture the entire workspace. The only free actionparameters for this system are the e.e. pose, whichis sampled uniformly and spaced appropriately forthe task, and the e.e. operation. In this case, theobservations and actions are similar to that of DQN[3], and the DQN algorithm can be applied to theresulting MDP.However, this approach is problematic in roboticsbecause the required number of action samples islarge, and the image resolution would need to behigh in order to capture the required details of the scene. For example, a pick-place task where e.e.poses are in SE (3) , the robot workspace is 1 m ,the required position precision is 1 mm, and therequired orientation resolution is 1 ◦ per Euler anglerequires on the order of actions. Adding morelevels (i.e. L > ) alleviates this problem.

2) Deictic Image Mapping:

With L = 1 , T s centered in the robot’s workspace, z the deicticmarker size (e.g., the size of the largest object tobe manipulated), and d large enough to capturethe entire workspace, HSA applied to the looka-head sense-move-effect MDP is the Deictic ImageMapping representation [4]. Similar to the case withDQN, if the space of e.e. poses is large, and precisepositioning is needed, many actions need to besampled. In fact, the computational burden with theDeictic Image Mapping representation is even largerthan that of DQN due to the need to create imagesfor each action. Yet, the deictic representation hassigniﬁcant advantages over DQN in terms of efﬁ-cient learning due to its small observations [4].HSA generalizes and improves upon both DQNand Deictic Image Mapping by overcoming theburden for the agent to select from many actions ina single time step. Instead, the agent sequentiallyreﬁnes its choice of e.e. pose over a sequence of L decisions. We provide comparisons between theseapproaches in Section V. E. Implementation Methods

To implement HSA for a sense-move-effect MDP,it is necessary to select values for HSA parametersand a training algorithm. Here we provide roughguidelines for making both choices for standardHSA.

1) HSA Parameter Values:

Ideal values for T s , L , and [( z , d ) , . . . ( z L , d L )] depend on the posi-tion and size of the robot’s workspace, the desirede.e. precision, and available computing resources.In our implementations, we have separate levelsfor selecting position and orientation, with positionselecting levels occurring ﬁrst. The procedure fordeciding position selecting levels is as follows.First, the position component of the initial sensorpose T s is set to the center of the robot’s workspace.Second, the number of action samples n s depends on computing resources, e.g., the number of Q-values that can be evaluated in parallel. If n s = n ,where n is the number of position samples spacedevenly along an axis, then n is set to the largestinteger such that n s samples can be evaluatedefﬁciently. Third, the number of levels L is the min-imum number of times the workspace needs dividedto achieve the desired e.e. precision. If p ∈ R > isthe desired e.e. precision and w ∈ R > is the size ofthe workspace, L = max i =1 ,..., (cid:100) log n ( w ( i ) /p ( i )) (cid:101) .Fourth, sampling regions d i for i = 1 , . . . , L shouldbe large enough so that, if patches size d i arecentered on samples in level i − , the entire regionis covered: d i = w/n i − . Lastly, observation sizes z i for i = 1 , . . . , L should be equal to d i or the sizeof the largest object to be manipulated, whichever islargest. The latter condition is necessary if the entireobject must be visible to determine the appropriateaction. For example, when grasping bottles to beplaced upright, either the top or bottom of the bottlemust be visible to determine bottle orientation inthe hand. Deciding orientation selecting levels issimpler: add 1 level per Euler angle, each with thedesired angular e.e. precision.

2) Training Algorithm:

Algorithm 1 is a variantof DQN [3] that follows the HSA constraints.For concreteness, this implementation stores ex-periences for Q-learning; modiﬁcation for othertemporal difference (TD) update rules, such as Sarsa[24] or Monte Carlo (MC) [36], is straight-forward.For simplicity of exposition, we also restrict to thecase where image history consists of the currentimage I and the image I h before the last grasp,e.e. status is binary empty or holding , and the e.e.operation is binary open or close .Initially, the Q-function gets random weights,the experience replay database is empty, and theprobability of taking random actions (cid:15) = 1 (line 1).The environment is initialized to a scene unique toeach episode (line 2). For each time step, the e.e.status is observed (line 5), and I h is the previouslyobserved image if the e.e. is holding something(lines 6-8). Then, for each HSA level, a sense action is taken (line 10), the pose of the next sense action is determined either randomly or accordingto Q (line 12), and the experience is saved (line 15). Input: nEpisodes , t max , T , n s , L , [( z , d ) , . . . , ( z L , d L )] , maxExperiences , trainEvery Initialize Q , D , (cid:15) for i ← , ..., nEpisodes do env ← initialize - environment ( i ) for t ← , . . . , t max do h = get - ee - status ( env ) I h = NULL if t > ∧ h = holding then I h ← I for l ← , . . . , L do I ← sense ( T l , z l ) o (cid:48) ← ( h, I h , I ) T l +1 ← get - pose ( Q, o (cid:48) , T l , d l , n s , (cid:15) ) a (cid:48) ← T − l T l +1 if t > ∨ l > then D ← D ∪ { ( o, a, o (cid:48) , r ) } o ← o (cid:48) ; a ← a (cid:48) ; r ← op ← get - ee - op ( h ) overtAct ← move - eﬀect ( T L +1 , op ) r ← transition ( env, overtAct ) D ← D ∪ { ( o, a, NULL , r ) } if modulo ( i, trainEvery ) = 0 then D ← prune - exp ( D, maxExperiences ) Q ← update - q - function ( D, Q ) (cid:15) ← decrease - epsilon ( | D | ) Algorithm 1:

Train standard HSA.Actions are encoded relative to the previous sensepose (line 13). Next, the robot moves the e.e. to T L +1 and performs an operation op , after which areward is observed (lines 17 - 19). Finally, after trainEvery episodes, the Q function is updatedwith the current experiences (lines 21 - 23), and (cid:15) is set inversely proportional to the number ofexperiences (line 24).V. A PPLICATION D OMAINS

In this section we compare the HSA approachin 4 application domains of increasing complexity.The complexity increases in terms of the size of the action space and in terms of the diversity ofobject poses and geometries. We analyze simplerdomains because the results are more interpretableand learning is faster (Table I). More complexdomains are included to demonstrate the practicalityof the approach. All training is in simulation, andSections V-C and V-D include test results for a psy-chical robotic system. Source code for reproducingthe simulated experiments is available at [37].

A. Tabular Pegs on Disks

Here we analyze the HSA approach applied to asimple, tabular domain, where the number of statesand actions is ﬁnite. The domain consists of 2 typesof objects – pegs and disks – which are situated ona 3D grid (Fig. 6). The robot can move its e.e. to alocation on the grid and open/close its gripper. Thegoal is for the robot to place all the pegs onto disks.Fig. 6: Tabular pegs on disks with an × × grid,1 peg (red triangle), and 1 disk (blue circle).If this problem is described as a ﬁnite MDP, even-tual convergence to the optimal policy is guaranteedfor standard RL algorithms [26], [27]. However,the number of state-action pairs is too large forpractical implementation unless some abstractionis applied. The main question addressed here isif convergence guarantees are maintained with theHSA abstraction.

1) Ground MDP:

Tabular pegs on disks is ﬁrstdescribed without the sense-move-effect abstrac-tion. • State.

A set of pegs P = { p , . . . , p n } , a setof disks D = { d , . . . , d n } , and the current time t ∈ { , . . . , t max } . A peg (resp. disk) is a location Tabular Pegs on Disks Upright Pegs on Disks Bottles on Coasters 6-Dof Pick-PlaceTime (hours) 0.23 1.29 8.12 96.54

TABLE I: Average simulation time for the 4 test domains. Times are averaged over 10 or more simulationson 4 different workstations, each equipped with an Intel Core i7 processor and an NVIDIA GTX 1080graphics card. ( x, y, z ) ∈ { , . . . , m } except peg locations areaugmented with a special in-hand location h . Pegs(resp. disks) cannot occupy the same location atthe same time, but 1 peg and 1 disk can occupythe same location at the same time. • Action. move - eﬀect ( x, y, z ) , which moves thee.e. to ( x, y, z ) ∈ { , . . . , m } and opens/closes.It opens if a peg is located at h and closesotherwise. • Transition. t increments by 1. If no peg is at h and a peg p is at the action location, then the pegis grasped ( p = h ). If a peg is located at h and theaction location a does not contain a peg, the pegis placed ( p = a ). Otherwise, the state remainsunchanged. • Reward. t max =2 n , where there is enough time to grasp and placeeach peg. This MDP satisﬁes the Markov propertybecause the next state is completely determinedfrom the current state and action. The number ofpossible states is shown in Eq. 1, and the numberof actions is | A | = m . It is not practical to learnthe optimal policy by enumerating all state-actionpairs for this MDP: for example, if m = 16 and n = 3 , the state-action value lookup table size ison the order of . | S | = (cid:18) m + 1 n (cid:19)(cid:18) m n (cid:19) t max (1)

2) Sense-Move-Effect MDP:

We apply the sense-move-effect abstraction of Section IV-A and HSAconstraints of Section IV-B to the tabular pegs ondisks problem. The process is illustrated in Fig. 7.At level 1, the sensor perceives the entire m gridas 8 cells, each cell summarizing the contents of an octant of space in the underlying grid. The robotthen selects one of these cells to attend to. Atlevels , . . . , L − , the sensor perceives 8 cellsrevealing more detail of the octant selected in theprevious level. At level L , the sensor perceives 8cells in the underlying grid, and the location of theunderlying action is determined by the cell selectedhere. Without loss of generality, assume the gridsize m of the ground MDP is a power of 2 and thenumber of levels L is log ( m ) .Fig. 7: HSA applied to the grid in Fig. 6. Columns correspond to levels 1, 2, and 3. The observedvolume appears yellow, and the octant selected bythe robot appears green.

Top . Robot selects the pegand is holding it afterward.

Bottom . Robot selectsthe disk. • State.

The current level l ∈ { , . . . , L } , the overttime step t ∈ { , . . . , t max } , a bit h ∈ { , } indicating if a peg is held, and the tuple G =( G p , G d , G pd , G e ) where each G i ∈ { , } . G p indicates the presence of unplaced pegs, G d un-occupied disks, G pd placed pegs, and G e emptyspace. • Action.

The action is a ∈ { , . . . , } , a locationin the observed grids. • Transition.

For levels l = 1 , . . . , L − , the robotselects a cell in G which corresponds to somepartition of space in the underlying grid. Thesensor perceives this part of the underlying gridand generates the observation at level l + 1 . Forlevel L , the L selections determine the locationof the underlying move - eﬀect action, l is resetto 1, and otherwise the transition is the same asin the ground MDP. • Reward.

The reward is 0 for levels , . . . , L − . Otherwise, the reward is the same as for theground MDP.The above process is no longer Markov because ahistory of states and actions could be used to betterpredict the next state. For instance, for a sufﬁcientlylong random walk, the exact location of all pegsand disks could be determined from the historyof observations, and the underlying grid could bereconstructed.On the other hand, this abstraction results insubstantial savings in terms of the number of states(Eq. 2) and actions ( | A | = 8 ). The only nonconstantterm (besides t max ) is logarithmic in m . Referringto the earlier example with m = 16 and n = 3 , thestate-action lookup table size is the order of . | S | ≤ log ( m ) t max (2)

3) Theoretical Results:

The sense-move-effectMDP with HSA constraints can be classiﬁed ac-cording to the state abstraction ordering deﬁned inLi et al. [28]. In particular, we show Q ∗ -irrelevance,which is sufﬁcient for the convergence of a numberof RL algorithms, including Q-learning, to a policyoptimal in the ground MDP. Deﬁnition 5 ( Q ∗ -irrelevance Abstraction [28]) . Given an MDP M = (cid:104) S, A, P, R, γ (cid:105) , any states s , s ∈ S , and an arbitrary but ﬁxed weightingfunction w ( s ) , a Q ∗ -irrelevance abstraction φ Q ∗ issuch that for any action a , φ Q ∗ ( s ) = φ Q ∗ ( s ) implies Q ∗ ( s , a ) = Q ∗ ( s , a ) . φ Q ∗ is a mapping from ground states to abstract states and deﬁnes the abstract MDP. Theorem 1 (Convergence of Q-learning under Q ∗ -irrelevance [28]) . Assume that each state-actionpair is visited inﬁnitely often and the step-sizeparameters decay appropriately. Q-learning withabstraction φ Q ∗ converges to the optimal state-action value function in the ground MDP. Therefore,the resulting optimal abstract policy is also optimalin the ground MDP. Because Li et al. do not consider action ab-stractions, we redeﬁne the ground MDP to havethe same actions as sense-move-effect MDP. Ad-ditionally, to keep the ground MDP Markov, weadd the current level l , and the current point offocus v ∈ { , . . . , m } , to the state. This doesnot essentially change the tabular pegs on disksdomain but merely allows us to rigorously makethe following connection.Let states and actions of the ground MDP bedenoted by s and a respectively. Similarly, letstates and actions of the sense-move-effect MDP bedenoted by ¯ s and ¯ a respectively. Let φ SME : S → ¯ S be the observation function. Theorem 2 ( φ SME is Q ∗ -irrelevant) . The sense-move-effect abstraction, φ SME , is a Q ∗ -irrelevanceabstraction.Proof. Q ∗ ( s, a ) can be computed from ¯ s and ¯ a . Thereward after the current overt stage t depends on h ,whether or not it is possible to select a peg/disk,and whether or not it is possible to avoid selectinga placed peg. These are known from ¯ s and ¯ a .Furthermore, whether or not a peg will be held afterthe current stage can be determined from ¯ s and ¯ a .Finally, due to t max = 2 n and the fact that all pegsare initially unplaced, the sum of future rewardsfollowing an optimal policy from the current stagedepends only on (a) whether or not a peg will be Although the deﬁnition is for inﬁnite-horizon problems (dueto γ ), our ﬁnite-horizon problem readily converts to an inﬁnite-horizon problem by adding an absorbing state that is reachedafter t max overt stages. The weight w ( s ) is the probability theunderlying state is s given its abstract state φ ( s ) is observed.Any ﬁxed policy, e.g. (cid:15) -greedy with ﬁxed (cid:15) , induces a valid w ( s ) and satisﬁes the deﬁnition. held after the current stage and (b) the amount oftime left, t − .

4) Simulation Results:

In these experiments,there were n = 3 objects, and the grid size was m = 16 . Besides Deictic Image Mapping (where L = 1 ), the number of levels was L = 4 . Acomparison with no abstraction or HSA with L = 1 was not possible because the system quickly ranout of memory (Eq. 1). The learning algorithm wasSarsa [24], and actions were taken greedily w.r.t. thecurrent Q-estimate. An optimistic initialization ofaction-values and random tie-breaking were reliedon for exploration.The proof to Theorem 2 suggests the observabil-ity of pegs, disks, placed pegs, and empty spaceare all important for learning the optimal policy.On the other hand, we empirically found no dis-advantage to removing the G pd (placed pegs) and G e (empty space) grids. However, it is important todistinguish unplaced pegs and placed pegs. Fig. 8shows learning curves for an HSA agent with G p and G d grids versus an HSA agent with the samegrids but showing pegs/disks irregardless of whetheror not they are placed/occupied.Fig. 8: Number of objects placed for the standardHSA agent (blue) and a standard HSA agent witha faulty sensor (red). Curves are ﬁrst mean and ± σ over each episode in 30 realizations, then averagedover , -epsisode segments for visualization.Lookahead HSA and Deictic Image Mappingvariants (Section IV-C and IV-D) result in an even smaller state-action space than standard HSA. Inthe tabular domain, this means faster convergence(Fig. 9). Although the deictic representation seemssuperior in these results, it has a serious drawback.The action-selection time scales linearly with m because there is one action for each cell in theunderlying grid. The lookahead variant capturesthe best of both worlds – small representation andfast execution. Thus, in the tabular domain, looka-head appears to be the satisfactory middle groundbetween the two approaches. However, for morecomplex domains, where Q-function approximationis required, the constant time needed to generatethe action images becomes more signiﬁcant, andthe advantage of lookahead in terms of episodes totrain diminishes (Section V-B).Fig. 9: Number of objects placed for standard HSA(blue), lookahead HSA (red), and Deictic ImageMapping (yellow) agents. Curves are mean (solid)and ± σ (shaded) over 30 realizations. Plot in logscale for lookahead and deictic results to be visible. B. Upright Pegs on Disks

In this domain, pegs and disks are modeled as talland ﬂat cylinders, respectively, where the cylinderaxis is always vertical (Fig. 10, left). Unlike thetabular domain, object size and position are sampledfrom a continuous space. Grasp and place successare checked with a set of simple conditions appro- priate for upright cylinders. The reward is 1 forgrasping an unplaced peg, -1 for grasping a placedpeg, 1 for placing a peg on an unoccupied disk, and0 otherwise.Observations consist of 1 or 2 images ( k = 2 , n ch = 1 , n x = n y = 64 ); the current HSAlevel, l ∈ { , , } ; and the e.e. status, h ∈{ empty , holding } . Each HSA level selects ( x, y, z ) position (Fig. 10, right). Gripper orientation is notcritical for this problem.Fig. 10: Left . Example upright pegs on disks scene.

Right . Level 1, 2, and 3 images for grasping theorange peg. Red cross denotes the ( x, y ) positionselected by the robot and the blue rectangle denotesthe allowed ( x, y ) offset. z x = z y = 36 cm forlevel 1 and cm for levels 2 and 3. d x = d y = 36 cm for level 1, cm for level 2, and . cm forlevel 3. Pixel values normalized and height selectionnot shown for improved visualization.

1) Network Architecture and Algorithm:

The Q-function consists of 6 convolutional neural networks(CNNs), 1 for each level and e.e. status, withidentical architecture (Table II). This architectureresults in faster execution time compared with ourprevious version [2]. The loss is the squared dif-ference between predicted and actual action-valuetarget, averaged over a mini-batch. The action-valuetarget is the reward received at the end of the currentovert stage. For CNN optimization, Adam [38] isused with a base learning rate of . , weightdecay of . , and mini-batch size of 64. Grasp conditions: gripper is collision-free and the top-centerof exactly 1 cylinder is in the gripper’s closing region. Placeconditions: entire cylinder is above an unoccupied disk and thecylinder bottom is at most 1 cm below or 2 cm above the disksurface. With standard MC and γ = 1 , the action-value target wouldbe the sum of rewards received after the current time step [36].Since, for this problem, no positively rewarding grasp precludesa positively rewarding place, ignoring rewards after the currentovert stage is acceptable. layer kernel size stride output size conv-1 × × × conv-2 × × × conv-3 × × × conv-4 × × × conv-5 × × × TABLE II: CNN architecture for the upright pegs ondisks domain. Each layer besides conv-4 and conv-5has a rectiﬁed linear unit (ReLU) as the activation.

2) Simulation Results:

We tested standard HSAwith 1, 2, and 3 levels. The number of actions (CNNoutputs) per level was adjusted so that each case hadthe same 5.625 mm precision in positioning of thee.e.: 1 level had outputs, 2 levels outputs and outputs, and 3 levels each had outputs. Notethat with 1 level this is the DQN (i.e. no-hierarchy)approach (Section IV-D). Exploration was (cid:15) -greedywith (cid:15) = 0 . .Results are shown in Fig. 11. The 1 level caseis faster in terms of episodes because learning isover fewer time steps. The 2 levels case initiallylearns faster for the same reason. The 1 and 2 levelcases converge to higher values because, with 3levels, there is a higher chance of taking a randomaction during an overt stage. This is because morelevels imply more time steps over which a randomaction could be selected w.p. (cid:15) . What is importantis that, in the last , episodes when (cid:15) = 0 , allscenarios have similar performance. However, HSAtrains faster than DQN in terms of wall clock time(1.29 versus 2.55 hours) because fewer actions needevaluated ( versus , ). This advantagebecomes more staggering as dimensionality of theaction space increases, as in following sections.In another experiment we tested the sensitivity ofstandard HSA to the choice of z and d parameters.As explained in Section IV-E1, these parametersare selected based on task geometry. If z (resp. d )is too small, parts of the workspace will not beperceivable (resp. reachable). On the other hand,if z is too large, the scene will not be visible indetail (because the perceived images are of ﬁxedresolution), and if d is too large, the samples atthe last level will not be dense, resulting in lowe.e. precision. Results for different values of z Fig. 11: Standard HSA with varying number oflevels. (Blue) L = 3 , (red) L = 2 , and (yellow) L = 1 . Curves are mean ± σ over 10 realizationsthen averaged over , episode segments.and d are shown in Table III. The “ideal” valuesare those selected according to the principles inSection IV-E1 and correspond to the 3-levels casein Fig. 11. As expected, performance is much worsewhen selecting z and d without consideration to taskgeometry. Small Ideal Large level-1, z xy = d xy = z xy = d xy = z xy = d xy = µ Return 2.69 3.91 2.83 σ Return 1.32 0.01 1.75

TABLE III: Varying standard HSA parameters z xy and d xy (in cm). “Ideal” values were selected ac-cording to Section IV-E1. “Small” (resp. “Large”)values are smaller (resp. larger) than ideal. Last 2rows are average and standard deviation over sumof rewards per episode, after 10 different trainingsessions and , episodes per session.We also compared standard HSA to lookaheadHSA, both with 3 levels. We did not compare to theDeictic Image Mapping approach (Lookahead HSAwith 1 level) because computation of all imageswas prohibitively expensive. Results are shown in Fig. 12. In contrast to the tabular results, bothscenarios perform similarly. We hypothesize thatthe advantage of lookahead HSA is lost due tothe equivariance property of CNNs. Since executiontime for standard HSA is less than half that oflookahead (1.29 versus 3.67 hours), from now onwe only consider standard HSA.Fig. 12: Standard HSA (blue) versus lookaheadHSA (red). C. Bottles on Coasters

The main question addressed here is if HSAcan be applied to a practical problem and imple-mented on a physical robotic system. The bottleson coasters domain is similar to the pegs on disksdomain, but now objects have complex shapes andare required to be placed upright. The reward is for grasping an unplaced object more than 4 cmfrom the bottom (placing with bottom grasps iskinematically infeasible in the physical system), − for grasping a placed object, for placing a bottle,and 0 otherwise.Observations are similar to before except nowthe image resolution is lower ( n x = n y = 48 ),and the overt time step is always input to graspnetworks (and never input to place networks). HSA Grasp conditions: gripper closing region intersects exactly 1object and the antipodal condition from [12] with ◦ frictioncone. Place conditions: bottle is upright, center of mass (CoM) ( x, y ) position at least 2 cm inside an unoccupied coaster, andbottom within ± cm of coaster surface. has 3 levels selecting ( x, y, z ) position and 1 levelselecting orientation about the gripper approach axis(Fig. 5).To achieve the target precision in e.e. pose (3.75mm position and ◦ orientation for grasping), DQN(or 1-level HSA) would need to evaluate over 53million actions. Evaluation was prohibitively ex-pensive with our computing hardware. HSA onlyneeds 404 actions (although we use 708 to achieveredundancy, with little loss in computation time asthe evaluation is done in parallel).

1) Network Architecture and Algorithm:

Thenetwork architecture is shown in Table IV. Thereis 1 network for each HSA level and e.e. status.Weight decay is 0. Q-network targets are the rewardafter the current overt stage. layer kernel size stride output size conv-1 × × × conv-2 × × × conv-3 × × × conv-4 / ip-1 × / - 1 / - /n orient TABLE IV: CNN architecture for the bottles oncoasters domain. Each layer besides the last hasa ReLU activation. The last layer is a convolutionlayer for levels 1-3 (selecting position) and an innerproduct (IP) layer for level 4 (selecting orientation). n orient = 60 for grasp networks and n orient = 3 for place networks.

2) Simulation Results:

70 bottles from 3DNet[39] were randomly scaled to height 10-20 cm.Bottles were placed upright with probability / and on their sides with probability / . Learningcurves for 2 bottles and 2 coasters are shown inFig. 13. Performance is lower than that of theupright pegs on disks domain, reﬂective of theadditional problem complexity.To test robustness of the system to backgroundnoise, we ran the same experiment with the additionof distractor objects. These distractors are 3 rectan-gular blocks, with side lengths 1 to 4 cm, scatteredrandomly in the scene (e.g., Fig. 14, left). Learningperformance is only slightly lower (Fig. 14, right).However, if clutter is present at test time, it isimportant to train the system with clutter. The robottrained without clutter places an average of 1.24 Fig. 13: Number of bottles grasped (blue) andplaced (red). Curves are mean ± σ over 10 realiza-tions then averaged over , episode segments.Standard HSA with L = 4 .bottles in the cluttered environment (versus 1.55 iftrained with clutter). The distractors are visible atsome levels (e.g., level 1), so the robot does needto learn to ignore (and avoid collisions with) them.Fig. 14: Left.

Scene with clutter.

Right.

Learningcurves comparing average sum of rewards whendistractors are not present (blue) and present (red).

3) Top- n Sampling:

Before considering experi-ments on a physical robotic system, we address animportant assumption of the move-effect system ofSection III. The assumption is the e.e. can moveto any pose, T ee , in the robot’s workspace. Recentadvances in motion planning algorithms make thisa reasonable assumption for the most part; nonethe-less, a pose can still sometimes be unreachable dueto obstacles, motion planning failure, or IK failure. To address this issue, multiple, high-valued ac-tions are sampled from the policy learned in simula-tion. In particular, for each level l of an overt stage,we take the top- n samples according to Eq. 3, where Q l is the action-value estimate at level l , Q max isthe maximum possible action-value, Q min is theminimum possible action-value, and p = 1 . p l = p l − Q l − Q min Q max − Q min , l = 1 , . . . , L (3)Preliminary tests in simulation showed samplingtop- n p l values performs better than sampling top- nQ L values, as was done previously [2]. Samplingtop- n p l values may be viewed as an ensemblemethod where each level votes on the ﬁnal overtaction (cf. [40]).During test time, the resulting n , T ee samplesare checked for IK and motion plan solution indescending order of p L value. As n increases,the probability of failing to ﬁnd a reachable e.e.pose decreases; however, the more poses that areunreachable, the lower the p L value. Thus, whendesigning an HSA system, it is important to notover constrain the space of actions.

4) Robot Experiments:

We tested the bottles oncoasters task with the physical system depicted inFig. 15. The system consists of a Universal Robots5 (UR5) arm, a Robotiq 85 parallel-jaw gripper, anda Structure depth sensor. The test objects (Fig.16)were not observed during training. The CNN weightﬁles had about average performance out of the 10realizations (Fig. 13).Initially, 2 coasters were randomly selected andplaced in arbitrary positions in the back half of therobot’s workspace (too close resulted in unreachableplaces). Then, 2 bottles were randomly selected andplaced upright with probability / and on the sidewith probability / . The bottles were not allowedto be placed over a coaster. Top- n sampling with n = 200 was used. A threshold was set for the ﬁnalgrasp/place approach, whereby, if the magnitudeof the force on the arm exceeded this threshold, Python’s pseudorandom number generator was used to decidethe objects used and upright/side placement. Object position wasdecided by a human instructed to make the scenes diverse.

Fig. 15: Test setup for bottles on coasters task:a UR5 arm, Robotiq 85 gripper, Structure depthsensor (mounted out of view above the table andlooking down), 2 bottles, and 2 coasters.the motion canceled and the open/close action wasimmediately performed.Fig. 16: Test objects used in UR5 experiments.Results are summarized in Table V, and a suc-cessful sequence is depicted in Fig. 17. A graspwas considered successful if a bottle was lifted tothe “home” conﬁguration; a place was considered successful if a bottle was placed upright on anunoccupied coaster and remained there after thegripper withdrew. Failures were: grasped a placedobject ( × ), placed too close to the edge of a coasterand fell over ( × ), placed upside-down ( × ), objectslip in hand after grasp caused a place failure ( × ),and object fell out of hand after grasp ( × ). Grasp PlaceAttempts 60 59Success Rate 0.98 0.90Number of Objects . ± .

18 1 . ± . TABLE V: Performance for UR5 experiments plac-ing 2 bottles on 2 coasters averaged over 30episodes with ± σ . Task success rate with t max = 4 was 0.67. D. 6-DoF Pick-Place

The HSA method was also implemented for6-DoF manipulation, and the same system wastested on 3 different pick-place tasks [2]. Thetasks included stacking a block on top of another,placing a mug upright on the table, and (similar toSection V-C) placing a bottle on a coaster. All tasksincluded novel objects in light to moderate clutter(Fig. 18). To handle perceptual ambiguities in mugs,the observations were 3-channel images ( k = 2 , n ch = 3 , n x = n y = 60 ) projected from a pointcloud obtained from 2 camera poses. HSA included6 levels ( L = 6 ) – 3 for ( x, y, z ) position and 1 foreach Euler angle. Results from UR5 experimentsare shown in Table VI. Blocks Mugs BottlesGrasp 0.96 0.86 0.89Place 0.67 0.89 0.64Task 0.64 0.76 0.57 n Grasps 50 51 53 n Places 48 44 47

TABLE VI:

Top.

Grasp, place, and task successrates for the 3 tasks with t max = 2 (i.e., 1 pick1 place). Bottom.

Number of grasp and placeattempts. This section refers to an earlier version of our system, sothe simulations took longer and the success rates for bottles arelower. The setup was similar to that in Fig. 15 except the sensorwas mounted to the wrist. See [2] for more details.

VI. C

ONCLUSION

The primary conclusion is that the sense-move-effect abstraction, when coupled with hierarchicalspatial attention, is an effective way of simultane-ously handling (a) high-resolution 3D observationsand (b) high-dimensional, continuous action spaces.These two issues are intrinsic to realistic problemsof robot learning. We provide several other con-siderations relevant to systems employing spatialattention:

A. Secondary Conclusions • Compared to a ﬂat representation, HSA can resultin an exponential reduction in the number ofactions that need to be sampled (Section IV-B). • HSA generalizes DQN, and lookahead HSA gen-eralizes Deictic Image Mapping (Section IV-D). • The partial observability induced by an HSAobservation does not preclude learning an optimalpolicy (Section V-A). • HSA may take longer to learn than DQN in termsof the number of episodes to convergence, butHSA executes faster when the number of actionsis large (Section V-B). • Lookahead HSA is preferred to standard HSA interms of the number of the episodes to train, butexecution time is longer by a constant and thelearning beneﬁt diminishes when coupled withfunction approximation (Sections V-A and V-B). • HSA can be applied to realistic problems on aphysical robotic system (Sections V-C and V-D).

B. Limitations and Future Work

A concern with all deep RL methods is thatmodeling and optimization errors induced by theuse of function approximation prevent the robotfrom learning an optimal policy. This is true foreven simple problems, such as the upright pegs ondisks problem of Section V-B. Also, how manip-ulation skills can be automatically and efﬁcientlytransferred to different but related tasks remainsan open question. Even small changes to the task,such as the inclusion of distractor objects, requirescomplete retraining of the system for maximum per-formance. Finally, HSA is a competing approach to Fig. 17: Successful trial – all bottles placed in 4 overt stages. Image taken immediately after open/close.Fig. 18: 6-DoF pick place on the UR5 system.

Top .Blocks task.

Bottom . Mugs task. Notice the grasp isdiagonal to the mug axis, and the robot compensatesfor this by placing diagonally with respect to thetable surface.policy search methods in that both can handle high-dimensional, continuous action spaces. It wouldbe interesting to see a comparison between theseapproaches. Previous value-based approaches likeDQN cannot handle the high-dimensional actionspaces prevalent in robotics; thus, HSA enables acomparison between value and policy search meth-ods for these domains.A

CKNOWLEDGMENT

We thank Andreas ten Pas for reviewing earlydrafts of this paper and the anonymous reviewers fortheir insightful comments. Funding was provided byNSF (IIS-1724257, IIS-1724191, IIS-1750649, IIS- 1830425, IIS-1763878), ONR (N00014-19-1-2131),and NASA (80NSSC19K1474).R

EFERENCES[1] S. Whitehead and D. Ballard, “Learning to perceive andact by trial and error,”

Machine Learning , vol. 7, no. 1,pp. 45–83, 1991.[2] M. Gualtieri and R. Platt, “Learning 6-DoF grasping andpick-place using attention focus,” in

Proceedings of The2nd Conference on Robot Learning , ser. Proceedings ofMachine Learning Research, vol. 87, Oct 2018, pp. 477–486.[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Ve-ness, M. Bellemare, A. Graves, M. Riedmiller, A. Fid-jeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg,and D. Hassabis, “Human-level control through deep rein-forcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015.[4] R. Platt, C. Kohler, and M. Gualtieri, “Deictic image maps:An abstraction for learning pose invariant manipulationpolicies,” in

AAAI Conf. on Artiﬁcial Intelligence , 2019.[5] T. Lozano-P´erez, “Motion planning and the design oforienting devices for vibratory part feeders,”

MIT ArtiﬁcialIntelligence Laboratory Technical Report , 1986.[6] P. Tournassoud, T. Lozano-P´erez, and E. Mazer, “Regrasp-ing,” in

IEEE Int’l Conf. on Robotics and Automation ,vol. 4, 1987, pp. 1924–1928.[7] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox,and S. Birchﬁeld, “Deep object pose estimation for seman-tic robotic grasping of household objects,” in

Proceedingsof The 2nd Conference on Robot Learning , ser. Proceedingsof Machine Learning Research, vol. 87. PMLR, Oct 2018,pp. 306–316.[8] I. Lenz, H. Lee, and A. Saxena, “Deep learning fordetecting robotic grasps,”

The Int’l Journal of RoboticsResearch , vol. 34, no. 4-5, pp. 705–724, 2015.[9] L. Pinto and A. Gupta, “Supersizing self-supervision:Learning to grasp from 50k tries and 700 robot hours,”in

IEEE Int’l Conf. on Robotics and Automation , 2016,pp. 3406–3413.[10] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen,“Learning hand-eye coordination for robotic grasping withlarge-scale data collection,” in

Int’l Symp. on ExperimentalRobotics . Springer, 2016, pp. 173–184. [11] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu,J. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning toplan robust grasps with synthetic point clouds and analyticgrasp metrics,” Robotics: Science and Systems , vol. 13,2017.[12] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasppose detection in point clouds,”

The Int’l Journal ofRobotics Research , vol. 36, no. 13-14, pp. 1455–1473,2017.[13] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog,E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Van-houcke, and S. Levine, “Scalable deep reinforcement learn-ing for vision-based robotic manipulation,” in

Proceedingsof The 2nd Conference on Robot Learning , ser. Proceedingsof Machine Learning Research, vol. 87. PMLR, 29–31Oct 2018, pp. 651–673.[14] D. Morrison, P. Corke, and J. Leitner, “Closing the loop forrobotic grasping: A real-time, generative grasp synthesisapproach,” in

Robotics: Science and Systems , 2018.[15] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, andS. Levine, “Deep reinforcement learning for vision-basedrobotic grasping: A simulated comparative evaluation ofoff-policy methods,” in

IEEE Int’l Conf. on Robotics andAutomation , 2018.[16] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan,M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli,F. Alet, N. Chavan-Daﬂe, R. Holladay, I. Morona, Q.-N.Prem, D. Green, I. Taylor, W. Liu, T. Funkhouser, andA. Rodriguez, “Robotic pick-and-place of novel objectsin clutter with multi-affordance grasping and cross-domainimage matching,” in

IEEE Int’l Conf. on Robotics andAutomation , 2018.[17] Y. Jiang, C. Zheng, M. Lim, and A. Saxena, “Learningto place new objects,” in

Int’l Conf. on Robotics andAutomation , 2012, pp. 3088–3095.[18] M. Gualtieri, A. ten Pas, and R. Platt, “Pick and placewithout geometric object models,” in

IEEE Int’l Conf. onRobotics and Automation , 2018.[19] A. Xie, A. Singh, S. Levine, and C. Finn, “Few-shotgoal inference for visuomotor learning and planning,” in

Proceedings of The 2nd Conference on Robot Learning ,ser. Proceedings of Machine Learning Research, vol. 87.PMLR, 29–31 Oct 2018, pp. 40–52.[20] S. James, A. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world fora multi-stage task,” in

Conf. on Robot Learning , vol. 78.Proceedings of Machine Learning Research, 2017, pp.334–343.[21] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider,R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, andW. Zaremba, “Hindsight experience replay,” in

Advances inNeural Information Processing Systems , 2017, pp. 5048–5058.[22] J. Kober, A. Bagnell, and J. Peters, “Reinforcement learn-ing in robotics: A survey,”

The Int’l Journal of RoboticsResearch , vol. 32, no. 11, pp. 1238–1274, 2013.[23] C. Watkins, “Learning from delayed rewards,” Ph.D. dis-sertation, King’s College, Cambridge, 1989.[24] G. Rummery and M. Niranjan, “On-line Q-learning usingconnectionist systems,” Cambridge University EngineeringDepartment, CUED/F-INFENG/TR 166, September 1994. [25] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deepreinforcement learning,” arXiv preprint arXiv:1509.02971 ,2015.[26] C. Watkins and P. Dayan, “Q-learning,”

Machine learning ,vol. 8, no. 3-4, pp. 279–292, 1992.[27] T. Jaakkola, M. Jordan, and S. Singh, “Convergence ofstochastic iterative dynamic programming algorithms,” in

Advances in Neural Information Processing Systems , 1994,pp. 703–710.[28] L. Li, T. Walsh, and M. Littman, “Towards a uniﬁed theoryof state abstraction for MDPs,” in

Int’l Symp. on ArtiﬁcialIntelligence and Mathematics , 2006.[29] N. Sprague and D. Ballard, “Eye movements for rewardmaximization,” in

Advances in Neural Information Pro-cessing Systems , 2004, pp. 1467–1474.[30] H. Larochelle and G. Hinton, “Learning to combine fovealglimpses with a third-order Boltzmann machine,” in

Ad-vances in Neural Information Processing Systems , 2010,pp. 1243–1251.[31] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu,“Recurrent models of visual attention,” in

Advances inNeural Information Processing Systems , 2014, pp. 2204–2212.[32] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu, “Spatial transformer networks,” in

Advances in Neural Information Processing Systems ,2015, pp. 2017–2025.[33] M. Gualtieri and R. Platt, “Viewpoint selection for graspdetection,” in

IEEE/RSJ Int’l Conf. on Intelligent Robotsand Systems , 2017, pp. 258–264.[34] D. Morrison, P. Corke, and J. Leitner, “Multi-view picking:Next-best-view reaching for improved grasping in clutter,”in

IEEE Int’l Conf. on Robotics and Automation , 2019.[35] B. Wu, I. Akinola, and P. Allen, “Pixel-attentive policygradient for multi-ﬁngered grasping in cluttered scenes,”in

IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems ,2019.[36] R. Sutton and A. Barto,

Reinforcement Learning: An In-troduction , 2nd ed. MIT Press Cambridge, 2018.[37] “Source code for: Learning manipulation skills via HSA,”https://github.com/mgualti/Seq6DofManip, accessed:2019-11-13.[38] D. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”

Int’l Conf. on Learning Representations ,2015.[39] W. Wohlkinger, A. Aldoma, R. Rusu, and M. Vincze,“3DNet: Large-scale object class recognition from CADmodels,” in

IEEE Int’l Conf. on Robotics and Automation ,2012, pp. 5384–5391.[40] O. Anschel, N. Baram, and N. Shimkin, “Averaged-DQN:Variance reduction and stabilization for deep reinforcementlearning,” in

Int’l Conf. on Machine Learning , 2017, pp.176–185. Marcus Gualtieri is a PhD studentat Northeastern University in Boston,Massachusetts. In 2017 he received theMS degree in computer science fromNortheastern, and 2008 he receivedthe BS degree in software engineeringfrom Florida Institute of Technology.His research interests include robotlearning and planning in unstructuredenvironments.