[PDF] Robust, Occlusion-aware Pose Estimation for Objects Grasped by Adaptive Hands

Abstract

Many manipulation tasks, such as placement or within-hand manipulation, require the object's pose relative to a robot hand. The task is difficult when the hand significantly occludes the object. It is especially hard for adaptive hands, for which it is not easy to detect the finger's configuration. In addition, RGB-only approaches face issues with texture-less objects or when the hand and the object look similar. This paper presents a depth-based framework, which aims for robust pose estimation and short response times. The approach detects the adaptive hand's state via efficient parallel search given the highest overlap between the hand's model and the point cloud. The hand's point cloud is pruned and robust global registration is performed to generate object pose hypotheses, which are clustered. False hypotheses are pruned via physical reasoning. The remaining poses' quality is evaluated given agreement with observed data. Extensive evaluation on synthetic and real data demonstrates the accuracy and computational efficiency of the framework when applied on challenging, highly-occluded scenarios for different object types. An ablation study identifies how the framework's components help in performance. This work also provides a dataset for in-hand 6D object pose estimation. Code and dataset are available at: this https URL

Full PDF

RRobust, Occlusion-aware Pose Estimation for Objects Grasped by Adaptive Hands

Bowen Wen, Chaitanya Mitash, Sruthi Soorian, Andrew Kimmel, Avishai Sintov and Kostas E. Bekris

Abstract — Many manipulation tasks, such as placement orwithin-hand manipulation, require the object’s pose relative toa robot hand. The task is difﬁcult when the hand signiﬁcantlyoccludes the object. It is especially hard for adaptive hands,for which it is not easy to detect the ﬁnger’s conﬁguration.In addition, RGB-only approaches face issues with texture-lessobjects or when the hand and the object look similar. Thispaper presents a depth-based framework, which aims for robustpose estimation and short response times. The approach detectsthe adaptive hand’s state via efﬁcient parallel search given thehighest overlap between the hand’s model and the point cloud.The hand’s point cloud is pruned and robust global registrationis performed to generate object pose hypotheses, which areclustered. False hypotheses are pruned via physical reasoning.The remaining poses’ quality is evaluated given agreementwith observed data. Extensive evaluation on synthetic and realdata demonstrates the accuracy and computational efﬁciencyof the framework when applied on challenging, highly-occludedscenarios for different object types. An ablation study identiﬁeshow the framework’s components help in performance. Thiswork also provides a dataset for in-hand 6D object pose esti-mation. Code and dataset are available at: https://github.com/wenbowen123/icra20-hand-object-pose

I. INTRODUCTIONRobot manipulation often requires recognizing objectsand detecting their 6D pose, i.e., position and orientation.Applications include logistics [1], where picking is a frequenttask. Once picked, an object may need to be purposefullyplaced for packaging, sorting or restocking. Depending onthe task, regrasping or within-hand manipulation may also berequired. These objectives need the object’s 6D pose relativeto the robot’s hand post-grasp. Most existing work in poseestimation is focusing on the pre-grasp case [2], [3], [4], [5],[6], which is not always a good indicator of the post-graspone due to the effects of contact. This is especially true foradaptive hands, such as underactuated, compliant systemsthat naturally and safely adapt to an object’s shape as in Fig.1. There are multiple challenges that arise in this context: - Severe occlusions:

The hand often signiﬁcantly occludesthe grasped object. Thus, solutions need to robustly distin-guish the target object from the robot’s ﬁngers and noisyscene. Small objects further complicate the process as theyare mostly covered by the hand from the camera’s viewpoint. - Unpredictable contacts and dynamic tasks:

Pre-grasppose estimation does not suffer as much from occlusions.Recent work for in-hand pose estimation [7] assumes thepose does not change signiﬁcantly upon grasping and caninitialize ICP (Iterative Closest Point). But as the hand grasps

The authors are with the Computer Science Dept. of Rutgers Univ. in NJ,USA. This work is supported by NSF awards IIS-1734492, IIS-1723869,CCF-1934924. The opinions & ﬁndings in this paper do not necessarilyreﬂect the sponsor’s views. Email: { bw344,kostas.bekris } @cs.rutgers.edu. Fig. 1. Left: Original image showing the adaptive hand grasping andseverely occluding a texture-less object. Middle: Point-cloud data. Right:Scene reconstruction given the output of the approach. the object, the pose changes dynamically. This is also true ifregrasping or within-hand manipulation is performed, whereit is difﬁcult to account for contacts, especially for compliantand adaptive hands. 6D pose tracking [8], [9], [10], [11]can help but also requires a good initial estimate. If thetracking loses the object, robust pose estimation given ahighly-occluded snapshot is still needed. - Robustness and Generalizability Pose estimation basedon color or texture data [7], [12], [13], [14], [15] can be sen-sitive to lighting conditions, and challenging for texture-lessobjects or when the object and the robot hand look similar.Extracting local 3D descriptors and ﬁnding correspondences[16], [17], [18] may suffer from limited object visibility.This paper presents a framework for robust, within-hand6D object pose estimation using a consumer-level depthsensor. It addresses the issues arising from adaptive handsand focuses on the Yale Hand T42 [19], given its use fordexterous manipulation [20], [21]. A key feature is theestimation of the hand’s state to help infer the object’sregion on the image. The method builds a hand-SDF (SignedDistance Field) to regularize the object’s pose given physicalconstraints. This makes the task computationally manageableeven under severe occlusions. The proposed framework ex-hibits these properties: • High precision; it achieves high accuracy, even for a tighterror threshold of 5mm under the

ADI metric [22]. • Computational efﬁciency; as it returns the pose of theobject and the state of the adaptive hand in 0.5 to 0.7seconds. The hand’s state estimate may also be helpfulfor control purposes; • Robustness; as the method works for various objects, in-cluding textureless ones, and with a cluttered background,where RGB-based methods would struggle.This work also contributes a synthetic and a real dataset,where an adaptive hand holds various objects, with RGB-Ddata and ground truth information, since no related datasetexist in the literature beyond for objects contained in humanhands [8]. Experiments on both datasets demonstrate the ef-fectiveness, robustness and efﬁciency of the proposed systemfor multiple objects in various scenarios, compared againststate-of-the-art methods. An ablation study highlights howthe method’s critical components help in performance. a r X i v : . [ c s . R O ] M a r GB-D SensorRobot hand Object Hand State Estimation (Particle Swarm Optimization)

Object Pose Hypotheses Generation (Global pointset matching)

Hand mask … Pose Hypothesis SelectionRGB-D Point cloudHand model Object Model Object SegmentHand state Object posecandidates Most-likelypose

Fig. 2. The framework acquires the RGB-D point cloud and computes the conﬁguration of the adaptive hand given its CAD model. From this estimate,the hand is removed from the point cloud and the object segment is recovered. A set of pose candidates is generated by matching the segment to theobject’s model. The most likely pose is returned by evaluating the consistency of the interactions between the estimated hand and the in-hand object.

II. RELATED WORKThis section covers different approaches for object poseestimation related to manipulation tasks.

Alternatives to Vision:

Various sensors have been usedfor in-hand pose estimation, such as proprioception [23],[24], and contact/force sensing [25], [26], [27]. Such sensorshave also been combined with vision to decrease uncertainty[28], [29], [30], [31], [32], [33], [34], [35]. Nevertheless,these sensing modalities are not always accessible, as theyrequire careful engineering of the hands and increase cost.Under-actuated adaptive hands, for instance, do not oftenprovide information for identifying ﬁnger conﬁgurations.Thus, a vision-only solution is desirable.

Single Image Object Pose Estimation:

Recent advancesin object detection [36], [37] and pose estimation [13], [38]have shown promise given access to sufﬁcient labeled data.This allows to project an object’s 3D bounding box on theimage and solve for a pose using PnP [39], [14]. This isproblematic, however, under severe occlusions. Alternatively,direct 6D pose regression has been attempted [13], [40].Nevertheless, the complexity of SO(3) results in instabilityin training and prediction. Recent work [41], [42] attemptsto jointly estimate a human hand and the in-hand 6D objectpose accounting for physical consistency but the resultingprecision is not sufﬁcient for manipulation. In contrast, arobotic hand’s kinematic information is available, whichhelps increase precision.

3D Registration Methods:

Registration [16] often useslocal geometry features followed by voting, which makesthem sensitive to point cloud density that is problematic un-der severe occlusions. Alignment solutions can use gradientdescent optimization [43] but again degrade under severeocclusions, when only few features and correspondences canbe extracted on the small point cloud segment of the object.Super4PCS has been shown effective in global registration,whereas its RANSAC nature makes it inefﬁcient when largenumber of outliers exist. This work builds upon prior efforts[43] and achieves higher accuracy with faster speed byintroducing heuristics-guided sampling.

Object Pose Tracking:

Methods have used a varietyof approaches: GPU-accelerated particle ﬁltering with alikelihood estimation based on color, distance and normals[10]; modeling occlusions to eliminate outliers [44]; Gaus-sian Filtering to track objects using depth [8]. Promising precision is achieved for small errors but tracking loss arisesfrequently. Recent work [45] formulates the 6D object posetracking problem in the Rao-Blackwellized particle ﬁlteringframework. This method, however, requires a reliable singleimage pose for re-initialization upon tracking loss. Thecurrent work differs from the above in that it achievesfast, high precision estimates from individual high-occlusionsnapshots without knowledge about previous frames. It canbe integrated with such tracking frameworks to (re-)initialize.

Visual Servoing:

A simple solution is to attach ﬁducialmarkers [46], [32] on the object [47], [48], [20] but it isnot always practical to keep the marker visible, especiallyduring in-hand manipulation. Additionally, complex surfacesmake the attachment troublesome. Recent work trained anend-to-end policy network to perform within-hand manipu-lation while reasoning about object pose [12]. Computationalresources, however, prevent it from easy application acrossconditions, such as objects unseen during training or havingless distinctive features. Another effort estimated object poseby ﬁrst segmenting the robot hand given a Naive Bayes clas-siﬁer and then performing ICP (Iterative Closest Point) forthe object segment, assuming the object does not move muchupon grasping [7]. This assumption is often violated whengrasping or in-hand manipulation leads to object slippage.The current work does not depend on a pre-grasp estimate.III. PROBLEM FORMULATIONGiven a depth image from camera C , a mesh model M of object O , the goal is to compute O ’s 6D pose, i.e., therigid transform T CM , where O is grasped by an adaptive handin C ’s view. The work considers under-actuated hands (theYale Hand T42 [19]) for which a CAD model is available.The hand state determined by conﬁguration of the N ﬁngers x H = { q F i } Ni = are initially unknown and not available. Thecamera is calibrated and the transform T CH of the hand’s wristframe H to the camera is available.IV. APPROACHFig. 2 outlines the proposed approach with 3 key com-ponents: 1) parallel evolutionary optimization to estimatethe hand’s conﬁguration; 2) heuristics-guided global pointsetregistration to generate pose hypotheses for the object; 3)scene-level physics reasoning that considers the hand-objectinteraction to ﬁnd the most-likely object pose. . Hand State Estimation 𝜽𝜽 𝜽𝜽 𝜽𝜽 𝜽𝜽 wrist Finger 1Finger 2

Fig. 3. Adaptive hand with 2underactuated ﬁngers.

An adaptive hand consists ofa wrist and a set of ﬁngers. Theﬁngers are not sensorized to alevel that provides reliable stateinformation. Each ﬁnger F istreated as an articulated chainand its conﬁguration is the setof all joint angles, i.e., q F = { θ F , θ F , ..., θ Fn } (see Fig. 3). A 3D region-of-interest (ROI)is identiﬁed that contains the point cloud P S of the in-handobject and ﬁngers. The ROI is computed based on the wrist’spose T HC obtained from forward kinematics and the handdimensions. ICP , performed over the point cloud and thewrist’s model, reﬁnes T HC to compensate for errors in forwardkinematics and camera calibration.The next step aims to ﬁnd the ﬁnger conﬁguration, whichminimizes the discrepancy between the robot hand modeland the observed depth image given P S . It is possible toformalize this problem as convex objective optimization andemploy gradient descent algorithms to obtain the optimalpose, as in related work [49]. An initial estimation from theprevious frame, in the context of a tracking scenario, canbe good initialization for the gradient descent to converge.Nevertheless, in single image estimation, such as in thiswork, no such initial guess is assumed. For this reason,this paper proposes Particle Swarm Optimization ( PSO ) forsearching each ﬁnger conﬁguration, inspired by prior workon human hand pose tracking [50].

PSO is an evolutionaryprocess where particles interact with each other to searchthe parameter space. In addition to being less sensitive tolocal optima, it is highly parallelizable and does not requirethe objective function to be differentiable. This allows toformalize the cost function as minimizing the negative

LCP (Largest Common Pointset) [51] score computed via anefﬁcient KDTree implementation.Unlike human hands, the conﬁguration space of robothands is more constrained. It was empirically observed thatinstead of estimating the hand state globally in

PSO , sequen-tially estimating each ﬁnger’s conﬁguration leads to morestable solutions and faster convergence (with 15 particles and3 iterations for each ﬁnger). Therefore,

PSO was applied toeach ﬁnger separately to estimate its conﬁguration startingfrom the ﬁnger closest to camera. Each

PSO particle is avector representing the current ﬁnger conﬁguration q F andthe swarm is a collection of particles. Initially, particlesare randomly sampled and their velocities are initialized tozero. In each generation, a particles velocity is updated asa randomly weighted summation of its previous velocity,the velocity towards its own best known position, and thevelocity towards the best known position of the entire swarm.The cost function evaluation is given in Alg. 1. The inputsare the ﬁnger conﬁguration q F , which will be evaluated, thehand region point cloud P S and ﬁnger model point cloud P F .In lines 2 - 5, a penalty is assigned to cases when ﬁngers havecollisions. It returns a score that is linearly dependent on the penetration depth d to encourage particles to move to a morepromising parameter space that satisﬁes collision avoidance.The λ c parameter is a penalization term and is arbitrarilyassigned to a very large value. P S is ﬁrst transformed intothe ﬁnger frame using forward kinematics and q F . A KDTreeis built on the transformed P S to compute the LCP score withthe ﬁnger model cloud efﬁciently.

Algorithm 1:

COST FUNCTION ( q F , P S , P F ) P fingerS ← transform P S to ﬁnger frame using forwardkinematics and q F ; for any other ﬁnger Q F do /* collision penetration depth (negative) */ d ← collisionCheck ( P F , Q F ) ; if d < ε then return − λ c − λ c d ; kdtree ( P S ) ← build kdtree from P fingerS ; LCP ← for each p F ∈ P F do p nei ← kdtree ( P S ) . f indNearestNeighbor ( p F ) ; if || p nei − p F || < ε andnormal ( p nei ) · normal ( p F ) > δ then LCP ← LCP + return − LCP ;The single shot hand state estimation is implemented forparallel execution in C++. This component can also be veryuseful for tracking approaches [49], [52] as initialization orre-initialization.

B. Object Pose Hypotheses Generation and Clustering

Once the full hand state x H is available, SDF (SignedDistance Field) is computed for the hand. All P S pointswith signed distance below a threshold SDF ( p , x H ) < ε areeliminated ( ε = mm in the accompanying experiments).The remaining point cloud P O is now assigned to the object.The new goal is to register the object mesh M O against thepoint cloud P O , despite the imperfections of P O due to sensornoise, occlusions or errors in the hand state estimate.This paper builds upon prior work for hypotheses gener-ation [53], [54]. It samples sets of 4-point, co-planar baseson the object’s point cloud ( P O ), and searches for congruentsets on the object model ( M O ) to provide a pool of rigidalignments (Fig. 4). Bases can be sampled randomly [53]or given the stochastic output of a CNN [54]. To limitthe number of samples, while maximizing the chances ofsampling a valid base (where all points belong to the object),this work proposes sampling heuristics given the hand state.The base sampling process is given in Alg. 2, where inputsare the object point cloud P O , heuristics π and a hash map PPF M of Point Pair Features (PPF) [16] of the model M O .The hash map PPF M is precomputed. It counts the numberof times a discretized PPF feature appears on M . The PPFfor any two points on M O is given by: PPF ( p , p ) = ( (cid:107) p p (cid:107) , ∠ ( n , d ) , ∠ ( n , d ) , ∠ ( n , n )) where n and n are point normals and d is the distancebetween the points. This avoids outliers from P O . For sam-pling one base, 4 points are sampled incrementally by using aeuristic score associated with every point on the point cloud P O . The heuristic score follows an exponential distributionof the Euclidean Distance Transform of each point, which iscomputed from the hand’s signed distance ﬁeld

SDF : π ( p i ) ∝ − exp ( − λ SDF ( p i ; x H )) . where π ( p i ) returns a point’s probability to be sampled. Theprobability distribution of all the points on the object cloud P O are normalized and denoted as π . Points further awayfrom the hand are more likely to belong to the object andare prioritized. To balance exploitation and exploration, adiscounting factor γ = . Algorithm 2:

SAMPLE ONE BASE ( P O , π , PPF M ) b ← sample a point from P O according to π ; B ← { b } ; for p ∈ P O do f ← PPF ( p , b ) ; if PPF M [ f ] == Ø then π ( p ) ← for i ← to max iter do b , b ← sample two different points from P O according to the updated distribution π ; π ( b ) ← γ π ( b ) ; π ( b ) ← γ π ( b ) ; f ← PPF ( b , b ) ; if PPF M [ f ] (cid:54) = Ø and ∠ ( −−→ b b , −−→ b b ) > δ then B ← B ∪ { b , b } ; break ; for i ← to max iter do b ← sample a point from P O according to theupdated distribution π ; π ( b ) ← γ π ( b ) ; if distance ( plane ( b , b , b ) , b ) > ε then continue ; f ← PPF ( b , b ) , f ← PPF ( b , b ) ; if PPF M [ f ] (cid:54) = Ø and PPF M [ f ] (cid:54) = Ø then B ← B ∪ b ; break ; return B ; Fig. 4. A 4-point base isheuristically sampled. For acongruent set on the model acandidate transform is deﬁned.

The sampling ensures that the4 points are co-planar givena small threshold (Line 18).Base sampling is repeated un-til a desired number of basesis achieved. Given a base B ,its congruent set on the objectmodel is retrieved by hyper-sphere rasterization [53]. Align-ment between the matchingbases can be solved in a leastsquare manner [53]. This returnsa set of object pose hypotheses along with their LCP score.Base sampling and alignment are executed in parallel.The large number of pose candidates generated oftencontains many incorrect or redundant poses. Clustering in SE(3) is performed to group together similar poses andreduce the size of the hypotheses set. Similar to prior work[55], a fast and effective technique is adapted for this step: around of coarse grouping is performed in R via EuclideanDistance Clustering . Then, each group is split by clusteringaccording to the minimal geodesic distance along SO ( ) : d ( R , R ) = arccos ( trace ( R T R ) − ) . Different from prior work [55], however, rather than usingK-means, which can be computationally expensive, the newhypotheses are formed by the poses with the highest

LCP score per cluster and reﬁned by

Point-to-Plane

ICP [56].After

ICP , some candidates may converge to the samepose and are merged. The top k hypotheses (empiricallyset to 100) with the highest LCP score are kept to improvecomputational efﬁciency.

C. Pose Hypothesis Pruning and Selection

Physical reasoning is leveraged to further prune falsehypotheses via collision checking and scene-level occlusionreasoning. Physical consistency is imposed by checking ifthe object model collides beyond certain depth with theestimated hand state, or if the object is located above certaindistance from the hand mesh surface, indicating that the handis not touching the object. This process can be performedefﬁciently by utilizing the hand state and its

SDF .Ambiguities might still arise due to several pose candi-dates achieving similar

LCP score with the object under highocclusions. Any (non-corrupted) observation of a non-zerodepth indicates that there is nothing between the observedpoint and the camera, up to some noise threshold and barringsensor error [57]. This scene-level reasoning is adapted bycomparing the accumulated pixel-wise discrepancy betweenthe observed depth image and the rendered one (computedvia

OpenGL using both the estimated object pose and handstate). Based on this rendering score, the top 1 / rd of posehypotheses are retained. The ﬁnal optimal pose is selectedfrom this set according to the highest LCP .V. EXPERIMENTSThis section evaluates the proposed approach and com-pares against state-of-the-art single-image pose estimationmethods on in-hand objects. Note the difference with track-ing methods [28], [49], [52], [8], since here the 6D objectpose is recovered from a single static image without de-pendency on previous frames. To the best of the authors’knowledge, there are no relevant datasets in the literaturebeyond those for objects in human hands [8]. A benchmarkdataset is developed that includes both simulated and realworld data for in-hand object pose estimation with adaptivehands and will be released publicly.

A. Experimental Setup

The setup consists of a robot manipulator (Yaskawa Mo-toman) and a Yale T42 adaptive hand (Fig. 3), which was 3Dprinted based on open-source designs. Objects considered forin-hand manipulation were picked to evaluate the robustness

Mustard, White hand

Method Avg. Recall (%)Super4PCS [23] + HS HS + ICP

ICP

HS + ICP

ICP

HS + ICP

Cylinder, Blue hand

Cuboid, Blue hand

Ellipse, Blue hand tless3, Blue hand

Mustard, Blue hand

Tomato, Blue hand

Cylinder, White hand

Cuboid, White hand

Ellipse, White hand tless3, White hand

Mustard, White hand

Tomato, White hand

Fig. 5. Comparison on simulation dataset. For the table, +HS implies using the proposed

PSO hand pose estimation to remove the hand related cloudfrom the scene, +ICP implies applying Point-to-Plane ICP for pose reﬁnement. of estimation. As shown in Fig. 6, the selected set is a mixof objects: with and without texture or geometric features.

Fig. 6. Mesh of objects used: A cylinder with diameter 0.035 m and length0.064 m , an ellipsoid with length 0.064 m , a cuboid with side length 0.03 m and length 0.064 m , an industrial object .All experiments are conducted on a standard desktopwith Intel Xeon(R) E5-1660 [email protected] processor. Forthe comparison to deep learning methods, neural networkinference is performed on a NVIDIA Tesla K40c GPU. B. Evaluation Metric

The recall for pose estimation is measured based on theerror given by the

ADI metric [22], which measures theaverage of point distances between poses T and T givenan object mesh model M : e ADI ( T , T ) = avg p ∈ M min p ∈ M || T ( p , M ) − T ( p , M ) || , where T ( p , M ) corresponds to point p after applying trans-formation T on M . Given a ground-truth pose T g , a truepositive is a returned pose T that has e ADI ( T , T g ) < ε , where ε is a tolerance threshold. ε is set to 5 mm in all experimentsexcept in recall curves, to evaluate the applicability ofdifferent methods for precise in-hand manipulation scenarios. C. Simulation Dataset and Results

Simulated RGB-D data were generated by placing a virtualcamera at random poses around the model of the hand. Posesare sampled from 648 view points on spheres of radius 0.3to 0.9 m centered at the hand. To generate each data point,an object is placed at a random pose between the ﬁngers.The two articulated-ﬁngers are closed randomly until theytouch the object, veriﬁed by a collision checker. Physicalparameters, such as friction, gravity or any grasping stabilitymetric are deliberately not employed since this work aims atany-time single image 6D object pose estimation during theentire within-hand manipulation process, in which a stablegrasp is not always a true assumption. By randomizing theobject pose relative to the hand, the dataset is able to covervarious in-hand object poses that can occur during an in-handmanipulation process. For the adaptive hand, two colors arechosen. The blue hand differs from any object color used in the experiments whereas the white hand resembles texture-less objects and evaluates robustness to lack of texture. Inaddition to the RGB-D data, ground-truth object pose andsemantic segmentation images are also obtained from thesimulator. For each combination of the 6 objects and the 2adaptive hands, 1000 data points are generated, resulting in12000 test cases.Fig. 5 reports the recall for pose estimation on the syn-thetic dataset. When Super4PCS is directly applied to theentire point cloud, outlier points that do not belong to theobject are often sampled, leading to poor results (5.83 %). Onintroducing the proposed

PSO hand state estimation (HS) andthereby eliminating the hand points from the scene, pointsbelonging to the object are more likely to be sampled, whichdramatically improves the performance of (Super4PCS+ HS ).Recent state-of-the-art learning-based approaches are alsoevaluated. DOPE [14] trains a neural network to predict3D bounding-box vertices projected on the image and re-covers 6D pose from them via Perspective-N-Point (PnP),which has shown to outperform PoseCNN [13] on the YCBdataset. To eliminate the domain gap from the scope ofevaluation, the training and test data were generated in thesame simulator and domain randomization was utilized assuggested [14]. AAE [60] is another learning-based methodthat trains an autoencoder network to embed object 3Dorientation information using extensive data augmentationand domain randomization techniques. It has been shown tobe successfull on textureless objects and achieved state-of-art results on the T-LESS dataset [58]. This approach is onlyable to predict 3D orientation. The translation is based onthe output of another object detection network. For the scopeof this evaluation ground-truth bounding-box were providedas input to AAE [60].A dramatic performance improvement is observed for allmethods, when the proposed PSO hand state estimation isutilized to remove the hand related point cloud from thescene. This proves the signiﬁcance of additionally estimatingthe robot hand state for in-hand 6D object pose estimation.

D. Real Dataset and Results

The real dataset contains 986 snapshots of 2 Yale T42hands holding 4 types of objects including cylinder (295),ellipse (239), cuboid (187) and tless3 (265). All the objectsand the adaptive hands are 3D printed. Similar to the settingin simulation, the adaptive hands are painted in two colors: ethod Modality cylinder cuboid ellipse tless3 Avg.Super4PCS [53] + HS Depth 52.49 43.85 62.64 62.64 55.41Super4PCS [53] +

HS+ICP

Depth 70.51 43.85 54.81 78.49 61.92

AAE ∗ [60] RGB 11.19 8.56 15.92 40.38 19.01 AAE ∗ [60] + ICP

RGBD 43.39 22.99 27.35 55.85 37.40

AAE ∗ [60] + HS+ICP

RGBD 41.02 29.41 29.80 81.89 45.53OURS Depth

Error Thres (mm) R eca ll % OURSSuper4PCS+HSSuper4PCS+HS+ICPAAEAAE+ICPAAE+HS+ICP

TABLE I:

Left:

Recall percentage ( e ADI < mm ) on real data: +HS means using the proposed PSO hand state estimation to remove the hand’s pointcloud, +ICP means applying Point-to-Plane ICP at the end for pose reﬁnement. AAE* [60] is provided a ground-truth bounding-box. Right: recall-thresholdcurves of compared methods on real data. Su cce ss F a il u re Raw image RGB-D Point cloud Output

Fig. 7. Qualitative results for the proposed approach showing success andfailure cases under challenges like occlusion and symmetry. .blue-green and white. The images are collected with anIntel RealSense SR300 RGB-D camera and the ground-truthposes are manually annotated using a GUI developed bythe authors. Before each image is taken, objects are graspedrandomly and the adaptive hand performs a random within-hand manipulation. Due to the small size of objects relativeto the hand, severe occlusions occur frequently, as exhibitedin Fig. 7. P o s e R ec a ll ( % ) Perturbation

Blue Hand cylindercuboidellipsetless3average P o s e R ec a ll ( % ) Perturbation

White Hand cylindercuboidellipsetless3average

Fig. 8. Pose recall of [7] on the real dataset. As the approach requiresinitialization, it is evaluated over perturbations on the ground-truth pose.

Table I presents results on real data. Given the largeappearance gap between synthetic training data and realscenarios, and the presence of textureless objects, the per-formance of DOPE does not translate well, and therebywas dropped from the table. AAE [60] was robust to someof these challenges and given the ground-truth bounding-boxes, it could predict the correct rotation in some cases.An additional related work [7] was evaluated on real data.It was developed to perform pose estimation for in-handobjects during robot manipulation. It assumes the initialobject pose does not change much upon grasping and servesas an initialization for ICP. To evaluate this approach, poseinitialization is provided by perturbing the ground-truth pose.Fig. 8 shows how the performance of this approach varies with the perturbation. Our proposed approach outperformsthe best-case (small perturbation) of [7] even though poseinitialization is not provided to our system.

Method Mean recall (%)Baseline 8.44(+)

PSO handpose 61.92(+)

PPF -constrained sampling 75.97(+) Heuristic sampling 79.80(+) Hypothesis pruning 83.52

TABLE II:

Ablation study of critical components in our system. Resultsare averaged across the entire real dataset.

Baseline refers to randombase sampling on the entire scene cloud.

E. System Analysis

Fig. 7 exhibits examples of the output from the proposedapproach on real data where severe occlusions occur andadditional challenge is introduced by virtue of the noise inconsumer-level depth sensor. Table. II shows the ablationstudy where the recall percentage for the object pose ( e ADI < mm ) is measured by incrementally adding the critical pro-posed components. Component Speed (ms)Mean StdPointcloud processing 66.47 10.63Hand wrist ICP 9.15 3.46Hand pose estimation 45.98 3.58Pose hypothesis generation 90.06 12.00Pose clustering and ICP 61.41 11.84Pose Hypothesis pruning 231.47 62.95Pose selection 73.95 14.05Misc 38.17 10.60Total 616.64 64.50

TABLE III:

Run-time decomposi-tion of the system on real data.

Table III presents theoverall and the decom-position of the runningtime for each componentof the proposed pipeline,when tested on real data.

Misc includes transforma-tions, building KDTree,etc. Given the parallel im-plementation, the proposedtechnique requires a relatively short amount of time toperform a single image pose estimation without any initial-ization such as in tracking.VI. CONCLUSIONSThis work presents a framework for fast and robust 6Dpose estimation of in-hand objects. Due to the lack ofrelevant datasets, both real and synthetic data will be releasedas a benchmark for 6D object pose estimation applied torobot in-hand manipulation. Extensive experiments demon-strate advantages of the proposed method: robustness undersevere occlusions and adaptation to different objects whileable to run fast as a single-image pose estimation method.Although not real time, it could be integrated with tracking-based methods to provide initialization or recovery from losttracking.

EFERENCES[1] N. Correll, K. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser,K. Osada, A. Rodriguez, J. Romano, and P. Wurman, “Analysis andObservations From the First Amazon Picking Challenge,”

T-ASE ,2016.[2] A. Zeng, K. Yu, S. Song, D. Suo, E. Walker, A. Rodriguez, and J. Xiao,“Multiview self-supervised deep learning for 6d pose estimation in theamazon picking challenge,” in

ICRA , 2019.[3] M. Schwarz, A. Milan, A. S. Periyasamy, and S. Behnke, “Rgb-d ob-ject detection and semantic segmentation for autonomous manipulationin clutter,”

The International Journal of Robotics Research , 2018.[4] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, andR. Chellappa, “Fast object localization and pose estimation in heavyclutter for robotic bin picking,”

The International Journal of RoboticsResearch , 2012.[5] M. Zhu, K. G. Derpanis, Y. Yang, S. Brahmbhatt, M. Zhang,C. Phillips, M. Lecce, and K. Daniilidis, “Single image 3d objectdetection and pose estimation for grasping,” in

ICRA , 2014.[6] C. Mitash, A. Boularias, and K. Bekris, “Physics-based scene-levelreasoning for object pose estimation in clutter,”

The InternationalJournal of Robotics Research , 2019.[7] C. Choi, J. Del Preto, and D. Rus, “Using vision for pre-and post-grasping object localization for soft hands,” in

ISER , 2016.[8] J. Issac, M. W¨uthrich, C. G. Cifuentes, J. Bohg, S. Trimpe, andS. Schaal, “Depth-based object tracking using a robust gaussian ﬁlter,”in

ICRA , 2016.[9] S. Trinh, F. Spindler, E. Marchand, and F. Chaumette, “A modularframework for model-based visual tracking using edge, texture anddepth features,” in

IROS , 2018.[10] C. Choi and H. I. Christensen, “Rgb-d object tracking: A particle ﬁlterapproach on gpu,” in

IROS , 2013.[11] M. F. Fallon, H. Johannsson, and J. J. Leonard, “Efﬁcient scenesimulation for robust monte carlo localization using an rgb-d camera,”in .IEEE, 2012, pp. 1663–1670.[12] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc-Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al. , “Learning dexterous in-hand manipulation,” arXiv preprintarXiv:1808.00177 , 2018.[13] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: Aconvolutional neural network for 6d object pose estimation in clutteredscenes,” arXiv preprint arXiv:1711.00199 , 2017.[14] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birch-ﬁeld, “Deep object pose estimation for semantic robotic grasping ofhousehold objects,” arXiv preprint arXiv:1809.10790 , 2018.[15] M. Kokic, D. Kragic, and J. Bohg, “Learning to estimate poseand shape of hand-held objects from rgb images,” arXiv preprintarXiv:1903.03340 , 2019.[16] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, matchlocally: Efﬁcient and robust 3d object recognition,” in .Ieee, 2010, pp. 998–1005.[17] A. G. Buch, L. Kiforenko, and D. Kraft, “Rotational subgroup votingand pose clustering for robust 3d object recognition,” in . IEEE, 2017,pp. 4137–4145.[18] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser,“3dmatch: Learning local geometric descriptors from rgb-d reconstruc-tions,” in

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2017, pp. 1802–1811.[19] L. U. Odhner, R. R. Ma, and A. M. Dollar, “Open-loop precisiongrasping with underactuated hands inspired by a human manipulationstrategy,”

IEEE Transactions on Automation Science and Engineering ,vol. 10, no. 3, pp. 625–633, 2013.[20] A. Sintov, A. S. Morgan, A. Kimmel, A. M. Dollar, K. E. Bekris, andA. Boularias, “Learning a state transition model of an underactuatedadaptive hand,”

IEEE Robotics and Automation Letters , vol. 4, pp.1287–1294, 2019.[21] A. Kimmel, A. Sintov, J. Tan, B. Wen, A. Boularias, and K. E.Bekris, “Belief-space planning using learned models with applicationto underactuated hands,” in

International Symposium on RoboticsResearch (ISRR) , 2019.[22] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige,and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in

Asian conferenceon computer vision . Springer, 2012, pp. 548–562.[23] M. T. Mason, A. Rodriguez, S. S. Srinivasa, and A. S. Vazquez,“Autonomous manipulation with a general-purpose simple hand,”

TheInternational Journal of Robotics Research , vol. 31, no. 5, pp. 688–703, 2012.[24] B. S. Homberg, R. K. Katzschmann, M. R. Dogar, and D. Rus, “Hapticidentiﬁcation of objects using a modular soft robotic gripper,” in . IEEE, 2015, pp. 1698–1705.[25] S. Tian, F. Ebert, D. Jayaraman, M. Mudigonda, C. Finn, R. Calandra,and S. Levine, “Manipulation by feel: Touch-based control with deeppredictive models,” arXiv preprint arXiv:1903.04128 , 2019.[26] J. Bimbo, S. Luo, K. Althoefer, and H. Liu, “In-hand object poseestimation using covariance-based tactile to geometry matching,”

IEEERobotics and Automation Letters , vol. 1, no. 1, pp. 570–577, 2016.[27] K. Aquilina, D. A. Barton, and N. F. Lepora, “Principal componentsof touch,” in . IEEE, 2018, pp. 1–8.[28] T. Schmidt, K. Hertkorn, R. Newcombe, Z. Marton, M. Suppa, andD. Fox, “Depth-based tracking with physical constraints for robotmanipulation,” in . IEEE, 2015, pp. 119–126.[29] P. K. Allen, “Integrating vision and touch for object recognition tasks,”

The International Journal of Robotics Research , vol. 7, no. 6, pp. 15–33, 1988.[30] P. Hebert, N. Hudson, J. Ma, and J. Burdick, “Fusion of stereovision, force-torque, and joint sensors for estimation of in-hand objectlocation,” in . IEEE, 2011, pp. 5935–5941.[31] L. Zhang and J. C. Trinkle, “The application of particle ﬁltering tograsping acquisition with visual occlusion and tactile sensing,” in . IEEE,2012, pp. 3805–3812.[32] K.-T. Yu and A. Rodriguez, “Realtime state estimation with tactileand visual sensing. application to planar manipulation,” in . IEEE,2018, pp. 7778–7785.[33] M. Pfanne, M. Chalon, F. Stulp, and A. Albu-Sch¨affer, “Fusing jointmeasurements and visual features for in-hand object pose estimation,”

IEEE Robotics and Automation Letters , vol. 3, no. 4, pp. 3497–3504,2018.[34] J. Bimbo, L. D. Seneviratne, K. Althoefer, and H. Liu, “Combiningtouch and vision for the estimation of an object’s pose during manip-ulation,” in . IEEE, 2013, pp. 4021–4026.[35] M. Chalon, J. Reinecke, and M. Pfanne, “Online in-hand object lo-calization,” in . IEEE, 2013, pp. 2977–2984.[36] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in

NIPS , 2015,pp. 91–99.[37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Uniﬁed, real-time object detection,” in

CVPR , 2016, pp. 779–788.[38] C. Mitash, B. Wen, K. Bekris, and A. Boularias, “Scene-level poseestimation for multiple instances of densely packed objects,” arXivpreprint arXiv:1910.04953 , 2019.[39] B. Tekin, S. N. Sinha, and P. Fua, “Real-time seamless single shot6d object pose prediction,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 292–301.[40] C. Wang, D. Xu, Y. Zhu, R. Mart´ın-Mart´ın, C. Lu, L. Fei-Fei, andS. Savarese, “Densefusion: 6d object pose estimation by iterative densefusion,” arXiv preprint arXiv:1901.04780 , 2019.[41] M. Kokic, D. Kragic, and J. Bohg, “Learning to estimate poseand shape of hand-held objects from rgb images,”

ArXiv , vol.abs/1903.03340, 2019.[42] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev,and C. Schmid, “Learning joint reconstruction of hands and manipu-lated objects,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 11 807–11 816.[43] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in

European Conference on Computer Vision . Springer, 2016, pp. 766–782.44] M. W¨uthrich, P. Pastor, M. Kalakrishnan, J. Bohg, and S. Schaal,“Probabilistic object tracking using a range camera,” , pp. 3195–3202, 2013.[45] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox,“Poserbpf: A rao-blackwellized particle ﬁlter for 6d object posetracking,” arXiv preprint arXiv:1905.09304 , 2019.[46] R. Munoz-Salinas, “Aruco: a minimal library for augmented realityapplications based on opencv,”

Universidad de C´ordoba , 2012.[47] B. alli, K. Srinivasan, A. Morgan, and A. M. Dollar, “Learning modesof within-hand manipulation,” , pp. 3145–3151, 2018.[48] S. Cruciani, C. Smith, D. Kragic, and K. Hang, “Dexterous manipula-tion graphs,” , pp. 2040–2047, 2018.[49] T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulatedreal-time tracking.” in

Robotics: Science and Systems , vol. 2, no. 1.Berkeley, CA, 2014.[50] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robusthand tracking from depth,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2014, pp. 1106–1113.[51] D. Aiger, N. J. Mitra, and D. Cohen-Or, “4-points congruent sets forrobust surface registration,”

ACM Transactions on Graphics , vol. 27,no. 3, pp.

IEEE Robotics and Automation Letters , vol. 2, no. 2, pp. 577–584,2016.[53] N. Mellado, D. Aiger, and N. J. Mitra, “Super 4pcs fast globalpointcloud registration via smart indexing,” in

Computer GraphicsForum , vol. 33, no. 5. Wiley Online Library, 2014, pp. 205–215.[54] C. Mitash, A. Boularias, and K. Bekris, “Robust 6d object poseestimation with stochastic congruent sets,” in

BMVC , 2018.[55] C. Mitash, A. Boularias, and K. E. Bekris, “Improving 6D poseestimation of objects in clutter via physics-aware monte carlo treesearch,” in

ICRA , 2018.[56] Y. Chen and G. Medioni, “Object modelling by registration of multiplerange images,”

Image and vision computing , vol. 10, no. 3, pp. 145–155, 1992.[57] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun, “Real-timehuman pose tracking from range data,” in

European conference oncomputer vision . Springer, 2012, pp. 738–751.[58] T. Hodaˇn, P. Haluza, ˇS. Obdrˇz´alek, J. Matas, M. Lourakis, andX. Zabulis, “T-LESS: An RGB-D dataset for 6D pose estimationof texture-less objects,”

IEEE Winter Conference on Applications ofComputer Vision (WACV) , 2017.[59] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M.Dollar, “The ycb object and model set: Towards common benchmarksfor manipulation research,” in . IEEE, 2015, pp. 510–517.[60] M. Sundermeyer, Z. Marton, M. Durner, and R. Triebel, “Implicit3d orientation learning for 6d object detection from rgb images,” in