[PDF] GRIP: Generative Robust Inference and Perception for Semantic Robot Manipulation in Adversarial Environments

Abstract

Recent advancements have led to a proliferation of machine learning systems used to assist humans in a wide range of tasks. However, we are still far from accurate, reliable, and resource-efficient operations of these systems. For robot perception, convolutional neural networks (CNNs) for object detection and pose estimation are recently coming into widespread use. However, neural networks are known to suffer overfitting during training process and are less robust within unseen conditions, which are especially vulnerable to adversarial scenarios. In this work, we propose Generative Robust Inference and Perception (GRIP) as a two-stage object detection and pose estimation system that aims to combine relative strengths of discriminative CNNs and generative inference methods to achieve robust estimation. Our results show that a second stage of sample-based generative inference is able to recover from false object detection by CNNs, and produce robust estimations in adversarial conditions. We demonstrate the efficacy of GRIP robustness through comparison with state-of-the-art learning-based pose estimators and pick-and-place manipulation in dark and cluttered environments.

Full PDF

aa r X i v : . [ c s . R O ] D ec GRIP: Generative Robust Inference and Perception for SemanticRobot Manipulation in Adversarial Environments

Xiaotong Chen , Rui Chen , Zhiqiang Sui , Zhefan Ye , Yanqi Liu ,R. Iris Bahar and Odest Chadwicke Jenkins Abstract — Recent advancements have led to a proliferationof machine learning systems used to assist humans in a widerange of tasks. However, we are still far from accurate,reliable, and resource-efﬁcient operations of these systems.For robot perception, convolutional neural networks (CNNs)for object detection and pose estimation are recently cominginto widespread use. However, neural networks are known tosuffer from overﬁtting during the training process and areless robust under unforeseen conditions (which makes themespecially vulnerable to adversarial scenarios ). In this work,we propose

Generative Robust Inference and Perception (GRIP) as a two-stage object detection and pose estimation systemthat aims to combine the relative strengths of discriminativeCNNs and generative inference methods to achieve robustestimation. Our results show that a second stage of sample-based generative inference is able to recover from false ob-ject detections by CNNs, and produce robust estimations inadversarial conditions. We demonstrate the efﬁcacy of

GRIP robustness through comparison with state-of-the-art learning-based pose estimators and pick-and-place manipulation in darkand cluttered environments.

I. INTRODUCTIONTaking advantage of the renaissance in deep neural net-works, machine learning has achieved great progress inobject detection and segmentation and image recognition.These deep learning methods are also prevalent in roboticsfor problems, including manipulation in clutter [1] andlearning of manipulation actions [2]. For 6D object poseestimation, learning-based Convolutional Neural Networks(CNNs) have achieved promising accuracy and real-timeinference speed [3], [4], [5]. Notably, these successes rely onwell-designed models and adequate training resources. Therobustness and generalization capability of CNNs heavilydepend on the training data, which represents a certain rangeof conditions that could be faced by robots. However, dueto the complex and dynamic nature of the real world, robotsare subject to unforeseen environmental conditions, whichare not present in the training data.More speciﬁcally, CNNs recognition systems introducevulnerability to errors (both benign and malicious) due to theeffects of overﬁtting during the training process. Distortedobjects and/or objects captured under poor lighting condi-tions could be enough to defeat the recognition abilities of X. Chen, R. Chen, Z. Sui, Z. Ye and O.C. Jenkins are with theDepartment of Electrical Engineering and Computer Science, RoboticsInstitute, University of Michigan, Ann Arbor, MI, USA, 48109-2121 [cxt|richen|zsui|zhefanye|ocj]@umich.edu Y. Liu and R. Bahar are with the Department of Computer Scienceand the School of Engineering, Brown University, Providence, RI, USA,02912-1910 [yanqi liu|iris bahar]@brown.edu

Fig. 1: Our

GRIP system perceiving and grasping an objectin adversarially darkened lighting.

GRIP uses two stages of (a)PyramidCNN object detection bounding boxes with conﬁdencescore greater than 0.1 (green boxes), shown along with the groundtruth (red box), and (b) sample-based generative inference. The (c)resulting estimate and (d) its localized pose (highlighted in cyan)enables (e) the Michigan Progress Fetch robot to accurately graspthe potted meat can object. a CNN [6]. Such perception errors can lead to (potentiallydisastrous) outcomes for embodied systems acting in the realworld. These challenges for robust perception become thatmuch more challenging when an adversary can modify theenvironment to exploit the vulnerabilities of a CNN. Forinstance, in the context of object recognition for a roboticsystem, a possible malicious attack (through simple modi-ﬁcations of an environment) has the potential to drasticallyalter and even manipulate a robot’s ﬁnal behavior. Fig. 1shows such a robot manipulation task under dark scene.Generative-discriminative algorithms [7], [8] offer apromising avenue for robust perception. Such methods com-bine inference by deep learning (or other discriminativetechniques) with sampling and probabilistic inference mod-els to achieve robust and adaptive perception in adver-sarial environments. The value proposition for generative-discriminiative inference is to get the best out of exist-ing approaches to computational perception and roboticmanipulation while avoiding their shortcomings. We wantthe robustness of belief space planning [9], [10] withoutits computational intractability. The recall power of neuralnetworks without excessive overﬁtting [2]. The efﬁciencyof deterministic inference without its fragility to uncertainty[11], [12]. Generative-discriminative algorithms may be es-pecially advantageous when exposed to adversarial attack,building on foundational ideas in this space [13], [14], [15],16], [17]. Furthermore, we expect our approach will be moregenerally applicable to guard against broad categories ofattack with a clear pathway for explanability of the resultingperceptual estimates.In this paper, we present

Generative Robust Inferenceand Perception (GRIP) as a two-stage method to exploregenerative-discriminiative inference for object recognitionand pose estimation in adversarial environments. Within

GRIP , we represent the ﬁrst stage of inference as a CNN-based recognition distribution. The CNN recognition distri-bution is used within a second stage of generative multi-hypothesis optimization. This optimization is implementedas a particle ﬁlter with a static state process. We showthat our

GRIP method produces comparable and improvedperformance with respect to state-of-the-art pose estimationsystems (PoseCNN [3] and DOPE [4]) under adversarialscenarios with varied lighting and cluttered occlusion. More-over, we demonstrate the compatibility of

GRIP with goal-directed sequential manipulation in object pick-and-placetasks with a Michigan Progress Fetch robot.II. BACKGROUND

A. Motivation

To get the best of both worlds, we consider the state-of-the-art as the relative strengths and weaknesses of deeplearning and generative inference for robust perception. Weare particularly interested in complementary properties ofthese methods for making perceptual decisions, where theweaknesses of one can be addressed by the strengths ofthe other. Despite the strengths of CNNs, they have severalshortcomings that leave them vulnerable to adversarial ac-tion, such as their opacity in understanding how its decisionsare made, fragility for generalizing beyond overﬁt trainingexamples, and inﬂexibility for recovering when false deci-sions are produced. For these methods, Goodfellow et al. [18]demonstrated that adversarial examples are misclassiﬁed bothin the case of different architectures or different subsetsof the training data. These weaknesses for CNNs playto the strengths of robustness for generative probabilisticinference, which are inherently: explainable, general, andresilient through the process of generating, evaluating, andmaintaining a distribution of many hypotheses representingpossible decisions. However, this robustness comes at thecost of computational efﬁciency. Probabilistic inference, incontrast to CNNs, is often computationally intractable withcomplexity that grows exponentially with the number ofvariables.

GRIP aims to overcomes these limitations bycombining the strengths of deep learning and probabilisticinference through a two-stage algorithm, illustrated in Fig. 2and discussed later in Section IV. The remainder of thisbackground section will provide a broader overview ofrelated existing works.

B. Perception for Manipulation

Perception is a critical step for robotic manipulation inunstructured environments. Ciocarlie et al. [19] proposed anarchitecture for reliable grasping and manipulation, where non-touching, isolated objects are estimated by clusteringthe surface normal of RGBD sensor data. The MOPEDframework [17] has been proposed for object detection andpose estimation using iterative clustering estimation frommulti-view features. A bottom-up approach is taken in [20]using RANSAC and Iterative Closest Point registration(ICP), relying solely on geometric information. Narayanan etal. [13] integrated global search with discriminatively trainedalgorithms to balance robustness and efﬁciency, which workson multi-object identiﬁcation, assuming known objects.For manipulation in dense cluttered environments, ten Pasand Platt [21] showed success in detecting grasp affordancesfrom 3D point clouds. In [22], they sample grasp posecandidates based on their geometric plausibility, from whichfeasible grasp poses are selected by a CNN. Regardingmanipulation with known object geometry models, [23],[24], [25] proposed generative sampling approaches to sceneestimation for object poses and physical support relations.However, these methods used object detection boundingboxes with hard thresholding as the prior for generativesampling, which might cause false negatives.

C. Object Detection and Pose Estimation

Learning-based approaches have been used as modules inobject pose estimation systems, or directly built end-to-endapproaches. Sui et al. [7] proposed a sample-based two-stageframework to sequential manipulation tasks, where objectdetection results are used as prior of sample initialization.Mitash et al. [26] developed a two-stage approach, which ranstochastic sampling of congruent sets [27] to get object posesbased on the semantic map from a segmentation network.Regarding end-to-end systems, PoseCNN [3] was proposedby constructing a neural-network that learned segmenta-tion, object 3D translation, and 3D rotation separately. Thiswork also contributed an object dataset, called YCB-Video-Dataset, for benchmarking robotics pose estimation and ma-nipulation approaches. DOPE [4] outperformed PoseCNN inestimation robustness in dark, occluded scenes by training thenetwork on a synthetic dataset from domain-randomizationand photo-realistic simulation. DenseFusion [5], utilized twonetworks to extract RGB and depth features separately.In this paper, we focus on the pose estimation problemin adversarial scenarios . Liu et al. [8] provided insight intohandling adversarial clutter, yet provided limited evaluationsof its approach or comparisons with state-of-the-art methods.We believe that the performance of CNNs relies highly on theconsistency of the testing environment to the training set, andthat the same is true for the two-stage methods in [7] and [26]since they rely on high-quality CNN output from their ﬁrststages. Our main contribution is the development of a two-stage pose estimation system that is robust under adversarialscenarios and able to recover from false detections from itsown ﬁrst stage.III. PROBLEM FORMULATIONGiven an RGB-D observation ( Z r , Z d ) from the robotsensor and 3D geometry models of a known object set, ig. 2: Overview of GRIP . The robot operating in a dark and cluttered environment is to grasp the meat can from its RGBD observation.Stage 1 takes the RGB image and generates object bounding boxes with conﬁdence scores. Stage 2 takes the depth image and performssample-based generative inference to estimate the pose for each object in the scene. The samples in Stage 2 are initialized according tobounding boxes from Stage 1. From this estimate, the robot performs manipulation on the meat can object. our aim is to estimate the conditional joint distribution P ( q , b | o , Z r , Z d ) for each object class o , where q is the sixDoF object pose and b is the object bounding box in theRGB image. The problem can be formulated as: P ( q , b | o , Z r , Z d ) (1) = P ( q | b , o , Z r , Z d ) P ( b | o , Z r , Z d ) (2) = P ( q | b , o , Z d ) | {z } pose estimation P ( b | o , Z r ) | {z } detection (3)Equations (1) and (2) are derived using chain rule statisticsand Equation (3) represents the factoring of object detectionand pose estimation. Here, we assume that pose estimation isconditionally independent of RGB observation, while objectdetection is conditionally independent of depth observation.Ideally, we could use Markov Chain Monte Carlo(MCMC) [28] to estimate the distribution of Equation (1).However, the state space of the entire states is so large that itis intractable to directly compute. End-to-end neural networkmethods can also be used to calculate the distribution [3], [4],[5]. These results place a heavy reliance on proper coverageof the input space in the training set. This data reliance makessuch methods vulnerable to unforeseen environment changes.SUM [7] implements a combination of Equation (1) to ﬁlterover hard detections provided by a CNN, thereby enablingit to ﬁlter out false positive CNN detections. The limitationof SUM is its inability to recover from false negatives thatare eliminated from consideration in object proposal anddetection stages. On the other hand, our GRIP paradigmis able to compensate for data deﬁciency by employing agenerative sampling method in the second stage. IV. METHODWe propose a two-stage paradigm to combine objectdetection and pose estimation, as shown in Fig. 2. In the ﬁrststage of inference, PyramidCNN performs object detectionand generates a prior distribution P ( b | o , Z r ) of 2D boundingboxes for each object label o . In the second stage, we performgenerative multi-hypothesis optimization to estimate the jointdistribution P ( q , b | o , Z ( r , d ) ) for each object label o using theﬁrst stage output as prior. The second stage is implementedas an iterated likelihood weighting ﬁlter [29]: P ( q , b | o , Z ( r , d ) ) | {z } Sample Initialization = P ( q | b ) P ( b | o , Z r ) (4) P ( q t , b t | o , Z ( r , d ) ) = η P ( Z ( r , d ) | q t , b t , o ) | {z } Likelihood P ( q t , b t | o , Z ( r , d ) ) | {z } Proposal (5) P ( q t , b t | o , Z ( r , d ) ) = Z Z P ( q t , b t | q t − , b t − ) | {z } Diffusion · P ( q t − , b t − | o , Z ( r , d ) ) dq t − db t − (6)where η is the normalizing factor. In Equation (4), initialpose q is generated from bounding boxes b , which aresampled from the prior distribution generated by the ﬁrststage. After the second stage, we get a probability distri-bution of pose estimation as shown in Equation (1). Weconsider the best estimate as the one with highest probability.Equivalently, best pose q ∗ satisﬁes, q ∗ , b ∗ = argmax q , b P ( q , b | o , Z r , Z d ) (7) . Object Detection The goal of the ﬁrst stage is to provide a probabilitydistribution map for an object class o in a given input image.To achieve this, we exploit the discriminative power ofCNNs. Inspired by region proposal networks (RPN) in [30],our PyramidCNN serves as a proposal method for the secondstage. We choose VGG-16 networks [31] to extract features,which are directed to two fully convolutional networks(FCN) [32]: a classiﬁer learning the object labels and ashape network learning the bounding box aspect ratios. Thestructure of PyramidCNN is detailed in Fig. 2.The input to our networks is a pyramid of images atdifferent scales. This enables the networks to detect objectswith different sizes and appearing at various distances. Thus,the output contains a pyramid of heatmaps representingbounding boxes associated with conﬁdence scores, positions,aspect ratios, and sizes for each object class. Different fromend-to-end learning systems, we do not apply any thresholdto the conﬁdence scores in order to avoid any false negativesgenerated by the ﬁrst stage. B. Pose Estimation

The purpose of the second stage is to estimate the objectpose by performing iterated likelihood weighting, which of-fers us robustness and versatility over the search space. Thisis critical in our context since the manipulation task heavilydepends on the accuracy of pose estimations. We expectthe second stage to perform robustly even with inaccuratedetection from the ﬁrst stage.

1) Initial Samples:

We use a set of weighted samples { q ( i ) , w ( i ) } Mi = to represent the belief of object pose, whereeach 6D sample pose q ( i ) corresponds to a weight w ( i ) .Given an object class o , its pose q , and the correspondinggeometry model, we can render a 3D point cloud observation r using the z-buffer of a 3D graphics engine. Essentially,these rendered point clouds are what would be observed ifthe object had the hypothesized poses, which we refer toas rendered samples hereafter. The samples are initializedaccording to the ﬁrst stage output. As mentioned in SectionIV-A, our CNN produces a density pyramid that is essentiallya list of bounding boxes with conﬁdence scores. We performimportance sampling over the conﬁdence scores and initializeour samples uniformly within the 3D workspaces indicatedby sampled bounding boxes as shown in Equation (4). Moresamples are spawned within bounding boxes with higherconﬁdence scores.

2) Likelihood Function:

The weight of each sample iscalculated by the likelihood function, which evaluates thecompatibility of a sample with observations as shown inEquation (5). The likelihood function consists of severalparts, including bounding boxes weight, raw pixel-wise inlierratio, and feature-based inlier ratio. We ﬁrst deﬁne the rawpixel-wise inlier function as:Inlier ( p , p ′ ) = I (cid:16) || p − p ′ || < ε (cid:17) , (8)where p , p ′ ∈ R refer to a point in observation point cloud z and a point in rendered point cloud from sample pose respectively. I is the indicator function. A rendered pointis considered as an inlier if it is within a certain sensorresolution range ε from an observed point. The point-wiseinlier ratio of a rendered sample is then deﬁned as: I = | r | ∑ ( u , v ) ∈ z Inlier ( r ( u , v ) , z ( u , v ) ) , (9)where ( u , v ) refers to 2D image indices in the renderedsample point cloud r and observation point cloud z . | · | refersto point cloud size.Besides raw point-wise inliers, we extract geometry fea-ture point clouds from both rendered samples and observa-tion point clouds and compute feature inlier ratios. Hereby,we enhance the robustness of the likelihood function byconsidering contextual geometric information from 3D pointclouds. This term prunes wrong poses that agree with theobservation only in individual points but neglect higher-levelgeometric information such as depth discontinuity and sharpobject surfaces. We apply feature point extraction introducedby Zhang et al. [33] based on local surface smoothness: c ( u , v ) = || ∑ ( u ′ , v ′ ) ∈ N ( u , v ) (cid:16) p ( u ′ , v ′ ) − p ( u , v ) (cid:17) ||| N ( u , v ) | · || p ( u , v ) || (10) c ( u , v ) is calculated by adding all displacement vectors from p ( u , v ) to each of its neighbor points N ( u , v ) . The point cloud p here can be either rendered sample r or observation z . Thevalue is then normalized by the size of N ( u , v ) and the lengthof vector p ( u , v ) . Intuitively, c describes the depth changingrate within a certain local range, which has larger values inareas with acute depth changes and smaller values whereobject surfaces are consistent. We extract two features, edgepoints and planar points, by selecting point sets with largestand smallest c values respectively. To balance feature pointdensity in areas with different observation quality, we set amaximum number of edge points and planar points to beextracted from a certain local area. Essentially, a point at ( u , v ) can be selected as an edge or a planar point only if its c value is larger or smaller than a threshold and if the numberof selected points has not exceeded the limit. We ﬁnd that thealgorithm is insensitive to our feature extraction parameters.Finally, we apply feature extraction on both rendered sampleand scene observation point cloud to get sample features andobservation features. We use the same inlier calculation inEquations (8) and (9) to calculate feature inlier ratios.The weight w of each hypothesis q is deﬁned as W ( q ) = α box w box + α b I b | {z } network terms + α r I r + α e I e + α p I p | {z } geometric terms (11)where w box is the conﬁdence score of the bounding box. I r is the ratio of pixel-wise inliers in the whole renderedsample point cloud. I b is the inlier ratio in the portion ofrendered sample that is within the bounding box ( I b is 0 ifno rendered sample point falls into the bounding box). I e and I p are inlier ratios in sample edges and sample planarswith respect to observation features. The coefﬁcients α ∗ represent the importance of each likelihood term and sump to 1. The ﬁrst two terms, w box and I b , network terms ,are heavily determined by the bounding boxes and describethe consistency between pose sample and detection result.The last three terms, geometric terms , weigh how much thecurrent hypothesis explains itself in the scene geometry.

3) Update Process:

To produce object pose estimations,we follow the procedure of iterated likelihood weighting byﬁrst assigning a new weight to each sample. Resampling isdone with replacement according to sample weights. Duringthe diffusion process shown in Equation (6), each pose q ( i ) t isdiffused in the space subject to zero-mean Gaussian noises N T , t ( , σ T , t ) and N R , t ( , σ R , t ) with time-varying variancesfor translation and rotation respectively. The standard de-viations σ T , t and σ R , t at iteration t are decayed according to W ( q ∗ t ) , the weight of best pose estimation q ∗ t at that iteration.Bounding boxes b ( i ) t are diffused uniformly within the image.The algorithm terminates when W ( q ∗ t ) reaches a threshold ¯ w ,or the iteration limit is reached. Finally, we assume the poseweights of objects in the scene will be much higher thanthose for non-existing objects.V. EXPERIMENTS A. Implementation

We use PyTorch for our CNN implementation basedon a VGG16 model pre-trained on ImageNet [34]. (morearchitectures are tested in [8]). The shape network branchof our CNN predicts 7 different aspect ratios. The size ofa training image is 224 ×

224 and contains a single object.The aspect ratio of an object in the training image can beinferred from the width and height of the object. We applya softmax at the end of the network to generate probabilitydistribution of object classiﬁcation and aspect ratio. We usecross entropy as the loss function in training.Our second stage pose estimation relies on the OpenGLgraphics engine to render depth images with 3D geometrymodels and hypothesis poses on Nvidia GTX1080/RTX2070graphics cards. During the iterated likelihood weightingprocess, we allocate 625 samples for each iteration andrun the algorithm for 400 iterations in total, with ε set to0.005m. The sample size is limited by the buffer size of ourrendering engine, while the iteration limit was set since ourpose estimation converges after approximately 150 iterations(see, e.g., Fig. 3) in less than 10s ( ∼ ε was set to approximated distancebetween adjacent points in 3D point clouds.In the feature extraction mentioned in Section IV-B.2, weextract up to 5 edge points and 2 planar points from each 5 × p , we consider p ( u , v ) as an edge point if ln ( c ( u , v ) ) ≥ − . planar point otherwise. Thesehyper-parameters are determined experimentally for clearindication of object boundaries as well as surfaces. Thelikelihood coefﬁcients are set to α box = . , α b = . , α r = . , α e = . , α p = .

25. Through experiments, we ﬁnd thatthe system performance is sensitive to the total categoryweight allocated to network terms and geometric terms,

Iteration Number P o i n t w i s e D i s t an c e ( m ) Converged Pose Accuracy VS. Iteration (150, 0.0046)

Fig. 3: Plot of pose accuracy vs. iteration numbers of all convergedtrials. The point-wise distance is calculated using ADD and ADD-S metrics [3] and not used in pose estimation. After about 150iterations, the point-wise distance reaches below 0.005m. rather than the allocation within each category. If the ﬁrststage produces accurate detection evaluated by mean averageprecision (mAP), one can take advantage of it by allocatingmore weight to network terms. Otherwise, one should reducethe weight of network terms to attenuate the negative impactof underperforming ﬁrst stage. Since our ﬁrst stage produceslow-mAP detection, we allocate only 20% of the weight onnetwork terms. We allocate the remaining 80% to geometricterms since these terms are robust to adversarial scenariosand unreliable ﬁrst stage detection. Further weight allocationwithin each category is done approximately evenly.During diffusion, standard deviations of the Gaussiannoises are decayed by a common factor λ t , which dropsexponentially from 1.0 to 0.0 as W ( q ∗ t ) increases from 0.6to 1.0. In other words, the standard deviations at iteration t are given by σ ∗ , t = λ t σ ∗ , , where λ t = I W ( q ∗ t ) ≥ . · (cid:18) − W ( q ∗ t ) − . (cid:19) + I W ( q ∗ t ) < . · σ T , = . m and σ R , = . rad for translation and rotation respectively. The threshold¯ w for convergence is set to 0.9. B. Dataset and Baselines

We use the YCB video dataset [3] as the training data forour ﬁrst stage PyramidCNN. The YCB video dataset consistsof 133,827 frames of 21 objects under normal conditionswith balanced and adequate lighting but no occlusion. Totest the performance of our two-stage method with baselinemethods, PoseCNN [3] and DOPE [4], we collect a testingdataset (i.e., adversarial YCB dataset) from 40 scenes with 15out of 21 objects from YCB video dataset under adversarialscenarios. In each scene, we place 5-7 different objects ona table and collect seven frames: one in normal lighting,one in darkness, two with different single light sources,and three with different cluttered object placements (seeFig. 4). The dark setting and two single-lighting settings ig. 4: Testing dataset with YCB objects under adversarial settings.The base-setting data is collected with regular lighting withoutocclusions. The dark-setting data is collected with lights off in theroom. The single lighting data is collected with a ﬂash light. Objectposes are the same in previous three settings. Data in three occlusionscenes is collected with the same objects randomly stacked. cause bias in image pixels values from the training set andthus undermine network prediction. We refer to these settingsas varied lighting for simplicity. In addition, object cluttercauses occlusions as well as natural information loss andchallenges the robustness of pose estimation algorithms. Allthe scene images and 3D point clouds are gathered by theRGB-D sensor on our Fetch robot. Ground truth boundingboxes and 6D poses are manually labeled.

C. Evaluation1) Comparing accuracy with PoseCNN and DOPE with4 YCB objects:

We compare our pose estimation accuracywith PoseCNN (with ICP reﬁnement) and DOPE on theYCB dataset under different adversarial settings. We use pre-trained models from the authors’ Github page for PoseCNN and DOPE and train our ﬁrst stage PyramidCNN using2500 frames from the original YCB video dataset. SinceDOPE is trained with 5 of 21 objects from the YCB VideoDataset, we ﬁrst compare all three methods on 4 of them:003 cracker box, 005 tomato soup can, 006 mustard bottleand 010 potted meat can. The ﬁfth object, 004 sugar box,was unavailable from the market when this experiment wasset up. We use ADD and ADD-S metrics [3] to calculate poseerror for asymmetric and symmetric objects (marked withasterisks in Table. I) respectively. In manipulation tasks, thebearable pose estimation error is bounded by the clearancethat objects have when placed in the robot end effector. Basedon the sizes of Fetch robot gripper and target objects, wechoose 0.04m as the maximum error tolerance. We then plotaccuracy-threshold curves within a range of [0.00m, 0.04m]in Fig. 5 and calculate AUC (Area Under accuracy-thresholdCurve) as the evaluation metric. GRIP outperforms the other https://rse-lab.cs.washington.edu/projects/posecnn/ https://github.com/NVlabs/Deep Object Pose (a) Base settings.(b) Varied Light settings.(c) Occlusion settings Fig. 5: The comparison between DOPE, PoseCNN+ICP and our

GRIP two-stage method on pose estimation accuracy of 4 objectsmentioned in Sec. V-C. rea Under Accuracy-Threshold Curve Base Varied Lighting OcclusionsDOPE PoseCNN ∗ GRIP

DOPE PoseCNN ∗ GRIP

003 cracker box ∗

005 tomato soup can ∗

006 mustard bottle ∗

007 tuna ﬁsh can ∗

010 potted meat can ∗

011 banana 0.3922

019 pitcher base

024 bowl ∗

025 mug ∗

051 large clamp 0.2405

052 extra large clamp ∗ Overall 0.4308

TABLE I: Overall Performance (Area Under accuracy-threshold Curve) of 15 YCB Objects on DOPE, PoseCNN with ICP and our

GRIP method. Symmetric objects are marked with stars and evaluated using ADD-S; asymmetric objects are evaluated using ADD.Fig. 6: Comparison of

GRIP , DOPE and PoseCNN under adver-sarial scenarios. In varied lighting condition, DOPE only detects006 mustard bottle correctly while PoseCNN makes inaccuratedepth estimation (marked yellow). In occlusion condition, DOPEmisses half of the objects while PoseCNN fails to detect objectposes in clutter.

GRIP correctly detects all objects under both scenesexcept 051 large clamp under occlusion setting (cyan arrow) wherethe sampling converges to object 052 extra large clamp because oftheir geometric similarity. two methods under most error thresholds, especially lowerones, and thereby facilitates robotic manipulation tasks.See Fig. 6 for a qualitative comparison of all three methodswith different adversarial settings.

2) Comparing accuracy with PoseCNN with 15 YCB ob-jects:

Next, we perform an extensive comparison of ourmethod with PoseCNN (with ICP) on 15 of the 21 YCBobjects. Table I and Fig. 7 show our overall results anddetailed accuracy evaluations for each object.

GRIP outperforms PoseCNN+ICP for most objects underall 3 settings. All methods have worse performances undervaried lighting and occlusions as opposed to the basic setting.We can infer the strengths and weaknesses of each methodfrom its performance variance among different objects. Forexample, PoseCNN with ICP performs better on symmet-ric objects such as 003 cracker box and 061 foam brick as opposed to others such as 021 bleach cleanser. Sym-metric objects contain repetitive features which are morelikely to be captured by learning-based systems.

GRIP performs better on objects that are well recognizable un-der a depth camera. Large and compact objects such as006 mustard bottle and 024 bowl naturally generate denseand continuous 3D point cloud observations that effectivelycapture their geometry. Objects with thin or articulated parts,such as 037 scissors, 052 extra large clamp, and 025 mug,produce sparse point clouds around their handle-like partsthat do not effectively reveal the scene geometry, especiallyobject orientations. Hence, our

GRIP algorithm best suitsscenarios where rich depth sensory data are available due todetectable object dimensions and surface materials or high-deﬁnition depth sensors. Finally, distinguishing near-identicalobjects remains challenging. For instance, 051 large clampand 052 extra large clamp have identical colors and shapesand differ only insigniﬁcantly in sizes. This results in poorestimation accuracy by all methods.

D. Robotic ManipulationGRIP has been successfully used as the perception modulein real robot manipulation tasks, such as a grocery packingtask shown in the video .VI. CONCLUSIONSWe have introduced GRIP as a two-stage method for ro-bust 6D object pose estimation suited to adversarial settings.

GRIP demonstrated similar and improved performance withrespect to state-of-the-art neural network pose estimatorsconsidering the adversarial YCB dataset. The key insight of

GRIP is to avoid hard thresholding, which introduces falsepositives and false negatives, until a ﬁnal pose estimate is re-quired. Avoiding hard thresholds increases the possibility of Distance Threshold (meter) A cc u r a cy % Base

GRIPPoseCNN+ICP

Distance Threshold (meter) A cc u r a cy % Varied Lighting

GRIPPoseCNN+ICP

Distance Threshold (meter) A cc u r a cy % Occlusion

GRIPPoseCNN+ICP

Fig. 7: Overall pose estimation accuracy of 15 YCB objects using PoseCNN and our

GRIP method. ﬁnding the real pose in adversarial environments. In addition,a generative second stage inherently provides an avenue forexplainable perception, without requiring deciphering net-work weights. Also, this generative process readily extends totracking over multiple instances of time through the inclusionof a proper process model. The results presented are alsoamenable to improvement due to the limited types of featuresconsidered. These beneﬁts come at the cost of assumingonly one instance of each object is present in the scene. Forfuture work, we aim to investigate these limitations throughexploring features amenable to robust inference with multipleobject instances in greater clutter.R

EFERENCES[1] Marcus Gualtieri, Andreas ten Pas, and Robert Platt. Pick andplace without geometric object models. In , pages 7433–7440.IEEE, 2018.[2] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.

The Journal of MachineLearning Research , 17(1):1334–1373, 2016.[3] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox.Posecnn: A convolutional neural network for 6d object pose estimationin cluttered scenes. arXiv preprint arXiv:1711.00199 , 2017.[4] Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang,Dieter Fox, and Stan Birchﬁeld. Deep object pose estimation forsemantic robotic grasping of household objects. arXiv preprintarXiv:1809.10790 , 2018.[5] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Mart´ın-Mart´ın, CewuLu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object poseestimation by iterative dense fusion. arXiv preprint arXiv:1901.04780 ,2019.[6] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno,Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robustphysical-world attacks on deep learning models. arXiv preprintarXiv:1707.08945 , 1:1, 2017.[7] Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest ChadwickeJenkins. Sum: Sequential scene understanding and manipulation. In

Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ InternationalConference on , pages 3281–3288. IEEE, 2017.[8] Yanqi Liu, Alessandro Costantini, R Bahar, Zhiqiang Sui, Zhefan Ye,Shiyang Lu, and Odest Chadwicke Jenkins. Robust object estimationusing generative-discriminative inference for secure robotics applica-tions. In

Proceedings of the International Conference on Computer-Aided Design , page 75. ACM, 2018.[9] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassan-dra. Planning and acting in partially observable stochastic domains.

Artiﬁcial intelligence , 101(1-2):99–134, 1998.[10] Leslie Pack Kaelbling and Tom´as Lozano-P´erez. Integrated task andmotion planning in belief space.

The International Journal of RoboticsResearch , 32(9-10):1194–1227, 2013. [11] Richard E Fikes and Nils J Nilsson. Strips: A new approach tothe application of theorem proving to problem solving.

Artiﬁcialintelligence , 2(3-4):189–208, 1971.[12] Shiwali Mohan, Aaron H Mininger, James R Kirk, and John E Laird.Acquiring grounded representations of words with situated interactiveinstruction. In

Advances in Cognitive Systems . Citeseer, 2012.[13] Venkatraman Narayanan and Maxim Likhachev. Discriminatively-guided deliberative perception for pose estimation of multiple 3dobject instances. In

Robotics: Science and Systems , 2016.[14] Venkatraman Narayanan and Maxim Likhachev. Perch: Perception viasearch for multi-object recognition and localization. In , pages5052–5059. IEEE, 2016.[15] Ziyuan Liu, Dong Chen, Kai M Wurm, and Georg von Wichert. Table-top scene analysis using knowledge-supervised mcmc.

Robotics andComputer-Integrated Manufacturing , 33:110–123, 2015.[16] Dominik Joho, Gian Diego Tipaldi, Nikolas Engelhard, Cyrill Stach-niss, and Wolfram Burgard. Nonparametric bayesian models forunsupervised scene analysis and reconstruction.

Robotics , page 161,2013.[17] Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. TheMOPED framework: Object recognition and pose estimation formanipulation.

The International Journal of Robotics Research ,30(10):1284–1306, 2011.[18] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explainingand harnessing adversarial examples. arXiv preprint arXiv:1412.6572 ,2014.[19] Matei Ciocarlie, Kaijen Hsiao, Edward Gil Jones, Sachin Chitta,Radu Bogdan Rusu, and Ioan A S¸ucan. Towards reliable grasping andmanipulation in household environments. In

Experimental Robotics ,pages 241–252. Springer, 2014.[20] Chavdar Papazov, Sami Haddadin, Sven Parusel, Kai Krieger, andDarius Burschka. Rigid 3d geometry matching for grasping of knownobjects in cluttered scenes.

The International Journal of RoboticsResearch , 31(4):538–553, 2012.[21] Andreas Ten Pas and Robert Platt. Localizing handle-like graspaffordances in 3d point clouds. In

Experimental Robotics , pages 623–638. Springer, 2016.[22] Andreas ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt.Grasp pose detection in point clouds.

The International Journal ofRobotics Research , 36(13-14):1455–1473, 2017.[23] Zhiqiang Sui, Odest Chadwicke Jenkins, and Karthik Desingh. Ax-iomatic particle ﬁltering for goal-directed robotic manipulation. In , pages 4429–4436. IEEE, 2015.[24] Karthik Desingh, Odest Chadwicke Jenkins, Lionel Reveret, andZhiqiang Sui. Physically plausible scene estimation for manipulationin clutter. In , pages 1073–1080. IEEE, 2016.[25] Zhen Zeng, Zheming Zhou, Zhiqiang Sui, and Odest ChadwickeJenkins. Semantic robot programming for goal-directed manipulationin cluttered scenes. In , pages 7462–7469. IEEE, 2018.26] Chaitanya Mitash, Abdeslam Boularias, and Kostas Bekris. Robust 6Dobject pose estimation with stochastic congruent sets. arXiv preprintarXiv:1805.06324 , 2018.[27] Nicolas Mellado, Dror Aiger, and Niloy J Mitra. Super 4pcs fast globalpointcloud registration via smart indexing. In

Computer GraphicsForum , volume 33, pages 205–215. Wiley Online Library, 2014.[28] W Keith Hastings. Monte carlo sampling methods using markov chainsand their applications. 1970.[29] Stephen J. Mckenna and Hammadi Nait-Charif. Tracking humanmotion using auxiliary particle ﬁlters and iterated likelihood weighting.

Image Vision Comput. , 25:852–862, 2007.[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In

Advances in neural information processing systems , pages 91–99,2015.[31] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[32] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convo-lutional networks for semantic segmentation. In

Proceedings of theIEEE conference on computer vision and pattern recognition , pages3431–3440, 2015.[33] Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping inreal-time. In