Multi-FinGAN: Generative Coarse-To-Fine Sampling of Multi-Finger Grasps
Jens Lundell, Enric Corona, Tran Nguyen Le, Francesco Verdoja, Philippe Weinzaepfel, Gregory Rogez, Francesc Moreno-Noguer, Ville Kyrki
MMulti-FinGAN: Generative Coarse-To-Fine Samplingof Multi-Finger Grasps
Jens Lundell, Enric Corona, Tran Nguyen Le, Francesco Verdoja, Philippe Weinzaepfel,Gr´egory Rogez, Francesc Moreno-Noguer, Ville Kyrki
Fig. 1: From an input RBG-D image, Multi-FinGAN generates a diverse set of grasps from all around the object in abouta second, and then executes the highest scoring grasp on the real robot.
Abstract — While there exists a large number of methods formanipulating rigid objects with parallel-jaw grippers, graspingwith multi-finger robotic hands remains a quite unexploredresearch topic. Reasoning and planning collision-free trajec-tories on the additional degrees of freedom of several fingersrepresents an important challenge that, so far, involves compu-tationally costly and slow processes. In this work, we present
Multi-FinGAN , a fast generative multi-finger grasp samplingmethod that synthesizes high quality grasps directly from RGB-D images in about a second. We achieve this by training inan end-to-end fashion a coarse-to-fine model composed of aclassification network that distinguishes grasp types accordingto a specific taxonomy and a refinement network that producesrefined grasp poses and joint angles. We experimentally validateand benchmark our method against standard grasp-samplingmethods on 790 grasps in simulation and 20 grasps on a real
Franka Emika Panda . All experimental results using our methodshow consistent improvements both in terms of grasp qualitymetrics and grasp success rate. Remarkably, our approachis up to 20-30 times faster than the baseline, a significantimprovement that opens the door to feedback-based grasp re-planning and task informative grasping.
I. I
NTRODUCTION
Generating multi-fingered grasps for unknown objectssuch as the one shown in Fig. 1 is still non-trivial and con-siderably more challenging than using parallel-jaw grippersfor the same task. However, by actuating more joints, multi-fingered grippers allow a robot to perform more advancedmanipulations, including precision grasping flat disks orpower grasping spherical objects. Typical methods for multi-fingered grasp generation require known 3D object models
This work was supported in part by the Academy of Finland Strategic Re-search Council grant 314180, CHIST-ERA project IPALM, Finland 326304,Spain PCI2019-103386, and by the Spanish government with projectsHuMoUR TIN2017-90086-R and Maria de Maeztu Seal of ExcellenceMDM-2016-0656.J. Lundell ( [email protected] ), T. Nguyen Le, F. Verdoja,and V. Kyrki are with School of Electrical Engineering, Aalto University,Finland. E. Corona and F. Moreno-Noguer are with Institut de Rob`oticai Inform`atica Industrial, CSIC-UPC, Barcelona, Spain. G. Rogez andP. Weinzaepfel are with NAVER LABS Europe, Meylan, France. and poses in order to sample a large space of candidate graspsand then evaluate them based on physical grasping metricssuch as the (cid:15) -quality [1]. In case of unknown object poses,pose estimation is the de facto standard solution [2], whilefor unknown models estimating the shape through, e.g. , shapecompletion or mirroring has shown to work well in manyscenarios [3]–[5]. Nevertheless, generating good candidategrasps with these methods is computationally expensive asit usually relies on a stochastic search process such assimulated annealing over a large search space. For instance,the search space for the robotic hand we consider in thiswork, the
Barrett hand seen in Fig. 1, has 7 Degrees-Of-Freedom (DOF) which together with the 6D object poses(3 rotations and 3 translations) result in a 13 dimensionalsearch space. Despite clever solutions to reduce the searchspace, such as limiting the search over eigengrasps [6], theprocess is still inherently slow (in the order of several tensof seconds) due to the stochastic search procedure.In this work, we present a deep network inspired by recentwork from the computer vision community on human handgrasp synthesis [7], [8] that can generate and evaluate multi-finger grasps on unknown objects in roughly a second. Toachieve this, we devise a generative architecture for coarse-to-fine grasp sampling named
Multi-FinGAN that is purelytrained on synthetic data. Especially the integration of anovel parameter-free finger refinement layer based on a fullydifferentiable forward kinematics layer of the
Barrett hand facilitates fast learning and robust grasp generation.The proposed sampling method is quantitatively evaluatedin both simulation, where we compare over 790 graspsagainst baselines in terms of analytical grasp quality metrics,and on a real
Franka Emika Panda equipped with a
Barretthand , where we evaluate grasp success rate on 10 graspsper method. In both cases, our approach demonstrates asignificant reduction in running time compared to baselineswhile still generating grasps with high quality metrics andsuccess rate. a r X i v : . [ c s . R O ] D ec n summary, the main contributions of this work are:(i) a novel generative method for multi-finger grasp selectionthat enables fast sampling with high coverage; (ii) a novelloss function for guiding grasps towards the object whileminimizing interpenetration; and (iii) an empirical evaluationof the proposed method against state-of-the-art, presenting,both in simulation and on real hardware, improvements interms of running time, grasp ranking, and grasp success rate.II. R ELATED WORK
When considering parallel-jaw grippers, a large corpus ofdata-driven generative [9]–[12] and classification based [13],[14] grasping methods exist. Many of these approaches [9],[13] reach a grasp success rate over 90% on a wide variety ofobjects by constraining to top-down-only grasps. However,as recently discussed by Wu et al. [15], the simplificationsmade in these works exclude many solutions that could beused for applications like semantic and affordance grasping[16] or multi-finger grasping for dexterous manipulation.Despite the limitations of parallel-jaw grippers, alternativemethods using multi-fingered hands have not seen as muchdevelopment [15]–[21]. These approaches generally under-perform both in terms of running time and grasp successrate compared to their parallel-jaw counterparts.One of the earliest deep-learning-based multi-finger grasp-ing work trained a network to detect the palm and fingertippositions of stable grasps directly from an RGB-D view [17].That method achieved a 75% grasp success rate on 8 objectsbut relied on an external planner (GraspIt!) at run-time togenerate grasp samples which made the method slow (16.6seconds on average to generate a grasp). Our generativemethod also takes RGB-D images as input but does not re-quire any external planner, making it a much faster solution.To remove the need of a slow external planner, recentwork also focused on generating grasps [18], [19], [21]. Forinstance, in [21], the authors train a network that regressesfrom a voxel grid representation of the object to the outputpose and configuration of the gripper. Similar to ours, thiswork also employs the known forward kinematics equationof the gripper to compute a collision loss. At run-time thegenerated grasp is refined by searching over all ground-truthgrasps, selecting the grasp closest to the generated one. Themain drawback is that the grasps are viewpoint dependent so,to generate grasps in all possible locations around the object,the input representation needs to be rotated. Our method, onthe other hand, can generate grasps from any orientation injust a single forward pass.[19] proposes a generative-evaluative model which bothgenerates grasps and subsequently test them. Grasps areproduced through stochastic hill-climbing on a product ofexperts, which is a sequential and time-consuming process.Our model, on the other hand, only requires one forwardpass to generate a grasp that is then evaluated by computinganalytical quality metrics [1].The work most similar to ours is by Lu et al. [18]. Itproposes a deep network that, given an initial grasp config-uration and a RGB-D or voxel representation of the object, optimizes a hand pose and finger joints to increase graspsuccess. The proposed method reached an average graspsuccess rate of 57.5% but requires roughly 5–10 secondsto generate a grasp. Our work, in comparison, does notrequire an explicit initial hand configuration as the networkimplicitly learns to predict such a configuration. Moreover,our method is much faster at generating grasps.Tangential to training a multi-finger grasp sampler withsupervised learning is to use Reinforcement Learning (RL).In [15], Wu et al. learned a deep 6-DOF multi-finger graspingpolicy directly from RGB-D inputs. That work introduced anovel attention mechanism that zooms in and focuses on sub-regions of the depth image to achieve better grasps in denseclutter. Although the policy was trained purely in simulation,it transferred seamlessly to the real world and attained a highgrasp success rate on a diverse set of objects. Nevertheless,training such a method requires an elaborate simulation setupand fine-tuning of hyper-parameters for the RL method towork well. III. P
ROBLEM FORMULATION
In this work, we consider the problem of grasping un-known objects with a multi-fingered robotic hand. Thisimplies producing a grasp that does not interpenetrate theobject but has several contact points with it. More formally,we train a model M that takes as input an RGB-D image I and produces a grasp type c , a 6D gripper pose p , and avalid hand joint configuration q : M : I = ⇒ { c, p , q } . We represent the hand joint configuration q by a 7-DOF Barrett hand shown in Fig. 1. c is a coarse grasp classwithin the 33-grasp taxonomy listed in [22]. Due to physicalconstraints, the Barrett hand can achieve only 7 out of these33 grasp types, namely: small wrap, medium wrap, largewrap, power sphere, precision sphere, precision grasp, andpinch grasp. All grasps are in the object’s center of reference.Furthermore, we assume that the hand joint configuration q leaves a small gap between the fingers and the object asthe gripper executes a close-hand primitive before attemptingto actually lift the object. This assumption is reasonable tolimit the impact of sensing uncertainties in the real worldas we avoid the need to generate precise configurations thatactually touch the surface of the object.IV. M ETHOD
Our model for generating 6D multi-finger grasps inspiredfrom [7] is visualized in Fig. 3. It consists of 6 different sub-modules: Shape Completion, Image Encoding, Multi-LabelGrasp type Classification, Grasp Generation, Discriminatorand Finger Refinement. All these modules are novel forrobotic grasping except the “shape completion”, which wasused in [23] to shape complete voxelized input point-cloudsusing a deep network, and the “image encoding”, which is apre-trained ResNet-50 [24]. In the following subsections wewill present each of the novel modules and their function.Finally, we will present the loss functions used to train thecomplete generative grasp planning architecture end-to-end. rasp DiscriminatorMulti-Label Grasp typeClassificationShape Completion DifferentiableGripper 3D ModelGrasp Generation Finger Refinement LayerGrasp orientation FC LayersConv. Layers
Input RGB-D Image
Shape completion
RealFake
Image Encoding
Fig. 3: The architecture of Multi-FinGAN.
A. Multi-Label Grasp Type Classification
The task of the Multi-Label Grasp Type Classificationnetwork is to classify which of the seven predefined graspclasses are feasible for a given object. For this purpose, theclassification network is fed with the image representationof an object enc ( I ) , where enc ( · ) is a ResNet-50 encoding,and produces as output a grasp type c .As objects can often be associated with multiple correctgrasp categories, we frame the problem as multi-label classi-fication task. As such, we use a Sigmoid activation functionat the output and the binary cross-entropy loss L class to trainthe network. To later choose one grasp among all the possibleones, we threshold the output to 0.5 and randomly chooseone grasp type c that is classified as valid for that object. B. Grasp Generation
The grasp type c estimated by the classification networkis associated with a coarse hand configuration q c , i.e. , theaverage joint angles. The task of the Grasp Generation is togenerate a first refinement of the hand configuration q r = q c + ∆ q along with a 6D hand pose p = { t , r } where t isa translation and r is a normalized axis-angle rotation.Given the input point-cloud, we can estimate the centerof the object t and have the network refine this translationas t ∗ = t + ∆ t . Similarly, we represent the refinement ofthe hand’s rotation as r ∗ = r + ∆ r . At training time theinput rotation r is set to a rotation of a ground-truth graspwith added zero-mean Gaussian noise while at test-time wesample uniform rotations and feed these to the network.All in all, the network is represented as a fully-connectedresidual network which takes as input the initial handconfiguration, the object center, a noisy rotation, and theimage encoding { q c , t , r , enc ( I ) } and produces individualrefinements { ∆ q , ∆ t , ∆ r } . C. Finger Refinement Layer
The Finger Refinement Layer is responsible to furtherrefine the hand representation q r . To this end, we propose anovel fully-differentiable and parameter-free layer based onthe forward kinematics equation of the Barrett hand . This layer takes as input the pose of the gripper p and the coarsegripper representation q r and produces an optimized gripperconfiguration q ∗ = q r + ∆ q ∗ that is close to the surface ofthe object but not in collision with it.We optimize each finger independently with respect to theestimated object mesh. We denote ∆ q j ∗ as the optimized po-sition of joint j . This is calculated by rotating the articulatedfinger within its predefined physical limit θ j until the distance δ θ j between the finger vertices V θ j i and the object vertices O k implies a contact between finger and object. Hand-Objectcontact is parameterized by a threshold hyperparameter t d ,following ∆ q ∗ j = arg min θ j { δ θ j + (cid:15) − q r,j } ∀ θ j s.t. δ θ j < t d ,δ θ j = min i (min k ( (cid:107) V θ j i , O k (cid:107) )) . (1)Note that we could have simply set q ∗ = arg min θ δ θ for each joint as was proposed in [7] for a human handmodel. However this would break backward differentiability,and instead, we explicitly calculate ∆ q ∗ and add it to q r .The Barrett hand consists of three fingers made of twolinks (a proximal and a distal one). As such, (1) is solvedfor j = 1 , . . . , , where we first rotate the proximal jointsuntil contact and then proceed with the distal joints.The optimized gripper configuration is then updated ac-cording to q ∗ = q r +∆ q ∗ . To avoid interpenetration betweenthe object and the gripper, we add an offset (cid:15) which weheuristically set to 0.5 cm in our experiments. D. Discriminator network
Since the network does not have any supervision except forthe classification task, we need to enforce that the generatedgrasps are realistic. To this end, we add a Wasserstein dis-criminator D [25] and train it with the gradient penalty [26].More specifically, the objective to minimize using the graspgeneration module G is L disc = E [ D ( G ( enc ( I ) , q c , t , r )] − E (cid:104) D ( (cid:98) q , (cid:98) t , (cid:98) r ) (cid:105) , L gp = E (cid:20)(cid:16) (cid:107)∇ (cid:101) q , (cid:101) T , (cid:101) r D ( (cid:101) q , (cid:101) t , (cid:101) r ) (cid:107) − (cid:17) (cid:21) , (2)here (cid:98) q , (cid:98) t , and (cid:98) r are samples from the ground-truth dataand (cid:101) q , (cid:101) t , and (cid:101) r are linear interpolations between predictionsand those ground-truth samples. E. Complementary loss functions
While the discriminator loss L disc helps in producingrealistic looking grasps, it alone is not sufficient to guidethe learning problem enough. Therefore, we propose a set ofcomplementary losses.To ensure that the generated grasps are close to the object,we add a contact loss L cont = 1 | V cont | (cid:88) v ∈ V cont min k (cid:107) v, O k (cid:107) , (3)where O k are the object vertices and V cont are vertices on thehand that are often in contact with the object in our ground-truth grasps. We calculate V cont as the vertices that are closerthan 5 mm to the object in at least 8% of the ground-truthgrasps. These are mainly located on the finger tips and thepalm of the hand.For a grasp to be successful, the gripper should be rotatedtowards the object of interest. To promote such behaviour, weadd a loss function that penalizes the gripper if its approachdirection ˆa is pointing away from the vector ˆo connectingthe hand to the object’s center: L orient = 1 − ˆa (cid:62) ˆo . (4)Finally, for a grasp to be successful it cannot interpenetratethe object. To encourage such behaviour, we add a loss thatpenalizes the distance between vertices V i that are inside theobject and the closest object vertex L int = 1 | V i | (cid:88) v ∈ V i A v min k (cid:107) v , O jk (cid:107) , (5)where A v is the average area of the incident faces ofthe vertex that is inside the object. Since uniform meshtessellation cannot usually be assumed in robotics, e.g. ,Fig. 4b, we add the term A v to be robust to non-uniformtessellation.Finally, the total loss is a linear combination of all theseparate loss functions L tot = w class L class + w disc L disc + w gp L gp + w cont L cont + w int L int + w orient L orient , whereeach individual loss contribution is given a correspondingweight. The model is trained end-to-end. F. Implementation details
The network was implemented in PyTorch 1.5.1. We use apre-trained ResNet-50 [24] as the image encoder. The modelwas trained on 30 objects from the YCB object set and foreach of them we synthetically rendered 100 novel viewpointssome examples shown in Fig. 4a. The images are resized to256x256. We trained our networks with a learning rate of · − and a batch size of 100. The weights of the lossfunctions were set to w class = 1 , w disc = 1 , w gp = 10 , w cont = 100 , w int = 100 , w orient = 1 . The generator istrained once every 5 forward passes to improve the relativequality of the discriminator. We trained the networks for 800 (a) (b) Fig. 4: The color component of three of the synthetic RGB-D images used for training (a), and a grasp generated by ourmethod (b).epochs where we linearly reduced the learning rate for thelast 400 epochs.V. E
XPERIMENTS AND R ESULTS
The three main questions we wanted to answer in theexperiments were:1) Is Multi-FinGAN able to generate high quality grasps?2) What are the contributions of the proposed loss func-tions?3) Is our generative grasp sampler, which is purely trainedon synthetic data, able to transfer to real objects?In order to provide justified answers to these questions, weconducted three separate experiments. In the first experiment(Section V-B) we evaluate grasp quality and hand-objectinterpenetration in simulation. In the second experiment(Section V-C) we do an ablation study over the proposedloss functions and in (Section V-D) we finally evaluate graspsuccess rate on real hardware.
A. Dataset
To train our model, we manually generate a dataset ofgrasps on the YCB objects [27] using a
Barrett hand inGraspIt! [28] . As previously mentioned, the hand can onlyattain 7 of the 33 grasp types listed in [22] and we thereforecategorize each grasp according to these. As a final step, wegenerate additional grasps around the symmetry axes of theobjects. In total, this amounts to over 4000 labeled grasps. B. Grasping in Simulation
In the simulated grasping experiment, we evaluate howgood our method is at producing high quality grasps that arenot interpenetrating the object. We test our model on twodifferent dataset: 30 objects from the YCB object set [27]and 49 objects from the recent EGAD! dataset [29]. TheYCB object set contains both object models we trained on,and models that were held out during training; the EGAD!dataset contains completely novel objects.We benchmark against the simulated annealing planner inGraspIt! [30] that ran for 75000 steps to generate 360 graspcandidates on average. To evaluate the quality of a grasp, weused the (cid:15) -quality metric which represents the radius of thelargest 6D ball centered at the origin that can be enclosed bythe convex hull of the wrench space [31]. With our method,we render 5 different viewpoints for each object and 110 The dataset will be made publicly available upon acceptance
ABLE I: Simulation experiment results. ↑ : higher the better; ↓ : lower the better. GraspIt! Multi-FinGANYCB EGAD! YCB EGAD! (cid:15) -quality ↑ Interpenetration (cm ) ↓ Grasp Sampling (sec.) ↓ . . . . (cid:15) -quality Interpenetration (cm )Multi-FinGANGraspIt! Fig. 5: Histograms showing all results obtained on bothdatasets by our approach and the baseline in terms of (cid:15) -quality and interpenetration (best viewed in color).grasps per viewpoint. Out of this pool of grasps, we report theaverage performance on the 10 top-scoring grasps accordingto the (cid:15) -metric. We also average the performance of the 10top-scoring grasps found with the baseline method. In totalthis amounts to 790 grasps per method.Table I and Fig. 5 show the simulation results and Fig. 4ban example grasp using our method. To analyze the statis-tical differences between the methods we used a one sidedWilcoxon signed-rank test. Based on these results we candraw several interesting conclusions. Multi-FinGAN is ableto achieve statistically better results ( α = 0 . ) in termsof quality. The histogram in Fig. 5 is also confirming thisresult, showing that our data-driven grasp planning method ismore consistent than the baseline at generating high qualitygrasps. However, in terms of interpenetration, GraspIt! showsa statistically significant improvement over our method ( α =0 . ). This result is most likely due to our method reachinga higher interpenetration on the YCB objects because of thepresence of large objects which were not in the training-set.EGAD!, on the other hand, contains objects scaled to the sizeof the hand and our method achieves a low interpenetrationon those. One thing to note, though, is that when looking atthe interpenetration histogram in Fig. 5, the two methodsdo not perform radically differently. Finally, our methodgenerates, on average, a grasp in around a second comparedto the 30 seconds required by GraspIt!, which makes Multi-FinGAN 30 times faster than the baseline; difference whichis once again statistically significant ( α = 0 . ). C. Ablation study
To further evaluate the impact the proposed loss functionshave on performance, we conduct an ablation study wherewe trained models with one of the following losses removed: TABLE II: Ablation study on EGAD!
Loss removed none L int L cont L orient L disc (cid:15) -quality ↑ ) ↓ Fig. 6: The objects used in the physical experiments.the interpenetration loss L int , the contact loss L cont , theorientation loss L orient , and the discriminator loss L disc . Wetrain each of these models as was described in Section V-A and evaluated them on the EGAD! dataset by sampling110 grasps from 5 different viewpoints and calculating the (cid:15) -quality and the intersection for the top 10 grasps.The results are presented in Table II. As expected, anetwork trained with no interpenetration loss L int oftenintersects the object as this results in more contacts but thefinal (cid:15) -quality 0.77 is still not higher than 0.86 achieved withthe model trained with all the losses.Another interesting observation is that the model withno contact loss L cont still achieves a high (cid:15) -quality. Ourhypothesize was that grasps generated with this model wouldnot interpenetrate the object at all as the model would havelearned to translate the gripper far from the object whichwould also have resulted in low quality. However, this wasnot the case and one possible explanation why is that theorientation and discriminator losses forces the grasps to berealistic and oriented towards the object.A model trained without the orientation loss L orient barelyimpacts the quality of the grasps but does increase theinterpenetration compared to the full model. However, thisloss speeds up learning in the early stages of training asit acts as an inductive bias forcing the hand to be orientedtowards the object.The network with no discriminator loss L disc producesthe lowest quality grasps. At the same time, it also producesgripper joint-configurations that are physically infeasible.Overall, all of the models produce grasps with lower (cid:15) -quality and higher interpenetration compared to a modeltrained with all the losses. However, the grasp quality doesnot heavily deteriorate. This is probably due to the inherentpower of the finger refinement layer which will always refinethe gripper’s fingers close to the object if that is possible.Nevertheless, the ablation study shows that all the losseshave different purposes and leaving one out affects the finalperformance of the model. Therefore, we use the model withig. 7: Some example of the grasps proposed by Multi-FinGAN on real objects. The upper row shows the unsegmentedinput image and the bottom rows shows some grasps on the shape-completed object. The grasp shown in the red box failedas it was in collision with the object.TABLE III: Real hardware experiment results GraspIt! Multi-FinGANGrasp Success Rate (%) ↑ Shape Completion Time (sec.) ↓ Grasp Evaluation Time (sec.) ↓ all losses in our following Sim-to-Real experiment. D. Sim-to-Real Grasp Transfer
To understand if grasps generated with our generativegrasp sampler trained on synthetic data, transfer well toreal objects, we conducted an experiment on a real
FrankaEmika Panda equipped with a
Barrett hand . The goal wasto grasp the objects shown in Fig. 6 which were chosen asthey represent a high variability in both size and shape.To capture the RGB-D image we used an Intel RealSenseD435 camera looking at the scene from the side at a 45degree viewpoint. For the extrinsic calibration of the camerawe used an Aruco marker [32]. To provide our network withan RGB image of only the object, we segment it from thescene by subtracting the background and the table. To createa mesh of the segmented object, which is needed in bothour method and the baseline, we used the shape-completionmethod detailed in [23]. For both methods we generated 20grasps per object. We then calculated the intersection andquality metric of each grasp. The first physically reachablegrasp with lowest intersection and highest quality metricwas executed on the real robot. To evaluate if a grasp wassuccessful, the robot had to grasp the object and, withoutdropping it, move to the start position and rotate the hand ±
90° around the last joint. If the object was dropped duringthe manipulation, the grasp was considered unsuccessful.The result of this experiment is shown in Table III. Basedon these numbers we can see that our method reaches agrasp success rate of 60% compared to the baseline 40%,while being over 20 times faster.An example of the input image fed to the network and agenerated grasp is shown in Fig. 1. Although this image is notqualitatively as good as the training data visualized in Fig. 4athe method was still able to generate high quality grasps onsuch objects showing stable sim-to-real transfer. Additionalgrasps generated using our method are visualized in Fig. 7.Based on our experiments, our method never produced grasps that were too far from the object to be able to grasp it, butfor larger objects such as the right-most image in Fig. 7the grasp was always in collision with the object showingthat our method has some problems with very large objects.Despite that limitation, the results still indicate stable sim-to-real grasp transfer.VI. C
ONCLUSIONS AND FUTURE WORK
We presented Multi-FinGAN, a generative grasp samplingmethod that produces multi-fingered 6D grasps directly froman RGB-D image. The key insight was to reduce the searchspace by using a coarse-to-fine grasp generation methodwhere we first generated coarse grasps based on a grasptaxonomy which subsequently were refined using a fullydifferentiable forward kinematics layer. We compared ourmodel to the well known simulated annealing planner inGraspIt! both in simulation and on a real robot. The resultsshowed that our model trained on synthetic data was signif-icantly better than the baseline in generating higher qualitygrasps in simulation, and on real hardware it achieved ahigher grasp success rate. At the same time it was also 20–30times faster than the baseline.Despite the good results, there is still room for improve-ments. Although the classification into grasp types reducedthe search space and eased the learning of the model itrequires a large dataset of labeled grasps which is time-consuming to gather. A more elegant solution is to notclassify grasps according to a taxonomy but instead regressdirectly to joint angles allowing the model train on otherdatasets such as the Columbia grasp database [33]. Anotherlimitation is the computational time to evaluate the graspswhich accounts for more than half the time needed togenerate grasps. This time could be reduced by training acritique to evaluate multi-finger grasps but that is still anopen problem.In conclusion the work presented here shows that gen-erating 6D coarse-to-fine multi-fingered grasps is both fastand leads to good grasps. This, in turn, opens the doorto use dexterous hands for feedback-based grasping, taskinformative grasping and grasping in clutter.
EFERENCES[1] C. Ferrari and J. F. Canny, “Planning optimal grasps.” in
ICRA , vol. 3,1992, pp. 2290–2295.[2] K. Kleeberger, R. Bormann, W. Kraus, and M. F. Huber, “A survey onlearning-based robotic grasping,”
Current Robotics Reports , pp. 1–11,2020.[3] J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen, “Shapecompletion enabled robotic grasping,” in
IEEE/RSJ International Con-ference on Intelligent Robots and Systems , 2017, pp. 2442–2447.[4] J. Lundell, F. Verdoja, and V. Kyrki, “Robust Grasp Planning OverUncertain Shape Completions,” in . Macau, China:IEEE, Nov. 2019.[5] J. Bohg, M. Johnson-Roberson, B. Le´on, J. Felip, X. Gratal,N. Bergstr¨om, D. Kragic, and A. Morales, “Mind the gap-roboticgrasping under incomplete observation,” in . IEEE, 2011, pp. 686–693.[6] M. Ciocarlie, C. Goldfeder, and P. Allen, “Dexterous grasping viaeigengrasps: A low-dimensional approach to a high-complexity prob-lem,” in
Robotics: Science and systems manipulation workshop-sensing and adapting to the real world . Citeseer, 2007.[7] E. Corona, A. Pumarola, G. Alenya, F. Moreno-Noguer, and G. Ro-gez, “Ganhand: Predicting human grasp affordances in multi-objectscenes,” in
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2020, pp. 5031–5041.[8] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev,and C. Schmid, “Learning joint reconstruction of hands and manipu-lated objects,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 11 807–11 816.[9] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for roboticgrasping: A real-time, generative grasp synthesis approach,” arXivpreprint arXiv:1804.05172 , 2018.[10] A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variationalgrasp generation for object manipulation,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 2901–2910.[11] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel,“Asymmetric actor critic for image-based robot learning,” arXivpreprint arXiv:1710.06542 , 2017.[12] U. Viereck, A. t. Pas, K. Saenko, and R. Platt, “Learning a visuomotorcontroller for real world robotic grasping using simulated depthimages,” arXiv preprint arXiv:1706.04652 , 2017.[13] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea,and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust graspswith synthetic point clouds and analytic grasp metrics,” arXiv preprintarXiv:1703.09312 , 2017.[14] V. Satish, J. Mahler, and K. Goldberg, “On-policy dataset synthesisfor learning robot grasping policies using fully convolutional deepnetworks,”
IEEE Robotics and Automation Letters , vol. 4, no. 2, pp.1357–1364, 2019.[15] B. Wu, I. Akinola, A. Gupta, F. Xu, J. Varley, D. Watkins-Valls, andP. K. Allen, “Generative attention learning: a “general” frameworkfor high-performance multi-fingered grasping in clutter,”
AutonomousRobots , pp. 1–20, 2020.[16] M. Kokic, J. A. Stork, J. A. Haustein, and D. Kragic, “Affordancedetection for task-specific grasping using deep learning,” in . IEEE, 2017, pp. 91–98.[17] J. Varley, J. Weisz, J. Weiss, and P. Allen, “Generating multi-fingeredrobotic grasps via deep learning,” in . IEEE, 2015.[18] Q. Lu, M. Van der Merwe, B. Sundaralingam, and T. Hermans, “Multi-fingered grasp planning via inference in deep neural networks,” arXivpreprint arXiv:2001.09242 , 2020.[19] U. R. Aktas, C. Zhao, M. Kopicki, A. Leonardis, and J. L. Wyatt,“Deep dexterous grasping of novel objects from a single view,” arXivpreprint arXiv:1908.04293 , 2019.[20] Q. Lu, M. Van der Merwe, and T. Hermans, “Multi-fingered activegrasp learning,” arXiv preprint arXiv:2006.05264 , 2020.[21] M. Liu, Z. Pan, K. Xu, K. Ganguly, and D. Manocha, “Generatinggrasp poses for a high-dof gripper using neural networks,” arXivpreprint arXiv:1903.00425 , 2019.[22] T. Feix, J. Romero, H.-B. Schmiedmayer, A. M. Dollar, and D. Kragic,“The grasp taxonomy of human grasp types,”
IEEE Transactions onhuman-machine systems , vol. 46, no. 1, pp. 66–77, 2015. [23] J. Lundell, F. Verdoja, and V. Kyrki, “Beyond top-grasps through scenecompletion,” in . IEEE, 2020, pp. 545–551.[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computervision and pattern recognition , 2016, pp. 770–778.[25] S. Martin Arjovsky and L. Bottou, “Wasserstein generative adversarialnetworks,” in
Proceedings of the 34 th International Conference onMachine Learning, Sydney, Australia , 2017.[26] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville, “Improved training of wasserstein gans,” in
Advances inneural information processing systems , 2017, pp. 5767–5777.[27] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M.Dollar, “The ycb object and model set: Towards common benchmarksfor manipulation research,” in . IEEE, 2015, pp. 510–517.[28] A. T. Miller and P. K. Allen, “Graspit! a versatile simulator for roboticgrasping,”
IEEE Robotics Automation Magazine , vol. 11, no. 4, pp.110–122, 2004.[29] D. Morrison, P. Corke, and J. Leitner, “Egad! an evolved graspinganalysis dataset for diversity and reproducibility in robotic manipula-tion,”
IEEE Robotics and Automation Letters , 2020.[30] M. Ciocarlie, C. Goldfeder, and P. Allen, “Dimensionality reductionfor hand-independent dexterous robotic grasping,” in . IEEE,2007, pp. 3270–3275.[31] A. T. Miller and P. K. Allen, “Examples of 3d grasp quality com-putations,” in
Proceedings 1999 IEEE International Conference onRobotics and Automation , vol. 2. IEEE, 1999, pp. 1240–1246.[32] S. Garrido-Jurado, R. Mu˜noz-Salinas, F. J. Madrid-Cuevas, and M. J.Mar´ın-Jim´enez, “Automatic generation and detection of highly reliablefiducial markers under occlusion,”
Pattern Recognition , 2014.[33] C. Goldfeder, M. Ciocarlie, H. Dang, and P. K. Allen, “The columbiagrasp database,” in2009 IEEE international conference on roboticsand automation