[PDF] Real-Time, Highly Accurate Robotic Grasp Detection using Fully Convolutional Neural Network with Rotation Ensemble Module

Abstract

Rotation invariance has been an important topic in computer vision tasks. Ideally, robot grasp detection should be rotation-invariant. However, rotation-invariance in robotic grasp detection has been only recently studied by using rotation anchor box that are often time-consuming and unreliable for multiple objects. In this paper, we propose a rotation ensemble module (REM) for robotic grasp detection using convolutions that rotates network weights. Our proposed REM was able to outperform current state-of-the-art methods by achieving up to 99.2% (image-wise), 98.6% (object-wise) accuracies on the Cornell dataset with real-time computation (50 frames per second). Our proposed method was also able to yield reliable grasps for multiple objects and up to 93.8% success rate for the real-time robotic grasping task with a 4-axis robot arm for small novel objects that was significantly higher than the baseline methods by 11-56%.

Full PDF

RReal-Time, Highly Accurate Robotic Grasp Detection using FullyConvolutional Neural Network with Rotation Ensemble Module

Dongwon Park, Yonghyeok Seo, Se Young Chun*

Abstract — Rotation invariance has been an important topicin computer vision tasks. Ideally, robot grasp detection shouldbe rotation-invariant. However, rotation-invariance in roboticgrasp detection has been only recently studied by using rotationanchor box that are often time-consuming and unreliable formultiple objects. In this paper, we propose a rotation ensemblemodule (REM) for robotic grasp detection using convolutionsthat rotates network weights. Our proposed REM was ableto outperform current state-of-the-art methods by achievingup to 99.2% (image-wise), 98.6% (object-wise) accuracies onthe Cornell dataset with real-time computation (50 frames persecond). Our proposed method was also able to yield reliablegrasps for multiple objects and up to 93.8% success rate forthe real-time robotic grasping task with a 4-axis robot armfor small novel objects that was signiﬁcantly higher than thebaseline methods by 11-56%.

I. INTRODUCTIONRobot grasping of novel objects has been investigatedextensively, but it is still a challenging open problem inrobotics. Humans instantly identify multiple grasps of novelobjects (perception), plan how to pick them up (planning) andactually grasp it reliably (control). However, accurate roboticgrasp detection, trajectory planning and reliable executionare quite challenging for robots. As the ﬁrst step, detectingrobotic grasps accurately and quickly from imaging sensorsis an important task for successful robotic grasping.Deep learning has been widely utilized for robotic graspdetection from a RGB-D camera and has achieved signiﬁcantimprovements over conventional methods. For the ﬁrst time,Lenz et al. proposed deep learning classiﬁer based roboticgrasp detection methods that achieved up to 73.9% (image-wise) and 75.6% (object-wise) grasp detection accuracy ontheir in-house Cornell dataset [16], [17]. However, its com-putation time per image was still slow (13.5 sec per image)due to sliding windows. Redmon and Angelova proposeddeep learning regressor based grasp detection methods thatyielded up to 88.0% (image-wise) and 87.1% (object-wise)with remarkably fast computation time (76 ms per image) onthe Cornell dataset [23]. Since then, there have been a lot ofworks proposing deep neural network (DNN) based methodsto improve the performance in terms of detection accuracyand computation time. Fig. 1 summarizes the computationtime (frame per second) vs. grasp detection accuracy onthe Cornell dataset with object-wise split for some previ-ous works (Redmon [23], Kumra [15], Asif [1], Chu [3],Zhou [33], Zhang [31]) and our proposed method. Note

Dongwon Park, Yonghyeok Seo and Se Young Chun are with Departmentof Electrical Engineering (EE), UNIST, Ulsan, 44919, Republic of Korea.*Corresponding author : [email protected]

Fig. 1: Performance summary of computation time (frameper second) vs. grasp detection accuracy on the Cornelldataset with object-wise data split.that recent works (except for our proposed method) usingstate-of-the-art DNNs such as [1], [3], [33], [31] seem toshow trade-off between computation time and grasp detectionaccuracy. For example, Zhou a , b [33] were based on ResNet-101, ResNet-50 [11], respectively, that have the trade-offbetween network parameters vs. computation time. Note thatprediction accuracy is generally related to real successfulgrasping and computation time is potentially related to real-time applications for fast moving objects or stand-aloneapplications with limited power.Rotation invariance has been an important topic in com-puter vision tasks such as face detection [27], texture clas-siﬁcation [8] and character recognition [13], to name afew. The importance of rotation invariant properties forcomputer vision methods still remains for recent DNN basedapproaches. In general, DNNs often require a lot moreparameters with data augmentation with rotations to yieldrotational-invariant outputs. Max pooling helps alleviatingthis issue, but since it is usually × [12], it is only forimages rotated with very small angles. Recently, there havebeen some works on rotation-invariant neural network suchas rotating weights [4], [7], enlarged receptive ﬁeld usingdialed convolutional neural network (CNN) [29] or a pyramidpooling layer [10], rotation region proposals for recognizingarbitrarily placed texts [19] and polar transform network toextract rotation-invariant features [6].Ideally, robot grasp detection should be rotation-invariant.Rotation angle prediction in robot grasp detection has beendone by regression of continuous angle value [23], classiﬁ-cation of discretized angles (e.g., ◦ , ◦ , . . . , ◦ ) [9], [3]or rotation anchor box that is a hybrid method of regressionand classiﬁcation [32], [33], [31]. Previous works were a r X i v : . [ c s . C V ] S e p ot considering rotation-invariance or attempting rotation-invariant detection by rotating images or feature maps thatwere often time-consuming especially for multiple objects.In this paper, we propose a rotation ensemble module(REM) for robotic grasp detection using convolutions that ro-tates network weights. This special structure allows the DNNto select rotation convolutions for each grid. Our proposedREM were evaluated for two different tasks: robotic graspdetection on the Cornell dataset [16], [17] and real roboticgrasping tasks with novel objects that were not used duringtraining. Our proposed REM was able to outperform state-of-the-art methods such as [33] by achieving up to 99.2%(image-wise), 98.6% (object-wise) accuracy on the Cornelldataset as shown in Fig. 1 with × faster computationthan [33]. Our proposed method was also able to yield upto 93.8% success rate for the real-time robotic grasping taskwith a 4-axis robot arm for novel objects and to yield reliablegrasps for multiple objects unlike rotation anchor box.II. R ELATED WORKS

A. Spatial, rotational invariance

Max pooling layers often alleviate the issue of spatialvariance in CNN. To better achieve spatial-invariant imageclassiﬁcation, Jaderberg et al. proposed spatial transformernetwork (STN), a method of image (or feature) transforma-tion by learning (afﬁne) transformation parameters so that itcan help to improve the performance of inference operationsof the following neural network layers [12]. Lin et al. proposed to use STN repeatedly with an inverse compositemethod by propagating warp parameters rather than images(or features) for improved performance [18]. Esteves et al. proposed a rotation-invariant network by replacing the gridgeneration of STN with a polar transform [6]. Input featuremap (or image) was transformed into the polar coordinatewith the origin that was determined by the center of mass.Cohen and Welling proposed a method to use group equiv-ariant convolutions and pooling with weight ﬂips and fourrotations with ◦ stepsize [4]. Follmann et al. proposedto use rotation-invariant features that were created usingrotational convolutions and pooling layers [7]. Marcos et al. proposed a network with a different set of weights for eachlocal window instead of weight rotation [21]. B. Object detection

Faster R-CNN was a method of using a region pro-posal network for generating region proposals to reducecomputation time [26]. YOLO was faster but less accuratethan the faster R-CNN by directly predicting { x , y , w , h , class } without using the region proposal network [24].YOLO9000 stabilized the loss of YOLO by using anchor boxinspired by region proposal network and yielded much fasterobject detection results than faster R-CNN while its accuracywas comparable [25]. For rotation-invariant object detection,Shi et al. investigated face detection using a progressivecalibration network that predicted rotation by 180 ◦ , 90 ◦ oran angle in [-45 ◦ , 45 ◦ ] after sliding window [28]. Ma et al. used a rotation region proposal network to transform re-gions for classiﬁcation using rotation region-of-interest (ROI)pooling [19]. Note that rotation angle was predicted using 1)rotation anchor box, 2) regression or 3) classiﬁcation. C. Robotic grasp detection

Deep learning based robot grasp detection methods seemto belong one of the two types: two stage detector (TSD) orone stage detector (OSD). TSD consists of a region proposalnetwork and a detector [9], [3], [32], [33], [31]. Afterextracting feature maps using proposals from the network inthe ﬁrst stage, objects are detected in the second stage. Theregion proposal network of TSD generally helps to improveaccuracy, but is often time-consuming due to feature mapextractions. OSD detects an object on each grid instead ofgenerating region proposal to reduce computation time withdecreased prediction accuracy [23].Lenz et al. proposed a TSD model that classiﬁes objectgraspability using a sparse auto-encode (SAE) with slidingwindows for brute-force region proposals [17]. Redmon et al. developed a regression based OSD [23] using AlexNet [14].Guo et al. applied ZFNet [30] based TSD to robot graspingand formulated angle prediction as classiﬁcation [9]. Chu etal. further extended the TSD model of Guo [3] by incor-porating recent ResNet [11]. Zhou et al. also used ResNetfor TSD, but proposed rotation anchor box [33]. Zhang etal. extended the TSD method of Zhou [33] by additionallypredicting objects using ROI [32]. DexNet 2.0 is also TSDthat predicts grasp candidates from a depth image and thenselects the best one by its classiﬁer, GQ-CNN [20].III. M

ETHOD

A. Problem setup and reparametrization

The goal of the problem is to predict 5D representationsfor multiple objects from a color image where a 5D rep-resentation consists of location ( x , y ), rotation θ , width w ,and height h , as illustrated in Fig. 2. Multi-grasp detectionoften directly estimates 5D representation { x, y, θ, w, h } aswell as its probability (conﬁdence) of being a class (or (a) (b) Fig. 2: (a) A 5D detection representation with location ( x , y ), rotation θ , gripper opening with w and plate size h . (b)For a ( , ) grid cell, all parameters for 5D representation areillustrated including a pre-deﬁned anchor box (black dottedbox) and a 5D detection representation (red box).eing graspable) z for each grid cell. In summary, the 5Drepresentations with its probability are { x, y, θ, w, h, z } . For TSD, region proposal networks generate potential can-didates for { x, y, w, h } [3], [9], [33], [32] and rotation regionproposal network yields possible arbitrary-oriented proposals { x, y, θ, w, h } [19]. Then, classiﬁcation is performed forproposals to yield their graspable probabilities z . Rotationregion proposal network classiﬁes rotation anchor boxes with ◦ stepsize and then regresses angles.For OSD, a set of { x, y, θ, w, h, z } is directly esti-mated [23]. Inspired by YOLO9000 [25], we propose touse the following reparametrization for 5D grasp repre-sentation and its probability for robotic grasp detection as { t x , t y , θ, t w , t h , t z } where x = σ ( t x ) + c x , y = σ ( t y ) + c y ,w = p w exp( t w ) , h = p h exp( t h ) and z = σ ( t z ) . Note that σ ( · ) is a sigmoid function, p h , p w are the predeﬁned heightand width of anchor box, respectively, and c x , c y are thetop left corner of each grid cell. Therefore, a DNN directlyestimates { t x , t y , θ, t w , t h , t z } instead of { x, y, θ, w, h, z } . B. Parameter descriptions of the proposed OSD method

For S × S grid cells, the following locations are deﬁned ( c x , c y ) ∈ { ( c x , c y ) | c x , c y ∈ { , , . . . , S − }} , which are the top left corner of each grid cell ( c x , c y ) . Thus,our proposed method estimates the ( x, y ) offset from the topleft corner of each grid cell. For a given ( c x , c y ) , the rangeof ( x, y ) will be c x < x < c x + 1 , c y < y < c y + 1 dueto the reparametrization using sigmoid functions.We also adopt anchor box approach [25] to robotic graspdetection. Reparametrization changes regression for w, h intoregression & classiﬁcation. Classiﬁcation is performed topick the best representation among all anchor box candidatesthat were generated using estimated t w , t h and the following p w , p h values: { (0 . , . , (0 . , . , (1 . , . , (1 . , . , (1 . , . , (3 . , . , (3 . , . } or { (1 . , . } .We investigated three prediction methods for rotation θ .Firstly, a regressor predicts θ ∈ [0 ◦ , ◦ ) . Secondly, aclassiﬁer predicts θ ∈ { ◦ , ◦ , . . . , ◦ } . Lastly, anchorbox approach with regressor & classiﬁer predicts both θ a ∈{ ◦ , ◦ , ◦ } and θ r ∈ [ − ◦ , ◦ ] to yield θ = θ a + θ r .Predicting detection (grasp) probability is crucial formultibox approaches such as MultiGrasp [23]. Conventionalground truth for detection probability was 1 (graspable) or0 (not graspable) [23]. Inspired by [25], we proposed to useIOU (Intersection Over Union) as the ground truth detectionprobability as z g = | P ∩ G | / | P ∪ G | where P is the predicteddetection rectangle, G is the ground truth detection rectangle,and | · | is the area of the rectangle. C. Rotation ensemble module (REM)

We propose a rotation ensemble module (REM) withrotation convolution and rotation activation to determine anensemble weight associated with angle class probability foreach grid. We added our REM to the latter part of a robotgrasp detection network since it is often effective to putgeometric transform related layers in the latter of the network Fig. 3: An illustration of incorporating our proposed REMin a DNN for robot grasp detection (a) and the architectureof our proposed REM with rotation convolutions (b).such as deformable convolutions [5]. A typical location forREM in DNNs is illustrated in Fig. 3 (a).Consider a typical scenario of convolution with inputfeature maps f ∈ R H × W × C where N = H × W is thenumber of pixels and C is the number of channels. Let usdenote g l ∈ R K × K × C , l = 1 , . . . , n f a convolution kernelwhere K × K is the spatial dimension of the kernel andthere are n f number of kernels in each channel. Similar tothe group convolutions [4], we propose n r rotations of theweights to obtain n f · n r rotated weights for each channel.Bilinear interpolations of four adjacent pixel values wereused for generating rotated kernels. A rotation matrix is R ( r ) =  cos( rπ/ − sin( rπ/

4) 0sin( rπ/

4) cos( rπ/

4) 00 0 1  where r is an index for rotations. Then, the rotated weights(or kernels) are g il = R ( i ) g l , i = 0 , . . . , , l = 1 , . . . , n f . Finally, the output of these convolutional layers with rotationoperators for the input f is d il = g il (cid:63) f, i = 0 , . . . , , l = 1 , . . . , n f , where (cid:63) is a convolution operator. This pipeline of operationsis called “rotation convolution”. A typical kernel size is K=5.Our REM contains rotation activation that aggregates allfeature maps at different angles. Assume that an intermediateoutput for { t x , t y , θ, t w , t h , t z } is available in REM, called { t xm , t ym , θ m , t wm , t hm , t zm } . Note that θ im ∈ R H × W where i = 0 , π/ , π/ , π/ . For each angle, activations willbe generated and all of them must be aggregated to yieldone ﬁnal feature map ˆ d l = (cid:80) i =1 d il (cid:12) θ im / . where (cid:12) isHadamard product. Thus, our proposed method utilizes classprobability (probability to grasp) to selectively aggregateactivations along with the weight of angle classiﬁcation.n the REM, the intermediate output is partially usedfor rotation activation, it still contains valuable, compressedinformation about the ﬁnal output - it could be a goodinitial bounding box. Thus, we designed our REM to de-compress, concatenate it at the end of REM as illustrated inFig. 3 (b). This pipeline delivers valuable information about { t xm , t ym , θ m , t wm , t hm , t zm } indirectly to the ﬁnal layer and thisstructure seemed to decrease probability errors. D. Loss function for REM-equipped DNN

We re-designed the loss function for training robotic graspdetection DNNs to emphasize this additional REM. The out-put of DNN ( t x , t y , θ, t w , t h , t z ) and the intermediate outputof the REM { t xm , t ym , θ m , t wm , t hm , t zm } should be convertedinto ( x, y, θ, w, h, z ) and { x m , y m , θ m , w m , h m , z m } , respec-tively. Then, using the ground truth ( x g , y g , θ g , w g , h g , z g ) ,the loss function is deﬁned as λ cd ( (cid:107) m (cid:12) ( x − x g ) (cid:107) + (cid:107) m (cid:12) ( y − y g ) (cid:107) ) + λ cd ( (cid:107) m (cid:12) ( w − w g ) (cid:107) + (cid:107) m (cid:12) ( h − h g ) (cid:107) ) + λ pr (cid:107) m (cid:12) ( z − z g ) (cid:107) + λ ag AngLoss( θ g , θ ; m ) + λ cd (cid:107) m (cid:12) ( x m − x g ) (cid:107) + (cid:107) m (cid:12) ( y m − y g ) (cid:107) ) + λ cd (cid:107) m (cid:12) ( w m − w g ) (cid:107) + (cid:107) m (cid:12) ( h m − h g ) (cid:107) ) + λ pr (cid:107) m (cid:12) ( z m − z g ) (cid:107) + λ ag m (cid:12) θ g , m (cid:12) θ m ) where m is a mask vector with 1 (ground truth for that grid)or 0 (no ground truth for that grid), (cid:107) · (cid:107) is l norm, CE iscross entropy, and AngLoss is one of these functions: CE forclassiﬁcation on θ , l norm for regression or rotation anchorbox on θ . We chose λ cd = λ ag = 1 and λ pr = 5 .IV. S IMULATIONS AND E XPERIMENTS

We evaluated our proposed REM methods on the Cornellrobotic grasp dataset [16], [17] and on real robot graspingtasks with novel objects. The effectiveness of our REM wasdemonstrated in prediction accuracy, computation time andgrasping success rate. Our proposed methods were comparedwith previous methods such as [17], [23], [9], [3], [33], [32]based on literature for widely used Cornell dataset as wellas our in-house implementations of some previous works.

A. Implementation details

It is challenging to fairly compare a robot grasp detectionmethod with other previous works such as [17], [23], [9],[3], [33], [32]. Due to the Cornell dataset, most works wereable to compare their results with those of previous methodsthat were reported in literature. Considering fast advancesof computing power and DNN techniques, it is often notclear how much the proposed scheme or method actuallycontributed to the increase of performance.In this paper, we did not only compare our REM methodswith previous works on the Cornell dataset through literature,but also implemented the core angle prediction schemes ofother previous works with modern DNNs: regression (Reg)that Redmon et al. proposed [23], classiﬁcation (Cls) that Guo et al. proposed [9] and rotation anchor box (Rot) thatZhou et al. proposed [33]. While Redmon [23], Guo [9] andZhou [33] used AlexNet [14], ZFNet [30] and ResNet [11],respectively, our in-house implementations, Reg, Cls andRot, all used DarkNet-19 [22]. While Guo and Zhou werebased on faster R-CNN (TSD) [26], our implementationswere based on YOLO9000 (OSD) [25].We performed ablation studies for our REM so that it be-comes clear which part will affect the performance of rotatedgrasp detection most signiﬁcantly. We placed our proposedREM at the 6th layers from the end of the detection network.We also performed simulations with rotation activation usingangle and probability. For multiple robotic grasps detection,boxes were plotted when probabilities were 0.25 or higher.All algorithms were tested on the platform with GPU(NVIDIA 1080Ti), CPU (Intel i7-7700K 4.20GHz) and32GB memory. Our REM methods and other in-house DNNssuch as Ref, Cls and Rot were implemented with PyTorch.

B. Benchmark dataset and novel objects

The Cornell robot grasp detection dataset [16], [17] con-sists of images (RGB color and depth) of 240 differentobjects as shown in Fig. 4a with ground truth labels of a fewgraspable rectangles and a few non-graspable rectangles. Weused RG-D information without B channel just like the workof Redmon [23]. An image was cropped to yield a × image and ﬁve-fold cross validation was performed.Then, mean prediction accuracy was reported for image-wiseand object-wise splits. Image-wise split divides the Cornelldataset into training and testing data with 4:1 ratio randomlywithout considering the same or different objects. Object-wise is a way of splitting training and testing data with 4:1 ratio such that both data do not contain the same object.We followed other previous works for accuracy metrics [17],[23], [15]. Successful grasp detection is deﬁned as follows: ifIOU is larger than a certain threshold ( e.g. , 0.25, 0.3 or 0.35)and the difference between the output orientation θ and the (a)(b) Fig. 4: (a) Images from the Cornell dataset and (b) novelobjects for real robot grasping task.ig. 5: (Left) Robot experiment setup with a top-mountedRGB-D camera and a small 4-axis robot arm. (Right) Di-mensional information on our robot gripper and an object.ground truth orientation θ g is less than 30 ◦ (Jaccard index),then it is considered as a successful grasp detection.We also performed real grasping tasks with our REMmethods on 8 novel objects as shown in Fig. 4b (toothbrush,candy, earphone cap, cable, styrofoam bowl, L-wrench, nip-per, pencil). Our proposed methods were applied to a small4-axis robot arm (Dobot Magician, China) and a RGB-Dcamera (Intel RealSense D435, USA) that has a ﬁeld-of-viewof the robot and its workspace from the top. If a robot canpick and place an object, it is counted as success. Our robotexperiment setup is illustrated in Fig. 5. C. Results for in-house implementations of previous works

Table I shows the results of ablation studies for our in-house implementations on the Cornell dataset for anchor boxwith w and h with various ratios (N) vs. one ratio of 1:1 (1)TABLE I: Ablation studies on the Cornell dataset for anchorbox of w , h with various ratios or one ratio and angleprediction methods with Reg, Cls, Rot. Anchor Box Angle Prediction Image-wise Object-wise25% 35% 25% 35%N Reg 91.0 86.5 88.7 85.61 Reg 91.8 87.7 89.2 86.3N Cls 97.2 93.1 96.1 93.1

Fig. 6: Grasp detection accuracy over epoch on the Cornelldataset using various methods for angle predictions: Rot:rotation anchor box, Cls: classiﬁcation, Reg: regression,REM: ours. TABLE II: The ablation studies on the Cornell dataset forour REM with RC, RA and RL.

Angle RC RA RL Image-wise Object-wise25% 35% 25% 35%Cls - - - 97.3 94.1 96.6 92.9Cls O - - 97.6 94.1 97.3 92.7

Cls O O - 99.2 95.3 98.6 95.5

Cls O O O 98.6 94.9 97.3 94.1Reg O O - 89.3 84.0 88.3 84.5Rot O O - 98.5 95.6 98.0 94.0 and angle prediction methods: regression (Reg) vs. classiﬁ-cation (Cls) vs. rotation anchor box (Rot). The results showthat using a 1:1 ratio (1) yields better accuracy than using avariety of anchor boxes (N). For angle prediction methods,rotation anchor box yielded the best performance whileregression yielded the lowest that was consistent with theliterature. Thus, our in-house implementations seem to yieldbetter performance in accuracy than the original previousworks possibly due to modern DNNs in our implementations:Reg - Redmon et al. [23], Cls - Guo et al. [9] and Rot -Zhou et al. [33].Fig. 6 shows the results of different angle predictionmethods at IOU 25% over epoch. We observed that Rotyielded slowly increased accuracy over epochs than Clsinitially and Reg yielded overall slow increase in accuracyover epochs. These slow initial convergences of Reg and Rotmay not be desirable for re-training on additional data.

D. Results for our proposed REM on the Cornell dataset

Table II shows the results of the ablation studies for ourproposed REM with different components such as rotationconvolution (RC) and rotation activation (RA). RA can beobtained by using rotation activation loss (RL) as show inFig. 3. We observed that RC itself did not improve theperformance while RC & RA signiﬁcantly improved theaccuracy. Comparable performance was observed when usingRC & RA with Rot, but substantially low performance wasachieved with Reg.Table III summarizes all evaluation results on the Cornellrobotic grasp dataset for previous works and our proposedTABLE III: Performance summary on Cornell dataset. Ourproposed method yielded state-of-the-art prediction accuracyin both image-wise (Img) and object-wise (Obj) splits withreal-time computation. The unit for performance is %.

Method Angle Type Img Obj Speed25% 25% (FPS)Lenz [17], SAE Cls TSD 73.9 75.6 0.08Redmon [23], AlexNet Reg OSD 88.0 87.1 13.2Kumra [15], ResNet-50 Reg TSD 89.2 88.9 16Asif [2] Reg OSD 90.2 90.6 41Guo [9]

Our REM, DarkNet-19 Cls OSD 99.2 98.6 50 ig. 7: Grasp detection results on the Cornell dataset for (a)Reg, a modern version of Redmon [23], (b) Cls, a modernversion of Guo [9], (b) Rot, a modern version of Zhou [33]and (d) our proposed Cls+REM. (e) Ground truth labels inCornell dataset. Black boxes are grasp candidates and green-red boxes are the best grasp among them.methods. Our proposed method yielded state-of-the-art per-formance, up to 99.2% prediction accuracy for image-wisesplit and up to 98.6% for object-wise split, respectively, overreported accuracies of the previous works that are listedin the Table. Our proposed methods yielded these state-of-the-art performances with real-time computation at 50frame per second (FPS). Note that AlexNet, DarkNet-19,ResNet-50, ResNet-101 require 61.1, 20.8, 25.6 and 44.5 MBparameters, respectively. Thus, our REM method achievedstate-of-the-art results with relatively small size of DNN(20.8MB) compared to other recent works using large DNNsFig. 8: Grasp detection results (cropped) on multiple novelobjects including a nipper using (a) Reg, (b) Cls, (c) Rot and(d) ours (Cls + REM). Black boxes are grasp candidates andgreen-red boxes are the best grasp among them.Fig. 9: Multiple robotic grasp detection results on severalnovel objects for (a) Reg, (b) Cls, (c) Rot and (d) ourproposed Cls+REM. Black boxes are grasp candidates andgreen-red boxes are the best grasp among them. TABLE IV: Performance summary of real robotic graspingtasks for 8 novel, small objects with 8 repetitions.

Object Reg Cls

Ours toothbrush 5 / 8 candy 0 / 8 6 / 8 earphone cap 5 / 8 7 / 8 cable 3 / 8 6 / 8 styrofoam bowl 3 / 8

L-wrench 5 / 8 6 / 8 nipper 0 / 8 5 / 8 pencil 3 / 8

Average 3 / 8 6.6 / 8 such as ResNet-101 (44.5MB).Fig. 7 illustrates grasp detection results on the Cornelldataset. Our proposed Cls+REM yielded grasp candidatesthat were close to the ground truth compared to otherprevious methods such as Reg and Cls.

E. Results for real grasping tasks on novel objects

We applied all grasp detection methods that were trainedon the Cornell set to real grasping tasks with novel (multiple)objects without re-training. Fig. 8 illustrates our robot graspexperiment with novel objects including nipper using our al-gorithm implementations. Multi-object multi-grasp detectionresults on novel objects are reported in Fig. 9 for Reg, Cls,Rot and our Cls+REM methods, respectively. Both Cls andour Cls+REM generated good grasp candidates over Reg anRot. Our REM seems to detect reliable grasps and angles(e.g, pen, L-wrench) over Rot. Real grasping task results withour 4-axis robot arm is tabulated in Table IV. Possibly dueto reliable angle detections, our proposed Cls+REM yielded93.8% grasping success rate, that is 11% higher than Cls. Wedid not perform real grasping with Rot, a modern version ofZhou [33], due to unreliable angle predictions for multipleobjects. However, the advantage of our Cls+REM seemsclear over Rot for detection accuracies, fast computation andreliable angle predictions for multi-objects.V. CONCLUSIONWe propose the REM for robotic grasp detection thatwas able to outperform state-of-the-art methods by achievingup to 99.2% (image-wise), 98.6% (object-wise) accuracieson the Cornell dataset with fast computation (50 FPS) andreliable grasps for multi-objects. Our proposed method wasable to yield up to 93.8% success rate for the real-timerobotic grasping task with a 4-axis robot arm for small novelobjects that was higher than the baseline methods by 11-56%.A

CKNOWLEDGMENTS

This work was supported partly by the Technology Inno-vation Program or Industrial Strategic Technology Develop-ment Program (10077533, Development of robotic manip-ulation algorithm for grasping/assembling with the machinelearning using visual and tactile sensing information) fundedby the Ministry of Trade, Industry & Energy (MOTIE, Korea)and partly by a grant of the Korea Health Technology R&DProject through the Korea Health Industry DevelopmentInstitute (KHIDI), funded by the Ministry of Health &Welfare, Republic of Korea (grant number : HI18C0316).

EFERENCES[1] U. Asif, M. Bennamoun, and F. A. Sohel. RGB-D Object Recognitionand Grasp Detection Using Hierarchical Cascaded Forests.

IEEETransactions on Robotics , 33(3):547–564, May 2017.[2] U. Asif, J. Tang, and S. Harrer. GraspNet: An Efﬁcient ConvolutionalNeural Network for Real-time Grasp Detection for Low-poweredDevices. In

International Joint Conference on Artiﬁcial Intelligence(IJCAI) , pages 4875–4882, 2018.[3] F.-J. Chu, R. Xu, and P. A. Vela. Real-World Multiobject, MultigraspDetection.

IEEE Robotics and Automation Letters , 3(4):3355–3362,Oct. 2018.[4] T. Cohen and M. Welling. Group equivariant convolutional networks.In

International Conference on Machine Learning (ICML) , pages2990–2999, 2016.[5] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei.Deformable convolutional networks. In

IEEE International Conferenceon Computer Vision (ICCV) , pages 764–773, 2017.[6] C. Esteves, C. Allen-Blanchette, X. Zhou, and K. Daniilidis. Polartransformer networks. In

International Conference on LearningRepresentations (ICLR) , 2018.[7] P. Follmann and T. Bottger. A rotationally-invariant convolutionmodule by feature map back-rotation. In

IEEE Winter Conferenceon Applications of Computer Vision (WACV) , pages 784–792, 2018.[8] H. Greenspan, S. Belongie, R. Goodman, P. Perona, S. Rakshit, andC. H. Anderson. Overcomplete steerable pyramid ﬁlters and rotationinvariance. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 222–228, 1994.[9] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi. A hybriddeep architecture for robotic grasp detection. In

IEEE InternationalConference on Robotics and Automation (ICRA) , pages 1609–1614,2017.[10] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deepconvolutional networks for visual recognition. In

European conferenceon computer vision , pages 346–361, 2014.[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning forImage Recognition. In

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 770–778, 2016.[12] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu.Spatial transformer networks. In

Advances in Neural InformationProcessing Systems 28 , pages 2017–2025, 2015.[13] W.-Y. Kim and P. Yuan. A practical pattern recognition system fortranslation, scale and rotation invariance. In , pages 391–396. IEEE, 1994.[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcationwith deep convolutional neural networks. In

Advances in NeuralInformation Processing Systems 25 , pages 1097–1105, 2012.[15] S. Kumra and C. Kanan. Robotic grasp detection using deep con-volutional neural networks. In

IEEE International Conference onIntelligent Robots and Systems (IROS) , pages 769–776, 2017.[16] I. Lenz, H. Lee, and A. Saxena. Deep Learning for Detecting RoboticGrasps. In

Robotics: Science and Systems , page P12, June 2013.[17] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting roboticgrasps.

The International Journal of Robotics Research , 34(4-5):705–724, Apr. 2015.[18] C.-H. Lin and S. Lucey. Inverse compositional spatial transformernetworks. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017.[19] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue.Arbitrary-oriented scene text detection via rotation proposals.

IEEETransactions on Multimedia , 2018.[20] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea,and K. Goldberg. Dex-net 2.0: Deep learning to plan robust graspswith synthetic point clouds and analytic grasp metrics. arXiv preprintarXiv:1703.09312 , 2017.[21] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia. Rotation equivariantvector ﬁeld networks. In

IEEE International Conference on ComputerVision (ICCV) , pages 5058–5067, 2017.[22] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/ , 2013–2016.[23] J. Redmon and A. Angelova. Real-time grasp detection using convolu-tional neural networks. In

IEEE International Conference on Roboticsand Automation (ICRA) , pages 1316–1322, 2015.[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only LookOnce: Uniﬁed, Real-Time Object Detection. In

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , pages 779–788,2016. [25] J. Redmon and A. Farhadi. YOLO9000: Better, Faster, Stronger.In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 6517–6525, 2017.[26] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In

Advances inneural information processing systems (NIPS) , pages 91–99, 2015.[27] H. A. Rowley, S. Baluja, and T. Kanade. Rotation invariant neu-ral network-based face detection. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTERSCIENCE, 1997.[28] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen. Real-time rotation-invariant face detection with progressive calibration networks. In

IEEEConference on Computer Vision and Pattern Recognition (CVPR) ,pages 2295–2303, 2018.[29] F. Yu and V. Koltun. Multi-scale context aggregation by dilatedconvolutions. In

ICLR , 2016.[30] M. D. Zeiler and R. Fergus. Visualizing and understanding convolu-tional networks. In

European Conference on Computer Vision (ECCV) ,pages 818–833, 2014.[31] H. Zhang, X. Lan, L. Wan, C. Yang, X. Zhou, and N. Zheng.Rprg: Toward real-time robotic perception, reasoning and graspingwith one multi-task convolutional neural network. arXiv preprintarXiv:1809.07081 , 2018.[32] H. Zhang, X. Lan, X. Zhou, and N. Zheng. Roi-based robotic graspdetection in object overlapping scenes using convolutional neuralnetwork. arXiv preprint arXiv:1808.10313 , 2018.[33] X. Zhou, X. Lan, H. Zhang, Z. Tian, Y. Zhang, and N. Zheng. Fullyconvolutional grasp detection network with oriented anchor box. In2018 IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS)