Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation
FFalling Things: A Synthetic Dataset for3D Object Detection and Pose Estimation
Jonathan TremblayNVIDIA [email protected]
Thang ToNVIDIA [email protected]
Stan BirchfieldNVIDIA [email protected]
Abstract
We present a new dataset, called Falling Things (FAT),for advancing the state-of-the-art in object detection and3D pose estimation in the context of robotics. By syntheti-cally combining object models and backgrounds of complexcomposition and high graphical quality, we are able to gen-erate photorealistic images with accurate 3D pose annota-tions for all objects in all images. Our dataset contains 60kannotated photos of 21 household objects taken from theYCB dataset [2]. For each image, we provide the 3D poses,per-pixel class segmentation, and 2D/3D bounding box co-ordinates for all objects. To facilitate testing different inputmodalities, we provide mono and stereo RGB images, alongwith registered dense depth images. We describe in detailthe generation process and statistical analysis of the data.
1. Introduction
Robotic manipulation of household objects in everydayenvironments requires accurate detection and pose estima-tion of multiple object categories. This presents a two-foldchallenge for developing robotic perception algorithms.First, acquiring ground truth data is time-consuming, error-prone, and potentially expensive (depending upon the tech-nique used). This limits the evaluation of algorithms, par-ticularly with respect to new object categories or environ-mental conditions. Secondly, existing techniques for ac-quiring real world data [16, 5, 15] do not scale. As a result,they are not capable of generating the large datasets that areneeded for training deep neural networks.We propose to overcome such limitations by using syn-thetically generated data. Synthetic data has been gainingtraction in recent years as an efficient means of both train-ing and evaluating DNNs for computer vision problems forwhich collecting ground truth data is laborious, e.g. , stereo[17], optical/scene flow [10], or semantic segmentation [4]. The dataset can be downloaded from http://research.nvidia.com/publication/2018-06_Falling-Things
Figure 1. The Falling Things (FAT) dataset was generated by plac-ing 3D household object models ( e.g. , mustard bottle, soup can,gelatin box, etc.) in virtual environments. Each snapshot consistsof a stereo pair of RGB images (only one of which is shown, top),pixelwise segmentation of the objects (bottom left), depth (bottomcenter), 2D/3D bounding box coordinates (bottom right), and 3Dposes of all objects (not shown).
We believe that 3D detection and pose estimation fall withinthis category and are thus a natural fit for synthetic data dueto the difficulty of acquiring accurate ground truth.In this paper we introduce the
Falling Things (FAT) dataset, consisting of a large number (61,500) of snapshotsfor training and evaluating robotics scene understanding al-gorithms in household environments. Specifically, as shownin Fig. 1, each snapshot consists of a stereo pair of imageframes with corresponding depth images, along with the3D poses, 2D/3D bounding boxes, projected 3D boundingboxes, and pixelwise segmentation of the known objects inthe scene. Unlike previous datasets, we aim for photore-alistic images of real-world objects, leveraging the popularYCB object models [2] placed in different virtual environ-ments.To better understand the contribution of our dataset, con-sult Table 1. To our knowledge, only two datasets exist with a r X i v : . [ c s . C V ] J u l ccurate ground truth poses of multiple objects with signif-icant occlusion (T-LESS [6], YCB-Video [16]). Yet neitherof these datasets contain extreme lighting variations or mul-tiple modalities. Our FAT dataset thus extends upon theseexisting solutions in both quantity and variety.
2. Falling Things Dataset
Description.
All data were generated by a custom pluginwe developed for Unreal Engine 4 (UE4) [14]. By lever-aging asynchronous, multithreaded sequential frame grab-bing, the plugin generates data at a rate of 50–100 Hz,which is significantly faster than either the default UE4screenshot function or the publicly available UnrealCV tool[12].We selected three virtual environments within UE4: akitchen, sun temple, and forest. These environments werechosen for their high-fidelity modeling and quality, as wellas for the variety of indoor and outdoor scenes. For each en-vironment we manually selected five specific locations cov-ering a variety of terrain and lighting conditions ( e.g. , on akitchen counter or tile floor, next to a rock, above a grassyfield, and so forth). Together, these yielded 15 different lo-cations consisting of a variety of 3D backgrounds, lightingconditions, and shadows.We selected 21 household objects from the YCB [2]dataset. Since these models are not all aligned, translationwas applied to each of the downloaded models to center thecoordinate frame at the object centroid, and rotation was ap-plied to align the coordinate axes with those of the object,taking product labeling into account. For each run, someof these object models were placed at random positions andorientations within a vertical cylinder of radius cm andheight of cm placed at a fixation point. The objects’ ini-tial positions are sampled within this volume to avoid initialpenetration. The objects were then allowed to fall under theforce of gravity, as well as to collide with one another andwith the surfaces in the scene. While the objects fell, thevirtual camera system was rapidly teleported to random az-imuths, elevations, and distances with respect to the fixationpoint to collect data. Azimuth ranged from – ◦ to + ◦ (to avoid collision with the wall, when present), elevationfrom ◦ to ◦ , and distance from 0.5 m to 1.5 m.Our virtual camera system consists of a pair of stereoRGBD cameras. This design decision allows the dataset tosupport at least three different sensor modalities. Whereassingle RGBD sensors are commonly used within robotics,stereo sensors have the potential to yield higher quality out-put with fewer distortions, and a monocular RGB camerahas obvious advantages in terms of cost, simplicity, andavailability. By supporting all of these options, the dataset We used the same subset of 21 objects as in [16]. allows researchers to use their modality of choice, as well asto explore the important topic of comparing across modali-ties. In our system, the baseline separating the left and rightcameras is 6 cm; and each camera’s horizontal field of viewis 64 degrees, leading to a focal length of 768.2 pixels withan image resolution of × .The dataset (consisting of 61,500 unique frames) is di-vided into two parts: • Single objects.
The first part of the dataset was gen-erated by dropping each object model in isolation ∼ ∼ ∼ × ) unique images for each of the 21 objects,thus totaling 31,500 frames. • Mixed objects.
The second part of the dataset wasgenerated in the same manner except with a randomnumber of objects sampled uniformly from 2 to 10.By sampling the objects with replacement, we allowmultiple instances of the same category in an image,unlike many previous datasets. For each location wegenerated 2,000 images, thus yielding 30,000 frames.
Testing.
There are several natural ways to split the FATdataset for training and testing. One approach would be tohold out one location per scene as the test sets, and leavethe other data for training. Another approach would be tohold out one environment for testing, leaving the others fortraining. Finally, the single object images could be used fortraining, while using the remaining images for testing—thatis, assuming that occlusion is artificially introduced in thedata augmentation process during training. Further detailsregarding training/testing methodology can be found withthe dataset.
Statistics.
Fig. 2 displays the total number of occurrencesof each object class in the FAT dataset, using opacity of thecolor bars to indicate the percentage of the object that is vis-ible (i.e., non-occluded). That is, solid color bars show thenumber of occurrences for which the object is at least 75%visible, whereas lighter color bars indicate the occurrencesfor which the visibility is between 25% and 75%. (Occur-rences with visibility less than 25% are not shown.) As canbe seen, smaller objects (such as the scissors or marker) areoccluded more often than larger objects (such as the crackerbox).Fig. 3 shows statistical distributions of various parame-ters of one of the YCB objects (the mustard bottle) in theFAT dataset (from left camera perspective). Shown are theyaw, pitch, and roll of the object with respect to the front.The modes in yaw (at 0 ◦ and 180 ◦ and, to a lesser extent, at ± ◦ ) are due to the fact that after falling, the object likelylies on either its front or back side. Similarly, the angleof the camera with respect to the resting surface biases the ataset d e p t h s t e r e o 3 D po s e f u ll r o t a ti on o cc l u s i on e x t r e m e li gh ti ng s e g m e n t a ti on bbox c oo r d s EPFL multi-view [11] 20 2k cars (cid:55) (cid:55) (cid:88) − (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) UW RGBD [9] 300 250k household (cid:88) (cid:55) (cid:88) − (cid:88) (cid:55) (cid:55) (cid:55) (cid:55) LINEMOD [5] 15 18k household (cid:88) (cid:55) (cid:88) (cid:88) (cid:55) (cid:55) (cid:55) (cid:55)
Object tracking [3] 4 6k household (cid:88) (cid:55) (cid:88) (cid:88) (cid:55) (cid:55) (cid:55) (cid:55)
Pascal3D+ [15] 12 30k various (cid:55) (cid:55) (cid:88) (cid:55) (cid:88) (cid:88) (cid:55) (cid:55)
Brachmann et al. [1] 20 10k various (cid:88) (cid:55) (cid:88) (cid:88) (cid:88) (cid:88) (cid:55) (cid:55)
Occlusion [1, 7] 8 1k household (cid:88) (cid:55) (cid:88) (cid:88) (cid:88) (cid:55) (cid:55) (cid:55)
Krull et al. [8] 3 3k handheld (cid:88) (cid:55) (cid:88) (cid:88) (cid:88) (cid:55) (cid:55) (cid:55)
Rutgers APC [13] 24 10k warehouse (cid:88) (cid:55) (cid:88) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
T-LESS [6] 30 10k industrial (cid:88) (cid:55) (cid:88) + (cid:88) (cid:88) (cid:55) (cid:55) (cid:55) YCB-Video [16] 21 134k household (cid:88) (cid:55) (cid:88) + (cid:88) (cid:88) (cid:55) (cid:88) (cid:88) FAT (ours) 21 60k household (cid:88) (cid:88) (cid:88) + (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Table 1. Datasets for object detection and pose estimation. Under 3D pose, (cid:88) + means that the pose of all known objects in the sceneare provided, (cid:88) means only the pose of a single object is provided, and (cid:88) − means that the provided poses are approximate. Note thatsegmentation and bounding boxes can be determined for any dataset by overlaying models according to ground truth.Figure 2. Total appearance count of the 21 YCB objects in theFAT dataset. Light color bars indicate object visibility greater than25%, while solid bars indicate visibility greater than 75%. (100%visible means not occluded , 0% visible means fully occluded .) pitch angle toward neutrality. In contrast, the roll is uni-form.Also shown in the figure is the distribution of the dis-tance to the camera, which extends slightly beyond the in-tended range of 0.5 m to 1.5 m because oftentimes objectsroll or slide after impact. The final row shows the visibil-ity indicating that, while the single objects are fully visi-ble, significant occlusions occur; and the distribution of thecentroid within the image, which is approximately a broadGaussian centered near the center of the image. Similar dis-tributions were observed for other YCB objects.
3. Conclusion
We have presented a new dataset to accelerate research inobject detection and pose estimation, as well as segmenta-tion, depth estimation, and sensor modalities. The proposeddataset focuses on household items from the YCB dataset
Figure 3. Statistics for one object (mustard bottle) in the FATdataset. In lexicographic order are the distribution of yaw, pitch,and roll angles; distance to the camera; visibility; and position ofthe object’s centroid within the exported images. and has been rendered with high fidelity and a wide vari-ety of backgrounds, poses, occlusions, and lighting condi-tions. Statistics from the dataset confirm this variety quanti-tatively. We hope that researchers find this dataset useful forexploring robust solutions to open problems such as objectdetection, pose estimation, depth estimation from monocu-lar and/or stereo cameras, and depth-based segmentation, toadvance the field of robotic manipulation. eferences [1] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton,and C. Rother. Learning 6D object pose estimation using 3Dobject coordinates. In
ECCV , 2014. 3[2] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, andA. M. Dollar. The YCB object and model set: Towards com-mon benchmarks for manipulation research. In
ICAR , 2015.1, 2[3] C. Choi and H. I. Christensen. RGB-D object tracking: Aparticle filter approach on GPU. In
IROS , 2013. 3[4] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worldsas proxy for multi-object tracking analysis. In
CVPR , 2016.1[5] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski,K. Konolige, and N. Navab. Model based training, detec-tion and pose estimation of texture-less 3D objects in heavilycluttered scenes. In
ACCV , 2012. 1, 3[6] T. Hodaˇn, P. Haluza, ˇS. Obdrˇz´alek, J. Matas, M. Lourakis,and X. Zabulis. T-LESS: An RGB-D dataset for 6D poseestimation of texture-less objects. In
WACV , 2017. 2, 3[7] A. Krull, E. Brachmann, F. Michel, M. Y. Yang, S. Gumhold,and C. Rother. Learning analysis-by-synthesis for 6D poseestimation in RGB-D images. In
ICCV , 2015. 3[8] A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke,and C. Rother. 6-DOF model based tracking via object coor-dinate regression. In
ACCV , 2014. 3[9] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchicalmulti-view RGB-D object dataset. In
ICRA , 2011. 3[10] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers,A. Dosovitskiy, and T. Brox. A large dataset to train convo-lutional networks for disparity, optical flow, and scene flowestimation. In arXiv:1512.02134 , 2015. 1[11] M. Ozuysal, V. Lepetit, and P. Fua. Pose estimation for cate-gory specific multiview object localization. In
CVPR , 2009.3[12] W. Qiu and A. Yuille. UnrealCV: Connecting computer vi-sion to Unreal Engine. In arXiv:1609.01326 , 2016. 2[13] C. Rennie, R. Shome, K. E. Bekris, and A. F. D. Souza. Adataset for improved RGBD-based object detection and poseestimation for warehouse pick-and-place.
IEEE Roboticsand Automation Letters (RA-L) , 1(2):1179–1185, July 2016.3[14] T. To, J. Tremblay, D. McKay, Y. Yamaguchi, K. Leung,A. Balanon, J. Cheng, and S. Birchfield. NDDS: NVIDIAdeep learning dataset synthesizer. https://github.com/NVIDIA/Dataset_Synthesizer , 2018. 2[15] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL:A benchmark for 3D object detection in the wild. In
WACV ,2014. 1, 3[16] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. PoseCNN:A convolutional neural network for 6D object pose estima-tion in cluttered scenes. In arXiv:1711.00199 , 2017. 1, 2,3[17] Y. Zhang, W. Qiu, Q. Chen, X. Hu, and A. Yuille. Unreal-Stereo: A synthetic dataset for analyzing stereo vision. In arXiv:1612.04647 , 2016. 1 . Sample Images
The figure below shows sample images from the FAT dataset, demonstrating the variety of object poses, backgrounds,composition, and lighting conditions. (Random center crops are shown.) . YCB Objects