[PDF] Object Detection using Domain Randomization and Generative Adversarial Refinement of Synthetic Images

Abstract

In this work, we present an application of domain randomization and generative adversarial networks (GAN) to train a near real-time object detector for industrial electric parts, entirely in a simulated environment. Large scale availability of labelled real world data is typically rare and difficult to obtain in many industrial settings. As such here, only a few hundred of unlabelled real images are used to train a Cyclic-GAN network, in combination with various degree of domain randomization procedures. We demonstrate that this enables robust translation of synthetic images to the real world domain. We show that a combination of the original synthetic (simulation) and GAN translated images, when used for training a Mask-RCNN object detection network achieves greater than 0.95 mean average precision in detecting and classifying a collection of industrial electric parts. We evaluate the performance across different combinations of training data.

Full PDF

OObject Detection using Domain Randomization and Generative AdversarialReﬁnement of Synthetic Images

Fernando Camaro Nogues Andrew Huie Sakyasingha DasguptaAscent Robotics, Inc. Japan { fernando, andrew, sakya } @ascent.ai Abstract

In this work, we present an application of domain ran-domization and generative adversarial networks (GAN) totrain a near real-time object detector for industrial elec-tric parts, entirely in a simulated environment. Large scaleavailability of labelled real world data is typically rare anddifﬁcult to obtain in many industrial settings. As such here,only a few hundred of unlabelled real images are used totrain a Cyclic-GAN network, in combination with variousdegree of domain randomization procedures. We demon-strate that this enables robust translation of synthetic im-ages to the real world domain. We show that a combinationof the original synthetic (simulation) and GAN translatedimages, when used for training a Mask-RCNN object detec-tion network achieves greater than 0.95 mean average pre-cision in detecting and classifying a collection of industrialelectric parts. We evaluate the performance across differentcombinations of training data.

1. Introduction

The successful examples of deep learning require a largenumber of manually annotated data, which can be pro-hibitive for most applications, even if they start from a pre-trained model in another domain and only require a ﬁne-tuning phase in the target domain.An effective way to eliminate the cost of the expensive an-notation is to train the model within a simulated environ-ment where the annotations can be also automatically gen-erated. However, the problem with this approach is thatthe generated samples (in our case images) may not fol-low the same distribution as the real domain, resulting inwhat is known as the reality-gap. Several approaches existthat try to reduce this apparent gap. One such method isdomain randomization ([1], [2]). In this, several renderingparameters of the scene can be randomized, like the colorof objects, textures, lights, etc, thus effectively enabling themodel to see a very wide distribution during training, and seeing the real distribution as one variation in it.Another approach that directly tries to minimize this reality-gap is to reﬁne the synthetic images so that they look morerealistic. One possible way to build such a reﬁner is by us-ing a generative adversarial training framework [3]. An al-ternative and more indirect approach to reduce the negativeeffect of this reality-gap is to use again the GAN frame-work, but in this case, directly on the features of some ofthe last layers of the network being trained for the speciﬁctarget task [4].In this work we present an experimental use case of an ob-ject detector in a real industrial application setting, which istrained with different combinations of synthetic images andreﬁned synthetic images (synthetic images reﬁned to lookmore realistic). We evaluate our method robustly across var-ious combinations of training data.

2. Synthetic Image Generation with DomainRandomization

The architecture to produce the synthetic images for ourexperiments is composed of two main parts. First, thephysics simulation engine, Bullet is used to place the ob-jects in a physically consistent conﬁguration after lettingthem fall from a random position. Second, the ray trac-ing rendering library POV-Ray is used to render an im-age based on this conﬁguration. In POV-Ray we introducedomain randomization, by randomizing several parameters,namely, the number of lights and their color, the color andtexture of each part of the target objects and the scene ﬂoorplane, as well as the camera position. The camera posi-tion is drawn from a uniform distribution in a rectangularprism that is 10cm above the ﬂoor plane, with a squaredbase of side 20cm and 10cm height. Although the locationof the camera was uniform, the camera was always pointingto the global coordinates origin with no rolling angle. Thisvariation of the camera position was intended to achieve ro-bustness against different positions of the camera in the real https://pybullet.org/wordpress/ a r X i v : . [ c s . C V ] J un orld.

3. Reﬁnement of synthetic images by adversar-ial training

An alternative way we consider to reduce the reality-gapis to use the GAN framework to reﬁne the synthetic imagesto look more realistic. Here, we selected the Cyclic-GAN[6] architecture since it only requires two sets of unpairedexamples, one for each domain, the synthetic and the realone. The original synthetic images of size 1024x768 weretoo large for the training of our Cyclic-GAN model, as such,instead of resizing the image, we opted for training on ran-dom crops of size 256x256. This way we can train in theoriginal pixel density and exploit the fact that our genera-tors are fully convolutional networks, such that during theinference phase we can still input the original full-size im-age.Figure 1: Left: example of synthetic image. Right: corre-sponding synthetic image after translation to real domain.The USB socket has gained a more realistic reﬂection, andthe switch has gained a realistic surface texture and color.We notice that after training, one particular target object lostits color and turned gray, while the remaining objects werereﬁned in a realistic manner without loosing their originalcolor. We think that this was mainly due to the particulararchitecture of the discriminators. The discriminator modelﬁnal layer consisted of a spatial grid of discriminator neu-rons whose receptive ﬁeld with respect to the input imagewas too small to capture that object. In order to solve thiswe added more convolutional layers to the discriminatormodels. This effectively increased the receptive ﬁeld size.Furthermore, instead of substituting one grid of discrimi-nators by another, we preferred to maintain both, one withsmall receptive ﬁeld intended to discriminate details of theobjects, and another with large receptive ﬁeld, that can un-derstand the objects as a whole (Fig. 3 in Appendix). Theﬁnal loss was computed as the mean of all individual dis-criminator units for both of these two layers. This smallmodiﬁcation enabled us to maintain the color of all the ob-jects. The Cyclic-GAN model was trained using 10K syn-thetic images and 256 real images. Fig.1 shows an example of the resulting image with our model that translates fromsynthetic domain to the real domain; see Fig. 4 in Appendixfor more examples.

4. Experiments

In this section we compare different combinations oftraining data and its impact on the mAP for object detec-tion with a Mask-RCNN model [5]. As a test dataset wehave used 100 real images.The different types of datasets used for training were: S fix : synthetic images with ﬁxed object colors without texture,and white background. S fix → real : translated images from S fix to the real domain. S rand − tex : synthetic images withobjects and background with randomized colors but with-out texture. S rand + tex : synthetic images with objects andbackground with randomized colors and texture. See Fig. 2in the appendix for a general overview of the training archi-tecture and Fig. 5 for some examples of different types ofimages employed.The target objects to be detected, consisted of 12 tiny elec-tronic parts for which accurate 3D CAD models were avail-able (Fig. 6). In all the experiments we used 10K train-ing samples, the same number of training iterations and thesame hyperparameters.The object detection performance for the different com-binations of datasets used in the experiments are presentedin Table 1 . Using a training set made purely of one type ofdata resulted in a mAP below 0.9 in most cases, with the ex-ception of the case with S rand + tex . Overall, the best detec-tion results were obtained when the reﬁned synthetic imagesset ( S fix → real ) was combined with high variation random-ized data ( S rand + tex ). The results indicate that neither do-main randomization or GAN based reﬁnement is enough onits own to get sufﬁcient performance. In combination, theyreduce the reality-gap effectively, resulting in a signiﬁcantboost in performance ( see the real-time object detectionvideo at https://youtu.be/Q-WeXSSnZ0U ). Referto Fig. 7 for the training curves associated with the differentexperiments, and to Fig. 8 for some detection result images.Training data mAP (0.5 IoU)100% S fix S fix → real S rand − tex S rand + tex S fix and 80% S rand + tex S fix → real and 80% S rand + tex S fix → real and 50% S rand + tex eferences [1] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wo-jciech Zaremba, Pieter Abbeel. Domain Randomization forTransferring Deep Neural Networks from Simulation . Inthe proceedings of the 30th IEEE/RSJ International Confer-ence on Intelligent RObots and Systems (IROS), Vancouver,Canada, October 2017[2] Sadeghi, Fereshteh and Levine, Sergey.

CAD2RL: RealSingle-Image Flight without a Single Real Image , Robotics:Science and Systems(RSS), 2017.[3] Ashish Shrivastava, Tomas Pﬁster, Oncel Tuzel, JoshuaSusskind, Wenda Wang and Russell Webb.

Learning fromSimulated and Unsupervised Images through AdversarialTraining . 2017 IEEE Conference on Computer Vision and Pat-tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017[4] Ganin, Yaroslav et al.

Domain-adversarial Training of NeuralNetworks . The Journal of Machine Learning Research, 2016.[5] Kaiming He, Georgia Gkioxari, Piotr Doll´ar and Ross B. Gir-shick.

Mask R-CNN . 2017 IEEE International Conference onComputer Vision (ICCV), 2017.[6] Jun-Yan Zhu, Taesung Park, Phillip Isola and Alexei A. Efros.

Unpaired Image-to-Image Translation using Cycle-ConsistentAdversarial Networks . arXiv preprint arXiv:1703.10593,2017 2017 IEEE International Conference on Computer Vi-sion (ICCV), 2017.

Appendix

In Fig. 2 we provide a schematic overview of the object detec-tion training data generation pipeline.

Figure 2: General architecture for training the object detec-tor. Figure 3: Discriminator network with two grid layers ofdiscriminator cells, one with small receptive ﬁeld and theother with bigger receptive ﬁeld.4323igure 4: Left column: images from S fix . Right column: corresponding reﬁned images ( S fix → real ).4324 a) Example of S fix image (b) Example of S rand − tex image(c) Example of S rand + tex image (d) Example of a real image used to train the cyclic-GAN Figure 5: Examples of different types of images employed in the experiments.4325 a) tactile switch (b) pin header (c) 3 way cable mount screwterminal(d) DC power jack (e) DIP switch (f) slide switch(g) led (h) IC socket (i) trimmer(j) buzzer (k) USB type A socket (l) USB type C socket

Figure 6: Electronic parts used in the experiments.4326igure 7: Mask-RCNN training loss. The model was trained by ﬁne-tuning a Mask-RCNN model pre-trained on the COCOdataset. First by training only the mask-rcnn heads (without training the region proposal network or the backbone model) for10 epochs with a learning rate of 0.002, and then the whole network for another 5 epochs with a learning rate of 0.0002. Weused a SGD optimizer with a momentum of 0.9. The conﬁgurations that achieved the better performances, ”20% S fix → real and 80% S rand + tex ” and ”50% S fix → real and 50% S rand + tex ”, are the ones that had worse loss values during training. Wethink that this is because these datasets were more difﬁcult, but at the end prepared the model better for the also difﬁcult realtest dataset. 4327igure 8: Example of detection results for 20% S fix → real and 80% S rand + textex