Object Detection using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
OObject Detection using Domain Randomization and Generative AdversarialRefinement of Synthetic Images
Fernando Camaro Nogues Andrew Huie Sakyasingha DasguptaAscent Robotics, Inc. Japan { fernando, andrew, sakya } @ascent.ai Abstract
In this work, we present an application of domain ran-domization and generative adversarial networks (GAN) totrain a near real-time object detector for industrial elec-tric parts, entirely in a simulated environment. Large scaleavailability of labelled real world data is typically rare anddifficult to obtain in many industrial settings. As such here,only a few hundred of unlabelled real images are used totrain a Cyclic-GAN network, in combination with variousdegree of domain randomization procedures. We demon-strate that this enables robust translation of synthetic im-ages to the real world domain. We show that a combinationof the original synthetic (simulation) and GAN translatedimages, when used for training a Mask-RCNN object detec-tion network achieves greater than 0.95 mean average pre-cision in detecting and classifying a collection of industrialelectric parts. We evaluate the performance across differentcombinations of training data.
1. Introduction
The successful examples of deep learning require a largenumber of manually annotated data, which can be pro-hibitive for most applications, even if they start from a pre-trained model in another domain and only require a fine-tuning phase in the target domain.An effective way to eliminate the cost of the expensive an-notation is to train the model within a simulated environ-ment where the annotations can be also automatically gen-erated. However, the problem with this approach is thatthe generated samples (in our case images) may not fol-low the same distribution as the real domain, resulting inwhat is known as the reality-gap. Several approaches existthat try to reduce this apparent gap. One such method isdomain randomization ([1], [2]). In this, several renderingparameters of the scene can be randomized, like the colorof objects, textures, lights, etc, thus effectively enabling themodel to see a very wide distribution during training, and seeing the real distribution as one variation in it.Another approach that directly tries to minimize this reality-gap is to refine the synthetic images so that they look morerealistic. One possible way to build such a refiner is by us-ing a generative adversarial training framework [3]. An al-ternative and more indirect approach to reduce the negativeeffect of this reality-gap is to use again the GAN frame-work, but in this case, directly on the features of some ofthe last layers of the network being trained for the specifictarget task [4].In this work we present an experimental use case of an ob-ject detector in a real industrial application setting, which istrained with different combinations of synthetic images andrefined synthetic images (synthetic images refined to lookmore realistic). We evaluate our method robustly across var-ious combinations of training data.
2. Synthetic Image Generation with DomainRandomization
The architecture to produce the synthetic images for ourexperiments is composed of two main parts. First, thephysics simulation engine, Bullet is used to place the ob-jects in a physically consistent configuration after lettingthem fall from a random position. Second, the ray trac-ing rendering library POV-Ray is used to render an im-age based on this configuration. In POV-Ray we introducedomain randomization, by randomizing several parameters,namely, the number of lights and their color, the color andtexture of each part of the target objects and the scene floorplane, as well as the camera position. The camera posi-tion is drawn from a uniform distribution in a rectangularprism that is 10cm above the floor plane, with a squaredbase of side 20cm and 10cm height. Although the locationof the camera was uniform, the camera was always pointingto the global coordinates origin with no rolling angle. Thisvariation of the camera position was intended to achieve ro-bustness against different positions of the camera in the real https://pybullet.org/wordpress/ a r X i v : . [ c s . C V ] J un orld.
3. Refinement of synthetic images by adversar-ial training
An alternative way we consider to reduce the reality-gapis to use the GAN framework to refine the synthetic imagesto look more realistic. Here, we selected the Cyclic-GAN[6] architecture since it only requires two sets of unpairedexamples, one for each domain, the synthetic and the realone. The original synthetic images of size 1024x768 weretoo large for the training of our Cyclic-GAN model, as such,instead of resizing the image, we opted for training on ran-dom crops of size 256x256. This way we can train in theoriginal pixel density and exploit the fact that our genera-tors are fully convolutional networks, such that during theinference phase we can still input the original full-size im-age.Figure 1: Left: example of synthetic image. Right: corre-sponding synthetic image after translation to real domain.The USB socket has gained a more realistic reflection, andthe switch has gained a realistic surface texture and color.We notice that after training, one particular target object lostits color and turned gray, while the remaining objects wererefined in a realistic manner without loosing their originalcolor. We think that this was mainly due to the particulararchitecture of the discriminators. The discriminator modelfinal layer consisted of a spatial grid of discriminator neu-rons whose receptive field with respect to the input imagewas too small to capture that object. In order to solve thiswe added more convolutional layers to the discriminatormodels. This effectively increased the receptive field size.Furthermore, instead of substituting one grid of discrimi-nators by another, we preferred to maintain both, one withsmall receptive field intended to discriminate details of theobjects, and another with large receptive field, that can un-derstand the objects as a whole (Fig. 3 in Appendix). Thefinal loss was computed as the mean of all individual dis-criminator units for both of these two layers. This smallmodification enabled us to maintain the color of all the ob-jects. The Cyclic-GAN model was trained using 10K syn-thetic images and 256 real images. Fig.1 shows an example of the resulting image with our model that translates fromsynthetic domain to the real domain; see Fig. 4 in Appendixfor more examples.
4. Experiments
In this section we compare different combinations oftraining data and its impact on the mAP for object detec-tion with a Mask-RCNN model [5]. As a test dataset wehave used 100 real images.The different types of datasets used for training were: S fix : synthetic images with fixed object colors without texture,and white background. S fix → real : translated images from S fix to the real domain. S rand − tex : synthetic images withobjects and background with randomized colors but with-out texture. S rand + tex : synthetic images with objects andbackground with randomized colors and texture. See Fig. 2in the appendix for a general overview of the training archi-tecture and Fig. 5 for some examples of different types ofimages employed.The target objects to be detected, consisted of 12 tiny elec-tronic parts for which accurate 3D CAD models were avail-able (Fig. 6). In all the experiments we used 10K train-ing samples, the same number of training iterations and thesame hyperparameters.The object detection performance for the different com-binations of datasets used in the experiments are presentedin Table 1 . Using a training set made purely of one type ofdata resulted in a mAP below 0.9 in most cases, with the ex-ception of the case with S rand + tex . Overall, the best detec-tion results were obtained when the refined synthetic imagesset ( S fix → real ) was combined with high variation random-ized data ( S rand + tex ). The results indicate that neither do-main randomization or GAN based refinement is enough onits own to get sufficient performance. In combination, theyreduce the reality-gap effectively, resulting in a significantboost in performance ( see the real-time object detectionvideo at https://youtu.be/Q-WeXSSnZ0U ). Referto Fig. 7 for the training curves associated with the differentexperiments, and to Fig. 8 for some detection result images.Training data mAP (0.5 IoU)100% S fix S fix → real S rand − tex S rand + tex S fix and 80% S rand + tex S fix → real and 80% S rand + tex S fix → real and 50% S rand + tex eferences [1] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wo-jciech Zaremba, Pieter Abbeel. Domain Randomization forTransferring Deep Neural Networks from Simulation . Inthe proceedings of the 30th IEEE/RSJ International Confer-ence on Intelligent RObots and Systems (IROS), Vancouver,Canada, October 2017[2] Sadeghi, Fereshteh and Levine, Sergey.
CAD2RL: RealSingle-Image Flight without a Single Real Image , Robotics:Science and Systems(RSS), 2017.[3] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, JoshuaSusskind, Wenda Wang and Russell Webb.
Learning fromSimulated and Unsupervised Images through AdversarialTraining . 2017 IEEE Conference on Computer Vision and Pat-tern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017[4] Ganin, Yaroslav et al.
Domain-adversarial Training of NeuralNetworks . The Journal of Machine Learning Research, 2016.[5] Kaiming He, Georgia Gkioxari, Piotr Doll´ar and Ross B. Gir-shick.
Mask R-CNN . 2017 IEEE International Conference onComputer Vision (ICCV), 2017.[6] Jun-Yan Zhu, Taesung Park, Phillip Isola and Alexei A. Efros.
Unpaired Image-to-Image Translation using Cycle-ConsistentAdversarial Networks . arXiv preprint arXiv:1703.10593,2017 2017 IEEE International Conference on Computer Vi-sion (ICCV), 2017.
Appendix
In Fig. 2 we provide a schematic overview of the object detec-tion training data generation pipeline.
Figure 2: General architecture for training the object detec-tor. Figure 3: Discriminator network with two grid layers ofdiscriminator cells, one with small receptive field and theother with bigger receptive field.4323igure 4: Left column: images from S fix . Right column: corresponding refined images ( S fix → real ).4324 a) Example of S fix image (b) Example of S rand − tex image(c) Example of S rand + tex image (d) Example of a real image used to train the cyclic-GAN Figure 5: Examples of different types of images employed in the experiments.4325 a) tactile switch (b) pin header (c) 3 way cable mount screwterminal(d) DC power jack (e) DIP switch (f) slide switch(g) led (h) IC socket (i) trimmer(j) buzzer (k) USB type A socket (l) USB type C socket
Figure 6: Electronic parts used in the experiments.4326igure 7: Mask-RCNN training loss. The model was trained by fine-tuning a Mask-RCNN model pre-trained on the COCOdataset. First by training only the mask-rcnn heads (without training the region proposal network or the backbone model) for10 epochs with a learning rate of 0.002, and then the whole network for another 5 epochs with a learning rate of 0.0002. Weused a SGD optimizer with a momentum of 0.9. The configurations that achieved the better performances, ”20% S fix → real and 80% S rand + tex ” and ”50% S fix → real and 50% S rand + tex ”, are the ones that had worse loss values during training. Wethink that this is because these datasets were more difficult, but at the end prepared the model better for the also difficult realtest dataset. 4327igure 8: Example of detection results for 20% S fix → real and 80% S rand + textex