BlendTorch: A Real-Time, Adaptive Domain Randomization Library
Christoph Heindl, Lukas Brunner, Sebastian Zambal, Josef Scharinger
BBlendTorch: A Real-Time, AdaptiveDomain Randomization Library
Christoph Heindl ∗ , Lukas Brunner ∗ , Sebastian Zambal ∗ and Josef Scharinger †∗ Visual Computing, Profactor GmbH, Austria, [email protected] † Computational Perception, JKU, Austria, [email protected]
Abstract —Solving complex computer vision tasks by deeplearning techniques relies on large amounts of (supervised) imagedata, typically unavailable in industrial environments. The lack oftraining data starts to impede the successful transfer of state-of-the-art methods in computer vision to industrial applications. Weintroduce BlendTorch, an adaptive Domain Randomization (DR)library, to help creating infinite streams of synthetic training data.BlendTorch generates data by massively randomizing low-fidelitysimulations and takes care of distributing artificial training datafor model learning in real-time. We show that models trainedwith BlendTorch repeatedly perform better in an industrialobject detection task than those trained on real or photo-realisticdatasets.
I. I
NTRODUCTION
Recent advances in computer vision depend extensively ondeep learning techniques. With sufficient modeling capacityand enough (labeled) domain datasets, deeply learned modelsoften outperform conventional vision pipelines [1]–[3]. How-ever, providing large enough datasets is challenging withinindustrial applications for several reasons: a) costly and error-prone manual annotations, b) the odds of observing rareevents are low, and c) a combinatorial data explosion asvision tasks become increasingly complex. If high capacitymodels are trained despite low data quality, the likelihoodof overfitting increases, resulting in reduced robustness inindustrial applications [4].In this work , we focus on generating artificial trainingimages and annotations through computer simulations. Train-ing models in simulations promises annotated data withoutlimits, but the discrepancy between the distribution of trainingand real data often leads to poorly generalizing models [5].Increasing photo realism and massive randomization of low-fidelity simulations (Domain Randomization) are two popularand contrary strategies to minimize the distributional mis-match. Recent frameworks focus on photorealism, but do notaddress the specifics of massive randomization such as: onlinerendering capabilities and feedback channels between trainingand simulation.We introduce BlendTorch , a general purpose open-sourceimage synthesis framework for adaptive, real-time DomainRandomization (DR) written in Python. BlendTorch integratesprobabilistic scene composition, physically plausible real-time rendering, distributed data streaming, and bidirectional Supported by MEDUSA, Leitprojekt Medizintechnik https://github.com/cheind/pytorch-blender C Learning Inference
BlendTorch
Sim RealLoss train predicttune sim.
Figure 1: We introduce BlendTorch, a real-time adaptiveDomain Randomization library, for neural network trainingin simulated environments. We show, networks trained withBlendTorch outperform identical models trained with photo-realistic or even real datasets on the same object detectiontask.communication. Our framework integrates the modelling andrendering strengths of Blender [6] with the high-performancedeep learning capabilities of PyTorch [7]. We successfullyapply BlendTorch to the task of learning to detect industrialobjects without access to real images during training (seeFigure 1). We demonstrate that data generation and modeltraining can be done in a single online sweep, reducing thetime required to ramp-up neural networks significantly. Ourapproach not only outperforms photo-realistic datasets on thesame perception task, we also show that DR surpasses thedetection performance compared to a real image data set.
A. Related Work
Using synthetically rendered images for training supervisedmachine learning tasks has been studied before. Tobin et al. [8]as well as Sadeghi et al. [9] introduced uniform DomainRandomization, in which image data is synthesized low-fidelity simulations, whose simulation aspects are massivelyrandomized. These earlier DR approaches were tailored to aspecific application, which severely limits their reusability inother environments. BlendTorch is based on the idea of DR,but generalizes to arbitrary applications.Recently, general purpose frameworks focusing on gener-ating photo-realistic images have been introduced [10]–[12]. a r X i v : . [ c s . C V ] O c t lender btb BA p(A) Blender btb BA p(A) Push Data AdaptSimulation
Blender btb BA p(A) btt PyTorch
BA BA BA
LossControl
PullData
Figure 2: BlendTorch overview. BlendTorch combines probabilistic (supervised) image generation in Blender [6] (left) with deeplearning in PyTorch [7] (right) via a programmatic Python interface. Information is exchanged via a scalable network library(dotted lines). Implementation details are encapsulated in two Python subpackages blendtorch.btb and bendtorch.btt .Additional feedback capabilities allow the simulation adapt to current training needs.Compared to BlendTorch, these frameworks are not real-time capable and also lack a principled way to communicateinformation from model training back into simulation. Thework most closely related to ours is BlenderProc [10], sincewe share the idea of using Blender for modelling and renderingpurposes. BlenderProc utilizes a non real-time physically-based path tracer to generate photo-realistic images, butlacks real-time support and feedback capabilities offered byBlendTorch. BlenderProc focuses on a configuration file basedscene generation, while BlendTorch offers a more flexibleprogramming interface.Enabling adaptive simulations through training feedbackwas introduced in Heindl et al. [13] for the specialized taskof robot keypoint detection. BlendTorch generalizes this ideato arbitrary applications and to real-time rendering.
B. Contributions
This paper offers the following contributions1) BlendTorch, an adaptive, open-source, real-time domainrandomization library that seamlessly connects mod-elling, rendering and learning aspects.2) A comprehensive industrial object detection experimentthat highlights the benefits of DR over photo-realistic andeven real training datasets.II. D
ESIGN P RINCIPLES
BlendTorch weaves several ideas into a design that enablespractitioners and scientists to rapidly realize and test novel DRconcepts.
Reuse and connect.
Training neural networks in simulationusing DR requires several specialized software modules.For a successful experiment, modeling and renderingtools as well as powerful learning libraries for deeplearning are required. In the recent past, excellent open-source frameworks for the aforementioned purposes haveemerged independently from each other. However, thesetools are not interconnected and a basic framework fortheir online interaction is missing. BlendTorch aims tobring these separate worlds together as seamlessly as possible without losing the benefits of either softwarecomponent.
Real-time computing.
As the complexity of visual tasks in-creases, the combinatorial scene variety also increasesexponentially. Several applications of DR separate thesimulation from the actual learning process, because theslow image generation stalls model learning. However,offline data generation suffers from the following short-comings: constantly growing storage requirements andthe missing possibility of online simulation adaptation.BlendTorch is designed to provide real-time, distributedintegration between simulation and learning that is fastenough not to impede learning progress.
Adaptability.
The ability to adapt the simulation parametersduring model training has already proven to be benefi-cial [13] before. Adaptability allows the simulation tosynchronize with the evolving requirements of the learn-ing process in order to learn more efficiently. The mean-ing of adaptability is application-dependent and rangesfrom adjusting the level of simulation difficulty to thegeneration of adversarial model examples. BlendTorchis designed with generic bidirectional communicationin mind, allowing application-specific workflows to beimplemented quickly.III. A
RCHITECTURE
BlendTorch connects the modeling and rendering strengthsof Blender [6] and the deep learning capabilities of Py-Torch [7] as depicted in Figure 2. Our architecture considersdata generation to be a bidirectional exchange of information,which contrasts conventional offline, one-way data generation.Our perspective enables BlendTorch to support scenarios thatgo beyond pure data generation, including adaptive domainrandomization and reinforcement learning applications.To seamlessly distribute rendering and learning across ma-chine boundaries we utilize ZeroMQ [14] and split BlendTorchinto two distinctive sub-packages that exchange informationvia ZMQ: blendtorch.btb and bendtorch.btt , pro-viding the Blender and PyTorch views on BlendTorch. typical data generation task for supervised machinelearning is setup in BlendTorch as follows. First, the train-ing procedure launches and maintains one or more Blenderinstances using btt.BlenderLauncher . Each Blenderinstance will be instructed to run a particular scene andrandomization script. Next, the training procedure creates a btt.RemoteIterableDataset to listen for incomingnetwork messages from Blender instances. BlendTorch usesa pipeline pattern that supports multiple data producers andworkers employing a fair data queuing policy. It is guaranteedthat only one PyTorch worker receives a particular messageand no message is lost, but the order in which it is receivedis not guaranteed. To avoid out-of-memory situations, thesimulation processes will be temporarily stalled in case thetraining is not capable to catch up with data generation. Withinevery Blender process, the randomization script instantiatesa btb.DataPublisher to distribute data messages. Thisscript registers the necessary animation hooks. Typically, ran-domization occurs in pre-frame callbacks, while imagesare rendered in post-frame callbacks. Figure 3 illustratesthese concepts along with a minimal working example.IV. E
XPERIMENTS
We evaluate BlendTorch by studying its performance withinthe context of an industrial 2D object detection task. Forthe following experiments we choose the T-Less dataset [15]which originally consists of 30 industrial objects with nosignificant texture or discriminative color. For reasons ofpresentation, we re-group the 30 classes into 6 super-groupsas depicted in Figure 5. T-Less constitutes a solid basisfor comparative experiments, since real and synthetic photo-realistic images are already available for training and testing.Our evaluation methodology can be summarized as follows:We train the same state-of-the-art object detection neuralnetwork by varying only the training dataset, but keeping allother hyper-parameters fixed. We then evaluate each resultingmodel on the same test dataset using the mean AveragePrecision (mAP) [16] metric that combines localization andclassification performance. To avoid biases due to randommodel initialization, we repeat model learning multiple timesfor each dataset.
A. Datasets
Throughout our evaluation we distinguish four color imageT-Less datasets that we describe next.
RealKinect
A real image dataset based on T-Less imagestaken with a Kinect camera. It consists of imagesgrouped into 20 scenes with 500 images per scene.Images are taken in a structured way by sampling posi-tions from a hemisphere. Each scene includes a variablenumber of occluders. PBR
Is a publicly available photo-realistic synthetic imagedataset generated by BlenderProc [10] in an offline step.It consists of × images taken from random camerapositions. PBR uses non parametric occluders which areare inserted randomly. BlendTorch
This dataset corresponds to synthetic data gener-ated by Domain Randomization BlendTorch. The datasetconsists of × images, whose generation details aregiven in Section IV-B. BOP
A real image dataset of T-Less corresponding to thetest dataset of the BOP Challenge 2020 [17] taken witha PrimeSense camera. The dataset contains images.We use this dataset solely for final evaluation of trainedmodels.Illustrative examples of each dataset are shown in Figure 4. B. BlendTorch Dataset
The BlendTorch dataset is generated as follows. For eachscene we randomly draw a fixed number of objects accordingto the current class probabilities which might change overthe course of training. Each object is represented by itsCAD model. For each generated object, there is a chanceof producing an occluder. Occluding objects are based onsuper-shapes , which can vary their shape significantly basedon 12 parameters. These parameters are selected uniformlyat random within a sensible range. Next, we randomize eachobjects position, rotation and procedural material. We ensurethat initially each object hovers above a ground plane. Finally,we employ Blender’s physics module to allow the objects tofall under the influence of gravity. Once the simulation hassettled, we render N images using a virtual camera positionedrandomly on hemispheres with varying radius. Besides colorimages, we generate the following annotations: boundingboxes, class numbers and visibility scores. Figure 4c showsan example output. Each image and annotation pair is thenpublished through BlendTorch data channels (refer to Figure 2and Example 3). C. Training
Object detection is performed by CenterNet [18], [19] usinga DLA-34 [20] feature extracting backbone pre-trained onImageNet [21]. The backbone features are inputs to multipleprediction heads. Each head consists of a convolution layermapping to 256 feature channels, a rectified linear unit, andanother convolution layer to have a task specific number ofoutput channels. In particular, our network uses the followingheads: a) one center point heatmap per class of dimension ×(cid:98) H/ (cid:99)×(cid:98) W/ (cid:99) , and b) a bounding box size regression headof dimension ×(cid:98) H/ (cid:99)×(cid:98) W/ (cid:99) . During training, the total loss L total is computed as follows L total = L hm + 0 . · L wh , (1)where L hm is measured as the focal loss of predicted andground truth center point heatmaps, L wh is the L1 loss ofregressed and true bounding box dimensions evaluated atground truth center point locations.We split the training dataset into training and validationdata ( /
10 % ) and train for × steps using Adam [22] https://github.com/cheind/supershape raining train.py from torch.utils import data import blendtorch.btt as btt def main ():largs = dict(scene = 'cube.blend',script = 'cube.blend.py',num_instances = = ['DATA'],) with btt . BlenderLauncher( ** largs) as bl: addr = bl . launch_info . addresses['DATA']ds = btt . RemoteIterableDataset(addr)dl = data . DataLoader(ds, batch_size = for item in dl:img, xy = item['image'], item['xy'] print ('Received', img . shape, xy . shape) main() Received images superimposed with annotations. Simulation cube.blend.py import bpy from numpy.random import uniform import blendtorch.btb as btb def main (): cube = bpy . data . objects["Cube"] def pre_frame (): cube . rotation_euler = uniform(0,3.14,3) def post_frame (): pub . publish(image = off . render(),xy = off . camera . object_to_pixel(cube)) btargs, _ = btb . parse_blendtorch_args()pub = btb . DataPublisher(btargs . btsockets['DATA'],btargs . btid) off = btb . OffScreenRenderer(mode = 'rgb') anim = btb . AnimationController()anim . pre_frame . add(pre_frame)anim . post_frame . add(post_frame)anim . play()main() Figure 3: Minimal BlendTorch example. The train script train.py (top-left) launches multiple Blender instances to simulatea simple scene driven by a randomization script cube.blend.py . The training then awaits batches of images and annotationswithin the main training loop. The simulation script cube.blend.py (top-right) randomizes the properties of a cube andpublishes color images along with corner annotations. Color images and superimposed annotations are shown in the bottom-left.(lr= . × − , weight-decay= − ). We perform the fol-lowing augmentations independent of the dataset: scale to
512 px , random rotation, random horizontal flip, and colorjitter. Regularization and augmentation are applied to alltraining sets to avoid skewing of results due to different datasetsizes.Validation is performed every × steps. The best modelis selected based on the lowest validation total loss, as definedin Equation 1. As mentioned above, each training session isrepeated 10 times, and the best model of each run is storedfor evaluation. D. Prediction
Objects and associated classes are determined by extractingthe top- K local peak values that exceed a given confidencescore from all center heatmaps. Bounding box dimensionsare determined from the respective channels at center pointlocations. E. Average Precision
We assess the quality of each training dataset by computingthe mean Average Precision (mAP) [16] from model predic- tions based on the unseen BOP test dataset (see Section IV-A)using the Intersection over Union (IoU) evaluation metric.The additional training runs per model allow us to computethe mAP with confidence. Unless otherwise stated, we reportthe mAP averaged over runs, error bars indicate ± standarddeviation. We consider up to 25 model predictions that surpassa minimum confidence score of 0.1 during computation of themAP.Table I compares the average precision achieved for eachtraining dataset over the course of 10 runs. BlendTorch out-performs photo-realistic and real image datasets. Although theBOP test dataset contains similar scenes to RealKinect, thecharacteristics of the cameras are different. Despite moderatemodel regularization, the network begins to overfit on thesmaller RealKinect dataset, resulting in poor test perfor-mance. BlendTorch generated images only slightly exceed thedata provided by the photo-realistic PBR data set. However,BlendTorch trained models exhibit only half the standarddeviation of all other models, making their performance morepredictable in real world applications. This feature is shownclearly in Figure 6 that compares the precision-recall be- a) Kinect/BOP dataset sample. (b) PBR dataset sample.(c) BlendTorch samples with annotated bounding boxes and occlud-ers. Figure 4: Samples from datasets used throughout our work.Notice that PBR gives a much more realistic impression ofthe scene than BlendTorch using DR. Yet, we show DR issimpler to implement, generates images faster, and yields abetter performing model.
Class 0 Class 1 Class 2Class 3 Class 4Class 5
Figure 5: T-Less dataset objects. 30 individual objects re-grouped into 6 super-classes in this work. Displayed in falsecolors for better visibility.haviour of all three training datasets.
F. Runtime
We compare the time it takes to generate and receivesynthetic images using BlendTorch. All experiments are per-formed on the same machine (Nvidia GeForce GTX 1080Ti)and software (Blender 2.90, PyTorch 1.60). Table II shows theresulting runtimes image data synthesis. Compared to photo-realistic rendering, BlendTorch creates images at interactiveframe rates, even for physics enabled scenes involving millionsof vertices. V. C
ONCLUSION
We introduced BlendTorch, a real-time, adaptive DomainRandomization library for infinite artificial training data gen-eration in industrial applications. Our object detection experi-ments show that—all other parameters being equal— models
Recall P r e c i s i o n RealKinect mAP=0.415PBR mAP=0.501BlendTorch mAP=0.529
Figure 6: Precision-recall curves as a result of evaluatingobject detection performance on varying training datasets.Despite the non-realistic appearance of domain random-ized scenes in BlendTorch, our approach outperforms photo-realistic datasets and real image dataset at all IoU thresholds.We repeat the process 10 times and indicate the variationsas ± σ error bars. Note that models trained with BlendTorchdata have significantly less variability.trained with BlendTorch data outperform those models trainedon photo-realistic or even real training datasets. Moreover,models trained with domain randomized data exhibit lessperformance variance over multiple runs. This makes DR auseful technique to compensate for the lack of training datain industrial machine learning applications. For the future weplan to explore the possibilities of adjusting simulation param-eters with respect to training progress and to apply BlendTorchin the medical field, which exhibits similar shortcomings ofreal training data. R EFERENCES [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Ima-genet classification with deep convolutional neural net-works,” in
Advances in neural information processingsystems , 2012, pp. 1097–1105.[2] K. Simonyan and A. Zisserman, “Very deep convo-lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in
Proceedings of theIEEE conference on computer vision and pattern recog-nition , 2016, pp. 770–778.[4] C. Shorten and T. M. Khoshgoftaar, “A survey on imagedata augmentation for deep learning,”
Journal of BigData , vol. 6, no. 1, p. 60, 2019.[5] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V.Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon,and S. Birchfield, “Training deep networks with syn-thetic data: Bridging the reality gap by domain ran-domization,” in
Proceedings of the IEEE Conference on verall AP Per-Class mAPDataset mAP σ mAP AP AP mAP C mAP C mAP C mAP C mAP C mAP C RealKinect 41.1 3.4 71.7 42.9 29.9 52.5 37.0 47.3 28.4 53.7PBR 50.1 2.9 70.8 58.8
Table I: Performance comparison. Average precision values over 10 training runs. Here mAP refers to the mean average precisioncomputed by integrating over all classes, Intersection over Union (IoU) thresholds and all runs. σ mAP denotes standard deviationof mAP measured over 10 runs. AP and AP represent average precision values for specific IoU-thresholds averaged overall runs and classes. Finally, mAP Ci are average precision values averaged for specific classes averaged over all runs andthresholds. Except for σ mAP , higher values indicate better performance. Scene [Hz]Renderer Instances Cube T-LessBlendTorch (renderer Eevee) 1 43.5 5.74 111.1 18.2Photo-realistic (PBR) 1 0.8 0.44 1.9 0.6
Table II: BlendTorch vs. photo-realistic rendering times. Tim-ings are in frames per second. Higher values are better. Alltimings include the total time spent in rendering plus the timeit takes to receive the data in training. We show timings fortwo different scenes, render engines and various numbers ofparallel simulation instances. Cube represents a minimal scenehaving one object, whereas T-Less refers to a complex sceneinvolving multiple objects, occluders and physics. Sharedsettings across all experiments: batch size (8), image size( × × ), and number of PyTorch workers (4). Computer Vision and Pattern Recognition Workshops ,2018, pp. 969–977.[6] Blender Online Community,
Blender - a 3d modellingand rendering package et al. , “Pytorch: An imperative style, high-performancedeep learning library,” in
Advances in neural informa-tion processing systems , 2019, pp. 8026–8037.[8] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba,and P. Abbeel, “Domain randomization for transfer-ring deep neural networks from simulation to the realworld,” in , IEEE, 2017,pp. 23–30.[9] F. Sadeghi and S. Levine, “Cad2rl: Real single-imageflight without a single real image,” arXiv preprintarXiv:1611.04201 , 2016.[10] M. Denninger, M. Sundermeyer, D. Winkelbauer, D.Olefir, T. Hodan, Y. Zidan, M. Elbadrawy, M. Knauer,H. Katam, and A. Lodhi, “Blenderproc: Reducing thereality gap with photorealistic rendering,” [11] T. To, J. Tremblay, D. McKay, Y. Yamaguchi, K.Leung, A. Balanon, J. Cheng, W. Hodge, and S. Birch-field,
NDDS: NVIDIA deep learning dataset synthesizer ,https://github.com/NVIDIA/Dataset Synthesizer, 2018.[12] M. Schwarz and S. Behnke, “Stillleben: Realistic scenesynthesis for deep learning in robotics,” arXiv preprintarXiv:2005.05659 , 2020.[13] C. Heindl, S. Zambal, and J. Scharinger, “Learningto predict robot keypoints using artificially gener-ated images,” in , IEEE, 2019, pp. 1536–1539,
ISBN : 978-1-7281-0303-7.
DOI : 10.1109/ETFA.2019.8868243.[14] P. Hintjens,
ZeroMQ: messaging for many applications .” O’Reilly Media, Inc.”, 2013.[15] T. Hodan, P. Haluza, ˇS. Obdrˇz´alek, J. Matas, M.Lourakis, and X. Zabulis, “T-less: An rgb-d datasetfor 6d pose estimation of texture-less objects,” in , IEEE, 2017, pp. 880–888.[16] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,and A. Zisserman, “The pascal visual object classes(voc) challenge,”
International journal of computervision , vol. 88, no. 2, pp. 303–338, 2010.[17] T. Hodaˇn, M. Sundermeyer, B. Drost, Y. Labb´e, E.Brachmann, F. Michel, C. Rother, and J. Matas, “BOPchallenge 2020 on 6D object localization,”
EuropeanConference on Computer Vision Workshops (ECCVW) ,2020.[18] X. Zhou, V. Koltun, and P. Kr¨ahenb¨uhl, “Trackingobjects as points,”
ECCV , 2020.[19] X. Zhou, D. Wang, and P. Kr¨ahenb¨uhl, “Objects aspoints,” arXiv preprint arXiv:1904.07850 , 2019.[20] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deeplayer aggregation,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2018,pp. 2403–2412.[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei, “Imagenet: A large-scale hierarchical imagedatabase,” in , Ieee, 2009, pp. 248–255.22] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980