[PDF] A novel method for object detection using deep learning and CAD models

Abstract

Object Detection (OD) is an important computer vision problem for industry, which can be used for quality control in the production lines, among other applications. Recently, Deep Learning (DL) methods have enabled practitioners to train OD models performing well on complex real world images. However, the adoption of these models in industry is still limited by the difficulty and the significant cost of collecting high quality training datasets. On the other hand, when applying OD to the context of production lines, CAD models of the objects to be detected are often available. In this paper, we introduce a fully automated method that uses a CAD model of an object and returns a fully trained OD model for detecting this object. To do this, we created a Blender script that generates realistic labeled datasets of images containing the object, which are then used for training the OD model. The method is validated experimentally on two practical examples, showing that this approach can generate OD models performing well on real images, while being trained only on synthetic images. The proposed method has potential to facilitate the adoption of object detection models in industry as it is easy to adapt for new objects and highly flexible. Hence, it can result in significant costs reduction, gains in productivity and improved products quality.

Full PDF

AA novel method for object detection using deep learning and CAD models

Igor Garcia Ballhausen Sampaio , Luigy Machaca , Jos´e Viterbo and Joris Gu´erin Computing Institute, Universidade Federal Fluminense, Brazil LAAS-CNRS, ONERA, Universit´e de Toulouse, France { igorgarcia, luigyarcana, jviterbo } @id.uff.br, [email protected] Keywords: Object Detection, CAD Models, Synthetic Image Generation, Deep Learning, Convolutional Neural NetworkAbstract: Object Detection (OD) is an important computer vision problem for industry, which can be used for qualitycontrol in the production lines, among other applications. Recently, Deep Learning (DL) methods have enabledpractitioners to train OD models performing well on complex real world images. However, the adoption ofthese models in industry is still limited by the difﬁculty and the signiﬁcant cost of collecting high qualitytraining datasets. On the other hand, when applying OD to the context of production lines, CAD models ofthe objects to be detected are often available. In this paper, we introduce a fully automated method that uses aCAD model of an object and returns a fully trained OD model for detecting this object. To do this, we createda Blender script that generates realistic labeled datasets of images containing the object, which are then usedfor training the OD model. The method is validated experimentally on two practical examples, showing thatthis approach can generate OD models performing well on real images, while being trained only on syntheticimages. The proposed method has potential to facilitate the adoption of object detection models in industry asit is easy to adapt for new objects and highly ﬂexible. Hence, it can result in signiﬁcant costs reduction, gainsin productivity and improved products quality.

Recently, Deep Learning (DL) has produced excellentresults for Object Detection (OD) (Liu et al., 2020).On the one hand, a typical limitation with DL is therequirement of large labeled datasets for training. In-deed, although there are various large databases avail-able online for OD, for speciﬁc industrial applicationsit is always necessary to create custom datasets con-taining the objects of interest. While the scenariospresent in public datasets are useful from both a re-search and application standpoint, it was found thatindustrial applications, such as bin picking or defectinspection, have quite different characteristics that arenot modeled by the existing datasets (Drost et al.,2017). As a result, methods that perform well on ex-isting datasets sometimes show different results whenapplied to industrial scenarios without retraining. Theprocess of generating a speciﬁc dataset for retrainingis tedious, and can be error-prone when conducted bynon-professional technicians. Moreover, generating anew dataset and labeling it manually can be very timeconsuming and expansive (Jabbar et al., 2017). a https://orcid.org/0000-0002-0339-6624 b https://orcid.org/0000-0002-8048-8960 On the other hand, OD for industrial productionlines presents the speciﬁcity that the manufacturersoften have access to the CAD models of the objectsto detect. Thanks to advances in computer graphicstechniques, such as ray tracing (Shirley and Morley,2003), the generation of photo-realistic images is nowpossible. In such artiﬁcially generated images, thecomputer can be employed to obtain bounding box la-beling for free. The use of synthetic images renderedfrom CAD models to train OD models has alreadybeen proposed in (Peng et al., 2015), (Rajpura et al.,2017) and (Hinterstoisser et al., 2018). However, theirapproaches are not automated as they require manualscene creation by Blender artists. In addition, the ob-jects used in these works are usually generic, such asbuses, airplanes, cars or animals.The main contribution of this paper is to present anew method for training OD models in synthetic im-ages generated from CAD models that is fully auto-matic and thus well suited for industrial use. The pro-posed method consists of the automatic generation ofrealistic labeled images containing the objects to bedetected, followed by the ﬁne-tuning of a pretrainedOD model on the artiﬁcial dataset. An extensive studyis conducted to properly select the user-deﬁned pa- a r X i v : . [ c s . C V ] F e b ameters so that it maximizes the performance on realworld images. Our method is evaluated using theCAD models of two industrial objects for training, aswell as real labeled images containing the objects forevaluation. The results obtained are very promisingas we manage to get F1-scores above 90% on real im-ages while training only on synthetic images.This paper is organized as follows. Section 2presents the related work in the ﬁeld of object de-tection and deep learning for industry. Section 3provides detailed explanations about the proposedmethod created. Section 4 describes our experiments,presents the results obtained and discusses them. Fi-nally, conclusions and directions for future work arepresented in Section 5. This section presents related work about OD, indus-trial applications of DL-based computer vision, aswell as computer vision methods using CAD models.

OD is a challenging computer vision problem thatconsists in locating instances of objects from prede-ﬁned categories in natural images (Prasad, 2012). Ithas many applications in various domains such asautonomous driving, security and medical diagno-sis (Xiao et al., 2020). Deep learning techniques haveemerged as a powerful strategy for learning character-istic representations directly from data and have led tosigniﬁcant advances in the ﬁeld of generic object de-tection (Liu et al., 2020). In the last decade, manycompetitions for object detection have been held toprovide large annotated datasets to the community,and to unify the benchmarks and metrics for faircomparison between proposed methods (Everinghamet al., 2010), (Lin et al., 2014), (Zhou et al., 2017),(Kuznetsova et al., 2018).Some examples of OD methods proposed withinthe last few years include (He et al., 2015), wherethe author proposes a new network structure, calledSPP-net, which can generate a ﬁxed-length represen-tation, regardless of the size/scale of the image. Otherworks such as (Jana et al., 2018) aim to improve pro-cessing speed and at the same time efﬁciently identifyobjects in the image. Finally, deeper CNNs have ledto record-breaking improvements in the detection ofmore general object categories, a shift which cameabout when DCNNs began to be successfully appliedto image classiﬁcation (Liu et al., 2020).

Although general purpose OD methods have greatlyimproved thanks to the availability of large publicdatasets, the detection of instances in the industrialcontext must be approached differently, since anno-tated images are generally not available or rare. In-deed, to train a deep learning model, hundreds of an-notated images for each object category are needed.Speciﬁc datasets need to be collected and annotatedfor different target applications. This process is time-consuming and laborious, and increases the burden onoperators, which goes against the goal of industrialautomation (Cohen et al., 2020), (Ge et al., 2020).A public datasets adapted to the industrial contextwas developed in (Drost et al., 2017). Unlike other3D object detection datasets, this work models in-dustrial waste collection and object inspection tasksthat often face different challenges. In addition, theevaluation criteria are focused on practical aspects,such as execution times, memory consumption, use-ful measures of correction and precision. Other ex-amples of datasets adapted to the industrial context in-clude (Gu´erin et al., 2018b) and (Gu´erin et al., 2018a).Finally, in (Yang et al., 2019), a method to detectdefects of tiny parts in real time was developed, basedon object detection and deep learning. To improvetheir results, the authors consider the speciﬁcities ofthe industrial application in their method such as theproperties of the parts, the environmental parametersand the speed of movement of the conveyor. This isa good example to adapt OD training methods to thespeciﬁc constraints of the industrial context.

The ﬁrst commercial CAD programs came up in the1970s, providing functions for 2D-drawing and dataarchival, and evolved into the main engineering de-sign tool (Lindsay et al., 2018), (Hirz et al., 2017).These models can provide a scalable solution forintelligent and automatic object recognition, track-ing and augmentation based on generic object mod-els (Ben-Himane et al., 2010). For example, CADmodels have been used to support multi-view detec-tion (Zhang et al., 2013). In (Peng et al., 2015),3D models were used as the primary source of infor-mation to build object models. In other works, 3DCAD models were used as the only source of labeleddata (Lin et al., 2014), (Everingham et al., 2010), butthey are limited to generic categories, such as cars andmotorcycles . a) Training(b) InferenceFigure 1: Overview of the proposed method for training an object detection network using a CAD model.

An overview of this proposed method can be seen inFigure 1. First, a custom Blender code is used to gen-erate labeled training images containing the renderedCAD model in context. Then, a pretrained object de-tection model is ﬁne-tuned on the generated dataset.Finally, the model can be used for inference on realimages (Figure 1b).

For the automatic generation of the training images,the software Blender (Blender Online Community,2018) is used. Blender is a powerful software for 3Ddesign, which includes features such as modeling, rig-ging, simulation and rendering. Blender has a goodPython API, is open-source and has good GPU sup-port.In order to generate a synthetic training imagesample, our code requires several elements. First, aCAD model of the object of interest as well as severalother industrial CAD models need to be available. Inthe experiments of this paper, we use the two objectsshown in Figure 2, for which we also have real worldtest images. The other objects serve as distractorsto help the model focusing on the right object. TheCAD models for the distractors are gathered from theGrabcad website . Different textures for the differentdistractors as well as for the background are gatheredfrom the Poliigon website . Finally, the color and tex- https://grabcad.com/ ture of the object of interest are reproduced manually.Once we have access to all the elements above,the generation code goes as follows. A ﬂoor and atable are created and some distractors are sampled.Using physics simulation, the distractors are droppedfrom a random height on the table. The position ofthe object of interest is also randomly sampled. Oncethe 3D scene is created, textures and colors are sam-pled for the backgrounds and the distractors and theentire scene is textured. Light sources and camerasare also sampled and placed randomly. Constraintson the camera pose are applied, in order to ensure thatthe object appears in the camera view. Once the scenehas been created, the rendering occurs and generatesan image. By removing the light sources and mak-ing the object of interest a light source itself, we cangenerate another image which can be used for bound-ing box labeling. This procedure is necessary becauseeven if we know the location of the object, it can bepartly hidden by distractors and thus distort the la-beling. Example images generated using our Blendercode can be seen in Figure 3. a) Adblue (b) Yamaha logoFigure 3: Example of images generated using our custom Blender script. In this work, we did not train a new CNN architecturefrom the scratch. Instead, we used one of the pre-trained models provided by TensorFlow Object De-tection API (Huang et al., 2017). This approach iscalled transfer learning and consists in starting train-ing from a model that already knows basic feature ex-traction skills and is less likely to overﬁt the syntheticdatasets. Indeed, the diversity that we can create withBlender is limited as we cannot get an inﬁnite amountof textures and distractors, and the diversity alreadyencountered by the network during pre-training canhelp reduce overﬁtting. In addition, using a networkpre-trained on real images can prevent the networkfrom learning detection features that depend too muchon the generation procedure.There exists several models in the TensorFlowOD model zoo. More information on the perfor-mance of the detection, as well as the reference ex-ecution times, for each of the available pre-trainedmodels, can be found on the Github page of theAPI . In practice, the model used in this paper is the faster rcnn inception v2 coco model, which providesa good trade-off between performance and speed.Faster R-CNN, the model used in this work, takesas input an entire image and a set of object propos-als. The network ﬁrst processes the whole image withseveral convolutional and max pooling layers to pro-duce a convolutional feature map. Then, for eachobject proposal, a region of interest (RoI) poolinglayer extracts a ﬁxed-length feature vector from thefeature map. Each feature vector is fed into a se-quence of fully connected layers that ﬁnally branchinto two sibling output layers: one that produces soft-max probability estimates over K object classes plusa catch-all “background” class and another layer thatoutputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes reﬁned https://github.com/tensorﬂow/models bounding-box positions for one of the K classes. Fora more detailed view about Faster-RCNN, we referthe reader to the original paper (Ren et al., 2015), orto the following tutorial (Ananth, 2019).To train the ﬁnal OD model, the TensorFlow ODAPI requires a speciﬁc ﬁle structure of the trainingimages and labels. This step is carried out automati-cally by our script. The Blender script used for image generation hasmany hyperparameters that must be chosen before us-ing it, such as the number of distractors, the numberof scenes generated or the resolution of synthetic im-ages. Hence, we conduct a set of experiments to prop-erly select these parameters in order to optimize theOD results for inference on real images. In this sec-tion, we explain the parameter selection procedure.In other words, we present the dataset on which thedifferent sets of parameters were evaluated, as well asthe metrics used to assess the quality of the results ob-tained with a given set of parameters. The results ob-tained for this parameter selection procedure are pre-sented in Section 4.

The objective of this work is to validate that an objectdetector trained on synthetic images can generalizeto real world industrial cases. Hence, we use a testdataset composed of 380 real images containing theobjects corresponding to the CAD models used fortraining. The bounding box annotation ﬁles for thetest images are generated manually using a softwarecalled LabelImg . This application allows us to drawand save the annotations of each image as xml ﬁles inthe PASCAL VOC format (Everingham et al., 2010). https://github.com/tzutalin/labelImgigure 4: Intersection over Union (IoU) computation In order to evaluate the quality of the trained model onreal images, and thus to be able to select the best hy-perparameters for image generation and training, weused standard OD metrics that are presented here.

Intersection over Union (IoU) is an evaluationmetric used to measure how much a predicted bound-ing box matches with a ground truth bounding box.For a pair of bounding boxes, IoU is deﬁned as thearea of the intersection divided by the area of theunion (Figure 4). If A corresponds to the ground-truthbox and B to the predicted box, then, IoU is computedas :

IoU = | A ∩ B || A ∪ B | , (1)where | . | denotes the area of a given shape. The nu-merator is called the overlap area and the denomina-tor is called the combined area. IoU ranges between 0and 1, where 1 means that the bounding boxes are thesame and 0 that there is no overlap. Precision, Recall, F1-Measure

We call conﬁdencescore, the probability that an anchor box contains anobject from a certain class. It is usually predicted bythe classiﬁer part of the object detector. The conﬁ-dence score and IoU are used as the criteria to de-termine whether a detection is a true positive or afalse positive. Given a minimal threshold on the con-ﬁdence score for bounding box acceptance, and an-other threshold on IoU to identify matching boxes, adetection is considered a true positive (TP) if thereexists a ground truth such that: conﬁdence score > threshold; the predicted class matches the class of theground truth; and IoU > threshold IoU . The violationof any of the last two conditions generates a false pos-itive (FP). In case multiple predictions correspond to the same ground-truth, only the one with the highestconﬁdence score counts as a true positive, while theothers are considered false positives. When a groundtruth bounding box is left without any matching pre-dicted detection, it counts as a false negative (FN).If we note TP, FP and TN respectively the numberof True Positives, False Positives and False Negativesin a dataset, we can deﬁne the following metrics:

Precision = T PT P + FP , (2) Recall = T PT P + FN , (3) F = Precision · RecallPrecision + Recall . (4)A high precision means that most of the predictedboxes had a corresponding ground truth, i.e., the ob-ject detector is not producing bad predictions. A highrecall means that most of the ground truth boxes hada corresponding prediction, i.e., the object detectorﬁnds most objects in the images. The F -Score isthe harmonic mean of the precision and recall, it isneeded when a balance between precision and recallis sought.In the case of object detection on production lines,a low precision means that sometimes a part might beabsent and the model would not see it, whereas a lowrecall means that sometimes the part is present and themodel raises an alert anyways. For this reason, both agood recall and a precision are required and the choiceof using the F -Score metric seems appropriate. Average Precision

After an OD model has beentrained, the computation of Precision, Recall and F1-score depends on the value of the two thresholds de-ﬁned above (for the conﬁdence score and IoU). Inorder to properly choose the values of these thresh-olds, it is interesting to analyze the Precision x Re-call curves. For each class, and for a given value ofthe IoU threshold, the conﬁdence threshold is set as avariable and sampled between 0 and 1 to plot a para-metric curve with precision and recall as the x andy-axis.A class-speciﬁc object detector is considered goodif the precision remains high as the recall increases,meaning that if you vary the conﬁdence limit, the pre-cision and recall will still be high. Hence, to com-pare between curves we generally rely on a numericalmetric called Average Precision (AP). Since 2010, thestandard computation method for AP consists in cal-culating the area under the curve (AUC) of the Preci-sion x Recall curve (Everingham et al., 2010).

EXPERIMENTS AND RESULTS

The results obtained for the parameter selection pro-cedure as well as our ﬁnal evaluations are presentedhere. These experiments were conducted on a NvidiaQuadro P5000 GPU and a 2.90GHz Intel Xeon E3-154M v5 processor (16 GB of RAM).

The parameter selection procedure is conducted ex-clusively on the Yamaha logo object, the best set ofparameters is then tested on the Adblue object to en-sure that it also performs well. The inﬂuence of fourtunable parameters on the ﬁnal results is studied here.for each parameter, three values were selected for thetests. These parameters and their studied values are:•

Resolution:

Camera poses:

2, 5, 20•

Number of scenes:

20, 50, 200•

Number of distractors:

0, 5, 20From simple preliminary experiments that are not pre-sented here, we concluded that the number of tex-tures used for the ﬂoor, the distractors and the sup-port should be set to the maximum number of texturesavailable (in our case 7 for the ﬂoor and 6 for the twoothers). The parameter values used in this work werechosen empirically, that is, after several test scenar-ios, these values were the ones that generated the bestperformance regarding the metrics.In total, from the values selected for the four pa-rameters, we sampled more than 30 combinations andcompared the OD results on the testing set of real im-ages. For each combination tested, we trained theFaster-RCNN CNN on the synthetic images that weregenerated. We note that, for each hyperparameterscombination, the experiments were repeated 10 timesin order to attenuate the inﬂuence of the random com-ponents in the generation and training process. Forreasons of space in this article, it was not possible topresent all results. However, in order to demonstratethe importance of this parameter selection step, Ta-ble 1 shows the best and the worst conﬁguration thatwere tested.In Table 1, we can see that the distractors are anessential element in our proposed pipeline for imagegeneration. Indeed, when removing them, we can seea drop of around 23% in F1-score, on average acrossthe 10 experiment samples. We also tried combina-tions with few distractors, but the F1-score resultsdropped signiﬁcantly. This makes sense as the realimages evaluated had several distractors as well.

Table 1: Best and worst hyperparameters conﬁgurations ob-tained and their corresponding results.

Parameters Best Case Worst Case

Resolution 960x540 960x540cam poses 5 5n scenes 20 20n images 100 100n distractors 20 0Generation Time 1257.30 749.92n samples 10 10Precision %: Avg (Std.Dev) (15.41) (16.27)

Recall %: Avg (Std.Dev) (4.07) (6.45)

F1-Score %: Avg (Std.Dev) (10.57) (11.49)

Table 2: Results obtained with the best set of hyperparame-ters

Object Precision % Recall % F1-Score %

Adblue 85.11 80.00 81.93Yamaha logo 78.06 96.22 85.19

Another important point is that the resolution ofthe images generated should be greater than the infer-ence images. In all of our tests, this scenario alwaysproduced the best results. It also makes sense as it iseasier to learn from a more detailed/complex modeland then evaluate in a less detailed/complex scenario.Finally, we also tried to increase the number ofgenerated training images to see if this would leadto an increase in performance. Surprisingly, we ac-knowledged that the performance dropped for thecase with 20 distractors, 20 camera poses and 50scenes (1000 images). This might mean that whenpresented too many synthetic images, the model startsoverﬁtting to the biases involved by our generationprocess, and it also indicates that we do not need alarge number of images to train our model. In ad-dition to this performance drop, generating ten timesmore images also makes the proposed pipeline almost25 times slower (31037.16 seconds).

In this section, we evaluate the results of the best com-bination of parameters (Best Case from Table 1) inmore details. These results are presented in Table 2,they correspond to using a conﬁdence threshold of0.9 and an IoU threshold of 0.5. From Table 2, wecan see that the best parameters identiﬁed using onlythe Yamaha logo produce similar results when appliedto another object (Adblue). This suggests that theproposed parameters for our method seem to be wellsuited for different objects and thus could generalizewell to various industrial use cases without additionalparameter tuning. .3 Discussion

It is difﬁcult to compare our results with other worksin the literature. Indeed, as far as we know, the ap-proach presented in this work is the ﬁrst proposal tobuild a fully automated pipeline that takes as inputthe CAD model of an object and outputs a trainedobject detection model for this object without anyreal image. For fair comparison we would need tocompare our work with other end-to-end systematicapproaches to build OD models from CAD models,which is impossible as it does not exists. Else, wehope that the results presented in this work can serveas a good baseline for comparison of future works inthis research direction.However, we give a rough comparison with otherrelevant works to give an idea of how well our ap-proach is performing. In (Mazzetto et al., 2019) thedetection of objects in an automobile production linewas implemented, using only real images of the ob-jects. In this work, the estimated detection accuracywas around 90 %, which is only about 5 to 10 % bet-ter than the results obtained in our work using onlysynthetic images. In (Jabbar et al., 2017), the au-thors also train an OD model using synthetic imagesgenerated with Blender and evaluate the results inreal images. However, this approach is not entirelyautomated since the scenes are created manually byBlender artists to ensure photo-realism. The objectused in this work for evaluation is a glass of wineand the maximum AP obtained is 71.14 %. We cansee that our systematic approach seems to work betterthan this approach, however, we cannot reproduce themethod on our objects, as we cannot create the scenesmanually in the same way that they would. The po-tential better performance of our approach can be ex-plained by the fact that the loss of photo-realism canbe compensated by the higher number of images inour synthetic datasets. Indeed, with our fully auto-mated approach, it is faster and requires no effort togenerate more data, unlike in (Jabbar et al., 2017).

This work presents a systematic approach to train ob-ject detection models to address industrial scenarios,using only a CAD model of the object of interest asinput. The method ﬁrst generates realistic syntheticimages using a custom Blender script, and then trainsa faster-RCNN OD model using the TensorFlow ODAPI. To understand and optimize the different param-eters in the proposed pipeline, a systematic parame-ter selection study is conducted using a Yamaha logo CAD model for training and real images containingthe same object in context for evaluation. The se-lected hyperparameters are then tested on an otherobject, showing that they can generalize to differentscenarios.Over the last decade, successful deep learningmethods have been developed to tackle the challeng-ing problem of generic object detection. However,when it comes to the problem of OD in an indus-trial environment, the availability of good quality databecomes a bottleneck. To address this issue we pro-posed to use synthetic images for training, which ischallenging as it might not reﬂect the high variabil-ity found in in real industrial environment (objects,pieces and scenery, etc.). In addition, there is also adifﬁculty in ﬁnding CAD models of speciﬁc indus-trial objects so that they can be trained and other ap-proaches can be tested and compared. Thus, as a con-sequence of this work, a set of data was produced andmade publicly available for future research .Therefore, the main conclusion from this work isthat it is possible to train an object detection model ona set of synthetic images generated from CAD mod-els with excellent performance. In addition, it wasshown that a large set of images is not needed to ob-tain a signiﬁcant result. Our experiments indicate thatthe proposed rendering process is sufﬁcient to obtaingood performances and that the way of building andrendering the scenes is crucial for the ﬁnal result. ACKNOWLEDGEMENTS

Our work has beneﬁted from the AI InterdisciplinaryInstitute ANITI. ANITI is funded by the French ”In-vesting for the Future – PIA3” program under theGrant agreement n°ANR-19-PI3A-0004.

REFERENCES

Ananth, S. (2019). Faster R-CNN for object detection, atechnical paper summary.Ben-Himane, S., Hintestroisser, S., and Navab, N. (2010).Computer vision CAD models. US Patent App.12/682,199.Blender Online Community (2018).

Blender - a 3D mod-elling and rendering package . Blender Foundation.Cohen, J., Crispim-Junior, C., Grange-Faivre, C., andTougne, L. (2020). CAD-based learning for ego-centric object detection in industrial context. In https://github.com/igorgbs/systematic approach cadmodels heory and Applications , volume 5, pages 644–651.SCITEPRESS.Drost, B., Ulrich, M., Bergmann, P., Hartinger, P., and Ste-ger, C. (2017). Introducing MVTec ITODD-a datasetfor 3d object recognition in industry. In Proceedingsof the IEEE International Conference on ComputerVision Workshops , pages 2200–2208.Everingham, M., Van Gool, L., Williams, C. K., Winn,J., and Zisserman, A. (2010). The pascal visual ob-ject classes (VOC) challenge.

International journal ofcomputer vision , 88(2):303–338.Ge, C., Wang, J., Wang, J., Qi, Q., Sun, H., and Liao,J. (2020). Towards automatic visual inspection: Aweakly supervised learning method for industrial ap-plicable object detection.

Computers in Industry ,121:103232.Gu´erin, J., Gibaru, O., Nyiri, E., Thiery, S., and Palos,J. (2018a). Automatic construction of real-worlddatasets for 3D object localization using two cam-eras. In

IECON 2018-44th Annual Conference ofthe IEEE Industrial Electronics Society , pages 3655–3658. IEEE.Gu´erin, J., Gibaru, O., Nyiri, E., Thieryl, S., and Boots,B. (2018b). Semantically meaningful view selection.In , pages 1061–1066.IEEE.He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyra-mid pooling in deep convolutional networks for visualrecognition.

IEEE transactions on pattern analysisand machine intelligence , 37(9):1904–1916.Hinterstoisser, S., Lepetit, V., Wohlhart, P., and Konolige,K. (2018). On pre-trained image features and syn-thetic images for deep learning. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) .Hirz, M., Rossbacher, P., and Gulanov´a, J. (2017). Futuretrends in CAD–from the perspective of automotiveindustry.

Computer-aided design and applications ,14(6):734–741.Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A.,Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadar-rama, S., et al. (2017). Speed/accuracy trade-offs formodern convolutional object detectors. In

Proceed-ings of the IEEE conference on computer vision andpattern recognition , pages 7310–7311.Jabbar, A., Farrawell, L., Fountain, J., and Chalup, S. K.(2017). Training deep neural networks for detectingdrinking glasses using synthetic images. In

Interna-tional Conference on Neural Information Processing ,pages 354–363. Springer.Jana, A. P., Biswas, A., et al. (2018). YOLO based de-tection and classiﬁcation of objects in video records.In , pages 2448–2452. IEEE.Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin,I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M.,Duerig, T., et al. (2018). The open images datasetv4: Uniﬁed image classiﬁcation, object detection, andvisual relationship detection at scale. arXiv preprintarXiv:1811.00982 . Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-manan, D., Doll´ar, P., and Zitnick, C. L. (2014). Mi-crosoft COCO: Common objects in context. In

Euro-pean conference on computer vision , pages 740–755.Springer.Lindsay, A., Paterson, A., and Graham, I. (2018). Identi-fying and quantifying inefﬁciencies within industrialparametric CAD models. In

Advances in Manufac-turing Technology XXXII: Proceedings of the 16th In-ternational Conference on Manufacturing Research ,volume 8, page 227. IOS Press.Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu,X., and Pietik¨ainen, M. (2020). Deep learning forgeneric object detection: A survey.

International jour-nal of computer vision , 128(2):261–318.Mazzetto, M., Southier, L. F., Teixeira, M., and Casanova,D. (2019). Automatic classiﬁcation of multiple ob-jects in automotive assembly line. In , pages 363–369. IEEE.Peng, X., Sun, B., Ali, K., and Saenko, K. (2015). Learningdeep object detectors from 3D models. In

Proceed-ings of the IEEE International Conference on Com-puter Vision , pages 1278–1286.Prasad, D. K. (2012). Survey of the problem of object de-tection in real images.

International Journal of ImageProcessing (IJIP) , 6(6):441.Rajpura, P. S., Bojinov, H., and Hegde, R. S. (2017). Ob-ject detection using deep CNNs trained on syntheticimages. arXiv preprint arXiv:1706.06782 .Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: Towards real-time object detection with regionproposal networks. In

Advances in neural informationprocessing systems , pages 91–99.Shirley, P. and Morley, R. K. (2003).

Realistic ray tracing .AK Peters/CRC Press.Xiao, Y., Tian, Z., Yu, J., Zhang, Y., Liu, S., Du, S., andLan, X. (2020). A review of object detection basedon deep learning.

Multimedia Tools and Applications ,pages 1–63.Yang, J., Li, S., Wang, Z., and Yang, G. (2019). Real-timetiny part defect detection system in manufacturing us-ing deep learning.

IEEE Access , 7:89278–89291.Zhang, X., Yang, Y.-H., Han, Z., Wang, H., and Gao, C.(2013). Object class detection: A survey.

ACM Com-puting Surveys (CSUR) , 46(1):1–53.Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-ralba, A. (2017). Places: A 10 million image databasefor scene recognition.