Blocks World Revisited: The Effect of Self-Occlusion on Classification by Convolutional Neural Networks
BBlocks World Revisited:The Effect of Self-Occlusion on Classification by Convolutional Neural Networks
Markus D. SolbachYork University [email protected]
John K. TsotsosYork University [email protected]
Abstract
Despite the recent successes in computer vision, thereremain new avenues to explore. In this work, we pro-pose a new dataset to investigate the effect of self-occlusionon deep neural networks. With TEOS (The Effect of Self-Occlusion), we propose a 3D blocks world dataset that fo-cuses on the geometric shape of 3D objects and their om-nipresent challenge of self-occlusion. We designed TEOSto investigate the role of self-occlusion in the context ofobject classification. Even though remarkable progresshas been seen in object classification, self-occlusion is achallenge. In the real-world, self-occlusion of 3D objectsstill presents significant challenges for deep learning ap-proaches. However, humans deal with this by deployingcomplex strategies, for instance, by changing the viewpointor manipulating the scene to gather necessary informa-tion. With TEOS, we present a dataset of two difficultylevels ( L and L ), containing 36 and 12 objects, respec-tively. We provide 738 uniformly sampled views of eachobject, their mask, object and camera position, orienta-tion, amount of self-occlusion, as well as the CAD modelof each object. We present baseline evaluations with fivewell-known classification deep neural networks and showthat TEOS poses a significant challenge for all of them. Thedataset, as well as the pre-trained models, are made pub-licly available for the scientific community under https://nvision2.data.eecs.yorku.ca/TEOS .
1. Introduction
Over most of the last decade, computer vision waspushed by efforts put into deep learning. The exact ad-vent of this deep learning dominated era is often datedto the ImageNet challenge ([Russakovsky et al., 2015]) in2012. Since then, the performance of models on vari-ous tasks has been improving at unparalleled speed; forinstance, image classification on the ImageNet datasetsurpassed the reported human-level performance in 2015
Figure 1. Example of the proposed objects from three differentviewpoints. ([He et al., 2015]). Two of the enablers for the recent suc-cesses are faster computers, specifically graphic processors,and the availability of large scale and often well-curateddata sets to learn from.The deep learning paradigm is omnipresent, and, with it, theneed for data with specific statistics to work in certain do-mains. [Kuznetsova et al., 2020] goes as far as saying that”Data is playing an especially critical role in enabling com-puters to interpret images as compositions of objects, anachievement that humans can do effortlessly while it hasbeen elusive for machines so far.”Many domains exist in which one would like machinesto perform visual tasks ([Carroll and Others, 1993]). Oneof these is object classification, which is defined aswhether a particular item is present in the stimulus([Tsotsos et al., 2005]).Object classification is an essential capability of humans,as well as for any robotic system whose goal is to be areal-world assistant; in a factory, hospital, or at home, justto name a few. Even though very successful in many do-mains, deep learning methods are challenged with occlu-sion ([Koporec and Pers, 2019]), which is inevitable in real-world scenarios. Here, we go a step further and show thatdeep learning methods are also challenged with the self-occlusion of objects, hence not generalizing to objects’ 3Dstructure.The problem of understanding the 3D structure froma 2D description, for instance, a line drawing, wasfirst put forward independently by [Huffman, 1971] and1 a r X i v : . [ c s . C V ] F e b Clowes, 1971], and they both showed that the necessarycritical condition for a line drawing to represent an actualarrangement of polyhedral objects was labelability.As the human brain is very efficient at reconstructinga scene’s 3D structure from a single image with notexture, colour or shading, efforts have been concen-trated on computational complexity issues; one mightthink an efficient solution exists (e.g. polynomial-time).[Kirousis and Papadimitriou, 1988], however, proved thatthis problem is NP-Complete, also for simple cases liketrihedral, solid scenes. To further research in this field,[Parodi et al., 1998] proposed a method to generate ran-dom instances of line drawings with useful distributionto investigate questions related to complexity of under-standing images of polyhedral scenes. More recently,[Solbach et al., 2018] provided a 3D extension with con-trollable camera parameters and two different light set-tings. It is designed to enable research on how a programcould parse a scene if it had multiple and definable view-points to consider. An example of a polyhedral scene from[Solbach et al., 2018] is shown from three different view-points in Figure 2.
Figure 2. Six polyhedral scenes from three different viewpointswith increasing complexity ([Solbach et al., 2018]).
With the increasing successes, contemporary com-puter vision approaches show a healthy trend awayfrom artificial problems and provide solutions to real-world problems, already deployed in many domains([Andreopoulos and Tsotsos, 2013]), for example, opticalcharacter recognition, industrial inspection systems, med-ical imaging, and biometrics. However, at the sametime, a disparagement of artificial domains can be seen([Slaney and Thi´ebaux, 2001]). At the very least, thesedomains can support meaningful systematic experiments.Here we revisit one such artificial domain; the BlocksWorld.In visual perception, the basic physical and geometric con-straints of our world play a crucial role. This idea goes backat least to Helmholtz and his argument for unconscious in-ference .This theme of visual perception can be traced back tothe early years of the discipline of computer vision.Larry Roberts argued that ”the perception of solid ob-jects is a process which can be based on the properties of three-dimensional transformations and the laws of nature”([Roberts, 1965]). Roberts’ popular Blocks World was anearly attempt to build a system for complete scene under-standing for a closed artificial world of textureless polyhe-dral shapes by using a generic library of polyhedral blockshapes. This toy domain that has remained as a staple ofthe AI literature for over 30 years but has fallen into dis-repute since then. This is due to a superficial understand-ing of it, leading to insufficient experimental methodol-ogy, and therefore, failing to extract useful results from it([Slaney and Thi´ebaux, 2001]).
Figure 3. Example of the objects by [Shepard and Metzler, 1971]which are used as an inspiration.
In this paper, we present
TEOS : The Effect of Self-Occlusion.
TEOS is a Blocks World based set of ob-jects with known complexity, controlled viewpoints, with aknown level of self-occlusion and 3D models.
TEOS sharessimilarities in appearance with the so-called Shepard andMetzler Objects ([Shepard and Metzler, 1971]), which arewidely used in the literature for mental rotation tasks. SeeFigure 3 for an illustration of two such objects. Similaritiesare for instance, the strict ninety-degree angle of elementsmaking up an object, the use of only cuboids, the use ofmainly one primitive (except for the base plate).However, with
TEOS , we present a set of objects that gobeyond the Shepard and Metzler objects. Specifically, ourobjects have a known complexity, share a common coordi-nate system, and empirical results have shown that they arechallenging for visual tasks using state-of-the-art classifica-tion algorithms.Our contributions are an investigation of the effect of self-occlusion for object classification. To accomplish this, weprovide a novel set of objects, a carefully created dataset,including an in-depth explanation of the objects and gen-erated data with a focus on self-occlusion and a baselineevaluation with state-of-the-art classification algorithms.The remainder of the paper is structured as follows. First,we will explain in detail the objects we have created for
TEOS . We then continue by giving an overview of relatedwork, describing the data acquisition, presenting our self-occlusion measure, evaluating the dataset against state-of-the-art classification algorithms, and finally finishing withour conclusions and future directions. . Related Work
To the best of our knowledge, self-occlusion has not at-tracted much attention in the literature. However, occlusioncaused by other objects has. In addition to several datasets,a number of approaches were introduced to deal with occlu-sion.
A burden of deep learning is its need for vast mountsof training data. Even though occlusion and its ef-fect on vision tasks has been addressed for sometime ([Hsiao et al., 2010, Ouyang and Wang, 2012,Brachmann et al., 2014, Hsiao and Hebert, 2014]), datasetscreated are usually too small to be used to train successfuldeep learning models. Furthermore, to our knowledge,datasets, if considering occlusion, mostly introduce variouslevels of clutter but lack to define occlusion in a genericway. For instance, the CMU Kitchen Occlusion dataset(CMU KO8) by [Hsiao and Hebert, 2014] consists of1,600 images of eight kitchen objects, such as backingpan, scissors, etc., which only yields 200 examples perclass. The dataset has explicitly been designed to challengeobject recognition algorithms with strong viewpoint andillumination changes, occlusions and clutter. Besidesa novel dataset, an occlusion reasoning module is alsoproposed (further details in Section 2.2).With the
ICCV 2015 Occluded Object Challenge ([Hinterstoisser et al., 2013, Brachmann et al., 2014]),a dataset with eight objects positioned in a realistic settingof heavy occlusion is presented. The objects can bedescribed as being of different domains, ranging fromanimals, over office supplies, to kitchenware. However,neither a definition of occlusion nor a metric is given.Figure 4 shows an example image of the dataset.
Figure 4. A scene with different objects under occlusion takenfrom the
ICCV 2015 Occluded Object Challenge . The majority of occlusion datasets, however, deal withthe occlusion of pedestrians. Specifically, in the con- text of autonomous driving, detecting pedestrians, even ifoccluded, is crucial to detect potential collisions. It isargued that most existing datasets are not designed forevaluating occlusion. For instance, the Caltech dataset([Doll´ar et al., 2012]) only contains 105 out of 4250 imageswith occluded pedestrians. The CUHK Occlusion Dataset([Ouyang and Wang, 2012]) is specifically designed as apedestrian dataset with occlusion. The authors selected im-ages from popular pedestrian datasets and recorded imagesfrom surveillance cameras and filtered them mainly for oc-cluded pedestrians. The dataset contains 1,063 images withbinary classification to indicate whether the pedestrian isoccluded or not.
Reasoning about occlusion has been used in many areas,from object recognition to tracking and segmentation. Re-ported in ([Hsiao and Hebert, 2014], the literature is exten-sive, but there has been comparatively little work on mod-elling occlusions from different viewpoints and using 3Dinformation until recently. Further, occlusion reasoning isbroadly classified into five categories; inconsistent objectstatistics, multiple images, part-based models, 3D reason-ing, and convolutional neural networks.The first category includes classical approaches, which useinconsistent object statistics to reason about potential oc-clusion. In general, occlusions are modelled as regionsthat are inconsistent with object statistics. For instance,[Meger et al., 2011] use inconsistencies in 3D sensor datato classify occlusions. [Girshick et al., 2011] introduce anoccluder part in their grammar model when all parts cannotbe placed. [Wang et al., 2009] use a scoring metric basedon individual HOG filter cells. [Hsiao and Hebert, 2014] in-corporate occlusion reasoning in object detection in a two-stage manner. First, in a bottom-up stage, occluded regionsare hypothesized from image data. Second, a top-downstage is used that relies on prior knowledge to score thecandidates’ occlusion plausibility. Extensive evaluation onsingle, as well as multiple views show that incorporatingocclusion reasoning yields significant improvement in rec-ognizing texture-less objects under severe occlusions.The use of multiple images characterizes the second cate-gory. For these approaches, a sequence of consecutive im-ages is necessary to disambiguate the object from occlud-ers. For instance, [Ess et al., 2009] detects the objects andextrapolates the state of occluded objects using an ExtendedKalman Filter. Reliable tracklets that are used in a temporalsliding window fashion are generated to disambiguate oc-cluded objects in [Xing et al., 2009].One of the largest categories are approaches using part-based models. A challenge of global object templates isocclusion as their performance degrades with its presencesignificantly. A popular solution to this problem is to sepa-ate the object into a set of parts and detect parts individu-ally. This approach generally yields more robust detectionstowards occlusion. For example, [Shu et al., 2012] analyzethe contribution of each part using a linear SVM and trainthe classifier to use unoccluded parts to maximize the prob-ability of detection. [Wu and Nevatia, 2009] go a step fur-ther and use multiple part detectors to maximize the jointlikelihood. Binary classification of parts is introduced by[Vedaldi and Zisserman, 2009]. They decompose the HOGdescriptor into small blocks that selectively switch betweenan object and an occlusion descriptor.
Figure 5. The effect of occlusion reasoning used in a CNN.Left the original CNN (MaskRCNN) and different (2D and3D) occlusion reasoning approaches improve the detection([Reddy et al., 2019]).
More recent work using 3D information for occlu-sion reasoning has been introduced. Hsiao et al. ar-gue that having 3D data provides richer information ofthe world, such as depth discontinuities and object size.[Pepikj et al., 2013] train multiple occlusion detectors onmined 3D annotated urban street scenes that contain dis-tinctive, reoccurring occlusion patterns. [Wang et al., 2013]use RGB-D information and an extended Hough vot-ing to include object location and its visibility pattern.[Radwan et al., 2013] addresses precisely the problem ofself-occlusion in the context of human pose estimation andadds an inference step to handle self-occlusion to an off-the-shelf body pose detector to increase its performance un-der self-occlusion. The solution leverages prior knowledgeabout the kinematics and orientations of the human pose todeal with self-occluding body parts.Lastly, convolutional neural networks have shown promis-ing results when it comes to different visual tasks like ob-ject classification, recognition, segmentation and 3D poseestimation. However, occlusion and, as will be shownin this work, self-occlusion pose significant challenges.[Reddy et al., 2019] introduce a framework to predict 2Dand 3D locations of occluded key points for objects to mit-igate the effect of occlusion on the performance. Eval-uated on CAD data and a large image set of vehicles atbusy city intersections, the approach increases the local-ization accuracy of MaskRCNN by about 10%. A self-occlusion example can be seen in Figure 5. [Li et al., 2019]uses deep supervision to fine-grain image classification. Intheir approach, they simulate challenging occlusion con-figurations between objects to enable reliable data-driven occlusion reasoning. Occlusion is modelled by render-ing multiple object configurations and extracting the vis-ibility level of the object of interest. With the work of[Kortylewski et al., 2020]) a convolutional neural networkis combined with the idea of part-based models. The au-thors introduce CompositionalNets. It is proposed to re-place the fully-connected classification layer with a differ-entiable compositional model. The idea of Compositional-Nets is to decompose images into objects and context, andthen decompose objects into parts and objects’ pose. Usingvarious CNN backbone shows that the approach can learnfeatures that are invariant to occlusion and discard occlud-ers during classification, hence increasing performance, es-pecially under occlusion. However, a trade-off is pointedout that a good occluder localization lowers classificationperformance because classification benefits from featuresthat are invariant to occlusion, where, on the other hand,occluder localization requires a different type of features.Namely, ones that sensitive to occlusion. It is pointed outthat it is essential to resolve this trade-off with new types ofmodels.
3. Object Definitions
In this Section, we will talk about the creation of theobjects, as well as their characterization. With
TEOS , wepresent in total 48 objects, split into two sets; L and L .The idea of having two sets is to support the different needsof research. L consists of 36 objects in 18 complexity lev-els, hence tailored towards research exploring the effect offinely grained complexity changes. L , on the other side, ismade up of 12 objects in three complexity levels based on L , so made up of complexity groups. Figure 6. The building blocks used to create the objects of
TEOS ;cuboid (left) and base (right).
In a number of empirical studies, we have studied the re-lationship between the number of elements per object andclassification accuracy. The human performance of clas-sification L objects is reliable (accuracy of > ) forobjects with up to seven elements. The classification getsuncertainty (89%) with objects consisting of around ten el-ements. Finally, the classification gets challenging (57%)with objects made of around 18 elements.Based on these findings, we have created the L set withthree complexity levels; easy with seven elements, mediumith ten elements, and hard with 18 elements. For eachcomplexity level, we have constructed three more objectswith the same amount of elements but changing the orienta-tion of one of the elements to have four distinct objects foreach level. We will now continue to explain the elementsused to assemble the objects and describe the objects’ typi-cal characteristics.All objects consist of the following two elements:• One 20mm x 60mm x 120mm base (Figure 6 right)• n Figure 7. Possible connection points of cuboids on the base.
The complexity of an object is simply calculated as compl = n + 1 (1)Where n is the number of cuboids used.The base has five connection points for cuboids to be at-tached to. All cuboids are only attached upright, sittingflush with the bottom of the base. This also makes it simpleto define a coordinate system. See Figure 8 for an illustra-tion. Figure 8. Illustration of the common coordinate system of the ob-jects.
All objects share the same coordinate system, which iscrucial for any research that looks at the effect of the ori-entational difference of 3D objects. The coordinate system is defined as depicted in Figure 8; the Y-Axis is running or-thogonal out of the base, the X-Axis running through thebase from its center of gravity towards the end with threecuboids connectors, and the Z-Axis runs orthogonal to theY- and X-Axis with the positive direction through the sideof the base with two cuboid connections.
Figure 9. Stage-wise characterization of the objects. Another formof expressing the complexity of the object.
Building up the objects, a cuboid has eight connectors atwhich another cuboid can be attached. Consecutive cuboidsare always orthogonally and never aligned in their direction,which is one of the differences to the Sheppard and Met-zler objects. Furthermore, cuboids never intersect or touchneighbouring cuboids, hence avoiding geometrical loops.Creating the objects for L , we focused on making the com-plexity comparable by consecutively adding one cuboid percomplexity level to the object of the previous complexitylevel. Figure 10. Illustration of L with all 36 objects. The objects can also be characterized in four stages ofheight. Each stage adds one perpendicular cuboid. For
TEOS , four height-stages were introduced. Consequently,the number of height-stages presents another way of ex-pressing the complexity of an object. An illustration of thedifferent stages is given in Figure 9.Lastly, having the objects characterized, we present the L and L object sets. The L objects can be seen in Figure10, and consist of 36 objects split into 18 complexity levels.There is a distractor object of the same complexity for eachobject that differs only in one small detail; one of the itemsis oriented differently. The introduction of the distractor ob-jects is intended to support research in visual recognition, igure 11. Illustration of L with all 12 objects, split into threedifferent complexity levels. where merely counting the number of elements would re-veal the object class.On the other side, the L set of objects is designed with lessvariation in complexity but more variation within a com-plexity. Twelve objects are evenly split into three complex-ity levels. To stick with the analogy of the distractors of the L set, each complexity level of L has three distractors;the same amount of elements but slight changes in assem-bling them. As described in the introduction of this Section,we chose the complexity levels based on empirical studiesof visual recognition tasks. The L objects can be seen inFigure 11. Figure 12. Illustration of the amount of average self-occlusion perobject of L . An emphasis in designing these datasets was put on theability to use them for self-occlusion studies. To accom-plish this, all objects are built around a common coordi-nate system, share the same base and progressively add ele-ments to increase the complexity and level of self-occlusion.Figure 12 shows how an increase of complexity of the L dataset also increases the average amount of self-occlusionamong all viewpoints. However, worth noting, with an in-creasing amount of complexity, the self-occlusion distribu- tion per class decreases. Further information about our self-occlusion measure will be explained in Section 5.
4. Dataset Acquisition
TEOS is a dataset that is designed to be used in the vir-tual as well as the real world. For the former, one can usethe rendered images and provided 3D Models (.STL file).For the latter, the objects are designed to be printable witha 3D printer. For this, we have prepared the 3D Mod-els to be readable by many 3D printing slicing software.However, in this Section we want to focus on the genera-tion of the rendered dataset images for which we have usedBlender ([Blender, 2020]), a free and open-source 3D com-puter graphics software toolset. For
TEOS , each object wasrendered from 768 views – Totalling 36,864 images. Toachieve realistic renderings of the objects, we used the Cy-cles Path Tracing rendering engine, created a white, smooth,plastic imitating material, set six light sources in the render-ing scene and used 4096 paths to trace each pixel in the finalrendering.
Figure 13. Illustration of viewpoints used to render each object of L and L . Views are evenly distributed on a sphere around anobject (blue points) and point towards the object (light red). Intotal 768 views are taken. Each object is rendered from the same set of views.To determine the views, we used the Fibonacci lattice([Stanley, 1988]) approach. This approach allows distribut-ing points on a sphere uniformly. Other approaches, forexample, using radial distance, polar angle and azimuthalangle, will result in an unevenly sampled sphere; dense onthe poles and sparse closer to the equator. Figure 13 il-lustrates the chosen views to generate the dataset. Eachblue-coloured point represents a location where the camerais placed and oriented to the center where the object (red)is. We chose a sphere radius of two such that the object isiew-filling but not cropped in any view.Further, as it is sometimes practiced in the machine learn-ing community ([Everingham et al., 2010, Lin et al., 2014,Matthey et al., 2017, Kabra et al., 2019]), we also providethe object mask and renderings with a dark and bright back-ground for data augmentation purposes. The annotation filecontains the object-type, view-id, bounding box informa-tion, object and camera positions and orientations, and ob-ject dimensions.
5. Self-Occlusion Measure
It seems evident that if we see less of an object, it isharder to classify it. Regions of the object that are occludedto us might hold distinct features to tell object X apart fromobject Y . In other words, occlusion for visual classificationplays an important role. However, it is not only dependenton the view but also on the object. Let us take, for exam-ple, a sphere. No matter from which angle we look at it, wealways observe 50% of it. On the contrary, for a complexpolygonal shape, this cannot be answered as quickly as it isdependent on its geometry.[Gay-Bellile et al., 2010] distinguishes two kinds of occlu-sions; “external occlusion” and “self-occlusion.” “Exter-nal occlusion” is caused by an object entering the spacebetween the camera and the object of interest and “self-occlusion ”which describes the occlusion caused by the ob-ject of interest to itself. For TEOS , we are interested in thelatter, as we always have one object in the scene. To ourknowledge, no standard self-occlusion measure is used forcomputational approaches; therefore, we aim to specify ourown intuitive measure as: SO c i = A c i φ A σ (2)Where A φ is defined as the occluded (not visible) surfacearea of the object and A σ stands for the total surface area ofthe object. Figure 14. Examples of different objects (Object 10 and 13 of L )and poses causing the same amount of occlusion but different ap-pearances. An object might have different views from which itcauses the same amount of self-occlusion, resulting in per-haps a considerably different appearance. Figure 14 showsan example of two objects from two different views with thesame amount of occlusion. Therefore, we also consider the camera’s point of view with c i as the camera pose. Here, c i is defined as the camera position c i = ( x i , y i , z i ) . There-fore, we also consider the camera’s point of view with c i as the camera pose. Here, c i is defined as the camera posi-tion c i = ( x i , y i , z i ) and computed based on the Fibonaccilattice approach (see Figure 13). The camera orientation isautomatically set such that the object is in the centre of theviewpoint. Figure 15. Examples of object viewpoints and their corresponding SO c i . For evaluation purposes, we also define a function thatmaps a camera position ( c i ) onto one of the eight regionsof the octahedral viewing-sphere placed at the centre of anobject. Figure 16 illustrates a mapping example for twocamera-positions. Figure 16. Visualization of the octahedron based projection usedto map camera positions. Bottom: two example camera poses ( c i and c j ) mapped to oh and oh . We represented the viewing sphere around an object asa spherically tiled octahedron, resulting in eight uniformlydistributed triangles. To map a viewpoint c i to a tile, weperform a determinant check to see in which tile a givencamera pose c i is located.In our rendered data set, the self-occlusion was calculatedby using the following steps:. Iterates over all faces of the object with valid normalsand calculate the ( A σ )2. Subdivide the objects into many thousand elements3. Position the camera at a given location and pointing itat the object (see Figure 13)4. Select all vertices that are visible through the cameraview-port5. Divide the object into two objects: visible and not-visible6. Iterate over all faces of the not-visible object with validnormals and calculate ( A φ )7. Lastly, calculate Self-Occlusion (Equation 2) Figure 17. Illustration of the self-occlusion distribution for L and L (top), as well as the distributional relation between viewpointmapping and self-occlusion for L and L (bottom). Figure 15 shows eight examples of the same object(object-7) from different viewing angles and sorted basedon their amount of self-occlusion. As can be seen in theillustration, a single object can cast many different appear-ances based on the viewing angle and a significant changein the amount of what is observable of it.Figure 17 illustrates the self-occlusion distribution for L and L (top) and the distributional relation between view-point mapping and self-occlusion for L and L (bottom).Self-Occlusion for L ranges from 49.99% to 95.16% witha mean at around 68% and L from 54.08% to 87.5% witha mean at 61% (Easy), 63% (Medium), and 71% (Hard).The lower half of the Figure shows that different octahedron viewpoints result in varying amounts of self-occlusion. Forboth L and L , a sweet-spot with the least self-occlusion isat oh , presumably resulting in the best classification result.
6. Baseline Evaluation
In this Section, we discuss how well state-of-the-art clas-sification approaches perform on
TEOS . We have chosenfive deep learning models with different properties, care-fully trained and evaluated them on
TEOS . We have chosen Inception-V3([Szegedy et al., 2016]),MobileNet-V2 ([Sandler et al., 2018]),ResNet-V2 ([He et al., 2016]), VGG16([Simonyan and Zisserman, 2014]) and EfficientNet([Tan and Le, 2019]) as reference networks for
TEOS .Their trained version of
TEOS will be made publiclyavailable. Table 1 shows more details about the networksin ascending order of their parameter count.
Table 1. High-Level CNN Characteristics
CNN Layers Parameters (mil.)
MobileNet-V2 53 3.4Inception-V3 48 24ResNet-V2 152 58.4EfficientNet-B7 813 66VGG16 152 138With
TEOS ’s unique characteristics, it is not trivial totell which CNN architecture will perform better or worse.Therefore, we chose networks with varying numbers of pa-rameters and different numbers of layers.
Besides the architecture of CNNs, a crucial element isthe choice of training-parameters and so-called hyperpa-rameters. In our case, we have looked at the input size, inputnoise, dropout rate, learning rate, optimization algorithmand lastly, the difference between learning from scratch andfine-tuning the networks. Hyperparameters such as inputnoise, drop rate, learning rate were determined using the hy-perparameter optimizer Hyperband by [Li et al., 2017]. Theremaining parameters were empirically determined. Table2 presents the parameters used to establish the baseline of
TEOS . To prepare the data for training, we chose a20% validation split and augmented the remain-ing 80% with standard data augmentation techniques([Shorten and Khoshgoftaar, 2019]). See Table 3: able 2. Chosen Training Parameters
Parameter Value
Input Sice 224 x 224 – 800 x 800(dependent on CNN)Input Noise Gaussian Noise of 0.1Drop Rate 20%Learning Rate 1e-5Optimizer Adam OptimizationLearning Method Fine Tuning
Table 3. Parameters for Data Augmentation
Data Augmentation Technique Value
Rotation 0-40 degreesWidth Shift 0-20%Height Shift 0-20%Zoom 0-20%
Our results show that MobileNet-V2 performed bestacross L and L . Specifically, for L , it achieved a top-1 accuracy of 17.25% and 10.83% on the L data set. SeeFigure 18 for the classification accuracies of all networks. Itseems that MobileNet-V2 is the only network that was ableto learn some aspects of TEOS , performing with a large ( L )or small ( L ) margin above chance, whereas all other net-works perform at around chance. This, perhaps, has some-thing to do with the relatively homogeneous appearance of TEOS , not allowing the more complex CNNs to learn from.However, this needs to be investigated further in the future.
Figure 18. Evaluation results on L (left) and L (right) for fivedifferent CNNs and their accuracy across the entire datasets. Generally, L is more challenging to learn for CNNsthan L . Even the best performing CNN is only 2.53%above chance, where this margin for L was at about 14.5%.This is explainable with the high intra-class similarity of L – objects of one class look very similar to each other andonly vary in one small detail, which might be only observ-able from certain viewpoints, hence will be confused witheach other.The L dataset, on the other side, has a low inter-class sim-ilarity – the appearance of objects varies between classes. A closer look at the results of L reveals that more exten-sive networks (VGG16 and EfficientNet-B7) were able tolearn objects of class “Hard” of L ; however, they couldnot learn “Medium” and “Easy” Objects. The smaller net-works, on the other hand (MobileNet-V2 and Inception-V3), were able to learn “Easy” and “Medium” objects butnot “Hard.” Except for MobileNet-V2, all networks haveprofound problems to learn the “Easy” Objects. See Figure18 for details. Figure 19. Evaluation results on L and L for the three top-performing CNNs. Top: Accuracy across the entire datasets withrespect to self-occlusion. Bottom: Accuracy and how it is affectedby the chosen viewpoint. Regarding the connection between classification accu-racy and amount of self-occlusion, it can be generally saidthat the classification accuracy goes down if self-occlusionincreases. We have chosen the three best-performing CNNsto analyze this connection and grouped L and L from50% to 85% self-occlusion in 5% intervals. < capturesall viewpoints with a self-occlusion of less than . > includes all images with more than 85%. Furthermore,we also investigated the connection between the viewpointmapped to an octahedral viewing-sphere and accuracy. Ascan be seen in the example of L and MobileNet-V2, theviewpoint does play a vital role and can result in an increaseof accuracy performance by . (cid:44) → .
31 = 67 . .Across L and L the octahedral viewpoint resulting in thebest performance was oh . This can be explained with thatall objects share a common coordinate system and showsonce more that the viewpoint matters and, even more, thatan ideal viewpoint can exist. See Figure 19 for details.Further, even though the CNNs are trained and validatedon the entire data set, their best performance can be seent lower self-occlusion rates, which shows the vital role ofself-occlusion for object classification performance.
7. Conclusion and Future Directions
In this work, we have presented a novel 3D blocks worlddataset that focuses on the geometric shape of 3D objectsand their omnipresent challenge of self-occlusion. We havecreated two data sets, L and L , including hundreds ofhigh-resolution, realistic renderings from known camera an-gles. Each data set also comes with rich annotations.Further, we have presented a simple but precise measureof self-occlusion and were able to show how self-occlusionchallenges the classification accuracy of state-of-the-artCNNs and the viewpoint can benefit the classification.Lastly, in our baseline evaluation, we have presented thatCNNs are unable to learn TEOS , leaving much room for fu-ture work improvements.We hope to have pathed a way to explore the relationshipbetween object classification, viewpoint, and self-occlusionwith this work. Specifically, we hope that
TEOS is usefulfor research in the realm of active vision – to plan and rea-son for the next-best-view seems to be crucial to increaseobject classification performance.
References [Andreopoulos and Tsotsos, 2013] Andreopoulos, A. and Tsot-sos, J. K. (2013). 50 Years of object recognition: Directions for-ward.
Computer Vision and Image Understanding , 117(8):827–891.[Blender, 2020] Blender (2020).
Blender - a 3D modelling andrendering package . Blender Foundation, Stichting BlenderFoundation, Amsterdam.[Brachmann et al., 2014] Brachmann, E., Krull, A., Michel, F.,Gumhold, S., Shotton, J., and Rother, C. (2014). Learning 6Dobject pose estimation using 3D object coordinates.
LectureNotes in Computer Science (including subseries Lecture Notesin Artificial Intelligence and Lecture Notes in Bioinformatics) ,8690 LNCS(PART 2):536–551.[Carroll and Others, 1993] Carroll, J. B. and Others (1993).
Hu-man cognitive abilities: A survey of factor-analytic studies .Cambridge University Press.[Clowes, 1971] Clowes, M. B. (1971). On seeing things.
Artificialintelligence , 2(1):79–116.[Doll´ar et al., 2012] Doll´ar, P., Wojek, C., Schiele, B., and Per-ona, P. (2012). Pedestrian detection: An evaluation of the stateof the art.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 34(4):743–761.[Ess et al., 2009] Ess, A., Schindler, K., Leibe, B., and Van Gool,L. (2009). Improved multi-person tracking with active occlu-sion handling. In
ICRA Workshop on People Detection andTracking , volume 2. Citeseer.[Everingham et al., 2010] Everingham, M., Van Gool, L.,Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (VOC) challenge.
InternationalJournal of Computer Vision , 88(2):303–338.[Gay-Bellile et al., 2010] Gay-Bellile, V., Bartoli, A., and Sayd,P. (2010). Direct estimation of nonrigid registrations withimage-based self-occlusion reasoning.
IEEE Transactions onPattern Analysis and Machine Intelligence , 32(1):87–104.[Girshick et al., 2011] Girshick, R., Felzenszwalb, P., andMcAllester, D. (2011). Object detection with grammar models.
Advances in neural information processing systems , 24:442–450.[He et al., 2015] He, K., Zhang, X., Ren, S., and Sun, J. (2015).Delving Deep into Rectifiers: Surpassing Human-Level Perfor-mance on ImageNet Classification. In
Proceedings of the IEEEinternational conference on computer vision .[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016).Identity mappings in deep residual networks. In
European con-ference on computer vision , pages 630–645. Springer.[Hinterstoisser et al., 2013] Hinterstoisser, S., Lepetit, V., Ilic, S.,Holzer, S., Bradski, G., Konolige, K., and Navab, N. (2013).Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In
Lecture Notes inComputer Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) , volume7724 LNCS, pages 548–562.[Hsiao et al., 2010] Hsiao, E., Collet, A., and Hebert, M. (2010).Making specific features less discriminative to improve point-based 3D object recognition.
Proceedings of the IEEE Com-puter Society Conference on Computer Vision and PatternRecognition , pages 2653–2660.[Hsiao and Hebert, 2014] Hsiao, E. and Hebert, M. (2014). Oc-clusion reasoning for object detectionunder arbitrary view-point.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 36(9):1803–1815.[Huffman, 1971] Huffman, D. A. (1971). Impossible object asnonsense sentences.
Machine intelligence , 6:295–324.[Kabra et al., 2019] Kabra, R., Burgess, C., Matthey, L., Kauf-man, R. L., Greff, K., Reynolds, M., and Lerchner, A. (2019).Multi-Object Datasets. https://github.com/deepmind/multi-object-datasets/.[Kirousis and Papadimitriou, 1988] Kirousis, L. M. and Pa-padimitriou, C. H. (1988). The complexity of recognizingpolyhedral scenes.
Journal of Computer and System Sciences ,37(1):14–38.[Koporec and Pers, 2019] Koporec, G. and Pers, J. (2019). Deeplearning performance in the presence of significant occlusions -An intelligent household refrigerator case.
Proceedings - 2019International Conference on Computer Vision Workshop, IC-CVW 2019 , pages 2532–2540.[Kortylewski et al., 2020] Kortylewski, A., Liu, Q., Wang, A.,Sun, Y., and Yuille, A. (2020). Compositional ConvolutionalNeural Networks: A Robust and Interpretable Model for ObjectRecognition Under Occlusion.
International Journal of Com-puter Vision , (Economist 2017).Kuznetsova et al., 2020] Kuznetsova, A., Rom, H., Alldrin, N.,Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S.,Malloci, M., Kolesnikov, A., Duerig, T., and Ferrari, V. (2020).The Open Images Dataset V4: Unified Image Classification,Object Detection, and Visual Relationship Detection at Scale.
International Journal of Computer Vision , 128(7):1956–1981.[Li et al., 2019] Li, C., Zia, M. Z., Tran, Q. H., Yu, X., Hager,G. D., and Chandraker, M. (2019). Deep Supervision with In-termediate Concepts.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 41(8):1828–1843.[Li et al., 2017] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh,A., and Talwalkar, A. (2017). Hyperband: A novel bandit-based approach to hyperparameter optimization.
The Journalof Machine Learning Research , 18(1):6765–6816.[Lin et al., 2014] Lin, T. Y., Maire, M., Belongie, S., Hays, J.,Perona, P., Ramanan, D., Doll´ar, P., and Zitnick, C. L. (2014).Microsoft COCO: Common objects in context.
Lecture Notesin Computer Science (including subseries Lecture Notes in Ar-tificial Intelligence and Lecture Notes in Bioinformatics) , 8693LNCS(PART 5):740–755.[Matthey et al., 2017] Matthey, L., Higgins, I., Hassabis, D., andLerchner, A. (2017). dSprites: Disentanglement testing Spritesdataset. https://github.com/deepmind/dsprites-dataset/.[Meger et al., 2011] Meger, D., Wojek, C., Little, J. J., andSchiele, B. (2011). Explicit Occlusion Reasoning for 3D Ob-ject Detection. In
BMVC , pages 1–11. Citeseer.[Ouyang and Wang, 2012] Ouyang, W. and Wang, X. (2012). Adiscriminative deep model for pedestrian detection with occlu-sion handling.
Proceedings of the IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition , pages3258–3265.[Parodi et al., 1998] Parodi, P., Lancewicki, R., Vijh, A., andTsotsos, J. K. (1998). Empirically-derived estimates of thecomplexity of labeling line drawings of polyhedral scenes.
Ar-tificial Intelligence , 105:47–75.[Pepikj et al., 2013] Pepikj, B., Stark, M., Gehler, P., and Schiele,B. (2013). Occlusion patterns for object class detection. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 3286–3293.[Radwan et al., 2013] Radwan, I., Dhall, A., and Goecke, R.(2013). Monocular image 3D human pose estimation underself-occlusion.
Proceedings of the IEEE International Confer-ence on Computer Vision , pages 1888–1895.[Reddy et al., 2019] Reddy, N. D., Vo, M., and Narasimhan, S. G.(2019). Occlusion-net: 2d/3d occluded keypoint localizationusing graph networks. In
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pages 7326–7335.[Roberts, 1965] Roberts, L. G. (1965).
Machine perception ofthree-dimensional solids . MIT Press, Cambridge, MA.[Russakovsky et al., 2015] Russakovsky, O., Deng, J., Su, H.,Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015).ImageNet Large Scale Visual Recognition Challenge.
Interna-tional Journal of Computer Vision (IJCV) , 115(3):211–252. [Sandler et al., 2018] Sandler, M., Howard, A., Zhu, M., Zhmogi-nov, A., and Chen, L.-C. (2018). Mobilenetv2: Inverted resid-uals and linear bottlenecks. In
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , pages 4510–4520.[Shepard and Metzler, 1971] Shepard, R. N. and Metzler, J.(1971). Mental Rotation of Three-Dimensional Objects.171(3972):701–703.[Shorten and Khoshgoftaar, 2019] Shorten, C. and Khoshgoftaar,T. M. (2019). A survey on Image Data Augmentation for DeepLearning.
Journal of Big Data , 6(1).[Shu et al., 2012] Shu, G., Dehghan, A., Oreifej, O., Hand, E.,and Shah, M. (2012). Part-based multiple-person tracking withpartial occlusion handling. In , pages 1815–1821. IEEE.[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman,A. (2014). Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 .[Slaney and Thi´ebaux, 2001] Slaney, J. and Thi´ebaux, S. (2001).
Blocks World revisited , volume 125.[Solbach et al., 2018] Solbach, M. D., Voland, S., Edmonds, J.,and Tsotsos, J. K. (2018). Random polyhedral scenes: An im-age generator for active vision system experiments.[Stanley, 1988] Stanley, R. P. (1988). Differential posets.
Journalof the American Mathematical Society , 1(4):919–961.[Szegedy et al., 2016] Szegedy, C., Vanhoucke, V., Ioffe, S.,Shlens, J., and Wojna, Z. (2016). Rethinking the Incep-tion Architecture for Computer Vision. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR) .[Tan and Le, 2019] Tan, M. and Le, Q. V. (2019). Efficientnet:Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 .[Tsotsos et al., 2005] Tsotsos, J. K., Liu, Y., Martinez-Trujillo,J. C., Pomplun, M., Simine, E., and Zhou, K. (2005). Attendingto visual motion.
Computer Vision and Image Understanding ,100(1-2 SPEC. ISS.):3–40.[Vedaldi and Zisserman, 2009] Vedaldi, A. and Zisserman, A.(2009). Structured output regression for detection with partialtruncation.
Advances in neural information processing systems ,22:1928–1936.[Wang et al., 2013] Wang, T., He, X., and Barnes, N. (2013).Learning structured hough voting for joint object detection andocclusion reasoning. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 1790–1797.[Wang et al., 2009] Wang, X., Han, T. X., and Yan, S. (2009). AnHOG-LBP human detector with partial occlusion handling. In ,pages 32–39. IEEE.[Wu and Nevatia, 2009] Wu, B. and Nevatia, R. (2009). Detec-tion and segmentation of multiple, partially occluded objectsby grouping, merging, assigning part detection responses.
In-ternational journal of computer vision , 82(2):185–204.Xing et al., 2009] Xing, J., Ai, H., and Lao, S. (2009). Multi-object tracking through occlusions by local tracklets filteringand global tracklets association with detection responses. In2009 IEEE Conference on Computer Vision and Pattern Recog-nition