Noise-resistant Deep Learning for Object Classification in 3D Point Clouds Using a Point Pair Descriptor
Dmytro Bobkov, Sili Chen, Ruiqing Jian, Muhammad Iqbal, Eckehard Steinbach
TThis is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681.
Noise-resistant Deep Learning for ObjectClassification in 3D Point Clouds Using a PointPair Descriptor
Dmytro Bobkov , Sili Chen , Ruiqing Jian , Muhammad Z. Iqbal , Eckehard Steinbach Abstract —Object retrieval and classification in point clouddata is challenged by noise, irregular sampling density and oc-clusion. To address this issue, we propose a point pair descriptorthat is robust to noise and occlusion and achieves high retrievalaccuracy. We further show how the proposed descriptor canbe used in a 4D convolutional neural network for the task ofobject classification. We propose a novel 4D convolutional layerthat is able to learn class-specific clusters in the descriptorhistograms. Finally, we provide experimental validation on 3benchmark datasets, which confirms the superiority of theproposed approach.
Index Terms —Recognition; Object Detection, Segmentationand Categorization; RGB-D Perception.
I. I
NTRODUCTION O BJECT retrieval and classification are important tasksin robotics. Of all 3D data representations, point cloudsare closest to the output from LiDAR and depth sensors.Unfortunately, this representation is very challenging due toits irregular data structure and large data size. Because ofthis, many works first convert this representation into 3Dvoxel grids or multi-view rendered images. While convenientfor processing, this pre-processing step introduces additionalcomputational complexity and makes the resulting data repre-sentation unnecessarily voluminous. For this reason, we focuson approaches that work directly on point clouds. There area number of handcrafted descriptor algorithms ([1], [2], toname a few) designed for point cloud data that are able toproduce a fixed size descriptor. Unfortunately, their descriptionperformance is limited for realistic data that suffer from highlevels of noise and occlusion. These disturbances often occurin robotic applications, where an object cannot be scannedfrom all viewpoints due to time, energy and accessibilityconstraints.Point pair-based global descriptors [1], [2], [3], [4] achievehigh performance for object matching and classification asthey do not describe the geometry explicitly and instead use
Manuscript received: September, 10, 2017; Revised November, 29, 2017;Accepted December, 22, 2017.This paper was recommended for publication by Editor Tamim Asfour uponevaluation of the Associate Editor and Reviewers’ comments. MuhammadZ. Iqbal has been supported by a PhD scholarship provided by the HigherEducation Commission (HEC) of Pakistan. Dmytro Bobkov, Ruiqing Jian, Muhammad Z. Iqbal and Eckehard Stein-bach are with the Chair of Media Technology, Technical University of Munich,80333 Munich, Germany [email protected] Sili Chen is with the Augmented Reality Lab, Baidu Inc., 100193, Beijing,China [email protected]
Digital Object Identifier (DOI): see top of this page.
Input point cloud Point pair function values quantized in a
4D histogram
4D to 2D reshape
Class… Proposed neural networkDescriptor 2D to 1D reshape
Fig. 1. Overview of the proposed object classification pipelinethat is a combination of a novel handcrafted descriptor and a4D convolutional neural network (CNN). For details on thenetwork architecture and layer dimensions, see Fig. 6 andTable I. Here, FC denotes a fully connected layer.point pair functions (PPF). The descriptor values describingshape statistics are computed using PPFs on the point pairs.In this paper, similarly to [3], we quantize PPF values usinga 4D histogram.Deep learning-based approaches have experienced signifi-cant interest over the last years for the task of 3D object clas-sification. While a number of approaches convert the 3D datato a regular representation for the ease of processing, we arguethat it introduces an unnecessary computational step. However,due to the irregular data structure of point clouds, it is notstraightforward how to feed this representation into a neuralnetwork. In particular, a point cloud is an unstructured setof points, which is invariant to permutations of its members.Different permutations of the set members result in differentinput data for the network, which makes training of the neuralnetwork complex (also known as symmetrization problem).PointNet [5] addresses this problem by using max-pooling.However, the classification performance remains relatively lowfor realistic datasets that are subject to noise and occlusion.We argue that this is because the network is not able to learnsufficiently invariant representations based solely on pointsets. To address these limitations, we instead feed the 4Ddescriptor into the neural network (see Fig. 1). For this, weemploy a novel network architecture with 4D convolutionallayers, which outperforms state-of-the-art deep learning-basedapproaches on benchmark datasets.
Contributions of this paper are as follows:1) We present a novel 4D convolutional neural networkarchitecture that takes a 4D descriptor as input and out-
Copyright © 2017 IEEE. Personal use of this material is permitted. However, permission to use thismaterial for any other purposes must be obtained from the IEEE by sending a request to [email protected]. a r X i v : . [ c s . C V ] A p r his is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681.performs existing deep learning approaches on realisticpoint cloud datasets.2) We design a handcrafted point pair function-based 4Ddescriptor that offers high robustness for realistic noisypoint cloud data. The source code of the descriptor willbe made publicly available .II. R ELATED W ORK
Handcrafted point cloud descriptors . Point cloud descrip-tors are typically divided into global and local descriptors.Local descriptors usually offer superior performance, butthey come with high complexity as typically thousands ofkeypoints need to be computed and matched per object [6].Instead, global descriptors compute one descriptor per object,thus reducing the matching complexity. Some of the globaldescriptors use projection techniques [7], [8]. The descriptorproposed by Kasaei et al . [7] exhibits state-of-the-art accuracyand achieves low computational complexity. Sanchez et al . [8]combine local and global descriptors into a mixed representa-tion. Others employ PPFs, which are based on sampled pointpairs [1], [2], [3], [4], [9], [10]. Such descriptors are morerobust to noise and occlusion that often occur in real pointcloud datasets. Unfortunately, their recognition performanceis still not sufficient for many applications.
Deep learning on point clouds . Deep learning-based ap-proaches have shown significant success over the last fewyears. However, due to the irregular structure of point clouddata and issues with symmetry, many approaches convert theraw point clouds into 3D voxel grids ([11] and [12], to namea few). Volumetric representations, however, are constrainedby their resolution due to data sparsity and unnecessaryconvolutions over empty voxels. Some other approaches solvethe problem using field probing neural networks [13], but itstill remains unclear how to apply such approach on a generalproblem. Multi-view CNNs achieve a good performance [14],but yet require an additional step of view rendering. Thereare also a number of feature-based CNNs that first computefeatures based on 3D data and then feed these into a neuralnetwork. While they achieve good results [15], the presentedfeatures are only suitable for mesh structures and it is unclearhow to extend them to point sets. Finally, there are a numberof end-to-end deep learning methods working directly on pointsets, such as [16], [17], [5]. However, these are sensitive tonoise and occlusion in real point cloud datasets.Due to limitations of related work, we review existingPPFs and propose the new ones. The quantized PPFs ofthe descriptor are then input into a 4D convolutional neuralnetwork for the task of object classification.III. M
ETHODOLOGY
A. Handcrafted Descriptors
Sampling-based PPFs were previously used for the construc-tion of robust 3D descriptors by [1], [2], [4], [18]. Typically,point pairs and point triplets are randomly sampled from thepoint set. Based on the sampled pairs and triplets, the functions https://rebrand.ly/obj-desc Fig. 2. Illustration of the points p , p , their normal vectors n , n and the Euclidean distance f = (cid:107) d (cid:107) .map them to the scalar values, which are then quantized intoa histogram that describes the statistics of the shape. Suchpoint sampling leads to certain randomness in the result, butalso enhances robustness to noise and occlusion. A furtheradvantage of this approach is its rotation-invariance.We define a point pair function f as the mapping of a pointpair to a scalar value as follows: f : ( R , G ) × ( R , G ) → R , (1)with R being a Euclidean space and G denoting a manifoldof surface normal orientations in 3D space. We employ thefollowing functions f to f for our point pair-based descrip-tor:1) Euclidean distance between the points f [4].2) The maximum angle between the corresponding surfacepatches of the points and direction vector d connectingthe points f .3) Normal distance between the points f .4) Occupancy ratio along the line connecting the points f . Euclidean distance . The function value f is the Euclideandistance between two points p and p : f ( p , p ) = (cid:107) p − p (cid:107) . (2)The statistics of the distances between point pairs can representboth the geometry and the size of the object (see d in Fig. 2). The maximum surface angle . Function value f describesthe underlying patch orientation with respect to the lineconnecting both points d . It is defined as follows: f ( p , n , p , n ) = max( β , β ) , (3)with β = arccos( n · d ) − π/ and β = arccos( n · d ) − π/ being the angles between vector d and the tangent patchesof points p and p , respectively (see Fig. 2). The directionvector is computed as follows: d = p − p . Function f isimportant in the cases, when the Euclidean distance f andnormal distance f are not descriptive enough. Refer to Fig. 3for illustration. Here, the point pairs in the left and right havethe same Euclidean distance and the same normal distance.In contrast, function f has significantly different values forboth pairs, hence it provides descriptive information on thegeometry. Although β and β may take on different valuesand, hence, provide an informative surface description, wechoose only one value. Thus, we strike a trade-off betweencompactness and accuracy. We use the maximum operation ascompared to minimum or average because we have observedthat it is much more descriptive for noisy datasets.his is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681. 𝑛 𝑝 𝑛 𝛽 = 0 ° 𝛽 = 0 ° 𝑛 𝑝 𝑛 𝛽 = 45 ° 𝛽 = 45 ° 𝑝 𝑝 Fig. 3. Illustration of the case with point pairs ( p , p ) (left)and ( p , p ) (right) that have similar Euclidean ( f ) andnormal distances ( f ), but still describe significantly differentshapes. The maximum angle between patches and direction f is an important feature for such case. Normal distance . Function value f describes the similarityof surface orientations of the point neighborhoods, which isdefined as follows: f ( n , n ) = arccos( | n · n | ) , (4)which lies in the range from to π . We take the absolute valueof the dot product, because we want to eliminate the influenceof the viewpoint, which can be unreliable in multi-view pointclouds [19]. Occupancy ratio . The value f describes the object ge-ometry. In particular, we perform voxelization of the objectvolume using a voxel grid of dimensions N x × N y × N z toenable fast lookup and occupancy check computations [2]. Weset N x = N y = N z = 64 . The value of f is defined asfollows: f ( P, p , p ) = N occ N total , (5)with P ∈ R being the set of points in the considered pointcloud, N occ is the number of occupied voxels intersected bythe 3D line connecting two points p and p , and N total is the total number of voxels intersected by the line (seeFig. 4). We classify the voxel as occupied if at least onepoint is contained inside. This is because the point densitycan vary significantly in indoor point clouds, therefore suchconservative value provides higher robustness. This functiondescribes the global object geometry, because the voxel gridoccupancy is computed based on all points. For the examplewith p and p in Fig. 4, N occ = 8 , N total = 13 , leading to f = 0 . . In contrast, for point pair ( p , p ) , f = 1 as allintersected voxels are occupied. B. Feature Statistics
We draw , point pairs at random from the point set aswe observe that this is sufficient to describe complex shapes.After the PPF values are computed for these pairs, they needto be aggregated into a descriptor histogram. Many approaches(among others [2] and [9]) assume that the different functionvalues are uncorrelated with each other. Therefore, these func-tion values have been discretized into bins and concatenatedinto a 1D histogram. We have observed that aggregation into a1D histogram leads to significant loss of performance becauseinformation on co-occurrences of different function values isneglected. To avoid loss of information on 4D co-occurrences, 𝑝 𝑝 𝑝 𝑁 𝑋 𝑁 𝑌 Fig. 4. 2D illustration of the grid used for performing voxeloccupancy checks for the voxels lying along the line (dashedline) connecting a given point pair p and p and another pointpair p and p . 𝑓 𝑓 𝑓 𝑓 … Low weight 0 High weight 𝑓 𝑓 𝑓 Fig. 5. 4D histogram that is used to discretize the aggregatedcounts of sampled PPF values w i,j,k,l into a descriptor. Bluecolor denotes the bins with low number of counts, whereasred corresponds to high.similar to [4], we instead build a 4D histogram of functionvalue occurrences (shown in Fig. 5). It can be expressed as: F = ( f , f , f , f ) . (6)We denote the descriptor as Enhanced Point Pair Functions(EPPF). Clearly, a straightforward extension into a 4D his-togram would result in an exponential increase of computa-tional complexity [3], [4]. We observe that not all functionvalues are equally informative for the description of the objectgeometry, hence a different number of bins has to be chosenfor different dimensions. In the later section IV-B we providean experimental study of the feature contribution to the overallperformance.In particular, function f helps to distinguish objects ofdifferent sizes, therefore we choose a relatively large numberof bins N f = 20 . Contrary to scale-variant approaches, wedo not scale every object to fit into a unit sphere, as weobserve that in indoor environments the size of the objectprovides quite important information. For example, monitor and whiteboard can have similar geometric shape, but dif-ferent dimensions. To preserve this information, we makeour descriptor scale-variant and scale all objects so that thelargest one fits into a unit cube. For f , we observe that itsdescriptive ability is relatively low for noisy and occludeddata, therefore we choose a relatively small number of bins N f = 4 . Furthermore, for f we set it to a slightly largernumber N f = 5 . Finally, for f , we observe that N f = 3 issufficient. We have experimented with larger numbers of binsand have noticed no significant performance improvement.Thus, we strike a trade-off between complexity and accuracyof the descriptor.his is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681.According to our observation, point pairs with larger Eu-clidean distances usually have higher discriminative power ascompared to those with smaller Euclidean distances. This isbecause every object has point pairs with small distance, butonly certain objects have point pairs with larger distances.Furthermore, to suppress the influence of noise in low distanceregions and enhance the difference in high distance regions fora better discrimination, we use the following bin weightingfactor for the bin located at index i, j, k, l : α i = ln( i/N f + c ) , (7)where i is the index of the Euclidean bin, and α i is usedto compute the bin weight as follows: w newi,j,k,l = α i · w i,j,k,l (see w i,j,k,l in Fig. 5). c is a constant greater than 1, whichis used to guarantee that weights are positive and mitigatenoise for point pairs with smaller Euclidean distances. Basedon the experimental validation, we set c = 1 . . Finally, everyweighted descriptor histogram is normalized. The total numberof bins of the resulting 4D histogram is N = 1200 . Formatching, we employ the symmetrized form of Kullback-Leibler (KL) divergence as a distance metric between twohistograms: d ( W , W
2) = N (cid:88) i =1 ( W i − W i ) ln W i W i , (8)where W and W are histogram counts for object 1 and2, respectively. Similarly to [4], we set all zero bins of ahistogram to a common minimum value that is twice smallerthan the smallest observed bin value in this dataset. Weobserved that KL outperforms L1 and L2-distance metrics. C. 4D Deep Learning Architecture
The previously computed 4D descriptor is rotation-invariant,which resolves the issue of symmetry of point sets in neuralnetworks. Hence, it is possible to feed this representationinto a neural network for the task of object classification. Topreserve information about 4D co-occurrences of the functionvalues, we use 4D convolution for the first two layers of theneural network (see Fig. 6). 4D convolution has already beensuccessfully applied for the task of material recognition byWang et al . [20], where it outperformed other architectures.In particular, we use 4D convolutional blocks in the firstand second layers, respectively. Details on the dimensionsare given in Table I. Afterwards, the resulting responses arereshaped into a 2D structure and fed into a 2D layer block.Furthermore, 2D max-pooling is performed. This is followedby reshaping from 2D to 1D representation, which is input intoa fully connected layer. Afterwards, to achieve regularizationand enhance the generalization property of the network, weemploy a dropout layer, which is followed by a fully connectedlayer. At the output of the network the class prediction for theobject is provided.For comparison, we also design 2D- and 3D-convolution-based networks for object classification. For fair evaluation, wechoose the dimensions so that the number of parameters of allthree networks is comparable to each other. The dimensions ofthe single layers are given in Table I. Thus, for the 2D-variant TABLE I. Layer dimensions for 2D, 3D and 4D-variants ofthe network. N f denotes the number of filters. Layer 2D network 3D network 4D network N f Input (40 ,
30) (20 , ,
15) (20 , , , -1 (conv.) 2D: (5 , , (5 , , , (5 , , , ,
322 (conv.) 2D: (5 , , (5 , , , (5 , , , ,
643 (conv.) 2D: (5 , , (5 , , , (5 , ,
484 max-pool: (2,2) 15 fully connected (192,1024) 16 dropout 0.5 17 fully connected: (1024, N classes ) 1 of the network, the input 4D descriptor is first reshaped into2D with dimensions (40 , and then processed with three 2Dconvolution layers. For a 3D-variant of the network, the input4D descriptor is reshaped into 3D with dimensions (20 , , and then processed with three 3D convolutional layers. Thenumber of filters remains the same for all three networks. Weuse stride value of 1.IV. EXPERIMENTAL RESULTSFor experimental comparison with state-of-the-art ap-proaches, we choose OUR-CVFH [9], ESF [2] and Wahl etal . [4], as the ones that constantly perform well across variousbenchmarks. For the first two, we use the implementationsprovided in the Point Cloud Library 1.8 [21]. For Wahl et al ,we use our own implementation in C++. We further performfine-tuning of the descriptor parameters to obtain optimalperformance. For comparison with deep learning approaches,we use PointNet [5], as it is the only approach up-to-date thatis able to directly work on 3D point sets without additionaloperations of multi-view projection [14] or voxelization [22].We use its implementation provided by the authors.The proposed descriptor EPPF has a larger number ofbins as compared to OUR-CVFH and ESF, which raises thequestion, whether a larger number of bins has an impact onperformance. To answer this question, we additionally evaluatethe version of the EPPF descriptor with fewer bins. We chooseits number so that it is comparable to the other descriptors. Inparticular, we set N f = 15 , N f = 3 , N f = 4 , N f = 3 . Thisresults in a total number of bins of N = 540 : compare to in ESF and in OUR-CVFH. We refer to this descriptoras ”EPPF short” in the following. As metrics for evaluation,we choose total accuracy that is accuracy value divided by thetotal number of objects, whereas mean accuracy and recall areaveraged over the classes. A. Datasets
For evaluation, we use the Stanford point cloud dataset [23],ScanNet CAD dataset [24] and ModelNet40 CAD dataset [22].These are the most recent and largest datasets of indoor objectsthat can be used for evaluation.
Stanford dataset . The Stanford dataset in [23] containsRGB and depth images and has been captured in 6 office areaswithin 3 different buildings, using structured-light sensorsduring a ◦ rotation at each scanning location. Due to sensornoise and limited scanning time, point density significantlyvaries throughout the scene. Furthermore, there is a high levelhis is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681. ReLU and 4D to 2D reshape
Class2D to 1D reshape (20,4,5,3)
Input data (4D descriptor)
ReLUReLU ReLU
Fig. 6. Architecture of the proposed 4D neural network. See Table I for more details on the dimensions.of occlusion. The authors [23] propose a training/testing splitaccording to buildings. We cannot use this split, because inthis case some objects never occur in the testing or trainingsets. This would make the evaluation of object classificationless meaningful, therefore we derive our own / split.Here, of object instances per category make the trainingset and the remaining represent the testing set. Weomit the category clutter , as it contains different categories.Furthermore, we also skip architectural elements with a highlevel of planarity, such as floor , ceiling and wall , as theycan easily be classified using normal direction. The presenceof these objects would make the object classification taskunnecessarily complex. Thus, we have 10 classes in total with , objects. ScanNet dataset . The ScanNet dataset [24] is a large-scaleCAD dataset containing semantic annotation of indoor scenes.It contains high level of occlusion and noise, as it is collectedusing a commodity RGB-D sensor in a low-cost sensor setup.For classification we employ the training/testing split specifiedby the authors [24]. We use the list of categories that arecompatible with ShapeNet 55 dataset and mentioned in [24].To avoid an unbalanced training set, we remove category laptop , as it contains only instances (as compared to theothers with at least instances). Thus, the used datasetcontains categories: basket, bathtub, bed, cabinet, chair,keyboard, lamp, microwave, pillow, printer, shelf, stove, tableand tv . This results in , objects in the training set and , objects in the testing set. ModelNet40 dataset . The ModelNet40 dataset [22] is alarge-scale CAD model dataset. The CAD models have beenmanually cleaned, thus containing practically no noise orocclusion. There are , CAD models from categories,split into , for training and , for testing.ModelNet40 and ScanNet datasets contain mesh models,which need to be converted into a point cloud representation.For this, we use the mesh sampling approach from the PointCloud Library [21] with a resolution of 1 cm. Because EPPF,Wahl et al . and OUR-CVFH require normal information, wefurther perform normal estimation using the method of Boulchand Marlet [25]. B. Object Retrieval using Handcrafted Descriptors
We perform leave-one-out cross-validation by querying ev-ery object in the dataset against the other objects to find theclosest match. When the closest match is of the same category f f f f Removed function -15-10-50 R e l a t i v e r edu c t i on o f F - sc o r e i n % StanfordScanNetM40
Fig. 7. Illustration of the influence of function removal on theretrieval performance (F1-score) for the proposed descriptor.One function is removed at a time. Averaged over 10 runs.as the query object, we consider it as a correct retrieval, andincorrect otherwise. Because the ESF, Wahl et al . and EPPFdescriptors contain the step of random sampling of point pairsfrom the point set, there are variations in performance as eachtime different pairs are chosen. To mitigate this, we repeatexperiments ten times and record the mean and the standarddeviation. The retrieval performance is given in Table II.One can observe in Table II that the proposed EPPFdescriptor (in full and short versions) outperforms ESF andOUR-CVFH on all datasets. Furthermore, the EPPF descriptoroutperforms the Wahl descriptor on the Stanford and ScanNetdatasets, but shows comparable performance on the M40dataset. This is because of the low level of noise in thisdataset. The PPFs employed in the Wahl descriptor are lessrobust to high levels of noise, but with lower noise levelscan provide higher descriptive ability, as compared to theEPPF descriptor. Notably, there is a big difference in totaland mean accuracy values for all descriptors. This is becausethe datasets are unbalanced, i.e., some categories happen moreoften than others, therefore matching to a larger category ismore likely. Thus, the approaches perform correct retrievalfor larger categories, which increases the total accuracy, butresults in smaller mean accuracy.To gain further insights on the influence of various functionson the resulting performance, we disabled one PPF at a timeand repeated retrieval experiments. The results for EPPF aregiven in Fig. 7. Here one can see that the largest drop inretrieval performance ( − ) is observed when the Eu-clidean distance feature is removed. The drop when removingthe surface angle function is lower ( − ). Interestingly,the normal distance function performs differently on varioushis is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681.TABLE II. Retrieval performance of the handcrafted descriptors. The mean value is given in the corresponding column, whilethe standard deviation is given in brackets. Best performance is shown in bold. Dataset Metric DescriptorOUR-CVFH [9] ESF [2] Wahl [4] EPPF Short EPPF N bins
308 640 625 540 1200Stanford [23] Total accuracy (%) 62.79 71.34 ( ± ± ± ( ± ± ± ± ( ± ± ± ± ( ± ± ± ± ( ± ± ± ± ( ± ± ( ± ± ± ± ± ± ( ± ± ± ± ( ± ± ( ± ± ± ± ( ± ± ± ± ( ± ± ± ± ( ± ± ± datasets. In particular, on the Stanford dataset, which exhibitshigh levels of noise in normal orientation, removal of normaldistance function leads to performance improvement by . Incontrast, on the other datasets, this does not happen and thereis a significant drop by up to . Finally, the visibility ratiofunction f contributes the least to the overall performance onall datasets and results in a drop of − . This justifies thechosen number of bins for every dimension. C. Comparison of Deep Learning Approaches
We further evaluate deep learning approaches on the taskof object classification. We again use the Stanford, ScanNetand M40 datasets. We use the proposed 4D CNN networkin combination with the handcrafted feature descriptor. Forcomparison, we also include 2D- and 3D-convolution-basednetworks (denotes as 2D and 3D, respectively). For optimiza-tion, we employ the Adam optimizer with a learning rate of · − and . dropout probability. We perform trainingfor , epochs. Training on ScanNet takes 1-3 hours toconverge with Tensorflow [26] and Nvidia Titan XP. Forcomparison, we also evaluate the method able to learn on pointsets PointNet [5]. We train the PointNet network on the givenobjects taking into account normalization into unit cube asadvised by authors [5]. We use standard parameters and feedthe network with , points. We perform training for , epochs.In Table III we provide object classification results for bothapproaches (EPPF 4D denotes 4D convolutional network).One can observe that 4D convolution-based network performsbetter than the 2D- and 3D-variants. This is thanks to thefact that 4D co-occurrences between various dimensions arepreserved. In contrast, by reshaping into 2D and 3D, suchinformation is lost. On the Stanford and ScanNet datasets,3D network performs better than the 2D-based one. One canfurther observe that our approach outperforms PointNet onthe first two datasets. This can be explained by the factthat the proposed network can easily learn noise-resistantclass-specific patterns based on handcrafted descriptors ascompared to feeding the point sets directly. Notably, PointNetoutperforms our approach on M40 dataset. Here, we obtain thePointNet result different from the one reported by authors in Standard deviation of perturbation noise T o t a l a cc u r a cy i n % PointNetOur
Fig. 8. Illustration of the influence of zero-mean Gaussianrandom noise on the classification accuracy for the M40dataset using 1024 points. The noise is added to each pointindependently. PointNet results from [5].[5] ( . vs. . ), which is due to the fact that networktraining has random behaviour. With the lower level of noisein M40 dataset, PointNet is able to learn more descriptiverepresentation for object classification. The lower performanceof our network is due to the loss of information when operatingon features instead of point sets.To investigate the influence of noise on the total accuracy,we add zero-mean Gaussian random noise with various stan-dard deviation values onto 3D coordinates of point sets. Then,we re-train the network using the noisy examples. The resultsfor our 4D approach and PointNet are given in Fig. 8. Eventhough PointNet outperforms our approach on lower levelsof noise, with increasing noise levels our approach suffersno significant decrease in accuracy. In contrast, PointNetperformance starts to drastically deteriorate already at standarddeviation values of . (e.g., 6% of the unit cube size). Thiscan be explained by the fact that the proposed point pairfunctions are more robust to noise as compared to the networktrained on point sets directly.V. D ISCUSSION
Network response visualization on different layers . Togain further insights about the transformation learned by thenetwork, we show its responses for an exemplary object. Wehis is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681.TABLE III. Classification performance of deep learning approaches using 2D, 3D and 4D convolutional layers.
Dataset Metric PointNet [5] EPPF 2D EPPF 3D EPPF 4DStanford Total accuracy (%) 64.30 82.01 81.94
Mean accuracy (%) 42.48 64.26
F1-score 0.395 0.652 0.665
ScanNet Total accuracy (%) 63.04 70.39 70.57
Mean accuracy (%) 37.50 38.98 44.35
Mean recall (%) 19.53
M40 Total accuracy (%) choose the object table in ScanNet and visualize the descriptorvalues and responses of the first filter in the first two layersin Fig. 9. Observe that the descriptor is very sparse, e.g.,most part of the quantized space takes zero values. Curse ofdimensionality is not a big issue here, as dimensionality of ourfunction space is low (4D) and it is strongly quantized. We ag-gregate 20,000 4D function values into , histogram bins.Hence, we get . counts per bin on average, which furtherconfirms that our space is sufficiently sampled. Furthermore,when feeding this descriptor into the first 4D convolutionallayer, one can observe that the network has smeared this signalin the space. Finally, in the second layer the signal is evenfurther spread across different dimensions. This is followedby a max-pooling layer that achieves invariance to spatialshift. The transformation learned by the network does not onlyperform simple Gaussian smoothing, but, more importantly, itamplifies the signal in certain regions and suppresses the signalin the other regions. This special perturbation benefits thegeneralization of our network, as the first 4D convolution layercan learn the fine features, which are characteristic for certainobject categories, while suppressing occlusion and noise. Runtime analysis . We review the runtime performance ofthe proposed descriptor. We implement the descriptor in C++with OpenMP parallelization. We use Desktop PC Intel i7 with24 GB RAM. The descriptor computation takes 8ms per objecton average. This is comparable to runtime performance of theESF, Wahl et al . and OUR-CVFH descriptors. This still allowsus to use such descriptor in real-time operation in roboticsfor perception tasks. As our descriptor provides fixed featuresize irrespective of the object dimensions, we expect relativelyconstant runtime when using the neural network for objectclassification.
Further insights . We have experimented with a number ofnetwork architectures for object classification. Nonetheless, wehave not observed a significant improvement when using largerarchitectures, which can be explained by the limited size of thetraining data. Intuitively, reshaping operations performed in theproposed neural network removes information about the struc-ture and feature co-occurrences. However, we have observedthat 2D reshaping gives higher classification performance thanusing 4D blocks directly. This could be explained by the factthat the category-specific clusters learned by the network arespatially separated in all dimensions. We have alternativelyconsidered a number of other strategies such as stacking the Fig. 9. Descriptor and 4D neural network responses for theobject table in the ScanNet dataset. Left: descriptor values.Middle: response of the first filter in the first layer. Right:filter response in the second layer. The rows show slices ofthe fourth dimension. Transparent bins correspond to constantoffset values for the response (or 0 for the descriptor values),colored bins - to varying values. The bins are colored so thatlow values are shown in blue color, while high in red.dimensions into a 2D representation, global max-pooling anddid not observe any performance improvement.PointNet generally took much longer to converge as com-pared to our approach on all datasets. For PointNet we alsoevaluated the voting scheme that applies multiple perturbationsand uses majority vote of the representations as a prediction,but have observed no significant performance improvement.Furthermore, we have performed experiments with feedingpoint pairs directly to PointNet. This approach indeed slightlyimproved performance on noisy datasets, however, only by asmall margin. To make sure no local optima influenced thehis is the author’s version of an article that has been accepted to IEEE Robotics and Automation Letters 2018. Changeswere made to this version by the publisher prior to publication. The final version of record is available athttp://dx.doi.org/10.1109/LRA.2018.2792681.evaluation, we trained several times and reported the best testaccuracy. Most of the considered datasets have unbalancedcategories, e.g., some categories (such as chair in ScanNet)occur much more often than others ( lamp ). This leads to theeffect that the network is able to learn very complex patternsfor often occurring objects, while it can only learn simplepatterns for rarely occurring objects.
Limitations . Tuning of the hyper-parameters of the networkcould bring further improvements. In particular, one couldexpect that Res-Net structure could improve the result [27].Another important aspect is the point sampling strategy: weuse a straightforward random sampling for this work. Weexpect that by performing non-random point sampling, onecould improve the classification performance. For this, thetechniques similar to the ones explained by Birdal and Ilic [28]can be applied. Finally, end-to-end learning with the goalof identifying more descriptive point pair and point tripletfunctions could bring further improvements in classificationperformance. VI. CONCLUSIONSWe have proposed to feed the values of the global descriptorinto the novel 4D neural network for object classification,which outperforms existing deep learning approaches on re-alistic data. We further verified that 4D convolutional layersoutperform 2D and 3D convolutional layers. We have also il-lustrated that by carefully selecting the PPFs and the number ofbins over different dimensions, one can enhance performanceof the point pair-based global descriptor. Experimental resultson 3 benchmark datasets confirm the superiority of such designin high noise and occlusion scenario. By providing a compactdescription as input data into a neural network one can makethe learning problem easier and achieve faster convergence.R
EFERENCES[1] T. Birdal and S. Ilic, “Point pair features based object detection and poseestimation revisited,” in
Proceedings of the International Conference on3D Vision , 2015, pp. 527–535.[2] W. Wohlkinger and M. Vincze, “Ensemble of shape functions for3d object classification,” in
Proceedings of the IEEE InternationalConference on Robotics and Biomimetics , Dec 2011, pp. 2987–2992.[3] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, matchlocally: Efficient and robust 3d object recognition,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,June 2010, pp. 998–1005.[4] E. Wahl, U. Hillenbrand, and G. Hirzinger, “Surflet-pair-relation his-tograms: a statistical 3d-shape representation for rapid classification,” in
Proceedings of IEEE International Conference on 3-D Digital Imagingand Modeling (3DIM) , 2003, pp. 474–481.[5] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning onpoint sets for 3d classification and segmentation,”
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2017.[6] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms(fpfh) for 3d registration,” in
Proceedings of the IEEE InternationalConference on Robotics and Automation , 2009, pp. 3212–3217.[7] S. H. Kasaei, L. S. Lopes, A. M. Tom, and M. Oliveira, “An orthographicdescriptor for 3d object learning and recognition,” in
Proceedings of theIEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) , Oct 2016, pp. 4158–4163.[8] A. J. Rodrguez-Snchez, S. Szedmak, and J. Piater, “Scurv: A 3ddescriptor for object classification,” in
Proceedings of the IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS) , Sept2015, pp. 1320–1327. [9] A. Aldoma, F. Tombari, R. B. Rusu, and M. Vincze, “Our-cvfh –oriented, unique and repeatable clustered viewpoint feature histogramfor object recognition and 6dof pose estimation,”
Proceedings of theJoint 34th DAGM and 36th OAGM Symposium on Pattern Recognition ,pp. 113–122, 2012.[10] T. Furuya and R. Ohbuchi, “Diffusion-on-manifold aggregation of localfeatures for shape-based 3d model retrieval,” in
Proceedings of the 5thACM International Conference on Multimedia Retrieval , 2015, pp. 171–178.[11] N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox, “Orientation-boostedvoxel nets for 3d object recognition,” in
Proceedings of the BritishMachine Vision Conference (BMVC) , 2017.[12] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner,“Vote3deep: Fast object detection in 3d point clouds using efficientconvolutional neural networks,”
Proceedings of the IEEE InternationalConference on Robotics and Automation (ICRA) , pp. 1355–1361, 2017.[13] Y. Li, S. Pirk, H. Su, C. R. Qi, and L. J. Guibas, “Fpnn: Fieldprobing neural networks for 3d data,” in
Advances in Neural InformationProcessing Systems (NIPS) , 2016, pp. 307–315.[14] H. Su, S. Maji, E. Kalogerakis, and E. G. Learned-Miller, “Multi-viewconvolutional neural networks for 3d shape recognition,” in
Proceedingsof the IEEE International Conference on Computer Vision (ICCV) , 2015.[15] J. Xie, G. Dai, F. Zhu, E. K. Wong, and Y. Fang, “Deepshape: Deep-learned shape descriptor for 3d shape retrieval,”
IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI) , vol. 39, no. 7, pp.1335–1345, July 2016.[16] S. Ravanbakhsh, J. Schneider, and B. Poczos, “Deep learning with setsand point clouds,” in
Proceedings of the International Conference onLearning Representations (ICLR) – workshop track , 2017.[17] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierar-chical feature learning on point sets in a metric space,” arXiv preprintarXiv:1706.02413 , 2017.[18] S. Hinterstoisser, V. Lepetit, N. Rajkumar, and K. Konolige,
GoingFurther with Point Pair Features . Springer International Publishing,2016, pp. 834–848.[19] D. Bobkov, S. Chen, M. Kiechle, S. Hilsenbeck, and E. Steinbach,“Noise-resistant unsupervised object segmentation in multi-view indoorpoint clouds,” in
Proceedings of the International Joint Conferenceon Computer Vision, Imaging and Computer Graphics Theory andApplications (VISIGRAPP) , February 2017, pp. 149–156.[20] T.-C. Wang, J.-Y. Zhu, E. Hiroaki, M. Chandraker, A. A. Efros, andR. Ramamoorthi, “A 4d light-field dataset and cnn architectures formaterial recognition,” in
Proceedings of the European Conference onComputer Vision . Springer, 2016, pp. 121–138.[21] A. Aldoma, Z.-C. Marton, F. Tombari, W. Wohlkinger, C. Potthast,B. Zeisl, R. B. Rusu, S. Gedikli, and M. Vincze, “Tutorial: Pointcloud library: Three-dimensional object recognition and 6 dof poseestimation,”
IEEE Robotics and Automation Magazine , vol. 19, pp. 80–91, 2012.[22] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3dshapenets: A deep representation for volumetric shapes,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2015, pp. 1912–1920.[23] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer,and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in
Proceedings of the IEEE Conference in Computer Vision and PatternRecognition (CVPR) , 2016.[24] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, andM. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoorscenes,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2017.[25] A. Boulch and R. Marlet, “Fast and robust normal estimation for pointclouds with sharp features,”
Computer Graphics Forum , vol. 31, no. 5,pp. 1765–1774, Aug. 2012.[26] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, and M. D. et al., “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2016, pp. 770–778.[28] T. Birdal and S. Ilic, “A point sampling algorithm for 3d matchingof irregular geometries,” in