[PDF] DPointNet: A Density-Oriented PointNet for 3D Object Detection in Point Clouds

Abstract

For current object detectors, the scale of the receptive field of feature extraction operators usually increases layer by layer. Those operators are called scale-oriented operators in this paper, such as the convolution layer in CNN, and the set abstraction layer in PointNet++. The scale-oriented operators are appropriate for 2D images with multi-scale objects, but not natural for 3D point clouds with multi-density but scale-invariant objects. In this paper, we put forward a novel density-oriented PointNet (DPointNet) for 3D object detection in point clouds, in which the density of points increases layer by layer. In experiments for object detection, the DPointNet is applied to PointRCNN, and the results show that the model with the new operator can achieve better performance and higher speed than the baseline PointRCNN, which verify the effectiveness of the proposed DPointNet.

Full PDF

DDPointNet: A Density-Oriented PointNet for 3D Object Detection in Point Clouds

Jie Li , , Yu Hu , Research Center for Intelligent Computing SystemsInstitute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences { lijie2019, huyu } @ict.ac.cn Abstract

For current object detectors, the scale of the re-ceptive ﬁeld of feature extraction operators usu-ally increases layer by layer. Those operators arecalled scale-oriented operators in this paper, such asthe convolution layer in CNN, and the set abstrac-tion layer in PointNet++. The scale-oriented opera-tors are appropriate for 2D images with multi-scaleobjects, but not natural for 3D point clouds withmulti-density but scale-invariant objects. In this pa-per, we put forward a novel density-oriented Point-Net (DPointNet) for 3D object detection in pointclouds, in which the density of points increaseslayer by layer. In experiments for object detection,the DPointNet is applied to PointRCNN, and theresults show that the model with the new operatorcan achieve better performance and higher speedthan the baseline PointRCNN, which verify the ef-fectiveness of the proposed DPointNet.

As one of the key technologies of autonomous driving, 3D ob-ject detection can provide accurate spatial location informa-tion to help autonomous vehicles make effective predictionsand plans. The point clouds from LiDAR are adopted in many3D detection methods, and the range information is muchmore accurate and stable than that of camera images. How-ever, some attributes of point clouds are not friendly to reg-ular calculations, such as sparsity, irregularity, large amountof data, and inhomogeneity (as illustrated in Figure 1), andin this paper a novel density-oriented operator is proposed totackle the inhomogeneity of point clouds.Most leading 3D detection methods can generally be di-vided into two categories [Shi et al. , 2020]: grid-based meth-ods, and point-based methods. In grid-based methods, sparseirregular point clouds are transformed into compact regularrepresentations, such as 2D bird-eye-view (BEV) images or3D voxels. Generally, BEV features are deﬁned manually,such as occupancy, height or reﬂectance, which are used inPIXOR [Yang et al. , 2018]. Voxel features are also man-ually deﬁned in early methods, but after VoxelNet [Zhouand Tuzel, 2018], Voxel Feature Encoding (VFE) layers areusually adopted to learn features, such as SECOND [Yan

Figure 1: Illustration of the density inhomogeneity of point clouds,with the image on the top and the corresponding point cloud on thebottom. Four 3D bounding boxes are plotted in the point cloud forthe four cars shown in the image. The car in near areas has muchmore points than that in far areas. et al. , 2018], PointPillars [Lang et al. , 2019]. Unlike grid-based methods that need transformation, point-based meth-ods can directly process the raw point cloud data, and Point-Net [Charles et al. , 2017] or PointNet++ [Qi et al. , 2017]are often adopted to learn features of points, such as F-PointNet [Qi et al. , 2018], PointRCNN [Shi et al. , 2019a],3DSSD [Yang et al. , 2020]. In addition, some recent meth-ods try to combine the grid-based and point-based modules inone framework. In STD [Yang et al. , 2019], the PointNet++backbone is used to extract point features, and VFE layersare introduced to convert proposal features from sparse ex-pression to compact representation. PV-RCNN [Shi et al. ,2020] integrates both 3D voxel CNN and PointNet-based setabstraction to learn more discriminative point cloud features.In order to improve the localization precision, in SASSD [He et al. , 2020], an auxiliary network is designed to convert theconvolutional features back to point-level representations.Compared with other attributes, the density inhomogene-ity of point clouds has received relatively little attention incurrent methods. However, the point density of objects inpoint clouds and the scale of objects in images are compa- a r X i v : . [ c s . C V ] F e b able, while the latter plays an important role in 2D objectdetection. For the same object, the change of of its distancefrom the sensor (LIDAR or camera), leads to the change ofpoint density in point clouds or the change of scale in images.The convolution operator for images is scaled-oriented, forthe scale of its receptive ﬁeld usually increases layer by layer,and this property is used in FPN [Lin et al. , 2017a] for de-tecting objects at different scales. Accordingly, we propose adensity-oriented operator for point clouds, in which the pointdensity increases layer by layer. ConvolutionPointNet++

DPointNet

Input Layer 1 Layer 2 Output

Figure 2: Illustration of the calculation processes of three operators,including two scale-oriented operator (convolution and PointNet++)and one density-oriented operator (our DPointNet). Note that ‘Point-Net++’ in the ﬁgure is used to represent its set abstraction layers.

As shown in Figure 2, the calculation processes of three op-erators are compared, including two scale-oriented operators(convolution and PointNet++) and one density-oriented oper-ator (our DPointNet). For convolution layers, the local infor-mation in the rectangular neighborhood is aggregated througha grid-based calculation, and the scale of the receptive ﬁeldincreases layer by layer. For the set abstraction layers inPointNet++, the local information in the circular (sphericalactually) neighborhood is aggregated through a sampling-based calculation, and the receptive ﬁeld increases with thenumber of layers.For the DPointNet, the local information is also aggregatedthrough a sampling-based calculation. As shown in Figure 2,5 black points are sampled in ‘Layer 1’ and 4 black pointsare sampled in ‘Layer 2’, and the 5 green points in ‘Layer 2’represents the fusion information from ‘Layer 1’. Thus, onlypart of neighborhood information is considered in each layer,but the latter layer will fuse the information from the previ-ous layer, so the point density in the receptive ﬁled increaseslayer by layer. In addition, the green circle in ‘Layer 1’ is thesame size as the brown circle in ‘Layer 2’, which means thereceptive ﬁled of different layers remains unchanged.Our contributions can be summarized into three-fold.- The scale and density attributes of images and pointclouds are analyzed, base on which two design princi-ples of density-oriented operators for point clouds areobtained by analogy with the convolution operator forimages. - A novel density-oriented operator (DPointNet) for pointclouds is proposed, which consists of one SG (Samplingand Grouping) layer and several FA (Fusion and Ab-straction) layers. To the best of our knowledge, this isalso the ﬁrst density-oriented operator for point clouds.- The DPointNet is applied to PointRCNN to verify itseffectiveness, and experiments on KITTI dataset showthat it has better performance and faster speed than theoriginal PointRCNN with PointNet++.

The density inhomogeneity of point clouds has been consid-ered in several methods. In PointNet++, a multi-scale group-ing (MSG) strategy for the set abstraction layer is proposedto extract multiple scales of features and combine them to en-hance the robustness. In Frustum ConvNet [Wang and Jia,2019], a series of frustums at different distances are used togroup local points. In Voxel-FPN [Kuang et al. , 2020], amulti-resolution voxelization is performed on the point cloud,and a FPN [Lin et al. , 2017a] structure is adopted to fusemulti-resolution features. RT3D [Zeng et al. , 2018] consid-ers various amounts of valid points in different parts of thecar. The method [Wang et al. , 2019] uses an adversarialglobal adaptation and a ﬁne-grained local adaptation, to makethe features of far-range objects similar to that of near-rangeobjects. In article [Engels et al. , 2020], two separate detec-tors are trained to extract features of close-range and long-range objects. SegVoxelNet [Yi et al. , 2020] designs a depth-aware head with convolution layers of different kernel sizesand dilated rates, to explicitly model the distribution differ-ences of three parts. In DA-PointRCNN [Li and Hu, 2020], amulti-branch backbone network is adopted to match the non-uniform density of point clouds.In the above-mentioned methods, the density inhomogene-ity of point clouds is generally tackled through macro struc-tures instead of micro operators (only PointNet++ tries toadapt its operator to non-uniform densities), and their fea-ture extraction operators are all scale-oriented, such as 2D/3Dconvolution and PointNet++. In comparison, we propose adensity-oriented PointNet (DPointNet) to exploit the inhomo-geneity at the operator level, which ﬁts well with the intrinsicproperties of point clouds (scale invariance and density inho-mogeneity).

In this section, the proposed DPointNet is described in detail.We ﬁrst introduce the design principles for density-orientedoperators, and then two kinds of network layers in DPointNetare proposed. Finally, the auxiliary head and training lossesof the 3D detector with DPointNet are presented.As shown in Figure 3, the DPointNet includes one SG(Sampling and Grouping) layer and several FA (Fusion andAbstraction) layers. The SG layer is used to sample seedsand their neighbors, and FA layers are designed to fuse andabstract the features of seeds.ata Characteristics Operators CharacteristicsImages scale inhomogeneity Scale-oriented changing scaledensity invariance unchanging densityPoint clouds scale invariance Density-oriented unchanging scaledensity inhomogeneity changing density

Table 1: Comparison of characteristics for different kinds of data and operators.

PointPoint

GroupNeighbors of Seed pool pool pool pool concat pool concat pool pool pool concat squeeze pool concat squeezeInput Point Cloud Seeds

Neighbors of Seeds Neighbors of the first Seed

Sampling and Grouping Layer Input Point Cloud SG Output Seed Features

FA FA FA

DPointNet (a)(b) (c) FA FA FA Detection Head

Detection Head

Detection Head fc box cls fc Figure 3: An overview of the DPointNet, composed of one SG(Sampling and Grouping) layer and several FA (Fusion and Abstrac-tion) layers.

For images and point clouds, scale and density are an inter-esting pair of attributes. The change of the distance fromthe same object to sensors, will lead to scale change in im-ages and density change in point clouds. Consequently, twokey characteristics are closely related to the design of op-erators, i.e. , scale inhomogeneity and density invariance forimages, and scale invariance and density inhomogeneity forpoint clouds.As an effective scale-oriented operator, the convolution op-erator well matches the characteristics of images, with regu-lar grid-based computation (for density invariance) and en-larged receptive ﬁled of multiple layers (for scale inhomo-geneity). As shown in Table 1, the characteristics comparisonbetween images and point clouds, and between scale-orientedand density-oriented operators are presented.Accordingly, the design principles of density-oriented op-erators for point clouds can be obtained from the correspond-ing characteristics:- Principle 1: the scale of the receptive ﬁeld of the opera-tor remains unchanged in different layers.- Principle 2: the point density in the receptive ﬁeld of theoperator increases layer by layer.

Given a point cloud, the ﬁrst step is to sample some seedsand gather the neighbors of seeds, which is done in the SG(Sampling and Grouping) layer of DPointNet.As shown in Figure 4, the seeds are sampled from the inputpoint cloud by a farthest point sampling (FPS) algorithm, thenthe ball query is adopted to ﬁnd the points within a radius toeach seed. The number of neighbors of each seed is set to K ,and repeated random sampling is used if there are not enoughneighbors. PointPoint

GroupNeighbors of Seed pool pool pool pool concat pool concat pool pool pool concat squeeze pool concat squeezeInput Point Cloud Seeds

Neighbors of Seeds Neighbors of the first Seed

Sampling and Grouping Layer

Figure 4: Illustration of the Sampling and Grouping Layer inDPointNet.

The neighbors of each seed are divided into several groupsaccording to the number of operation layers, and the opera-tion layer here refers to the FA (Fusion and Abstraction) layerto be introduced in the next part. For example, three groupsare illustrated in Figure 4, i.e. , the green group, the red group,and the blue group, which means that the SG layer is fol-lowed by three FA layers. All the neighbors used in FA layersare from the SG layer, so there is only one sampling opera-tion, more efﬁcient than PointNet++, in which sampling andgrouping are required for each set abstraction layer.

With the neighbor information of seeds, the next step is tofuse and abstract information for each seed, which is done inthe FA (Fusion and Abstraction) layers of DPointNet.As shown in Figure 5, we design three schemes to fuse andabstract features for the FA layer: (a) feature appending; (b)coordinate concatenation; (c) feature concatenation. Only thecalculation of one seed is shown for example, and the sameoperations are applied on other seeds.In Figure 5(a), the neighbors of each seed are divided intothree groups, and each group is pooled in one FA layer, so thenumber of groups decreases layer by layer. For each layer,the ﬁrst group is pooled into one feature, and the feature isappended to the second group, then all the features are trans-formed by a multi-layer perceptron (MLP) network. Let g denote the ﬁrst group, and g oth denote the other groups, thenthe calculation of each FA layer is as follows: mlp ( append ( pool ( g ) , g oth )) (1)In Figure 5(b), the neighbors of each seed are divided intothree groups, and each group is processed in one FA layer. Foreach layer, the ﬁrst group is features while the other groupsare coordinates. The ﬁrst group is pooled into one feature,and the feature is concatenated with the coordinates of thesecond group, then an MLP is used to transform the feature. ointPoint GroupNeighbors of Seed pool pool pool pool concat pool concat pool pool pool concat squeeze pool concat squeezeInput Point Cloud Seeds

Neighbors of Seeds Neighbors of the first Seed

Sampling and Grouping Layer Input Point Cloud

SG Output Seed FeaturesFA FA FADPointNet(a)(b) (c)

Figure 5: Illustration of different schemes to fuse and abstract fea-tures: (a) feature appending; (b) coordinate concatenation; (c) fea-ture concatenation. Only the neighbor information of one seed isshown as an example, and the same operations are applied on all theother seeds.

Let g denote the ﬁrst group, and c denote the second group,then the calculation is as follows: mlp ( concat ( c , pool ( g ))) (2)In Figure 5(c), the neighbors of each seed are divided intothree groups, and each group is also processed in one FAlayer. For each layer, the ﬁrst group is pooled into one fea-ture, and the feature is concatenated with the features of othergroups, then an MLP is used to squeeze and transform the fea-tures. Let g denote the ﬁrst group, and g oth denote the othergroups, then the calculation is as follows: squeeze ( concat ( pool ( g ) , g oth )) (3)where ‘squeeze’ is also an MLP to reduce the dimensions ofthe concatenated features.Scheme (a) can transform the features of all remaininggroups in each FA layer, but the feature fusion of the ‘ap-pending’ mechanism is not enough, in which the pooled fea-ture from the ﬁrst group only affects the pooled feature of thesecond group. Scheme (b) adopts the ‘concatenation’ mech-anism for fusion, and only the coordinate information of thesecond group is considered to save memory. However, theproportion of the newly added coordinate information is smalland decreases sharply layer by layer, which is not suitablefor deep network. Scheme (c) combines the advantages ofscheme (a) and (b): the sufﬁcient feature extraction from thefeature transformation of all remaining groups, and the sufﬁ-cient feature fusion from the ‘concatenation’ mechanism. AnMLP is used to reduce the dimensions of the concatenatedfeatures for memory saving. The performance comparison ofthe three schemes is shown in the experimental section. Different from the SA (Set Abstraction) layers in PointNet++,the FA layers in DPointNet do not change the number of

PointPoint

GroupNeighbors of Seed pool pool pool pool concat pool concat pool pool pool concat squeeze pool concat squeezeInput Point Cloud Seeds

Neighbors of Seeds Neighbors of the first Seed

Sampling and Grouping Layer Input Point Cloud SG Output Seed Features

FA FA FA

DPointNet (a)(b) (c) FA FA FA Detection Head

Detection Head

Detection Head fc box cls fc Figure 6: Illustration of auxiliary heads for the DPointNet. All de-tection heads share the same structure, and only the top head is forinference while the others are auxiliary heads for training. seeds, thus each FA layer can provide the features of all seeds.Besides the last FA layer, auxiliary heads can also be adoptedfor other FA layers. As shown in Figure 6, three detectionheads of the same structures are assigned to three FA layers,and only the top head is for inference while the others areauxiliary heads for training.As shown in Figure 7, the DPointNet and auxiliary headsare applied to PointRCNN to demonstrate the effectivenessfor 3D object detection. input output

DPointNet

ProposalsPoint Cloud Region Pooling

DPointNet Stage 1Stage 2Detection HeadAuxiliary HeadAuxiliary HeadDetection Head Auxiliary Head

Auxiliary Head

Figure 7: The overall architecture of PointRCNN with the proposedDPointNet. The detector consists of two stages: stage 1 for gen-erating 3D proposals, and stage 2 for reﬁning the proposals. TheDPointNet is adopted in both stages to learn the features of seeds,and auxiliary heads are added for training process.

The PointRCNN with the proposed DPointNet (see Figure 7),is trained end-to-end with the region proposal loss L rpn andits auxiliary part L rpn − aux , the proposal reﬁnement loss L rcnn and its auxiliary part L rcnn − aux , formulated as: L = L rpn + L rpn − aux + L rcnn + L rcnn − aux (4) L rpn and L rpn − aux share the same loss function. Thefocal loss [Lin et al. , 2017b] with default hyper-parametersis adopted to address the foreground-background imbalance,and smooth-L1 loss is utilized for anchor box regression withthe predicted residual ∆ u p and the regression target ∆ u t : L rpn = L focal + (cid:88) u ∈{ x,y,z,l,h,w,θ } L smooth − L (cid:0) ∆ u p , ∆ u t (cid:1) (5) rcnn and L rcnn − aux also share the same loss function.The binary cross entropy loss is used to calculate the clas-siﬁcation loss, and and smooth-L1 loss is adopted for boxreﬁnement: L rcnn = L bce + (cid:88) u ∈{ x,y,z,l,h,w,θ } L smooth − L (cid:0) ∆ u p , ∆ u t (cid:1) (6) In this section, the PointRCNN with the proposed DPointNetis named as D-PointRCNN for convenience, and the model isevaluated on the widely used KITTI Object Detection Bench-mark [Geiger et al. , 2012].

Dataset

The KITTI dataset contains 7,481 training samples and 7,518test samples, and the training samples are split into the train split (3,712 samples) and the val split (3,769 samples). Allthe models in this paper are trained in the train split, and wecompare D-PointRCNN and the original PointRCNN on boththe val split and the test split. The dataset includes three cate-gories of Car, Pedestrian and Cyclist, and only the class Car isevaluated for its rich data and scenarios as [Yang et al. , 2020].

Implementation Details

The D-PointRCNN is developed from the PointRCNN in theOpenPCDet [Team, 2020] project, with the DPointNet ap-plied to learn features. For a fair comparison, the main exper-imental setup remains the same as the original model, and thedifferent parts are introduced in detail.Data sampling and augmentation strategies remain un-changed: 16,384 points are sampled from each point-cloud,and random ﬂip, scaling, rotation, and GT-AUG [Yan et al. ,2018] which adds extra non-overlapping ground-truth boxesfrom other scenes are employed.For the network structure, the DPointNet with two aux-iliary heads are adopted to replace the original PointNet++.The numbers of FA layers of DPointNet in stage-1 and stage-2 are four and three respectively, the same as the number ofSA layers in PointNet++. For memory saving, the number ofseeds sampled by the SG layer is 4,096, a quarter of the to-tal sampling points. In addition, the sampling radius and thenumber of sampling points of each FA layer are 3.0m and 24.The training scheme is the same as that of PointRCNN inOpenPCDet. The whole D-PointRCNN is trained end-to-endfor 80 epochs with the batch size 4 on 1 Tesla V100 GPU,which takes around 20G memory and 18.7 hours (14G mem-ory and 20 hours for PointRCNN), and the learning rate is ini-tialized as 0.01 for Adam optimizer. The ﬁnal model submit-ted for testing is trained for 100 epochs and about 23 hours.

Evaluation Metric

The average precision (AP) of class Car with a 0.7 IoU thresh-old is used to evaluate all results. For the recall positions ofAP settings, 40 positions are used for test set and 11 positionsfor val split, and all models are trained in train split.

Results on KITTI test

Set

As illustrated in Table 2, on the main metric, i.e.

AP on “mod-erate” instances, the D-PointRCNN outperforms PointRCNNby 0.7%, and similar performance for “hard” instances. How-ever, there is about 5% performance drop for “easy” in-stances, and we think the main cause is the mismatched dis-tribution of test and val data, as mentioned in Part- A [Shi etal. , 2019b] and CIA-SSD [Zheng et al. , 2020].Method AP D (%) Easy Moderate HardPointRCNN

D-PointRCNN 81.67

Table 2: Performance comparison on the KITTI test set on class Cardrawn from ofﬁcial Benchmark. The evaluation metric is AveragePrecision(AP) with IoU threshold 0.7.

Results on KITTI val

Set

As illustrated in Table 3, the D-PointRCNN outperformsPointRCNN on all three difﬁculties by 0.4% to 0.6%, withonly about 60% running time, demonstrating the effective-ness of the proposed DPointNet. In addition, the memoryoccupation of the D-PointRCNN ( . G ) is similar to that ofPointRCNN ( . G ). The inference time and memory cost areall tested on 1 Tesla V100 GPU with batch size 1.Method AP D (%) TimeEasy Moderate Hard (s)*PointRCNN 88.88 78.63 77.38 -PointRCNN 88.90 78.70 77.82 0.12D-PointRCNN

Improvement +0.37 +0.58 +0.53 -0.05

Table 3: Performance comparison on the KITTI val split set on classCar. The PointRCNN with * is the one in the original paper, and thePointRCNN without * is the version in OPenPCDet.

In this section, ablation experiments are carried out to ana-lyze the components of the D-PointRCNN, and all modelsare trained on train split and evaluated on val split of KITTIdataset.

The FA Layers of Different Schemes

As stated in Sec. 3.3, three schemes are designed for the FA(Fusion and Abstraction) Layer of DPointNet: (a) feature ap-pending; (b) coordinate concatenation; (c) feature concatena-tion. In terms of design mechanism, scheme (c) has betterfeature fusion than scheme (a), and better feature abstractionthan scheme (b). As shown in Table 4, scheme (c) outper-forms both scheme (a) and (b), demonstrating the effective-ness of the ‘feature concatenation’ mechanism.chemes AP E AP M AP H scheme (a) 88.63 78.77 77.79scheme (b) 88.71 78.67 77.77scheme (c) Table 4: Performance comparison of different schemes for FA lay-ers. AP E , AP M , AP H denote the Average Precision(AP) with IoUthreshold 0.7 for easy, moderate, hard difﬁculty on val split. The Layer of the Detection Head

In Figure 6, three detection heads are illustrated, and they arenamed as “top head”, “middle head”, and “bottom head” inthis part. We conduct experiments to ﬁgure out the effects ofthe layer of the detection head for stage 1.As shown in Table 5, the performance of the “top head”is the best among the three layers, which means the deeperfeatures can lead to better results. In D-PointRCNN, only the“top head” is used for inference, and the other two are usedas auxiliary heads.In addition, for the stage-1 of D-PointRCNN, only four FAlayers are adopted to extract the features of seeds, so the num-ber of FA layers of “top head”, “middle head”, and “bottomhead” are 4, 3, and 2 respectively. As shown in Table 5, thelargest performance difference is about 0.5% (for AP H be-tween “bottom” and “top”), which means a few FA layer canprovide relatively good features, thus the computation costcan further be reduced with a small loss of performance.Layer AP E AP M AP H bottom 88.88 78.87 77.66middle 88.95 78.91 77.91top Table 5: Effects of the layer of the detection head.

The Sampling Radius of FA Layers

The receptive ﬁelds of all FA layers in DPointNet share thesame size, and the effects of different sampling radius, i.e. ,different sizes of receptive ﬁelds, are explored in this part. Asshown in Table 6, the optimal value of the radius is around3.0 m, but the performance difference is marginal for vari-ous radius, which reﬂects the robustness of the operator. Inaddition, it was found in experiments that the performancevariance of 1.0 m or 5.0m is a little larger than that of others,which can be made up by training several times.Radius (m) AP E AP M AP H Table 6: Effects of different sampling radius.

Points AP E AP M AP H Memory16 88.77 78.78 77.79 13 G24 89.10 79.01 78.14 18 G32

24 G

Table 7: Effects of different number of sampling points.

The Number of Sampling Points of FA Layers

The number of sampling points of each FA layer is studied inthis part, and the results are shown in Table 7. More samplingpoints can provide richer neighborhood information for betterperformance, but the memory occupation also increases.Compared with 24 sampling points, 32 points have largermemory cost, but the performance improvement is marginal,so 24 points are adopted in our model to have a good bal-ance between performance and memory cost. In addition,if around 0.5% performance degradation is allowed, then 16points could be a good choice to save memory and reducecomputation cost.Auxiliary Extra AP E AP M AP H Head Training × × √ × √ √

Table 8: Effects of the auxiliary heads and extra training.

Auxiliary Heads and Extra Training

For the ﬁnal D-PointRCNN, auxiliary heads and extra train-ing (20 more epochs) are used to improve the performance.As shown in Table 8, the usage of auxiliary heads leads toabout 0.2% improvement on moderate set, and extra trainingcan further bring about 0.1% performance gain. Auxiliaryheads can be removed during the inference stage and extraepochs are only for training, so no additional calculations areintroduced for inference.

We have proposed a density-oriented operator (DPointNet) inthis paper, which consists of one SG (Sampling and Group-ing) layer and several FA (Fusion and Abstraction) layers. Toour knowledge, this is the ﬁrst density-oriented operator forpoint clouds. The DPointNet is applied to the PointRCNN toverify its effectiveness on 3D object detection, and it showsbetter performance and faster speed than the original PointR-CNN with PointNet++ operators.Several parameters of FA layers are studied in experiments,such as sampling radius and number of points, and the resultsshow that the memory occupation and computation cost canbe further reduced if a little performance degradation is al-lowed. In addition, we analyze the scale and density attributesof images and point clouds, and obtain two design principlesof density-oriented operators for point cloud, which can pro-vide guidance for the future designs of operators. cknowledgments

This work is supported in part by the National Key RD Pro-gram of China under grant No. 2018AAA0102701, in partby the Science and Technology on Space Intelligent Con-trol Laboratory under grant No. HTKJ2019KL502003, andin part by the Innovation Project of Institute of ComputingTechnology, Chinese Academy of Sciences under grant No.20186090.

References [Charles et al. , 2017] R. Qi Charles, Hao Su, Mo Kaichun,and Leonidas J. Guibas. Pointnet: Deep learning on pointsets for 3d classiﬁcation and segmentation. In , pages 77–85, 2017.[Engels et al. , 2020] Guus Engels, Nerea Aranjuelo, IgnacioArganda-Carreras, Marcos Nieto, and Oihana Otaegui. 3dobject detection from lidar data using distance dependentfeature extraction. In

Proceedings of the 6th InternationalConference on Vehicle Technology and Intelligent Trans-port Systems , pages 289–300, 2020.[Geiger et al. , 2012] Andreas Geiger, Philip Lenz, andRaquel Urtasun. Are we ready for autonomous driving?the kitti vision benchmark suite. In , pages3354–3361, 2012.[He et al. , 2020] Chenhang He, Hui Zeng, Jianqiang Huang,Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detection from point cloud. In

CVPR 2020:Computer Vision and Pattern Recognition , pages 11873–11882, 2020.[Kuang et al. , 2020] Hongwu Kuang, Bei Wang, JianpingAn, Ming Zhang, and Zehan Zhang. Voxel-fpn: Multi-scale voxel feature aggregation for 3d object detectionfrom lidar point clouds.

Sensors , 20(3):704, 2020.[Lang et al. , 2019] Alex H. Lang, Sourabh Vora, HolgerCaesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom.Pointpillars: Fast encoders for object detection from pointclouds. In , pages 12697–12705, 2019.[Li and Hu, 2020] Jie Li and Yu Hu. A density-aware pointr-cnn for 3d objection detection in point clouds. arXivpreprint arXiv:2009.05307 , 2020.[Lin et al. , 2017a] Tsung-Yi Lin, Piotr Dollar, Ross Gir-shick, Kaiming He, Bharath Hariharan, and Serge Be-longie. Feature pyramid networks for object detection. In , pages 936–944, 2017.[Lin et al. , 2017b] Tsung-Yi Lin, Priya Goyal, Ross Gir-shick, Kaiming He, and Piotr Dollar. Focal loss for denseobject detection. In , pages 2999–3007, 2017.[Qi et al. , 2017] Charles Ruizhongtai Qi, Li Yi, Hao Su, andLeonidas J. Guibas. Pointnet++: Deep hierarchical fea-ture learning on point sets in a metric space. In

Advances in Neural Information Processing Systems , pages 5099–5108, 2017.[Qi et al. , 2018] Charles R. Qi, Wei Liu, Chenxia Wu, HaoSu, and Leonidas J. Guibas. Frustum pointnets for 3d ob-ject detection from rgb-d data. In , pages918–927, 2018.[Shi et al. , 2019a] Shaoshuai Shi, Xiaogang Wang, andHongsheng Li. Pointrcnn: 3d object proposal genera-tion and detection from point cloud. In , pages 770–779, 2019.[Shi et al. , 2019b] Shaoshuai Shi, Zhe Wang, XiaogangWang, and Hongsheng Li. Part- a net: 3d part-aware andaggregation neural network for object detection from pointcloud. 2019.[Shi et al. , 2020] Shaoshuai Shi, Chaoxu Guo, Li Jiang, ZheWang, Jianping Shi, Xiaogang Wang, and Hongsheng Li.Pv-rcnn: Point-voxel feature set abstraction for 3d objectdetection. In CVPR 2020: Computer Vision and PatternRecognition , pages 10529–10538, 2020.[Team, 2020] OpenPCDet Development Team. Open-pcdet: An open-source toolbox for 3d object detec-tion from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020.[Wang and Jia, 2019] Zhixin Wang and Kui Jia. Frustumconvnet: Sliding frustums to aggregate local point-wisefeatures for amodal 3d object detection. arXiv preprintarXiv:1903.01864 , 2019.[Wang et al. , 2019] Ze Wang, Sihao Ding, Ying Li, MinmingZhao, Sohini Roychowdhury, Andreas Wallin, GuillermoSapiro, and Qiang Qiu. Range adaptation for 3d objectdetection in lidar. In , pages2320–2328, 2019.[Yan et al. , 2018] Yan Yan, Yuxing Mao, and Bo Li. Sec-ond: Sparsely embedded convolutional detection.

Sensors ,18(10):3337, 2018.[Yang et al. , 2018] Bin Yang, Wenjie Luo, and Raquel Ur-tasun. Pixor: Real-time 3d object detection from pointclouds. In , pages 7652–7660, 2018.[Yang et al. , 2019] Zetong Yang, Yanan Sun, Shu Liu, Xi-aoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d ob-ject detector for point cloud. In , pages1951–1960, 2019.[Yang et al. , 2020] Zetong Yang, Yanan Sun, Shu Liu, andJiaya Jia. 3dssd: Point-based 3d single stage object detec-tor. In

CVPR 2020: Computer Vision and Pattern Recog-nition , pages 11040–11048, 2020.[Yi et al. , 2020] Hongwei Yi, Shaoshuai Shi, Mingyu Ding,Jiankai Sun, Kui Xu, Hui Zhou, Zhe Wang, Sheng Li, anduoping Wang. Segvoxelnet: Exploring semantic con-text and depth-aware features for 3d vehicle detection frompoint cloud. arXiv preprint arXiv:2002.05316 , 2020.[Zeng et al. , 2018] Yiming Zeng, Yu Hu, Shice Liu, Jing Ye,Yinhe Han, Xiaowei Li, and Ninghui Sun. Rt3d: Real-time 3-d vehicle detection in lidar point cloud for au-tonomous driving.

IEEE Robotics and Automation Letters ,3(4):3434–3440, 2018.[Zheng et al. , 2020] Wu Zheng, Weiliang Tang, Sijin Chen,Li Jiang, and Chi-Wing Fu. Cia-ssd: Conﬁdent iou-aware single-stage object detector from point cloud. arXivpreprint arXiv:2012.03015 , 2020.[Zhou and Tuzel, 2018] Yin Zhou and Oncel Tuzel. Voxel-net: End-to-end learning for point cloud based 3d objectdetection. In2018 IEEE/CVF Conference on ComputerVision and Pattern Recognition