InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling
IInfoFocus: 3D Object Detection for AutonomousDriving with Dynamic Information Modeling
Jun Wang (cid:63) , Shiyi Lan (cid:63) , Mingfei Gao , and Larry S. Davis University of Maryland, College Park MD 20742, USA Salesforce Research, Palo Alto, CA 94301, USA { junwang,lsd } @umiacs.umd.edu, [email protected]@salesforce.com Abstract.
Real-time 3D object detection is crucial for autonomous cars.Achieving promising performance with high efficiency, voxel-based ap-proaches have received considerable attention. However, previous meth-ods model the input space with features extracted from equally di-vided sub-regions without considering that point cloud is generally non-uniformly distributed over the space. To address this issue, we propose anovel 3D object detection framework with dynamic information model-ing. The proposed framework is designed in a coarse-to-fine manner.Coarse predictions are generated in the first stage via a voxel-basedregion proposal network. We introduce InfoFocus, which improves thecoarse detections by adaptively refining features guided by the informa-tion of point cloud density. Experiments are conducted on the large-scalenuScenes 3D detection benchmark. Results show that our frameworkachieves the state-of-the-art performance with 31 FPS and improves ourbaseline significantly by 9.0% mAP on the nuScenes test set.
Keywords:
3D Object Detection, Point Cloud
With growing interests in autonomous vehicles, 3D object detection has receivedconsiderable attention. Due to the superior capability of modeling 3D objects,point cloud is the most popular type of data source. Most existing 3D detectorsare point-based [17,25,11,21,27] and voxel-based [12,30,26,28,7]. Point-based ap-proaches generate features from raw point cloud data directly. Although achiev-ing promising performance, these methods suffer from high computational com-plexity which discourages their application in real-time scenarios. Voxel-basedapproaches [12,30,26,28,7] firstly convert point cloud into voxels and then employdeep convolutional neural networks (DCNN) to conduct object detection. Takingadvantage of the advanced DCNN architecture, voxel-based approaches achievethe state-of-the-art performance with low computational cost. Our work follows (cid:63)
Equal contribution. a r X i v : . [ c s . C V ] J u l J. Wang, S. Lan et al.
Bird-eye-view Point Cloud RGB ImageStatistics in Training Set E1E2 E3E4 Others E E E E O t h e r s N o r m a li z e d D e n s i t y Fig. 1.
Left: we calculate the average point density across different parts of objects inBEV of nuScenes training set. E1 - E4 indicate four edges sorted by their normalizeddensity scores (sum to 100%) and others denotes areas inside objects. We set eachedge width as 10% of the length along the object size and only objects over more than100 points are counted. Middle and right: we visualize an example of the LiDAR pointcloud in 2D image and its corresponding bird’s eye view (BEV). Clearly, most of thepoint clouds locate on the contour of the object the setting of voxel-based methods for their advanced balance of efficiency andeffectiveness.Although much progress has been made in improving the performance ofvoxel-based detectors, an important characteristic of point cloud is not wellexplored: input data points are usually not uniformly distributed over the space.The density of point cloud can be affected by different factors, e.g., the distanceof objects from LiDAR sensor and object self-occlusion. As illustrated in Fig. 1,the density of point cloud over objects highly depends on the relative locations ofdifferent parts. It is also intuitive that the amount of information is highly relatedto the point density. However, existing voxel-based detectors extract featuresfrom uniformly divided sub-regions, regardless of the actual distribution of thepoints. We believe that this will lead to loss of useful information and ultimatelyresult in sub-optimal detection performance.To fully exploit the non-uniform distribution of point cloud, we propose anovel 3D object detection framework, to adaptively model the rich feature of 3Dobjects according to the information density of points. Illustrated in Fig. 2, ourframework contains two stages. Coarse detection results are obtained in the firststage via voxel-based region proposal network. In the second stage, we introduceInfoFocus, to model and extract the informative features from regions of interest(RoI) (formed by the coarse predictions) according to the distribution of pointcloud, and the predictions are improved with the help of the refined features.The InfoFocus is the core structure of our framework which contains threesequentially connected modules including the Point-of-interest (PoI) Pooling, theVisibility Attentive Module, and the Adaptive Point-wise Attention. PoI Pooling.
Unlike 2D objects which contain densely distributed informa-tion over the whole RoI, more of the points of 3D objects locate on the their nfoFocus 3 surfaces. Therefore, we hypothesize that most informative feature concentrateson the edge of RoI. Motivated by this intuition, we propose PoI Pooling whichdensely samples features on the edge and sparsely samples feature in the middleof RoI to accommodate the non-uniform information distribution of point cloud.
Visibility Attentive Module.
Heavy self-occlusion is presented because ofthe nature of LiDAR data that is no point cloud exists on the backside of objectrelatively to the sensor. To mitigate this issue, our proposed Visibility AttentiveModule applies hard attention to emphasize the visible parts of objects andeliminate the features from invisible points.
Adaptive Point-wise Attention.
PoIs may contain different amount ofinformation, although they are all visible. We introduce Adaptive Point-wiseAttention to re-weight the features to improve the modeling of 3D objects.We conduct extensive experiments on the largest public 3D object detectionbenchmark, i.e, nuScenes [1]. Experimental results show that our approach sig-nificantly outperforms the baselines, achieving 39 .
5% mAP with 31 FPS. Resultsof comprehensive ablation studies demonstrate the effectiveness of our InfoFocusand that each sub-module makes considerable contributions to our framework.
Point-based Detectors . Inspired by the powerful feature learning capabilityof PointNet [18,19] and the advanced modeling structure of 2D object detectors[5,4,20], Frustum PointNets [17] extrude the 2D object proposals into frustumsto generate the 3D bounding boxes from raw point cloud. Lan et al. [11] adda decomposition-aggregation module modeling local geometry to extract theglobal feature descriptor of point cloud. Limited by initial 2D box proposals,those methods yield low performance when objects are occluded. In contrast,PointRCNN [21] generates 3D proposals directly from point cloud instead of 2Dimages. The recent STD [27] attempts to refine the detection boxes in a coarse-to-fine manner. However, all those methods are computationally expensive dueto the large amount of data points to be processed.
Multi-view 3D Detectors . MV3D [2] is proposed to fuse multi-view featuremaps for the generation of 3D box proposals. Following [2], Ku et al. [10] ex-plore high resolution feature maps to compensate the information loss for smallobjects. These methods address the feature alignment between multi-modalityin a coarse level and are typically slow. Liang et al. [14] design a continuousfusion layer to deal with the continuous state of LiDAR and the discrete state ofimages. Later, [13,24] leverage different strategies to jointly fuse related tasksto improve feature representation.
Voxel-based Detectors . Recently, there is a trend of using regular 3D voxelgrids to represent point cloud such that the input data can be easily processedby the 3D/2D convolution networks. Among those, VoxelNet [30] is the pioneer-ing work of performing voxelization on the raw 3D point cloud. To improve itsefficiency, Second [26] adopts Sparse Convolution and speeds up detection pro-cess without compromising the detection accuracy. PointPillars [12] dynamically
J. Wang, S. Lan et al. converts the 3D point cloud into a 2D pseudo image, making it more suitablefor the application of the existing 2D object detection techniques. In [28], Ye etal. design a new voxel generator to preserve the information loss along the ver-tical direction. Building upon voxel-based detectors, our model captures richerinformation of objects by refining their feature representations at a second stageguided by the point cloud density and ultimately improves the detection results.There are several recent studies [3,16] focusing on fusing the voxel-basedfeatures with PointNet-based features in order to extract more fine-grained 3Dfeatures. InfoFocus is complementary to these techniques and can be further ap-plied on top of them. WYSIWYG [7] is the most related method to our approachsince we both drive the model to encode visibility information. However, insteadof using a separate branch to generate the hidden invisibility representation,our method directly aggregates the valuable point-wise features together fromexisting backbone network to refine the proposals in an end-to-end manner.
The proposed framework is illustrated in Fig. 2, which consists of a deep featureextractor followed by a two-stage architecture. The deep feature extractor con-taining a Pillar Feature Network and a DCNN, converts the input point cloudto representative feature maps. Specifically, the Pillar Feature Network dividesthe whole space into equal pillars and generates the so-called pseudo images [12].The pseudo images are then processed by the DCNN to obtain the feature mapswhich are shared by the two stages, i.e., Region Proposal Network (RPN) andInfoFocus. The RPN generates the initial coarse bounding box proposals thatare refined by InfoFocus, with dynamic information modeling. Note that ourDeep Feature Extractor and RPN follow the setting of [12].
Deep Feature Extractor is composed of two parts: 1) voxelization using PillarFeature Network that converts the orderless point cloud into a sparse pseudoimage via a simplified PointNet-like architecture and 2) feature extraction usingDCNN to learn informative feature maps.
Pillar Feature Network . The Pillar Feature Network operates on the raw pointcloud, and learns point-wise features for each pillar. After voxelizing raw pointcloud into evenly spaced pillars, we randomly sample N points from each non-empty pillar and then obtain a dense tensor with the size of D × P × N , whereD indicates the information dimension of each point, P denotes the numberof non-empty pillars, and N denotes the number of points in each pillar. ThePillar Feature Network utilizes a PointNet-like block to learn a multi-dimensionalfeature vector for each pillar. The pillar-wise features are encoded into a 2Dpseudo image with the shape of W × L × C , where W and L indicate the widthand length of the pseudo image, and C is the channel of the feature map. nfoFocus 5 Point Cloud DFE RPN
POI Pooling Vis. Att. Adp. Att. … FC Layers
InfoFocus Results
Fig. 2.
The proposed 3D object detection framework. It consists of three parts: DeepFeature Extractor(DFE), Region Proposal Network, and InfoFocus. InfoFocus containsthree modules: PoI Pooling, Visibility Attentive Module, and Adaptive Point-wise At-tention Module
Deep Convolution Neural Network (DCNN) . DCNN learns feature mapsfrom the generated pseudo 2D image. The DCNN uses conv-deconv layers toextract features of different levels, and concatenates them to get the final featuresfrom different strides.
The RPN takes the feature maps provided by DCNN as inputs and produceshigh-quality 3D object proposals. Similar to the proposal generation in 2D ob-ject detection, anchor boxes are predefined at each position and proposals aregenerated by learning the offsets between anchors and the ground truths. To han-dle different scales of objects, a dual-head strategy is adopted. Specifically, thesmall-scale head takes features from the first conv-deconv phase of the DCNN,while the large-scale head takes the features from its concatenation phase.
The InfoFocus serves as the second stage of our framework, which takes thecandidate proposals from RPN and extracts features of objects in a hierarchicalmanner from the feature maps produced by the DCNN. Specifically, given each3D object proposal, InfoFocus dynamically focuses on the informative parts ofthe feature maps by gradually emphasizing the representative PoIs in the fol-lowing three steps: 1) the edge points are selected out from the whole proposalregion by PoI Pooling; 2) Visibility Attention module emphasizes on the infor-mative points according to their relative visibility to the LiDAR sensor and 3)in the Adaptive Point-wise Attention module, the features of the visible pointsare further weighted adaptively. The re-weighted features of the visible pointsare then fused to form the final representation of the proposal, on top of whichtwo fully-connected layers are utilized to predict the refined box.
PoI Pooling . When representing a 3D proposal, the most intuitive way is adopt-ing the commonly used strategy in the two-stage 2D object detectors, i.e., RoI
J. Wang, S. Lan et al.
Fig. 3.
RoI Pooling vs.
PoI Pooling. The grid represents the feature map, and the dotsdenotes sampling points of interest. RoI Pooling samples the whole box, while PoIPooling focuses on the key-points from edge-of-interest
Pooling (see Fig. 3 left). However, unlike the 2D images that have densely dis-tributed information over the region proposals, the 3D point cloud mostly resideson the object surface which results in non-uniform information over the regions(most information locates on the edges of proposals).The proposed PoI pooling is illustrated in Fig. 3 (right). Instead of equallysampling points over a region of the feature maps, we focus on sampling thepoints on the informative parts including four corners, the center point and key-points on the edges. Note that we consider the center position as an additionaluseful signal since it is likely to capture the semantic-level information.We first project the 3D proposal to the birds’ view coordinate system. Let p , p , p and p represent the positions of top-left, top-right, bottom-right, andbottom-left corners of a proposal on the pseudo image, respectively and p c de-notes the center point. Along each edge, n more key-points are uniformly sam-pled. For example, for the top edge between p and p , the position of a sampledkey-point kp j = ( p jn +1 + p n +1 − jn +1 ), where j is an integer and 1 ≤ j ≤ n . To thisend, (5 + 4 ∗ n ) PoIs are obtained. A high-dimensional feature is extracted foreach PoI according to its relative position on the feature map and then we obtaina feature set F poi = { f poi , f poi , ..., f poiN poi } , where N poi = (5 + 4 ∗ n ) representingthe number of selected PoIs within the considered region. Visibility Attentive Module . Severe self-occlusion typically occurs in pointcloud, but is ignored by most of the existing methods. The Visibility AttentiveModule (see Fig. 4 left) is proposed to mitigate this issue by focusing on theinformation provided by the visible parts of objects. We argue that visible regionscontain more useful information than the occluded ones. Formally, we propose tore-weight features of PoIs according to their corresponding visibility by exploitingthe geometric relationship between the proposals and the LiDAR sensor in birdseye view. As shown in Eq. 1, F vis denotes the updated feature set, where v poii indicates the visibility score of the i th PoI. Different weighting strategies can beused and we use a hard attention strategy in this work for its simplicity, thatis assigning v poii = 1 if the i th PoI is visible and v poii = 0 otherwise. In otherwords, we only take PoIs on the visible edges to represent the proposal. nfoFocus 7 F vis = { f poi ∗ v poi , f poi ∗ v poi , ..., f poiN poi ∗ v poiN poi } (1)For the consideration of model efficiency, a simple yet effective method isused to estimate the visibility of points in the bird’s eye view. To figure out thesides of proposals facing to the sensor, we first compute the distance of eachcorner to the LiDAR sensor and determine the one that is closest to the sensor.Then, we consider the two edges passing this closest corner as the visible edgesand the other two as the occluded ones. Adaptive Point-wise Attention Module . PoI Pooling and Visibility Atten-tive Module are motivated by the nature of the non-uniform density of pointcloud. However, two points may offer different amount of information eventhough they are all visible by the sensor. Adaptive Point-wise Attention Moduleprovides the flexibility for the visible PoIs to contribute unequally to the pre-diction. Suppose F vis = { f vis , f vis , ..., f visN vis } indicates the feature set of visiblePoIs. Adaptive Point-wise Attention Module learns an attention weight, w i , foreach f visi adaptively for the next-step feature aggregation. Specifically, a sharedfully connected (FC) layer with sigmoid as the activation function is used to learnthe attention weights, formally expressed as v visi = Sigmoid ( W f visi + b ). We use F att = { f att , f att , ..., f attN vis } to represent the re-weighted feature set of visiblePoIs updated using F vis and the attention weights, where f atti = f visi ∗ v visi .The final representation of each proposal aggregates the features of its visiblePoIs. Let e , e , e and e denote the top, right, down, left edges of a proposal,respectively. We first compute f ei by applying max pooling to all the visiblepoints on e i . Then, the final representation is obtained by f e || f e || f e || f e || f pc ,where f pc indicates the feature of the center point and || indicates concatenation. Given the output PoI feature representation from the aforementioned three mod-ules topped by fully-connected layers, the head network consists of three branchespredicting the box class, localization and direction. The ground truth and an-chor boxes are parameterized as ( x, y, z, w, l, h, θ ), where ( x, y, z ) is the centerof box, ( w, l, h ) is the dimension of box, and θ is the heading along the z-axisin the LiDAR coordinate system. The box regression target is computed as theresiduals between the ground truth and the anchors as: (cid:52) x = x gt − x a d a , (cid:52) y = y gt − y a d a , (cid:52) z = z gt − z a h a , (cid:52) w = log( w gt w a ) , (cid:52) l = log( l gt l a ) , (cid:52) h = log( h gt h a ) , (cid:52) θ = θ gt − θ a (2)where x gt and x a refer to ground truth and anchor box respectively, and d a = (cid:112) ( w a ) + ( l a ) . To deal with severe class imbalance problem in the dataset, J. Wang, S. Lan et al.
Visibility Attention … Adaptive Point-wise Attention … Fig. 4.
Left: illustration of the Visibility Attentive Module. We compute hard attentionfor each sampled point depending on whether it is visible to the sensor. We also showthe visibility map on the bottom left. Points on the blue line are visible while pointson the orange line are invisible. Right: the architecture of the Adaptive Point-wiseAttention Module. The point-wise attention is generated using a fully connected (FC)layer followed by a Sigmoid function. The input of FC layer is the feature of each point we adopt the focal loss [15] for the classification loss. Smooth L1 loss [5] is usedfor the regression loss. In addition, to compensate for direction prediction missingin the regression, we adopt a softmax classification loss on orientation prediction.Similar with that of the vanilla PointPillars network [12], we formally definea multi-task loss for both stages as threefold, L stage i = 1 N pos ( β cls L cls i + β reg L reg i + β dir L dir i ) , (3)where i could be either RPN or InfoFocus stage, N pos refers to the numberof positive anchors and β cls , β reg , β dir are chosen to balance the weights amongclassification loss, regression loss and direction loss. . Our framework uses PointNet to extract featuresfrom equally divided sub-grids and employs a DCNN to generate 2D feature mapswhile point-based techniques [17,25,11,21] only use PointNet as its backbone.Both our approach and point-based approaches apply two-stage architecture toinfer objects. Meanwhile, we both sample features considering the distribution ofpoint cloud. However, compared to PointNet, InfoFocus is more computationallyefficient without performance degradation. Fusion-based Approaches . Fusion-based detectors [3,16] make use of bothRGB images and point cloud data for 3D object detection. InfoFocus is muchfaster than fusion-based approaches, since they contain two backbones to processmulti-view sources and are heavily engineered. On the other hand, InfoFocus alsoachieves competitive results compared to fusion-based approaches.
Traditional Voxel-based Approaches . Our method shares the similar back-bone as the existing voxel-based architectures [12,30,26,28]. However, previousvoxel-based detectors pay less attention to the distribution of LiDAR data that nfoFocus 9 most 3D point cloud locates on the surface of the objects. Our proposed PoIPooling, Visibility Attentive Module, and Adaptive Point-wise Attention modelthe non-uniform point cloud using dynamic information focus. First, the PoIPooling decreases the sampling from the inside of objects where few points lo-cate. Next, the Visibility Attentive Module eliminates the noise from the backof objects where points are occluded. Last, we apply the Adaptive Point-wiseAttention to learn the focus on each sampled points. Jointly, these modulescontribute significantly to the superior performance of InfoFocus.
Our method is mainly evaluated on the nuScenes dataset [1] which is consideredas the most challenging 3D object detection benchmark. We first present ourimplementation details. We compare with the existing approaches both quan-titatively and qualitatively. Then, extensive ablation studies are conducted todemonstrate the effectiveness of each designed module. Last, we analyze the in-ference time and the desired speed accuracy trade-off provided by our method.
NuScenes [1] is one of the largest datasets for autonomous driving. There are1000 scenes of 20 s duration each, including 23 object classes annotated with28,130 training, and 6,019 validation samples. We use the LiDAR point cloudas the only input to our method and all the experiments follow the standardprotocol on the training and validation sets. Officially, nuScenes evaluates thedetection accuracy across different classes, based on the average precision metric(AP) which is computed based on 2D center distance between ground truth andthe detection box on the ground plane. In details, the AP score is determined asthe normalized area under the precision recall curve above 10%. The final meanAP (mAP) is the average among the set of ten classes over matching thresholdsof D = { . , , , } meters. We integrate InfoFocus into a state-of-the-art real-time 3D object detector [12] toimprove the detection performance without largely compromising speed. Closelyfollowing the codebase recommended by the authors of PointPillars [12], we usePyTorch to implement our InfoFocus modules and integrate it into vanilla Point-Pillars network. More details will be introduced in the supplementary materials. RPN . For each class of objects, the RPN anchor size is set by calculating theaverage of all objects from the corresponding class in training set of nuScenes.In addition, the matching thresholds are based on the custom configurationfollowing the suggested codebase. 1,000 proposals are obtained from RPN, on https://github.com/traveller59/second.pytorch.0 J. Wang, S. Lan et al. which NMS with a threshold of 0.5 is applied to remove the overlapping proposalsfor both training and inference. The final top-ranked 300 proposals are kept forthe InfoFocus stage to simultaneously predict the category, location and directionof objects during both the training and inference. InfoFocus . The second stage is our proposed InfoFocus. The three novel mod-ules process object-centric feature sequentially based on the initial bounding boxproposals from RPN. The number of sampled key-points for each edge, n , is setto be 2. Thus, the total number of PoIs, N poi , is 13, including a center, 4 cornersand 2 key-points on each edge. Similar to RoIAlign [6], bi-linear interpolationis used to compute the deep feature from four neighboring regular locations ofeach point.As mentioned before, we apply a max-pool layer to summarize the featuresof points along each edge, resulting in 5 features for each proposal, includingfeatures from top, right, down, left edges and the center. When concatenatingthese features, we always treat the edge that is closest to the sensor as the topedge. A fully connected layer with a single node is used to generate point-wiseattention weight for each point.The feature of each proposal is transformed by two consecutive FC layers with512 nodes each and passed to three sibling linear layers, a box-regression branch,a box-classification branch and a box-direction branch. For the regression targetassignment, anchors having Intersection over Union (IoU) bigger than 0.6 withthe ground truth are considered positive, and smaller than 0.55 are assignednegative labels. Training Parameters . Experiments are conducted on a single NVIDIA 1080TiGPU. The weight decay is set to be 0.01. We adopt the Adam optimizer [9], anduse a one-cycle scheduler proposed in [23]. We train our model with a total of20 epochs as a default choice, taking about 40 hours from scratch. For the first8 epochs, the learning rate progressively increases from 3 × − to 3 × − with decreasing momentum from 0.95 to 0.85, while in the remaining 12 epochslearning rate decreases from 3 × − to 3 × − with increasing momentumfrom 0.85 to 0.95. The focal loss [15] with α = 0 .
25 and γ = 2 . β cls , β reg , β dir of both stages are 1, 2 and 0.2, respectively. First, we compare our framework with the state-of-the-art methods on thenuScenes validation set, including the vanilla PointPillars [12] as our baseline,and recently published WYSIWYG [7]. As can be seen from Table. 1, the base-line has an mAP of 29.5% with a single stage, while InfoFocus improves it by amassive 6.9%. This demonstrates the effectiveness of InfoFocus. We also visual-ize the detection results of our framework on 2D and 3D BEV images in Fig. 5.As shown in Fig. 6, compared to the vanilla PointPillars qualitatively, InfoFocushelps remove the false positives significantly and obtains better results.In addition, we submit the detection results of test set on the nuScenes testserver. The results show that our method achieves the state-of-the-art perfor- nfoFocus 11
Table 1.
Object detection results (%) on nuScenes validation setMethod car peds. barri. traff. truck bus trail. const. motor. bicyc. mAPPointPillars [12] 70.5 59.9 33.2 29.6 25.0 34.3 16.7 4.5 20.0 1.6 29.5WYSIWYG [7] 80.0 66.9 34.5 27.9 35.8 54.1 28.5 7.5 18.5 0 35.4Ours 77.6 61.7 43.4 33.4 35.4 50.5 25.6 8.3 25.2 2.5 mance with inference speed of 31 FPS, improving the baseline performance by7% mAP. Note that all methods listed in Table. 2 are LiDAR-based except thatMonoDIS [22] and CenterNet[29] are camera-based methods. Without bells andwhilstles, our approach works better than WYSIWYG [7]. Considering that ourmodel contains more parameters than the vanilla PointPillars, we empiricallyincrease the number of the training epoch by 2 times. With all the others set-tings the same, our method is improved by 2% mAP on the nuScenes test set asshown in Table. 2 (Ours 2 × ). In total, our method outperforms WYSIWYG[7]by 4.5% mAP on the nuScenes test set. In the rest of paper, the default settingof training epochs is adopted. To the best of our knowledge, our framework issuperior than all the published real-time methods with respect to mAP. Table 2.
Object detection results (%) on nuScenes test set. Note that MonoDIS andCenterNet are camera based methods, and the rest are LiDAR based. Ours 2 × indicates2 × training time with other settings being the same with OursMethod car peds. barri. traff. truck bus trail. const. motor. bicyc. mAPMonoDIS [22] 47.8 37.0 51.1 48.7 22.0 18.8 17.6 7.4 29.0 24.5 30.4PointPillars [12] 68.4 59.7 38.9 30.8 23.0 28.2 23.4 4.1 27.4 1.1 30.5SARPNET [28] 59.9 69.4 38.3 44.6 18.7 19.4 18.0 11.6 29.8 14.2 32.4CenterNet [29] 53.6 37.5 53.3 58.3 27.0 24.8 25.1 8.6 29.1 20.7 33.8WYSIWYG [7] 79.1 65.0 34.7 28.8 30.4 46.6 40.1 7.1 18.2 0.1 35.0Ours 77.2 61.5 45.3 40.4 31.5 44.1 35.9 9.8 25.1 4.0 Ours 2 × To understand the contribution of our major component to the success of In-foFocus, Table. 3 summarizes the performance of our framework when a cer-tain module is disabled, including PoI Pooling, Visibility Attention Module andAdaptive Attention Module.
PoI Pooling . To investigate the effect of PoI Pooling, we simply add the PoIPooling on top of the vanilla PointPillar. This baseline introduces 3.0% mAPimprovement. However, when we vary the number of pooling key-points on eachedge, we see that our framework with four key-points ( n = 4) on each edgedegrades slightly by 0.8% mAP than that of two key-points ( n = 2). A possible Table 3.
Ablation studies on nuScenes validation set. ”Vis. Att” and ”Adp. Att.” referto Visibility Attention Module and Adaptive Attention Module, respectivelyPoIPool Vis. Att. Adp. Att. mAP29.5 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) reason is that the higher number of samples along each edge might bring morenoise which harms the detection performance.
Visibility Attention . We further add the Visibility Attention module to fil-ter out invisible edges before PoI pooling. Table. 3 shows that when using thefeatures from two visible edges, the mAP result is improved by 2.3% mAP com-pared to baseline+PoIPool . Generally, the visible parts of objects correspond totheir sides closer to the LiDAR sensor, thus they may capture richer information.By applying visibility attention, our method focuses more on the representativeinformation which results in better performance.
Adaptive Point-wise Attention . Without the Adaptive Point-wise Attentionmodule, the framework naturally allows the same weight for each PoI feature. Aswe can see in Table. 3, when adding this module, the result of baseline+PoIPool improves by 2.3% mAP and that of baseline+PoIPool+Vis.Att. improves by1.6%. These results suggest that the Adaptive Point-wise Attention module helpsemphasize on useful points which leads to a better performance.
Table 4.
Inference time of 3D object detectors. Note that inference time for the baselinehere is the network reproduced by ourselvesMethod Input Format mAP Inference Time (ms)Baseline [12] LiDAR 30.5 26.9MonoDIS [22] RGB 30.4 29.0SARPNET [28] LiDAR 32.4 70.0Ours LiDAR 37.5 32.9
Rotated RoIAlign Comparison . One widely considered way to extract theregion-wise features in the two-stage architecture is RoIAlign [6]. So, it is intu-itive to compare with this strategy under the setting of 3D object detection. Weimplement rotated RoIAlign (RRoI) operation [8] to compensate for the rotatedbounding box, since in our case they are often not axis-aligned. We conduct ex-periments exploring two different pooling sizes, 4 × × × nfoFocus 13 Fig. 5.
We visualize the detection results on nuScenes with 2D and 3D BEV images.On the top, we demonstrate the 2D images with the 3D bounding box annotated, whilethe BEV of LiDAR with ground truth (red) and detection (blue) box are shown on thebottom. Note that the line in the frame denotes the direction of the object detection results utilizing the rotated-RoI with different pooling sizes. Comparedwith the vanilla PointPillars [12], adding the RoIAlign layer with size of 4 × Table 5.
Comparison with rotated RoIAlign feature extraction results (%) on thenuScenes validation setMethod car peds. barri. traff. truck bus trail. const. motor. bicyc. mAPRRoI 4x4 76.9 60.1 37.6 29.5 32.4 50.6 22.4 5.0 20.8
As indicated in Table. 4, our framework takes about 32.9 ms to perform detectionon an example of point cloud in the nuScenes, compared with 26.9 ms of thevanilla PointPillars when both are evaluated on a single NVIDIA 1080Ti GPU.In details, the pillar feature extraction time is 12.6 ms, the DCNN costs 1.1 ms,RPN takes 7.3 ms to generate proposals, and the InfoFocus stage takes 11.9 ms.Specifically, the proposal generation for the InfoFocus stage including NMS is5.1 ms, the PoI feature extraction time is 3.1 ms, and the second stage includingthree branches takes 0.7 ms. We also note that WYSIWYG [7] provides the
PointPillars InfoFocus
Fig. 6.
We visualize the BEV detection results for the same point cloud sample onnuScenes with the vanillar PointPillars (left) and InfoFocus (right) overhead of computing visibility over a 32-beam LiDAR point to be 24 . ± . Non-uniform distribution of point cloud causes varying amount of informationat different locations. Therefore, we argue that this imbalance distribution ofinformation may result in degradation on previous 3D voxel-based detectorswhen modeling 3D objects. To address this issue, we propose a 3D object detec-tion framework with InfoFocus to dynamically conduct information modeling.InfoFocus contain three effective modules including PoI Pooling, the VisibilityAttentive Module, and the Adaptive Point-wise Attention. Demonstrated by thecomprehensive experiments, our framework achieves the state-of-art performanceamong all the real-time detectors on the challenging nuScenes dataset.
Acknowledgement . This work was supported by the Intelligence Advanced Re-search Projects Activity (IARPA) via DOI/IBC contract numbers D17PC00345and D17PC00287. The U.S. Government is authorized to reproduce and dis-tribute reprints for Governmental purposes not withstanding any copyright an-notation thereon. The authors would like to thank Zuxuan Wu and Xingyi Zhoufor proofreading the manuscript. nfoFocus 15
References
1. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomousdriving. arXiv preprint arXiv:1903.11027 (2019)2. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection networkfor autonomous driving. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 1907–1915 (2017)3. Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point r-cnn. In: Proceedings of the IEEEInternational Conference on Computer Vision. pp. 9775–9784 (2019)4. Girshick, R.: Fast r-cnn. In: The IEEE International Conference on ComputerVision (ICCV) (December 2015)5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-curate object detection and semantic segmentation. In: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (June 2014)6. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of theIEEE international conference on computer vision. pp. 2961–2969 (2017)7. Hu, P., Ziglar, J., Held, D., Ramanan, D.: What you see is what you get: Exploitingvisibility for 3d object detection. arXiv preprint arXiv:1912.04986 (2019)8. Huang, J., Sivakumar, V., Mnatsakanyan, M., Pang, G.: Improving rotated textdetection with rotation region proposal networks. arXiv preprint arXiv:1811.07031(2018)9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)10. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposalgeneration and object detection from view aggregation. In: 2018 IEEE/RSJ In-ternational Conference on Intelligent Robots and Systems (IROS). pp. 1–8. IEEE(2018)11. Lan, S., Yu, R., Yu, G., Davis, L.S.: Modeling local geometric structure of 3d pointclouds using geo-cnn. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR) (June 2019)12. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars:Fast encoders for object detection from point clouds. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 12697–12705 (2019)13. Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusionfor 3d object detection. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 7345–7353 (2019)14. Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor3d object detection. In: The European Conference on Computer Vision (ECCV)(September 2018)15. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: Proceedings of the IEEE international conference on computer vision.pp. 2980–2988 (2017)16. Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning.In: Advances in Neural Information Processing Systems. pp. 963–973 (2019)17. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d objectdetection from rgb-d data. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 918–927 (2018)18. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point setsfor 3d classification and segmentation. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 652–660 (2017)6 J. Wang, S. Lan et al.19. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn-ing on point sets in a metric space. In: Advances in neural information processingsystems. pp. 5099–5108 (2017)20. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-tion with region proposal networks. In: Advances in neural information processingsystems. pp. 91–99 (2015)21. Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detectionfrom point cloud. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. pp. 770–779 (2019)22. Simonelli, A., Bulo, S.R., Porzi, L., L´opez-Antequera, M., Kontschieder, P.: Disen-tangling monocular 3d object detection. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 1991–1999 (2019)23. Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part1–learning rate, batch size, momentum, and weight decay. arXiv preprintarXiv:1803.09820 (2018)24. Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for3d object detection. arXiv preprint arXiv:1911.10150 (2019)25. Wang, Z., Jia, K.: Frustum convnet: Sliding frustums to aggregate local point-wisefeatures for amodal 3d object detection. arXiv preprint arXiv:1903.01864 (2019)26. Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection.Sensors (10), 3337 (2018)27. Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: Std: Sparse-to-dense 3d object detectorfor point cloud. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 1951–1960 (2019)28. Ye, Y., Chen, H., Zhang, C., Hao, X., Zhang, Z.: Sarpnet: Shape attention regionalproposal network for lidar-based 3d object detection. Neurocomputing379