[PDF] Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection

Abstract

A common dilemma in 3D object detection for autonomous driving is that high-quality, dense point clouds are only available during training, but not testing. We use knowledge distillation to bridge the gap between a model trained on high-quality inputs at training time and another tested on low-quality inputs at inference time. In particular, we design a two-stage training pipeline for point cloud object detection. First, we train an object detection model on dense point clouds, which are generated from multiple frames using extra information only available at training time. Then, we train the model's identical counterpart on sparse single-frame point clouds with consistency regularization on features from both models. We show that this procedure improves performance on low-quality data during testing, without additional overhead.

Full PDF

MMulti-Frame to Single-Frame:Knowledge Distillation for 3D Object Detection

Yue Wang , Alireza Fathi , Jiajun Wu ,Thomas Funkhouser , and Justin Solomon MIT { yuewangx,jsolomon } @mit.edu Google { alirezafathi,tfunkhouser } @google.com Stanford [email protected]

Abstract.

A common dilemma in 3D object detection for autonomousdriving is that high-quality, dense point clouds are only available duringtraining, but not testing. We use knowledge distillation to bridge thegap between a model trained on high-quality inputs at training timeand another tested on low-quality inputs at inference time. In particular,we design a two-stage training pipeline for point cloud object detection.First, we train an object detection model on dense point clouds, which aregenerated from multiple frames using extra information only available attraining time. Then, we train the model’s identical counterpart on sparsesingle-frame point clouds with consistency regularization on features fromboth models. We show that this procedure improves performance onlow-quality data during testing, without additional overhead.

Recent advances in 3D object detection are beginning to yield practical vision sys-tems for autonomous driving. Given the realities of data collection and allowableinference time in this application, however, most relevant 3D object detectionmethods take static one-frame point clouds as input, which can be extremelysparse. The goal of matching object detection rates achievable from dense pointclouds constructed by fusing multiple frames remains elusive. On the other hand,aggregating point clouds from multiple frames is inherently hard in dynamicalsettings. For example, Figure 1 (b) shows a point cloud generated by combining5 LiDAR scans, which exhibits signiﬁcant motion blur; this blur—unavoidable ina moving environment—yields ambiguity in the object detection model.Point cloud object detection training data often comes in sequences , withconsistent labels that track moving objects between frames in each sequence; thisextra information allows us to generate Figure 1 (c), which shows an aggregatedpoint cloud after accounting for the motion of individual objects. This trackinginformation, however, is generally unavailable at testing time. This mismatchbetween training and testing data leads to an unavoidable dilemma: It is desirable a r X i v : . [ c s . C V ] S e p Wang et al. (a) Single-frame (b) Multi-frame w/o tracking (c) Multi-frame w/ tracking

Fig. 1: Visualizations of single/multi-frame point clouds. (a) single-frame point clouds;(b) multi-frame point clouds without bounding box tracking; (c) multi-frame pointclouds with bounding box tracking. Two examples are labeled in blue and orange; themoving cars are extremely blurry in (b) while not in (c). to build a model that gathers dense point clouds from multiple frames, but inthe absence of ﬁne-tuned registration procedures this data is only available attraining time.In this paper, we introduce a two-stage training pipeline to bridge the gapbetween training on dense point clouds and testing on sparse point clouds. Inprinciple, our method is an intuitive extension and application of knowledgedistillation [1,2] and learning using privileged information [13,6,11]. First, wetrain an object detection model on dense point clouds aggregated from multipleframes, using information available during training but not testing. Then, we trainan identical model on single-frame sparse point clouds; we use the pre-trainedmodel from the ﬁrst stage to provide additional supervision for the second model.At test time, the second model makes predictions on single-frame point clouds,without extra inference overhead. Compared with a model trained solely on sparsepoint clouds, the proposed model beneﬁts from distilled information gathered bythe model trained on dense point clouds.

We study how to use privileged information available at training time in thecontext of point cloud object detection. In contrast to existing works [11,4], whichuse additional image information and require extra annotations, our model takesadvantage of the privileged information that is naturally available in a sequenceof point clouds. Our model bridges the gap between the single-frame model andthe (impractical) multi-frame model, while maintaining eﬃciency.Consider a sequence of point clouds {X , X , . . . , X t , . . . , X T } up to frame T with per-frame pose {T , T , . . . , T t , . . . , T T } . Each frame is a point cloud X t = { x t , x t , . . . , x it , . . . , x Nt } ⊂ R with additional point-wise features F t = { f t , f t , . . . , f it , . . . , f Nt } ⊂ R c , such as reﬂectance and material. For simplicityof notation, we assume the same number of points N in each point cloud.In each frame t , a set of ground-truth bounding boxes { b t , b t , . . . , b kt , . . . , b Kt } is provided, where each bounding box is parameterized by its center location p kt , its size s kt , and its heading angle θ kt . For ease of presentation, assume { b k , b k , . . . , b kt . . . , b kT } are bounding boxes of the k -th object over all frames; ang et al. 3Fig. 2: Our object detection model: ﬁrst, point clouds are encoded using a lightweightPointNet; then, points are projected into BEV 2D grid, followed by additional 2Dconvolutional layers for further feature embedding; ﬁnally, a classiﬁcation branch andlocalization branch predict the existence of bounding boxes per BEV pixel and boundingbox parameters, respectively. cls: classiﬁcation branch; loc: localization branch. that is, we assume each frame in the sequence is labeled consistently . We cantake advantage of these bounding box annotations to generate dense point cloudsas in Figure 1 (c), but this is only possible at training time.At test time, since we can no longer access the bounding box annotations,the inputs to the model are either single-frame point clouds (Figure 1 (a)) ormulti-frame point clouds (Figure 1 (b)) without accounting for moving objects.So the model trained on the desired point clouds (Figure 1 (c)) can never bedeployed for inference. To address this diﬀerence between training and testing,we propose a two-stage training pipeline. First, we produce dense point clouds byregistering multiple frames and train an object detection model on dense pointclouds. Then, we distill the previous model to a new model by leveraging bothdense point clouds and sparse point clouds. Finally, we use the new model fortesting on sparse point clouds without extra cost. First, we introduce the dataset, metrics, implementation details, and optimizationdetails in § § Metrics.

We use birds-eye view (BEV) and 3D mean average precision (mAP) asmetrics. The intersection-over-union (IoU) threshold is 0.7 for vehicles and 0.5for pedestrians. In addition to overall scores, we provide breakdowns based onthe distances between the origin and ground-truth boxes: 0m-30m, 30m-50m, and50m-inﬁnity (Inf). Following existing works [5,9,14], we evaluate our method onbounding boxes that have more than 5 points.

Implementation details.

As shown in Figure 2, the object detection modelis a variant of PointPillars [5]: ﬁrst, points in 3D are projected to BEV, with alightweight PointNet [10] to aggregate features in each BEV pixel; then, threeconvolutional blocks are employed to further embed the BEV features; ﬁnally, we

Wang et al.

Method BEV mAP (IoU=0.7) 3D mAP (IoU=0.7)Overall 0 - 30m 30 - 50m 50m - Inf Overall 0 - 30m 30 - 50m 50m - InfStarNet [9] - - - - 53.7 - - -LaserNet † [8] 71.57 92.94 74.92 48.87 55.1 84.9 53.11 23.92PointPillars ∗ [5] 75.57 92.1 74.06 55.47 56.62 81.01 51.75 27.94MVF [14] 80.4 93.59 79.21 63.09 62.93 86.3 60.2 36.02Baseline (1 frame) 83.16 94.2 81.22 64.43 63.09 84.73 58.55 33.55+ Distillation (1 frame) Improvements +1.63 +0.64 +1.47 +3.72 +3.51 +2.83 +2.42 +5.39

Oracle (5 frames) ‡ Table 1: Results on vehicle. † : our implementation. ∗ : re-implemented by [14]. ‡ : themodel trained and tested on dense point clouds as in Figure 1 (c). ‡ is not a realisticsetup, but we provide results to serve as an oracle. use two branches to predict bounding box existence and bounding box parameters,respectively. The frame rate at test time is 24 FPS on par with PointPillars [5]. Optimization.

We use Adam [3] to train both the multi-frame and the single-frame model. For both models, the initial learning rate is 3 × − . We thenlinearly increase it to 3 × − in the ﬁrst 5 epochs. Finally it is decreasedto 3 × − using cosine scheduling [7]. The model is trained for 75 epochs ateach stage. The weight λ of feature consistency is 0.1 for vehicles and 0.01 forpedestrians. We use 5 frames for the multi-frame model. Our batch size is 128.We train our models on TPUv3. Training the multi-frame model is about 2 timesslower than training the single-frame model. In Table 1 and Table 2, we mainly compare the model trained with our two-stagestrategy to a single-frame model trained from scratch. For the sake of completeness,we also provide results of StarNet [9], LaserNet [8], PointPillars [5], and MVF [14];compared to them, even the single-frame model achieves better performance.The model trained on 5 frames signiﬁcantly outperforms others, including thesingle-frame model, but this is an unrealistic setup for testing—points frommultiple frames are aggregated by using ground-truth tracking information. Thesingle-frame model with distillation improves over its vanilla counterpart in allmetrics, which indicates the model indeed learns additional information with thesupervision from a multi-frame model.

References

1. Buciluˇa, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: SpecialInterest Group on Knowledge Discovery and Data Mining (SIGKDD) (2006)ang et al. 5

Method BEV mAP (IoU=0.5) 3D mAP (IoU=0.5)Overall 0 - 30m 30 - 50m 50m - Inf Overall 0 - 30m 30 - 50m 50m - InfStarNet [9] - - - - 66.8 - - -LaserNet † [8] 70.01 78.24 69.47 52.68 63.4 73.47 61.55 42.69PointPillars ∗ [5] 68.7 75.0 66.6 58.7 60.0 68.9 57.6 46.0MVF [14] 74.38 80.01 72.98 62.51 65.33 72.51 63.35 50.62Baseline (1 frame) 74.48 81.72 72.9 58.77 68.32 78.12 64.85 50.21+ Distillation (1 frame) Improvements +1.82 +1.31 +1.59 +3.41 +1.58 +0.71 +2.11 +2.52

Oracle (5 frames) ‡ Table 2: Results on pedestrian. † : our implementation. ∗ : re-implemented by [14]. ‡ : themodel trained and tested on dense point clouds as in Figure 1 (c). ‡ is not a realisticsetup, but we provide results to serve as an oracle.2. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In:NeurIPS Deep Learning and Representation Learning Workshop (2015)3. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: TheInternational Conference on Learning Representations (ICLR) (2014)4. Lambert, J., Sener, O., Savarese, S.: Deep learning under privileged informationusing heteroscedastic dropout. In: The IEEE Conference on Computer Vision andPattern Recognition (CVPR) (June 2018)5. Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars:Fast encoders for object detection from point clouds. In: The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) (June 2019)6. Lopez-Paz, D., Bottou, L., Sch¨olkopf, B., Vapnik, V.: Unifying distillation andprivileged information. In: International Conference on Learning Representations(ICLR) (2016)7. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. In:The International Conference on Learning Representations (ICLR) (2017)8. Meyer, G.P., Laddha, A., Kee, E., Vallespi-Gonzalez, C., Wellington, C.K.: Laser-net: An eﬃcient probabilistic 3d object detector for autonomous driving. In: TheConference on Computer Vision and Pattern Recognition (CVPR) (2019)9. Ngiam, J., Caine, B., Han, W., Yang, B., Chai, Y., Sun, P., Zhou, Y., Yi, X.,Alsharif, O., Nguyen, P., Chen, Z., Shlens, J., Vasudevan, V.: Starnet: Targetedcomputation for object detection in point clouds. ArXiv abs/1908.11069 (2019)10. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3dclassiﬁcation and segmentation. In: The IEEE Conference on Computer Vision andPattern Recognition (CVPR) (2017)11. Su, J.C., Maji, S.: Adapting models to signal degradation using distillation. In:British Machine Vision Conference (BMVC) (2017)12. Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tak-wing Tsui,P., Guo, J.C.Y., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam,J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang,Y., Shlens, J., Chen, Z.F., Anguelov, D.: Scalability in perception for autonomousdriving: Waymo open dataset. ArXiv abs/1912.04838abs/1912.04838