[PDF] PixSet : An Opportunity for 3D Computer Vision to Go Beyond Point Clouds With a Full-Waveform LiDAR Dataset

Abstract

Leddar PixSet is a new publicly available dataset (dataset.leddartech.com) for autonomous driving research and development. One key novelty of this dataset is the presence of full-waveform data from the Leddar Pixell sensor, a solid-state flash LiDAR. Full-waveform data has been shown to improve the performance of perception algorithms in airborne applications but is yet to be demonstrated for terrestrial applications such as autonomous driving. The PixSet dataset contains approximately 29k frames from 97 sequences recorded in high-density urban areas, using a set of various sensors (cameras, LiDARs, radar, IMU, etc.) Each frame has been manually annotated with 3D bounding boxes.

Full PDF

PPixSet : An Opportunity for 3D Computer Vision to Go Beyond Point CloudsWith a Full-Waveform LiDAR Dataset

Jean-Luc D´eziel, Pierre Merriaux, Francis Tremblay, Dave Lessard,Dominique Plourde, Julien Stanguennec, Pierre Goulet, and Pierre Olivier

LeddarTech (Dated: February 25, 2021)Leddar PixSet is a new publicly available dataset ( dataset.leddartech.com ) for autonomousdriving research and development. One key novelty of this dataset is the presence of full-waveformdata from the Leddar Pixell sensor, a solid-state ﬂash LiDAR. Full-waveform data has been shownto improve the performance of perception algorithms in airborne applications but is yet to bedemonstrated for terrestrial applications such as autonomous driving. The

PixSet dataset containsapproximately 29k frames from 97 sequences recorded in high-density urban areas, using a set ofvarious sensors (cameras, LiDARs, radar, IMU, etc.) Each frame has been manually annotated with3D bounding boxes.

I. INTRODUCTION

Autonomous vehicles (AVs) have the potential totransform how transportation is done for people and mer-chandise, while improving both safety and eﬃciency. Inorder to reach the highest levels of autonomy, one of themain challenges that AVs are currently facing is to lever-age the data from multiple types of sensors, each of whichhas its own strengths and weaknesses. Sensor fusion tech-niques are widely used to improve the performance androbustness of computer vision algorithms.Nowadays, the best performing computer vision al-gorithms are neural networks that are optimized us-ing a deep learning approach [1–3], which requires largeamount of data. Multiple datasets have been made pub-licly available in order to boost research and develop-ment of such algorithms [4–8]. In particular, most ofthese datasets include data acquired with a LiDAR sen-sor (Light Detection and Ranging) which is considered tobe an essential component of highly autonomous vehicles[9, 10].In this paper, we present our contribution to this ef-fort, the

PixSet dataset . What makes this new datasetunique is the use of a ﬂash LiDAR and the inclusionof the full-waveform raw data, in addition to the usualpoint cloud data. The use of full-waveform data from aﬂash LiDAR has been shown to improve the performanceof segmentation and object detection algorithms in air-borne applications [11, 12], but is yet to be demonstratedfor terrestrial applications such as autonomous driving.The PixSet dataset contains 97 sequences, each averag-ing a few hundreds of frames, for a total of roughly 29000frames. Each frame has been manually annotated with3D bounding boxes. The sequences have been gatheredin various environments and climatic conditions with theinstrumented vehicle shown in Figure 1.Our main contributions are summarized as follows: Download link: dataset.leddartech.com

FIG. 1: Instrumented vehicle (

RAV4 ) used for dataacquisition. • Introduce to the community a new dual LiDAR-type dataset using solid-state and mechanical Li-DARs, with 3D bounding box annotation. • Provide full-waveform data for the solid-state Li-DAR. • Trigger exteroceptive sensors to improve 3D bound-ing box annotation accuracy • Provide API and dataset viewer to facilitate algo-rithm research and development.The paper is organized as follows: sections II A toII C present the recording setup and conditions. Sec-tion II E brieﬂy describes how the data is stored and howto read/viewed it with an optional open source librarywe have developed for

PixSet . The waveform data is de-scribed in section II F. Section III describes the datasetannotations (3D bounding boxes). Then section IV pro-vides baseline object detection results, and a brief con-clusion is presented in section V. a r X i v : . [ c s . R O ] F e b Sensor label Description pixell bfc Leddar Pixell solid-state LiDAR withwaveformsouster64 bfc Ouster OS1-64 mechanical LiDARﬂir bﬂ/bfc/bfr 3 FLIR BFS-PGE-16S2C-CS cameras+ 90 ◦ opticsﬂir bbfc FLIR BFS-PGE-16S2C-CS camera +Immervision panomorph 180 ◦ optic.radarTI bfc TI AWR1843 mmWave radarsbgekinox bcc SBG Ekinox IMU with RTK GPS dualantennapeakcan fcc Toyota RAV4 CAN bus TABLE I: List of sensors used for

PixSet dataset dataacquisition. ouster64_bfc flir_bfr/bfc/bflpixell_bfc flir_bbfcradarTI_bfc x yz x yzx yz x yzx y z x y z x yz FIG. 2: Spatial conﬁguration of the sensor stack on thefront of the vehicle. Arrows show the coordinate systemused by each sensor. See Table I for a description of thesensors.

II. DATASETA. Sensors

The sensors used to collect the dataset were mountedon a car (see Figure 1) and are listed in Table I. Mostsensors (cameras, LiDARs and the radar) are positionedclose to each other at the front of the car in a conﬁgura-tion shown in Figure 2. This proximity is deliberate inorder to minimize the parallax eﬀect. The GPS antennasfor the inertial measurement unit (IMU) are located onthe top of the vehicle.Most sensors are fairly standard and well known by thecommunity, except for the Leddar Pixell. It is a solid-state (no moving parts) ﬂash and full-waveform LiDAR (see section II F for a description of the waveforms). Itsﬁeld of view is 180 ◦ horizontally and 16 ◦ vertically. Whileit has a relatively low resolution (96 horizontal channelsby 8 vertical channels), it should provide more informa-tion per channel than a typical non-ﬂash LiDAR sensor.Testing this hypothesis is one of the motivations to createthis dataset. B. Time Synchronization

One of the objectives of

PixSet was to obtain the high-est accuracy possible in the positioning of the bound-ing boxes for data annotation. Since the environmentis dynamic, combining the data from multiple sensorswill yield the best result when the sensors are gather-ing data simultaneously. This was achieved by trigger-ing each frame acquisition from a single periodic signalsource generated by the OS1-64 mechanical LiDAR. ThePixell LiDAR and the 4 cameras are triggered as such, ata frequency of 10 Hz. The other sensors (radar, IMU andthe vehicle CAN bus) are not triggered, but are recordingat much higher frame rates which minimizes the timingerrors. The sensor triggering is used to minimize theacquisition time diﬀerence between extrinsic sensors. InFigure 3, the average oﬀset diﬀerences were measured forevery channel of the Pixell sensor, which is 40 ms at most.A timestamp is associated with each frame of each sen-sor, or each point for LiDAR. To make sure that all sen-sors are precisely synchronized, we run a server that col-lects the time from the GPS antenna (PTP protocol) andconnect all sensors to that server to synchronize their in-ternal clocks. The only exceptions are the radar and thevehicle CAN bus for which the timestamps are assignedat data reception in the computer (the computer’s clockis also synchronized with the same time server). More-over, the quality of the synchronization is always mon-itored in real time by comparing the trigger signal to aseparate time measurement from the IMU. The typicalmeasured synchronization error was 6 µ s for cameras and350 µ s for Pixell.One more thing that was considered is the fact thatLiDARs are not global shutter sensors, in contrast withcameras. What is referred to as a ”frame” for a LiDAR,or a single complete scan of the ﬁeld of view, is not mea-sured all at once. For example, a typical mechanical Li-DAR such as the OS1-64 is simultaneously measuring asingle vertical line, which is continuously rotating during100ms. Thanks to accurate timestamping LiDAR andIMU, the motion of the car during the frame acquisitiontime can be compensated. The API (section II E) pro-vides this functionality to algorithm developers. C. Calibration

Sensor projection is mandatory for many algorithmslike sensor fusion. To obtain an accurate projection (Fig-

Per channel timing difference -40ms-20ms0ms20ms

FIG. 3: Average diﬀerence between the timestamp ofeach channel of the Pixell sensor and the timestamps ofthe projected overlapping OS1-64 detections.FIG. 4: Projection of an OS1-64 point cloud in acamera image, using the calibration matrices providedwith the dataset.ure 4), a calibration must be performed, which can bedone by multiple methods. The following describes themethods chosen to obtain the calibration matrices in-cluded in the dataset.To compensate for the camera lens distortion, a setof images of a chessboard was gathered for each cameraand the openCV library that calculates the desired ma-trices from these images was used. For the Pixell sensor,the direction of each channel must be known in order toposition each detection in 3D space. Thankfully, thesedirections are calibrated at the factory with specializedequipment and provided in the dataset.Next, the extrinsic calibration matrices are also neededto change the referential of the coordinate systemfrom/to any pair of sensors. The extrinsic calibrationbetween pairs of cameras is obtained, again, by using theopenCV library to extract the 3D coordinates of the cor-ners of a chessboard in multiple images. A closed loopoptimization method was closed to minimize the transfor-mation matrices. A similar method is used for the extrin-sic calibration between the OS1-64 LiDAR and the cam-eras, by ﬁrst extracting the 3D coordinates of the cornersof a chessboard in each sensor and then solving the trans-formation matrices with the perspective-n-point method.The extrinsic calibration between both LiDARs (Pixelland OS1-64) is obtained by averaging multiple matrices,each obtained by using the iterative closest point (ICP)method with pairs of simultaneous scans from both sen-sors. The calibration between the radar and the OS1-64LiDAR is obtained similarly, except that only the high- intensity detections from a speciﬁc metallic target areaccounted for. Finally, the calibration between the OS1-64 LiDAR and the IMU is obtained by minimizing planethickness of multiple sequential point clouds in the worldcoordinates provided by the IMU. The pairs of sensorsthat have not been mentioned are obtained by combin-ing other matrices (i.e. the Pixell to IMU transformationmatrix is obtained by multiplying the matrices for Pixellto OS1-64 and for OS1-64 to IMU).

D. Recording Conditions

The

PixSet dataset has been recorded in Canada, inboth Quebec City and Montreal, during the summer of2020. A summary of the recording conditions is presentedin Figure 5. Most of the dataset has been recorded inhigh-density urban environments, such as downtown ar-eas where pedestrians are almost always present and onboulevards where the car density is high. Most of thedataset was recorded during day time with dry weather,but a few thousand frames were recorded at night and/orwhile raining, providing a great variety of situations withreal-world data for autonomous driving. Figure 6 show-cases a few samples taken from the dataset.

E. Dataset Format and Access

Each of the 97 sequences of the dataset is containedin an independent directory. Each of them contains thefollowing elements: (i) a conﬁguration ﬁle named plat-form.yml that contains all the required parameters toread and process the data from the optional library pro-vided along with the dataset (see below for details). (ii)An intrinsics directory that contains the calibration ma-trices for each camera (lens distortion compensation, seeSection II C). (iii) An extrinsics directory that containsthe extrinsic calibrations (4x4 aﬃne transformation ma-trices) for the pairs of sensors that have been calibrated.(iv) A zip ﬁle for each source of data (i.e. images from acamera).A given zip ﬁle is named after the label of the sensor(see Table I, with pixell bfc for the Pixell for example)and the type of data, which can be multiple for a sin-gle sensor. The raw waveforms from Pixell are storedin the pixell bfc ftrr.zip ﬁle and the detections that havebeen processed from these waveforms are stored in the pixell bfc ech.zip ﬁle. In each of these ﬁles, a series of pickle ﬁles (or a .jpg for camera images) are stored, eachof them containing the data of a single frame (a full Li-DAR scan or a single camera image). Along the rawdata, a timestamps.csv ﬁle is also stored, containing alltimestamps.Along with the dataset, we also provide an API to aPython library that was developed to read, process and N u m b e r o f f r a m e s Speed d a y t w ili g h t n i g h t Time of day d o w n t o w n b o u l e v a r d h i g h w a y s u b u r b a n p a r k i n g _ l o t Location

Rain

Clouds

FIG. 5: Summary of recording conditions.FIG. 6:

PixSet dataset overview of the variety of scenes and environmental conditions with samples from thecameras and the OS1-64 LiDAR with 3D boxes annotation.view the data . This tool can perform several of themost complex and common operations such as referen-tial transformations, time synchronization of the diﬀer-ent sources of data, motion compensation for point cloudsand much more. https://github.com/leddartech/pioneer.das.apihttps://github.com/leddartech/pioneer.das.view F. Waveforms

Typically, a LiDAR provides data in the form of pointclouds, a collection of three-dimensional coordinates.Usually, an intensity value is also attached to each point.Point clouds are not the raw data measured by the sensorbut are rather processed from the waveforms. A wave-form is the measured intensity as a function of time afterthe emission of the laser pulse from the LiDAR (See Fig-ure 7 for an example). A point is the result of a peakfound in a waveform. While the information about thepositions and the amplitudes of the peaks is preserved

Sample index I n t e n s i t y High intensityLow intensity(a)(b)(c)

FIG. 7: (a) 3D view of detections provided by the Pixellsensor. (b) 2D front view of the amplitudes from thesame detections as in (a). (c) Example of rawwaveforms obtained from a single Pixell channel. (Seetext for details).in the point cloud, all additional information about theshape of the waveform is lost.The Pixell sensor measures a set of two waveforms foreach channel of each frame. The ﬁrst waveform is gath-ered after the emission of a high-intensity laser pulse andthe second one after the emission of a laser pulse witha quarter of the intensity. This eﬀectively increases thedynamic range of the sensor, because the high-intensitywaveforms have a tendency to saturate the sensor whenlooking at close and/or reﬂective targets. This is analo-gous to high-dynamic range imagery with cameras thatcan be achieved by combining images with diﬀerent ex-posure levels.The waveforms are sampled at an 800 MHz rate. Thehigh-intensity waveforms contain 512 samples, while thelow-intensity ones contain 256 samples. One more im-portant aspect to keep in mind is that waveforms fromdiﬀerent channels have slightly diﬀerent time oﬀsets withrespect to each other. These oﬀsets are calibrated andcompensated for in the provided point clouds, but notfor the raw waveforms. The oﬀsets are provided with thedataset and the waveforms can easily be adjusted (by lin-ear interpolation). The API provided (see end of sectionII E) contains multiple functions to deal with waveformprocessing, including time realignment.

III. ANNOTATIONS

Each frame ( ∼ PixSet dataset is the variablesize of the bounding boxes for pedestrians. It is com-mon to use a ﬁxed size for a given object instance for thewhole duration of a sequence. It makes sense for cars, orany solid object that has a ﬁxed size and shape, but leadsto complications for pedestrians that are deformable ob-jects. For example, if a pedestrian extends their arm fora few seconds, using a ﬁxed bounding box size is prob-lematic. Should the arm be ignored or included in thebox, and an oversized bounding box maintained for theremainder of the sequence? Thus, bounding boxes forpedestrians in the

PixSet dataset have variable sizes, ad-justed frame per frame, to avoid these issues.

IV. OBJECT DETECTION

This section presents results from an object detectionalgorithm in order to provide a baseline for the perceptionperformance of the Pixell sensor. It describes the modeland the metrics at a coarse level, while the source codeand all details can be found in [14]. a. Preprocessing.

The raw data is prepared and pre-processed by using the library taht was made publiclyavailable (See section II E). The steps performed by thistool are the following: (i) match the synchronized datafrom Pixell, annotations and the IMU for egomotion. (ii)Convert raw detection data of the Pixell in point clouds.(iii) Apply motion compensation to the point cloud (Seeend of section II B).Then, data augmentation is performed using the fol-lowing methods: (i) insertion of bounding boxes and con-tained points from other frames (up to 5 cars, 5 pedes-trians and 5 cyclists); (ii) With a 50% probability, ﬂipthe left/right axis. (iii) Random translations along allthree axes within the values x ∈ [ − , y ∈ [ − , x ∈ [ − . , .

5] (in meters, see Figure 2 for the coordinatesystem). (iv) Random rotation ( yaw ∈ [ − ◦ , ◦ ]). (v)Random scaling of the scene by a factor f ∈ [0 . , . b. Neural network. The neural network is composedof three parts: the encoder, the backbone and the de-tection head. The encoder is similar to the one fromPointPillars [1], except that the single encoding layer isreplaced by a dense block inspired by DenseNet [15] (4layers, with a growth parameter of 16). This results ina 2D bird’s eye view of a set of encoded features. The p e d e s t r i a n c a r t r a ff i c s i g n t r a ff i c li g h tt r a ff i c c o n e b a rr i e r c y c li s t b i c y c l e t r u c k v a n f i r e h y d r a n t b i c y c l e r a c k s t o p s i g n m o t o r c y c l e b u s t r a il e r c o n s t r u c t i o n v e h i c l e a n i m a l u n c l a ss i f i e d v e h i c l e m o t o r c y c li s t N u m b e r o f o b j e c t s (a) Total number of annotated objects per category.Interestingly, the relative amounts seem to follow Zipf’slaw [13]. N u m b e r o f o b j e c t s car

100 0 100Orientation [degrees]0k20k40k N u m b e r o f o b j e c t s pedestrian

100 0 100Orientation [degrees]0k20k40k60k 0 50 100Distance [m]0k1k2k cyclist

100 0 100Orientation [degrees]0k1k2k3k (b) Distribution of the distances and orientations ofcars, pedestrians and cyclists. O cc l u s i o n car T r un c a t i o n

21 0 O n t h e r o a d TrueFalse A c t i v i t y Stopped ParkedMoving 2 10 pedestrian

21 0True FalseStandingSitting or lying downMoving 2 1 0 cyclist

21 0TrueFalseStoppedParkedMoving (c) Relative amount of each attribute values found inthe dataset for cars, pedestrians and cyclists. Occlusionrefers to a fraction of the object hidden behind otherthings. Truncation refers to a fraction of the object thatis outside the LiDARs’ ﬁeld of view. Take note that anocclusion/truncation value of 0 means 0%, a value of 1means <

50% and a value of 2 means > FIG. 8: Overview of data annotations.features then go through the backbone of the network,also a dense block, with 60 layers and a growth param-eter of 16. The detection head is a ﬁnal dense block of5 layers and a growth parameter of 7. The detectionhead is inspired by CenterNet [16] and produces bird’seye view heat maps where local maxima are interpretedas objects. Non-maximum suppression is not necessarywith this method so it is not used.The neural network has a total of 5.2M parameters.The model was trained for 100 epochs with the Adamoptimizer [17]. The learning rate was exponentially de-caying from 0.001 down to 0.000174 between each epoch.A test set ( ∼

3k frames) was removed from the trainingdata. The sequences that were kept for the test set areparts 1, 9, 26, 32, 38 and 39 (these part numbers are inthe name of each downloadable directory of the dataset). c. Metrics.

The most popular metric to evaluate theperformance of an object detection model is the averageprecision (AP). This metric is calculated as a functionof distance, as shown in Figure 9 (See table II for theglobal AP values). This better shows how a sensor canperform quite well at short range and at what distancethe model’s predictions can be trusted. Admittedly, thePixell detection capabilities seem to rapidly drop beyond

IoU threshold Car AP Pedestrian AP Cyclist AP25% 68.33% 37.15% 35.87%50% 52.04% 14.13% 19.69%

TABLE II: Average precision (AP) results. Theevaluation only account for bounding boxes foundwithin 32 meters from the Pixell sensor.10-15 meters, but it is not meant to be used as the onlysensor on an AV. These results are mostly presented as areference baseline. Future work will focus on the perfor-mance of a fused sensor stack and the potential improve-ments of including full-waveform data from the Pixell.A few notes on the metric calculation: a predictedbounding box is considered a true positive if its three-dimensional intersection over union (IoU) is above a cer-tain threshold (indicated in the ﬁgures) and if the classi-ﬁcation of the bounding box is correct. Moreover, therecan be only one true positive prediction per ground truthbounding box (duplicates are false positives). A P [ % ] car IoU > 25%IoU > 50% 0 5 10 15 20 25 30Distance [m]020406080100 pedestrian

IoU > 25%IoU > 50% 0 5 10 15 20 25 30Distance [m]020406080100 cyclist

IoU > 25%IoU > 50%

FIG. 9: Average precision (AP) as a function of distance for cars, pedestrians and cyclists found in the test set(parts 1, 9, 26, 32, 38 and 39 of the dataset). To be considered a true positive, a prediction must have an IoU(intersection over union) larger than the thresholds indicated in the legends.

V. CONCLUSION

This work presented our contribution, the

PixSet dataset, to the collective eﬀort of developing safe au-tonomous vehicles. We believe that there is a great po- tential to improve further the perception algorithms byleveraging the raw data from the full-waveforms providedby the Pixell ﬂash LiDAR. We have also provided a base-line for 3D object detection performance from the Pixellpoint clouds. [1] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, andO. Beijbom, in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) (2019).[2] Q. He, Z. Wang, H. Zeng, Y. Zeng, S. Liu, and B. Zeng,arXiv preprint arXiv:2006.04043 (2020).[3] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang,and H. Li, in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) (2020).[4] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, The Inter-national Journal of Robotics Research , 1231 (2013),https://doi.org/10.1177/0278364913491297.[5] M. Pitropov, D. Garcia, J. Rebello, M. Smart, C. Wang,K. Czarnecki, and S. Waslander, arXiv:2001.10117 [cs](2020), arXiv: 2001.10117.[6] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong,Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom,in Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (2020) pp. 11621–11631.[7] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard,V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. , in

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (2020) pp.2446–2454.[8] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain,S. Omari, V. Iglovikov, and P. Ondruska, arXiv preprint arXiv:2006.14480 (2020).[9] D. Feng, C. Haase-Sch¨utz, L. Rosenbaum, H. Hertlein,C. Gl¨aser, F. Timm, W. Wiesbeck, and K. Dietmayer,IEEE Transactions on Intelligent Transportation Sys-tems , 1 (2020).[10] Y. Li, L. Ma, Z. Zhong, F. Liu, M. A. Chapman, D. Cao,and J. Li, IEEE Transactions on Neural Networks andLearning Systems , 1 (2020).[11] J. Reitberger and P. Krzystek, ASPRS 2009 Annual Con-ference. Baltimal, MD, United States. March 9-13, 2009 , 9 (2009).[12] T. Shinohara, H. Xiu, and M. Matsuoka, Sensors ,3568 (2020).[13] G. K. Zipf, Human Behavior and the Principle of LeastEﬀort (Addison-Wesley, Reading MA (USA), 1949).[14] https://github.com/leddartech/object_detection_pixell .[15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-berger, in (2017) pp. 2261–2269.[16] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian,arXiv preprint arXiv:1904.08189 (2019).[17] D. P. Kingma and J. Ba, in3rd International Conferenceon Learning Representations, ICLR 2015, San Diego,CA, USA, May 7-9, 2015, Conference Track Proceedings