[PDF] Gated3D: Monocular 3D Object Detection From Temporal Illumination Cues

Abstract

Today's state-of-the-art methods for 3D object detection are based on lidar, stereo, or monocular cameras. Lidar-based methods achieve the best accuracy, but have a large footprint, high cost, and mechanically-limited angular sampling rates, resulting in low spatial resolution at long ranges. Recent approaches based on low-cost monocular or stereo cameras promise to overcome these limitations but struggle in low-light or low-contrast regions as they rely on passive CMOS sensors. In this work, we propose a novel 3D object detection modality that exploits temporal illumination cues from a low-cost monocular gated imager. We propose a novel deep detector architecture, Gated3D, that is tailored to temporal illumination cues from three gated images. Gated images allow us to exploit mature 2D object feature extractors that guide the 3D predictions through a frustum segment estimation. We assess the proposed method on a novel 3D detection dataset that includes gated imagery captured in over 10,000 km of driving data. We validate that our method outperforms state-of-the-art monocular and stereo approaches at long distances. We will release our code and dataset, opening up a new sensor modality as an avenue to replace lidar in autonomous driving.

Full PDF

aa r X i v : . [ c s . C V ] F e b Gated3D: Monocular 3D Object DetectionFrom Temporal Illumination Cues

Frank Julca-Aguilar Jason Taylor Mario Bijelic , Fahim Mannan Ethan Tseng Felix Heide , Algolux Daimler AG Ulm University Princeton University

Abstract

Today’s state-of-the-art methods for 3D object detec-tion are based on lidar, stereo, or monocular cameras.Lidar-based methods achieve the best accuracy, but havea large footprint, high cost, and mechanically-limited an-gular sampling rates, resulting in low spatial resolution atlong ranges. Recent approaches based on low-cost monocu-lar or stereo cameras promise to overcome these limitationsbut struggle in low-light or low-contrast regions as they relyon passive CMOS sensors. In this work, we propose a novel3D object detection modality that exploits temporal illumi-nation cues from a low-cost monocular gated imager. Wepropose a novel deep detector architecture, Gated3D, thatis tailored to temporal illumination cues from three gatedimages. Gated images allow us to exploit mature 2D objectfeature extractors that guide the 3D predictions through afrustum segment estimation. We assess the proposed methodon a novel 3D detection dataset that includes gated im-agery captured in over 10,000 km of driving data. We vali-date that our method outperforms state-of-the-art monocu-lar and stereo approaches at long distances. We will releaseour code and dataset, opening up a new sensor modality asan avenue to replace lidar in autonomous driving.

1. Introduction

3D object detection is a fundamental vision task inrobotics and autonomous driving. Accurate 3D detectionsare critical for safe trajectory planning, with applicationsemerging across disciplines such as autonomous drones, as-sistive and health robotics, as well as warehouse and de-livery robots. RGB-D cameras using correlation time-of-ﬂight [22, 29, 33], such as Microsoft’s Kinect One, enablerobust 3D detection indoors [55, 56] for small ranges. Inthe past, autonomous driving, which requires long rangesand high depth accuracy, has relied on scanning lidar for3D detection [50, 59, 15, 63, 34, 11, 67, 30, 32]. How-ever, while lidar provides accurate depth, existing systemsare fundamentally limited by point-by-point acquisition, re- sulting in spatial resolution that falls off quadratically withdistance and linearly with framerate. In contrast to con-ventional cameras, lidar systems are three orders of mag-nitude more expensive, suffer from low resolution at longdistances, and fail in the presence of strong back-scatter,e.g. in snow or fog [4].Promising to overcome these challenges, a recent lineof work proposed pseudo-lidar sensing [60], which relyon low-cost sensors, such as stereo [10, 7, 27] or monoc-ular [9, 20, 14] to recover dense depth maps from conven-tional intensity imagers. Point-clouds are sampled from thedepth maps and ingested by 3D detection methods that op-erate on point-cloud representations [32, 67]. More recentmethods predict 3D boxes directly from the passive inputimages [35, 5, 54]. Although all of these methods promiselow-cost 3D detection with the potential to replace lidar,they rely on passive camera-only sensing. Passive stereoapproaches degrade at long ranges, where disparities aresmall, and in low-light scenarios, e.g. at night, when stereoor monocular depth cues are less visible.In this work, we introduce the ﬁrst 3D object detectionmethod using gated imaging and evaluate this as a low-cost detection method for long ranges, outperforming re-cent monocular and stereo detection methods. Similar topassive approaches, we use CMOS sensors but add activetemporal illumination. The proposed gated imager capturesillumination distributed in three wide gates ( >

30 m) forall sensor pixels. Gated imaging [25, 6, 3, 62, 49, 2, 21] al-lows us to capture several dense high-resolution images dis-tributed continuously across the distances in their respectivetemporal bin. Additionally, back-scatter can be removedby the the distribution of early gates. Whereas scanninglidar trades off temporal resolution with spatial resolutionand SNR, the sequential acquisition of gated cameras tradesoff dense spatial resolution and SNR (i.e. wide gates) withcoarse temporal resolution. We demonstrate that the tempo-ral illumination variations in gated images are a depth cuenaturally suited for 3D object detection, without the need toﬁrst recover intermediate proxy depth maps [21]. Operatingon 2D gated slices allows us to leverage existing 2D object1igure 1: We propose a novel 3D object detection method, which we dub “Gated3D”, using a ﬂood-illuminated gated camera.The high-resolution of gated images enables semantic understanding at long ranges. In the ﬁgure, our gated slices are color-coded with red for slice 1, green for slice 2 and blue for slice 3. We evaluate Gated3D on real data, collected with a scanninglidar Velodyne HDL64-S3D as reference, see overlay on the right.detection architectures to guide the 3D object detection taskwith a novel frustum segmentation. The proposed archi-tecture further exploits gated images by disentangling thesemantic contextual features from depth cues in the gatesthrough a two stream feature extraction. Relying on the re-sulting high-resolution 2D feature stacks, the method out-performs existing methods especially at long ranges. Themethod runs at real-time frame rates and outperforms exist-ing passive imaging methods, independent of the ambientillumination – promising low-cost CMOS sensors for 3Dobject detection in diverse automotive scenarios.Speciﬁcally, we make the following contributions:• We formulate the 3D object detection problem as a re-gression from a frustum segment, computed using 2Ddetection priors and the object dimension statistics.• We propose a novel end-to-end deep neural networkarchitecture that solves the regression problem by ef-fectively integrating depth cues and semantic featuresfrom gated images, without generating intermediatedepth maps.• We validate the proposed method on real-world driv-ing data acquired with a prototype system in challeng-ing automotive scenarios. We show that the proposedapproach detects objects with high accuracy beyond80 m, outperforming existing monocular, stereo andpseudo-lidar low-cost methods.• We provide a novel annotated 3D gated dataset, cov-ering over 10,000 km driving throughout northern Eu-rope, along with all code.As an example, Figure 1 shows experimental results ofthe proposed method. The gated image contains dense in-formation on objects further away in the scene. The advan-tage of gated sensors for nighttime scenes is also demon-strated in this example, where the pedestrians are not clearlyvisible in the RGB image.

2. Related Work

Depth Sensing and Estimation.

Passive acquisition meth-ods for recovering depth from conventional intensity imagesoperate on single monocular images [8, 20, 31, 14, 48, 5],temporal sequences of monocular images [28, 57, 58, 66],or on multi-view stereo images [23, 51, 7, 43, 35]. Thesemethods all suffer in low-light and low-contrast scenes. Ac-tive depth sensing overcomes these limitations by activelyilluminating the scene, and scanning lidar [50] has emergedas an essential depth sensor for autonomous driving, inde-pendent of ambient lighting. However, the spatial resolutionof lidar is fundamentally limited by the sequential point-by-point scanning frame rate and the sensor cost is signiﬁcantlyhigher. Recently, gated cameras were proposed as an alter-native for dense depth estimation [21]. Although promisingdepth estimates have been demonstrated with gated cam-eras, local artefacts and low-conﬁdence regions in outputsfrom Gruber et al. [21] call into question if its performancefor high-quality scene understanding tasks could surpassthat of recent monocular and stereo-based methods – a gapaddressed in this work in an end-to-end fashion by directlyprocessing the gated input slices.

CNN 2D Object Detection.

Convolutional neural networks(CNNs) for efﬁcient 2D object detection have outperformedclassical methods that rely on hand-crafted features by alarge margin [47]. The key concept behind such learned ob-ject detectors is the classiﬁcation of image patches at vary-ing positions and scales [52]. Discretized grid cells and pre-deﬁned object templates (anchor boxes) are regressed andclassiﬁed by fully-convolutional network architectures [39].To this end, two popular directions of research have beenexplored: single-stage [38, 46, 26, 37] and proposal-basedtwo-stage detectors [19, 18, 47]. Two-stage approachessuch as R-CNN [19] and Faster R-CNN [47] generate re-gion proposals for objects in the ﬁrst stage followed by ob-ject classiﬁcation and bounding box reﬁnement in the sec-ond stage [19]. Single-stage detectors such as SSD [38] andOLO [46] directly predict the ﬁnal detections and are usu-ally faster than two-stage detectors but with lower accuracy.Recently, RetinaNet [37] proposed a focal loss that effec-tively down-weights easily-classiﬁed background examplesand showed that single-stage detectors trained with this losscan outperform two-stage detectors in terms of accuracy.

3D Object Detection.

A large body of work on 3D ob-ject detection has explored different scene and measurementrepresentations. For lidar point cloud data, one direction isto rely on voxel-based representations [59, 15, 67, 12, 53].Unfortunately, the computational cost of the 3D convo-lutions required for voxel-based approaches is prohibitivefor real-time processing [59, 15]. Alternatively, the heightdimension of the voxel grid can be collapsed into fea-ture channels with 2D convolutions performed in the BEVplane [63, 32, 40], trading off height information for com-putational efﬁciency.Although current state of the art relies on lidar, recentwork has been attempting to close the performance gap withlow-cost passive sensors due to the limitations of scanninglidar, such as cost, size, low angular resolution and failurein strong back-scatter.Earlier work on monocular [9, 54, 5] and stereo [35]methods leveraged convolutional architectures from 2D ob-ject detection, extracting depth information from stereo dis-parity cues or geometric constraints in an end-to-end fash-ion. More recently, pseudo-lidar [60] showed that pointcloud input representations can be used with passive imag-ing approaches by ﬁrst estimating depth maps. Severalmethods have since followed this approach with monocu-lar [61, 42] and stereo [64] depth estimation. PatchNet [41]proposed that the advantage of pseudo-lidar is its explicitdepth information in its input rather than the point cloudrepresentation. Instead, PatchNet uses a 2D convolutionalarchitecture with the estimated (x,y,z) coordinates of eachpixel as its input. Estimating the depth prior to the de-tection network effectively disentangles depth informationfrom object appearance, improving the detection accuracy.In this work, we propose a method for 3D detection using2D gated images, offering a low-cost solution comparableto passive sensors with improved detection accuracy. Thisinput representation allows us to leverage the rich body ofefﬁcient 2D convolutional architectures for the task of 3Dobject detection, while the gated slices represent depth moreeffectively than RGB images.

3. Gated Imaging

Gated imaging is an emerging sensor technology for self-driving cars which relies on active ﬂash illumination to al-low for low-light imaging (e.g. night driving) while reduc-ing back-scatter in adverse weather situations such as snowor fog [21].

Pulsed LaserGated Sensor

Distance r [m] C ( r ) Range-Intensity ProﬁleGated Slice 1 Gated Slice 2 Gated Slice 3

Figure 2: A gated system consists of a pulsed lasersource and a gated imager that are time-synchronized. Therange-intensity proﬁle (RIP) C i ( r ) describes the distance-dependent illumination for a slice i . A car at a certain dis-tance appears with a different intensity in each slice accord-ing to the RIP.As shown in Figure 2, a gated imaging system consistsof a ﬂood-illuminator and synchronized gated image sen-sor that integrates photons falling in a window of round-trippath-length ξc , where ξ is a delay in the gated sensor and c is the speed of light. Following [21], the range-intensityproﬁle (RIP) C ( r ) describes the distance-dependent inte-gration, which is independent of the scene and given by C ( r ) = ∞ Z −∞ g ( t − ξ ) p (cid:18) t − rc (cid:19) β ( r ) d t, (1)where g is the temporally modulated camera gate, p thelaser pulse proﬁle and β models atmospheric interactions.Assuming now a scene with dominating lambertian reﬂec-tor with albedo α at distance ˜ r , the measurement for eachpixel location is obtained by z = αC (˜ r ) + η p ( αC (˜ r )) + η g , (2)where η p describes the Poissonian photon shot noise and η g the Gaussian read-out noise [16]. In this work, we capturethree images Z i ∈ N height × width for i ∈ { , , } with dif-ferent proﬁles C i ( r ) that intrinsically encode depth into thethree slices.

4. 3D Object Detection from Gated Images

In this section, we introduce

Gated3D , a novel modelfor detecting 3D objects from temporal illumination cues ingated images. Given three gated images, the proposed net-work determines the 3D location, dimensions, orientationand class of the objects in the scene.

Architecture Overview

The proposed architecture is il-lustrated in Figure 3. Our model is composed of a 2D de-tection network, based on Mask R-CNN [24], and a 3D de-tection network designed to effectively integrate semantic,ontextual, and depth information from gated images. Themodel is trained end-to-end using only 3D bounding boxannotations with no additional depth supervision.The 2D detector predicts bounding boxes that guide thefeature extraction with a FPN [36] backbone. These 2Dboxes are used to estimate frustum segments that constrainthe 3D location. In addition to these geometric estimates,the 3D detection network receives the cropped and resizedregions of interest extracted from both the input gated slicesand the backbone features. To extract contextual, seman-tic and depth information from the temporal intensity varia-tions of the gated images, our 3D detection network appliestwo separate convolution streams: one for the backbone fea-tures and another for the gated input slices. The resultingfeatures are fed into a sequence of fully-connected layersthat predict the 3D location, dimensions, and orientation ofthe objects.The remainder of this section details our proposed 2Dobject detection network 4.1, 3D prediction network archi-tecture 4.2 and the loss functions for training 4.3.

The proposed 2D detection network uses a FPN [36] asa backbone and RoIAlign for extracting crops of both thefeatures and input gated slices. We extract features maps P , P , P and P of the backbone, as deﬁned in [36].Our 2D object detection network follows a two-stage ar-chitecture, where the ﬁnal 2D box detections are reﬁnedfrom proposals output by a region proposal network (RPN).In contrast to Mask RCNN [24], we use these 2D detec-tions instead of the RPN proposals for 3D detection. Usingthe reﬁned 2D detections allows the 3D box prediction net-work to obtain more precise region features, especially fromthe input gated slices, and a more precise frustum segment,which is essential for depth estimation. Our 3D prediction network fuses the extracted featuresfrom both the input gated slices and the backbone features.The gated stream extracts depth cues from the croppedgated input slices with a sequence of convolutions per slice,without parameter sharing. These convolutions consist ofthree layers with × × , × × and × × kernels. The network fuses the three gated features and thebackbone features by concatenating along the channel di-mension and processing with 5 residual layers. Instead ofpooling or ﬂattening the resulting features, an attention sub-network produces softmax attention maps for each featurechannel which are used for a weighted sum over the heightand width of the features. The resulting feature vectors arefed into two fully connected layers, followed by a ﬁnal layerthat generates eight 3D bounding box coefﬁcients.We denote an object’s predicted 2D bounding box as P = ( c, u, v, h u , w v ) , where c is object’s class, ( u, v ) isthe bounding box center, and ( h u , w v ) deﬁne its height andwidth, respectively. The 3D detection network takes P andestimates a set of parameters Q , that deﬁne a 3D boundingbox whose projection is given by P . The problem of esti-mating Q is ill-posed as given a speciﬁc 2D bounding box P , there are an inﬁnite number of 3D boxes that can be pro-jected to P . However, we can restrict the range of locationsof Q to a segment of the 3D viewing frustum extracted from P , using the object’s approximate dimensions and P . SeeFigure 4 for an illustration.Estimating the 3D location is aided by restricting the ob-ject’s location to a speciﬁc frustum region similar to [44].For lidar data, a frustum sufﬁces to deﬁne an object in 3Dspace as lidar provides depth values. In our case, we onlyhave data in the image space, without absolute depth value.Instead of considering the whole frustum as in [44], weleverage the camera calibration and object dimensions in thetraining set to constrain the depth. This idea is illustrated inFig. 4, where a person is located at different distances rela-tive to the camera. Using the object height and 2D boundingbox projection, we can estimate the distance to the camerathrough triangulation. Assuming a bounded height, we canaccurately estimate the segment of the frustum where theobject is located. In the example in Fig. 4, we deﬁne theminimum and maximum height values to be 1.5m and 2m.For each 2D bounding box P = ( c, u, v, w u , h v ) gen-erated by the 2D detection network, our 3D boundingbox network is trained to estimate the parameters Q ′ =( δu ′ , δv ′ , δz ′ , δh ′ , δw ′ , δl ′ , θ ′ ) , which encode the location ( x, y, z ) , dimensions ( h, w, l ) , and orientation ( θ ′ ) of a 3Dbounding box as follows

3D Location.

We estimate the objects location ( x, y, z ) using its projection over the image space, as well as a frus-tum segment. Speciﬁcally, we deﬁne the target δu ′ , δv ′ val-ues as δu ′ = ( P roj d u ( x, y, z ) − u ) /w u (3) δv ′ = ( P roj d v ( x, y, z ) − v ) /h v , (4)where P roj d u ( x, y, z ) , P roj d v ( x, y, z ) represent the u, v coordinates of the 2D projection of ( x, y, z ) over theimage space.To deﬁne the target z , we ﬁrst deﬁne a frustum segmentused as a reference for depth estimation. Given an objectwith height h , we can estimate the object distance to thecamera with focal length f v as f ( h v , h ) = hh v f v . (5)If we assume that h follows a Gaussian Distributionwith mean µ h and standard deviation σ h , given P =( c, u, v, w u , h v ) and f v , we can constrain the distancefrom the object to the camera to a range of [ f ( h v , µ h − igure 3: From three gated slices, the proposed Gated3D architecture detects objects and predicts their 3D location, dimen-sion and orientation. Our network employs a 2D detection network to detect ROIs. The resulting 2D boxes are used to cropregions from both the backbone network and input gated slices. Our 3D network estimates the 3D object parameters using afrustum segment computed from the 2D boxes and 3D statistics of the training data. The network processes the gated slicesseparately, then fuses the resulting features with the backbone features and estimates the 3D bounding box parameters.Figure 4: There is an inﬁnite number of 3D cuboids that canproject to a given bounding box P . However, the object lo-cation can be reasonably estimated using the object height,its projected height, and the vertical focal length. σ h ) , f ( h v , µ h + σ h )] , or, more general, we deduct that thefrustum segment has a length dd = f ( h v , µ h + k ∗ σ h ) − f ( h v , µ h − k ∗ σ h ) , (6)where k is a scalar that adjusts the segment extent and isinversely proportional to our prediction conﬁdence.Following these observations, the z coordinate of the 3Dbounding box, δz ′ , is given as δz ′ = z − f ( h v , h ) d . (7)Note that learning δz ′ instead of the absolute depth z hasthe advantage that the target value includes a good depthestimation as prior and it is normalized by d , which variesaccording to the distance from the object to camera. Wehave found this normalization is key to estimate the abso-lute depth of the objects. Intuitively, for higher distances there is greater localization uncertainty in the labels and assuch, the training loss needs to account for this proportion-ally. Analogous to 2D detectors, this frustum segment canalso be considered as an anchor, except its position and di-mensions are not ﬁxed, instead using the camera model andobject statistics to adjust accordingly.During training, we use h from ground-truth; during in-ference, we use the network prediction.

3D Box Dimensions and Orientation.

The target 3D boxdimensions are estimated using δh ′ , δw ′ , δl ′ , which are de-ﬁned as the offset between the mean of the objects dimen-sions, per class, and the true dimensions. δp ′ = p − µ p µ p , ∀ p ∈ { h, w, l } . (8)To learn the target orientation (observation angle) θ ′ , theorientation is encoded as (sin θ ′ , cos θ ′ ) , and the network istrained to estimate each parameter separately. Given a 3D box parameters prediction Q =( δu, δv, δz, δh, δw, δl, sin θ , cos θ ) , and its correspond-ing ground-truth box Q ′ = ( δu ′ , δv ′ , δz ′ , δh ′ , δw ′ , δl ′ , θ ′ ) ,we deﬁne our overall loss L ( Q, Q ′ ) as L ( Q, Q ′ ) = α ∗ X l ∈{ u,v,z } L loc ( δl − δl ′ )+ X d ∈{ h,w,l } L dim ( δd − δd ′ ) + β ∗ L ori (sin θ , cos θ , θ ′ ) , (9)here L loc is the location loss, L dim is the dimensionsloss, and L ori ( θ, θ ′ ) is the orientation loss. We use α and β to weight the location and orientation loss, and deﬁnethese values during training. We deﬁne L loc and L dim as SmoothL , and L ori (sin θ , cos θ , θ ′ ) as L ori (sin θ , cos θ , θ ′ ) = (sin θ − sin( θ ′ )) +(cos θ − cos( θ ′ )) . (10)The method runs at approximately 10 FPS on an NvidiaRTX 2080 GPU in TensorFlow without implementation op-timization such as TensorRT. We refer to the SupplementalMaterial for additional method and implementation details.We also provide detailed ablation studies, validating the ar-chitecture components of the model, in the same document.

5. Datasets

In this section, we describe

Gated3D , our new datasetfor 3D object detection with gated images.

Sensor Setup.

Since existing automotive datasets [1, 13,17, 65] do not include measurements from gated cameras,we collected gated image data during a large-scale data ac-quisition in Northern Europe. Following [21], we used thegated system

BrightEye from BrightwayVision which con-sists of:• A gated CMOS pixel array of resolution px × px with a pixel pitch of 10 µm. Using a focallength of 23 mm provides a horizontal and vertical ﬁeldof view of . ◦ H × . ◦ V.• Two repetitive pulsed vertical-cavity surface-emittinglaser (VCSEL) which act as a pulsed illuminationsource. The VCSELs emit light at 808 nm and 500 Wpeak power to comply with eye-safety regulations. Thepulsed illumination is diffused and results . ◦ H × . ◦ V illumination cone. The source is mounted be-low the bumper of the vehicle, see Figure 5.The gated images consist of three exposure proﬁles asshown in Figure 2. The corresponding gate settings (de-lay, laser duration, gate duration) can be found in the sup-plement. For each single capture, multiple laser ﬂashes areintegrated on the chip before read-out in order to increasethe measurement signal-to-noise ratio.For comparison with state-of-the-art 3D detection ap-proaches, our test vehicle is equipped with a VelodyneHDL64 lidar scanner and a stereo camera. The stereo sys-tem consists of two cameras with OnSemi AR0230 sensorsmounted at 20.3 cm baseline. All sensor speciﬁcations arelisted in Figure 5. The gated camera runs freely and cannotbe triggered, so to obtain matching measurements we com-pensate the egomotion of the lidar point clouds. The cor-responding gated images are found using an adapted ROSMessageFilter [45], see Supplemental Material.

Gated Camera Stereo Camera LidarSensor

BrightwayVisionBrightEye 2x OnSemiAR0230 VelodyneHDL64-S3D

Resolution × × ” × ” Wavelength

808 nm Color 905 nm

Frame Rate

120 Hz 30 Hz 10 Hz

Bit Depth

10 bit uint 12 bit uint 32 bit ﬂoat

Figure 5: Sensor setup for recording the proposed Gated3Ddataset. For comparisons we also capture corresponding li-dar point clouds and stereo image pairs. Note that the stereocamera is located at approximately the same position of thegated camera in order to ensure a similar viewpoint.

Collection and Split

We annotated 1.4 million framescollected at framerate of 10 Hz, covering 10,000 km of driv-ing in Northern Europe during winter. The annotation andcapture procedures for the dataset are detailed in the supple-ment. The gated images have been manually labeled withhuman annotators matching lidar, gated and RGB framessimultaneously. In total, more than 100,000 objects arelabeled, which comprise 4 classes. The annotations weredone over 12997 image examples. The dataset is randomlysplit into a training set of 10,046 frames, a validation setof 1,000 frames and a test set of 1,941 frames. In additionto the gated images, our proposed dataset contains corre-sponding RGB stereo images captured by the stereo camerasystem described in the previous paragraph. In contrast topopular datasets, including as Waymo [1], KITTI [17] andCityscapes [13], our dataset is signiﬁcantly more challeng-ing as it also includes many nighttime images and capturesunder adverse weather conditions such as snow and fog.

6. Assessment

Evaluation Setting.

The BEV and 2D/3D detection met-rics as deﬁned in the KITTI evaluation framework are usedfor evaluation, as well as the ones described by [63], whichcalculate the metrics with respect to distance ranges. Fol-able 1: Object detection performance over Gated3D dataset (test split). Our method outperforms monocular and stereomethods (bottom part of the table) over most of the short (0-30m), middle (30-50m) and long (50-80m) distance ranges, aswell as Pseudo-Lidar based methods trained over gated images. Interestingly, our model even outperforms PointPillars lidarreference for Pedestrian detection at long distance ranges. (a) Average Precision on

Car class.

Method Modality Daytime Images Nighttime Images2D object detection 3D object detection BEV detection 2D object detection 3D object detection BEV detection

OINT P ILLARS [32] Lidar 90.12 82.83 56.63 91.51 84.63 54.28 91.59 86.54 54.71 90.73 84.88 54.22 90.29 87.40 52.32 90.29 87.51 52.60M3D-RPN [5] RGB 90.44 89.29 62.76 53.21 13.26 10.52

TEREO -RCNN [35] Stereo 81.56 81.07 78.08

SEUDO -L IDAR

Gated 81.74 81.33 80.88 26.17 16.06 10.27 26.94 17.26 10.87 89.35 89.02 88.31 36.58 23.05 19.88 39.50 28.68 22.82P

SEUDO -L IDAR ++ [64] Gated 81.74 80.29 81.59 30.44 15.47 11.76 32.49 16.97 12.83 90.21 81.75 81.78 36.36 21.93 P ATCH N ET [41] Gated 90.46 81.74 89.78 23.91 10.86 7.34 24.87 11.33 7.84 ATED

3D Gated (b) Average Precision on

Pedestrian class.

Method Modality Daytime Images Nighttime Images2D object detection 3D object detection BEV detection 2D object detection 3D object detection BEV detection

OINT P ILLARS [32] Lidar 70.08 49.03 0.00 69.71 45.24 0.00 70.53 48.07 0.00 69.97 43.32 0.00 71.25 41.21 0.00 70.99 43.61 0.00M3D-RPN [5] RGB 79.08 66.41 36.98 26.20 14.50 9.84 30.68 17.47 10.07 78.36 62.99 36.76 25.09 6.43 2.07 26.42 7.69 2.74S

TEREO -RCNN [35] Stereo 88.57 75.63 59.82 48.58

SEUDO -L IDAR

Gated 77.87 78.38 69.11 6.19 4.59 2.15 10.28 9.14 4.13 80.34 78.61 67.78 7.53 9.58 1.62 14.27 15.72 5.55P

SEUDO -L IDAR ++ [64] Gated 77.89 77.95 60.88 9.19 2.36 3.30 14.32 5.66 4.10 79.84 79.57 54.42 7.37 7.21 2.06 12.92 11.99 5.64P

ATCH N ET [41] Gated ATED

3D Gated 89.72 lowing Simonelli et al. [54], average precision (AP) is basedon 40 recall positions to provide a fair comparison. We con-sider

Pedestrian and

Car as our target detection classes.The 3D metrics are based on intersection over union(IoU) between cuboids [11], which has the disadvantage ofequally penalizing completely wrong detections and detec-tions with IoU below the threshold. Due to the emphasison challenging scenarios in the dataset, as well as imperfectsensor synchronization, the dataset has notably more labelnoise than typical public datasets for 3D object detection.This problem is mitigated by using lower IoU thresholdsthan in KITTI: 0.2 for

Car and 0.1 for

Pedestrian . To fo-cus on detection at different depth ranges, metrics based ondifﬁculty, as deﬁned in KITTI, are provided in the Supple-mental Document.

Baselines.

We compare our approach to monocular,stereo, lidar, and pseudo-lidar methods. As monocularbaseline, we evaluate M3D-RPN [5], which performs 3Dobject detection from a single RGB image by “depth-aware”convolution, where weights in one branch of the networkare shared across rows only, assuming objects higher up inthe image tend to be further away. As stereo method, weevaluate S

TEREO -RCNN [35], which utilizes stereo imagepairs to predict left-right 2D bounding boxes and keypointsthat are then used to infer 3D bounding boxes using geo-metric constraints. Recent pseudo-lidar methods allow usto compare our method with recent state-of-the-art meth-ods using the depth map as input, and therefore more di-rectly asses the effectiveness of our model architecture inextracting information from gated images. To this end, we use the method from Gruber et. al. [21] to ﬁrst generatedense depth maps from gated images, back-project all thepixels of the depth maps into 3D coordinates, and follow[60] to perform 3D object detection using Frustum Point-Net [44]. We also evaluate Pseudo-Lidar ++ [64] depth cor-rection method from sparse lidar, downsampled from our 64layered lidar to four lidar rays. Furthermore, we evaluatePatchNet [41], which implements a pseudo-lidar approachbased on 2D image-based representation. As a lidar refer-ence method for reference with known (measured) depth,we evaluate P

OINT P ILLARS [32].We use the corresponding open source repositories andtune the hyperparameters of each baseline model duringtraining over our dataset.

Experimental Validation.

Tables 1a and 1b, respec-tively, show

Car and

Pedestrian

AP for 2D, 3D and BEVdetection on the test set. These results demonstrate the util-ity of gated imaging for 3D object detection. Consistentwith prior work [35] both the monocular and stereo base-lines show a drop in performance with increasing distance.Monocular and stereo depth cues for a small automotivebaseline of 10 - 30cm are challenging to ﬁnd with increas-ing range.The proposed G

ATED

3D method offers a new im-age modality between monocular, stereo and lidar mea-surements. The results demonstrate improvement overintensity-only methods, especially for pedestrians and atnight. G

ATED

3D excels at detecting objects at long dis-tances or in low-visibility situations. Note that pseudo-lidarand stereo methods can be readily combined with the pro-igure 6: Qualitative comparison against baseline methods on the captured dataset. Bounding boxes from the proposedmethod are tighter and more accurate than the state-of-the-art methods. This is seen in the second image with the othermethods showing large errors in pedestrian bounding box heights. The BEV lidar overlays show our method offers moreaccurate depth and orientation than the baselines. For example, the car in the intersection of the fourth image has a 90 degreeorientation error in the pseudo-lidar and stereo baselines, and is missed in the monocular baseline. The advantages of ourmethod are most noticeable for pedestrians, as cars are easier for other methods due to being large and specular (please zoomin electronic version for details).posed method — a gated stereo pair may capture stereocues orthogonal to the gated cues exploited by the proposedmethod. For additional ablation studies on the componentsof the proposed method, please refer to the SupplementalDocument.Figure 6 shows qualitative examples of our proposedmethod and state-of-the-art methods. The color-codedgated images illustrate the semantic and space informationof the gated data (red tones for closer objects and blue forfarther away ones). Our method accurately detects objectsat both close and large distances, whereas other methodsstruggle, particularly in the safety-critical application of de-tecting pedestrians at night or in adverse weather condi-tions.

7. Conclusions and Future Work

This work presented the ﬁrst 3D object detection methodfor gated images. As a low-cost alternative to lidar,

Gated3D outperforms recent stereo and monocular detec-tion methods, including state-of-the-art pseudo-lidar ap-proaches. We expand on CMOS sensor arrays used in pas-sive imaging approaches by ﬂood-illuminating the sceneand capture the temporal intensity variation in coarse tem- poral gates. Gated images allow us to leverage existing 2Dfeature-extraction architectures. We distribute the resultingfeatures in the camera frustum along the corresponding gate– a representation that naturally encodes geometric con-straints between the gates, without the need to ﬁrst recoverintermediate proxy depth maps. The proposed method runsat real-time rates and we validate the method experimentallyon 10,000 km of driving data, demonstrating higher 3D ob-ject detection accuracy than existing monocular or stereodetection methods , including recent stereo and monocularpseudo-lidar methods with similar cost to the proposed sys-tem. The proposed method allows for accurate object detec-tion in low-illumination scenarios, where passive methodsfail, while being a low-cost camera with an additional ﬂashsource.In the future, gated imaging systems could beneﬁt fromstereo cues (in a stereo system). We envision our work as aﬁrst step towards gated imaging as a new sensor modality,beyond lidar, radar and camera, useful for a broad range oftasks in robotics and autonomous driving, including track-ing, motion planning, SLAM, visual odometry, and large-scale scene understanding. eferences [1] Waymo open dataset: An autonomous driving dataset, 2019.6[2] A. Adam, C. Dann, O. Yair, S. Mazor, and S. Nowozin.Bayesian time-of-ﬂight for realtime shape, illumination andalbedo.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 39(5):851–864, 2017. 1[3] P. Andersson. Long-range three-dimensional imaging us-ing range-gated laser radar images.

Optical Engineering ,45(3):034301, 2006. 1[4] M. Bijelic, T. Gruber, F. Mannan, F. Kraus, W. Ritter, K. Di-etmayer, and F. Heide. Seeing through fog without see-ing fog: Deep multimodal sensor fusion in unseen adverseweather. arXiv preprint arXiv:1902.08913 , 2020. 1[5] G. Brazil and X. Liu. M3d-rpn: Monocular 3d region pro-posal network for object detection. In

Proceedings of theIEEE International Conference on Computer Vision , pages9287–9296, 2019. 1, 2, 3, 7[6] J. Busck. Underwater 3-D optical imaging with a gated view-ing laser radar.

Optical Engineering , 2005. 1[7] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching net-work. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5410–5418, 2018. 1,2[8] R. Chen, F. Mahmood, A. Yuille, and N. J. Durr. Rethinkingmonocular depth estimation with adversarial training. arXivpreprint arXiv:1808.07528 , 2018. 2[9] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urta-sun. Monocular 3d object detection for autonomous driving.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 2147–2156, 2016. 1, 3[10] X. Chen, K. Kundu, Y. Zhu, H. Ma, S. Fidler, and R. Urtasun.3d object proposals using stereo imagery for accurate objectclass detection.

IEEE Transactions on Pattern Analysis andMachine Intelligence , 40(5):1259–1272, 2017. 1[11] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3dobject detection network for autonomous driving. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1907–1915, 2017. 1, 7[12] Y. Chen, S. Liu, X. Shen, and J. Jia. Fast point r-cnn. In

Pro-ceedings of the IEEE International Conference on ComputerVision , pages 9775–9784, 2019. 3[13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016. 6[14] D. Eigen, C. Puhrsch, and R. Fergus. Depth map predictionfrom a single image using a multi-scale deep network. In

Advances in Neural Information Processing Systems , pages2366–2374, 2014. 1, 2[15] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner.Vote3deep: Fast object detection in 3d point clouds usingefﬁcient convolutional neural networks. In ,pages 1355–1361. IEEE, 2017. 1, 3 [16] A. Foi, M. Trimeche, V. Katkovnik, and K. Egiazarian.Practical poissonian-gaussian noise modeling and ﬁtting forsingle-image raw-data.

IEEE Transactions on Image Pro-cessing , 17(10):1737–1754, 2008. 3[17] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 3354–3361, 2012. 6[18] R. Girshick. Fast r-cnn. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 1440–1448,2015. 2[19] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 580–587,2014. 2[20] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsuper-vised monocular depth estimation with left-right consistency.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2017. 1, 2[21] T. Gruber, F. D. Julca-Aguilar, M. Bijelic, W. Ritter, K. Di-etmayer, and F. Heide. Gated2depth: Real-time dense lidarfrom gated images.

CoRR , abs/1902.04997, 2019. 1, 2, 3, 6,7[22] M. Hansard, S. Lee, O. Choi, and R. P. Horaud.

Time-of-ﬂight cameras: principles, methods and applications .Springer Science & Business Media, 2012. 1[23] R. Hartley and A. Zisserman.

Multiple view geometry incomputer vision . Cambridge university press, 2003. 2[24] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn.In , pages 2980–2988, 2017. 3, 4[25] P. Heckman and R. T. Hodgson. Underwater optical rangegating.

IEEE Journal of Quantum Electronics , 3(11):445–448, 1967. 1[26] L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifyinglandmark localization with end to end object detection. arXivpreprint arXiv:1509.04874 , 2015. 2[27] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry,R. Kennedy, A. Bachrach, and A. Bry. End-to-end learningof geometry and context for deep stereo regression. In

Pro-ceedings of the IEEE International Conference on ComputerVision , 2017. 1[28] J. J. Koenderink and A. J. Van Doorn. Afﬁne structurefrom motion.

Journal of the Optical Society of America A ,8(2):377–385, Feb 1991. 2[29] A. Kolb, E. Barth, R. Koch, and R. Larsen. Time-of-ﬂight cameras in computer graphics. In

Computer GraphicsForum , volume 29, pages 141–159. Wiley Online Library,2010. 1[30] J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander.Joint 3d proposal generation and object detection from viewaggregation. In

IEEE/RSJ Int. Conf. on Intelligent Robotsand Systems , pages 1–8. IEEE, 2018. 1[31] Y. Kuznietsov, J. St¨uckler, and B. Leibe. Semi-superviseddeep learning for monocular depth map prediction. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2215–2223, 2017. 232] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, andO. Beijbom. Pointpillars: Fast encoders for object detectionfrom point clouds. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 12697–12705, 2019. 1, 3, 7[33] R. Lange. 3D time-of-ﬂight distance measurement with cus-tom solid-state image sensors in CMOS/CCD-technology.2000. 1[34] B. Li. 3d fully convolutional network for vehicle detectionin point cloud. In

IEEE/RSJ Int. Conf. on Intelligent Robotsand Systems , pages 1513–1518. IEEE, 2017. 1[35] P. Li, X. Chen, and S. Shen. Stereo r-cnn based 3d objectdetection for autonomous driving. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition , 2019. 1, 2, 3, 7[36] T. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, andS. Belongie. Feature pyramid networks for object detection.In , pages 936–944, 2017. 4[37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Fo-cal loss for dense object detection. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2980–2988, 2017. 2, 3[38] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.In

Proceedings of the IEEE European Conf. on ComputerVision , pages 21–37. Springer, 2016. 2[39] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3431–3440, 2015. 2[40] W. Luo, B. Yang, and R. Urtasun. Fast and furious: Realtime end-to-end 3d detection, tracking and motion forecast-ing with a single convolutional net. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3569–3577, 2018. 3[41] X. Ma, S. Liu, Z. Xia, H. Zhang, X. Zeng, and W. Ouyang.Rethinking pseudo-lidar representation. arXiv preprintarXiv:2008.04582 , 2020. 3, 7[42] X. Ma, Z. Wang, H. Li, P. Zhang, W. Ouyang, and X. Fan.Accurate monocular 3d object detection via color-embedded3d reconstruction for autonomous driving. In

Proceedingsof the IEEE International Conference on Computer Vision ,pages 6851–6860, 2019. 3[43] A. Pilzer, D. Xu, M. Puscas, E. Ricci, and N. Sebe. Unsuper-vised adversarial depth estimation using cycled generativenetworks. In

International Conference on 3D Vision (3DV) ,pages 587–595, 2018. 2[44] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustumpointnets for 3d object detection from rgb-d data. pages 918–927, 2018. 4, 7[45] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote,J. Leibs, R. Wheeler, and A. Y. Ng. Ros: an open-sourcerobot operating system. In

IEEE International Conferenceon Robotics and Automation , volume 3, page 5. Kobe, Japan,2009. 6 [46] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Uniﬁed, real-time object detection. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 779–788, 2016. 2, 3[47] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In

Advances in Neural Information Processing Systems , pages91–99, 2015. 2[48] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth fromsingle monocular images. In

Advances in Neural InformationProcessing Systems , pages 1161–1168, 2006. 2[49] M. Schober, A. Adam, O. Yair, S. Mazor, and S. Nowozin.Dynamic time-of-ﬂight. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages6109–6118, 2017. 1[50] B. Schwarz. Lidar: Mapping the world in 3D.

Nature Pho-tonics , 4(7):429, 2010. 1, 2[51] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, andR. Szeliski. A comparison and evaluation of multi-viewstereo reconstruction algorithms. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 519–528, 2006. 2[52] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. arXiv preprintarXiv:1312.6229 , 2013. 2[53] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, andH. Li. Pv-rcnn: Point-voxel feature set abstraction for 3dobject detection. arXiv preprint arXiv:1912.13192 , 2019. 3[54] A. Simonelli, S. R. R. Bul`o, L. Porzi, M. L´opez-Antequera,and P. Kontschieder. Disentangling monocular 3d object de-tection. arXiv preprint arXiv:1905.12365 , 2019. 1, 3, 7[55] S. Song and J. Xiao. Sliding shapes for 3d object detectionin depth images. In

Proceedings of the IEEE European Conf.on Computer Vision , pages 634–651. Springer, 2014. 1[56] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob-ject detection in rgb-d images. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 808–816, 2016. 1[57] P. H. Torr and A. Zisserman. Feature based methods forstructure and motion estimation. In

International workshopon vision algorithms , pages 278–294. Springer, 1999. 2[58] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,A. Dosovitskiy, and T. Brox. DeMoN: Depth and motionnetwork for learning monocular stereo. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2017. 2[59] D. Z. Wang and I. Posner. Voting for voting in online pointcloud object detection. In

Robotics: Science and Systems ,volume 1, pages 10–15607, 2015. 1, 3[60] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Camp-bell, and K. Q. Weinberger. Pseudo-lidar from visual depthestimation: Bridging the gap in 3d object detection for au-tonomous driving. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 8445–8453, 2019. 1, 3, 761] X. Weng and K. Kitani. Monocular 3d object detection withpseudo-lidar point cloud. In

Proceedings of the IEEE Inter-national Conference on Computer Vision Workshops , pages0–0, 2019. 3[62] W. Xinwei, L. Youfu, and Z. Yan. Triangular-range-intensityproﬁle spatial-correlation method for 3D super-resolutionrange-gated imaging.

Applied Optics , 52(30):7399–406,2013. 1[63] B. Yang, W. Luo, and R. Urtasun. Pixor: Real-time 3d ob-ject detection from point clouds. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 7652–7660, 2018. 1, 3, 6[64] Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hari-haran, M. Campbell, and K. Q. Weinberger. Pseudo-lidar++:Accurate depth for 3d object detection in autonomous driv-ing. arXiv preprint arXiv:1906.06310 , 2019. 3, 7[65] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madha-van, and T. Darrell. Bdd100k: A diverse driving videodatabase with scalable annotation tooling. arXiv preprintarXiv:1805.04687 , 2018. 6[66] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper-vised learning of depth and ego-motion from video. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2017. 2[67] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning forpoint cloud based 3d object detection. In