[PDF] RGB-D-E: Event Camera Calibration for Fast 6-DOF Object Tracking

Abstract

Augmented reality devices require multiple sensors to perform various tasks such as localization and tracking. Currently, popular cameras are mostly frame-based (e.g. RGB and Depth) which impose a high data bandwidth and power usage. With the necessity for low power and more responsive augmented reality systems, using solely frame-based sensors imposes limits to the various algorithms that needs high frequency data from the environement. As such, event-based sensors have become increasingly popular due to their low power, bandwidth and latency, as well as their very high frequency data acquisition capabilities. In this paper, we propose, for the first time, to use an event-based camera to increase the speed of 3D object tracking in 6 degrees of freedom. This application requires handling very high object speed to convey compelling AR experiences. To this end, we propose a new system which combines a recent RGB-D sensor (Kinect Azure) with an event camera (DAVIS346). We develop a deep learning approach, which combines an existing RGB-D network along with a novel event-based network in a cascade fashion, and demonstrate that our approach significantly improves the robustness of a state-of-the-art frame-based 6-DOF object tracker using our RGB-D-E pipeline.

Full PDF

RRGB-D-E: Event Camera Calibration for Fast 6-DOF Object Tracking

Etienne Dubeau * Mathieu Garon † Benoit Debaque ‡ Raoul de Charette § Jean-Franc¸ois Lalonde ¶ Universit´e Laval Thales Digital Solutions Inria

IR Filter DAVIS346Azure Kinect

Figure 1: Our RGB-D-E hardware setup uses a Kinect Azure (RGB-D) and a DAVIS346 event-based camera (E) for continuousevents which are temporally binned. RGB-D-E data streams are spatially and temporally calibrated and used for 6 degree offreedom object tracking. A BSTRACT

Augmented reality devices require multiple sensors to perform vari-ous tasks such as localization and tracking. Currently, popular cam-eras are mostly frame-based (e.g. RGB and Depth) which impose ahigh data bandwidth and power usage. With the necessity for lowpower and more responsive augmented reality systems, using solelyframe-based sensors imposes limits to the various algorithms thatneeds high frequency data from the environement. As such, event-based sensors have become increasingly popular due to their lowpower, bandwidth and latency, as well as their very high frequencydata acquisition capabilities. In this paper, we propose, for the ﬁrsttime, to use an event-based camera to increase the speed of 3D objecttracking in 6 degrees of freedom. This application requires handlingvery high object speed to convey compelling AR experiences. Tothis end, we propose a new system which combines a recent RGB-Dsensor (Kinect Azure) with an event camera (DAVIS346). We de-velop a deep learning approach, which combines an existing RGB-Dnetwork along with a novel event-based network in a cascade fashion,and demonstrate that our approach signiﬁcantly improves the robust-ness of a state-of-the-art frame-based 6-DOF object tracker using ourRGB-D-E pipeline. Our code and our RGB-D-E evaluation datasetare available at https://github.com/lvsn/rgbde_tracking . Index Terms:

Event camera—Calibration—6-DOF Objecttracking—Augmented reality;

NTRODUCTION

Compelling augmented reality (AR) experiences are achievedthrough the successful execution of several tasks in parallel. Notably,simultaneous localization and mapping (SLAM) [31], hand track-ing [30], and object tracking in 6 degrees of freedom (6-DOF) [7] * e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] § e-mail: [email protected] ¶ e-mail: [email protected] must all be executed efﬁciently and concurrently with minimal la-tency on portable, energy-efﬁcient devices.This paper focuses on the task of 6-DOF rigid object tracking.In this scenario, successfully tracking the object at high speed isparticularly important, since freely manipulating an object can eas-ily result in translational and angular speeds of up to 1 m/s and360 ◦ / s respectively. Despite recent progress on real time 6-DOFobject tracking at 30 fps [7, 20, 24], these methods still have troublewith very high object motion and tracking failures are still com-mon. Increasing the speed of 6-DOF object trackers is of paramountimportance to bring this problem closer to real-world applications.To increase the speed of object tracking, one can trivially employcameras with frame rates higher than 30 fps. Indeed, 90 and even120 fps off-the-shelf cameras are available and could be used as adrop-in replacement. However, this comes at signiﬁcant practicaldisadvantages: higher data bandwidth, increased power consumption(since the algorithms must be executed more often), and the necessityto have sufﬁcient light in the scene since exposure times for eachframe is necessarily decreased.In this work, we propose a system to increase the speed of 6-DOFobject tracking applications with a minimal increase in bandwidthand power consumption. Speciﬁcally, we propose to combine anevent camera (speciﬁcally, the DAVIS346 camera) with an RGB-Dcamera (the Kinect Azure) into a single “RGB-D-E” capture system.The event camera offers several key advantages: very low latency(20 µ s), bandwidth, and power consumption (10–30 mW), all whilehaving much greater dynamic range (120 dB vs 60 dB) than frame-based cameras.This paper makes the following contributions. First, we showhow to calibrate the setup both spatially and temporally. Second,we provide a new challenging publicly available 6-DOF evaluationdataset that contains approximately 2,500 RGB-D-E frames of a real-world object with high-speed motion with the corresponding groundtruth pose at each frame. Third, we propose what we believe to bethe ﬁrst 6-DOF object tracker that uses event-based data. Similar toprevious work [7, 20, 24], our approach assumes that the object totrack must be rigid (non-deforming) and its textured 3D model mustbe known a priori. Finally, we demonstrate through a quantitativeanalysis on our real evaluation dataset that, using an extension of anexisting deep learning approach for 6-DOF object tracking results ina threefold decrease in the number of tracking failures and achieves a r X i v : . [ c s . C V ] A ug obust tracking results on fast free interaction motions. We believethis paper brings 6-DOF object tracking one step closer to real-worldaugmented reality consumer applications. ELATED WORK

The majority of computer vision systems rely on established frame-based camera architectures, where the scene irradiance is cap-tured synchronously at each pixel or in a rapid, rolling shutter se-quence [21]. However, such cameras need to stream large amount ofdata (most of which redundant), making them power- and bandwidth-hungry. Recently, a newer camera architecture with a event-basedparadigm [22] is gaining popularity. By triggering events on eachpixel asynchronously when the brightness at that pixel changes bycertain threshold, event-based camera can stream at a much higherfrequency while consuming less power. A branch of computer visionresearch now focuses on developing algorithms to take advantage ofthis new type of data.

Event-based applications.

Event-based sensors bring greatpromises in the ﬁeld as their low power consumption makes themideal for embedded systems such as virtual reality headset [5],drones [3, 40] or autonomous driving [26]. Their high-speed reso-lution also enables the design of robust high-frequency algorithmslike SLAM [1, 5, 16, 32, 36, 41, 42] or fast 2D object tracking [9, 28].While related to our work since we also focus on tracking, all relatedworks are still restricted to tracking objects in the 2D image plane.In this paper, we extend the use of event cameras to the challengingtask of fast 6-DOF object tracking by building over a state-of-the-artframe-based 6-DOF object tracker [6]. Different from other works,we beneﬁt from RGB, Depth and Event data to propose the ﬁrstRGB-D-E 6-DOF object tracker.

Deep learning with events.

Using event-based data is notstraightforward since the most efﬁcient deep architectures for visionare designed for processing conventional image data (e.g. CNNs).In fact, it is still unclear how event-based data should be providedto networks since each event is a 4-dimensional vector storing time,2D position, and event polarity. Experimental architectures such asspiking neural networks [23] holds great promises but are currentlyunstable or difﬁcult to train [18]. With conventional deep frame-works, events can be converted to 2D tensors only by discardingboth time and polarity dimensions [35] or to 3D tensors by discard-ing either of the two dimensions [26, 46]. Recent work [37, 39]has demonstrated that conventional grayscale frames can be recon-structed from event data, opening the way to the use of existingalgorithms on these “generated” images. In this paper, we favorthe Event Spike Tensor formulation from Gehrig et al. [8], wheretime dimension is binned. This allows us to exploit event data di-rectly without requiring the synthesis of intermediate images, whilemaintaining a fast convolutional network architecture.

Event-based datasets.

Finally, large amount of training datais required. While a few events datasets exist mostly for localiza-tion/odometry [3, 19, 29] or 2D object tracking [12], there are, asof yet, no 6-DOF object tracking dataset which contains event data.Instead, event data can be synthesized with a simulator such as [34]which allows various types of data augmentation [38]. Our exper-iments show that a network can be trained without using real dataand is not critically affected by the real-synthetic domain gap.

YSTEM OVERVIEW AND CALIBRATION

In this section, we describe our novel RGB-D-E hardware setup,which combines a Microsoft Kinect Azure (RGB-D) with aDAVIS346 event camera (E).

As illustrated in Fig. 1, the DAVIS346 event camera is rigidlymounted over the Kinect Azure using a custom-designed, 3D-printedmount. We observed that the modulated IR signal projected on the factory presets ours p i x e l s

500 1000 1500 2000 expected depth (mm) e rr o r ( mm ) Linear correction withoutwith (ours) (a) (b)Figure 2: Comparison between the factory presets and our cali-bration. (a) The reprojection error from the Depth to RGB image( T RGBDepth ) computed on 51 matching planar checkerboard images.(b) Linear regression of the Kinect depth map error compared to theexpected depth, computed on calibration target corners.Frame t Frame t + ∆ t Event [ t , t + ∆ t ] Event [ t , t + ∆ t ] Figure 3: Events from the DAVIS346 camera projected on KinectRGB frames on a moving calibration target. The projection is ob-tained using the calibrated transformation T RGBEvent . Events of a mov-ing checkerboard are accumulated and represented as red (negative)and blue (positive) on both frames. Pixels with more than one eventare presented to reduce distraction by noise. Frames are captured ata ∆ t = /

15 s interval and events between t and t + ∆ t . A properalignment of events with the checkerboard demonstrates that thesystem is calibrated both spatially and temporally.scene by the Time-of-Flight (ToF) sensor in the Kinect triggeredmultiple events in the DAVIS346 camera. To remedy this limitation,an infrared ﬁlter is placed in front of the event camera lens. Our system contains 3 cameras that must be calibrated: the KinectRGB, the Kinect Depth and the DAVIS346 sensor. In this paper, wedescribe a coordinate system transformation with the notation T ba ,denoting a transformation matrix from coordinate frame a to b .The intrinsic parameters of each camera can be computed witha standard method [44]. The checkerboard corners can easily befound using the color frame and the IR image from the Kinect Azure.Calibrating an event-based sensor is usually more difﬁcult, howeverthe DAVIS346 possesses an APS sensor (gray scale frame-basedcapture) that is spatially aligned with the event-based capture sensor.We thus use the APS sensor to detect the target corners that will beused for the intrinsic and extrinsic calibration. Intrinsics.

We capture images where a checkerboard target (9 ×

14 with 54 mm squares) is positioned in a spatial uniform distributionin the frustum of each camera. To account for varying ﬁelds of view,199 images were captured for the Kinect RGB, 112 for the KinectDepth, and 50 for the DAVID346. For each sensor, we retrieve theintrinsic parameters (focal and image center) with a lens distortionmodel including 6 radial and 2 tangential parameters. xtrinsics.

We retrieve the rigid transformations T DepthRGB and T DepthEvent by capturing images of the target in overlapping frustums.Once the 3D points are retrieved from the previously computedcamera intrinsic and the known checkerboard geometry, PnP [4] isused to retrieve the 6-DOF transformation between each camera.Finally, we compare our calibration procedure with the factorypresets of the Kinect Azure. Motivated by previous work [2] thatdemonstrate lower accuracy errors with factories presets calibrationwe capture a test dataset of 45 target images and show that we obtaina lower reprojection error in Fig. 2-(a).

As [7, 11] reported for the Kinect 2, we also found that the depthfrom the Kinect Azure has an offset that changes linearly w.r.t thedepth distance and average in an error in the range of 8.5 mm. Wecompare the target points from the calibration dataset with the depthpixels in each frame and ﬁt a 2nd-degree polynomial to the errorsw.r.t to their distance to the camera. In Fig. 2b, we show the errorwith and without the polynomial correction on the test calibrationset. Using the correction, the mean error on the test calibration set isless than 4 mm.

In a multi-sensors setup, each sensor acquires data at its own fre-quency aligned with its inner clock. For time-critical applications,such as fast object tracking, it is required to synchronize the sen-sors clocks to ensure temporal alignment of the data. Technically,this is commonly addressed with synchronization pulses emittedby a master sensor at the beginning of each data frame acquisition,subsequently triggering the acquisition of other slave sensors.In our setup, both Kinect and DAVIS346 support hardware syn-chronization but we found that the Kinect (master) emits a variablenumber of pulses before the ﬁrst RGB-D frame. This led to incorrecttriggering of DAVIS346 (slave) and thus temporal misalignment ofRGB-D and Event data. Because pulses are always emitted at thesame frequency, we ﬁx this by computing the pulses offset δ as δ = (cid:98) RGBD t × RGBD fps (cid:99) , (1)where RGBD t is the timestamp of the ﬁrst RGB-D frame andRGBD fps is the Kinect framerate (here, 30). Following this, we canpair RGBD and Event frames as (cid:0) RGBD i , E i + δ (cid:1) . Fig. 3 illustratesthe projection of events captured on a moving checkerboard. Theevents are captured between the two RGB frames. Alignment withthe borders of the pattern shows the temporal and spatial calibration. AST

OBJECT TRACKING

With the sensors spatio-temporally calibrated, we enhance an exist-ing tracking framework by the addition of the new event modality(E). We build on the work of Garon et al. [6, 7] who propose a deeplearning approach of robust 6-DOF object tracking, which relies onthe reﬁnement between a render of the object at the current poseestimate and the current Kinect RGB-D frame. While this methodis robust to occlusion and small displacements, we notice that it issigniﬁcantly impacted by larger motions (over 0.5 m/s), possiblybecause of the motion blur induced. Additionally, the network in [7]is fundamentally limited by a maximum pose translation of 2 cmbetween two frames. We note that increasing the sensor frame rate isalso not a practical solution as the network computation time is themain bottleneck. In this section, we improve the tracker reactivityand robustness with the addition of an event-speciﬁc network. Inthe following, we ﬁrst describe the generation of synthetic data fortraining and proceed to explain how frame-based and event-basedtrackers are jointly used.

Despite the existence of event datasets [12, 19, 45], none of themprovide event data with 6-DOF object pose. Since capturing adataset of sufﬁcient magnitude and variety for training a deepnetwork is prohibitive, we rely on synthetic data generated froman event camera simulator [34]. The engine renders a stream ofevents that represent changes in pixel brightness, thus mimickingevent-based sensors. We build a training dataset by generatingsequences of events where our target object (here, a toy dragon)is moved in front of a static camera. We acquire a textured 3Dmodel of the dragon with a Creaform GoScan™ handheld 3Dscanner at 1 mm voxel resolution, subsequently cleaned manuallyusing Creaform VxElements™ to remove background and spuriousvertices. As the camera remains stationary, we simulate thescene background with a random RGB texture from the SUN3Ddataset [43] applied on a plane orthogonal to the virtual cameraoptical axis. We next describe the simulation setup followed byvarious data augmentation strategies applied to the data sample.

Simulation details.

Event sequences are generated by ﬁrstpositioning the object in front of the camera at a random distance d ∼ U ( .

45 m , . ) (where U ( a , b ) denotes a uniform distributionin the [ a , b ] interval) and a random orientation. The center of mass ofthe object is aligned with the optical axis of the camera, so the objectappears in the center of the frame. The object is then displaced by arandom pose transformation over 33 ms and the generated eventsare recorded. The transformation is generated by ﬁrst samplingtwo directions on the sphere using spherical coordinate ( θ , φ ) with θ ∼ U ( − ◦ , ◦ ) and φ = cos − ( x − ) , where x ∼ U ( , ) asin [7] and then sample the magnitude of the translation and rotationwith U ( , .

04 m ) and U ( ◦ , ◦ ) respectively. A 3D boundingbox of size 0.207 m around the object is projected on the imageplane. The event spatial axes are then cropped according to theprojected bounding box and resized with bilinear interpolation to aspatial resolution of 150 × N events storing { t , x , y , p } i = .. N where t is time, x and y are pixel coordinates and p the polarity of the event (positiveor negative, indicating a brighter or darker transition respectively).A total of 10 such event sets are simulated for each backgroundimage, leading to 180,000 training and 18,000 validation sets. Data augmentation.

To maximize the reliability of our simula-tions, we randomize some parameters as in [38] to increase variabil-ity in the dataset and reduce the domain gap between synthetic andreal data. The contrast threshold, which deﬁnes the desired changein brightness to generate an event, is difﬁcult to precisely estimate onreal sensors [5] and is instead sampled from a gaussian distributionN ( . , . ) (where N ( a , b ) denotes a gaussian distribution withmean a and standard deviation b ). Subsequently, the proportionof ambient lighting versus diffuse lighting for the OpenGL render-ing engine (employed in the simulator) is randomly sampled fromU ( , ) . To simulate tracking errors, the center of the bounding boxis offset by a random displacement of magnitude N ( , ) pixels.Finally, we notice the appearance of white noise captured by theDAVIS346. To quantify the noise, we capture a sequence of a staticscene (which should generate no event) with the DAVIS346 andcount the number of noisy events generated in each 33 ms window.A gaussian distribution is then ﬁt to the number of noisy events. Attraining time, we sample a number k from the ﬁtted distributionand randomly select k elements in the set (across t , x , and y ) to adduniformly to the input volume. This process is done separately foreach polarity (positive and negative). Fig. 4 shows the qualitativesimilarity between real samples acquired with the DAVIS346 andour synthetic samples at the same pose. ea l d a t a S yn t h e ti c d a t a Figure 4: Qualitative comparison between real (top) and synthetic events (bottom). Synthetic frames are generated with the event simulator ofRebecq et al. [34], where the pose of the synthetic object is adjusted to match the real. Event polarities are displayed as blue (positive) and red(negative).

RenderFrame Kinect

Continuous events T i m e X Event Network

FrameNetwork

Figure 5: Overview of our method for high-speed tracking with bothsensors. On the left, our event network predicts the relative posechanges ∆ P from e [ t − , t ] . The predicted pose is transformed to theRGB referential to estimate the current pose P (cid:48) t . On the right side,the frame-based network [7] uses the improved pose P (cid:48) t to computea ﬁnal pose reﬁnement. In this paper, we assume that the pose of the object in the previousframe, P t − , is known. In a full system, it could be initialized by a3D object detector (e.g. SSD-6D [15]) at the ﬁrst frame ( t = change ∆ P between two frames such that an estimate of the current pose P t can be obtained by P t = ∆ PP t − . (2)Note that all poses P i are expressed in the RGB camera coordinatesystem.In this work, we rely on two deep networks to estimate ∆ P . First,our novel event network f e ( e [ t − , t ] ) that takes event data e [ t − , t ] accumulated during the [ t − , t ] time interval, and cropped ac-cording to the previous object pose T EventRGB P t − . Here, T EventRGB =( T DepthEvent ) − T DepthRGB is the extrinsic camera calibration matrix fromsec. 3.2, necessary to transform the pose estimate in the event cam-era coordinate system.Second, we also employ the RGB-D frame network of Garonet al. [7] f f ( f t , P t − ) . Although more recent techniques exist, thischoice was made because it is the only one providing both trainingand inference code and offers robust performance. Alternatively,other approaches such as [14, 25, 27] could also be considered. How-ever, [14] is already outperformed by [7], the inference code of [25]is limited to LineMOD objects [10], and [27] is an improvementover [7] for occlusion handling, which is not the focus here. Notethat our method is not limited to this speciﬁc network and could beextended to any RGB or RGB-D frame based tracker.Each network aims to estimate the relative 6-DOF pose of theobject. Interestingly, while events are much more robust to fastdisplacement they carry less textural information than RGB-D dataand we found that the event network used on its own is slightly lessaccurate. Therefore, we use a cascade approach where the eventnetwork ﬁrst estimate P (cid:48) t , and subsequently the frame network isprovided with this new estimation for reﬁnement: P (cid:48) t = ( T RGBEvent f e ( e [ t − , t ] )) P t − , (3) P t = f f ( f t , r ( P (cid:48) t )) P (cid:48) t , (4)with T RGBEvent obtained from the extrinsic camera calibration matricesfrom sec. 3.2 as before. Note that f f () is an iterative method and canbe run multiple time to reﬁne its prediction. To simplify the notationwe show a single iteration, in practice, 3 iterations are used as inthe original implementation. A diagram overview of the method isprovided in ﬁg. 5. Event data is fundamentally different than frame-based data as itpossesses two extra dimensions for time and polarity ( T × P × X × Y ,nput: e [ t − , t ] kernel-5conv-3-64ﬁre-32-64ﬁre-64-128ﬁre-128-256ﬁre-128-512FC-500FC-6Output: ∆ P Input: r ( P t − ) Input: f t conv3-64 conv3-64ﬁre-32-64 ﬁre-32-64concatenationﬁre-64-256ﬁre-128-512ﬁre-256-1024FC-500FC-6Output: ∆ P (a) (b)Figure 6: Our deep network architecture for the (a) event and (b)RGB-D frames. We use the same network architecture as [7]. Themain differences between both networks are that in (a) we onlyhave one head and a learnable kernel is added to merge temporalinformation event input data e [ t − , t ] [8]. The notation “kernel- x ”represents a learnable kernel of dimensions x convoluted on thetemporal dimensions with the weights shared for every pixel. Thenotation “conv- x - y ” represent a 2D convolution layer y ﬁlters of size x × x , “ﬁre-x-y” are “ﬁre” modules [13] reducing the channels to x and expanding to y and “FC- x ” are fully connected network ofsize x . Each ﬁre module has skip-links followed by a 2 × T is discretized time and P is polarity.). We use the “EventSpike Tensor” representation from [8] where the time dimensionis binned (in our case 9 bins for a 33 ms sample), and the polaritydimension is removed by simply subtracting the negative eventsfrom the positive ones. Finally, the spatial dimensions are resizedas explained in the previous section. The ﬁnal tensor has a shapeof 9 × ×

150 where each voxel represents the number of eventsrecorded per time bin. We normalize that quantity between 0 and 1by dividing each voxel by the maximum amount of events seen in asingle voxel during training.

Event network architecture.

While the event spike tensor canbe processed by a standard CNN, we follow [8] and ﬁrst learn a1D ﬁlter in the time dimension and then apply a standard imageconvolution where the time dimension acts as different channels. Inpractice, we use the same backbone from [7] for the RGB-D framenetwork and event network but change only the ﬁrst two input layersto match the event spike tensor. Fig. 6 (a) shows the event networkarchitecture. The event network is optimized with ADAM [17] at alearning rate of 0.001 and a batch size of 256. We train for 40 epochand apply a learning rate scheduling by multiplying the latter by 0.3every 8 epochs.

The RGB-D network (see [7] for more details) takes as input thecurrent RGB-D frame cropped according to the previous pose f t anda rendering of the object at the previous pose r ( P t − ) . Both inputshave a shape of 4 × ×

184 and are normalized by subtractingthe mean and dividing by the standard deviation of a subset of thetraining dataset. The last layer outputs the predicted 6-DOF posedifference between both inputs.

RGB-D network architecture.

As shown in ﬁg. 6-(b), eachinput is individually convoluted then passed to a “ﬁre” module [13].The module outputs are then concatenate before being max pooled.The single feature map is fed to multiple “ﬁre” modules before beingapplied to two fully connected layers. The RGB-D network is trainedwith the same optimizer, hyper-parameters and data augmentations

Translation speed (mm/frame) s a m p l e s Rotation speed (degree/frame) s a m p l e s (a) (b) ( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Translation speed (mm/frame) T r a n s l a t i o n e rr o r ( mm ) Garon et al.Ours ( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ] Rotation speed (degree/frame) R o t a t i o n e rr o r ( d e g r ee ) Garon et al.Ours (c) (d)Figure 7: Translation and rotation error as a function of object dis-placement speed, computed over two consecutive frames. The ﬁrstrow shows the test set distribution indicating the number of frameswhere the object has a particular (a) translation and (b) rotationspeed. The second row plots the distribution of errors between theprediction and the ground truth, computed separately for (c) trans-lation (eq. 5) and (d) rotation (eq. 6). Errors are computed on 10sequences with a total of 2,472 frames.from the original work (see [7] for more details).

The average inference time is 29.02 ms split in 25.06 ms for theRGB-D network and 3.96 ms for the event network. Note thatinference can be reduced close to 25 ms total by running networksin parallel since the total memory footprint is 152.48 MB (frame) +54.96 MB (event) = 207.44 MB (total), which easily ﬁts on a modernGPU. Runtimes are averaged over 100 samples and computed on anIntel i5 and Nvidia GeForce GTX 1060.The numbers reported above include all prepossessing steps suchas calculating the bounding box from the last known position andrendering the image for [7]. Note that building the “Event SpikeTensor” representation can be achieved in real time while the pre-vious frame is being processed by the networks (average of 9ms).While our current, unoptimized implementation computes thosesteps serially, this operation could be trivially parallelized.

XPERIMENTS

We now proceed to evaluate our RGB-D-E system for the high-speed tracking of 3D objects in 6-DOF. We ﬁrst describe our realtest dataset, then present quantitative and qualitative results.

In order to compare the RGB-D and the RGB-D-E trackers, wecapture a series of real sequences of a rigid object freely moving atvarious speeds with different environment perturbation and recordthe corresponding RGB-D frames and events using our capture setup.To provide a quantitative evaluation, we obtain ground truth poseof the object at each frame using the approach described below.We capture a total of 10 sequences with an average duration of10 seconds, for a total of 2,472 frames and corresponding event data.Examples from different sequences of the dataset are shown in Fig. 8.The full dataset is publicly available. v e n t F r a m e T i m e s t a m p s R G B D e p t h Figure 8: Visualization of the RGB-D-E dataset. Event amplitude (top) from 0 to 5 events accumulate for 33 ms. Event polarities are displayedas blue (positive) and red (negative). Timestamp (second row) associated to each event. Only the last timestamp is displayed for each pixel andrange from 0 to 33 ms. Synchronized RGB (third row) and depth map frame (bottom) from the Kinect. The depth map interval is from 0 to 4 m.For each sequence, we ﬁrst manually align the 3D model with theobject on the ﬁrst RGB frame. Then, we use ICP [33] to align thevisible 3D model vertices with the depth from the RGB-D frame,back-projected in 3D. To avoid back-projecting the entire depthframe, only a bounding box of (

280 mm ) centered around theinitial pose is kept. Vertex visibility is computed using raytracingand updated at each iteration of ICP. If the angular pose differencebetween two successive iterations of ICP is less than 10 ◦ , it isdeemed to have converged and that pose is kept. If that conditionis not met after a maximum of 10 iterations, ICP diverges and theﬁnal pose is reﬁned manually. For all subsequent frames in thesequence, ICP is initialized with the pose from the previous frame.In all, every frame in our test dataset is manually inspected to ensurea good quality pose is obtained, even when it has been determinedautomatically. We quantitatively compare our RGB-D-E tracker with the RGB-Dapproach of Garon et al. [7], which is the current state-of-the-art in6-DOF object tracking. We represent a pose P = [ R t ] by a rotationmatrix R and a translation vector t . The translation error δ t betweena pose estimate and its ground truth (denoted by ∗ ) is reported as theL2 norm between the two translation vectors δ t ( t ∗ , t ) = || t ∗ − t || . (5)The rotation error between the two rotation matrices is computedusing δ R ( R ∗ , R ) = arccos (cid:18) Tr ( R T R ∗ ) − (cid:19) , (6) where Tr ( · ) denotes the matrix trace.Fig. 7 compares the translation and rotation errors obtained byboth approaches. These plots report the error between two adjacentframes only: the trackers are initialized to their ground truth poseat the initial frame. Our method reports lower errors at translationspeeds higher than 20 mm/frame, which corresponds to approxi-mately 600 mm/s, and similar rotation errors overall. This is notsurprising, given the fact that our method relies on the RGB-Dnetwork of Garon et al. [7] to obtain its ﬁnal pose estimate.However, visualizing the per-frame error does not tell the wholestory. Indeed, in a practical scenario the trackers estimate a succes-sion of predictions instead of being reset to the ground truth pose atevery frame. Errors, even small, may therefore accumulate over timeand result in tracking failure. Following [7], we consider a trackingfailure when either δ t ( t ∗ i , t i ) > δ R ( R ∗ i , R i ) > ◦ . Resultsof this analysis are presented in Tab. 1. The experiment is doneat 30fps, 15fps, and 10fps, which is obtained by down-samplingthe input frame-based frequency. To allow comparison, the eventnetwork is run at the same frequency with the same time window toaccumulate the events of 33 ms. For all frame rates, our RGB-D-Eapproach has at least 61% fewer failures than the RGB-D approachof Garon et al. [7].Fig. 9 shows representative qualitative results comparing bothtechniques with the ground truth. Those results show that the ap-proach of Garon et al. [7] is affected by the strong motion blur whicharises under fast object motion. In contrast, our approach remainsstable and can follow the object through very fast motion. Pleasesee video results in the supplementary materials . a r on e t a l . [ ] O u r s G r ound t r u t h G a r on e t a l . [ ] O u r s G r ound t r u t h Figure 9: Qualitative comparison between the different approaches on different sequences. The overlay is the position predicted by eachmethod. From top to bottom, Garon et al. [7] (blue), ours (pink) and the ground truth (yellow). Each frame is continuous in the sequence andcropped according to the ground truth position.

Method Failures

ISCUSSION

We present a novel acquisition setup for simultaneous RGB-D-Ecapture which combines a Kinect Azure camera with a DAVIS346sensor. With the new event modality, we show that a state-of-the-artRGB-D 6-DOF object tracker can be signiﬁcantly improved in termsof tracking speed. We capture an evaluation dataset with ground truth3D object poses that mimics difﬁcult scenarios typically encounteredin augmented reality applications : a user manipulating a small objectwith fast free motions. Using this dataset, we demonstrate that ourapproach achieves a threefold decrease in loss of tracking over theprevious state-of-the-art, thereby bringing 6-DOF object trackingcloser to applicability in real-life scenarios.

Limitations and future work.

First, capturing an evaluation dataset is time-consuming and obtaining the 6-DOF ground truthpose of the object is difﬁcult, especially when fast motions areinvolved. While our semi-automatic approach provided a way toacquire a small number of sequences easily, scaling up to largerRGB-D-E datasets will require more sophisticated apparatus suchas a motion capture (mocap) setup as in [7]. Indeed, mocap systemsare ideal for this use case as they can track the object robustly athigh frame rates. Second, while using a cascade scheme improvessigniﬁcantly the robustness to large motion of the tracker, it is stillinherently limited in accuracy since it always relies on the framenetwork. The success of the cascade conﬁguration motivates fur-ther exploration of better ways to fuse the Event modality with theprevious frame-based modalities. Third, we notice that the trackersare still sensitive to dynamic backgrounds (see the last example inthe supplementary video). We anticipate that this could be partiallysolved by generating training data with spurious structured eventssuch as those that could be created by a dynamic background (or amoving camera). These represent exciting future research directionsthat we plan to investigate in order to achieve even more robustand accurate object tracking systems that can be used in real-worldaugmented reality applications.

CKNOWLEDGMENTS

The authors wish to thank J´er´emie Roy for his help with data acqui-sion. This work was partially supported by a FRQ-NT Samuel deChamplain grant, the NSERC CRDPJ 524235 - 18 grant, and Thales.We thank Nvidia for the donation of the GPUs used in this research. R EFERENCES [1] S. Bryner, G. Gallego, H. Rebecq, and D. Scaramuzza. Event-based,direct camera tracking from a photometric 3d map using nonlinearoptimization. In

International Conference on Robotics and Automation ,2019.[2] C. Chen, B. Yang, S. Song, M. Tian, J. Li, W. Dai, and L. Fang.Calibrate multiple consumer RGB-D cameras for low-cost and efﬁcient3D indoor mapping.

Remote Sensing , 10(2):328, 2018.[3] J. Delmerico, T. Cieslewski, H. Rebecq, M. Faessler, and D. Scara-muzza. Are we ready for autonomous drone racing? the uzh-fpvdrone racing dataset. In

International Conference on Robotics andAutomation (ICRA) , 2019.[4] M. A. Fischler and R. C. Bolles. Random sample consensus: aparadigm for model ﬁtting with applications to image analysis andautomated cartography.

Communications of the ACM , 24(6):381–395,1981.[5] G. Gallego, J. E. Lund, E. Mueggler, H. Rebecq, T. Delbruck, andD. Scaramuzza. Event-based, 6-DOF camera tracking from photomet-ric depth maps.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 40(10):2402–2412, 2017.[6] M. Garon and J.-F. Lalonde. Deep 6-DOF tracking.

IEEE Transactionson Visualization and Computer Graphics , 23(11), Nov. 2017.[7] M. Garon, D. Laurendeau, and J.-F. Lalonde. A framework for evalu-ating 6-DOF object trackers. In

European Conference on ComputerVision , 2018.[8] D. Gehrig, A. Loquercio, K. Derpanis, and D. Scaramuzza. End-to-end learning of representations for asynchronous event-based data.

IEEE/CVF International Conference on Computer Vision , Oct 2019.[9] A. Glover and C. Bartolozzi. Robust visual tracking with a freely-moving event camera. In

IEEE/RSJ International Conference on Intel-ligent Robots and Systems , 2017.[10] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige,and N. Navab. Model based training, detection and pose estimation oftexture-less 3d objects in heavily cluttered scenes. In

Asian conferenceon computer vision , pp. 548–562. Springer, 2012.[11] T. Hodan, P. Haluza, ˇS. Obdrˇz´alek, J. Matas, M. Lourakis, and X. Zab-ulis. T-less: An RGB-D dataset for 6D pose estimation of texture-lessobjects. In

IEEE Winter Conference on Applications of ComputerVision , 2017.[12] Y. Hu, H. Liu, M. Pfeiffer, and T. Delbruck. Dvs benchmark datasets forobject tracking, action recognition, and object recognition.

Frontiersin neuroscience , 10:405, 2016.[13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, andK. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer pa-rameters and < arXiv preprint arXiv:1602.07360 ,2016.[14] D. Joseph Tan, F. Tombari, S. Ilic, and N. Navab. A versatile learning-based 3d temporal tracker: Scalable, robust, online. In Proceedings ofthe IEEE International Conference on Computer Vision , pp. 693–701,2015.[15] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab. SSD-6D:Making RGB-based 3D detection and 6D pose estimation great again.In

IEEE International Conference on Computer Vision , 2017.[16] H. Kim, S. Leutenegger, and A. J. Davison. Real-time 3D recon-struction and 6-DOF tracking with an event camera. In

EuropeanConference on Computer Vision , 2016.[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[18] J. H. Lee, T. Delbruck, and M. Pfeiffer. Training deep spiking neuralnetworks using backpropagation.

Frontiers in neuroscience , 10:508,2016.[19] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye,Y. Huang, R. Tang, and S. Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprintarXiv:1809.00716 , 2018.[20] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox. Deepim: Deep iterativematching for 6d pose estimation. In

European Conference on ComputerVision , 2018.[21] C.-K. Liang, L.-W. Chang, and H. H. Chen. Analysis and compensationof rolling shutter effect.

IEEE Transactions on Image Processing ,17(8):1323–1330, 2008.[22] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128x128 120 db 15 µ slatency asynchronous temporal contrast vision sensor. IEEE journal ofsolid-state circuits , 43(2):566–576, 2008.[23] W. Maass and H. Markram. On the computational power of circuits ofspiking neurons.

Journal of computer and system sciences , 69(4):593–616, 2004.[24] F. Manhardt, W. Kehl, N. Navab, and F. Tombari. Deep model-based 6dpose reﬁnement in rgb. In

European Conference on Computer Vision ,2018.[25] F. Manhardt, W. Kehl, N. Navab, and F. Tombari. Deep model-based 6dpose reﬁnement in rgb.

Lecture Notes in Computer Science , p. 833–849,2018. doi: 10.1007/978-3-030-01264-9 49[26] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garc´ıa, and D. Scara-muzza. Event-based vision meets deep learning on steering predictionfor self-driving cars. In

IEEE Conference on Computer Vision andPattern Recognition , 2018.[27] I. Marougkas, P. Koutras, N. Kardaris, G. Retsinas, G. Chalvatzaki, andP. Maragos. How to track your dragon: A multi-attentional frameworkfor real-time rgb-d 6-dof object pose tracking, 2020.[28] A. Mitrokhin, C. Ferm¨uller, C. Parameshwara, and Y. Aloimonos.Event-based moving object detection and tracking. In

IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems , 2018.[29] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza.The event-camera dataset and simulator: Event-based data for poseestimation, visual odometry, and SLAM.

The International Journal ofRobotics Research , 36(2):142–149, 2017.[30] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas, andC. Theobalt. Real-time hand tracking under occlusion from an egocen-tric RGB-D sensor. In

IEEE International Conference on ComputerVision Workshops , 2017.[31] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinect-Fusion: Real-time dense surface mapping and tracking. In

IEEEInternational Symposium on Mixed and Augmented Reality , 2011.[32] A. Nguyen, T.-T. Do, D. G. Caldwell, and N. G. Tsagarakis. Real-time 6-DOF pose relocalization for event cameras with stacked spatialLSTM networks. In

IEEE Conference on Computer Vision and PatternRecognition Workshops , 2019.[33] F. Pomerleau, F. Colas, R. Siegwart, and S. Magnenat. Comparing ICPvariants on real-world data sets.

Autonomous Robots , 34(3):133–148,Feb. 2013.[34] H. Rebecq, D. Gehrig, and D. Scaramuzza. ESIM: an open eventcamera simulator. In

Conference on Robotics Learning , 2018.[35] H. Rebecq, T. Horstschaefer, and D. Scaramuzza. Real-time visual-inertial odometry for event cameras using keyframe-based nonlinearoptimization. 2017.[36] H. Rebecq, T. Horstsch¨afer, G. Gallego, and D. Scaramuzza. Evo:A geometric approach to event-based 6-DOF parallel tracking andmapping in real time.

IEEE Robotics and Automation Letters , 2(2):593–600, 2016.[37] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza. Events-to-video: Bringing modern computer vision to event cameras. In

IEEEConference on Computer Vision and Pattern Recognition , 2019.[38] H. Rebecq, R. Ranftl, V. Koltun, and D. Scaramuzza. High speed andhigh dynamic range video with an event camera.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 2019.[39] C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, andD. Scaramuzza. Fast image reconstruction with an event camera. In

IEEE Winter Conference on Applications of Computer Vision , 2020.[40] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza. Hybrid,frame and event based visual inertial odometry for robust, autonomousnavigation of quadrotors. arXiv preprint arXiv:1709.06310 , 2017.41] A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza. UltimateSLAM? combining events, images, and IMU for robust visual SLAMin HDR and high-speed scenarios.

IEEE Robotics and AutomationLetters , 3(2):994–1001, 2018.[42] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt. Event-based 3d slam with a depth-augmented dynamic vision sensor. In

IEEEInternational Conference on Robotics and Automation , 2014.[43] J. Xiao, A. Owens, and A. Torralba. Sun3D: A database of big spacesreconstructed using SfM and object labels. In

IEEE InternationalConference on Computer Vision , 2013.[44] Z. Zhang. A ﬂexible new technique for camera calibration.

IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 22(11):1330–1334, 2000.[45] A. Z. Zhu, Z. Wang, K. Khant, and K. Daniilidis. Eventgan: Lever-aging large scale image datasets for event cameras. arXiv preprintarXiv:1912.01584 , 2019.[46] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis. Unsupervised event-based learning of optical ﬂow, depth, and egomotion. In