[PDF] Standard and Event Cameras Fusion for Dense Mapping

Abstract

Event cameras are a kind of bio-inspired sensors that generate data when the brightness changes, which are of low-latency and high dynamic range (HDR). However, due to the nature of the sparse event stream, event-based mapping can only obtain sparse or semi-dense edge 3D maps. By contrast, standard cameras provide complete frames. To leverage the complementarity of event-based and standard frame-based cameras, we propose a fusion strategy for dense mapping in this paper. We first generate an edge map from events, and then fill the map using frames to obtain the dense depth map. We propose "filling score" to evaluate the quality of filled results and show that our strategy can increase the number of existing semi-dense 3D map.

Full PDF

SStandard and Event Cameras Fusion for Dense Mapping

Yan DongDepartment of Automation, Tsinghua University, China

Abstract

Event cameras are a kind of bio-inspired sensors that generate data when the brightness changes.Because of the advantages of low-latency and high dynamic range (HDR), they are widely used in theﬁeld of mobile robots. However, due to the nature of the sparse event stream, event-based mapping canonly obtain sparse or semi-dense edge 3D maps. By contrast, standard cameras provide complete frames.To leverage the complementarity of event-based and standard frame-based cameras, we propose a fusionstrategy for dense mapping in this paper. We ﬁrst generate an edge map from events, and then ﬁll themap using frames to obtain the dense depth map.

Keywords:

Multi-sensor Fusion, Event Camera, SLAM, Depth Estimation

Mapping is a core block of many vision tasks such as simultaneous localization and mapping (SLAM). Frame-based mapping has been studied for over decades, but event-based mapping is gaining more and more attentionrecently. Event-based mapping can be divided into two categories, monocular or stereo. The monocularmapping methods include reconstruction-based mapping [1], contrast maximization framework [2] [3], andmultiple view stereo ways [4]. However, maps from these methods are usually “edge map” which are sparse.A recent work [5] used optical ﬂow to generate dense depth by performing dense disparity estimation and useevents to track the disparity between frames.We notice that standard cameras provide frames containing all information about the scene, while eventcameras only capture the changes which usually happen at edges. We hope to fuse two types of sensors toget the dense 3D map. The main contribution of this paper is, we proposed a dense mapping method byusing frames to ﬁll the sparse 3D map from the events stream. More speciﬁcally, we ﬁrst segment framesinto diﬀerent regions and then project 3D map points from events (event map points) onto the image. Thedistance of each pixel is calculated if the region possess enough projected event map points. To the best ofour knowledge, the proposed method is the ﬁrst to fuse events and frames for monocular dense mapping.Figure 1: Dense mapping on datasets. Top: edge map by events. Bottom: dense map after fusing frames.1 a r X i v : . [ c s . R O ] F e b Method

We choose EMVS [4] to obtain the sparse 3D map from events stream, but we point out that any event-based3D mapping method can be used. We explain the EMVS method brieﬂy here, and reader can refer to [4] formore details.When the camera is moving, events generated in the event camera are caused by an “event source”, a 3Dmap point in 3D scene, if not considering the noise. If we back project events caused by the same “eventsource”, the rays must intersect at the 3D map point in the scene.EMVS ﬁrst chooses a reference view, and discretizes the volume containing the 3D scene and counts thenumber of viewing rays passing through each voxel. A voxel is determined to be a 3D point in the scene if itis a local maxima.

We select the frame captured at the reference view, and segment the image using the region growing segmen-tation method. Since events are only generated at pixels where brightness changes, it is natural to segmentregions by their grayscale value. First each unlabeled pixel is set to be a new region, and then it begins togrow if the intensity at adjacent pixels are similar. The region stops growing if the intensities of all adjacentpixels are much larger or smaller. Small regions are merged into surrounding ones.

We project all map points obtained from events onto frames, and call them “projected events”. A region willbe ﬁlled if there are enough projected events on its contour with limited projected events inside it.To determine the number of projected events on its contour, we notice that due to the discretization errorof the volume, some projected map points are not exactly located at the contour of the region. As a result, wecalculate projected events in the 3-pixel ring of the region. When the number of projected events on regioncontour is larger than a threshold ( e.g

30% of the contour length), while the number of projected events insidea region is smaller than a threshold ( e.g

5% of the region size), the region can be ﬁlled.A non-parameter method is adopted to ﬁll each region. For each pixel inside the region, we calculate theEuclidean distances to each projected event in this region and estimate the depth at this pixel by a weightedsummation: d ( p ) = ω (cid:88) k depth ( e k ) | e k , p | , p ∈ R i where d ( p ) represents the estimated depth at pixel p in the region R i . depth ( e k ) is the depth of the k -th eventeach projected 3D map point, | · | is the l norm and ω is the normalization factor. We test our method on Event Camera Dataset [6]. We use the released EMVS code to obtain the edge 3Dmap. Results are shown in Fig. 2. In this paper, we have introduced a novel approach for dense mapping by frames and events fusion. Thesparse 3D map is obtained by EMVS method and the dense map is generated by fusing frames. Firstly, framesare segmented into diﬀerent regions and depths at each pixel in regions are calculated by projected 3D mappoints. To the best of our knowledge, this is the ﬁrst dense mapping by monocular event camera and standardcamera fusion.We will improve our methods further by adopting a better segmentation method as well as a ﬁlling strategy.Although current segmentation and ﬁlling methods are naive, we believe it is an important attempt to fusetwo sensors for dense mapping and hope it can inspire other researchers. https://github.com/uzh-rpg/emvs dynamic , bottom: boxes . From left to right: frame at the reference view,segmentation results, regions to be ﬁlled (colored) with projected events (red pixels), depth image after ﬁlling,and 3D dense point cloud. References [1] Hanme Kim, Stefan Leutenegger, and Andrew J. Davison. Real-time 3d reconstruction and 6-dof trackingwith an event camera. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,

ComputerVision – ECCV 2016 , pages 349–364, Cham, 2016. Springer International Publishing.[2] Guillermo Gallego, Henri Rebecq, and Davide Scaramuzza. A unifying contrast maximization frameworkfor event cameras, with applications to motion, depth, and optical ﬂow estimation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.[3] Guillermo Gallego, Mathias Gehrig, and Davide Scaramuzza. Focus is all you need: Loss functions for event-based vision. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) , June 2019.[4] Henri Rebecq, Guillermo Gallego, Elias Mueggler, and Davide Scaramuzza. Emvs: Event-based multi-viewstereo - 3d reconstruction with an event camera in real-time.

International Journal of Computer Vision ,126(12):1394–1414, 2018.[5] Antea Hadviger, Ivan Markovi´c, and Ivan Petrovi´c. Stereo dense depth tracking based on optical ﬂowusing frames and events.

Advanced Robotics , 35(3-4):141–152, 2021.[6] Elias Mueggler, Henri Rebecq, Guillermo Gallego, Tobi Delbruck, and Davide Scaramuzza. The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam.