HDRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor
HHDRFusion: HDR SLAM using a low-costauto-exposure RGB-D sensor
Shuda Li , Ankur Handa , Yang Zhang ,Andrew Calway University of Bristol, University of Cambridge
Abstract.
We describe a new method for comparing frame appearancein a frame-to-model 3-D mapping and tracking system using an low dy-namic range (LDR) RGB-D camera which is robust to brightness changescaused by auto exposure. It is based on a normalised radiance measurewhich is invariant to exposure changes and not only robustifies the track-ing under changing lighting conditions, but also enables the following ex-posure compensation perform accurately to allow online building of highdynamic range (HDR) maps. The latter facilitates the frame-to-modeltracking to minimise drift as well as better capturing light variationwithin the scene. Results from experiments with synthetic and real datademonstrate that the method provides both improved tracking and mapswith far greater dynamic range of luminosity.
Keywords: high dynamic range, 3-D mapping and tracking, auto ex-posure, RGB-D cameras
Most existing methods for dense visual/RGB-D 3-D mapping and tracking relyon the brightness constancy assumption, i.e. the brightness of 3-D points ob-served from different viewing positions is constant. These can be categorizedinto using either a global or a local constancy assumption. The former assumesthat any two over lapping frames from a sequence fulfil the condition [1], whilstthe latter requires only that consecutive frames do[2,3]. The global assumptionenables frame-to-model tracking which is known to accumulate less drift [4],whilst the local assumption is easier to meet in practice but means that thetracking is done frame-to-frame, with a consequent increase in drift.However both of the above assumptions are broken in reality when usingcameras equipped with automatic exposure (AE). AE is designed to map thehigh dynamic range of scene luminance into a narrow range for display deviceswhile remain suitable for the human eye. When the camera moves from a brightto dark area, the exposure time is increased automatically so that more light canbe captured by the camera sensor and vice versa when the camera moves fromdark to bright regions. This breaks the global assumption since images viewingthe same scene area from different viewing positions are seldom captured at thesame auto-exposure. The local assumption is more likely to be met as exposure a r X i v : . [ c s . C V ] A p r S. Li , A. Handa , Y. Zhang , A. Calway1
Most existing methods for dense visual/RGB-D 3-D mapping and tracking relyon the brightness constancy assumption, i.e. the brightness of 3-D points ob-served from different viewing positions is constant. These can be categorizedinto using either a global or a local constancy assumption. The former assumesthat any two over lapping frames from a sequence fulfil the condition [1], whilstthe latter requires only that consecutive frames do[2,3]. The global assumptionenables frame-to-model tracking which is known to accumulate less drift [4],whilst the local assumption is easier to meet in practice but means that thetracking is done frame-to-frame, with a consequent increase in drift.However both of the above assumptions are broken in reality when usingcameras equipped with automatic exposure (AE). AE is designed to map thehigh dynamic range of scene luminance into a narrow range for display deviceswhile remain suitable for the human eye. When the camera moves from a brightto dark area, the exposure time is increased automatically so that more light canbe captured by the camera sensor and vice versa when the camera moves fromdark to bright regions. This breaks the global assumption since images viewingthe same scene area from different viewing positions are seldom captured at thesame auto-exposure. The local assumption is more likely to be met as exposure a r X i v : . [ c s . C V ] A p r S. Li , A. Handa , Y. Zhang , A. Calway1 Fig. 1: (a) shows the proposed frame-to-model tracking using normalized radi-ance deliver best tracking accuracy using visual data. The tracking is performedusing a challenging synthetic flickering RGB-D sequence. (b)-(e) are screen cap-tures from video released with previous works. Specifically, (b) and (d) are from[5], where (b) is the raw input image. (d) is predicted scene textures. By contrast,the unrealistic artefacts, marked by red circles, indicate insufficent exposure com-pensation. (c) is predicted scene texture from [7]; (e) from [6]. Similar artefactscan be seen in these results. (f) in the top right shows the results from our im-plementation of [3] using a RGB-D video sequence, the artefacts are very strongdue to large camera exposure adjustment when moving from bright area (top inthe scene) to the dark area (bottom left in the scene). (g) in the bottom right arethe predicted textures using the proposed HDRFusion. It can been that it is freeof artefacts and its HDR textures are visualized using Mantiuk tone mappingoperater [8].usually changes smoothly, but this assumption also breaks when video flickeringoccurs. Video flickering artefacts, also known as brightness fluctuation, happenwhen a camera moves across the boundary between a bright and dark area ormoves quickly back and forth between them: in these scenarios, the exposurechanges dramatically in a short period of time and results in flickering. TurningAE off can ensure the brightness constancy, but it is often undesirable since itleaves bright areas over exposed and dark areas under exposed, leading to theloss of important visual detail.AE also poses a problem when texturing a 3-D model of the scene. Over-lapping images captured with inconsistent brightness will leave mosaic artefactswhen projected back onto the model surface. This is a common problem for manystate-of-the-art dense mapping systems as illustrated in Fig. 1. The problem hasbeen widely addressed in conventional model texturing, panoramic imaging [9]and video tonal stabilization [10,11]. These works tackle the problem by com-pensating the global brightness of input images and blending colours along theboundaries between input images to create consistent texture. But these are
DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 3
Fig. 2: Flow chart of HDRFusion. The boxes represent data structures, eclipsesrepresent data transforming modules and arrows represent data flow. From left,input RGB frames are converted into radiance map. The camera pose is trackedin a frame-to-model style. Note that in confidence map is used in exposureestimation module but for simplitcity the data flow is not shown.usually expensive offline approaches and are mainly aimed at delivering visuallypleasing results rather than maintaining fidelity to the real world luminance.In this paper, we introduce a novel technique for appearance based framecomparison which allows robust frame-to-model mapping and tracking usingRGB frames with AE enabled. It is very robust to brightness fluctuation and iscapable of capturing a consistent HDR texture on the 3-D surface of the model(Fig. 1(a)). The HDR range corresponding to real world luminance values isillustrated in Fig. 1(j) using Mantiuk tone mapping operation (TMO) [8].The key assumption of the work is that the luminance of real world is globallyconstant and invariant to video brightness changes due to AE. The main chal-lenge lies in how to build a real-time system, capable of tracking reliably with AEenabled so that HDR luminance can be captured by fusing low dynamic range(LDR) frames together. Instead of jointly tracking and compensating exposurelike previous work [5] — which is not as robust and reliable as shown in our tests— we propose to track normalized radiance since it is a function which dependson only luminance. Radiance is the amount of luminance captured during theperiod of the exposure time. Another advantage of tracking normalized radiancelies in that the normalization operation can be efficiently implemented usingdown-sampled integral images.Exposure compensation is therefore decoupled from tracking and greatly im-proves its accuracy as well. In the end, both the tracking and radiance fusionbenefit from confidence maps derived from sensor noise level function whichadaptively weighs radiance map at pixel level. Overall, the proposed HDRFusionachieves high quality radiance map and enables better visualization experienceusing TMOs. We will demonstrate the improvements in both qualitative andquantitative experiments.
S. Li , A. Handa , Y. Zhang , A. Calway1
S. Li , A. Handa , Y. Zhang , A. Calway1 We now give an overview of HDRFusion. The main algorithm is shown in theflow-chart in Fig. 2. The inputs are RGB-D frames from a Xtion sensor. Firstly,we estimate the inverse camera response function and noise level function forradiometric calibration (Section 4). The RGB frames are then converted intoradiance map with estimated pixel-wise confidence. The camera poses are trackedby aligning incoming frame with the prediction coming from the 3D model builtso far, i.e. , registering the live normalized radiance with predicted normalizedradiance. The predicted normalized radiance is estimated by casting rays into theglobal volume. The confidence map is used to adaptively weigh error function fortracking, exposure compensation and radiance fusion. The ray casting moduleestablishes predicted radiance, normalized radiance and depth. The predictedradiance and depth can be used for visualization through tone mapping on LDRdevices or output as HDR data.
There is a huge wealth of literature on dealing with visual odometry or cameramotion tracking. However, we will only focus on the direct approaches whichcan track and reconstruct a dense and textured 3-D model in real-time. Motiontracking using active sensor [4,12] is independent to lighting but leaves surfacesun-textured. Approaches combining appearance and depth [2,13,3,7] for cameratracking are the most relevant approaches. In all these approaches, it is assumedthat brightness of consecutive frames is constant which is likely to fail when videoflickering happens. In addition, [3] introduce a simple color blending method butas shown in our experiments, it is inadequate to deal with large exposure changes.Kerl et al. [7] introduce a key frame based approach by taking the rollingshutter effect into account. The approach relies on local brightness constancywhen tracking live frames with a key frame, it is capable of producing sharpsuper-resolution frames involves no exposure compensation.Maxime et al. [5] propose one of the first work in real-time 3-D HDR texturecapturing. We follow the same approach of transforming raw RGB into radiancedomain and tracking using radiance. It mainly focuses on re-lighting virtualspecular objects. The differences between [5] and this paper are two-fold. First,in [5], a gamma function is adopted to approximate inverse camera responsefunction (CRF). Gamma function may introduce error when radiance is high andthe resulting radiance is not directly proportional to scene luminance (Fig. 3).Second, in [5], the exposure is estimated jointly with camera pose, but we findthat the shape of error function when tracking using exposure compensatedradiance bears shallow global minima even when exposure has been compensatedfor and, therefore, not as robust as normalized radiance based object function weproposed.(Fig. 4) Lastly, mosaic artefacts are clearly visible from the syntheticHDR mode which indicate inadequate exposure estimation (Fig. 1(d)).Normalized cross correlation (NCC) has been widely applied in visual track-ing [14] to deal with challenging lighting condition but its computational cost
DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 5 grows exponentially with the size of patch. Small patches are sensitive to imagenoise and bring many local minimum (Fig. 4). In addition, the 3-D HDR tex-ture capturing is not addressed in the paper. HDR video capture using high-endstereo rig [15,16] is also relevant to the topic since it involves estimating dispar-ity between binocular views so that LDR frames captured by both frame canbe integrated into a single stream of HDR video [16], but the high quality HDRvideo is the main focus of the group of approach rather than a full 3-D model.
Start from direct tracking using visual data assuming brightness constancy, cam-era poses can be estimated by minimizing the intensity difference between areference frame and a live frame. The object funciton F can be formulated as: F ( R , t ) = (cid:90) Ω (cid:13)(cid:13) I r ( u ) − I l ( π ( R π − ( u , D r ( u )) + t )) (cid:13)(cid:13) d u (1)where I : Ω → R + and D : Ω → R + denote the intensity and depth func-tions. The whole 2-D image domain is denoted as Ω ⊂ R and u ∈ R is thepixel coordinate. Subscript r and l denote the reference frame and live framerespectively. R ∈ SO (3) and t ∈ R are the rigid body motion to transforma 3-D point defined in reference coordinate system to live coordinate system. π : R → Ω and π − : Ω × R + → R are projection function and its inverse. π ( . )projects a 3-D point to image plane and π − ( . ) transforms 2-D point back into3-D given the depth D .Equation 1 works as long as brightness constancy holds. We can define thecorrespondent point in live frame as u (cid:48) = π ( R π − ( u , D r ( u )) + t ) and e ( u , u (cid:48) ) = I r ( u ) − I l ( u (cid:48) ). Equation 1 is rewritten as F ( R , t ) = (cid:82) Ω (cid:107) e ( u , u (cid:48) ) (cid:107) d u . NCC basedtracking can be viewed as an extension from equation 1 by replacing e ( u , u (cid:48) ) with (cid:112) − C ( u , u (cid:48) ). C ( . ) is the NCC score and defined as following: C ( u , u (cid:48) ) = 1 | Ω N | (cid:90) Ω N ( N r ( u , v ) − µ )( N l ( u (cid:48) , v ) − µ (cid:48) ) σσ (cid:48) d v (2)Where N : Ω × Ω N → R + defines a small image patch, a neighbourhood centredat u . Ω N ⊂ R is the domain of the neighbourhood N and v ∈ R is thecoordinate w.r.t. N . µ and σ are mean and std. (standard deviation) of imageintensity over N r and µ (cid:48) and σ (cid:48) are mean and std. over N l . The NCC-basedtracking can be formulated as F ( R , t ) = (cid:82) Ω (cid:13)(cid:13) − C ( u , u (cid:48) ) (cid:13)(cid:13) d u . The key observation we rely on in this paper is that the scene luminance ismostly constant and invariant to exposure settings. The idea is to replace e ( . )with a new error function dependent on luminance only. The luminance L is the S. Li , A. Handa , Y. Zhang , A. Calway1
Start from direct tracking using visual data assuming brightness constancy, cam-era poses can be estimated by minimizing the intensity difference between areference frame and a live frame. The object funciton F can be formulated as: F ( R , t ) = (cid:90) Ω (cid:13)(cid:13) I r ( u ) − I l ( π ( R π − ( u , D r ( u )) + t )) (cid:13)(cid:13) d u (1)where I : Ω → R + and D : Ω → R + denote the intensity and depth func-tions. The whole 2-D image domain is denoted as Ω ⊂ R and u ∈ R is thepixel coordinate. Subscript r and l denote the reference frame and live framerespectively. R ∈ SO (3) and t ∈ R are the rigid body motion to transforma 3-D point defined in reference coordinate system to live coordinate system. π : R → Ω and π − : Ω × R + → R are projection function and its inverse. π ( . )projects a 3-D point to image plane and π − ( . ) transforms 2-D point back into3-D given the depth D .Equation 1 works as long as brightness constancy holds. We can define thecorrespondent point in live frame as u (cid:48) = π ( R π − ( u , D r ( u )) + t ) and e ( u , u (cid:48) ) = I r ( u ) − I l ( u (cid:48) ). Equation 1 is rewritten as F ( R , t ) = (cid:82) Ω (cid:107) e ( u , u (cid:48) ) (cid:107) d u . NCC basedtracking can be viewed as an extension from equation 1 by replacing e ( u , u (cid:48) ) with (cid:112) − C ( u , u (cid:48) ). C ( . ) is the NCC score and defined as following: C ( u , u (cid:48) ) = 1 | Ω N | (cid:90) Ω N ( N r ( u , v ) − µ )( N l ( u (cid:48) , v ) − µ (cid:48) ) σσ (cid:48) d v (2)Where N : Ω × Ω N → R + defines a small image patch, a neighbourhood centredat u . Ω N ⊂ R is the domain of the neighbourhood N and v ∈ R is thecoordinate w.r.t. N . µ and σ are mean and std. (standard deviation) of imageintensity over N r and µ (cid:48) and σ (cid:48) are mean and std. over N l . The NCC-basedtracking can be formulated as F ( R , t ) = (cid:82) Ω (cid:13)(cid:13) − C ( u , u (cid:48) ) (cid:13)(cid:13) d u . The key observation we rely on in this paper is that the scene luminance ismostly constant and invariant to exposure settings. The idea is to replace e ( . )with a new error function dependent on luminance only. The luminance L is the S. Li , A. Handa , Y. Zhang , A. Calway1 Fig. 3: The CRF and PCF of the RGB camera on 3 Xtion sensors. The figures inthe top row are the CRF function and its derivative of RGB channels respectively.From the figure, we can see that the gamma approximation of CRF bears largeerror when the radiance is high. The figures in the bottom row are the PCF andscaled standard deviation of noise level captured as various exposure time.radiance R received at the camera sensor per unit time L = R/∆t , where ∆t isthe exposure time. The relation between luminance and image intensity I canbe described by the image formation model [17]: I = f ( R + n s ( R ) + n c ) (3)where f : R + → Z + is the camera response function (CRF) and R = L∆t .Essentially, it maps radiance R to LDR intensity level I , which is ranged from 0 to255. n s accounts for noise component dependent on the radiance, n c accounts forthe constant noise. The statistics of noise can be assumed as E ( n s ) = E ( n c ) = 0, V ar ( n s ) = L∆tσ s and V ar ( n c ) = σ c .The noise level function [18,19] measures how reliable sensor response is atgiven intensity level. For convenience, we convert it to a probability functionby scaling the noise level function using a scalar m , where m is the maximumstandard deviation over 3 colour channels. p ( I ) = 1 m ∂f ( r ) ∂r (cid:12)(cid:12)(cid:12)(cid:12) r = R (cid:112) Rσ s + σ c (4)where p : Z + → (0 , R = f − ( I ) is the radiance. In the right column of Fig. 5,the probability maps are shown. Each channel represents the probability of thechannel at the pixel location: dark areas show the low probability pixels whichusually occur around exposed or under exposed parts of the image. We canalso define variants of this probability function based on equation 4. p ( I ) = 1, p i ( I ) = (cid:112) p ( I ), p ( I ) = p ( I ), and p ( I ) = p ( I ) . For clarity, the family of DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 7
Fig. 4: The comparison of tracking errors. The family of error function usingnormalized radiance (red) gives the most ideal global minimum over the groundtruth. The NCC based error function [14] also presents a strong convex butbears a lot of local minimum. Tracking using exposure compensated radiance [5]looks better than tracking raw RGB but its global minimum are shallow evenwhen the exposure is compensated with high accuracy. The left plotting usesreal flickering pair and the right uses simulated flicking pairs based on [19].probability functions are named as pixel confidence functions (PCF) from now onsince these are different from noise level functions. Their effects will be discussedin section 5.The CRF and PCF depends on specific type of camera sensor. They canbe pre-calibrated before performing the HDRFusion [20,21,22]. Specifically, ourCRF is estimated by putting the RGB-D sensor at fixed position. A sequence ofimages at different exposure time are captured [20] and the noise level functionand PCF are estimated using [19]. The CRF, its derivative and PCF are shownin Fig. 3. With this estimated CRF, its inverse f − ( . ) and PCF can be calcu-lated straightforwardly: inverse CRF, allows us to convert intensity to radianceefficiently and PCFs allow us to weigh the error terms appropriately in tracking,exposure compensation and fusion stage. Now we derive a novel error function dependent on luminance alone. The normal-ization of the radiance in a patch of neighbourhood N , centred at pixel location u , can be formulated in the following:¯ R N ( u ) = R N ( u ) − E ( R N ) (cid:112) V ar ( R N ) = L N ( u ) ∆t − E ( L N ∆t ) (cid:112) V ar ( L N ∆t ) = L ( u ) − E ( L N ) (cid:112) V ar ( L N ) (5)where R N : Ω N → R + and L N : Ω N → R + denote radiance map and luminancemap in N , respectively. From above equation, it can be seen that ¯ R N ( u ) isindependent from exposure ∆t . This value is also invariant to viewpoint due tothe fact that the luminance distribution in the local region corresponding to N isroughly constant to viewing position, as long as the surface is Lambertian. Fig. 5shows the mean, standard deviation, normalized radiance and confidence map S. Li , A. Handa , Y. Zhang , A. Calway1
Fig. 4: The comparison of tracking errors. The family of error function usingnormalized radiance (red) gives the most ideal global minimum over the groundtruth. The NCC based error function [14] also presents a strong convex butbears a lot of local minimum. Tracking using exposure compensated radiance [5]looks better than tracking raw RGB but its global minimum are shallow evenwhen the exposure is compensated with high accuracy. The left plotting usesreal flickering pair and the right uses simulated flicking pairs based on [19].probability functions are named as pixel confidence functions (PCF) from now onsince these are different from noise level functions. Their effects will be discussedin section 5.The CRF and PCF depends on specific type of camera sensor. They canbe pre-calibrated before performing the HDRFusion [20,21,22]. Specifically, ourCRF is estimated by putting the RGB-D sensor at fixed position. A sequence ofimages at different exposure time are captured [20] and the noise level functionand PCF are estimated using [19]. The CRF, its derivative and PCF are shownin Fig. 3. With this estimated CRF, its inverse f − ( . ) and PCF can be calcu-lated straightforwardly: inverse CRF, allows us to convert intensity to radianceefficiently and PCFs allow us to weigh the error terms appropriately in tracking,exposure compensation and fusion stage. Now we derive a novel error function dependent on luminance alone. The normal-ization of the radiance in a patch of neighbourhood N , centred at pixel location u , can be formulated in the following:¯ R N ( u ) = R N ( u ) − E ( R N ) (cid:112) V ar ( R N ) = L N ( u ) ∆t − E ( L N ∆t ) (cid:112) V ar ( L N ∆t ) = L ( u ) − E ( L N ) (cid:112) V ar ( L N ) (5)where R N : Ω N → R + and L N : Ω N → R + denote radiance map and luminancemap in N , respectively. From above equation, it can be seen that ¯ R N ( u ) isindependent from exposure ∆t . This value is also invariant to viewpoint due tothe fact that the luminance distribution in the local region corresponding to N isroughly constant to viewing position, as long as the surface is Lambertian. Fig. 5shows the mean, standard deviation, normalized radiance and confidence map S. Li , A. Handa , Y. Zhang , A. Calway1 Fig. 5: Radiance normalization. From left to right, figures correspond to rawRGB, mean, standard deviation, normalized radiance, and confidence map. The1 st and 2 nd row correspond to 2 consecutive frames when flickering happens.Although the image brightness changes significantly, the normalized radiancemap is pretty similar thank to equation 5. The mean, standard deviation mapsand normalized radiance are tone mapped from HDR.of two consecutive frames captured at different exposure time. It can be seenthat the normalized radiance maps extracted from frames captured at differentexposure are strikingly similar while the mean and standard deviation mapsare smooth and blurry which indicates good resistance to viewpoint changes.Therefore, the new error function can be defined as: e (cid:48) ( u , u (cid:48) ) = ( ¯ R r ( u ) − ¯ R l ( u (cid:48) )) p ( I l ( u (cid:48) )) (6)where the probability p ( I l ( u (cid:48) )) serves as a dynamic weight to balance the noiseintroduced during image formation such that less reliable pixel will be assignedwith a smaller weight. p ( . ) can be chosen from the family of PCFs we definedbefore. p ( . ) ∈ { p ( . ) , p ( . ) , p ( . ) , p ( . ) } .The error functions using NCC, raw intensity, radiance with exposure com-pensated and the proposed normalized radiance are compared by plotting againstthe ground truth along x-axis in Fig. 4. Pairs of flickering consecutive frames arechosen, where one is real and the other is synthetic. It can be seen that ourproposed error function using normalized radiance and weighted by square rootPCF p ( . ) gives the most ideal error space for optimization.The camera poses can then be solved out by optimizing the error functionsusing the forward compositional approach described in [3]. When the camera pose is estimated, the exposure will then be compensatedusing the follow equation: t = 1 | Ω | (cid:90) Ω p l ( u ) R r ( u ) R l ( u (cid:48) ) d u (7)where p l ( . ) is the PCF of live frame. After t is estimated, the radiance map oflive frame will be scaled by t . DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 9
The exposure compensated radiance map tR l will then be fused into a globalvolume using an fast parallel approach similar to [3]. The volumetric data struc-ture stores not only the truncated signed distance function (TSDF) and itsweights, but also the 3 channels of radiance and normalized radiance and ra-diance weights. The normalized radiance is also fused into the global volumeso that synthetic normalized radiance map can be efficiently extracted usingray casting. Note that the radiance weight is different from TSDF weights. Thefusion of radiance with depth for each voxel is shown in the following equations: F = w F ∗ F + w (cid:48) F ∗ F (cid:48) w F + w (cid:48) F (8) R = w R ∗ R + w (cid:48) R ∗ R (cid:48) w R + w (cid:48) R (9)¯ R = w R ∗ ¯ R + w (cid:48) R ∗ ¯ R (cid:48) w R + w (cid:48) R (10) w F = w F + w (cid:48) F (11) w R = w R + w (cid:48) R (12)where F and R are TSDF values and radiance in global volume; F (cid:48) and R (cid:48) are those from live frame. Similarly, w F and w R are the global weights. w (cid:48) F and w (cid:48) R are weights from live frame. w F = | n T v | is the absolute cosine valuesbetween surface normal n and viewing direction v at the live pixel locationwhere n , v are unit vectors. It down weight the TSDF values captured at highangle between the normal and viewing direction. Its effect is illustrated in Fig. 6 w R = p r + p g + p b , where p r , p g and p b are the PCF values of 3 colour channelsrespectively. In experiments, we find that storing individual PCF of 3 colourchannel is the global volume is unnecessary and may introduce color distortionas well.Fig. 6: Weight TSDFs according the angle between viewing direction and surfacenormal improves the geometry quality around thin and corner structures. , A. Handa , Y. Zhang , A. Calway1
The exposure compensated radiance map tR l will then be fused into a globalvolume using an fast parallel approach similar to [3]. The volumetric data struc-ture stores not only the truncated signed distance function (TSDF) and itsweights, but also the 3 channels of radiance and normalized radiance and ra-diance weights. The normalized radiance is also fused into the global volumeso that synthetic normalized radiance map can be efficiently extracted usingray casting. Note that the radiance weight is different from TSDF weights. Thefusion of radiance with depth for each voxel is shown in the following equations: F = w F ∗ F + w (cid:48) F ∗ F (cid:48) w F + w (cid:48) F (8) R = w R ∗ R + w (cid:48) R ∗ R (cid:48) w R + w (cid:48) R (9)¯ R = w R ∗ ¯ R + w (cid:48) R ∗ ¯ R (cid:48) w R + w (cid:48) R (10) w F = w F + w (cid:48) F (11) w R = w R + w (cid:48) R (12)where F and R are TSDF values and radiance in global volume; F (cid:48) and R (cid:48) are those from live frame. Similarly, w F and w R are the global weights. w (cid:48) F and w (cid:48) R are weights from live frame. w F = | n T v | is the absolute cosine valuesbetween surface normal n and viewing direction v at the live pixel locationwhere n , v are unit vectors. It down weight the TSDF values captured at highangle between the normal and viewing direction. Its effect is illustrated in Fig. 6 w R = p r + p g + p b , where p r , p g and p b are the PCF values of 3 colour channelsrespectively. In experiments, we find that storing individual PCF of 3 colourchannel is the global volume is unnecessary and may introduce color distortionas well.Fig. 6: Weight TSDFs according the angle between viewing direction and surfacenormal improves the geometry quality around thin and corner structures. , A. Handa , Y. Zhang , A. Calway1 To ensure the quality of radiance, only the pixels whose maximum PCFs areabove a threshold τ and the angle between surface normal and viewing directionis above threshold τ are allowed to be fused into the volume. { R | max ( p , p , p ) > τ ) (cid:92) n T v < τ } (13) In all experiments, we have used 3 Xtion RGB-D sensors whose exposure can bespecified. Calibrated CRFs of them are plotted in Fig. 3. Except CRF and noiselevel function, no other parameters need to be calibrated. Camera intrinsics areset as default values as in OpenNI library. A C++ implementation and testingdata for both the main HDRFusion and its calibration are available in . Thecodes are tested on two commodity system, PC0 equiped with NVIDIA GTX680 and PC1 NVIDIA GTX Titan Black GPU. Both PC are hosted by an i7quad-core CPU. The volume resolution are set as 256 and 480 for PC0 andPC1 respectively with volume size ranges from 2 to 3 m according to the size ofthe scene. Frame resolution are set as QVGA for PC0 and VGA for PC1. Bothof them operates at about 10Hz. We present a qualitative comparison with [3]and demonstrate the quality or recover HDR radiance map in an accompanyingvideo: https://youtu.be/ehwiFkmFQ7Q . We first use synthetic dataset ICL to evaluate our approach [19]. The high qualityCG HDR frames and ground truth camera poses are available. First, photorealistic LDR RGB frames are simulated using real CRF and noise level functionof a randomly chosen Xtion sensor. We generates two sequences of video tosimulate video flickering and smooth AE behaviour. The flickering sequence issimulated by randomly choosing exposure time from the set 3 , , , ...,
96 (ms).The second sequence is generated using the equation ∆t = C/L , where C =4 . × and L is the average HDR intensity of the 10 by 10 patch in the center ofthe original HDR frames. The exposure simulated in the second way are changingsmoothly. The Kinect like depth noise is also added using the approach from [23].Typical flickering pairs are illustrated in Fig. 4. The tracking approach usingnormalized intensity, NCC object function based on [14] and approach similarto the tracking of [5] are used as baseline approaches. For fairness, the ICP-based frame-to-model tracking are disabled for all above methods. The trackingaccuracy in terms of rotational and translational error are plotted in Fig. 7.We also performed a qualitative comparison using real data between theproposed tracking and tracking using the approach from [3]. Two sequencesof RGB-D video with flickering are captured. In these sequences, the sensoris overlooking a floor and a white board respectively. As the camera moving https://lishuda.wordpress.com DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 11
Fig. 7: Tracking synthetic sequence. The left two figures are the rotational andtranslational error using synthetic flickering sequence. The right two figures areusing synthetic smooth AE sequence. In the flickering sequence, we can see thatraw RGB based tracking quickly get lost, while the NCC and the proposed frame-to-frame tracking (NR) and frame-to-model tracking using normalized radianceremains working well. The tracking NR in frame-to-model mode gives the bestperformance in the flickering sequence. Due to the rich geometric variance, theICP-based frame-to-model tracking give the best results. In smooth sequences,the ncc and ICP performs better but the proposed tracking remain workingreasonably accurate. The frame-to-model tracking is within 3cm meter in the1000 frames testing sequence.from dark to bright areas, video flickering happens. [3] fails to tracking whenflickering happens, while the proposed method remain tracking effectively. Thereconstructed floor and white board using proposed approach are shown in Fig. 8.The tracking comparison between our approach and [3] is also available in theaccompanying video.Fig. 8: Tracking under flickering using real data. The blue curves are the cameratrajectories. The frustums in the left figure show the camera pose.
The HDR radiance are shown both in the screen shots attached in the paperand in the accompanying video. In Fig. 1, 9, 10 and 11, we perform the pro-posed HDRFusion in three scenes, namely ’Bear’, ’Desk’ and ’Sofa’. The bearsequence is illuminated by indirect sun light. The desk sequence is illuminatedby Fluorescent. The sofa sequence is illuminated by both fluorescent lighting and , A. Handa , Y. Zhang , A. Calway1
The HDR radiance are shown both in the screen shots attached in the paperand in the accompanying video. In Fig. 1, 9, 10 and 11, we perform the pro-posed HDRFusion in three scenes, namely ’Bear’, ’Desk’ and ’Sofa’. The bearsequence is illuminated by indirect sun light. The desk sequence is illuminatedby Fluorescent. The sofa sequence is illuminated by both fluorescent lighting and , A. Handa , Y. Zhang , A. Calway1 Dedolight-400D metal halide lamp. In Fig. 9, HDR scene textures are comparedwith the ground truth. The ground truth is captured using a Canon 5D MarkIISLR camera. Three exposure LDR images with a 2-fstop interval of the scenewere captured and then merged to form an HDR image. Both are rendered usingtone maping operator(TMO) [8].Fig. 9: The left are the ground truth HDR radiance and HDR radiance generatedusing HDRFusion are rendered using [8] where the colour saturation is set as1. We can see that estimated HDR texture closely matches the HDR radiancecaptured using the high-end SLR camera.
In this paper, we propose a novel HDRFusion system capable of capturing highquality HDR scene texture using a low cost RGB-D sensor. Tracking normalizedradiance allows decouple the tracking from exposure compensation which im-proves the accuracy of both. Tracking normalized radiance is also shown to berobust to video flickering due to camera AE adjustment. The tracking is runingin frame-to-model mode which accumulates less drift. In future work, calibrat-ing the CRF function online will be investigated as in some sensors the exposure
DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 13 time can not be changed by user. Another limitation of the system lies in its largememory footprint. Storing both the normalized radiance and radiance seems un-necessary. Reducing the size of memory cost by combining the both will also beinvestigated.Fig. 10: Sofa. The LDR frames generated using [3] are shown in the first row andHDR frames produced by HDRFusion are shown in the second row and thirdrow. The second row is generated using [8] where the colour saturation is setas 1. Comparing with raw RGB fusion [3], the dynamic range of the radiancetexture is much higher. The details in dark area are well preserved. The third rowis generated using [24], where the colour saturation is set as 1.25. [24] visualizesthe rich details captured by HDRFusion. The bottom row shows the recoveredsurface geometry. , A. Handa , Y. Zhang , A. Calway1
DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 13 time can not be changed by user. Another limitation of the system lies in its largememory footprint. Storing both the normalized radiance and radiance seems un-necessary. Reducing the size of memory cost by combining the both will also beinvestigated.Fig. 10: Sofa. The LDR frames generated using [3] are shown in the first row andHDR frames produced by HDRFusion are shown in the second row and thirdrow. The second row is generated using [8] where the colour saturation is setas 1. Comparing with raw RGB fusion [3], the dynamic range of the radiancetexture is much higher. The details in dark area are well preserved. The third rowis generated using [24], where the colour saturation is set as 1.25. [24] visualizesthe rich details captured by HDRFusion. The bottom row shows the recoveredsurface geometry. , A. Handa , Y. Zhang , A. Calway1 Fig. 11: Desk. The LDR frames generated using [3] are shown in the first rowand HDR frames produced by HDRFusion are shown in the second row. TheHDR radiance is rendered using [8], where the colour saturation is set as 1.5.The luminance under the desk is very low but are well preserved in the HDRradiance map.
DRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor 15
References
1. Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: DTAM : Dense Tracking andMapping in Real-Time. In: Intl. Conf. on Computer Vision (ICCV). (2011)2. Kerl, C., Sturm, J., Cremers, D.: Robust odometry estimation for RGB-D cameras.In: IEEE Intl. Conf. on Robotics and Automation (ICRA). (2013) 3748–37543. Whelan, T., Kaess, M., Johannsson, H., Fallon, M., Leonard, J.J., McDonald, J.:Real-time large-scale dense RGB-D SLAM with volumetric fusion. Intl. Journalon Robotics Research (IJRR) (4-5) (2015) 598–6264. Newcombe, R.A., Molyneaux, D., Kim, D., Davison, A.J., Shotton, J., Hodges, S.,Fitzgibbon, A.: KinectFusion : Real-Time Dense Surface Mapping and Tracking.In: IEEE/ACM Intl. Symposium on Mixed and Augmented Reality (ISMAR).(2011)5. Meilland, M., Barat, C., Comport, A.: 3D High Dynamic Range dense visualSLAM and its application to real-time object re-lighting. In: IEEE/ACM Intl.Symposium on Mixed and Augmented Reality (ISMAR). (2013) 143–1526. Whelan, T., Leutenegger, S., Salas-moreno, R.F., Glocker, B., Davison, A.J.: Elas-ticFusion : Dense SLAM Without A Pose Graph. Robotics: Science and Systems(RSS) (2015)7. Kerl, C., Cremers, D., Universit, T.: Dense Continuous-Time Tracking and Map-ping with Rolling Shutter RGB-D Cameras. In: Intl. Conf. on Computer Vision(ICCV). (2015)8. Mantiuk, R., Daly, S., Kerofsky, L.: Display adaptive tone mapping. ACM Trans.on Graphics (ToG) (3) (2008) 19. Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariantfeatures. Intl. Journal of Computer Vision (IJCV) (1) (2007) 59–7310. Aydin, T., Stefanoski, N., Croci, S.: Temporally coherent local tone mapping ofHDR video. ACM Trans. on Graphics (ToG) (6) (2014)11. Farbman, Z., Lischinski, D.: Tonal stabilization of video. ACM Trans. on Graphics(ToG) (4) (2011) 112. Serafin, J., Grisetti, G.: NICP : Dense Normal Based Point Cloud Registration.In: Intl. Conf. on Intelligent Robot Systems (IROS). (2015) 813. Whelan, T., Johannsson, H., Kaess, M., Leonard, J.J., McDonald, J.: Robust real-time visual odometry for dense RGB-D mapping. In: IEEE Intl. Conf. on Roboticsand Automation (ICRA). (2013) 5724–573114. Scandaroli, G.G., Meilland, M., Richa, R.: Improving NCC-based direct visualtracking. In: European Conf. on Computer Vision (ECCV). (2012) 442–45515. Heo, Y.S., Lee, K.M., Lee, S.U.: Robust Stereo matching using adaptive normalizedcross-correlation. IEEE Trans. Pattern Anal. Machine Intell. (PAMI) (4) (2011)807–82216. Batz, M., Richter, T., Garbas, J.U., Papst, A., Seiler, J., Kaup, A.: High dynamicrange video reconstruction from a stereo camera setup. Signal Processing: ImageCommunication (2) (2014) 191–20217. Hasinoff, S.W., Durand, F., Freeman, W.T.: Noise-optimal capture for high dy-namic range photography. In: IEEE Intl. Conf. on Computer Vision and PatternRecognition (CVPR). (2010) 553–56018. Liu, C., Szeliski, R., Kang, S.B., Zitnick, C.L., Freeman, W.T.: Automatic estima-tion and removal of noise from a single image. IEEE Trans. Pattern Anal. MachineIntell. (PAMI) (2) (2008) 299–3146 S. Li , A. Handa , Y. Zhang , A. Calway1