Defogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes
NNoname manuscript No. (will be inserted by the editor)
Defogging Kinect: Simultaneous Estimation of Object Regionand Depth in Foggy Scenes
Yuki Fujimura · Motoharu Sonogashira · Masaaki Iiyama
Received: date / Accepted: date
Abstract
Three-dimensional (3D) reconstruction andscene depth estimation from 2-dimensional (2D) imagesare major tasks in computer vision. However, using con-ventional 3D reconstruction techniques gets challeng-ing in participating media such as murky water, fog, orsmoke. We have developed a method that uses a time-of-flight (ToF) camera to estimate an object region anddepth in participating media simultaneously. The scat-tering component is saturated, so it does not depend onthe scene depth, and received signals bouncing off dis-tant points are negligible due to light attenuation in theparticipating media, so the observation of such a pointcontains only a scattering component. These phenom-ena enable us to estimate the scattering component inan object region from a background that only containsthe scattering component. The problem is formulatedas robust estimation where the object region is regardedas outliers, and it enables the simultaneous estimationof an object region and depth on the basis of an it-eratively reweighted least squares (IRLS) optimizationscheme. We demonstrate the effectiveness of the pro-posed method using captured images from a Kinect v2in real foggy scenes and evaluate the applicability withsynthesized data.
Yuki FujimuraThe Graduate School of Informatics, Kyoto University, JapanE-mail: [email protected] SonogashiraAcademic Center for Computing and Media Studies, KyotoUniversity, JapanE-mail: [email protected] IiyamaAcademic Center for Computing and Media Studies, KyotoUniversity, JapanE-mail: [email protected]
Keywords time-of-flight · depth estimation · partici-pating media · light scattering · iteratively reweightedleast squares Three-dimensional (3D) reconstruction and scene depthestimation from 2-dimensional (2D) images are impor-tant tasks in computer vision. These techniques can beutilized for a variety of applications such as robot visionand self-driving vehicles. However, using 3D reconstruc-tion techniques gets challenging in participating mediasuch as murky water, fog, or smoke. As shown at thebottom left of Fig. 1, the contrast of a captured imagein participating media is reduced because light propa-gating through the media gets attenuated and the re-ceived signal contains not only reflected light from theobject surface but also scattered light due to suspendedparticles. These effects make it difficult to use conven-tional 3D reconstruction techniques such as structurefrom motion or shape from shading.In response to the issue above, we have developeda method that enables us to acquire 3D geometry inparticipating media. Specifically, we use a time-of-flight(ToF) camera to detect an object and estimate the dis-tance in participating media (see Fig. 2). The proposedmethod enables simultaneous automatic obstacle detec-tion and depth reconstruction in challenging environ-ments such as disaster sites where light is scattered.There are several architectures for ToF cameras. Weuse Microsoft Kinect v2, a continuous-wave ToF cam-era that emits a modulated sinusoid signal into a sceneand then measures the amplitude of light that bouncesoff an object surface and the phase shift between the il-lumination and received signal. These observations are a r X i v : . [ c s . C V ] A p r Yuki Fujimura et al.
Depth from ToFRGB W it hou t f og W it h f og Fig. 1
Depth error due to light scattering. Depth measure-ment suffers from scattring effect in participating media suchas a foggy scene. represented as an amplitude image and a phase imageas shown in Fig. 2. Since the phase shift depends onan optical path, we can reconstruct the depth from thephase shift. In this study, we denote the observation ofan object surface by a direct component.This architecture assumes that each camera pixelobserves a single point in a scene. As mentioned previ-ously, however, the observed signal in participating me-dia includes a scattering component due to light scat-tering as well as a direct component. The amplitudeand phase shift suffer from the scattering effect, andthis causes error of depth measurement (see Fig. 1).We aim to recover the direct component, and this isan ill-posed problem because the separation of two com-ponents is required. To deal with this problem, we lever-age the saturation of a scattering component and lightattenuation in participating media. Given a near lightsource in participating media, a scattering componentis saturated close to the camera (Treibitz and Schech-ner, 2009; Tsiotsios et al., 2014). This means the scat-tering component does not depend on an object in thescene if the object is located at a certain distance fromthe camera. Moreover, the intensity of light propagat-ing through participating media is attenuated exponen-tially relative to distance. Thus, reflected light from adistant point is negligible and the observation of such apoint includes only a scattering component. In this pa-per, we assume that a target scene consists of an objectregion and a background that only contains a scatter-ing component. This scattering component can then beestimated simply by observing the background. In ad-dition to the above assumption, we introduce two priorsto estimate the scattering component: first, a scatter-ing component can be approximated using a quadraticfunction in a local image patch ( local quadratic prior ), and second, a scattering component has a symmetricalcharacteristic in an overall image ( global symmetricalprior ).The estimation of a scattering component is formu-lated as robust estimation, where the object region isregarded as outliers because it contains a direct compo-nent. We propose an optimization scheme based on it-eratively reweighted least squares (IRLS) (Holland andWelsch, 1977; Fox and Weisberg, 2002; Chartrand andYin, 2008; Wipf and Nagarajan, 2010), which minimizesweighted least squares iteratively. The object region canthen be extracted via the IRLS weight as outliers.In section 2 of this paper, we briefly overview pre-vious studies on computer vision application in partic-ipating media. In section 3, an image formation modelin participating media for ToF measurement is intro-duced. In section 4, we describe the proposed methodfor simultaneous estimation of the object region anddepth, and in section 5, we present the experimentalresults. We conclude the paper in section 6 with a breifsummary and mention of future work. efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 3
Object maskAmplitude image Phase imageReconstructed depthObservationToF cameraFoggy scene
Fig. 2
Overview of proposed method. A continuous-wave ToF camera captures an amplitude image and a phase image. Fromthese images captured in participating media, we estimate the object region and recover depth simultaneously. models were built in participating media for photomet-ric stereo application. Other studies have been basedon structured light (Narasimhan et al., 2005; Gu et al.,2013), light field (Tian et al., 2017), and the absorp-tion of infrared light in underwater scenes (Asano et al.,2016).2.3 Multipath interference of ToFA ToF camera assumes that each camera pixel observesa single point in a scene. In participating media, how-ever, the measurement also includes scattered light. Thisproblem is known as multipath interference (MPI). MPIis caused not just by light scattering in participatingmedia but also by subsurface scattering or interreflec-tion in common scenes. Thus, many previous studieshave tackled MPI compensation (Fuchs, 2010; Freed-man et al., 2014; Naik et al., 2015; Kasambi et al., 2016;Guo et al., 2018).In this paper, we limit our focus to MPI causedby light scattering in participating media. ToF mea-surement in participating media has been proposed byHeide et al. (2014); Satat et al. (2018). Heide et al.(2014) developed a scattering model based on exponen-tially modified Gaussians for transient imaging using aphotonic mixer device (PMD) (Heide et al., 2013). Sa-tat et al. (2018) demonstrated that scattered photonsobserved with a single photon avalanche diode (SPAD)have gamma distribution and leveraged this observation to separate received photons into a directly reflectedcomponent and a scattering component. Our methoddiffers from these approaches in that we just use anoff-the-shelf ToF camera (Kinect v2) with no specialhardware modification.
In this section, we describe our image formation modelfor a continuous-wave ToF camera in participating me-dia. As with many previous studies (Narasimhan et al.,2005, 2006; Treibitz and Schechner, 2009; Tsiotsios et al.,2014), we assume here that forward scattering and mul-tiple scattering are negligible and that the density of aparticipating medium in a scene is homogeneous.A continuous-wave ToF camera illuminates a scenewith amplitude-modulated light and then measures theamplitude of received signal α and phase shift ϕ be-tween the illumination and received signal. This obser-vation can be described using a phasor (Gupta et al.,2015), as αe jϕ ∈ C . (1)Since the phase shift is proportional to the depth of anobject, we can compute the depth as z = cϕ πf , (2)where z is depth, c is the speed of light, and f is themodulation frequency of the camera. Yuki Fujimura et al.
SurfaceToFcamera ScatteringParticipating medium x
Light scattering in participating media. Light inter-acts with a participating medium on the line of sight and thenarrives at a camera pixel. The total scattering component isthe sum of scattered light on the red line in the figure, whichdepends on the limited beam angle of the light source.
In participating media, the observation contains scat-tered light. As shown in Fig. 3, light interacts with themedium on the line of sight and then arrives at thecamera pixel. Thus, the observed scattering componentis the sum of scattered light on the line of sight. Now,we consider the 3D coordinate, the origin of which isthe camera center. When the camera observes a sur-face point p ∈ R at a camera pixel ( u, v ), the totalobservation ˜ α ( u, v ; p ) e j ˜ ϕ ( u,v ; p ) can be written as˜ α ( u, v ; p ) e j ˜ ϕ ( u,v ; p ) = α d ( u, v ; p ) e jϕ d ( u,v ; p ) + (cid:90) (cid:107) p (cid:107)(cid:107) x (cid:107) = (cid:107) x ( u,v ) (cid:107) α ( u, v ; x ) e jϕ ( u,v ; x ) d (cid:107) x (cid:107) , (3)where α d ( u, v ; p ) and ϕ d ( u, v ; p ) are the direct compo-nents. α d ( u, v ; p ) depends on the surface albedo, shad-ing, and attenuation, which is caused by the medium aswell as the inverse square law. α ( u, v ; x ) e jϕ ( u,v ; x ) is theobservation of scattered light at a position x . Note thatalthough the scattering component can be written usingan integral, the domain of the integral (red line in Fig.3) depends on the relative position between the lightsource and camera pixel. This is because an ideal pointlight source irradiates a scene with isotropic intensity,while a practical illumination such as a spotlight has alimited beam angle (Tsiotsios et al., 2014).Assuming a near light source in participating media,an observed scattering component is saturated closeto the camera (Treibitz and Schechner, 2009; Tsiotsioset al., 2014). That is, there exists x saturate for which (cid:107) x (cid:107) ≥ (cid:107) x saturate (cid:107) ⇒ α ( u, v ; x ) = 0 . (4) Therefore, we can rewrite Eq. (3) as˜ α ( u, v ; p ) e j ˜ ϕ ( u,v ; p ) = α d ( u, v ; p ) e jϕ d ( u,v ; p ) + (cid:90) (cid:107) x saturate (cid:107)(cid:107) x (cid:107) = (cid:107) x ( u,v ) (cid:107) α ( u, v ; x ) e jϕ ( u,v ; x ) d (cid:107) x (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) = α s ( u,v ) e jϕs ( u,v ) , (5)where α s ( u, v ) and ϕ s ( u, v ) are the scattering compo-nents, which depend on only the camera pixel ( u, v )rather than the object depth.Although the observation consists of the direct com-ponent α d ( u, v ; p ) e jϕ d ( u,v ; p ) and the scattering compo-nent α s ( u, v ) e jϕ s ( u,v ) , the attenuation due to the mediumreduces the direct component dramatically. Thus, if thecamera observes a distant point p far , the amplitude ofthe reflected light fades away, that is, α d ( u, v ; p far ) = 0 . (6)Therefore, the observation of the distant point includesonly a scattering component:˜ α ( u, v ; p far ) e j ˜ ϕ ( u,v ; p far ) = α s ( u, v ) e jϕ s ( u,v ) . (7)Figure 4 shows amplitude and phase images whenthe camera observes a black surface in a foggy scene.The intensity of reflected light from the black surface isvery small, so this approximates a distant observationwhere only a scattering component can be observed.As discussed above, in both the amplitude and phaseimages, the scattering component is inhomogeneous be-cause the illumination has a limited beam angle.The measurement range of our method is betweena saturation point and a background point that has nodirect component. More details about the saturation ofthe scattering component and the measurement rangeare provided in section 5.4. As explained in the previous section, a scattering com-ponent depends on the position of a camera pixel ratherthan a target object. In addition, only the scatteringcomponent is observed in the background where an ob-ject is farther away. Thus, our goal is to estimate thescattering component in an object region from the ob-servation of the background. After estimating scatter-ing components α s ( u, v ) and ϕ s ( u, v ) at each pixel, wecompute the amplitude and phase shift of a direct com-ponent from Eq. (5): α d = (cid:112) (˜ α cos ˜ ϕ − α s cos ϕ s ) + (˜ α sin ˜ ϕ − α s sin ϕ s ) , (8) ϕ d = arg ((˜ α cos ˜ ϕ − α s cos ϕ s ) + j (˜ α sin ˜ ϕ − α s sin ϕ s )) , (9) efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 5(a) RGB image(b) Amplitude image(c) Phase image Fig. 4
Observation of a black surface in a foggy scene. Theblack surface approximates a distant observation where only ascattering component can be observed because reflected lightfrom the scene gets attenuated. Note that the observed scat-tering component is inhomogeneous due to the limited beamangle of the illumination. where an operator arg returns the argument of a com-plex number. Then, depth is recovered substituting thephase into Eq. (2).In this section, we describe how our method dividescamera pixels into an object region and a background,and simultaneously estimates the scattering componentin the object region. First, we introduce two priors toestimate the scattering component, and then the prob-lem is formulated as robust estimation, which allows usto extract the object region as outliers. In the following,with a slight alteration of notation, we refer to both anamplitude image and a phase image as an image, sincewe process both images in the same manner.4.1 Prior of scattering componentWe can estimate the scattering component of an ob-ject region from a background because the componentdoes not depend on the object. Tsiotsios et al. (2014)approximated backscatter as a quadratic function in acaptured image. Similarly to their work, we also intro- x k ( u, v ) =
Local quadratic prior. We assume that a scatteringcomponent can be represented with a quadratic function in alocal image patch. duce priors, local quadratic prior and global symmetricalprior , that allow us to estimate the scattering compo-nent.
Local quadratic prior.
In our ToF setting, we found thata scattering component cannot be approximated glob-ally with a simple function. Thus, as shown in Fig. 5,we assume that a scattering component can be repre-sented with a quadratic function in a local image patch,that is, x k ( u, v ) = a k u + a k uv + a k v + a k u + a k v + a k = a (cid:62) k u , (10)where x k ( u, v ) is the value at a pixel ( u, v ) in a local im-age patch k . u = [ u uv v u v (cid:62) is a 6-dimensionalvector and a k = [ a k a k a k a k a k a k ] (cid:62) denotes thecoefficients of the quadratic function in patch k . Global symmetrical prior.
However, this local prior isnot useful when there exists a large object region and aquadratic function is also fitted into the values in thatregion. To address this problem, we introduce a globalprior to the scattering component.As discussed in section 3, a scattering componentdepends on the relative position between a camera pixeland a light source. This is because the individual start-ing points of the integral in Eq. (3) differ from eachother. Meanwhile, as shown in Fig. 6, we assume thatthe camera and light source are collocated on the linethat is parallel to the horizontal axis of the image.Kinect v2 has this setting, and other devices can eas-ily be built on the basis of this setting. In this case,the integral domain of a pixel is consistent with that ofthe symmetrical pixel with respect to the central axisof the image. Thus, the observed scattering componentalso has symmetry, and we leverage this symmetry asa global prior.
Yuki Fujimura et al.
Observed image S y mm e t r i ca l Participating medium X O b s e r v e d i m a g e S y mm e t r i ca l Fig. 6
Global symmetrical prior. When the camera and lightsource are collocated on a line that is parallel to the horizon-tal axis of the image, the observed scattering component hassymmetry because the integral domain of a pixel is consistentwith that of the symmetrical pixel with respect to the centralaxis of the image. x , a , ··· , a K N (cid:88) i =1 ρ (cid:18) x i − ˜ x i σ (cid:19) + γ K (cid:88) k =1 (cid:107) Ua k − x k (cid:107) + γ (cid:107) Fx − x (cid:107) + γ (cid:107)∇ x (cid:107) . (11)The first term of Eq. (11) is a data term where ˜ x =[˜ x i · · · ˜ x N ] (cid:62) and x = [ x i · · · x N ] (cid:62) are a captured im-age and a scattering component, respectively. N is thenumber of camera pixels, and σ is a scale parameter.We use a nonlinear differentiable function ρ ( x ) ratherthan square error x , which allows us to make the esti-mation robust against outliers. In this study, we simplyuse the residual of the observation and the scatteringcomponent as the data term, i.e., pixels that contain adirect component are regarded as outliers.We use three additional regularization terms. Thesecond term represents the local prior. K is the num-ber of patches for local quadratic function fitting. U is an N k × N k is the number of pixelsin patch k and each row of U is a vector u that cor-responds to each pixel coordinate. In this study, thesepatches do not overlap each other. The third term rep-resents the global prior where F ∈ R N × N is a matrix that flips an image vertically. The last term is a smooth-ing term where ∇ denotes a gradient operator. Thissmoothing accelerates the optimization. Hyperparame-ters γ , γ , γ control the contribution of each term.4.3 IRLS and object region estimationWe minimize Eq. (11) with respect to a scattering com-ponent x and the coefficients of quadratic functions a , · · · , a K . However, the nonlinearity of ρ ( x ) makes itdifficult to obtain a closed-form solution. For efficientcomputation, the IRLS optimization was developed inthe literature (Holland and Welsch, 1977; Fox and Weis-berg, 2002). IRLS minimizes weighted least squares it-eratively and the weight is updated using the currentestimate in each iteration. The objective function inEq. (11) is transformed into weighted least squares asfollows:min x , a , ··· , a K ( x − ˜ x ) (cid:62) W ( x − ˜ x )+ γ (cid:48) K (cid:88) k =1 (cid:107) Ua k − x k (cid:107) + γ (cid:48) (cid:107) Fx − x (cid:107) + γ (cid:48) (cid:107)∇ x (cid:107) , (12)where W = diag( w ) is an N × N matrix and w =[ w , · · · , w N ] (cid:62) is the weight for each error x i − ˜ x i . Hy-perparameters are given as γ (cid:48)∗ = 2 σ γ ∗ . Equation (12)is quadratic with respect to the scattering component x , and thus is easy to optimize. In each iteration, wesolve Eq. (12) for x and a , · · · , a K , and the weight canbe updated using the current estimate as w i = ρ (cid:48) (( x i − ˜ x i ) /σ )( x i − ˜ x i ) /σ . (13)The specific update rule of the weight depends onthe nonlinear function ρ ( x ). In this study, we use thefollowing function as ρ ( x ): ρ ( x ) = c (cid:20) − (cid:110) − (cid:0) xc (cid:1) (cid:111) (cid:21) if | x | ≤ c c otherwise. (14)This function yields the following update: w i = (cid:40) (cid:110) − (cid:0) r i c (cid:1) (cid:111) if | r i | ≤ c otherwise, (15)where r i = ( x i − ˜ x i ) /σ , and c is a tuning parameter.This update is referred to as Tukey’s biweight (Beatonand Tukey, 1974; Fox and Weisberg, 2002), where 0 ≤ w i ≤ efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 7 Algorithm 1
Simultaneous estimation of scatteringcomponent and object region
Input:
Image ˜ xOutput:
Scattering component x and object mask w Coarse level optimization (Eq. (16)): W ← I , a k ← argmin a k (cid:107) Ua k − ˜ x k (cid:107) repeat Solve Eq. (12) for x Solve Eq. (12) for a , · · · , a K if first iteration then Compute σ using Eq. (20) end if Update w using Eq. (18) until convergedFine level optimization (Eq. (11)):Initialize w and a , · · · , a K with the output of thecoarse level repeat Solve Eq. (12) for x Solve Eq. (12) for a , · · · , a K if first iteration then Compute σ using Eq. (19) end if Update w using Eq. (13) until convergedBinarize w The weight controls the robust estimation, that is,a large error term reduces the corresponding weight. Inthis study, we consider the object region as outliers, andthus the weight in the object region should be small.Therefore, we can leverage the IRLS weight to extractthe object region from the image.4.4 Coarse-to-fine optimizationThe accurate object region extraction is critical for theeffectiveness of the scattering component estimation. Insection 4.1, we introduced the local and global priors ofthe scattering component to deal with a large objectregion. To make the region extraction more robust, wedevelopped a coarse-to-fine optimization scheme. Be-fore solving Eq. (11), we optimize the following objec-tive function:min x , a , ··· , a K K (cid:88) k =1 ρ (cid:18) (cid:107) x k − ˜ x k (cid:107) σ (cid:19) + γ K (cid:88) k =1 (cid:107) Ua k − x k (cid:107) + γ (cid:107) Fx − x (cid:107) + γ (cid:107)∇ x (cid:107) . (16)This is similar to the patch-based robust regressionproposed by Chaudhury and Singer (2013). The dif-ference from Eq. (11) is that the data term consists of patch-wise errors. Equation (16) can be transformedinto IRLS as well as Eq. (11), and the IRLS weight isupdated patch-wise rather than pixel-wise. Differenti-ating the first term of Eq. (16) with respect to x , wecan obtain ∂∂ x K (cid:88) k =1 ρ (cid:18) (cid:107) x k − ˜ x k (cid:107) σ (cid:19) = K (cid:88) k =1 F k (cid:18) ∂∂ x k ρ (cid:18) (cid:107) x k − ˜ x k (cid:107) σ (cid:19)(cid:19) = K (cid:88) k =1 F k (cid:18) σ w k ( x k − ˜ x k ) (cid:19) = 1 σ W ( x − ˜ x ) , (17)where an operator F k : R N k → R N embeds an inputpatch into an overall image with zero padding and re-turns its vectorized form. The weight matrix is W =diags( w ) = diags([ w (cid:62) · · · w K (cid:62) K ] (cid:62) ) where k ∈ R N k .The weight w k for each patch k is given as w k = ρ (cid:48) ( (cid:107) x k − ˜ x k (cid:107) /σ ) (cid:107) x k − ˜ x k (cid:107) /σ . (18)Therefore, we can obtain the same objective functionas Eq. (12) where γ (cid:48)∗ = 2 σ γ .Algorithm 1 shows the overall procedure of the si-multaneous estimation of a scattering component andan object region. We first solve Eq. (16) for the weightin a patch level and then solve (11) in a pixel level. Eachscale parameter is computed only at the first iterationand is fixed during subsequent iterations. We computethe scale parameters using a median absolute deviation,which is the robust measure of a deviation, as follows(Fox and Weisberg, 2002): σ = median {| x − ˜ x | , · · · , | x N − ˜ x N |} . , (19) σ = median {(cid:107) x − ˜ x (cid:107) , · · · , (cid:107) x K − ˜ x K (cid:107)} . . (20)At the end of the algorithm, we binarize the IRLSweight to generate an object mask. This procedure isapplied to an amplitude and a phase image in the samemanner, and thus we can obtain the object mask ineach domain. In this study, we determine a final objectmask as their intersection. Experimental setup.
We first evaluated the effective-ness of the proposed method in a controlled environ-ment. The experimental environment is shown in Fig.7. We set up a fog generator and a Kinect v2 in a closed
Yuki Fujimura et al.
Kinect v2Fog generator
Fig. 7
Experimental environment.
PlaneHandDeskDuckChair (a) Scene(b) Amplitude (c) Phase (d) Depth reconstruction
Fig. 8
Results of proposed method. (a) Target foggy scene. (b)(c) Left to right: input image, estimated scattering component,and IRLS weight for the amplitude and phase image, respectively. (d) Left to right: depth without fog, depth with fog,reconstructed depth, masked depth, and estimated object mask. space sized 186 ×
161 cm with black walls and floor.The observation of the wall includes only a scatteringcomponent because incident light into the wall is ab-sorbed. The Kinect v2 has three modulation frequen-cies: 120, 80, and 16 MHz. We used images obtainedwith 16 MHz. To acquire an amplitude and phase im-age, we used the source code given by Tanaka et al.(2017). Their code provides the average image of sev-eral frames, and we modified the code so that only a sin-gle frame was input. To compensate for high frequencynoise, we used a bilateral filter as preprocessing. The spatial resolution of an image captured by Kinectv2 is 424 ×
512 pixels. We divided a captured imageinto 4 × K = 16) for local quadratic prior.In section 4.1, we assumed that the camera and thelight source are collocated on a line that runs paral-lel to the horizontal axis of the image. In practice, thecamera and light source are slightly out of alignment.Although this violates the symmetry of the scatteringcomponent, we found that error due to this misalign-ment is negligibly small. In our implementation, we de-fined F as a matrix that flips an image with respectto the 200th row of the image. In addition, we did not efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 9 (a) Result under thin fog (b) Result under thick fog Fig. 9
Results under different density conditions: (a) thin fog and (b) thick fog. In such a highly foggy scene, the accuracyof depth reconstruction is reduced where a scattering component has a large effect, but the proposed method can estimate anobject region and improve a depth measurement regardless of medium density.
Table 1
Mean depth error on each object under different density conditions. w/o method denotes error before applying theproposed method. Plane Chair Desk Hand DuckFigure 9(a) w/o method 117.26 mm 198.79 mm 459.73 mm 425.21 mm 574.28 mm(thin) proposed 17.97 mm 65.01 mm 83.40 mm 32.04 mm 45.08 mmFigure 8 w/o method 253.38 mm 372.02 mm 656.11 mm 669.64 mm 798.17 mm(medium) proposed 20.67 mm 60.27 mm 106.68 mm 50.63 mm 118.21 mmFigure 9(b) w/o method 421.31 mm 531.48 mm 798.85 mm 844.32 mm 953.56 mm(thick) proposed 19.95 mm 70.63 mm 202.36 mm 75.79 mm 143.20 mm use the 24 rows of the lower part of the image for thethird term of Eq. (11), as these pixels have no informa-tion of global symmetrical prior. For amplitude images,we set the hyperparameters of the objective function as[ γ (cid:48) , γ (cid:48) , γ (cid:48) ] = [0 . , . , ρ ( x ) is set as c = 4 , γ (cid:48) , γ (cid:48) , γ (cid:48) ] = [0 . , . ,
50] and c = 2 , Experimental result.
The results are shown in Fig. 8.Figure 8(a) shows the foggy scene, which has five targetobjects: “plane,” “chair,” “desk,” “hand,” and “duck.”Figure 8(b)(c) show the estimation of a scattering com-ponent and a object region for the amplitude and phaseimage, respectively. The object region depicted here isthe IRLS weight before binarization. As shown, the pro-posed method effectively estimated the object regionvia the weight. Of particular note is that thin regions such as the legs of the chair could be extracted. The esti-mation of the scattering component in the object regionwas also successful. Figure 8(d) shows the results of thedepth reconstruction, where we show the depth mea-surement without and with fog, and the reconstructeddepth. The depth measurement in the foggy scene hadlarge error here due to fog. In contrast, the proposedmethod reconstructed the object depth correctly.We tested the proposed method under different den-sity conditions. Figure 9 shows the results under thinfog and thick fog. In a highly foggy scene like the one inFig. 9(b), the accuracy of depth reconstruction was re-duced where a scattering component had a large effect,but the proposed method could estimate an object re-gion and improve the depth measurement regardless ofmedium density. The mean depth error on each objectwith and without our method in scenes under differentdensity conditions is listed in Table 1; here, we define (d) Depth reconstruction
Fig. 10
Results for a more realistic scene. (a) Target scene without and with fog. (b)(c) Left to right: input image, estimatedscattering component, and IRLS weight for the amplitude and phase image, respectively. (d) Left to right: depth without fog,depth with fog, reconstructed depth, masked depth, and estimated object mask. the ground truth as the measured depth without fog.As shown, the proposed method could reduce the errorsignificantly regardless of fog density.5.2 Realistic sceneNext, we tested the proposed method in a more real-istic scene. Figure 10(a) shows the target scene. Weartificially generated fog in the same manner as thecontrolled experimental setting, although the scene inFig. 10 has neither dark walls nor floor. Note that thescene has materials with various types of reflectance,including a lamp made from paper, a glossy vase, and awooden ornament. The results of the scattering compo-nent and object region for the amplitude and phase im-age are shown in Fig. 10(b)(c), respectively, and the es-timation of the depth reconstruction is shown in 10(d).The results of this experiment showed that the pro-posed method could also extract the object region andimprove the depth measurement in a scene that has ageneral background.5.3 Experiments with synthesized data
Synthesized data.
To investigate the effectiveness in morevaried scenes, we evaluated the proposed method with synthesized data. The procedure of generating the syn-thesized data is shown in Fig. 11. We assume that ascattering component does not depend on object depth,and thus we observed a direct component and a scat-tering component separately and then combined theminto a synthesized image.To synthesize an image, we have to know the scat-tering coefficient in a scene. First, we captured a foggyscene that includes calibration objects, and the regionof the calibration objects was masked manually. Afterthat, we compensated for the defective region by solv-ing Eq. (12) to estimate the scattering component. Theweight w in Eq. (12) corresponds to the mask, and weonly used the regularization of the global symmetricalprior and smoothing prior. Using the estimated scatter-ing component, we can compute the direct componentof the amplitude using Eq. (8). Now, the relationshipbetween the amplitude without fog and the attenuateddirect component in a foggy scene is given as α d ( u, v ) = e − βd ( u,v ) ˆ α ( u, v ) , (21)where ˆ α ( u, v ) is the amplitude at a pixel ( u, v ) withoutfog, d ( u, v ) is the distance between the camera and cal-ibration object, and β is a scattering coefficient. There-fore, we computed the scattering coefficient as β = 1 | Ω | (cid:88) ( u,v ) ∈ Ω log ˆ α ( u, v ) − log α d ( u, v )2 d ( u, v ) , (22) efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 11 Scene with fog Phase imageAmplitude image Direct componentScene without fog Phase imageAmplitude imageScattering component β
Fig. 11
Procedure of synthesizing images. First, we captured a scene that has calibration objects in a foggy scene and maskedthe region of the calibration objects manually. After that, we compensated for the defective region to estimate the scatteringcomponent. Using the observation without fog, the scattering coefficient can be computed. The images of a target scene withoutfog were captured separately, and the attenuated direct component and the estimated scattering component were combinedinto synthesized images. where Ω denotes the set of pixels in the mask and | Ω | is the number of pixels in Ω . In the scene in Fig. 11, ascattering coefficient was computed as β = 3 . × − /mm.We also observed a scene without fog, which wasused for the direct component after being attenuatedby the scattering coefficient. We combined the attenu-ated signal and the scattering component to synthesizeamplitude and phase images. Experimental result.
The results are shown in Figs. 12,13, and 14. In each figure, (b)(c) show the estimatedscattering component and the IRLS weight for the am-plitude and phase image. We show the output of boththe coarse and fine level optimization. (d) shows theresult of the depth reconstruction. In each scene, theproposed method effectively extracted the object re- gion and estimated the scattering component. (e) showsthe result of just applying the fine optimization. In thescenes in Figs. 13 and 14, performing only the fine-leveloptimization failed to detect the accurate object region,and this deteriorated the scattering component estima-tion. In contrast, the coarse-to-fine approach made theregion extraction more robust.We show a failure case in Fig. 15. In a scene thathas a large object region, our method was less effec-tive because a quadratic function also fits to values inthe object region. In Fig. 15, a large textureless objectregion exists on the left side. In addition, the globalsymmetrical prior did not work in this region becausethe object occupied the pixels from top to bottom inthe image. (d) Depth reconstruction (e) Only fine optimization
Fig. 12 (a) Target scene. (b)(c) Left to right: input image, estimated scattering component and weight of coarse level, and offine level for the amplitude and phase image, respectively. (d) Left to right: depth without fog, depth with fog, reconstructeddepth, masked depth, and estimated object mask. (e) Result of applying only fine optimization. The estimated scatteringcomponent and the weight for the amplitude image, for the phase image, and the reconstructed depth from left to right.(a) Scene (b) Amplitude(c) Phase (d) Depth reconstruction (e) Only fine optimization
Fig. 13 (a) Target scene. (b)(c) Left to right: input image, estimated scattering component and weight of coarse level, and offine level for the amplitude and phase image, respectively. (d) Left to right: depth without fog, depth with fog, reconstructeddepth, masked depth, and estimated object mask. (e) Result of applying only fine optimization. The estimated scatteringcomponent and the weight for the amplitude image, for the phase image, and the reconstructed depth from left to right.efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 13(a) Scene (b) Amplitude(c) Phase (d) Depth reconstruction (e) Only fine optimization
Fig. 14 (a) Target scene. (b)(c) Left to right: input image, estimated scattering component and weight of coarse level, and offine level for the amplitude and phase image, respectively. (d) Left to right: depth without fog, depth with fog, reconstructeddepth, masked depth, and estimated object mask. (e) Result of applying only fine optimization. The estimated scatteringcomponent and the weight for the amplitude image, for the phase image, and the reconstructed depth from left to right.(a) Scene (b) Amplitude(c) Phase (d) Depth reconstruction
Fig. 15
Failure case. (a) Target scene. (b)(c) Result for the amplitude and phase image, respectively. From left to right: inputimage, estimated scattering component, and weight. (d) Result of the depth reconstruction.4 Yuki Fujimura et al.
Participating medium Object z
Fig. 16
Simulation setting. The measurement range of ourmethod is between x saturate and x background where a scat-tering component is saturated and a direct component re-mains. x saturate and x background .For simplicity, the camera and light source are assumedto be collocated in the same place. Behind x saturate ,a scattering component is constant due to its satu-ration, while a direct component fades away behind x background . Therefore, the measurement range of ourmethod is between x saturate and x background .We simulated the measurement range to evaluatethe capability. Similarly to the process of synthesizingimages, a scattering coefficient was computed for thescene in Fig. 8 to use for the simulation ( β = 3 . × − /mm). We use the following Henyey-Greenstein phasefunction for scattering property: P ( θ ) = 14 π − g (1 + g − g cos θ ) / , (23)where θ is a scattering angle. The parameter g was setas 0 . z is given as α s ( z ) e jϕ s ( z ) = (cid:90) zz z βP ( π ) e − βz e j πfc z dz. (24)We set the starting point of the integral as z = 10 mm.A direct component from depth z is computed as α d ( z ) e jϕ d ( z ) = Iz e − βz e j πfc z , (25)where I consists of a surface albedo and shading, andwe set I = 1 in this simulation. The total observation z [mm] α s
1e 7 x saturate (a) z [mm] ϕ s x saturate (b) z [mm] | ˜ α − α s |
1e 7 x background (c) z [mm] ∆ ϕ x background (d) Fig. 17 (a)(b) Scattering components for amplitude andphase observed at a scene point whose depth is z . (c)(d)Residuals of observation and scattering component, whichrepresent remaining direct components. ˜ α ( z ) e ˜ ϕ ( z ) is the sum of these components as with Eq.(3).Figure 17 shows the simulation results. In (a) and(b), the horizontal axis denotes depth z and the verticalaxis denotes a scattering component for amplitude and efogging Kinect: Simultaneous Estimation of Object Region and Depth in Foggy Scenes 15 phase. These figures validate the saturation character-istic. Meanwhile, in 17(c) and (d), the vertical axes de-note the residual of the observation and scattering com-ponent. ∆ϕ is given by the residual angle of ˜ α ( z ) e jϕ ( z ) and α s ( z ) e jϕ s ( z ) on the complex plane. These valuesrepresent the remaining direct components from depth z . Now, we can set (cid:107) x background (cid:107) = 2500 mm and 5000mm for amplitude and phase from Fig. 17(c) and (d) be-cause the direct component is close to zero. In contrast,in Fig. 17(a) and (b), if we set (cid:107) x saturate (cid:107) = 1000 mm,the estimation error of the scattering component foramplitude due to the saturation assumption can be con-sidered almost zero because 1 − α s (1000) /α s (8000) ≈ − ϕ s (1000) /ϕ s (8000) ≈ . (cid:107) x saturate (cid:107) =1000 mm and (cid:107) x background (cid:107) = 2500 mm, the target ob-jects are located in that range, and from above discus-sion, we have just 6 .
0% error of the scattering compo-nent estimation for phase due to the saturation assump-tion. As shown in the experiments, we can effectivelyreconstruct the object depth regardless of the error. Themeasurement range depends on the density of a partic-ipating medium, and it will get larger under thinnerfog.
In this paper, we proposed a method that simultane-ously estimates an object region and depth by using acontinuous-wave ToF camera. We leveraged the satu-ration of a scattering component and the attenuationof a direct component from a distant point in a scene.The formulation with a robust estimator and the IRLSoptimization scheme allow us to estimate the scatteringcomponent and object region simultaneously.The limitation of the proposed method is that weassume the density of the participating medium in thescene to be homogeneous; thus, it cannot be appliedto an inhomogeneous medium or a dynamic floatingmedium. In these environments, global symmetrical priordoes not hold. In addition, we assume that a scene has abackground region, which makes it difficult to apply themethod to scenes filled with objects. In future work, wewill address these problems in order to further enhancethe real-world applicability of the proposed method.
Acknowledgements
This work was supported by JSPS KAK-ENHI Grant Number 18H03263.
References
Asano Y, Zheng Y, Nishino K, Sato I (2016) Shape fromwater: Bispectral light absorption for depth recovery.European Conference on Computer Vision pp 635–649Beaton AE, Tukey JW (1974) The fitting of powerseries, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16(2):147–185Berman D, Treibitz T, Avidan S (2016) Non-local imagedehazing. The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) pp 1674–1682Cai B, Xu X, Jia K, Qing C, Tao D (2016) Dehazenet:An end-to-end system for single image haze removal.IEEE Transaction on Image Processing 25(11):5187–5198Chartrand R, Yin W (2008) Iteratively reweighted al-gorithms for compressive sensing. The IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) pp 3869–3872Chaudhury KN, Singer A (2013) Non-local patch re-gression: Robust image denoising in patch space. TheIEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) pp 1345–1349Fattal R (2014) Dehazing using color-lines. ACM Trans-actions on Graphics (TOG) 34(1)Fox J, Weisberg S (2002) Robust regression: Appendixto an r and s-plus companion to applied regressionFreedman D, Smolin Y, Krupka E, Leichter I, SchmidtM (2014) Sra: Fast removal of general multipath fortof sensor. European Conference on Computer Vision(ECCV) pp 234–249Fuchs S (2010) Multipath interference compensation intime-of-flight camera images. 2010 20th InternationalConference on Pattern Recognition pp 3583–3586Fujimura Y, Iiyama M, Hashimoto A, Minoh M (2018)Photometric stereo in pariticipating media consider-ing shape-dependent forward scatter. The IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pp 7445–7453Gu J, Nayar SK, Belhumeur PN, Ramamoorthi R(2013) Compressive structured light for reconveringinhomogeneous partiicpating media. IEEE Transac-tions on Pattern Analysis and Machine Intelligence35(3):555–567Guo Q, Frosio I, Gallo O, Zickler T, Jautz J (2018)Tackling 3d tof artifacts through learning and theflat dataset. The European Conference on ComputerVision (ECCV) pp 368–383Gupta M, Nayar SK, Hullin MB, Martin J (2015) Pha-sor imaging: A generalization of correlation-basedtime-of-flight imaging. ACM Transaction on Graph-ics (TOG) 34(5)6 Yuki Fujimura et al.