Indirect Domain Shift for Single Image Dehazing
IIEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 1
Indirect Domain Shift for Single Image Dehazing
Huan Liu,
Student Member, IEEE,
Chen Wang, and Jun Chen,
Senior Member, IEEE
Abstract —Despite their remarkable expressibility, convolutionneural networks (CNNs) still fall short of delivering satisfactoryresults on single image dehazing, especially in terms of faithfulrecovery of fine texture details. In this paper, we argue thatthe inadequacy of conventional CNN-based dehazing methodscan be attributed to the fact that the domain of hazy images istoo far away from that of clear images, rendering it difficult totrain a CNN for learning direct domain shift through an end-to-end manner and recovering texture details simultaneously.To address this issue, we propose to add explicit constraintsinside a deep CNN model to guide the restoration process.In contrast to direct learning, the proposed mechanism shiftsand narrows the candidate region for the estimation outputvia multiple confident neighborhoods. Therefore, it is capableof consolidating the expressibility of different architectures,resulting in a more accurate indirect domain shift (IDS) fromthe hazy images to that of clear images. We also propose twodifferent training schemes, including hard IDS and soft IDS,which further reveal the effectiveness of the proposed method.Our extensive experimental results indicate that the dehazingmethod based on this mechanism outperforms the state-of-the-arts.
Index Terms —Single image dehazing, domain shift, deep neu-ral network.
I. I
NTRODUCTION
Deep convolutional neural networks (CNNs) have beentremendously successful in many high-level computer visiontasks, e.g. image recognition [26], [20] and object detection[15], [39]. Although recent works have shown that it is alsopossible to learn an end-to-end CNN model for low-levelvision tasks, e.g. image dehazing [6], [22], the resultingperformance is still not completely satisfactory. For high-level vision tasks, it suffices to extract specific features andsimply express them as very low dimensional vectors [26],which results in a relatively simple mapping. In contrast, low-level vision tasks require both global understanding of imagecontent and local inference of texture details; as such, theassociated mappings are more complicated.One possible explanation for performance discrepancieson high-level and low-level vision tasks is as follows. Forhigh-level vision tasks such as image recognition, a slightperturbation of the output tends to be inconsequential since theperturbed output is likely to get converted to the same one-hot vector and consequently the classification label remainsunaffected. However, for low-level vision tasks such as imagedehazing, any perturbation can potentially manifest in thefinal result, jeopardizing the image quality. From this point [email protected]
Manuscript received April 19, 2005; revised August 26, 2015.
Loss 1 T r a d i t i o n a l M e t h o d s Adversarial Loss
Neighbor Manifold Distorted Image Target Image
Adversarial Loss Loss 2
Fig. 1: The proposed indirect domain shift (IDS) method.of view, despite the fact that a deep CNN can in principleapproximate any function, it is still difficult to train an accuratemapping that lifts the input to the target domain in one shot,since the loss function is typically very close to zero in theneighborhood of the target image [31]. We argue that adifferent mechanism for domain shift is needed for imagedehazing, which requires both memory and understanding ofimage contents.To this end, we provide explicit guidance during modeloptimization to lead the domain shift path across severalidentified confident neighborhoods, resulting in the proposedframework shown in Fig. 1. More specifically, instead ofonly imposing the loss function on the model output, weintroduce multi-scale estimation, multi-branch diversity, andadversarial loss inside the model, thereby pulling the interimoutputs to specific regions then merging them in the targetdomain; this yields an indirect but more accurate mapping.The contributions of this paper include: • By introducing loss functions inside a CNN model, wepropose the framework of indirect domain shift (IDS) forimage dehazing, which aggregates powerful expressibilityof different architectures, i.e., multi-scale, multi-branch,and generator for lifting degraded images to the targetdomain indirectly. • We provide theoretic justifications for IDS and show thatit provides valuable guidance for network construction. – A multi-scale module takes the advantage of coarse-fine network to maintain global-local consistency. – A multi-branch architecture is adopted to enable pre-cise inference of local details by providing diverseconfident neighborhoods. – A FusionNet further improves the perceptual quality by a r X i v : . [ ee ss . I V ] F e b EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 2 informed ‘imagination’, rather than blindly pursuing ahigher PSNR, as the multi-scale multi-branch structurehas shifted degraded images close enough to the cor-responding ground truth in terms of objective imagequality metrics. • It is demonstrated that IDS leads to remarkable perfor-mance improvements compared with the state-of-the-artalgorithms. II. R
ELATED W ORKS
Image dehazing, which aims to recover a haze-free imagefrom its hazy version, is a highly ill-posed restoration problem.The haze effect is often approximated using the atmosphericscattering model [36] given as follows: I ( x ) = J ( x ) t ( x ) + A (1 − t ( x )) , (1)where I ( x ) , J ( x ) , and A are the observed hazy image, clearscene radiance, and global atmospheric light, respectively. Thescene transmission t ( x ) describes the portion of light that isnot scattered and reaches the camera. It can be expressed as t ( x ) = e − βd ( x ) , where β is the medium extinction coefficientand d ( x ) is the depth map of pixel x .Based on this atmospheric scattering model [36], manystrategies have been proposed by taking advantage of variousprior knowledge. For example, the dark channel prior [19]assumes that in non-sky patches, at least one color channel hasvery low intensity. The color attenuation prior [53] assumesthat the image saturation decreases sharply at hazy patches,so that the difference between brightness and saturation canbe utilized to estimate the haze concentration. Recently, data-driven approaches to image dehazing have received increasingattention. [40] and [7] propose to use CNN for mediumtransmission estimation, which is further leveraged to recoverthe haze-free image. In [40], a multi-scale deep neural networkis proposed to learn a mapping between hazy images andtheir corresponding transmission maps. A densely connectedpyramid network is proposed in [48] to jointly estimate thetransmission map, atmospheric light, and dehazed images,while an effective iteration algorithm is developed in [33]to learn the haze-relevant priors. [10] further embeds theatmospheric model into the designing of CNN and proposes afeature dehazing unit to ensure end-to-end trainable. However,it is known that the atmospheric scattering model (ASM) is notvalid in certain scenarios [30], which limits the applicabilityof the aforementioned dehazing methods.Unlike those ASM-dependent methods, [8] integrates mul-tiple models to perform haze removal with attention, and[32] uses a GridNet-based network [14] to directly predictdehazed images via an ASM-agnostic approach. To furtherimprove the performance in the ASM-agnostic setting, [9]proposes an multi-scale boosted dehazing network (MSBDN)with boosting strategy and back-projection technique. [21]makes use of knowledge distillation to tackle the dehazingproblem by training the neural network under the supervisionof both ground truths and teacher outputs.Many methods that have been developed for other imagerestoration tasks, e.g. deblurring, denoising, are also highly relevant. To remove blurring caused by the dynamic scenes, amulti-scale convolutional neural network is proposed in [35]to restore sharp images in an end-to-end manner. In [16],the weighted nuclear norm minimization (WNNM) problem isstudied and applied to image denoising by exploiting non-localself-similarity. This work is later extended to handle arbitrarydegradation, including blur and missing pixels [47]. To tacklethe long-term dependency problem, the MemNet [45] is pro-posed by introducing a memory block, consisting of a recur-sive unit and a gate unit, to explicitly mine persistent memorythrough an adaptive learning process. To make the deepnetworks implementable on limited resources, a new activationunit is proposed [25], which enables the net to capture muchmore complex features, thus requiring a significantly smallernumber of layers in order to reach the same performance.A super-resolution generative adversarial network (SRGAN)is developed in [27] to recover high-frequency details andproduce more natural-looking images.III. F ORMULATION FOR I NDIRECT D OMAIN S HIFT
In this section, we provide a theoretical formulation of theimage dehazing problem and propose an indirect domain shiftmethod as an effective approach to obtaining an approximationsolution.Denote the prior distribution of clear images of size m × n by p X , which is defined on a low dimensional manifold M in R × m × n . The image degradation mechanism can be modeledas a conditional distribution p X | Y , i.e., given the clear image x ,a distorted image y is generated according to p Y | X . Note that p X and p Y | X induce the joint distribution p X,Y as well as theconditional distribution p X | Y ; in general, both p X and p Y | X need to be learned from the training data. Image dehazing canbe formulated as a maximum a posterior estimation problem: ˆ x map = arg max ˆ x ∈M p X | Y (ˆ x | y ) . (2)In practice, one often considers the following alternativeformulation: ˆ x (cid:96) = min ˆ x ∈ R × m × n E [ (cid:96) ( X, ˆ x ) | Y = y ]= min ˆ x ∈ R × m × n (cid:90) M p X | Y ( x | y ) (cid:96) ( x, ˆ x )d x, (3)where (cid:96) is a loss function. In general it is expected that both ˆ x map and ˆ x (cid:96) are close to the ground truth. However, there isno guarantee that ˆ x (cid:96) belongs to M .We shall describe an IDS method, which leverages multi-scale estimation and multi-branch diversity to obtain an ap-proximate solution of (3), then lifts it into M using theadversarial loss to produce a candidate solution of (2). Anetwork that realizes the IDS method is shown in Fig. 2. A. Multi-scale Estimation
Note that (3) requires the knowledge of p X | Y , which needsto be estimated from the training data, hence we solve thefollowing approximated version of (3), i.e., ˆ x (cid:48) (cid:96) = min ˆ x ∈ R × m × n (cid:90) M p (cid:48) X | Y ( x | y ) (cid:96) ( x, ˆ x )d x, (4) EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 3
Multi-scale Estimation
MS-SSIMMS-SSIMMS-SSIM
SSIM Branch (b)
MSEMSE
Multi-scale Estimation
MSE Branch (a)
Disciminator
Adversarial LossContent Loss
FusionNet (d)
MSE C O N V RELU CONV RELU RELU CONCATE 1x1 C O N V CONV
CONCATE1X1CONV RDBCONV
Convolutional LayerConcatenation Layer1x1 Convolutional Layer ResidualDenseBlock (c)
ResidualDense BlockArchitecture
Fig. 2: One example of the proposed IDS network. (a) and (b) are the multi-scale estimation with MSE and SSIM loss,respectively. (d) is the FusionNet with adversarial and content loss. (c) shows the legend.where p (cid:48) X | Y is an approximation of p X | Y learned from thetraining data. To ensure that ˆ x (cid:48) (cid:96) ≈ ˆ x (cid:96) (and consequently closeto the ground truth), we need p (cid:48) X | Y ( x | y ) ≈ p X | Y ( x | y ) for x ∈ M (at least for x in a neighborhood of y that containsthe ground truth). However, since the difference between theground truth and the distorted version y is not negligible, thisneighborhood could be quite large, rendering a good approx-imation of p X | Y ( ·| y ) in this neighborhood difficult to obtain.Indeed, the number of parameters need to specify p X | Y ( ·| y ) in this neighborhood might be comparable or even larger thanthe available training data, hence a direct approximation canbe highly unreliable, especially considering the fact that theapproximation is in general done in a suboptimal way. For thisreason, it is sensible to first approximate p ˜ X | Y (with ˜ x being alow-resolution version of the ground truth), which itself is anapproximation of p X | Y and can be specified by a significantlysmaller number of parameters (as compared to p X | Y ). In thisway, we can get a good approximation of p ˜ X | Y , denoted by p (cid:48) ˜ X | Y , and solve the following optimization problem instead: ˜ x ˜ (cid:96) = min ˆ x ∈ R × m × n (cid:90) M p (cid:48) ˜ X | Y ( x | y )˜ (cid:96) ( x, ˆ x )d x. (5)Since p (cid:48) ˜ X | Y ( x | y ) is a good approximation of p ˜ X | Y ( x | y ) , it isexpected that ˜ x (cid:96) is close to ˜ x and consequently not very faraway from the ground truth. Now with ˜ x (cid:96) at hand, we canfurther convert (3) to the following problem: ˆ x (cid:96) = min ˆ x ∈ R × m × n (cid:90) N (˜ x (cid:96) ) p X | ˜ X (cid:96) ,Y ( x | ˜ x (cid:96) , y ) (cid:96) ( x, ˆ x )d x, (6)where N (˜ x (cid:96) ) is a neighborhood of ˜ x (cid:96) that is large enoughto cover the ground truth. It suffices to have a good approxi- mation p X | ˜ X (cid:96) ,Y ( ·| ˜ x (cid:96) , y ) over N (˜ x (cid:96) ) . The above procedure isrepeated until the required neighborhood is small enough.We assume that the smaller the neighborhood becomes,the fewer number of parameters are needed to specify thedistribution defined over this neighborhood and consequentlythe approximation becomes easier. Multi-scale estimation isintroduced to mimic conventional coarse-to-fine optimizationmethods and has been widely applied in many computer visiontasks [12], [11], [40], [35]. B. Multi-branch Diversity
The idea underlying multi-branch diversity is similar. Sup-pose we adopt two branches with different loss functions,denoted by (cid:96) and (cid:96) , respectively, then (6) becomes ˆ x (cid:96) = min ˆ x ∈ R × m × n (cid:90) N (˜ x (cid:96) ) ∩N (˜ x (cid:96) ) p X | ˜ X (cid:96) , ˜ X (cid:96) ,Y ( x | ˜ x (cid:96) , ˜ x (cid:96), , y ) (cid:96) ( x, ˆ x )d x. (7)It should be clear that multi-branch diversity further narrowsthe region over which the distribution needs to be estimated.In our experiments, we choose (cid:96) and (cid:96) to be mean squareerror (MSE) and structural similarity index (SSIM) loss,respectively. See Fig. 2 (a) and (b) for the architecture of twomulti-scale estimation branches of the proposed IDS network. C. Adversarial Loss
The role of the adversarial loss (cid:96) ad is to lift ˆ x (cid:96) into M .Specifically, consider a neural network subject to the weighted EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 4
ForwardBackwardForwardBackward Forward D i sc r i m i na t o r Loss I npu t LossLossLoss
Backward Forward
Fig. 3: The isolated training of one iteration in hard IDS.loss (cid:96) + λ(cid:96) ad , which can be interpreted as solve the followingproblem: ˆ x (cid:96) + λ(cid:96) ad = arg max ˆ x ∈N (ˆ x (cid:96) ,λ ) p X (ˆ x ) , (8)where N (ˆ x (cid:96) , λ ) is a neighborhood of ˆ x (cid:96) . In general, thisoptimization problem tends to give a reconstruction that fallsinto M since p X is only positive on M . Note that the sizeof N (ˆ x (cid:96) , λ ) depends on λ . Specifically, N (ˆ x (cid:96) , λ ) is largewhen λ is large. In the extreme case of λ → ∞ , we have ˆ x (cid:96) + λ(cid:96) ad → arg max ˆ x ∈M p X (ˆ x ) ; while when λ is very small, N (ˆ x (cid:96) , λ ) may have no intersection with M , and in this case(8) reduces to (3). In principle it is desirable to choose thesmallest λ such that N (ˆ x (cid:96) , λ ) intersects with M . It is alsoworth noting that p X is in general unknown. So one has tosolve a modified version of (8) with p X replaced by p (cid:48) X , whichis an approximation of p X learned from the training data.The adversarial loss serves an important role of generatingtexture details in image restoration. One of the reasons for itssuccess in our framework is that, by leveraging multi-scaleestimation and multi-branch diversity, one can already obtainan good estimate ˆ x (cid:96) which is in a narrow neighboring regionof M , and consequently the generator does not need much“imagination” to produce a natural-looking image. However,we observe the similar phenomenon reported in [27] that ad-versarial loss is helpful for faithful reproduction, even thoughthe final PSNR metric is slightly lower. Nevertheless, we in-troduce the adversarial loss to obtain better perceptual qualitybut not expect higher PSNR value. The relevant ablation studycan be found in Section V-C.IV. I MPLEMENTATION
In this section, we provide a detailed implementation ofthe indirect domain shift (IDS). We also propose two trainingschemes, i.e., the hard IDS and soft IDS.
A. Network Architecture
The proposed IDS network is shown in Fig. 2, whichconsists of three basic components, i.e., the MSE branch, theMS-SSIM branch, and the FusionNet. The MSE and SSIM TABLE I: The configuration of the shadow, medium, and deepHard IDS corresponding to Fig. 4. − . An adversarial loss (see (8)) is alsoimposed on the FusionNet to enhance the perceptual qualityof the final result.To be specific, inside each diversity branch, there are threesub-networks, each performing domain shift at a different scalelevel. The input of the coarse-scale sub-network is obtainedfrom the original hazy image via bi-linear interpolation witha down-sampling factor of 4. Its output is up-sampled witha factor of 2 via pixel shuffle [43], then fed into themedium-scale sub-network, together with the down-sampledhazy image representation by a factor of 2. The input of thefine-scale sub-network is the concatenation of the original hazyimage representation and the up-sampled output of medium-scale sub-network.It is known that residual networks (ResNets) can facilitategradient flow while dense networks (DenseNets) help max-imize the use of feature layers via concatenation and denseconnection. To capitalize on their respective strengths, [51]proposes so-called residual dense networks (RDNs), whichconsist of contiguous memory blocks, local residual learningblocks and global feature fusion blocks.In this work, we use RDNs as the fundamental buildingcomponents of the proposed IDS network. See Table I fordetailed specifications. Note that hard IDS and soft IDSadopt the same network structure, but differ in terms of thenumber of trainable parameters. Model depth will be detailedin Section V-D. B. Training Scheme
To handle the coexistence of multiple loss functions, wepropose two back-propagation strategies characterized by dif-ferent effective ranges of the loss functions. Specifically, wecan separately update each module according to the associatedloss function or jointly update all modules according to aglobal loss that aggregates the local ones. This results in thetwo IDS training schemes, i.e., hard IDS and soft IDS.
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 5 PS NR SS I M PSNR/SSIM vs.
PSNRSSIM
Fig. 4: The performance of hard IDS with different parameters.
1) Hard IDS:
We first present the isolated training strategyfor hard IDS shown in Fig. 3. Specifically, each moduleis supervised independently by the associated loss functionsand deliver dehazed images to the next stage after updatingtheir weights (see in Fig. 5a.). Note that in this case, theconvergence of the entire network does not depend on theconvergence of all loss functions, which means that the net-work performance may become stable before all loss functionsare small enough. This is a consequence of direct mapping,since for each mapping step it suffices to enter one of many(almost) equally good confident neighborhoods, resulting inlower computational load. One advantage of isolated updatingis that the gradient vanishing problem can be alleviated. Recallthat this problem is caused by the emergence of small gradientsin the earlier layers of very deep networks during back-propagation. As a comparison, isolated training shortens theback-propagation path, but maintains the depth of forwardinference, at the expense of heterogeneous convergence ratesof different loss functions.
2) Soft IDS:
In contrast to hard IDS, here a global lossfunction obtained by combining all local module losses isused to update network parameters via end-to-end back-propagation. Although the local losses are evaluated basedon the images output by the respective modules, only thefeature map from the penultimate convolutional layer of eachmodule is delivered to the next module (see in Fig. 5b.). Thisenables soft IDS to accomplish the desired task largely inthe feature space. The fact that each module no longer hasto re-map the previous module’s output images back to thefeature space is helpful for reducing the number of parametersand also making the indirect shifting path ‘smoother’. Anotheradvantage of soft IDS is that there is no need to be concernedwith the convergence of a specific module as in hard IDS,which facilitates the training process.In summary, the differences between these two versions ofIDS are as follows: (1) As shown in Fig. 5, hard IDS deliversimages to the next stage while soft IDS delivers featuresinstead. (2) Hard IDS adopts isolated training (optimizingindividual modules separately); in constrast, soft IDS leveragesthe aggregate loss to jointly optimize the constituent modulesof the entire network. V. A
BLATION S TUDY
We conduct ablation studies to investigate the respectivecontributions of multi-scale estimation, multi-branch diversity,and adversarial loss using RESIDE-standard indoor dataset[29] that will be introduced in detail in Section VI-A. To elim-inate the influence of other factors, all training configurationsare kept the same as that presented in Section VI-B, includingthe total number of trainable parameters for each network.
A. Multi-scale Estimation
As mentioned in Section III-A, a direct mapping can beunreliable, since the number of trainable parameters might becomparable or even larger than the available training data. Toovercome this problem, a multi-scale network is applied in thefirst stage of IDS. Another important property of such coarse-to-fine estimation is the local-global consistency: the coarse-scale network first estimates the holistic structure of the imagescene, and then a fine-scale network performs refinement basedon both local information and the coarse global estimation. Tofurther study the influence of such coarse-to-fine structure, wetest the performance of IDS framework without multi-scaleestimation (w/o scale).Following the ablation principle, we remove the coarse-scalenetwork and make the fine-scale network deeper to have thesame number of parameters. One output example is presentedin Fig. 6a indicating that hard IDS w/o scale is able to recoverthe image reasonably well, but with some local inconsistency:the haze at the up-left corner is not removed faithfully. Thisverifies the above analysis that multi-scale network is ableto capture both local and global features. We present theperformance on PSNR and SSIM for both hard IDS and softIDS in Table II (a) and (b), respectively. It can be seen that IDSw/o scale performs worse than IDS (especially in soft IDS),indicating that the local inconsistency has impact on both thequantitative metrics and perceptual quality.
B. Multi-branch Diversity
Using multi-scale estimation with MSE loss, one can realizedomain shift to a certain extent. However, some importantinformation may get lost along the way. To keep the infor-mation diversity, we introduce one more multi-scale branchand employ SSIM loss in this branch. This strategy enables amore precise inference of local details by providing distinctiveconfident neighborhoods identified by different branches. Tofurther illustrate its effectiveness of this strategy, we test theperformance of IDS without multi-branch diversity (w/o div).Similarly, we remove the second branch and make the firstbranch deeper. One of the examples is presented in Fig. 6b,in which the IDS w/o div sometimes delivers erroneous detailinference, since the “dark area” between the light and the wallclearly should not exist. This is further verified by the overallvalidation shown in the Table II, in which there is a largeperformance gap between IDS and IDS w/o div, indicatingthat it is well worth having two branches.
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 6
Loss Loss × 12 ×14 (a)
Loss Loss × 12 ×14 (b)
Fig. 5: The difference between (a) Hard IDS and (b) Soft IDS.TABLE II: Ablation studies on the SSIM/PSNR performance. The best performance is shown in bold , while second best resultsare with underline. (a) Hard IDS
Scale Branch Adversarial PSNR SSIM (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (b) Soft IDS
Scale Branch Adversarial PSNR SSIM (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (i) source (ii) hazy(iii) IDS (iv) w/o scale (a) without multi-scale. (i) source (ii) hazy(iii) IDS (iv) w/o div (b) without multi-branch. (i) source (ii) hazy(iii) IDS (iv) w/o adv (c) without adversarial loss.
Fig. 6: Some output examples of Hard IDS without multi-scale estimation (w/o scale), without multi-branch diversity (w/odiv), only with adversarial loss (o/w adv), and without adversarial loss (w/o adv) in the ablation study, respectively.
C. Adversarial Loss
The adversarial loss (together with the content loss) isemployed at the last stage (i.e., the FusionNet) of the proposedIDS framework and is served to obtain high visual quality.The FusionNet takes the estimates from the two branches,in conjunction with the original hazy image, as the inputand generates the final output with perceptually satisfactoryhigh-frequency details via proper fusion. Since the estimatesproduced by the two branches are already in the neighboringdomains of the target, the generator does not need to rely onpure “imagination” to create texture details; instead, it could,to a great extent, maintain the perceptual reality rather thanblindly pursue a higher PSNR [27].To prove this, we show that IDS without adversarial loss is able to produce a high PSNR but NOT able to obtainbetter perceptual quality. Following the ablation principle,we construct IDS IDS without adversarial loss (w/o adv)by simply removing discriminator. As can be seen, IDS w/oadv produces a slightly higher PSNR in Fig. 6c (26.508),but obviously lower perceptual quality than IDS (26.094), asthe wall is printed “darker” partially to minimize the MSEdistance. This demonstrates the generalization capability ofthe generator and provides further justifications for the IDSframework. To further prove the necessity of adversarial loss,we compare the proposed method with GridDehaze [32] onoutdoor datasets. Note that GridDehaze [32] is a pure CNN-based dehazing method without adopting adversarial loss togenerate natural-looking outputs. It can be seen from Fig. 7
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 7
Fig. 7: The output examples from SOTS outdoor testing set.that the images generated by soft IDS tend to be closer to theground truth with less inconsistent color gradients on the road,sky, and wall. This verifies that the adversarial loss is able toinduce better perceptual quality.
D. Model Depth
This section is devoted to investigating the impact ofmodel depth on the performance of our hard IDS method.By adjusting the number of convolutional and dense residualblocks, we construct shadow, medium, and deep models with8 M, 10.5 M, and 15 M trainable parameters, respectively.Detailed specifications is shown in Table I. As expected, thedeep model achieves the best overall performance in terms ofboth PSNR and SSIM. As illustrated in Fig. 4, both PSNRand SSIM values improve dramatically as the number ofparameters increases, which further verifies the effectivenessof the IDS framework. It is worth mentioning that albeitwith fewer trainable parameters (around 5.3 M), soft IDS stillmanages to outperform hard IDS as shown in Table III.VI. E
XPERIMENTS
In this section, we further compare the proposed IDSnetwork with several state-of-the-art dehazing algorithms, in-cluding dark channel prior (DCP) [19], DehazeNet [7], AOD-Net [28], gated fusion network (GFN) [41], GridDehazeNet(GridDehaze) [32], PFD [10] and MSBDN [9]. For a fair com-parison, all these algorithms are evaluated on both syntheticand realistic datasets in terms of visual effect and quantitativeaccuracy. We adopt the peak signal to noise ratio (PSNR) [52]and the structural similarity index (SSIM) [46] for evaluation.
A. Benchmark Dataset
For training and testing purposes, we use the RESIDE-standard dataset [29], which is a benchmark for single imagedehazing. The indoor training set (ITS) of RESIDE-standardcontains 13990 synthetic hazy indoor images (together withhaze-free counterparts). These synthetic images are gener-ated using NYU2 [37] and Middlebury stereo [42] withthe medium extinction coefficient β chosen uniformly from (0 . , . and the global atmospheric light A chosen uniformlyfrom (0 . , . . The outdoor training set (OTS) of RESIDE-standard contains 296695 hazy images generated from 8477clear counterparts with β chosen uniformly from (0 . , . and A chosen uniformly from (0 . , . . The testing set(SOTS) of RESIDE-standard contains 500 synthetic hazyindoor/outdoor images (together with haze-free counterparts).We also perform comparisons using the real-world hazy imagedataset in [13] to show the perceptual difference. B. Training Details
Our algorithm is implemented using the PyTorch library[38] and all tests are conducted on the same GPU of NvidiaTitan Xp. We train the network with the following config-uration: the Adam optimizer [24] is applied with β = 0 . and β = 0 . , where a mini-batch size of 10, a patch sizeof × , an initial learning rate of − are adopted.For hard IDS, the learning rate decays with a multiplicativefactor of . every 120 epochs for a total of 700 epochs,while soft IDS is trained for 100 epochs with the learning ratereduced by half on the 60th, the 80th, and the 90th epochs.Besides, horizontal/vertical random flipping is applied for dataaugmentation. C. RCAN as Substitute
The proposed IDS framework is generic in nature andadmits many different concrete implementations. In this work,we have focused on a particular implementation with RDNs asfundamental building blocks. However, this is by no means thebest possible one. Indeed, the performance of our IDS networkcan be further improved by adopting more powerful substitutesof RDNs. To demonstrate this, we replace RDNs in soft IDSby residual channel attention networks (RCANs) [50] withthe same number of trainable parameters. The experimentalresults demonstrating the effectiveness of this variant of IDScan be found in the subsequent subsections.
D. Evaluation on Benchmark Dataset
We train our network from scratch on RESIDE-standardITS, OTS and validate it on the separated testing dataset SOTS.The quantitative results and the qualitative results are shown inTable III and Fig. 8, respectively. Here hard IDS correspondsto the deep model in Table I, while soft IDS is as describedin Section V-D. It can be seen from Table III that soft IDSoutperforms the other methods under comparison in terms ofPSNR and SSIM. In particular, the PSNR achieved by soft IDSreaches 34.74 on SOTS indoor dataset. Moreover, adoptingRCANs leads to a further performance boost.
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 8
TABLE III: The SSIM/PSNR performance of different methods on SOTS-indoor, and SOTS-outdoor. Our proposed methodsand improved network with RCAN outperform the others.
Dataset Metrics DCP DehazeNet AOD-Net GFN GridDeaze PFD MSBDN Hard IDS Soft IDS RCAN IDSIndoor PSNR 16.62 21.14 19.06 22.30 32.16 32.68 33.79 32.17
SSIM 0.8179 0.8472 0.8504 0.880 0.9836 0.9760 0.9840 0.9860
Outdoor PSNR 19.13 24.75 24.14 28.29 30.86 31.17 31.33 30.78
SSIM 0.8605 0.9269 0.9198 0.9621 0.9819 0.9825 0.9832 0.9815 (i) Hazy (iii) DehazeNet(ii) DCP (iv) AOD-Net (v) GFN (vi) GridDehaze (vii) PFD (viii) MSBDN (ix) RCAN IDS (x) Clear
Fig. 8: The output examples from SOTS indoor testing set of the SOTA methods.As for visual quality, prior-based methods [19] overestimatethe haze thickness, which results in color distortion (e.g., thecolor of the wall turns purple in the fifth row of Fig. 8).Although some learning-based baseline methods [7], [28]avoid the color distortion problem, they tend to deliver unsatis-factory haze removal results for shaded regions. For example,in the seventh row of Fig. 8, the area behind the arch shouldbe dark; however, the restoration results produced by mostbaseline methods show light color instead. This is probablybecause the baseline methods fail to correctly estimate thedepth information and consequently are misled by the hazeeffect. GFN generates decent results, and removes the hazein this area reasonably well. A possible explanation is thatGFN does not rely on depth estimation for haze removal; it can also be attributed to the multi-scale approach adopted byGFN, which is an important ingredient of the IDS frameworkas well. Exploiting the full strength of IDS enables us to obtainbetter dehazing results. GridDehaze [32], PFD [10] andMSBDN [9] can produce dehazed images comparable to ours.Nevertheless, they still generate inconsistent color gradients onthe venetian blinds in the fourth row of Fig. 8. In contrast, ourdehazed images can hardly be distinguished from the groundtruth.
E. Evaluation on Real-world Photographs
We further show the dehazing results on real-world imagesin [13] to illustrate the generalization ability of IDS. In Fig. 9,the prior-based method [19] introduces color distortion and
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 9
Fig. 9: The output examples from real-world images in Fattal et.al. [13] to compare with SOTA DNN based methods.TABLE IV: Average per-image ( × ) runtime (second) on SOTS-indoor. CNN Based (GPU Running Time)DehazeNet AOD-Net GFN GridDehaze PFD MSDBN Hard IDS Soft IDS RCAN IDS0.190s 0.004s 0.011s 0.22s 0.103 0.088 0.048s 0.035s 0.041s over enhancement on images. It is clear that DehazeNet [7]and AOD-Net [28] fail to remove haze completely, especiallyin the last column where heavy haze can still be seen aroundthe haystack. Moreover, they also tend to over-enhance theimages (e.g. the mountains in the fourth column). Although GridDehaze [32], PFD [10] and MSBDN [9] work well onthe synthetic dataset, their generalization performance on realimages is not completely satisfactory; for example, one can seecolor distortion, incomplete haze removal or over enhancementin the regions identified by the red boxes in Fig. 9. We also
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 10 (a) The output samples from O-Haze testing set.(b) The output samples from Dense-Haze testing set.
Fig. 10: Qualitative evaluation on O-Haze [2] and Dense-Haze[1] dataset.TABLE V: The SSIM/PSNR performance of different methodson O-Haze [2] and Dense-Haze [1] dataset. Our proposedmethods outperform the others. Best results are in bold , andsecond best results are with underline.
O-Haze Dense-HazeMetrics PSNR SSIM Metrics PSNR SSIMScarlet [49] 24.03 0.775 iPAL-AtJ [18] 20.26
BJTU 24.69 0.777 iPAL-COLOR[17] 19.92 0.653FKS 23.88 0.775 MT.MaxClear[5] 19.47 0.652VICLAB [44] 22.71 0.707 BMIPL 18.84 0.633Ranjanisi [34] 23.00 0.701 xddqm 18.52 0.640GridDehaze 22.76 0.721 GridDehaze 16.56 0.582MSBDN 23.28 0.743 MSBDN 17.36 0.607
Ours 24.92 0.779 Ours 20.27 0.657 notice that the proposed IDS is able to not only remove hazesuccessfully, regardless whether it is dense or light, but alsorestore the texture details faithfully, which further proves theeffectiveness of our method.
F. Evaluation on Real-world Datasets
Here we perform evaluation on two real-world datasets, O-Haze [2] and Dense-Haze [1], which are very challenging due to limited training images (45 and 55 respectively) andcomplicated haze patterns. Therefore, the performance onthese two datasets can serve as a good indicator whether adehazing method can reliably handle highly unfavorable situ-ations. The quantitative and qualitative results of our methodsare presented in Table V and Fig. 10, respectively. Table Valso contains the top 5 results from NTIRE2018-Dehazingchanllenge [4] and NTIRE-2019 challenge [3], as well asthe evaluation results of GridDehaze [32] and MSBDN [9]for comparison.
Results on O-Haze.
We evaluate our proposed IDS onO-Haze dataset [2] following the setting of NTIRE2018-Dehazing chanllenge [4]. It can be observed from Table V thatour IDS outperforms all other methods in terms of PSNR andSSIM. Fig. 10a shows that our approach reconstructs faithfuland sharp haze free images with good perceptual quality.
Results on Dense-Haze.
In contrast to O-Haze that mostlycontains light haze, Dense-Haze dataset records images withdenser and more homogeneous haze layer. We follow thesetting of NTIRE-2019 challenge [3] for evaluation. Qual-itative results in Fig. 10b demonstrate that our IDS is able torestore those regions covered by thick haze. In particular, forthe second testing sample in Fig. 10b, the background sceneis almost invisible to human eyes, but our IDS surprisinglyremoves haze fairly completely and reconstructs identifiabledetails. Quantitative comparisons in Table V further confirmthat our IDS is the top performing method.
G. Runtime
Table IV shows runtime comparisons on the SOTS dataset.Our method is ranked the third among CNN-based methods.It is worth mentioning that in our implementation multi-scale estimation is performed branch by branch. A significantreduction in runtime is possible via a parallel implementationof multi-scale estimation in two branches.VII. C
ONCLUSION
In this paper, it is shown that the traditional direct mappingmethods cannot provide accurate direct mapping for imagedehazing. To solve this problem, an indirect domain shift(IDS) method is proposed by adding explicit loss functionsinside a deep CNN model to guide the dehazing process.Multi-scale estimation, multi-branch diversity, and adversarialloss play important roles in this method as shown by theablation studies. We also propose two training schemes, whichhave their respective advantages; specifically, hard IDS is lessdemanding in terms of computational resources and alleviatesthe gradient vanishing problem, while soft IDS is easier toimplement and in general yields better performance. We showthat IDS achieves remarkable improvements compared withthe state-of-the-art. One interesting direction for future workis to explore the application of the IDS framework to otherimage restoration tasks. R
EFERENCES[1] Codruta O Ancuti, Cosmin Ancuti, Mateu Sbert, and Radu Timofte.Dense-haze: A benchmark for image dehazing with dense-haze andhaze-free images. In , pages 1014–1018. IEEE, 2019.
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 11 [2] Codruta O Ancuti, Cosmin Ancuti, Radu Timofte, and ChristopheDe Vleeschouwer. O-haze: a dehazing benchmark with real hazy andhaze-free outdoor images. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition workshops , pages 754–762,2018.[3] Codruta O Ancuti, Cosmin Ancuti, Radu Timofte, Luc Van Gool, LeiZhang, and Ming-Hsuan Yang. Ntire 2019 image dehazing challengereport. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops , pages 0–0, 2019.[4] Cosmin Ancuti, Codruta O Ancuti, and Radu Timofte. Ntire 2018challenge on image dehazing: Methods and results. In
Proceedingsof the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , pages 891–901, 2018.[5] Simone Bianco, Luigi Celona, Flavio Piccoli, and Raimondo Schettini.High-resolution single image dehazing using encoder-decoder architec-ture. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops , pages 0–0, 2019.[6] Bolun Cai, Xiangmin Xu, Kui Jia, Chunmei Qing, and Dacheng Tao.DehazeNet: An End-to-End System for Single Image Haze Removal.
IEEE Transactions on Image Processing , 25(11):5187–5198, September2016.[7] Bolun Cai, Xiangmin Xu, Kui Jia, Chunmei Qing, and Dacheng Tao.Dehazenet: An end-to-end system for single image haze removal.
IEEETransactions on Image Processing , 25(11):5187–5198, 2016.[8] Zijun Deng, Lei Zhu, Xiaowei Hu, Chi-Wing Fu, Xuemiao Xu, QingZhang, Jing Qin, and Pheng-Ann Heng. Deep multi-model fusionfor single-image dehazing. In
Proceedings of the IEEE InternationalConference on Computer Vision , pages 2453–2462, 2019.[9] Hang Dong, Jinshan Pan, Lei Xiang, Zhe Hu, Xinyi Zhang, Fei Wang,and Ming-Hsuan Yang. Multi-scale boosted dehazing network withdense feature fusion. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pages 2157–2167, 2020.[10] Jiangxin Dong and Jinshan Pan. Physics-based feature dehazing net-works. In
European Conference on Computer Vision , pages 188–204.Springer, 2020.[11] David Eigen and Rob Fergus. Predicting depth, surface normals andsemantic labels with a common multi-scale convolutional architecture.In
Proceedings of the IEEE international conference on computer vision ,pages 2650–2658, 2015.[12] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map predictionfrom a single image using a multi-scale deep network. In
Advances inneural information processing systems , pages 2366–2374, 2014.[13] Raanan Fattal. Dehazing using color-lines.
ACM transactions ongraphics (TOG) , 34(1):13, 2014.[14] Damien Fourure, R´emi Emonet, ´Elisa Fromont, Damien Muselet, AlainTr´emeau, and Christian Wolf. Residual conv-deconv grid network forsemantic segmentation. arXiv preprint arXiv:1707.07958 , 2017.[15] Ross Girshick. Fast R-CNN. In
IEEE International Conference onComputer Vision , pages 1440–1448. IEEE, 2015.[16] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weightednuclear norm minimization with application to image denoising. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 2862–2869, 2014.[17] Tiantong Guo, Venkateswararao Cherukuri, and Vishal Monga.Dense123’color enhancement dehazing network. In
Proceedings ofthe IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , pages 0–0, 2019.[18] Tiantong Guo, Xuelu Li, Venkateswararao Cherukuri, and Vishal Monga.Dense scene information estimation network for dehazing. In
Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pages 0–0, 2019.[19] Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze removalusing dark channel prior.
IEEE transactions on pattern analysis andmachine intelligence , 33(12):2341–2353, 2011.[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. DeepResidual Learning for Image Recognition. In
IEEE Conference onComputer Vision and Pattern Recognition , pages 770–778. IEEE, 2016.[21] Ming Hong, Yuan Xie, Cuihua Li, and Yanyun Qu. Distilling imagedehazing with heterogeneous task imitation. In
Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,pages 3462–3471, 2020.[22] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Lossesfor Real-Time Style Transfer and Super-Resolution. In
EuropeanConference on Computer Vision , pages 694–711, Cham, 2016. SpringerInternational Publishing.[23] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In
ECCV , 2016. [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. In , 2015.[25] Idan Kligvasser, Tamar Rott Shaham, and Tomer Michaeli. xUnit -Learning a Spatial Activation Function for Efficient Image Restoration.
CVPR , pages 2433–2442, 2018.[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNetClassification with Deep Convolutional Neural Networks. In
Conferenceon Neural Information Processing Systems , 2012.[27] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, AndrewCunningham, Alejandro Acosta, Andrew P Aitken, Alykhan Tejani,Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-Realistic Sin-gle Image Super-Resolution Using a Generative Adversarial Network.
CVPR , pages 105–114, 2017.[28] Boyi Li, Xiulian Peng, Zhangyang Wang, Ji-Zheng Xu, and Dan Feng.Aod-net: All-in-one dehazing network. In
Proceedings of the IEEEInternational Conference on Computer Vision , 2017.[29] Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, WenjunZeng, and Zhangyang Wang. Benchmarking single-image dehazing andbeyond.
IEEE Transactions on Image Processing , 28(1):492–505, 2019.[30] Yu Li, Robby T Tan, and Michael S Brown. Nighttime haze removal withglow and multiple light colors. In
Proceedings of the IEEE InternationalConference on Computer Vision , pages 226–234, 2015.[31] Tsung-Yi Lin, Priya Goyal, Ross B Girshick, Kaiming He, and PiotrDoll´ar. Focal Loss for Dense Object Detection.
ICCV , pages 2999–3007, 2017.[32] Xiaohong Liu, Yongrui Ma, Zhihao Shi, and Jun Chen. Griddehazenet:Attention-based multi-scale network for image dehazing. In
ICCV , 2019.[33] Yang Liu, Jinshan Pan, Jimmy Ren, and Zhixun Su. Learning deeppriors for image dehazing. In
Proceedings of the IEEE InternationalConference on Computer Vision , pages 2492–2500, 2019.[34] Ranjan Mondal, Sanchayan Santra, and Bhabatosh Chanda. Imagedehazing by joint estimation of transmittance and airlight using bi-directional consistency loss minimized fcn. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops ,pages 920–928, 2018.[35] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep Multi-ScaleConvolutional Neural Network for Dynamic Scene Deblurring. In
IEEEConference on Computer Vision and Pattern Recognition , pages 257–265. IEEE, 2017.[36] Srinivasa G Narasimhan and Shree K Nayar. Vision and the atmosphere.
International Journal of Computer Vision , 48(3):233–254, 2002.[37] Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus.Indoor segmentation and support inference from rgbd images. In
ECCV ,2012.[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, EdwardYang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, andAdam Lerer. Automatic differentiation in pytorch. In
NIPS Workshop ,2017.[39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Net-works.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,39(6):1137–1149, 2017.[40] Wenqi Ren, Si Liu, Hua Zhang, Jinshan Pan, Xiaochun Cao, and Ming-Hsuan Yang. Single image dehazing via multi-scale convolutional neuralnetworks. In
European conference on computer vision , pages 154–169.Springer, 2016.[41] Wenqi Ren, Lin Ma, Jiawei Zhang, Jinshan Pan, Xiaochun Cao, Wei Liu,and Ming-Hsuan Yang. Gated fusion network for single image dehazing.In
IEEE Conference on Computer Vision and Pattern Recognition , 2018.[42] Daniel Scharstein and Richard Szeliski. High-accuracy stereo depthmaps using structured light. In ,volume 1, pages I–I. IEEE, 2003.[43] Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz, Andrew PAitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-timesingle image and video super-resolution using an efficient sub-pixelconvolutional neural network. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 1874–1883, 2016.[44] Hyeonjun Sim, Sehwan Ki, Jae-Seok Choi, Soomin Seo, Saehun Kim,and Munchurl Kim. High-resolution image dehazing with respect totraining losses and receptive field sizes. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops ,pages 912–919, 2018.[45] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. MemNet: APersistent Memory Network for Image Restoration. In
The Conferenceon Computer Vision and Pattern Recognition , pages 4539–4547, 2017.
EEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. X, MONTH XXXX 12 [46] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al.Image quality assessment: from error visibility to structural similarity.
IEEE transactions on image processing , 13(4):600–612, 2004.[47] Noam Yair and Tomer Michaeli. Multi-Scale Weighted Nuclear NormImage Restoration.
CVPR , 2018.[48] He Zhang and Vishal M Patel. Densely Connected Pyramid DehazingNetwork.
CVPR , 2018.[49] He Zhang and Vishal M Patel. Densely connected pyramid dehazingnetwork. In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2018.[50] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, andYun Fu. Image super-resolution using very deep residual channelattention networks. In
Proceedings of the European Conference onComputer Vision (ECCV) , pages 286–301, 2018.[51] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu.Residual dense network for image super-resolution. In
The IEEEConference on Computer Vision and Pattern Recognition (CVPR) , June2018.[52] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functionsfor image restoration with neural networks.
IEEE Transactions onComputational Imaging , 3(1):47–57, 2017.[53] Qingsong Zhu, Jiaming Mai, Ling Shao, et al. A fast single image hazeremoval algorithm using color attenuation prior.