[PDF] Segmentation-Aware Image Denoising without Knowing True Segmentation

Abstract

Several recent works discussed application-driven image restoration neural networks, which are capable of not only removing noise in images but also preserving their semantic-aware details, making them suitable for various high-level computer vision tasks as the pre-processing step. However, such approaches require extra annotations for their high-level vision tasks, in order to train the joint pipeline using hybrid losses. The availability of those annotations is yet often limited to a few image sets, potentially restricting the general applicability of these methods to denoising more unseen and unannotated images. Motivated by that, we propose a segmentation-aware image denoising model dubbed U-SAID, based on a novel unsupervised approach with a pixel-wise uncertainty loss. U-SAID does not need any ground-truth segmentation map, and thus can be applied to any image dataset. It generates denoised images with comparable or even better quality, and the denoised results show stronger robustness for subsequent semantic segmentation tasks, when compared to either its supervised counterpart or classical "application-agnostic" denoisers. Moreover, we demonstrate the superior generalizability of U-SAID in three-folds, by plugging its "universal" denoiser without fine-tuning: (1) denoising unseen types of images; (2) denoising as pre-processing for segmenting unseen noisy images; and (3) denoising for unseen high-level tasks. Extensive experiments demonstrate the effectiveness, robustness and generalizability of the proposed U-SAID over various popular image sets.

Full PDF

11 Segmentation-Aware Image Denoising withoutKnowing True Segmentation

Sicheng Wang,

Student Member, IEEE,

Bihan Wen,

Member, IEEE,

Junru Wu,

Student Member, IEEE,

Dacheng Tao,

Fellow, IEEE, and Zhangyang Wang,

Member, IEEE

Abstract —Several recent works discussed application-drivenimage restoration neural networks, which are capable of notonly removing noise in images but also preserving their semantic-aware details, making them suitable for various high-level com-puter vision tasks as the pre-processing step. However, suchapproaches require extra annotations for their high-level visiontasks, in order to train the joint pipeline using hybrid losses.The availability of those annotations is yet often limited to afew image sets, potentially restricting the general applicability ofthese methods to denoising more unseen and unannotated images.Motivated by that, we propose a segmentation-aware imagedenoising model dubbed U-SAID, based on a novel unsupervisedapproach with a pixel-wise uncertainty loss. U-SAID does notneed any ground-truth segmentation map, and thus can beapplied to any image dataset. It generates denoised images withcomparable or even better quality, and the denoised results showstronger robustness for subsequent semantic segmentation tasks,when compared to either its supervised counterpart or classical“application-agnostic” denoisers. Moreover, we demonstrate thesuperior generalizability of U-SAID in three-folds, by pluggingits “universal” denoiser without ﬁne-tuning: (1) denoising unseentypes of images; (2) denoising as pre-processing for segment-ing unseen noisy images; and (3) denoising for unseen high-level tasks, Extensive experiments demonstrate the effectiveness,robustness and generalizability of the proposed U-SAID overvarious popular image sets.

I. I

NTRODUCTION I MAGE denoising aims to recover the underlying cleanimage signal from its noisy measurement. It has beentraditionally treated as an independent signal recovery prob-lem, focusing on either single-level ﬁdelity (e.g., PSNR) orhuman perception quality of the recovery results. However,once high-level vision tasks are conducted on noisy imagesand such a separate image denoising step is typically appliedas preprocessing, it will become suboptimal because of itsunawareness of semantic information. A series of recent works[29], [5], [20], [12], [28], [27] discussed application-drivenimage restoration models that are capable of simultaneouslyremoving noise and preserving semantic-aware details forcertain high-level vision tasks. Those models achieve visuallypromising denoising results with richer details, in addition tobetter utility when supplied for high-level task pre-processing.

S. Wang, J. Wu and Z. Wang are with the Department of Computer Scienceand Engineering, Texas A&M University, College Station, TX, 77843 USA,e-mail: { sharonwang, sandboxmaster, atlaswang } @tamu.edu.B. Wen is with the School of Electrical & Electronic Engineering, NanyangTechnological University, Singapore, e-mail: [email protected]. Tao is with the School of Computer Science, the University of Sydney,NSW 2006 Australia, e-mail: [email protected]. However, a common drawback of them is their demandfor extra annotations for the high-level vision tasks, in orderto train the joint pipeline with hybrid low-level and high-level supervisions. On one hand, such annotations (e.g., objectbounding boxes, semantic segmentation maps) are often highlynon-trivial to obtain for real images, therefore limiting cur-rent works to synthesizing noise on existing annotated cleandatasets, to demonstrate the effectiveness of their methods. Onthe other hand, training with only one annotated dataset runsthe risk of overly tying the resulting denoiser with the semanticinformation of this speciﬁc dataset, which causes a lack ofuniversality and may show various artifacts due to overﬁtting,when applied to denoising other substantially different images.This paper attempts to break the above hurdles of existingapplication-driven image restoration models. We propose anovel unsupervised segmentation-aware image denoising ( U-SAID ) model, that enforces segmentation awareness and dis-criminative ability of denoisers, without actually needing anysegmentation groudtruth during training . It is implementedby creating a novel loss term, that penalizes the pixel-wiseuncertainty of the denoised outputs for segmentation. Ourcontributions are in two-folds: • On the low-level vision side, to the best of our knowledge,U-SAID is the ﬁrst unsupervised (or “self-supervised”)application-driven image restoration model. In contrastto the existing peer work [29], U-SAID can be trainedon any image datatset, without needing ground-truth (GT)segmentation maps. That greatly extends the applicabilityof U-SAID as a more “universal” denoiser, that can beapplied to denoise images with few semantic annotationswhile being substantially different from natural imagesin existing segmentation datasets. Compared to standard“application-agnostic” denoisers such as [52], U-SAIDis observed to provide better visual details, that are alsomore favored under perception-driven metrics [33]. • On the high-level vision side, the U-SAID denoisingnetwork is shown to be robust and “universal” enough,when applied to denoising different noisy datasets, aswell as when used towards boosting the segmentation taskperformance on unseen noisy datasets, thanks to its lesssemantic association with any dataset annotation. Fur-thermore, U-SAID trained with segmentation awarenessgeneralizes well to unseen high-level vision tasks, and canbe plugged into without ﬁne-tuning, which reduces thetraining effort when applied to various high-level tasks.Extensive experiments on various popular image sets a r X i v : . [ c s . C V ] M a y demonstrating the outstanding effectiveness, robustness, anduniversality of the proposed approach. We advocate that ourmethodology is (almost) a free lunch for image denoising, andhas a plug-and-play nature to be incorporated with existingdeep denoising models.II. R ELATED W ORK

Image denoising has been studied with intensive efforts fordecades. Earliest methods refer to various image ﬁlters [41].Later on, many model-based method with various priors havebeen introduced to this topic, in either spatial or transformdomain, or their hybrid, such as spatial smoothness [38],non-local patch similarity [7], sparsity [11], [31], [47] andlow-rankness [13]. More recently, a number of deep learningmodels have demonstrated superior performance for imagedenoising [3], [32], [52]. Despite their encouraging process,most existing denoising algorithms reconstruct images by min-imizing the mean square error (MSE), which is well-known tobe mis-aligned with human perception quality and often tendsto over-smooth textures [17]. Moreover, while image denoisingalgorithms are often needed as the pre-processing step for theacquired noisy visual data before subsequent high-level visualanalytics, their impact on the semantic visual information wasmuch less explored.Lately, a handful of works are devoted to closing the gapbetween the low-level (e.g., image denoising, as a representa-tive) and high-level computer vision tasks. Such marriage leadsto, not only better utility performance for high-level targettasks, but also the denoising outputs with richer visual detailsafter receiving the extra semantic guidance from the high-leveltasks, the latter being ﬁrst revealed in [19], [46]. [29] presenteda systematical study on the mutual inﬂuence between the low-level and high-level vision networks. The authors cascadeda ﬁxed pre-trained semantic segmentation network after adenoising network, and tuned the entire pipeline with a jointloss function of MSE and segmentation loss. In that way, theauthors showed the denoised images to have sharper edgesand clearer textual details, as well as higher segmentation andclassiﬁcation accuracies when feeding such denoised imagesfor those tasks. A similar effort was described in [12], wherea segmentation-aware deep fusion network was proposed toutilize the segmentation labels in MRI datasets to aid MRIcompressive sensing recovery. [20] considered a joint pipelineof image dehazing and object detection. [39] proposed toincorporate global semantic priors (e.g., eyes and mouths) asan input to deblur the highly structured face images. Thisﬁeld is now rapidly growing, with a few benchmarks launchedrecently [21], [45], [22], [50].Following [29], [12], we also adopt segmentation as ourhigh-level task, because it can supply pixel-wise feedbacksand is thus considered to be more helpful for dense regressiontasks. As pointed out by [15], the availability of segmentationinformation can compromise the over-smoothening effects ofCNNs across regions and increases their spatial precision.However, we would like to emphasize (again) that while [29],[15], [12] all exploit GT segmentation maps as extra strong su-pervision information during training, we have only a weaker form of feedbacks available from the segmentation task, due tothe absence of its GT as extra information. Straightforwardly,our methodology is applicable when cascaded with other high-level tasks as well.Our work is also broadly related to training deep networkwith noisy or uncertain annotations [44], [30]. Especiallyfor the segmentation task, existing supervised models requiremanually labeled segmentations for training. But pixel-basedlabeling for high-resolution images is often time-consumingand error-prone, causing incorrect pixel-wise annotations. Ex-isting works often consider them as label noise [37]. For exam-ple, [23] proposed a noise-tolerant deep model for histopatho-logical image segmentation, using the label-ﬂip noise modelsproposed in [40]. However, those algorithms still need tobe given segmentation maps (though inaccurate), and oftendemand more statistical estimations of the label noise.III. T HE P ROPOSED M ODEL : U-SAIDOur proposed unsupervised segmentation-aware image de-noising (U-SAID) network follows the same cascade idea ofthe segmentation-guided denoising framework proposed by[29]. We replace their self-designed U-Net denoiser with theclassical deep denoiser DnCNN [52], using the 20-layer blindcolor image denoising model referred to as CDnCNN-B ,since we favor more robustness to varying noise labels. Notethat the choice of denoiser network should not affect muchour obtained conclusions. Its loss L MSE is the reconstructionMSE between the denoised output and the clean image.The critical difference between U-SAID and existing workslies in the high-level component of the cascade. Unlike [29],[12] that placed a pre-trained and ﬁxed segmentation networkwith true segmentation labels given for training, we designa new unsupervised segmentation awareness ( USA ) module,that requires no segmentation labels to train with. The networkarchitecture is illustrated in Figure 1.

A. Design of USA Module

The USA module is composed of a feature embedding sub-network for transforming the input (denosied image) to thefeature space, followed by an unsupervised segmentation sub-network that calculates the pixel-wise uncertainty of semanticsegmentation.For the feature embedding sub-network, we used a FeaturePyramid Network (FPN) [24], with a ResNet-101 backboneas the feature encoder. We used ImageNet-pretrained weights for the backbone, and keep all default architecture details ofFPN/ResNet-101 unchanged. During training, the ResNet-101backbone is frozen as a ﬁxed feature extractor, and the top-down feature pyramid part of FPN started with random Gaus-sian initializations and also keeped ﬁxed. It is very important tonotice that we have not used any image segmentation datasetto pre-train the feature embedding sub-network.

For the unsupervised segmentation sub-network, weassume the input image resolution to be M × N and contain https://github.com/cszn/DnCNN https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py Noisy Image48x48 Residual Image48x48 Denoised Image48x48 C on v1 - s t r i d e C on v2 - s t r i d e C on v3 - s t r i d e C on v4 - s t r i d e C on v5 - s t r i d e M5 M2 M3

M4P4P3P2 channels K channels USA Module L MSE L USA

Joint Loss ++ F ea t u r e E m b e dd i ng S ub - n e t w o r k U n s up e r v i se d S e g m e n t a t i on S ub - n e t w o r k +++

8x up4x up2x up P5 → x C on v → → x C on v → x C on v → → x C on v → Segmentation Prediction

Denoiser

Random Gaussian InitializedResNet-101Backbone

Fig. 1:

The architecture of the proposed U-SAID network at most K different semantic classes. After FPN, we obtain512 channels of feature maps ∈ R M × N . We then apply K3 × K channels, eventually leading to a (resized) K -classsegmentation map.Since the image segmentation task can be casted as pixel-wise classiﬁcation, classical segmentation networks will adoptpixel-wise softmax loss function to generate a K -class proba-bility vector p i,j , for the ( i, j ) -th R K vector ( i, j range from 1to M, N , respectively), choosing the highest probability classand producing the ﬁnal segmentation map ∈ R M × N . However,since we have no GT pixel labels in the unsupervised case, weinstead minimize the average entropy function of all predictedclass vectors p i,j , denoted as L USA , to encourage conﬁdentpredictions at all pixels: L USA = 1

M N (cid:88) ≤ i ≤ M, ≤ j ≤ N − p i,j log p i,j All layer-wise weights in the unsupervised segmentation sub-network are random Gaussian initialized, and the ResNet-101backbone uses the pre-trained ImageNet weights. Similar to[29], [12], we use a ﬁxed high-level network, but we do notinclude the perceptual loss in training the network.

B. Training Strategy

We train the cascade of denoising network and USA modulein an end-to-end manner, while ﬁxing the weights in thefeature embedding sub-network of the latter. The overall lossfor U-SAID is: L MSE + γL USA , with the default γ = 1 unlessotherwise speciﬁed. The training dataset for U-SAID couldbe any image set and is unnecessary to have segmentationannotations, overcoming the limitations in [29], [12]. Thatsaid, we need an estimate of segmentation class numbers K toconstruct L USA : an ablation study of estimated K will follow. We use the Adam solver to train both the denoiser partand the USA module. The batch size is 16. The input patchesare set to be 48 ×

48 pixels (patches are randomly sampledfrom images with a stride of 1). The initial learning rate is setas 1e-3 for all learnable parts of U-SAID, using a multi-steplearning decay strategy, i.e. dividing the learning rate by 10 atepoch 10, 40 and 80, respectively. The training is terminatedafter 100 epochs.

C. Why It Works?

A noteworthy feature of U-SAID is frozen high-level net-work, together with the denoiser. Without strong label super-vision, one may wonder why it can regularize the denoisertraining effectively, since it is high level features includethe random initialization keep ﬁxed, and the ResNet-101ImageNet features can still be regressed into some unknownmap, that is only required to be low-entropy pixel-wise. Infact, if the network itself holds large enough capacity, onemay expect to be able to ﬁnd parameters that can ﬁt with anygiven pixel-wise map (low-entropy or not), that conveys littlesemantical information (e.g., random maps).That might have reminded the deep image prior proposedin [43]: the authors ﬁrst trained a convolutional network fromrandom scratch, to regress from a random vector to a givencorrupted image, and then used the trained network as a regu-larization. Since no aspect of the network is pre-trained fromdata, such deep image prior is effectively handcrafted and wasshown to work well for various image restoration tasks. Theauthors attributed the success to the convolutional architectureitself, that appeared to possess high noise impedance. In ourcase, the ImageNet features are thought as highly relevant toimage semantics. Therefore, we make the similar hypothesiswith the authors of [43]: although the parametrization mayregress to any random unstructured label map, it does so veryreluctantly.

Fig. 2:

Convergence Plot

To verify our hypothesis, we conduct a simple proof-of-concept experiment inspired by [51]. In the USA module,we replace L USA with a standard pixel-wise softmax loss,having ResNet-101 ﬁxed with ImageNet weights and otherparts initialized randomly. We then use PASCAL VOC 2012training set to train this modiﬁed USA module, in a supervisedway, but with three different choices for the supervision: 1)the GT segmentation maps; 2) evenly cutting each GT mapinto 4 sub-images, and randomly permuting their locations;3) randomly permuting all pixel locations in each GT map.Notice that if we compute L USA values for the three targetmaps, they should be the same.We show in Figure 2 the value of training loss, as a functionof the gradient descent iterations for three supervisions. Appar-ently, the network can converge much faster to GT maps; themore GT maps were permuted, the more convergence “inertia”we observe. In other words, the network descends much morequickly towards semantically meaningful maps, and resists“bad” solutions with fewer semantics, although their entropiesmight have been the same.IV. E

XPERIMENTS

A. Denoising Study on PASCAL-VOC

The U-SAID denoiser takes RGB images as input andoutputs the reconstructed images. We choose the PASCAL-VOC 2012 training set, and add i.i.d. Gaussian noise with zeromean and standard deviation σ to synthesize the noisy inputimage during training. Our testing set is generated similarlyby adding noise on the PASCAL-VOC 2012 validation set.Since we used CDnCNN-B as the backbone denoiser, wefocus on the challenging blind denoising scenario, by settingthe Gaussian noise standard deviations σ to uniformly rangebetween [0, 55] for the training set, creating a “one-for-all”denoiser that can be simply evaluated at different testing setswith various σ s. The PASCAL-VOC 2012 sets have 20 classesof interested objects, plus a background class, leading to K =21 unless otherwise speciﬁed. TABLE I:

The average image denoising performance comparison onPASCAL-VOC 2012 validation set, with σ = 15, 25, 35. Red is thebest and blue is the second best results (the same hereinafter) CDnCNN-B S-SAID U-SAID σ =15 PSNR (dB) 33.56 33.40 33.50SSIM 0.9159 0.9136 0.9153NIQE 4.3290 4.0782 4.0049 σ =25 PSNR (dB) 31.18 31.01 31.13SSIM 0.8725 0.8698 0.8724NIQE 4.2247 3.8508 3.8975 σ =35 PSNR (dB) 29.65 29.47 29.59SSIM 0.8344 0.8312 0.8347NIQE 4.1022 3.6679 3.7612We compare U-SAID with the original CDnCNN-B (re-trained on our training set) [52], which requires no segmen-tation information at all. We further create another denoisrfollowing the same idea of [29]: cascading CDnCNN-B withthe supervised segmentation network (i.e., replacing L USA with a standard pixel-wise softmax loss), with all other train-ing protocols and initialization the same as U-SAID. Wecall it supervised segmentation-aware image denoising ( S-SAID ), and train it with the hybrid MSE-segmentation loss(the two losses are weighted equally), using the ground-truthsegmentation maps available on the PASCAL training set.

Note that S-SAID is the only method that exploits “true”segmentation information , making it a natural baseline forU-SAID to show the effect of such extra information . We donot include other denoising methods such as [7], [3], [13]because: 1) their average performance was shown to be worsethan CDnCNN; and 2) most of them are not designed for theblind denoising scenario, thus hard to make fair comparisons.We have exhaustively tuned the hyper-parameters (learningrates, etc.) for CDnCNN-B and S-SAID, to ensure the optimalperformance of either baseline.The typical metric used for image denoising is PSNR, whichhas been shown to correlate poorly with human assessment ofvisual quality [18]. On the other hand, in the metric of PSNR,a model trained by minimizing MSE on the image domainshould always outperform a model trained by minimizing ahybrid weighted loss. Therefore, we emphasize that the goal ofour following experiments is not to pursue the highest PSNR,but to quantitatively demonstrate the different behaviors be-tween models with and without segmentation awareness.Table I reports the denoising performance in terms of PNSR,SSIM and Naturalness Image Quality Evaluator (NIQE) [33].The last one is a well-known no-reference image quality scoreto indicate the perceived “naturalness” of an image: a smallerscore indicates better perceptual quality. Our observationsfrom Table I are summarized as below: • Since CDnCNN-B is optimized towards the MSE loss,it is not surprising that it consistently achieves thebest PSNR results among all. However, U-SAID isable to achieve only marginally inferior

PSNR/SSIMs toCDnCNN-B, which usually surpass S-SAID. • The two methods with segmentation awareness (U-SAIDand S-SAID) are signiﬁcantly more favored by NIQE, showing a large margin over CDnCNN-B (e.g., nearly0.4 at σ = 25 ). That testiﬁes the beneﬁts of consideringhigh-level tasks for denoising. • While not exploiting the true segmentation maps duringtraining as S-SAID did, the performance of U-SAID isalmost as competitive as S-SAID under the NIQE metric.In other words, we did not lose much without using thetrue segmentation as supervision . K

10 15 20 21 (default) 22 25 40NIQE 3.9878 3.8320 4.0783 3.8975 3.8455 4.1139 3.9746PSNR 31.00 31.06 30.99 31.13 31.01 30.99 30.98

TABLE II:

Ablation study of varying K in U-SAID training. a) Ablation Study on “Unsupervised Segmentation”: In training U-SAID above, we have used the “true” classnumber K = 21. It is then to our curiosity that: is thisground-truth value really best for training denoisers? Or, if theclass number information cannot be accurately inferred whentackling general images, how much the denoising performancemight be affected?We hereby present an ablation study, by training severalU-SAID models with different K values (all else remainunchanged), and compare their denoising performance on thetesting set, as displayed in Table II. It is encouraging toobserve that, the U-SAID denoising performance (PSNR andSSIM) consistently increase as K grows from smaller values(10, 15) towards the true value (21), and thens graduallydecreases as K get further larger. The NIQE values show thesimilar ﬁrst-go-up-then down trend, except the peak slightlyshifted to 15. That acts as a side evidence that rather thanlearning a semantically blind discriminator, the USA moduleindeed picks up the semantic class information and beneﬁtsfrom the correct K estimate. On the other hand, the variationsof denoising performance w.r.t K are mild and smooth,showing certain robustness to inaccurate K s too. b) More Comparison to Relevant Methods: To solidifyour results, we include more off-the-shelf denoising methodsfor comparison. We performed these experiments on Kodakdataset with three test sigmas 15, 25 and 35. A detailedcomparison for each method we use is shown in III. However,all methods we mentioned previously, i.e. CDnCNN-B, S-SAID and U-SAID, are blind to the noise level, the competingmethods are non-blind. Therefore, we created two settings tosimulate blind denoising: • Applying the median sigma as denoising input ( σ = 25 ); • Assuming the oracle sigma is known in denoisingThe second setting is apparently unfair to our blind model.Even so, we demonstrate the results in IV, from which U-SAID constantly yields the best performance.

B. Segmentation Study on PASCAL-VOC

We next investigate the effectiveness of denoising as apro-processing step for the semantic segmentation over noisyimages, which follows the setting in [29]. We ﬁrst pass thenoisy images in the PASCAL-VOC testing set through eachof the three learned denoisers (CDnCNN-B, S-SAID, and

TABLE III:

Comparison of different methods. The three categories(columns) verify if the methods i) are using deep learning, ii) aresemantic-aware denoising methods, and iii) require extra segmenta-tion annotation.

Deep Semantic SegmentationLearning -Aware AnnotationU-SAID (cid:88) (cid:88)

S-SAID (cid:88) (cid:88) (cid:88)

CDnCNN-B (cid:88)

MLP [3] (cid:88)

MC-WNNM [49]CBM3D [8]

Setting I σ =15 σ =25 σ =35MLP [3] 4.3924/ 29.83 3.0205/ 30.09 6.5367/ 23.50MC-WNNM [49] 5.6334 / 31.04 3.6731/ 31.35 8.6496/ 21.53CBM3D [8] 3.7707/ 32.60 2.6152/ 31.81 6.7044/ 25.29Setting II σ =15 σ =25 σ =35MLP [3] 4.675/ 29.11 3.008/ 30.09 3.070/ 28.67MC-WNNM [49] 3.302/ 33.94 3.673/ 31.35 4.039/ 29.70CBM3D [8] 2.6360/ 34.40 2.6620/ 31.81 2.6786/ 30.04 TABLE IV:

The average Image denoising performance comparisonin NIQE/ PSNR on the Kodak dataset, with noise σ = 15, 25, 35,respectively. U-SAID). We then apply a FPN pre-trained on the cleanPASCAL-VOC 2012 training set , on the denoised testing sets,and evaluate the segmentation performance in terms of meanintersection-over-union (mIOU).As compared in Table 3, when we apply the CDnCNN-B denoiser without considering high-level semantics, it easilyfails to achieve high segmentation accuracy due to the artifactsintroduced during denoising (even though those artifacts mightnot be reﬂected by PSNR or SSIM). With their segmentationawareness, both S-SAID and U-SAID have led to remarkablyhigher mIOUs. Most impressively, U-SAID is comparableto S-SAID, provided that the former has never seen truesegmentation information on this dataset (training set) , whilethe latter does. Figure 3 has visually conﬁrmed the impact ofdenoisers on the segmentation performance.

C. Generalizability Study: Data, Semantics, and Task

In this section, we deﬁne and compare three aspects ofgeneral usability, which were often overlooked in previousresearch of learning-based denoisers: • Data Generalizability : whether a denoiser trained on onedataset can be applicable to restoring another. • Semantic Generalizability : whether a denoiser trainedon one dataset can be effective in preserving semantics,as the preprocessing step for applying semantic segmen-tation over another noisy dataset (with unseen classes). • Task Generalizability : whether a denoiser trained withsegmentation awareness can also be effective as prepro-cessing for other high-level tasks over noisy images.

IOU: 0.7866 IOU: 0.7909 IOU: 0.7872IOU: 0.5432 IOU: 0.8827 IOU: 0.8720IOU: 0.5432 IOU: 0.8827 IOU: 0.8720(a) Original Image (b) True Segmentation (c) C-DnCNNB (d) S-SAID (e) U-SAID

Fig. 3:

Visualized semantic segmentation examples from Pascal VOC 2012 validation set. The ﬁrst row is added with noise of σ = 15, thesecond row σ = 25 and the third row σ = 35. Columns (a) - (b) are the ground truth images and true segmentation maps; (c) -(e) are theresults by applying the pre-trained segmentation model on the denoised images using (c) C-DnCNNB; (d) S-SAID; and (e) U-SAID. noisy CDnCNN-B S-SAID U-SAID σ =15 0.4227 0.4238 0.4349 0.4336 σ =25 0.4007 0.4003 0.4084 0.4047 σ =35 0.3667 0.3724 0.3802 0.3785 TABLE V:

Segmentation results (mIoU) after denoising noisy imageinputs, averaged over Pascal VOC 2012 validation dataset.

Throughout the whole section below, all three denoisers usedare the same models trained on PASCAL-VOC 2012 above.

There is no re-training involved . Our hypothesis is that since U-SAID is not trained withany annotation on the original training set, it may less likelyoverﬁt the training set’s semantics than U-SAID, while stillpreserving discriminative features, and hence could generalizebetter to various unseen data, semantics and tasks. a) Denoising Unseen Noisy Datasets:

We evaluate theedenoising performance over the widely used Kodak dataset ,consisting of 24 color images. Table VI reports the quantitativeresults, which show strong consistency across all three noiselevels: CDnCNN-B achieves the highest PSNR and SSIMvalues, while S-SAID performs the best in terms of NIQE.Interestingly, U-SAID seems to be the “balanced” solution interms of data generalizability: it tends to obtain very closePSNR and SSIM values compared to CDnCNN-B, whileproducing comparable or even better NIQE values to S-SAID(especially at smaller σ s). We further observe that U-SAIDis usually able to preserve sharper edges and textures thanCDnCNN-B, sometimes even better than S-SAID. Figure 4displays a group of examples, where U-SAID ﬁnds clear http://r0k.us/graphics/kodak/ TABLE VI:

The average Image denoising performance comparisonon the Kodak dataset, with noise σ = 15, 25, 35, respectively. CDnCNN-B S-SAID U-SAID σ =15 PSNR 34.75 34.57 34.62SSIM 0.9242 0.9217 0.9222NIQE 2.7570 2.6288 2.5690 σ =25 PSNR 32.27 32.07 32.17SSIM 0.8812 0.8770 0.8790NIQE 2.8493 2.6006 2.6355 σ =35 PSNR 30.69 30.48 30.50SSIM 0.8418 0.8366 0.8395NIQE 2.9753 2.5619 2.6687advantages in preserving local ﬁne details on the sail. Pleaserefering more visualizations to 5. b) Denoising for Unseen Dataset Segmentation: Wechoose two recently released real-world datasets, whose classcategories are substantially different from PASCAL VOC: i)The ISIC 2018 dataset [6] . We choose the validation set ofTask 1: Lesion Segmentation, whose goal is to predict lesionsegmentation boundaries from dermoscopic lesion images; ii)The DeepGlobe dataset . We choose the validation set of Track3: Land Cover Classiﬁcation, whose goal is to predict a pixel-level mask of land cover types (urban, agriculture, rangeland,forest, water, barren, and unknow) from satellite images.We add σ = 25 noise to both validation sets, to create unseentesting sets for the trained denoisers. For either denoisedvalidation set, we apply a pyramid scene parsing network(PSPNet) [53], that is pre-trained on the original clean training https://challenge2018.isic-archive.com http://deepglobe.org (PSNR = 34.31, NIQE = 3.01) (PSNR = 34.02, NIQE = 2.76) (PSNR = 34.27, NIQE = 2.71) σ = 25 (a) CDnCNN-B (b) S-SAID (c) U-SAID (d) Ground Truth Fig. 4:

Visual comparison on one Kodak image. We show the full images (top) and zoom-in regions (bottom) of the ground truth as wellas three denoised images by CDnCNN-B, S-SAID and U-SAID, at σ = 25 (Best viewed on high-resolution color display, lower NIQE isbetter). noisy CDnCNN-B S-SAID U-SAIDISIC 2018 0.8061 0.8076 0.8084 0.8095DeepGlobe 0.1309 0.4260 0.4198 0.4263 TABLE VII:

Segmentation results (mIoU) after denoising noisy imageinputs, on ISIC 2018 and DeepGlobe validation sets, respectively. set. Table VII reports the generalization effects of three de-noisers when serving as preprocessing for segmenting unseennoisy datasets: U-SAID performs the best on both datasets,again verifying the beneﬁts of segmentation awareness (thatcomes “for free” with no knowledge of true segmentationon any dataset). What is noteworthy, while we observe inthe PASCAL-VOC segmentation experiment that the fully-supervised S-SAID is always superior to the segmentation-unaware CDnCNN-B, it is no longer always the case when ap-plied to unseen datasets of different semantic categories: evenCDnCNN-B is able to outperform S-SAID on DeepGlobe. Ourhypothesis is that, the full supervision of S-SAID might causeits certain overﬁtting with PASCAL-VOC object categories.Trained in the unsupervised fashion but still equipped withsegmentation awareness, U-SAID is not closely tied withoriginal class semantics on the training set, and might thusgeneralize better to extracting and preserving semantics fromnew categories. c) Denoising for Unseen High-Level Tasks:

We nowinvestigate if the segmentation-aware image denoising canalso enhance other high-level vision applications, and chooseclassiﬁcation and detection as two representative examples.

TABLE VIII:

Classiﬁcation results after denoising noisy image inputs( σ = 25) from CIFAR-100. noisy CDnCNN-B S-SAID U-SAIDPSNR 20.17 29.13 28.94 28.98SSIM 0.6556 0.9232 0.9203 0.9219 Top-1 Acc

Top-5 Acc σ = 25 noise to its validation set. Wethen pass it through three denoisers, followed by a ResNet-110 classiﬁcation model, pre-trained on the clean CIFAR-100training set. As seen from Table VIII, while U-SAID is secondbest in terms of both PSNR and SSIM (marginally inferiorto CDnCNN-B), it demonstrates a notable boost in terms ofboth top-1 and top-5 accuracies, with a good margin comparedto CDnCNN-B and S-SAID. While S-SAID also outperformsCDnCNN-B in improving classiﬁcation, U-SAID proves tohave even better generalizablity here.For detection, We choose the MS COCO benchmark [26],and add σ = 15, 25, 35 noise to its validation set. We evaluatethree denoisers in the same way as for the classiﬁcationexperiment, using a pre-trained YOLOv3 detection model [36].Table IX shows consistent observations as above: U-SAIDalways leads to the largest improvements in the detectionmean average prediction (mAP), and hence has the best taskgeneralizablity among all. Another interesting observation is (PSNR =35.52 , NIQE =3.0163) (PSNR =35.31 , NIQE =3.0701) (PSNR = 35.45, NIQE = 2.9791) σ = 15 (PSNR = 30.71, NIQE = 2.6303) (PSNR =30.49 , NIQE =1.9944) (PSNR =30.66 , NIQE = 2.3407) σ = 25 (PSNR =33.75 , NIQE =3.6207) (PSNR =33.41 , NIQE = 3.0099) (PSNR = 33.63, NIQE = 3.0905) σ = 35 (a) CDnCNN-B (b) S-SAID (c) U-SAID (d) Ground Truth Fig. 5:

More denoised visualizations from Kodak data set by CDnCNN, S-SAID and U-SAID under three different noise level (lower NIQEindicates better visual quality).

TABLE IX:

Detection results after denoising noisy MS COCO images. noisy CDnCNN-B S-SAID U-SAID σ = 15 PSNR 24.61 35.14 34.92 35.01SSIM 0.4796 0.9440 0.9410 0.9411 mAP σ = 25 PSNR 20.17 32.70 32.48 32.60SSIM 0.3233 0.9137 0.9095 0.9108 mAP σ = 35 PSNR 17.25 31.12 30.89 31.02SSIM 0.2383 0.8861 0.8803 0.8821 mAP D. Statistical Signiﬁcance Study of U-SAID’s Improvement

CDnCNN-B S-SAID U-SAIDPASCAL VOC SegmentationmIOU 39.46% 40.19% 40.35%Variance 3.30E-6 3.98E-6 3.15E-6Cross-set Kodak DenoisingNIQE 2.87 2.60 2.62Variance 1.74E-4 1.78E-4 6.00E-4Cross-task CIFAR-100 classiﬁcationtop-1 Accuracy 56.89% 57.82% 58.47%top-1 Variance 0.03 0.06 0.02top-5 Accuracy 82.89% 83.57% 83.91%top-5 Variance 0.02 0.05 0.06

TABLE X:

Performance and variance on three different tasks

How consistent and statistically meaningful is U-SAID’sperformance advantage? To answer this, we report the detailedstatistics: (1) the p -values of the denoising quality improve-ment over different testing images; and (2) the variance ofthe performance improvements with different simulated noisepatterns, for three representative experiments: PASCAL VOCsegmentation (Table V), cross-set KODAK denoising (TableVI), and cross-task CIFAR-100 classiﬁcation (Table VIII). Foreach test, we simulated i.i.d. random Gaussian noise ( σ = 25 )for each image ten times, and repeat the experiments on themaccordingly. Experiment results are shown in Table X.In the PASCAL VOC segmentation experiment, we perfor-mance hypothesis tests to check if U-SAID leads to bettersegmentation results than CDnCNN-B. Being 95% conﬁdent,we obtained p -value = 1 . E − , which demonstrates thestatistical signiﬁcance of improvement. On the other hand, U-SAID and S-SAIDs results do not show signiﬁcant differencewith p -value = 0 . > . . Without using any segmenta-tion ground truth, our method achieved statistically similarresults to S-SAID, even under a disadvantageous setting.For the cross-set Kodak denoising experiment, the NIQEof U-SAID is statistically signiﬁcantly better than that ofCDnCNN-B, with p -value = 2 . E − . Similarly, S-SAIDis better than U-SAID in NIQE with p -value = 6 . E − . In CIFAR-100 experiment, for top-1 accuracy, U-SAIDyields mean accuracy of 58.47%, which is signiﬁcantly higherthan DnCNN, which has mean = 56.89%, with p -value =3.6147E-14. U-SAID has also higher accuracy than S-SAID(mean = 57.82%) with p -value = 1.3486E-6. Similarly for top-5, U-SAID’s performance ( 83.91%) is statistically signiﬁcantbetter than DnCNN (82.89%), and S-SAID (83.57%), with p -values of 1.3982E-9 and 4.3994E-3, respectively.V. C ONCLUSION

This paper proposes a segmentation-aware image denoisingmodel that requires no ground-truth segmentation map fortraining. The proposed U-SAID model leads to comparableperformance with its supervised counterpart, in terms of bothlow-level (denoising) and high-level (segmentation) visionmetrics, when trained on and applied to the same noisydataset (without utilizing extra segmentation information asthe latter has to). Furthermore, U-SAID shows remarkablegeneralizablity to unseen data, semantics, and high-level tasks,all of which endorse it to be a highly robust, effective andgeneral-purpose denoising option.R

EFERENCES[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for image segmenta-tion.

IEEE transactions on pattern analysis and machine intelligence ,39(12):2481–2495, 2017.[2] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review ofimage denoising algorithms, with a new one.

Multiscale Modeling &Simulation , 4(2):490–530, 2005.[3] Harold C Burger, Christian J Schuler, and Stefan Harmeling. Imagedenoising: Can plain neural networks compete with bm3d? In

ComputerVision and Pattern Recognition (CVPR), 2012 IEEE Conference on ,pages 2392–2399. IEEE, 2012.[4] Huaijin Chen, Jinwei Gu, Orazio Gallo, Ming-Yu Liu, Ashok Veer-araghavan, and Jan Kautz. Reblur2deblur: Deblurring videos via self-supervised learning. In

Computational Photography (ICCP), 2018 IEEEInternational Conference on , pages 1–9. IEEE, 2018.[5] Bowen Cheng, Zhangyang Wang, Zhaobin Zhang, Zhu Li, Ding Liu,Jianchao Yang, Shuai Huang, and Thomas S Huang. Robust emotionrecognition from low quality and low bit rate video: A deep learningapproach. In , pages 65–70. IEEE, 2017.[6] Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba,Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, KonstantinosLiopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis towardmelanoma detection: A challenge at the 2017 international symposiumon biomedical imaging (isbi), hosted by the international skin imagingcollaboration (isic). In

Biomedical Imaging (ISBI 2018), 2018 IEEE15th International Symposium on , pages 168–172. IEEE, 2018.[7] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and KarenEgiazarian. Image denoising by sparse 3-d transform-domain collabora-tive ﬁltering.

IEEE Transactions on image processing , 16(8):2080–2095,2007.[8] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen OEgiazarian. Color image denoising via sparse 3d collaborative ﬁlteringwith grouping constraint in luminance-chrominance space. In

ICIP (1) ,pages 313–316, 2007.[9] Kostadin Dabov, Alessandro Foi, Vladmir Katkovnik, and KarenEgiazarian. Color image denoising via sparse 3d collaborative ﬁlteringwith grouping constraint in luminance-chrominance space. In

ICIP .IEEE, 2007.[10] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visuallearning. In

The IEEE International Conference on Computer Vision(ICCV) , 2017.[11] Michael Elad and Michal Aharon. Image denoising via sparse andredundant representations over learned dictionaries.

IEEE Transactionson Image processing , 2006. [12] Zhiwen Fan, Liyan Sun, Xinghao Ding, Yue Huang, Congbo Cai, andJohn Paisley. A segmentation-aware deep fusion network for compressedsensing mri. arXiv preprint arXiv:1804.01210 , 2018.[13] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weightednuclear norm minimization with application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2862–2869, 2014.[14] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. Towardconvolutional blind denoising of real photographs. arXiv preprintarXiv:1807.04686 , 2018.[15] Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos.Segmentation-aware convolutional networks using local attention masks.In

IEEE International Conference on Computer Vision (ICCV) , vol-ume 2, page 7, 2017.[16] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Maskr-cnn. In

Computer Vision (ICCV), 2017 IEEE International Conferenceon , pages 2980–2988. IEEE, 2017.[17] Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In

Pattern recognition (icpr), 2010 20th international conference on , pages2366–2369. IEEE, 2010.[18] Quan Huynh-Thu and Mohammed Ghanbari. Scope of validity of psnrin image/video quality assessment.

Electronics letters , 44(13):800–801,2008.[19] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In

European Conferenceon Computer Vision . Springer, 2016.[20] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and Dan Feng.Aod-net: All-in-one dehazing network. In

Proceedings of the IEEEInternational Conference on Computer Vision , volume 1, page 7, 2017.[21] Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, WenjunZeng, and Zhangyang Wang. Benchmarking single-image dehazing andbeyond.

IEEE Transactions on Image Processing , 28(1):492–505, 2019.[22] Siyuan Li, Iago Breno Araujo, Wenqi Ren, Zhangyang Wang, Eric KTokuda, Roberto Hirata Junior, Roberto Cesar-Junior, Jiawan Zhang, Xi-aojie Guo, and Xiaochun Cao. Single image deraining: A comprehensivebenchmark analysis. arXiv preprint arXiv:1903.08558 , 2019.[23] Weizhi Li, Xiaoning Qian, and Jim Ji. Noise-tolerant deep learningfor histopathological image segmentation. In

Image Processing (ICIP),2017 IEEE International Conference on , pages 3075–3079. IEEE, 2017.[24] Tsung-Yi Lin, Piotr Doll´ar, Ross B Girshick, Kaiming He, BharathHariharan, and Serge J Belongie. Feature pyramid networks for objectdetection. In

CVPR , volume 1, page 4, 2017.[25] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and PiotrDoll´ar. Focal loss for dense object detection.

IEEE transactions onpattern analysis and machine intelligence , 2018.[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, PietroPerona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoftcoco: Common objects in context. In

European conference on computervision , pages 740–755. Springer, 2014.[27] Ding Liu, Bowen Cheng, Zhangyang Wang, Haichao Zhang, andThomas S Huang. Enhance visual recognition under adverse conditionsvia deep networks.

IEEE Transactions on Image Processing , 2019.[28] Ding Liu, Bihan Wen, Jianbo Jiao, Xianming Liu, Zhangyang Wang, andThomas S Huang. Connecting image denoising and high-level visiontasks via deep learning. arXiv preprint arXiv:1809.01826 , 2018.[29] Ding Liu, Bihan Wen, Xianming Liu, Zhangyang Wang, and Thomas SHuang. When image denoising meets high-level vision tasks: A deeplearning approach. arXiv preprint arXiv:1706.04284 , 2017.[30] Zhiwu Lu, Zhenyong Fu, Tao Xiang, Peng Han, Liwei Wang, and XinGao. Learning from weak and noisy labels for semantic segmenta-tion.

IEEE transactions on pattern analysis and machine intelligence ,39(3):486–500, 2017.[31] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and AndrewZisserman. Non-local sparse models for image restoration. In

ComputerVision, 2009 IEEE 12th International Conference on , pages 2272–2279.IEEE, 2009.[32] Xiaojiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration usingvery deep convolutional encoder-decoder networks with symmetric skipconnections. In

Advances in neural information processing systems ,pages 2802–2810, 2016.[33] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a”completely blind” image quality analyzer.

IEEE Signal Process. Lett. ,20(3):209–212, 2013.[34] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, andAlexei A Efros. Context encoders: Feature learning by inpainting. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2536–2544, 2016. [35] Tobias Pltz and Stefan Roth. Benchmarking denoising algorithms withreal photographs. In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017.[36] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 , 2018.[37] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy,Dumitru Erhan, and Andrew Rabinovich. Training deep neural networkson noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 ,2014.[38] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total varia-tion based noise removal algorithms.

Physica D: nonlinear phenomena ,60(1-4):259–268, 1992.[39] Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, and Ming-Hsuan Yang.Deep semantic face deblurring. arXiv preprint arXiv:1803.03345 , 2018.[40] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev,and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080 , 2014.[41] Carlo Tomasi and Roberto Manduchi. Bilateral ﬁltering for gray andcolor images. In

Computer Vision, 1998. Sixth International Conferenceon , pages 839–846. IEEE, 1998.[42] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, andChristoph Bregler. Efﬁcient object localization using convolutionalnetworks. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 648–656, 2015.[43] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep imageprior.

CVPR , 2018.[44] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta,and Serge J Belongie. Learning from noisy large-scale datasets withminimal supervision. In

CVPR , pages 6575–6583, 2017.[45] Rosaura G VidalMata, Sreya Banerjee, Brandon RichardWebster,Michael Albright, Pedro Davalos, Scott McCloskey, Ben Miller, AsongTambo, Sushobhan Ghosh, Sudarshan Nagesh, et al. Bridging thegap between computational photography and visual recognition. arXivpreprint arXiv:1901.09482 , 2019.[46] Zhangyang Wang, Shiyu Chang, Yingzhen Yang, Ding Liu, andThomas S Huang. Studying very low resolution recognition using deepnetworks. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 4792–4800, 2016.[47] B. Wen, S. Ravishankar, and Y. Bresler. Structured overcomplete sparsi-fying transform learning with convergence guarantees and applications.

Int. J. Computer Vision , 114(2):137–167, 2015.[48] Jiqing Wu, Radu Timofte, Zhiwu Huang, and Luc Van Gool. On therelation between color image denoising and classiﬁcation. arXiv preprintarXiv:1704.01372 , 2017.[49] Jun Xu, Lei Zhang, David Zhang, and Xiangchu Feng. Multi-channelweighted nuclear norm minimization for real color image denoising. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 1096–1104, 2017.[50] Ye Yuan, Wenhan Yang, Wenqi Ren, Jiaying Liu, Walter J Scheirer,and Zhangyang Wang. Ug track 2: A collective benchmark effortfor evaluating and advancing image understanding in poor visibilityenvironments. arXiv preprint arXiv:1904.04474 , 2019.[51] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and OriolVinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[52] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang.Beyond a gaussian denoiser: Residual learning of deep cnn for imagedenoising. IEEE Transactions on Image Processing , 26(7):3142–3155,2017.[53] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and JiayaJia. Pyramid scene parsing network. In