[PDF] Deep learning architectural designs for super-resolution of noisy images

Abstract

Recent advances in deep learning have led to significant improvements in single image super-resolution (SR) research. However, due to the amplification of noise during the upsampling steps, state-of-the-art methods often fail at reconstructing high-resolution images from noisy versions of their low-resolution counterparts. However, this is especially important for images from unknown cameras with unseen types of image degradation. In this work, we propose to jointly perform denoising and super-resolution. To this end, we investigate two architectural designs: "in-network" combines both tasks at feature level, while "pre-network" first performs denoising and then super-resolution. Our experiments show that both variants have specific advantages: The in-network design obtains the strongest results when the type of image corruption is aligned in the training and testing dataset, for any choice of denoiser. The pre-network design exhibits superior performance on unseen types of image corruption, which is a pathological failure case of existing super-resolution models. We hope that these findings help to enable super-resolution also in less constrained scenarios where source camera or imaging conditions are not well controlled. Source code and pretrained models are available at this https URL angelvillar96/super-resolution-noisy-images.

Full PDF

DDEEP LEARNING ARCHITECTURAL DESIGNS FORSUPER-RESOLUTION OF NOISY IMAGES

Angel Villar-Corrales Franziska Schirrmacher Christian Riess

IT Security Infrastructures Lab, University of Erlangen-N¨urnberg

ABSTRACT

Recent advances in deep learning have led to signiﬁcant improve-ments in single image super-resolution (SR) research. However, dueto the ampliﬁcation of noise during the upsampling steps, state-of-the-art methods often fail at reconstructing high-resolution imagesfrom noisy versions of their low-resolution counterparts. However,this is especially important for images from unknown cameras withunseen types of image degradation. In this work, we propose tojointly perform denoising and super-resolution. To this end, weinvestigate two architectural designs: “in-network” combines bothtasks at feature level, while “pre-network” ﬁrst performs denoisingand then super-resolution. Our experiments show that both vari-ants have speciﬁc advantages: The in-network design obtains thestrongest results when the type of image corruption is aligned in thetraining and testing dataset, for any choice of denoiser. The pre-network design exhibits superior performance on unseen types ofimage corruption, which is a pathological failure case of existingsuper-resolution models. We hope that these ﬁndings help to en-able super-resolution also in less constrained scenarios where sourcecamera or imaging conditions are not well controlled. Source codeand pretrained models are available at https://github.com/angelvillar96/super-resolution-noisy-images . Index Terms — Super-resolution, Denoising, Deep learning,Image enhancement

1. INTRODUCTION

Single image super-resolution (SR) aims at recovering a high-resolution (HR) image from its low-resolution (LR) counterpart, inwhich high-frequency details have been lost due to degrading factorssuch as blur, hardware limitations, or decimation.Early SR approaches were based on upsampling and interpola-tion techniques [1, 2]. However, these methods are limited in theirrepresentational power, and hence also limited in their ability to pre-dict realistic high-resolution images. More complex methods con-struct mapping functions between low- and high-resolution images.Such a mapping function can be obtained from a variety of differ-ent techniques, including sparse coding [3, 4], random forests [5, 6]or embedding approaches [7, 8]. Recently, deep learning methodsfor super-resolution lead to considerable performance improvements[9, 10]. ResNet-like architectures [11] obtain state-of-the-art re-sults for SR tasks while maintaining low computational complex-ity [12, 13, 14]. Despite these successes, it is still challenging to pre-vent the ampliﬁcation of noise during the upsampling steps, whichoften leads to loss of information and the emergence of artifacts.

This work was supported in part by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) – Project number 146371743– TRR 89 Invasive Computing and the German Research Foundation, GRKCybercrime (393541319/GRK2475/1-2019).

Fig. 1 . Original WDSR architecture (top) in comparison to the pre-network architectural design (center) and the in-network architec-tural design (bottom). Pre-network cascades denoising and super-resolution. In-network combines low-level features from the de-noised input with high-level features from the noisy input.Several approaches have been considered to jointly performsuper-resolution and denoising. Image restoration can be formulatedas an inverse problem. In this approach, the data term for the re-spective objectives is speciﬁc to the respective task. For the prior, amore generic function can be chosen that applies to multiple tasks.For example, using deep learning models [15], employing a de-noiser as regularizer [16, 17] or a so-called plug-and-play prior [18].Furthermore, deep learning approaches have also been consideredto combine denoising with a SR model [19]. The authors in [20]propose cascading a denoiser with a SR model so that the output ofthe denoiser is fed to the SR network. The pre-network architecturaldesign in our experiments is similar in spirit, but instead of usinga ﬁxed convolutional neural network (CNN) denoiser, our designallows for further ﬂexibility regarding the choice of the integrateddenoiser.In contrast to previous works, we compare two architectures thatallow for further ﬂexibility regarding the choice of the integrated de-noiser. This ﬂexibility can be used to incorporate domain knowl-edge into the network by selecting a denoising technique optimizedfor the particular type of degradation. For scenarios where domain a r X i v : . [ c s . C V ] F e b nowledge is missing, we investigate the generalization capabilityof different denoisers and the proposed architectural designs. Es-pecially for low-resolution images from cameras in-the-wild, imagedegradation, such as unseen noise distributions, can lead to artifactsin the reconstructed high-resolution images.The ﬁrst architecture, “pre-network”, includes a denoiser as apreprocessing step to the super-resolution. The second architecture,“in-network”, reconstructs the HR image by combining low-levelfeatures extracted from the denoised input and high-level featuresextracted from the noisy input. To the best of our knowledge, thistype of network design has not yet been studied in the context ofdenoising and super-resolution.We evaluate our architectures with the Wide Activation Super-Resolution model (WDSR) [13] on noisy versions of the imagesfrom the DIV2K [21] dataset. Both architectures aim to suppress thenoise corruptions while still recovering most of the high-frequencyinformation. We show that both architectures achieve more realisticreconstructions and better PSNR values than WDSR only.

2. METHODS

In this work, we employ the Wide Activation Super-Resolution(WDSR) model [13] as a building block to investigate architecturesfor joint denoising and super-resolution.The original WDSR architecture is shown on top of Figure 1.It consists of two paths. The main path is on top, consisting ofa user-deﬁned number B of residual blocks. Each block consistsof two convolutional layers followed by weight normalization [22]and ReLU activation. The lower path is a residual connection. Itprovides low-level features from the input to the output, which iscritically important for SR tasks [14]. Both paths contain a pixel-shufﬂe layer [23] , which performs the upsampling for image super-resolution. WDSR is sensitive to input images that are corrupted byadditive noise. However, we show in this work that it can be pairedwith a denoiser in two different ways, denoted as “pre-network” and“in-network”, which both considerably improve the results.Figure 1 shows both architectures. Pre-network (abbreviatedpre-net) is shown in the middle. It ﬁrst passes the image throughthe denoiser prior to branching into main path and skip connection.This is conceptually similar to the denoiser and SR concatenationby Bei et al. [20]. One potential limitation of this approach is errorpropagation: if the denoiser removes information that is relevant tosuper-resolution, it cannot be recovered afterwards.In-network (abbreviated in-net) is shown on the bottom of Fig. 1.Here, the denoiser integrates into the residual connection. Hence, theSR model can jointly combine low-level features from the denoisedinput and high-level features from the noisy input.Both designs are open for the choice of denoiser, which allowsto choose a task-speciﬁc denoiser, i.e., that performs best on an ex-pected noise distribution. In our experiments, we evaluate three pop-ular denoisers of varying complexity: median ﬁlter [24], Wiener ﬁl-ter [25], and denoising autoencoders (DAE) [26].

3. DATASET PREPARATION AND TRAININGPROCEDURE

The experiments use the DIV2K dataset. In our experiments, weconsider low-resolution images which have been downsampled bya factor of two. We use the 800 images from the training set fortraining or ﬁne-tuning the models, and the 100 validation imagesfor evaluation. Closely following [13], we feed to the model × RGB image patches extracted from HR images, along with theirnoisy bicubic downsampled counterparts.We train our models using downsampled image patches cor-rupted with additive Gaussian noise. The testing data is corruptedusing additive Gaussian noise with the same distribution as duringtraining. Moreover, Poisson, speckle, and salt-and-pepper noise areused to degrade the testing data for the robustness analysis.As a baseline, we consider two versions of WDSR. First, thepublicly available version of WDSR, pretrained on the DIV2Kdataset, denoted as “No tuning”. Second, WDSR ﬁne-tuned onDIV2K images with added noise, denoted as “No denoiser”. Themodel weights are initialized with the pretrained WDSR weights.Each model is ﬁne-tuned using the mean absolute error (MAE) lossfunction and the ADAM update rule [27] with an initial learning rateof − . To avoid overﬁtting, the ﬁne-tuning procedure is stoppedafter 100 epochs. Regarding the median and Wiener ﬁlter, we usesquare kernel sizes with side length of ﬁve pixels. For the denoisingautoencoders, we construct a fully convolutional DAE composed ofthree convolutional layers with 64, 128, and 256 kernels of size × respectively. Each layer is followed by a ReLU activation functionand max-pooling. DAEs are trained for 80 epochs using the MSEloss function and the ADAM update rule with an initial learning rateof − . For the integration into the WDSR model, the autoencoderparameters are ﬁxed during the ﬁne-tuning of the network.

4. EVALUATION4.1. Aligned Noise in Training and Testing

The noise in real images can often be approximated as additive Gaus-sian noise, which we model in this experiment as a ﬁxed power σ for training and testing. This evaluation assumes that noise distri-bution and strength are known during training. This is a strong as-sumption for practical cases, but this setup shows the fundamentalcapability of the models to capture non-ideal images.Table 1 shows the numerical performances for the peak signal-to-noise ratio (PSNR). The experiment is performed for additiveGaussian noise with . ≤ σ ≤ . . For low noise lev-els, the ﬁne-tuned WDSR (“No denoiser”) and in-network with Me-dian ﬁlter perform best. In-network with the denoising autoencoderoutperforms the competing methods for higher noise levels. A con-sistent decrease in PSNR can be observed with increasing strengthof the noise. This is expected, since image restoration becomes in-creasingly difﬁcult with increasing noise strength. Pre-net and in-netare both affected by increasing noise, since the denoiser increasinglyremoves useful information from the image.In comparison, the original WDSR model (“No tuning”) is un-able to accurately recover the HR image from its noisy LR counter-part. However, ﬁne-tuning WDSR (“No denoiser”) considerably im-proves the PSNR, for example by about 10 dB for σ = 0 . . Thus,the ﬁne-tuned WDSR without any integrated denoiser learns to sup-press noise and is on par with the best denoiser methods (i.e., in-netwith autoencoder and median ﬁlter). Hence, WDSR’s 16 residualblocks and 32 convolutional ﬁlters apparently possess sufﬁcient rep-resentational power to jointly learn denoising and SR. However, wewill show in the next section that these ﬁndings do not generalize todistortions that were not seen during training.Figure 2 shows a qualitative comparison of the reconstructionsof an image patch for a noise strength of σ = 0 . . The in-networkarchitecture yields very similar results for all denoisers. It exhibitscomparable results to the ﬁne-tuned WDSR and overall outperformspre-network. Upon closer examination, pre-net suffers from the de- able 1 . Average peak signal-to-noise ratio (PSNR) for varying levels of additive Gaussian noise σ on the low-resolution testing images.Baseline WDSR ﬁne-tuned on the noisy data (“No denoiser”) performs best at low noise powers, while the in-net with an autoencoder denoiserachieves best the PSNR values for σ ≥ . . Baseline WDSR without ﬁne-tuning (“No tuning”) performs worst (see text for details). Median Filter Wiener Filter Autoencoder σ No tuning No denoiser In-net Pre-net In-net Pre-net In-net Pre-net

Fig. 2 . Comparison of different reconstructions of an image patch corrupted with Gaussian noise with σ = 0 . using different denoisersand architectural designs. “No tuning” corresponds to the reconstruction using the original WDSR trained on clean images and “No denoiser”corresponds to the reconstruction using a WDSR model ﬁne-tuned on noisy patches.noiser’s removal of relevant information at a very early processingstate. This can be seen in the holes in the hat. The holes appear blurryor are completely removed in all pre-net conﬁgurations. In-net per-forms much better since it complements the denoised image in theskip branch with original image information in the main branch.In conclusion, when training and testing images are aligned in aﬁxed amount of Gaussian noise, the ﬁne-tuned WDSR without anydenoiser and the in-net exhibit the best PSNR results and generatethe most accurate reconstructions. In this experiment, we investigate the generalization to noise dis-tributions that were not part of the training data. The models aretrained on images with additive Gaussian noise of strength σ = 0 . .Testing is performed on three different noise distributions. Morespeciﬁcally, we use speckle noise with σ = 0 . , Poisson noisewith λ = 0 . , and salt-and-pepper (S&P) noise with a probability of p = 0 . to change a pixel.Table 2 shows quantitative results for this experiment. As a ref-erence, we also include performances for testing on additive Gaus-sian noise with σ = 0 . , i.e., where training and testing are aligned.The ﬁne-tuned WDSR (“No denoiser”) performs best on images cor-rupted by Gaussian noise and speckle noise. We attribute this tothe fact that speckle noise is a multiplicative distortion, but it fol-lows the same probabilistic distribution as additive Gaussian noise.In-network with the median ﬁlter and the denoising autoencoderachieves similar results. However, pre-network with the median ﬁl- ter outperforms the competing methods by a large margin on imagesthat are corrupted by salt-and-pepper noise and Poisson noise: animprovement over the baselines of more than 13 dB is achieved.For noise distributions that signiﬁcantly differ from those seenduring training, it is highly beneﬁcial to suppress as much of theinput corruption as possible prior to performing SR. For this rea-son, pre-network performs best on Poisson noise and salt-and-peppernoise. The choice of the denoiser also affects the results. To this end,Fig. 3 shows qualitative results on Poisson noise. All conﬁgurationsof in-net, as well as pre-net with Wiener ﬁlter, and the ﬁne-tunedWDSR (“No denoiser”) suppress noise in homogeneous areas, suchas the blue sky. However, the wheel exhibits more texture, and it stillcontains noise. Pre-net with median ﬁlter removes the noise best, butthe reconstructed image appears slightly over-smoothed.In Fig. 4, qualitative results on salt-and-pepper noise are shown.Here, the median ﬁlter achieves the best reconstructions. The noiseis completely removed, but some details of the lemon peel are alsoremoved. We attribute both observations to to the fact that medianﬁltering is particularly well-suited for suppressing salt-and-peppernoise but does not preserve small structures. The Wiener ﬁlter inthe pre-net conﬁguration generates unwanted artifacts since it is op-timal for Gaussian denoising. Finally, it is interesting to note thatthe PSNR of the denoising autoencoder is oftentimes outperformedby the other denoisers. This can be attributed to the fact that theautoencoder is trained on additive Gaussian noise, and it apparentlyhas difﬁculties to generalize to other distributions without retraining.In summary, models that are trained on additive Gaussian noiseriginalNo tuning InputNo denoiser Median pre-netMedian in-net Wiener pre-netWiener in-net DAE pre-netDAE in-net Fig. 3 . Comparison of different reconstructions of an image patch with Poisson noise using different denoisers and architectures. “No tuning”corresponds to the reconstruction using the original WDSR trained on clean images and “No denoiser” corresponds to the reconstructionusing a WDSR model ﬁne-tuned on noisy patches.

Table 2 . Average PSNR values for noise distributions that differ from the training distribution. As a baseline, Gaussian noise is aligned withtraining distribution. Here, and for the mathematically similar speckle noise, WDSR without a denoiser (“No denoiser”) performs best. Forthe more challenging salt and pepper (S&P) noise and Poisson noise, the pre-network again generalizes best, by a large margin.

Median Filter Wiener Filter Autoencoder

Noise

No tuning No denoiser In-net Pre-net In-net Pre-net In-net Pre-net

Gaussian 18.78

Fig. 4 . Comparison of reconstructions of an image patch with salt-and-pepper noise using different denoisers and architectures. “No tuning”corresponds to the reconstruction using the original WDSR trained on clean images and “No denoiser” corresponds to the reconstruction usingWDSR ﬁne-tuned on noisy patches. Only the pre-network with the median ﬁlter succeeds at denoising and reconstructing the HR image.do not generalize well beyond Gaussian and speckle noise. Pre-network combined with a suitable denoiser generalizes considerablybetter to different noise distributions than the competing methods.

5. CONCLUSION

In this paper, we compare two architectures to jointly perform im-age denoising and single-image super-resolution. We combine thewell-known WDSR model [13] with three denoisers that can be cho-sen depending on the type of degradation. Both networks have spe-ciﬁc beneﬁts. The “pre-network” architecture sequentially removes the noise ﬁrst, and then recovers the high-resolution image. With asuitable denoiser, pre-network generalizes well to unseen noise dis-tributions. However, details in the image are removed and the re-constructions appear slightly over-smoothed. The “in-network” ar-chitecture reconstructs the high-resolution image by combining low-level features from the denoiser with high-level features from thenoisy input. This enables better structure preservation and sharperreconstructed images, but is more sensitive to unseen noise distri-butions and strength, independent of the chosen denoiser. We hopethat these ﬁndings are useful toward enabling super-resolution in-the-wild, when camera and image conditions are not fully controlled. . REFERENCES [1] Xin Li and Michael T Orchard, “New edge-directed interpola-tion,”

IEEE transactions on image processing , vol. 10, no. 10,pp. 1521–1527, 2001.[2] Lei Zhang and Xiaolin Wu, “An edge-guided image interpola-tion algorithm via directional ﬁltering and data fusion,”

IEEEtransactions on Image Processing , vol. 15, no. 8, pp. 2226–2238, 2006.[3] Roman Zeyde, Michael Elad, and Matan Protter, “On singleimage scale-up using sparse-representations,” in

Internationalconference on curves and surfaces . Springer, 2010, pp. 711–730.[4] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma,“Image super-resolution via sparse representation,”

IEEEtransactions on image processing , vol. 19, no. 11, pp. 2861–2873, 2010.[5] Jordi Salvador and Eduardo Perez-Pellitero, “Naive bayessuper-resolution forest,” in

Proceedings of the IEEE Interna-tional conference on computer vision , 2015, pp. 325–333.[6] Samuel Schulter, Christian Leistner, and Horst Bischof, “Fastand accurate image upscaling with super-resolution forests,” in

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2015, pp. 3791–3799.[7] Radu Timofte, Vincent De Smet, and Luc Van Gool, “An-chored neighborhood regression for fast example-based super-resolution,” in

Proceedings of the IEEE international confer-ence on computer vision , 2013, pp. 1920–1927.[8] Radu Timofte, Vincent De Smet, and Luc Van Gool, “A+:Adjusted anchored neighborhood regression for fast super-resolution,” in

Asian conference on computer vision . Springer,2014, pp. 111–126.[9] Chao Dong, Chen Change Loy, Kaiming He, and XiaoouTang, “Learning a deep convolutional network for imagesuper-resolution,” in

European conference on computer vision .Springer, 2014, pp. 184–199.[10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Deeply-recursive convolutional network for image super-resolution,”in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2016, pp. 1637–1645.[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in

Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[12] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, andKyoung Mu Lee, “Enhanced deep residual networks for singleimage super-resolution,” in

Proceedings of the IEEE confer-ence on computer vision and pattern recognition workshops ,2017, pp. 136–144.[13] Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, ZhaowenWang, Xinchao Wang, and Thomas Huang, “Wide activa-tion for efﬁcient and accurate image super-resolution,” arXivpreprint arXiv:1808.08718 , 2018.[14] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and YunFu, “Residual dense network for image super-resolution,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2018, pp. 2472–2481. [15] Tom Tirer and Raja Giryes, “Super-resolution via image-adapted denoising cnns: Incorporating external and internallearning,”

IEEE Signal Processing Letters , vol. 26, no. 7, pp.1080–1084, 2019.[16] Yaniv Romano, Michael Elad, and Peyman Milanfar, “The lit-tle engine that could: Regularization by denoising (red),”

SIAMJournal on Imaging Sciences , vol. 10, no. 4, pp. 1804–1844,2017.[17] Franziska Schirrmacher, Christian Riess, and Thomas K¨ohler,“Adaptive quantile sparse image (aquasi) prior for inverseimaging problems,”

IEEE Transactions on ComputationalImaging , vol. 6, pp. 503–517, 2020.[18] Stanley H Chan, Xiran Wang, and Omar A Elgendy, “Plug-and-play admm for image restoration: Fixed-point conver-gence and applications,”

IEEE Transactions on ComputationalImaging , vol. 3, no. 1, pp. 84–98, 2016.[19] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang,“Learning deep cnn denoiser prior for image restoration,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 3929–3938.[20] Yijie Bei, Alexandru Damian, Shijia Hu, Sachit Menon, NikhilRavi, and Cynthia Rudin, “New techniques for preserv-ing global structure and denoising with low information lossin single-image super-resolution,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2018, pp. 874–881.[21] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-HsuanYang, and Lei Zhang, “Ntire 2017 challenge on single imagesuper-resolution: Methods and results,” in

Proceedings of theIEEE conference on computer vision and pattern recognitionworkshops , 2017, pp. 114–125.[22] Tim Salimans and Durk P Kingma, “Weight normalization: Asimple reparameterization to accelerate training of deep neu-ral networks,” in

Advances in neural information processingsystems , 2016, pp. 901–909.[23] Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz,Andrew P Aitken, Rob Bishop, Daniel Rueckert, and ZehanWang, “Real-time single image and video super-resolution us-ing an efﬁcient sub-pixel convolutional neural network,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2016, pp. 1874–1883.[24] Linwei Fan, Fan Zhang, Hui Fan, and Caiming Zhang, “Briefreview of image denoising techniques,”

Visual Computing forIndustry, Biomedicine, and Art , vol. 2, no. 1, pp. 7, 2019.[25] Norbert Wiener,

Extrapolation, interpolation, and smoothingof stationary time series , The MIT press, 1964.[26] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Ben-gio, and Pierre-Antoine Manzagol, “Stacked denoising autoen-coders: Learning useful representations in a deep network witha local denoising criterion,”

Journal of machine learning re-search , vol. 11, no. Dec, pp. 3371–3408, 2010.[27] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980