Deep learning architectural designs for super-resolution of noisy images
Angel Villar-Corrales, Franziska Schirrmacher, Christian Riess
DDEEP LEARNING ARCHITECTURAL DESIGNS FORSUPER-RESOLUTION OF NOISY IMAGES
Angel Villar-Corrales Franziska Schirrmacher Christian Riess
IT Security Infrastructures Lab, University of Erlangen-N¨urnberg
ABSTRACT
Recent advances in deep learning have led to significant improve-ments in single image super-resolution (SR) research. However, dueto the amplification of noise during the upsampling steps, state-of-the-art methods often fail at reconstructing high-resolution imagesfrom noisy versions of their low-resolution counterparts. However,this is especially important for images from unknown cameras withunseen types of image degradation. In this work, we propose tojointly perform denoising and super-resolution. To this end, weinvestigate two architectural designs: “in-network” combines bothtasks at feature level, while “pre-network” first performs denoisingand then super-resolution. Our experiments show that both vari-ants have specific advantages: The in-network design obtains thestrongest results when the type of image corruption is aligned in thetraining and testing dataset, for any choice of denoiser. The pre-network design exhibits superior performance on unseen types ofimage corruption, which is a pathological failure case of existingsuper-resolution models. We hope that these findings help to en-able super-resolution also in less constrained scenarios where sourcecamera or imaging conditions are not well controlled. Source codeand pretrained models are available at https://github.com/angelvillar96/super-resolution-noisy-images . Index Terms — Super-resolution, Denoising, Deep learning,Image enhancement
1. INTRODUCTION
Single image super-resolution (SR) aims at recovering a high-resolution (HR) image from its low-resolution (LR) counterpart, inwhich high-frequency details have been lost due to degrading factorssuch as blur, hardware limitations, or decimation.Early SR approaches were based on upsampling and interpola-tion techniques [1, 2]. However, these methods are limited in theirrepresentational power, and hence also limited in their ability to pre-dict realistic high-resolution images. More complex methods con-struct mapping functions between low- and high-resolution images.Such a mapping function can be obtained from a variety of differ-ent techniques, including sparse coding [3, 4], random forests [5, 6]or embedding approaches [7, 8]. Recently, deep learning methodsfor super-resolution lead to considerable performance improvements[9, 10]. ResNet-like architectures [11] obtain state-of-the-art re-sults for SR tasks while maintaining low computational complex-ity [12, 13, 14]. Despite these successes, it is still challenging to pre-vent the amplification of noise during the upsampling steps, whichoften leads to loss of information and the emergence of artifacts.
This work was supported in part by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) – Project number 146371743– TRR 89 Invasive Computing and the German Research Foundation, GRKCybercrime (393541319/GRK2475/1-2019).
Fig. 1 . Original WDSR architecture (top) in comparison to the pre-network architectural design (center) and the in-network architec-tural design (bottom). Pre-network cascades denoising and super-resolution. In-network combines low-level features from the de-noised input with high-level features from the noisy input.Several approaches have been considered to jointly performsuper-resolution and denoising. Image restoration can be formulatedas an inverse problem. In this approach, the data term for the re-spective objectives is specific to the respective task. For the prior, amore generic function can be chosen that applies to multiple tasks.For example, using deep learning models [15], employing a de-noiser as regularizer [16, 17] or a so-called plug-and-play prior [18].Furthermore, deep learning approaches have also been consideredto combine denoising with a SR model [19]. The authors in [20]propose cascading a denoiser with a SR model so that the output ofthe denoiser is fed to the SR network. The pre-network architecturaldesign in our experiments is similar in spirit, but instead of usinga fixed convolutional neural network (CNN) denoiser, our designallows for further flexibility regarding the choice of the integrateddenoiser.In contrast to previous works, we compare two architectures thatallow for further flexibility regarding the choice of the integrated de-noiser. This flexibility can be used to incorporate domain knowl-edge into the network by selecting a denoising technique optimizedfor the particular type of degradation. For scenarios where domain a r X i v : . [ c s . C V ] F e b nowledge is missing, we investigate the generalization capabilityof different denoisers and the proposed architectural designs. Es-pecially for low-resolution images from cameras in-the-wild, imagedegradation, such as unseen noise distributions, can lead to artifactsin the reconstructed high-resolution images.The first architecture, “pre-network”, includes a denoiser as apreprocessing step to the super-resolution. The second architecture,“in-network”, reconstructs the HR image by combining low-levelfeatures extracted from the denoised input and high-level featuresextracted from the noisy input. To the best of our knowledge, thistype of network design has not yet been studied in the context ofdenoising and super-resolution.We evaluate our architectures with the Wide Activation Super-Resolution model (WDSR) [13] on noisy versions of the imagesfrom the DIV2K [21] dataset. Both architectures aim to suppress thenoise corruptions while still recovering most of the high-frequencyinformation. We show that both architectures achieve more realisticreconstructions and better PSNR values than WDSR only.
2. METHODS
In this work, we employ the Wide Activation Super-Resolution(WDSR) model [13] as a building block to investigate architecturesfor joint denoising and super-resolution.The original WDSR architecture is shown on top of Figure 1.It consists of two paths. The main path is on top, consisting ofa user-defined number B of residual blocks. Each block consistsof two convolutional layers followed by weight normalization [22]and ReLU activation. The lower path is a residual connection. Itprovides low-level features from the input to the output, which iscritically important for SR tasks [14]. Both paths contain a pixel-shuffle layer [23] , which performs the upsampling for image super-resolution. WDSR is sensitive to input images that are corrupted byadditive noise. However, we show in this work that it can be pairedwith a denoiser in two different ways, denoted as “pre-network” and“in-network”, which both considerably improve the results.Figure 1 shows both architectures. Pre-network (abbreviatedpre-net) is shown in the middle. It first passes the image throughthe denoiser prior to branching into main path and skip connection.This is conceptually similar to the denoiser and SR concatenationby Bei et al. [20]. One potential limitation of this approach is errorpropagation: if the denoiser removes information that is relevant tosuper-resolution, it cannot be recovered afterwards.In-network (abbreviated in-net) is shown on the bottom of Fig. 1.Here, the denoiser integrates into the residual connection. Hence, theSR model can jointly combine low-level features from the denoisedinput and high-level features from the noisy input.Both designs are open for the choice of denoiser, which allowsto choose a task-specific denoiser, i.e., that performs best on an ex-pected noise distribution. In our experiments, we evaluate three pop-ular denoisers of varying complexity: median filter [24], Wiener fil-ter [25], and denoising autoencoders (DAE) [26].
3. DATASET PREPARATION AND TRAININGPROCEDURE
The experiments use the DIV2K dataset. In our experiments, weconsider low-resolution images which have been downsampled bya factor of two. We use the 800 images from the training set fortraining or fine-tuning the models, and the 100 validation imagesfor evaluation. Closely following [13], we feed to the model × RGB image patches extracted from HR images, along with theirnoisy bicubic downsampled counterparts.We train our models using downsampled image patches cor-rupted with additive Gaussian noise. The testing data is corruptedusing additive Gaussian noise with the same distribution as duringtraining. Moreover, Poisson, speckle, and salt-and-pepper noise areused to degrade the testing data for the robustness analysis.As a baseline, we consider two versions of WDSR. First, thepublicly available version of WDSR, pretrained on the DIV2Kdataset, denoted as “No tuning”. Second, WDSR fine-tuned onDIV2K images with added noise, denoted as “No denoiser”. Themodel weights are initialized with the pretrained WDSR weights.Each model is fine-tuned using the mean absolute error (MAE) lossfunction and the ADAM update rule [27] with an initial learning rateof − . To avoid overfitting, the fine-tuning procedure is stoppedafter 100 epochs. Regarding the median and Wiener filter, we usesquare kernel sizes with side length of five pixels. For the denoisingautoencoders, we construct a fully convolutional DAE composed ofthree convolutional layers with 64, 128, and 256 kernels of size × respectively. Each layer is followed by a ReLU activation functionand max-pooling. DAEs are trained for 80 epochs using the MSEloss function and the ADAM update rule with an initial learning rateof − . For the integration into the WDSR model, the autoencoderparameters are fixed during the fine-tuning of the network.
4. EVALUATION4.1. Aligned Noise in Training and Testing
The noise in real images can often be approximated as additive Gaus-sian noise, which we model in this experiment as a fixed power σ for training and testing. This evaluation assumes that noise distri-bution and strength are known during training. This is a strong as-sumption for practical cases, but this setup shows the fundamentalcapability of the models to capture non-ideal images.Table 1 shows the numerical performances for the peak signal-to-noise ratio (PSNR). The experiment is performed for additiveGaussian noise with . ≤ σ ≤ . . For low noise lev-els, the fine-tuned WDSR (“No denoiser”) and in-network with Me-dian filter perform best. In-network with the denoising autoencoderoutperforms the competing methods for higher noise levels. A con-sistent decrease in PSNR can be observed with increasing strengthof the noise. This is expected, since image restoration becomes in-creasingly difficult with increasing noise strength. Pre-net and in-netare both affected by increasing noise, since the denoiser increasinglyremoves useful information from the image.In comparison, the original WDSR model (“No tuning”) is un-able to accurately recover the HR image from its noisy LR counter-part. However, fine-tuning WDSR (“No denoiser”) considerably im-proves the PSNR, for example by about 10 dB for σ = 0 . . Thus,the fine-tuned WDSR without any integrated denoiser learns to sup-press noise and is on par with the best denoiser methods (i.e., in-netwith autoencoder and median filter). Hence, WDSR’s 16 residualblocks and 32 convolutional filters apparently possess sufficient rep-resentational power to jointly learn denoising and SR. However, wewill show in the next section that these findings do not generalize todistortions that were not seen during training.Figure 2 shows a qualitative comparison of the reconstructionsof an image patch for a noise strength of σ = 0 . . The in-networkarchitecture yields very similar results for all denoisers. It exhibitscomparable results to the fine-tuned WDSR and overall outperformspre-network. Upon closer examination, pre-net suffers from the de- able 1 . Average peak signal-to-noise ratio (PSNR) for varying levels of additive Gaussian noise σ on the low-resolution testing images.Baseline WDSR fine-tuned on the noisy data (“No denoiser”) performs best at low noise powers, while the in-net with an autoencoder denoiserachieves best the PSNR values for σ ≥ . . Baseline WDSR without fine-tuning (“No tuning”) performs worst (see text for details). Median Filter Wiener Filter Autoencoder σ No tuning No denoiser In-net Pre-net In-net Pre-net In-net Pre-net
Fig. 2 . Comparison of different reconstructions of an image patch corrupted with Gaussian noise with σ = 0 . using different denoisersand architectural designs. “No tuning” corresponds to the reconstruction using the original WDSR trained on clean images and “No denoiser”corresponds to the reconstruction using a WDSR model fine-tuned on noisy patches.noiser’s removal of relevant information at a very early processingstate. This can be seen in the holes in the hat. The holes appear blurryor are completely removed in all pre-net configurations. In-net per-forms much better since it complements the denoised image in theskip branch with original image information in the main branch.In conclusion, when training and testing images are aligned in afixed amount of Gaussian noise, the fine-tuned WDSR without anydenoiser and the in-net exhibit the best PSNR results and generatethe most accurate reconstructions. In this experiment, we investigate the generalization to noise dis-tributions that were not part of the training data. The models aretrained on images with additive Gaussian noise of strength σ = 0 . .Testing is performed on three different noise distributions. Morespecifically, we use speckle noise with σ = 0 . , Poisson noisewith λ = 0 . , and salt-and-pepper (S&P) noise with a probability of p = 0 . to change a pixel.Table 2 shows quantitative results for this experiment. As a ref-erence, we also include performances for testing on additive Gaus-sian noise with σ = 0 . , i.e., where training and testing are aligned.The fine-tuned WDSR (“No denoiser”) performs best on images cor-rupted by Gaussian noise and speckle noise. We attribute this tothe fact that speckle noise is a multiplicative distortion, but it fol-lows the same probabilistic distribution as additive Gaussian noise.In-network with the median filter and the denoising autoencoderachieves similar results. However, pre-network with the median fil- ter outperforms the competing methods by a large margin on imagesthat are corrupted by salt-and-pepper noise and Poisson noise: animprovement over the baselines of more than 13 dB is achieved.For noise distributions that significantly differ from those seenduring training, it is highly beneficial to suppress as much of theinput corruption as possible prior to performing SR. For this rea-son, pre-network performs best on Poisson noise and salt-and-peppernoise. The choice of the denoiser also affects the results. To this end,Fig. 3 shows qualitative results on Poisson noise. All configurationsof in-net, as well as pre-net with Wiener filter, and the fine-tunedWDSR (“No denoiser”) suppress noise in homogeneous areas, suchas the blue sky. However, the wheel exhibits more texture, and it stillcontains noise. Pre-net with median filter removes the noise best, butthe reconstructed image appears slightly over-smoothed.In Fig. 4, qualitative results on salt-and-pepper noise are shown.Here, the median filter achieves the best reconstructions. The noiseis completely removed, but some details of the lemon peel are alsoremoved. We attribute both observations to to the fact that medianfiltering is particularly well-suited for suppressing salt-and-peppernoise but does not preserve small structures. The Wiener filter inthe pre-net configuration generates unwanted artifacts since it is op-timal for Gaussian denoising. Finally, it is interesting to note thatthe PSNR of the denoising autoencoder is oftentimes outperformedby the other denoisers. This can be attributed to the fact that theautoencoder is trained on additive Gaussian noise, and it apparentlyhas difficulties to generalize to other distributions without retraining.In summary, models that are trained on additive Gaussian noiseriginalNo tuning InputNo denoiser Median pre-netMedian in-net Wiener pre-netWiener in-net DAE pre-netDAE in-net Fig. 3 . Comparison of different reconstructions of an image patch with Poisson noise using different denoisers and architectures. “No tuning”corresponds to the reconstruction using the original WDSR trained on clean images and “No denoiser” corresponds to the reconstructionusing a WDSR model fine-tuned on noisy patches.
Table 2 . Average PSNR values for noise distributions that differ from the training distribution. As a baseline, Gaussian noise is aligned withtraining distribution. Here, and for the mathematically similar speckle noise, WDSR without a denoiser (“No denoiser”) performs best. Forthe more challenging salt and pepper (S&P) noise and Poisson noise, the pre-network again generalizes best, by a large margin.
Median Filter Wiener Filter Autoencoder
Noise
No tuning No denoiser In-net Pre-net In-net Pre-net In-net Pre-net
Gaussian 18.78
Fig. 4 . Comparison of reconstructions of an image patch with salt-and-pepper noise using different denoisers and architectures. “No tuning”corresponds to the reconstruction using the original WDSR trained on clean images and “No denoiser” corresponds to the reconstruction usingWDSR fine-tuned on noisy patches. Only the pre-network with the median filter succeeds at denoising and reconstructing the HR image.do not generalize well beyond Gaussian and speckle noise. Pre-network combined with a suitable denoiser generalizes considerablybetter to different noise distributions than the competing methods.
5. CONCLUSION
In this paper, we compare two architectures to jointly perform im-age denoising and single-image super-resolution. We combine thewell-known WDSR model [13] with three denoisers that can be cho-sen depending on the type of degradation. Both networks have spe-cific benefits. The “pre-network” architecture sequentially removes the noise first, and then recovers the high-resolution image. With asuitable denoiser, pre-network generalizes well to unseen noise dis-tributions. However, details in the image are removed and the re-constructions appear slightly over-smoothed. The “in-network” ar-chitecture reconstructs the high-resolution image by combining low-level features from the denoiser with high-level features from thenoisy input. This enables better structure preservation and sharperreconstructed images, but is more sensitive to unseen noise distri-butions and strength, independent of the chosen denoiser. We hopethat these findings are useful toward enabling super-resolution in-the-wild, when camera and image conditions are not fully controlled. . REFERENCES [1] Xin Li and Michael T Orchard, “New edge-directed interpola-tion,”
IEEE transactions on image processing , vol. 10, no. 10,pp. 1521–1527, 2001.[2] Lei Zhang and Xiaolin Wu, “An edge-guided image interpola-tion algorithm via directional filtering and data fusion,”
IEEEtransactions on Image Processing , vol. 15, no. 8, pp. 2226–2238, 2006.[3] Roman Zeyde, Michael Elad, and Matan Protter, “On singleimage scale-up using sparse-representations,” in
Internationalconference on curves and surfaces . Springer, 2010, pp. 711–730.[4] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma,“Image super-resolution via sparse representation,”
IEEEtransactions on image processing , vol. 19, no. 11, pp. 2861–2873, 2010.[5] Jordi Salvador and Eduardo Perez-Pellitero, “Naive bayessuper-resolution forest,” in
Proceedings of the IEEE Interna-tional conference on computer vision , 2015, pp. 325–333.[6] Samuel Schulter, Christian Leistner, and Horst Bischof, “Fastand accurate image upscaling with super-resolution forests,” in
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2015, pp. 3791–3799.[7] Radu Timofte, Vincent De Smet, and Luc Van Gool, “An-chored neighborhood regression for fast example-based super-resolution,” in
Proceedings of the IEEE international confer-ence on computer vision , 2013, pp. 1920–1927.[8] Radu Timofte, Vincent De Smet, and Luc Van Gool, “A+:Adjusted anchored neighborhood regression for fast super-resolution,” in
Asian conference on computer vision . Springer,2014, pp. 111–126.[9] Chao Dong, Chen Change Loy, Kaiming He, and XiaoouTang, “Learning a deep convolutional network for imagesuper-resolution,” in
European conference on computer vision .Springer, 2014, pp. 184–199.[10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Deeply-recursive convolutional network for image super-resolution,”in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2016, pp. 1637–1645.[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in
Proceed-ings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[12] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, andKyoung Mu Lee, “Enhanced deep residual networks for singleimage super-resolution,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition workshops ,2017, pp. 136–144.[13] Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, ZhaowenWang, Xinchao Wang, and Thomas Huang, “Wide activa-tion for efficient and accurate image super-resolution,” arXivpreprint arXiv:1808.08718 , 2018.[14] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and YunFu, “Residual dense network for image super-resolution,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2018, pp. 2472–2481. [15] Tom Tirer and Raja Giryes, “Super-resolution via image-adapted denoising cnns: Incorporating external and internallearning,”
IEEE Signal Processing Letters , vol. 26, no. 7, pp.1080–1084, 2019.[16] Yaniv Romano, Michael Elad, and Peyman Milanfar, “The lit-tle engine that could: Regularization by denoising (red),”
SIAMJournal on Imaging Sciences , vol. 10, no. 4, pp. 1804–1844,2017.[17] Franziska Schirrmacher, Christian Riess, and Thomas K¨ohler,“Adaptive quantile sparse image (aquasi) prior for inverseimaging problems,”
IEEE Transactions on ComputationalImaging , vol. 6, pp. 503–517, 2020.[18] Stanley H Chan, Xiran Wang, and Omar A Elgendy, “Plug-and-play admm for image restoration: Fixed-point conver-gence and applications,”
IEEE Transactions on ComputationalImaging , vol. 3, no. 1, pp. 84–98, 2016.[19] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang,“Learning deep cnn denoiser prior for image restoration,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 3929–3938.[20] Yijie Bei, Alexandru Damian, Shijia Hu, Sachit Menon, NikhilRavi, and Cynthia Rudin, “New techniques for preserv-ing global structure and denoising with low information lossin single-image super-resolution,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2018, pp. 874–881.[21] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-HsuanYang, and Lei Zhang, “Ntire 2017 challenge on single imagesuper-resolution: Methods and results,” in
Proceedings of theIEEE conference on computer vision and pattern recognitionworkshops , 2017, pp. 114–125.[22] Tim Salimans and Durk P Kingma, “Weight normalization: Asimple reparameterization to accelerate training of deep neu-ral networks,” in
Advances in neural information processingsystems , 2016, pp. 901–909.[23] Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz,Andrew P Aitken, Rob Bishop, Daniel Rueckert, and ZehanWang, “Real-time single image and video super-resolution us-ing an efficient sub-pixel convolutional neural network,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2016, pp. 1874–1883.[24] Linwei Fan, Fan Zhang, Hui Fan, and Caiming Zhang, “Briefreview of image denoising techniques,”
Visual Computing forIndustry, Biomedicine, and Art , vol. 2, no. 1, pp. 7, 2019.[25] Norbert Wiener,
Extrapolation, interpolation, and smoothingof stationary time series , The MIT press, 1964.[26] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Ben-gio, and Pierre-Antoine Manzagol, “Stacked denoising autoen-coders: Learning useful representations in a deep network witha local denoising criterion,”
Journal of machine learning re-search , vol. 11, no. Dec, pp. 3371–3408, 2010.[27] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980