Blind Image Super-Resolution with Spatial Context Hallucination
BBlind Image Super-Resolution with SpatialContext Hallucination
Dong Huo, Yee-Hong Yang
Department of Computing ScienceUniversity of Alberta, Edmonton, Canada { dhuo,herberty } @ualberta.ca Abstract.
Deep convolution neural networks (CNNs) play a critical rolein single image super-resolution (SISR) since the amazing improvementof high performance computing. However, most of the super-resolution(SR) methods only focus on recovering bicubic degradation. Reconstruct-ing high-resolution (HR) images from randomly blurred and noisy low-resolution (LR) images is still a challenging problem. In this paper,we propose a novel Spatial Context Hallucination Network (SCHN) forblind super-resolution without knowing the degradation kernel. We findthat when the blur kernel is unknown, separate deblurring and super-resolution could limit the performance because of the accumulation of er-ror. Thus, we integrate denoising, deblurring and super-resolution withinone framework to avoid such a problem. We train our model on two highquality datasets, DIV2K and Flickr2K. Our method performs better thanstate-of-the-art methods when input images are corrupted with randomblur and noise.
Single image super-resolution (SISR) is a fundamental topic in computer visionand has attracted the attention of many researchers for decades. The aim ofSISR is to reconstruct a high-resolution (HR) image from its low-resolution (LR)counterpart using only one image. Recently, most of the research in SISR focuseson applying deep learning methods to train an end-to-end model [1,2,3,4,5,6,7]because deep learning methods improve performance significantly. Most of thesesuper-resolution (SR) methods recover HR images from LR images obtainedby bicubic downsampling. In particular, HR images are collected from severaldatasets as training targets and downsampled with bicubic kernel to generateLR training inputs. The procedure is given by y = x ↓ dowmsample (1)where x ↓ dowmsample represents downsampling HR image x with a downsamplekernel (e.g. bicubic kernel), and y is the LR output.Although these models perform well in recovering bicubic degradation dueto downsampling, they are not applicable to more complicated scenarios, e.g. y = ( x (cid:126) k ) ↓ dowmsample + n. (2) a r X i v : . [ ee ss . I V ] S e p Dong Huo, Yee-Hong Yang
In this case, an image is blurred with a blur kernel k (e.g. a Gaussian blurkernel) before downsampling, and (cid:126) represents the convolution operation. n isadditive Gaussian noise with noise level σ . SRMD [8] stretches the blur kerneland Gaussian noise level to the same size of LR images, and concatenates thesethree to form the final input. DPSR [9] regards the deblurring module as a plug-and-play block, and then inputs the deblurred results into the SRResNet [4]models trained on bicubic degradation. But these models, which reconstruct HRimages with known blur kernel and noise level, do not satisfy the condition ofblind super-resolution where the explicit degradation model is unknown [10].Most conventional blind SR algorithms [11,12] use optimization instead ofdeep learning. Michaeli and Irani [13] exploit recurrence of patches across scalesand search similar patterns to estimate the blur kernel. However, when largenoise is added after downsampling, the estimation accuracy of these models issignificantly reduced. Recently, deep learning has been applied to blind SR. Forexample, SFTMD-IKC [14] extends the work of SRMD and builds an extra sub-network to learn the blur kernel and exploits spatial feature transform (SFT) [15]instead of direct concatenation, and upgrades the parameters of IKC iterativelyto improve kernel estimation. It learns the blur kernel explicitly which limits thecategories of blur kernels and hence, the results can also easily be affected bynoise. Besides, SFTMD-IKC contains over 9M parameters which is extremelycomputational expensive, while the proposed method uses only around 2M (af-ter bypassing parameters that are not used in testing) and is the fastest amongall compared SOTA methods. ZSSR [16] is trained on a single image to take fulluse of unique internal information. The model is trained in different scales of theinput image and exploits the kernel estimation algorithm of Michaeli and Irani[13] when the blur kernel is unknown. The architecture of ZSSR is simple becauseit is difficult to train a complex model using a single image only. Zhou et al. [17]estimate the blur kernel from real images with the dark-channel prior [18] togenerate a kernel pool, from which kernels are randomly selected to blur the LRimages. Although this method is more general, the performance of the modelhighly relies on the kernel estimation algorithm [18] whose outputs are regardedas the ground truth, which may cause the accumulation of error. In addition, itstill needs to upsample the LR input to the same size of the HR output whichincreases computational cost and may introduce extra noise. Most of these meth-ods apply an existing denoising algorithm [19,20,21] before SR reconstruction toreduce noise. Unfortunately, because these methods are all trained using noisyimages, they cannot differentiate between noise-free images and noisy images.Hence, the performance on noise-free LR images is poor because they assumethat the input images have noise.Given the above observations, we propose a new Spatial Context Halluci-nation Network (SCHN) to combine noise removal and blind super-resolution.Unlike existing SR methods, the proposed method is lightweight and compu-tational efficient in testing by bypassing parameters that are not used duringtesting (details are shown in Section 4). Our method is an inverse procedureof multiview SR like jittered lens SR [22] and video SR [23]. In [22] , multiple lind Image Super-Resolution with Spatial Context Hallucination 3 views of a scene captured by camera jittering differ in only sub-pixel (less than1 pixel). In [23], if the fps of a video is high enough, neighboring frames alsodiffer in sub-pixel. Multiview SR is effective as different views contains morefeatures than a single view. Indeed, our SCH module is to recover the multiviewfeatures to enhance the original features. In particular, it learns two offset maps(hallucination maps) that are used to recover the feature maps of neighboringframes in a video, or feature maps of jittered frames from camera jittering, sothat we can get more simulated features to perform multiview SR. Each pixelof the feature map is assigned two vectors in the format ( x offset , y offset ), andthe value of an offset is less than 1 to make sure that the shifted feature mapand the original feature map differ in only sub-pixel. We stacked multiple SCHmodules so that we can enhance the feature information by multiview simulationfor multiple times. Meanwhile, the shifted feature maps can also compensate theinformation loss that results from the noise. Since the noise is randomly appliedon the image, pixels with large noise level may have neighboring pixels withrelative low noise level. In this case, the information of the latter can be shiftedto the former which is helpful for image denoising.The contributions of this work are summarized as follows: (1) We propose anew spatial context hallucination (SCH) module in our network to take full ad-vantage of spatial correlation. To our best knowledge, we are the first to proposesuch a novel idea. (2) The experiments show that the proposed network, withshallower and thinner architecture, outperforms the state-of-the-art (SOTA) SRmethods using the proposed SCH module. (3) Our method integrates denois-ing, deblurring and upsampling together to solve the blind SR problem and cantolerate more general and random degradation. The first CNN-based SISR model is the SRCNN[1]. Its simple architecture showsthat it is easy to build an end-to-end CNN model for solving the SR problem.Since then, many CNN-based SR models have been proposed. The VDSR [24]extends the depth of the SRCNN and adds a long-term residual connection toavoid the vanishing/exploding gradients problem. A pixel-shuffle module of ES-PCN [25] avoids introducing extra noise from upsampling the LR input into thesame size of the HR output. The SRResNet [4] utilizes both short-term resid-ual blocks and long-term residual connections, followed by multiple sub-pixelmodules to gradually upsample the LR input. The architecture of EDSR [26] issimilar to that of the SRResNet but it removes the batch normalization moduleto reduce memory usage. The IDN [3] trains a weight for residual connectionto distill long-term information. The RDN [7] combines dense connection andlong-term residual connection together to stabilize training. The DBPN [2] notonly uses dense connection but also applies upsampling and downsampling al-ternatively on a single network to preserve HR features inside the network. TheMsRN [27] extends the work of EDSR and the RDN to infer multi-scale HRimages at the same time. EBRN [5] utilizes multiple block residual modules
Dong Huo, Yee-Hong Yang
Fig. 1: The network architecture of spatial context hallucination network.to restore different frequencies in models of different complexity, which reducesthe problem of over-enhancing and over-smoothing. Since one model may notperform best in all images, RankSRGAN [28] integrates SRGAN [4] and ESR-GAN [6] by a ranker network to distill the best result from two outputs, andguides the training of SRGAN with a ranking score.In recent years, the number of published paper in blind SR is still relativelysmall compared with that of non-blind SR. Some models can be regarded as ahybrid between blind and non-blind SR method. For example, SRMD [8] assumesthat the blur kernel is known and projects it onto a low dimensional vector byPrincipal Component Analysis (PCA). The kernel vector is concatenated withthe σ value of noise (noise level) and then, the result is stretched to the samesize of the LR input. It uses this extra information as another input to trainan SR model. DPSR [9] utilizes the half quadratic splitting (HQS) algorithmbased on the FFT (Fast Fourier Transform) as its deblurring method, whichstill needs an accurate blur kernel as input. Similar to DPSR, ZSSR [16] alsouses a conventional kernel estimation algorithm [13] when it is applied to blindSR, but it is trained on a single image only. In particular, it has eight convolutionlayers, which is extremely simple compared with most of the SOTA (state-of-the-art) SR networks. The CinCGAN [29] also exploits the unsupervised strategy inblind SR. Different from ZSSR, it is not trained on LR-HR pairs. However, due toits complicated structure and the ill-posed problem of SR, training CinCGAN isdifficult and unstable [30]. SFTMD-IKC [14] splits the network into three parts:a non-blind SR network, a predictor and a corrector of kernel estimation. Itconverts the stretched blur kernel into two parameters of an SFT module andtrains a separate subnetwork to estimate the blur kernel. Then the real kernelinput of SFTMD is replaced with the estimated one. Zhou et al. [17] utilizethe dark-channel prior [18] on real images to collect realistic blur kernels, andtrain a GAN to generate more blur kernels with the same distribution. Then thecollected blur kernels are used to generate training data.SISR is not the only research topic in SR. Li et al. [22] reconstruct HR imagesfrom several LR counterparts. In particular, they gather images by jittering the lind Image Super-Resolution with Spatial Context Hallucination 5 Fig. 2: The architecture of the SCH module. The hallucination map is appliedto all channels of feature maps. We only display one channel of the feature mapbefore and after “Transformation with Bilinear Sampling” for simplification. Tovisualize the two hallucination maps, we enhance the magnitude of the offsets ofeach hallucination map. The two channels of each hallucination map are com-bined and displayed using the convention as that used to display optical flow.Both the input and output feature maps contain 64 channels.lens of a camera when the shutter is released and images obtained in this waydiffer in sub-pixel positions only. They then pick pixels one by one from theseLR images and insert them into the HR output. Some of the SR methods utilizestereo image pairs as their input. Jeon et al. [31] build a 32-layer network for thestereo SR problems. They first upsample the left and right images using bicubicinterpolation and then shift the right image 64 times to build a cost volume asinput. The PASSRnet [32] uses matrix multiplication to fuse disparity estimationand SR together. It exploits a residual ASPP block which is designed to enlargethe receptive field and learn multi-scale features in the same layer.There are also some methods that focus on reconstructing high quality framesfrom a low-quality video. TDAN [23] utilizes deformable convolutions [33] toalign the consecutive neighboring LR frames with the reference frame and con-catenate the deformed output with feature maps of the reference frame to gen-erate an HR reference frame. EDVR [34] extends the work of TDAN which usespyramid and cascading deformable convolutions to exploit the alignment in dif-ferent scales, and then uses the attention mechanism to fuse the features fromdifferent times with different weights. Godard et al. [35] utilize the recurrentarchitecture to restore all the frames of an image sequence, and address the de-noising and SR problems with the same network. PFNL [36] exploits a non-localresidual block to extract temporal dependencies, and also uses multiple progressfusion resblocks to take full use of spatio-temporal correlations.In the proposed method, we apply deformable convolutions [33] to gener-ate new feature maps with sub-pixel offsets in our spatial context hallucination(SCH) module, and use pixel-shuffle convolution [25] within each SCH mod-ule to enlarge the spatial dimension. Our proposed model achieves the SOTAperformance in blind SISR problem.
Dong Huo, Yee-Hong Yang
Fig. 3: Transformation with bilinear sampling
Our model addresses the problem of blind SR which is formulated as shown inEqn 2. SFTMD-IKC [14] focuses on recovering the HR image from an LR inputblurred by an isotropic Gaussian blur kernel. In order to make our model morerobust, our method recovers LR images blurred by random anisotropic Gaussianblur kernel, which combines Gaussian blur with motion. We also consider remov-ing the additive random Gaussian noise caused by degradation. However, mostof the denoising methods [19,20,21,37] are applicable to noisy images only. As aresult, when they are applied to noise-free images, these methods may removesome tiny details or smooth the input inappropriately. To address this problem,our training input includes both noisy and noise-free images mixed togetherrandomly.
The proposed spatial context hallucination network (SCHN) is shown in Figure 1.Our network contains a 3 × Z and output the noise-free LR image set Y . Then we add Gaussian noise withrandom noise level on Y to generate noisy LR set X . Details are given in section4. To be specific, our random noise level can be 0, and we also randomly skipthe blurring process during the dataset generation.Similar to the Stacked Hourglass Networks [39] that output a heatmap andcalculate the loss at the end of each hourglass module, we calculate the loss ofeach high-resolution output. The loss function of each module is given by L srj = 1 N N (cid:88) i =1 (cid:107) S j ( O j − ) − z i (cid:107) , j = 1 ... , O = y i (3)where z i ∈ Z , N is the batch size, S j () represents the SCH module and j is theindex of a module. O j − represents the output of the ( j − th module. Note that lind Image Super-Resolution with Spatial Context Hallucination 7(a) σ =(1.5, 1.5), sf=2, n=0 (b) σ =(2.0, 2.0), sf=4, n=0(c) σ =(0.5, 3.0), sf=4, n=0 (d) σ =(0.5, 3.0), sf=4, n=15(e) real low-resolution image (f) real low-resolution image Fig. 4: Ouput samples of images on different degradations, and PSNR and SSIMare shown below each of the result. “ σ ” represents the Gaussian blur kernelwidth; “sf” represents the scale factor; “n” represents the noise level. The bestresults are in red and the second best in blue.the input of the 1 st SCH module is the noise-free LR image y i ∈ Y . Motivatedby the work of Zhao et al. [40], we use L L L
1. The total loss function of our network is given by L sr = λ (cid:88) j =1 L srj + L sr , (4)in which λ (we set it as 0.05) is a balancing coefficient to reduce the trainingconflict of multi-task learning [41]. The output of S () is selected as the finalhigh resolution output.Based on the procedure of Eqn 2, both the blur kernel k and the downsamplekernel ↓ dowmsample process the HR images by convolution, while the noise com-ponent focuses on single pixels. In this case, we integrate the deblurring and SRtogether, which is different from the CinCGAN [29] that deblurs and denoisesthe input simultaneously. However, we not only want to test the performanceof our network on non-linear inverse problem (deblurring and SR), but also at-tempt to test the compatibility of our network on recovering linear (denoising)and non-linear degradation. The loss function of the former is given in Eqn 3 Dong Huo, Yee-Hong Yang
Fig. 5: Hallucination maps of SCHN-NF with kernel width 2.0 and scale factor 4.(x, y) represent the y-th hallucination map of the x-th SCH module. Note thatwe enhance the magnitude of each hallucination map for visualization. (a) σ =(1.5, 2.0), sf=4, n=15 (b) σ =(0.5, 3.0), sf=4, n=0 Fig. 6: “Keep all” represents our original SCHN-AN containing 8 SCH moduleswith 2 hallucination outputs inside. “Remove the 1st/2nd” represents that weset the 1st/2nd hallucination output (in all SCH modules) as 0. “Remove all”represents that we set the both hallucination outputs as 0.and Eqn 4. To test the latter, Eqn 3 is replaced by L srj = 1 N N (cid:88) i =1 (cid:107) S j ( O j − ) − z i (cid:107) , j = 1 ... , O = x i (5)where the only difference is the noisy input x i ∈ X . The spatial context hallucination (SCH) module is the core of our network.As we discussed in Section 1, the SCH module is used to simulate camera lensjittering and create multiple pseudo frames, which can be regarded as the inverseprocedure of TDAN [23] and the jittered lens SR [22].We extend the idea of the DCN [33] which generates only one single offsetfor each pixel. In our SCH module (shown in Figure 2), the feature map fromthe last layer are passed through two different branches, and each branch con-sists of two 3 × lind Image Super-Resolution with Spatial Context Hallucination 9 the original feature map to generate the offset output by transformation withbilinear sampling. We concatenate these two outputs with the original featuremap and the concatenated result is passed to the third 3 × x offset and y offset of each pixel respectively.These offsets are used to obtain the location of the pixel from which the pixelintensity is computed. As shown in Figure 3, assume that pixel E is the locationof pixel 1 after adding its corresponding offsets in the hallucination map. Theintensity at pixel E is computed using bilinear interpolation with pixel 1-4. Theresulting intensity is then used to replace the value at pixel 1. More precisely,based on the formula given by Press et al. [42], P ( E ) =( x T + 1 − x )( y T + 1 − y ) P ( x T , y T ) + ( x − x T )( y T + 1 − y ) P ( x T + 1 , y T )+( x T + 1 − x )( y − y T ) P ( x T , y T + 1) + ( x − x T )( y − y T ) P ( x T + 1 , y T + 1) . (6)where x and y are the coordinates of pixel 1, x T = (cid:98) x + x offset (cid:99) , y T = (cid:98) y + y offset (cid:99) , x = x + x offset and y = y + y offset . Pixels outside of the borderare assigned a value of 0. In the above example, pixel 1 gets the value of P ( E ).By exploiting the SCH module, features at each pixel are enhanced by shiftedfeatures from its neighboring pixels. Details are given by F ( x ) enhanced = w F ( x ) original + N (cid:88) i =1 w i H i ( x ) (7)in which x is one of the pixel, F ( x ) original is the original features of x , H i is the ith hallucination outputs and H i ( x ) are features shifted to x , F ( x ) enhanced isthe features of x after being enhanced, w is the weight which are learned duringtraining. N is the number of hallucination outputs. In this paper, N = 2 is foundto give the best results. Such a strategy is able to feed more information to eachpixel, which plays a critical role in recovering high frequency information, e.g.edges and textures. It also helps to compensate the information loss of pixel x that results from the high noise level. Neighboring features with relative low noiselevel can shift to x , which significantly reduce the noise on the F ( x ) enhanced .We calculate the loss for all SCH modules to limit the range of values in thehallucination maps, which means that no activation function is used after thesecond convolution layer. In this case, the initial values of hallucination mapscould be quite large, and using only one loss function at the end of the networkis difficult to constrain all 8 modules. Hence, we use a loss function after eachSCH module to limit the output and to stabilize training. In this section, we first introduce the implementation details of data preprocess-ing and network training, and then we compare our model with other existingSR methods. We further discuss the ablation study to evaluate the different ver-sions of SCHN. Due to space limitations, more visual comparison on the samedatasets are shown in Section 1-3 of the supplementary materials.
The training images are gathered from two high quality datasets DIV2K [43]and Flickr2K [44], which consist of 2391 HR images in total. Following Eqn 2,we synthesize the LR images by blurring and downsampling the HR images,and adding additive Gaussian noise. We first crop the HR images into 256 ×
256 patches with a stride of 240. The LR-HR image pairs are generated duringthe training procedure. We define a probability for blurring, so that for eachinput patch, there is a chance to blur it or to keep it unblur. We apply to eachimage random 15 ×
15 anisotropic Gaussian blur kernel with kernel width inranges [0.2, 3.0] and [0.2, 4.0] for SR factors 2 and 4, respectively, to the HRpatches, and downsample it with bicubic interpolation based on the SR factor.With regard to the noise, there is also a probability to add it or not. A noiselevel σ is randomly selected from a range (0, 50] for both SR factors. We alsorandomly rotate or flip the images before blurring. In this case, our training setis generated dynamically, which augments the data significantly.We implement the SCHN in TensorFlow [45] on a PC with an Nvidia RTX2080Ti GPU. Our models are optimized using ADAM [46] with β = 0 . β =0 .
999 and a batch size of 4. We initialize the learning rate to 5 × − anddecrease it by half every 10 epochs. The training is stopped after 60 epochs. In order to evaluate the performance of our model on the non-linear inverseproblem, we first use images blurred with isotropic Gaussian blur kernel anddownsampled with bicubic interpolation. Note that we calculate the PSNR andSSIM on the RGB channels instead of only the Y channel in YCbCr color space aswe not only want to recover the luminance information but also the chrominance.Table 1 shows the PSNR and SSIM results on two standard benchmarkdatasets: Set5 [47] and Set14 [48] with different scale factors and kernel widths.We compare our proposed SCHN with bicubic interpolation and four SOTAmethods including ZSSR [16], SRMDNF (noise-free version of SRMD) [8], DPSR [9]and SFTMD-IKC [14]. SCHN-NF represents the SCHN network trained on noise-free blurry images. As ZSSR, SRMDNF and DPSR are all non-blind SR methodswhich need the blur kernel, we apply extreme channels prior (ECP) [49] to esti-mate the blur kernel after the SR procedure. As one can see, SCHN-NF achievesthe best performance in most of the cases. With unknown kernel width, theZSSR achieves slightly better results than the SRMDNF at small kernel but lind Image Super-Resolution with Spatial Context Hallucination 11
Table 1: The average PSNR and SSIM results of different methods on Set5 [47]and Set14 [48] with different scale factors and isotropic kernel width. The size ofthe blur kernel is fixed at 15 and the downsample method is bicubic interpolation.The results of SFTMD-IKC are obtained from pretrained models on https://github.com/yuanjunchai/IKC , which provides only models trained in scalefactor 4. The best results are in red and the second best in blue.
Method Scale Factor x2 Scale Factor x4KernelWidth Set5 Set14 KernelWidth Set5 Set14BicubicZSSR+ECPSRMDNF+ECPDPSR+ECPSFTMD-IKCSCHN-NF (ours) 0.5 30.81/0.896932.64/0.925332.60/0.925032.20/0.9226-/-33.70/0.9310 27.66/0.826728.95/0.871428.87/0.869028.53/0.8685-/-29.64/0.8689 0.5 26.06/0.772124.01/0.728523.78/0.741319.31/0.485622.07/0.674527.81/0.8339 23.60/0.666121.94/0.641721.15/0.630917.07/0.359319.63/0.599124.79/0.7056BicubicZSSR+ECPSRMDNF+ECPDPSR+ECPSFTMD-IKCSCHN-NF (ours) 1.0 28.88/0.854332.20/0.914932.08/0.913932.08/0.9150-/-33.64/0.9305 26.15/0.766128.74/0.848628.95/0.848928.76/0.8510-/-29.61/0.8600 1.0 25.99/0.761726.36/0.805027.28/0.823022.75/0.638226.38/0.806428.69/0.8442 23.79/0.655524.27/0.696224.29/0.706520.08/0.497623.12/0.692925.58/0.7129BicubicZSSR+ECPSRMDNF+ECPDPSR+ECPSFTMD-IKCSCHN-NF (ours) 1.5 27.11/0.802028.44/0.847428.35/0.855728.36/0.8412-/-32.98/0.9255 24.76/0.700125.81/0.803225.97/0.801925.76/0.7497-/-29.31/0.8596 2.0 24.65/0.705726.23/0.803227.09/0.799826.45/0.768728.68/0.832129.08/0.8480 22.89/0.600324.34/0.666224.89/0.697824.25/0.662125.56/0.708225.91/0.7197BicubicZSSR+ECPSRMDNF+ECPDPSR+ECPSFTMD-IKCSCHN-NF (ours) 2.0 25.72/0.750926.36/0.802326.33/0.798626.33/0.7746-/-32.29/0.9190 23.70/0.644124.40/0.737124.37/0.735024.34/0.7356-/-28.70/0.8451 3.0 23.36/0.647924.02/0.737124.40/0.721724.44/0.708828.45/0.819428.70/0.8369 21.94/0.550122.93/0.619023.07/0.639023.04/0.625126.04/0.717825.62/0.7046 worse when the width grows. Since a large scale factor and a small kernel widthcan create more artifacts, the performance of DPSR drops significantly withkernel width 0.5 and scale factor 4. Although the SFTMD-IKC achieves goodperformance for large kernel width, it does not even outperform bicubic interpo-lation for small ones, which is the drawback of explicitly estimating the kernelwidth with IKC, and splitting deblurring (IKC) and SR (SFTMD) may causethe accumulation of error which further impacts the results.The visual comparisons are shown in Figure 4a and 4b. The ringing effectis obvious for ZSSR, SRMDNF and DPSR with kernel width 2, and even worsefor ZSSR with kernel width 4. Besides, the DPSR creates more artifacts thanothers because of over-enhancing, and the edges of SRMDNF are over-smoothed.Compared with SFTMD-IKC which reduces the intensity of the image becauseof over-smoothing, our SCHN-NF not only generates sharper edges without ar-tifacts but also preserves the intensity of the images.We also display the hallucination maps of our SCHN-NF in Figure 5. Thetwo channels of each hallucination map are combined and displayed using theconvention as that used to display optical flow. (x, y) represent the y-th hallu-cination map of the x-th SCH module. We can see that the offsets on the edgesare obviously different from that of the flatten regions, which means the hallu-cination maps can help to restore the high frequency features by feeding moreinformation to the edges from multiple directions.
Table 2: The average PSNR and SSIM results of different methods on Set5 [47]with scale factor 4 and different anisotropic kernel width. The size of blur kernelis fixed to 15 and the downsample method is fixed to bicubic interpolation. Notethat before the SR procedure, DnCNN are used on all noise levels including 0.The best results are in red and the second best in blue.
Method NoiseLevel Kernel Width ( σx , σy )(0.5, 3.0) (1.0, 2.5) (1.5, 2.0)BicubicDnCNN+ZSSRDnCNN+SRMDDnCNN+DPSRDnCNN+SFTMD-IKCSCHN-NF (ours)SCHN-AN (ours)DnCNN+SCHN-NF 0 24.67/0.709325.18/0.743525.83/0.768324.52/0.709025.13/0.756828.28/0.838528.12/0.831727.73/0.8149 24.96/0.718126.30/0.769127.16/0.804425.80/0.748026.89/0.803728.89/0.846728.66/0.839028.45/0.8288 25.05/0.721826.80/0.785828.22/0.825826.90/0.778928.69/0.836629.09/0.848628.81/0.840028.76/0.8342BicubicDnCNN+ZSSRDnCNN+SRMDDnCNN+DPSRDnCNN+SFTMD-IKCSCHN-NF (ours)SCHN-AN (ours)DnCNN+SCHN-NF 15 22.90/0.554123.92/0.674524.36/0.691923.47/0.646423.96/0.683623.13/0.563925.17/0.722024.95/0.7120 23.10/0.562324.43/0.684424.99/0.708824.13/0.664624.72/0.703623.40/0.574625.47/0.728625.44/0.7254 23.17/0.564824.61/0.691925.30/0.721224.54/0.679225.23/0.717823.49/0.575425.63/0.732825.71/0.7334BicubicDnCNN+ZSSRDnCNN+SRMDDnCNN+DPSRDnCNN+SFTMD-IKCSCHN-NF (ours)SCHN-AN (ours)DnCNN+SCHN-NF 30 20.21/0.393122.85/0.622723.16/0.642822.30/0.585922.80/0.632220.04/0.385923.64/0.666623.47/0.6556 20.35/0.399923.14/0.633523.53/0.654422.79/0.605723.21/0.645420.15/0.391223.83/0.673423.77/0.6658 20.39/0.402923.31/0.639223.74/0.662322.93/0.608423.46/0.654520.16/0.392623.87/0.673423.90/0.6715BicubicDnCNN+ZSSRDnCNN+SRMDDnCNN+DPSRDnCNN+SFTMD-IKCSCHN-NF (ours)SCHN-AN (ours)DnCNN+SCHN-NF 45 17.98/0.292922.04/0.589122.19/0.605321.48/0.545921.84/0.593517.61/0.276322.45/0.619022.43/0.6178 18.10/0.297522.17/0.596622.40/0.613821.75/0.555222.06/0.603317.67/0.281122.63/0.624722.56/0.6239 18.10/0.295322.31/0.604122.58/0.620021.96/0.562622.27/0.608317.66/0.283222.73/0.631522.72/0.6301 To evaluate the compatibility of our network on recovering linear (denoising) andnon-linear degradation, we also compare with ZSSR [16], SRMD [8], DPSR [9]and SFTMD-IKC [14] on random anisotropic Gaussian blur kernels with Gaus-sian noise. SCHN-AN represents the SCHN network trained on blurry imageswith additive noise. Although SRMD and DPSR are also designed for denoising,they need a predefined noise level. Thus, we apply DnCNN (blind denoising ver-sion) [21] to these four methods for image denoising before the SR procedure.Different from the experiments on isotropic blur kernel, we do not use extra de-blurring methods because the DnCNN deblurs the images along with denoising,and extra deblurring actually degrade the performance. We also compare withSCHN-NF without any denoising methods to show the effect of noise.Table 2 shows the experimental results on anisotropic Gaussian blur kernelwith scale factor 4 and Gaussian noise. We select different kernel width pairs ( σ x , σ y ) for each of the methods, and add different levels of noise. As one can see,ZSSR, SRMD, DPSR and SFTMD-IKC are all sensitive to the difference of σ x and σ y when the noise level is 0. For the kernel width pair (0.5, 3.0), these meth-ods perform only slightly better than the bicubic interpolation. With non-zeronoise level, these methods become more stable because the denoising procedurealso recovers some details as deblurring, but still cannot outperform our methods. lind Image Super-Resolution with Spatial Context Hallucination 13 Table 3: Comparison of efficiency. Note that we do not collect the testing timeof SRMD-NF because we test it on a CPU. We still use the ECP [49] for blurkernel estimation but we ignore its running time.
Methods Parameters Testing Time PSNR/SSIMZSSR 0.22M 37.53s 30.48/0.8464SRMDNF 1.60M - 30.70/0.8531DPSR 3.49M 2.18s 24.44/0.6386SFTMD-IKC 9.05M 7.88s 30.00/0.8461SCHN-NF (ours) 6.31M 1.15s 31.36/0.8604
The performance of SCHN-NF is better than SCHN-AN and DnCNN+SCHN-NF on zero noise level, because the denoising procedure may remove some tinyfeatures which are regarded as noise by mistake. However, our SCHN-AN stilloutperforms DnCNN+SCHN-NF in most cases which shows that our SCHN-ANworks better in denoising and can protect more features from being removed.As shown in Figure 4c and 4d , the results of our methods far surpass that ofothers when the difference of σ x and σ y is large which show the robustness of ourmodel. ZSSR, SRMD, DPSR and SFTMD-IKC over-enhance edges which appearas deformation. Although the DnCNN+SCHN-NF outperforms the SCHN-ANin this case, SCHN-AN can generate smoother edges with less artifacts. In addition to the synthetic data, we also test our SCHN on real images. Weonly display the visual comparison (shown in Figure 4e and 4f) since there isno ground truth. The testing image is provided by Lebrun et al. [50]. As onecan see, the result of bicubic interpolation is blurry and noisy which is expected.All of the other 4 methods generate large artifacts especially on straight linesand cannot remove all of the noise on the flatten regions. Our method enhancesthe edges on the image and remove more noise than others. The experimentalresults show that our method is more suitable for the real image SR problem.
We also compare the parameters and testing time of our SCHN with others. Sincethe architecture of SCHN-NF is the same as that of SCHN-AN, we only displaythe efficiency of SCHN-NF. The testing time is collected by using “baby.png ofSet5 [47] on an Nvidia RTX 2080ti with isotropic kernel width 1.0 and scalefactor 4. As shown in Table 3, ZSSR has the smallest number of parameters as itcontains only 8 convolution layers, while SFTMD-IKC has the largest number ofparameters as it contains three different parts including a non-blind SR network(SFTMD), a predictor and a corrector for kernel estimation (IKC). The testingtime of ZSSR is much longer than that of others because it uses unsupervisedlearning and has to train a new model for each testing image. Although SFTMD-IKC uses supervised learning, it needs multiple iterations to refine the highresolution outputs which is the same as the DPSR. Our method achieves thebest performance on this image with the lowest run time. Note that we do not
Table 4: The average PSNR and SSIM results of SCHN-NF with different numberof hallucination maps and SCH modules on T91 [51] with scale factor 4 and kernelwidth 2.0. The best results are in red and the second best in blue.
SCH modules0 1 4 8 12Hallucinationmaps 0 26.21/7262 26.49/0.7388 26.77/0.7480 26.82/0.7496 26.98/0.75861 -/- 26.63/0.7457 27.51/0.7752 27.58/0.7759 27.49/0.77542 -/- 26.81/0.7512 27.59/0.7762 27.81/0.7851 27.43/0.77333 -/- 26.70/0.7536 27.47/0.7720 27.66/0.7768 27.10/0.7577 remove those parameters which are not used in the testing procedure. Otherwise,our speed would be faster which takes only 0.69s with 2.16M parameters.
We evaluate different configurations of our network. For the results with limitedtraining time and GPU memory, we only vary the number of hallucination mapsin the SCH module between 0 to 3 and set the number of SCH modules to be0, 1, 4, 8 and 12. As shown in Table 4, version 2-8 (2 maps and 8 modules)achieves the best performance and version 3-8 is the second best. Version 0-0is the worst as it only contains one convolution layer and one resblock beforethe pixel-shuffle convolution module [25]. It shows that our SCH module plays acritical role in increasing the performance of SR and in reducing the parameters.Even the version 2-1 with only one SCH module can achieve almost the sameperformance as version 0-8 which contains over 10 convolution layers but withoutany hallucination maps. The performance becomes worse when we increase thenumber of hallucination maps beyond 2 and the number of SCH modules beyond8. It may be due to overfitting with too many parameters.Meanwhile, in order to visualize the impact of the hallucination outputs, wealso set 1 or 2 of the hallucination outputs (in all SCH modules) as 0. Note thatwe do this after the network is well-trained. As shown in Figure 6, hallucinationoutputs are critical for the SOTA performance. In particular, the 1st hallucina-tion output is more important for color tone restoration, and the 2nd is moreimportant for deblurring and denoising. More visual comparison on the samedatasets are shown in Section 4 and 6 of the supplementary materials.
In this paper, we propose a new spatial context hallucination network for blindSR tasks. To our best knowledge, we are the first one to propose such an idea.Our SCHN introduces an SCH module which simulates the procedure of multi-frame SR by generating pseudo frames, and achieves the SOTA performancecomparing with existing SR methods. However, we have not combined internal(as in ZSSR) and external learning. Hence, our future work will focus on usinginternal information of the testing images. We also plan to train our network ona larger dataset to avoid overfitting. In this paper, we did not do this because wewant to compare our model with others that are trained on the same datasets. lind Image Super-Resolution with Spatial Context Hallucination 15
References
1. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convo-lutional networks. IEEE transactions on pattern analysis and machine intelligence (2015) 295–3072. Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. In: Proceedings of the IEEE conference on computer vision and patternrecognition. (2018) 1664–16733. Hui, Z., Wang, X., Gao, X.: Fast and accurate single image super-resolution viainformation distillation network. In: Proceedings of the IEEE conference on com-puter vision and pattern recognition. (2018) 723–7314. Ledig, C., Theis, L., Husz´ar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken,A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEEconference on computer vision and pattern recognition. (2017) 4681–46905. Qiu, Y., Wang, R., Tao, D., Cheng, J.: Embedded block residual network: Arecursive restoration model for single-image super-resolution. In: The IEEE Inter-national Conference on Computer Vision (ICCV). (2019)6. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Es-rgan: Enhanced super-resolution generative adversarial networks. In: Proceedingsof the European Conference on Computer Vision (ECCV). (2018) 0–07. Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for imagesuper-resolution. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. (2018) 2472–24818. Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolutionnetwork for multiple degradations. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. (2018) 3262–32719. Zhang, K., Zuo, W., Zhang, L.: Deep plug-and-play super-resolution for arbitraryblur kernels. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. (2019) 1671–168110. Han, F., Fang, X., Wang, C.: Blind super-resolution for single image reconstruction.In: 2010 Fourth Pacific-Rim Symposium on Image and Video Technology, IEEE(2010) 399–40311. Begin, I., Ferrie, F.: Blind super-resolution using a learning-based approach. In:Proceedings of the 17th International Conference on Pattern Recognition, 2004.ICPR 2004. Volume 2., IEEE (2004) 85–8912. Wang, Q., Tang, X., Shum, H.: Patch based blind image super resolution. In:Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1.Volume 1., IEEE (2005) 709–71613. Michaeli, T., Irani, M.: Nonparametric blind super-resolution. In: Proceedings ofthe IEEE International Conference on Computer Vision. (2013) 945–95214. Gu, J., Lu, H., Zuo, W., Dong, C.: Blind super-resolution with iterative kernel cor-rection. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. (2019) 1604–161315. Wang, X., Yu, K., Dong, C., Change Loy, C.: Recovering realistic texture in imagesuper-resolution by deep spatial feature transform. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. (2018) 606–61516. Shocher, A., Cohen, N., Irani, M.: zero-shot super-resolution using deep internallearning. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. (2018) 3118–31266 Dong Huo, Yee-Hong Yang17. Zhou, R., Susstrunk, S.: Kernel modeling super-resolution on real low-resolutionimages. In: The IEEE International Conference on Computer Vision (ICCV).(2019)18. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior.IEEE transactions on pattern analysis and machine intelligence (2010) 2341–235319. Liu, Z., Yan, W.Q., Yang, M.L.: Image denoising based on a cnn model. In: 2018 4thInternational Conference on Control, Automation and Robotics (ICCAR), IEEE(2018) 389–39320. Tian, C., Xu, Y., Fei, L., Wang, J., Wen, J., Luo, N.: Enhanced cnn for imagedenoising. CAAI Transactions on Intelligence Technology (2019) 17–2321. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser:Residual learning of deep cnn for image denoising. IEEE Transactions on ImageProcessing (2017) 3142–315522. Li, N., McCloskey, S., Yu, J.: Jittered exposures for image super-resolution. In:Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionWorkshops. (2018) 1852–185923. Tian, Y., Zhang, Y., Fu, Y., Xu, C.: Tdan: Temporally deformable alignmentnetwork for video super-resolution. arXiv preprint arXiv:1812.02898 (2018)24. Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using verydeep convolutional networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. (2016) 1646–165425. Shi, W., Caballero, J., Husz´ar, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert,D., Wang, Z.: Real-time single image and video super-resolution using an efficientsub-pixel convolutional neural network. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. (2016) 1874–188326. Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks forsingle image super-resolution. In: Proceedings of the IEEE conference on computervision and pattern recognition workshops. (2017) 136–14427. Gao, S., Zhuang, X.: Multi-scale deep neural networks for real image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops. (2019) 0–028. Zhang, W., Liu, Y., Dong, C., Qiao, Y.: Ranksrgan: Generative adversarial net-works with ranker for image super-resolution. In: The IEEE International Confer-ence on Computer Vision (ICCV). (2019)29. Yuan, Y., Liu, S., Zhang, J., Zhang, Y., Dong, C., Lin, L.: Unsupervised imagesuper-resolution using cycle-in-cycle generative adversarial networks. In: Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition Work-shops. (2018) 701–71030. Wang, Z., Chen, J., Hoi, S.C.: Deep learning for image super-resolution: A survey.arXiv preprint arXiv:1902.06068 (2019)31. Jeon, D.S., Baek, S.H., Choi, I., Kim, M.H.: Enhancing the spatial resolution ofstereo images using a parallax prior. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. (2018) 1721–173032. Wang, L., Wang, Y., Liang, Z., Lin, Z., Yang, J., An, W., Guo, Y.: Learningparallax attention for stereo image super-resolution. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. (2019) 12250–1225933. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolu-tional networks. In: Proceedings of the IEEE international conference on computervision. (2017) 764–773lind Image Super-Resolution with Spatial Context Hallucination 1734. Wang, X., Chan, K.C., Yu, K., Dong, C., Change Loy, C.: Edvr: Video restorationwith enhanced deformable convolutional networks. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops. (2019) 0–035. Godard, C., Matzen, K., Uyttendaele, M.: Deep burst denoising. In: Proceedingsof the European Conference on Computer Vision (ECCV). (2018) 538–55436. Yi, P., Wang, Z., Jiang, K., Jiang, J., Ma, J.: Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In: TheIEEE International Conference on Computer Vision (ICCV). (2019)37. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removalalgorithms. Physica D: nonlinear phenomena (1992) 259–26838. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition.(2016) 770–77839. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti-mation. In: European conference on computer vision, Springer (2016) 483–49940. Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for neural networks forimage processing. arXiv preprint arXiv:1511.08861 (2015)41. Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. In:Advances in Neural Information Processing Systems. (2018) 527–53842. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: (Numerical recipes in c.the art of scientific computing)43. Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution:Dataset and study. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops. (2017) 126–13544. Timofte, R., Agustsson, E., Van Gool, L., Yang, M.H., Zhang, L.: Ntire 2017challenge on single image super-resolution: Methods and results. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition Workshops.(2017) 114–12545. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-mawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale ma-chine learning. In: 12th { USENIX } Symposium on Operating Systems Design andImplementation ( { OSDI } (2015) 1–5451. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparserepresentation. IEEE transactions on image processing19