[PDF] Unsupervised Image Super-Resolution using Cycle-in-Cycle Generative Adversarial Networks

Abstract

We consider the single image super-resolution problem in a more general case that the low-/high-resolution pairs and the down-sampling process are unavailable. Different from traditional super-resolution formulation, the low-resolution input is further degraded by noises and blurring. This complicated setting makes supervised learning and accurate kernel estimation impossible. To solve this problem, we resort to unsupervised learning without paired data, inspired by the recent successful image-to-image translation applications. With generative adversarial networks (GAN) as the basic component, we propose a Cycle-in-Cycle network structure to tackle the problem within three steps. First, the noisy and blurry input is mapped to a noise-free low-resolution space. Then the intermediate image is up-sampled with a pre-trained deep model. Finally, we fine-tune the two modules in an end-to-end manner to get the high-resolution output. Experiments on NTIRE2018 datasets demonstrate that the proposed unsupervised method achieves comparable results as the state-of-the-art supervised models.

Full PDF

UUnsupervised Image Super-Resolutionusing Cycle-in-Cycle Generative Adversarial Networks

Yuan Yuan ∗ Siyuan Liu

Jiawei Zhang Yongbing Zhang Chao Dong Liang Lin Sensetime Research Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University Graduate School at Shenzhen, Tsinghua University, Shenzhen Department of Automation, Tsinghua University, Beijing

Abstract

We consider the single image super-resolution problemin a more general case that the low-/high-resolution pairsand the down-sampling process are unavailable. Differ-ent from traditional super-resolution formulation, the low-resolution input is further degraded by noises and blur-ring. This complicated setting makes supervised learn-ing and accurate kernel estimation impossible. To solvethis problem, we resort to unsupervised learning withoutpaired data, inspired by the recent successful image-to-image translation applications. With generative adversar-ial networks (GAN) as the basic component, we proposea Cycle-in-Cycle network structure to tackle the problemwithin three steps. First, the noisy and blurry input ismapped to a noise-free low-resolution space. Then the in-termediate image is up-sampled with a pre-trained deepmodel. Finally, we ﬁne-tune the two modules in an end-to-end manner to get the high-resolution output. Experimentson NTIRE2018 datasets demonstrate that the proposed un-supervised method achieves comparable results as the state-of-the-art supervised models.

1. Introduction

Recent deep learning based super-resolution (SR) meth-ods have achieved signiﬁcant improvement either on PSNRvalues [8, 12, 13, 16, 17, 25, 28, 30] or on visual qual-ity [16, 20]. These methods require supervised learning onhigh-resolution (HR) and low-resolution (LR) image pairs.However, their common assumption that the downscalingfactor is known and the input image is noise-free hindersthem from practical usages. In real-world scenarios, the SR ∗ Yuan Yuan and Siyuan Liu are co-ﬁrst authors. This workwas done when they were interns at Sensetime. Contacting email:[email protected]

Ground Truth Bicubic EDSR [17] BM3D+EDSR

CinCGAN

PSNR/SSIM 29.42/0.82 28.95/0.76 30.94/0.91 31.01/0.92

Figure 1. × Super-resolution results of the proposed CinCGANmethod for “0896” (DIV2K). For comparison, the sub-ﬁgures arecropped from results of existing algorithms. When the input isnoisy, the results of bicubic interpolation and the EDSR [17]model both are in low quality, while CinCGAN learns to recon-struct clean result with ﬁne details. The BM3D+EDSR methodmeans using BM3D for denoising ﬁrst and then using EDSR forsuper-resolution. problem often have the following properties: 1) HR datasetsare unavailable, 2) downscaling method is unknown, 3) in-put LR images are noisy and blurry. This problem is ex-tremely difﬁcult if the input images suffer from differentkinds of degradation. For an easier case, in this study, weassume that input images are degraded with the same pro-cessing which is complex and unavailable.Under the above circumstances, models learned fromsynthetic data tend to generate similar results as traditional1 a r X i v : . [ c s . C V ] S e p ethods [13, 30] or even simple interpolation. In Fig. 1,we show the results of bicubic interpolation and the state-of-the-art deep learning model—EDSR [17] with a noisyinput. This is mainly due to the data bias between trainingand testing images. Detailed survey and analysis of deeplearning based methods on real data can be found in [15].As an alternative choice, blind SR [7, 19, 29] deal withthe real-world data by estimating the down-sampling kernelfrom internal or external similar patches. However, whenthe input is noisy, the down-sampling kernel cannot be ac-curately estimated, and the inverse mapping results are ac-companied by ampliﬁed noises. There are also works at-tempting at restoring LR images with addictive Gaussiannoises [34]. But real-world noises may neither be addictivenor follow the standard Gaussian distribution, causing noiseestimation infeasible. More generally, LR images may suf-fer from complex noises, blurry and non-uniform down-sampling kernels, which fail almost all existing blind SRmethods.Inspired by the development of unsupervised learningin image-to-image translation, such as CycleGAN [35] orWESPE [9], we intend to investigate unsupervised strate-gies to overcome this obstacle. In CycleGAN, images aretranslated between different domains with unpaired train-ing data. They assume that the input image is of the samesize as the output image, with only the difference on styles.However, in SR, output images are several times larger thanthe inputs, making the direct application of CycleGAN im-possible. Further, using a bicubic-upsampled image as theinput also could not obtain satisfactory results. SR problemis speciﬁc as it requires high quality output but not just adifferent style.After exploring several training strategies, we ﬁnd an ef-fective Cycle-in-Cycle structure, named CinCGAN, whichcould achieve superior results. The whole pipeline consistsof two CycleGANs, while the second GAN covers the ﬁrstone (See Fig. 2). The ﬁrst CycleGAN maps the LR image tothe clean and bicubic-downsampled LR space. This moduleensures that the LR input is fairly denoised/deblurred. Wethen stack another well-trained deep model with bicubic-downsampling assumption to up-sample the intermediateresult to the desired size. Finally, we ﬁne-tune the wholenetwork using adversarial learning in an end-to-end man-ner. We conduct experiments on the NTIRE2018 Super-Resolution Challenge dataset, and show that the pro-posed Cycle-in-Cycle structure is much stable at trainingand achieves competitive performance as supervised deeplearning methods.The contributions of this work are three-folds: 1) Westudy a more general super-resolution problem, where the https://competitions.codalab.org/competitions/18024 high-resolution ground truth, down-sampling kernel anddegradation function are unavailable. 2) We explore severalunsupervised training strategies under the above assump-tion, and show that super-resolution task is different fromconventional image-to-image translation. 3) We propose aCycle-in-Cycle structure that could achieve comparable re-sults as supervised CNN networks.

2. Related work

Single image super-resolution (SISR) has been widelystudied for decades. Early approaches either rely on nat-ural image statistics [33] [13] or pre-deﬁned models [10][5] [26]. Later, mapping functions between LR images andHR images are investigated, such as sparse coding based SRmethods [30] [32].Recently, deep convolution neural networks (CNN) haveshown explosive popularity and powerful capability to im-prove the quality of SR results. Ever since Dong [3] ﬁrstproposed using CNN for SR and achieved the state-of-the-art performance, plenty of CNN architectures have beenstudied for SISR. Inspired by the VGG [24] networks usedfor ImageNet classiﬁcation, Kim et al. [12] present a verydeep network (VDSR) that learns a residual image. For ac-celerating the speed of SR, FSRCNN [4] and ESPCN [23]extract feature maps at the low-resolution space and up-sample the image at the last layer by transposed convolu-tion and sub-pixel convolution, respectively. All the abovementioned CNN based SR methods aim at minimizing themean-square error (MSE) between the reconstructed HRimage and the ground truth. Based on the observation thatminimizing MSE will make the SR results overly smooth,SRGAN [16] combines an adversarial loss [6] and a per-ceptual loss [24] [11] as the ﬁnal objective function, andgenerates visually pleasing images which contain more highfrequency details than the MSE-loss based methods. Thechampion of NTIRE2017 Super-Resolution Challenge [27],EDSR [17], employs deeper and wider networks to achievethe state-of-the-art performance by removing the unneces-sary modules in SRResNet [16].

Although a lot of works focus on SR problems withknown degradation/downsamping kernels, little works tryto solve blind SR—the degradation operation from HR im-ages to LR images are unavailable. Estimating the degra-dation/blur kernel is an essential step for blind SR. Wang etal. [29] propose a probabilistic framework combined withthe image co-occurrence prior to estimate the unknownpoint spread function (PSF) parameters. According to theproperty that small image patches will re-appear in natu-2al images, Michaeli and Irani [19] present a method that isable to estimate the optimal blur kernel. Another relevantwork [21] introduces a convolution consistency constraintand bi-l -l -norm regularization [22] to guide the blur ker-nel estimation process, achieving state-of-the-art blind SRperformance.In this work, we investigate how deep learning can bebeneﬁcial for addressing blind SR problems. Existing supervised deep learning methods cannot han-dle blind SR without LR-HR image pairs. In real-worldscenarios, where paired data is unavailable, it is essential toﬁnd a way to realize unsupervised learning. Recent workon GAN [6] provides a feasible solution, which includes agenerator and a discriminator. The generator tries to gen-erate fake images to fool the discriminator, while the dis-criminator aims at distinguishing the generated results fromreal data. GAN is widely used to solve the unsupervisedlearning problems. DualGAN [31] and CycleGAN [35] aretwo works about image-to-image translation using unsuper-vised learning, and both of them present an interesting net-work structure that contains a pair of forward and inversegenerators. The forward generator maps domain X to do-main Y, while the inverse generator maps the output back todomain X to maintain cycle consistency. Ignatov et al. [9]use the similar architecture to design a weakly supervisedphoto enhancer (WESPE) that translates ordinary photos toDSLR-quality images.Different from the proposed method, both Dual-GAN [31] and CycleGAN [35] deal with input and outputimages of the same size, while SR requires the output im-ages several times larger than the inputs. Utilizing the prop-erty of cycle consistency, we present a Cycle-in-Cycle GAN(CinCGAN) to super-resolve the LR images of which thedegradation operators are unknown. Our method achievesa comparable performance with the state-of-the-art super-vised

CNN based algorithms [4, 16, 17].

3. Proposed Method

Problem formulation

The conventional formulation ofSISR [30] is x = SHz + n , where x and z denote LR andHR image respectively, SH represents the down-samplingand blurring matrix, and n is the addictive noise. BlindSR [19,29] follow the same assumption, only with unknown SH . In this work, we study a more general formulation as x = f n ( f d ( z )) + n , where f d is the down-sampling pro-cess, f n is a degradation function that may introduce com-plex noises, shift and blur. Here, we assume that f d , f n andthe paired HR-LR training data are unavailable. Neverthe-less, we can obtain a set of LR images that can be used for analysis and unsupervised training. Motivation

1) Why applying unsupervised training? As thedown-sampling and degradation functions are complex andcoupled, it is hard to perform accurate estimation like tra-ditional blind SR methods [19, 29]. The unavailability ofHR images in practise also makes supervised training withsimulated paired data impractical. This drives us to exploreunsupervised learning strategies. 2) What is the differencebetween SR and image-to-image translation? SR acceptsan LR image and outputs a HR image with much largerresolution. Further, SR requires the output to be of highquality, not just a different style. If we directly apply theimage-to-image translation methods, we need to up-samplethe LR image ﬁrst by interpolation, which will also enlargethe noisy patterns. Directly applying existing methods likeCycleGAN cannot remove such ampliﬁed noises, and train-ing becomes very unstable. Experiments (in Sec. 4.4) alsoshow that when the degradation function varies from imageto image, it is difﬁcult to deal with all kinds of images in asingle forward pass.

Solution pipeline

Our solution pipeline consists of threesteps. First, we learn a mapping from an LR image set X to a “clean” LR image set Y , where images are noise-freeand down-sampled from HR images Z with bicubic kernel.In other words, we deblur and denoise the input images atlow resolution. Second, we adopt an existing SR model tosuper-resolve the intermediate results to the desired resolu-tion. In the end, we combine and ﬁne-tune these two modelssimultaneously to get the ﬁnal HR images.Under the guidance of the above pipeline, we proposea Cycle-in-Cycle structure named CinCGAN as shown inFig. 2. To be speciﬁc, we adopt two coupled CycleGANs tolearn the mapping from X to Y and Y to Z , respectively.Unpaired images x i ∈ X , y j ∈ Y and z j ∈ Z are used fortraining , where y j is down-sampled from z j with bicubickernel. Details are given in the following. The framework of the ﬁrst CycleGAN that maps an LRimage x to a clean LR image y is shown as LR → clean LR in Fig. 2. Given an input image x , the generator G learnsto generate an image ˜ y that looks similar to the clean LR y ,so as to fool the discriminator D . Meanwhile, D learns todistinguish the generated sample G ( x ) from the real sam-ple y . To stabilize the training procedure, we use the leastsquare loss [18] instead of the negative log-likelihood usedin [6]. The generator-adversarial loss is: L LRGAN = 1 N N (cid:88) i || D ( G ( x i )) − || , (1) For simplicity, we omit the subscript i and j in the following. art1G SRG G D D x x  x  yy ~ z z ~ LR → clean LR LR → HR Figure 2. The framework of the proposed CinCGAN, where G , G and G are generators and SR is a super-resolution network. D and D are discriminators. The G , G and D compose the ﬁrst LR → clean LR CycleGAN model, mapping the degrade LR images to cleanLR images. The G , SR , G and D compose the second LR → HR CycleGAN model, mapping the LR images to HR images. where N is the number of training samples. To maintainconsistency between input x and output y , we add a network G and let x (cid:48) = G ( G ( x )) be identical to the input x .Hence, we also use a cycle consistency loss as: L LRcyc = 1 N N (cid:88) i || G ( G ( x i )) − x i || . (2)In the previous work [35], the authors introduce anidentity loss to preserve color composition between inputand output images when they work on painting generation.They claim that the identity loss can help preserve the colorof input images. In image SR, we also need to avoid colorvariation among different iterations, thus we add an identityloss L LRidt = 1 N N (cid:88) i || G ( y i ) − y i || . (3)In addition, we add a total variation (TV) loss to imposespatial smoothness L LRT V = 1 N N (cid:88) i ( ||∇ h G ( x i ) || + ||∇ w G ( x i ) || ) , (4)where ∇ h and ∇ w are functions to compute the horizontaland vertical gradient of G ( x i ) .In summary, the ﬁnal objective loss for the LR → cleanLR model is a weighted sum of the four losses: L LRtotal = L LRGAN + w L LRcyc + w L LRidt + w L LRT V (5)where w , w , w are the weights of different losses. We then investigate how to super-resolve the interme-diate image ˜ y to the desired size. Recently, the enhanceddeep residual network – EDSR [17] has won the ﬁrstprize in the NTIRE 2017 challenge on single image super-resolution [1]. For simplicity, we directly adopt EDSR asthe SR network stacked after G . Similarly, we use a dis-criminator D for adversarial training both G and SR net-works. We also utilize another generator G to ensure cycleconsistency between x and the reconstructed x (cid:48)(cid:48) . The GANloss, cycle loss and TV loss for the LR → HR network areformulated as follows: L HRGAN = 1 N N (cid:88) i || D ( SR ( G ( x i ))) − || , (6) L HRcyc = 1 N N (cid:88) i || G ( SR ( G ( x i ))) − x i || , (7) L HRTV = 1 N N (cid:88) i ( ||∇ h SR ( G ( x i )) || + ||∇ w SR ( G ( x i )) || ) . (8) For the identity loss, instead of maintaining the tint con-sistency between input and output, we consider ensuringthe SR network can generate adequate quality of super-resolved images. We deﬁne a new identity loss as: L HRidt = (cid:88) i || SR ( z (cid:48) ) − z || . (9)4 b l o c k C o n v , k n s / k n s C o n v , k n s / k n s C o n v , k n s C o n v , k n s C o n v , k n s C o n v , k n s C o n v , k n s C o n v , k n s b l o c k b l o c k … input output (a) Generator C o n v , k n s / k n s C o n v , k n s / k n s C o n v , k n s C o n v , k n s / k n s real output B N B N C o n v , k n s B N fake (b) Discriminator Figure 3. The generators G , G and G share the same framework as (a) and the discriminators D and D share the same frameworkas (b). For the -nd and -rd convolution layers in generator (a), k3n64s1 is for G and G , while k4n64s2 is for G . For the ﬁrst threeconvolution layers in discriminator (b), k4n64s1 , k4n128s1 , and k4n256s1 are for D and k4n64s2 , k4n128s2 , and k4n256s2 are for D .Please see text for details. where z (cid:48) is down-sampled from z with bicubic kernel. This L HRidt makes the SR network does not betray its original am-bition, such that the produced ˜ z can be reasonable SR re-sults.To sum up, the total loss for ﬁne-tuning the LR to HRnetworks is L HRtotal = L HRGAN + λ L HRcyc + λ L HRidt + λ L HRT V (10)where λ , λ , λ , for i = 1 , , , are weights of each loss. The architecture of generators G , G , G and discrim-inators D , D are shown in Fig. 3. We adapt similar ar-chitecture as the work of Zhu et al. [35], which has shownimpressive results for unpaired image-to-image translation.Here, “conv” means convolution layer, where a LeakyReLU layer with negative slope 0.2 is added right after ex-cept for the last convolution layer (we omit it for simplicity).“BN” means a batch normalization layer. The number aftersymbols k, n and s represents kernel size, number of ﬁltersand stride size, respectively. For example, k3n64s1 refersto the convolution layer that contains 64 ﬁlters, of which thespatial size is 3 and stride is 1. For the generators G and G , we use 3 convolution lay-ers at the head and tail, and 6 residual blocks in the middle.The generator G shares the same architecture as G and G , except for the -nd and -rd convolution layers, wherethe stride is set to 2 to perform down-sampling. As to thediscriminator, we use a × PatchGAN for D . Sincewe up-sample LR images with a scale of ×

4, the size ofinput images is usually less than 70 (we use × LRimages and × HR images for training). Hence, wemodify the stride of the ﬁrst three convolution layers as 1for discriminator D , such that the respective ﬁeld of D isreduced to × .

4. Experiments

In this section, we ﬁrst introduce the dataset and detailswe used for training. We then evaluate the performance ofthe proposed CinCGAN model by comparing with severalstate-of-the-art SISR methods. Finally, we perform ablationstudy to validate the advantages of CinCGAN.5 .1. Training data

We take the track 2 dataset from the NTIRE2018 Super-Resolution Challenge for training. The challenge aims torestore a HR image given a degraded LR image. They pro-vide a high-quality image dataset, DIV2K [1], which con-tains 800 training images and 100 validation images. TheDIV2K dataset contains almost all kinds of natural scenar-ios: buildings (indoor and outdoor), forest, lakes, animals,people, etc. The track 2 dataset is degraded from DIV2Kdataset, with down-sampling, blurring, pixel shifting andnoises. Although the parameters of the degradation opera-tors are ﬁxed for all images, the blur kernels are randomlygenerated and their resulting pixel shifts vary from imageto image. Hence, the degradation kernels of images in thetrack 2 dataset are unknown and diverse.Since our purpose is to unsupervised train a networkwithout paired LR-HR data, we take the ﬁrst 400 images(numbered from 1 to 400) from the training LR set as inputimages X , and the other 400 images (numbered from 401to 800) from the HR set as demanding HR images Z . Theintermediate clean LR images Y are directly bicubic down-sampled from Z . Similar to [4] [24], we augment data with90 degree rotation and ﬂipping. Our experiments are per-formed with a scaling factor of ×

4. We randomly crop X and Y with size × and crop Z with size × .We conduct testing on the provided 100 validation images.Note that, although DIV2K contains paired training dataset,we do not use paired data for supervised training. We divide our training process into two steps. We ﬁrsttrain the model G , G and D for mapping LR images toclean LR images (shown as LR → clean LR in Fig. 2). Thethree parameters in (5) are set to be w = 10 , w = 5 and w = 0 . , respectively. We train our model withAdam optimizer [14] by setting β = 0 . , β = 0 . and (cid:15) = 10 − , without weight decay. Learning rate is initial-ized as × − and then decreased by a factor of 2 every40000 iterations. The weights of ﬁlters in each layer areinitialized using a normal distribution and the batch size isset as 16. We train the model over 400000 iterations, untilit converges.We then jointly ﬁne-tune the LR to HR model (shown as LR → HR in Fig. 2). We initialize our SR network by pub-licly available EDSR model . We set parameters in (10) as λ = 10 , λ = 5 and λ = 2 . The optimizer is set almostthe same as training the LR → clean LR model, except forwe initialize learning rate with − . As to the weight ofidentity loss L LRidt in (5), we set w = 1 . At each iteration,we update (5) and (10) in turn. We ﬁrst train G and G to https://github.com/thstkdgus35/EDSR-PyTorch update the LR → clean LR network. We then train G , SR and G simultaneously to update the LR → HR network.We implement the proposed networks with PyTorch andtrain them on a Nvidia Tesla K80 GPU. It takes about 1 dayto pre-train the LR → clean LR model and about 2 days tojointly ﬁne-tune the LR → HR model. We compare the performance of the proposed CinCGANmodel with several state-of-the-art SISR methods: FSR-CNN [4], EDSR [17] and SRGAN [16]. We use the publiclyavailable FSRCNN and EDSR models which are trainedwith paired LR and HR images, where the inputs are cleanLR images down-sampled from HR images. To make theresults more comparable, we also ﬁne-tune EDSR and SR-GAN (labelled as EDSR + and SRGAN + respectively) withthe paired track 2 dataset. To emphasize the effectivenessof CinCGAN structure, we also try to ﬁrst denoise the in-put LR images and then super-resolve the denoised imagesfor comparison. BM3D [2] is one of the state-of-the-art im-age denoising approach, which is an efﬁcient and powerfuldenoiser. Hence, we pre-process the test LR images withBM3D ﬁrst, and then super-resolve it using EDSR (labelledas BM3D+EDSR).Table 1 shows the average PSNR and SSIM values ofthe restored test images. It shows that FSRCNN and EDSRcannot work well if the blur and noises are unknown inthe training process. After ﬁne-tuning by paired track 2dataset, EDSR + and SRGAN + improve their results andour method can work comparably against SRGAN + interms of PSNR and SSIM without paired training data. Al-though BM3D can remove noise, it also over-smooth the in-put images. The PSNR and SSIM values of BM3D+EDSRare lower than the proposed method. Several subjective re-sults are illustrated in Fig. 4. To validate the advantages of the proposed CinCGANmodel for the unsupervised SISR problem, we design someother network structures for comparison.

Structure 1

The ﬁrst frame structure is to restore LR images X to HR images Z using only one CycleGAN, i.e. denoise,deblur and super-resolve the LR images at the same time.The structure of the model is shown in Fig. 5(a), where weset an LR image x as input to the SR network directly. Cor-respondingly, we only minimize the total loss L HRtotal (withreplacing SR ( G ( · )) as SR ( · ) in Eq. (6)(7)(8)). However,during the training procedure, we found that the result ˜ z arealways unstable and there are a lot of undesired artifacts,as shown in Fig. 6(a). It is hard for a single network to si-multaneously denoise, deblur and up-sample the degraded6 a) ground truth (b) bicubic (c) EDSR + [17] (d) SRGAN + [16] (e) BM3D+EDSR (f) CinCGAN (ours)

PSNR/SSIM 23.22/0.64 26.23/0.68 24.06/0.58 23.06/0.65 24.83/0.65(a) ground truth (b) bicubic (c) EDSR + [17] (d) SRGAN + [16] (e) BM3D+EDSR (f) CinCGAN (ours)

PSNR/SSIM 22.25/0.68 29.06/0.75 27.36/0.68 22.18/0.72 27.95/0.72(a) ground truth (b) bicubic (c) EDSR + (d) SRGAN + [16] (e) BM3D+EDSR (f) CinCGAN (ours)

PSNR/SSIM 26.81/0.83 30.28/0.88 29.05/0.85 26.84/0.86 28.26/0.84

Figure 4. Super-resolution results of “0801”, “0816” and “0853” (DIV2K) with scale factor × . EDSR + and SRGAN + are trained onpaired NTIRE2018 track 2 dataset. BM3D+EDSR means using BM3D for denoising ﬁrst and then using EDSR for super-resolution. Theproposed CinCGAN model shows comparable results with SRGAN + and is better than BM3D+EDSR method.Table 1. Quantitative evaluation on NTIRE 2018 track 2 dataset of the proposed CinCGAN model, in terms of PSNR and SSIM. method bicubic FSRCNN [4] EDSR [17] EDSR + SRGAN + [16] BM3D+EDSR CinCGAN (ours)

PSNR 22.85 22.79 22.67 25.77 24.33 22.88 24.33SSIM 0.65 0.61 0.62 0.71 0.67 0.68 0.69 images, especially when the degradation kernels are differ-ent from image to image and with unsupervised learning.

Structure 2

We remove D and G from the proposedCinCGAN model for our second experiment. We map theinput LR images to a set of clean LR images using the same LR → clean LR networks shown in Fig. 2; we then super-resolve the converted LR images directly using the SR net-work. The whole structure is shown in Fig. 5(b). The cor-responding result is illustrated in Fig. 6(b). As we can see,some negligible noise in the resulted clean LR images is7 G SR D 𝒙 𝒚 𝒛 𝒛 G D SR G G 𝒙 𝒙′ 𝒚 𝒚 𝒛 G SR D 𝒙 𝒙′′ 𝒛 𝒛 𝒙′′ 𝒙′′ G G SR D 𝒙 𝒚 𝒛 𝒛 G D SR G G 𝒙 𝒙′ 𝒚 𝒚 𝒛 G SR D 𝒙 𝒙′′ 𝒛 𝒛 𝒙′′ 𝒙′′ G G SR D 𝒙 𝒚 𝒛 𝒛 G D SR G G 𝒙 𝒙′ 𝒚 𝒚 𝒛 G SR D 𝒙 𝒙′′ 𝒛 𝒛 𝒙′′ 𝒙′′ (a) Structure 1 (b) Structure 2 (c) Structure 3 Figure 5. Experiments for validating the advantages of the proposed structure. (a) Structure 1: transform the LR images x to HR images z directly with one CycleGAN model; (b) Structure 2: remove D and G from the proposed CinCGAN model; (c) Structure 3: remove D and G from the proposed CinCGAN model. (a) Structure 1 (b) Structure 2 (c) Structure 3 (d) CinCGAN (ours) (e) ground truth

Figure 6. Super-resolution results of “0829” (DIV2K) with scale factor × , for each frame structure as described in Fig. 5. magniﬁed and now is visible in the super-resolved images,which affects the visual quality. Structure 3

Our third experiment is performed by remov-ing D and G from the proposed CinCGAN model, asshown in Fig. 5(c). We use one CycleGAN for the LR toHR model, where we take G + SR as the forward networkand G as the inverse network. D is used for distinguish-ing ˜ z from z . We load the pre-trained G (in the LR → cleanLR networks) and the downloaded EDSR models for ini-tialization. Experimental results on Fig. 6(c) show that theresulting ˜ z are still noisy. Since without the L LRcyc and L LRGAN constraints on G network ( L LRidt and L LRtv are still used forthis model), G is unable to deonise and deblur. The wholemodel becomes similar to Structure 1. Proposed Method

We then propose our ﬁnal solution asshown in Fig. 2: jointly ﬁne-tune LR to HR networks withCinCGAN. We sequentially update the LR → clean LR andthe LR → HR models. With the two constraint L LRtotal and L HRtotal , the G network can denoise and deblur the degradedinput image x , while the SR network can up-sample as wellas further restore the resulted intermediate image ˜ y . Theﬁnal resulted SR image is shown in Fig. 6(d), which showsthe best visual result comparing with other three structures.

5. Conclusions

We investigate the single image super-resolution prob-lem with a more general assumption: the low-/high-resolution image pairs and the down-sampling process areunavailable. Inspired by the recent successful image-to-image translation applications, we resort to the unsuper-vised learning methods to solve this problem. Using gen-erative adversarial networks (GAN), the proposed methodcontains two CycleGANs, where the second GAN cov-ers the ﬁrst one. The solution pipeline consists of threesteps. First, we map the input LR images to the cleanand bicubic-downsampled LR space with the ﬁrst Cycle-GAN. We then stack another well-trained deep model withbicubic-downsampling assumption to up-sample the inter-mediate result to the desired size. Finally, we ﬁne-tunethe two modules in an end-to-end manner to get the high-resolution out. Experimental results demonstrate that theproposed unsupervised method achieves comparable resultsas the state-of-the-art supervised models.

Acknowledgement.

This work is supported by Sense-Time Group Limited and in part by the Projects of Na-tional Science Foundations of China (61571254), Guang-dong Special Support plan (2015TQ01X16), and ShenzhenFundamental Research fund (JCYJ20160513103916577).8 eferences [1] E. Agustsson and R. Timofte. Ntire 2017 challenge on sin-gle image super-resolution: Dataset and study. In

ComputerVision and Pattern Recognition Workshops (CVPRW), 2017IEEE Conference on , pages 1122–1131. IEEE, 2017.[2] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Bm3dimage denoising with shape-adaptive principal componentanalysis. In

SPARS’09-Signal Processing with AdaptiveSparse Structured Representations , 2009.[3] C. Dong, C. C. Loy, K. He, and X. Tang. Imagesuper-resolution using deep convolutional networks.

IEEEtransactions on pattern analysis and machine intelligence ,38(2):295–307, 2016.[4] C. Dong, C. C. Loy, and X. Tang. Accelerating the super-resolution convolutional neural network. In

European Con-ference on Computer Vision , pages 391–407. Springer, 2016.[5] R. Fattal. Image upsampling via imposed edge statistics. In

ACM transactions on graphics (TOG) , volume 26, page 95.ACM, 2007.[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In

Advances in neural informationprocessing systems , pages 2672–2680, 2014.[7] Y. He, K.-H. Yap, L. Chen, and L.-P. Chau. A soft mapframework for blind super-resolution image reconstruction.

Image and Vision Computing , 27(4):364–373, 2009.[8] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In

Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition , pages 5197–5206, 2015.[9] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, andL. Van Gool. Wespe: Weakly supervised photo enhancerfor digital cameras. arXiv preprint arXiv:1709.01118 , 2017.[10] M. Irani and S. Peleg. Improving resolution by image reg-istration.

CVGIP: Graphical models and image processing ,53(3):231–239, 1991.[11] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In

EuropeanConference on Computer Vision , pages 694–711. Springer,2016.[12] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In

Com-puter Vision and Pattern Recognition , pages 1646–1654,2016.[13] K. I. Kim and Y. Kwon. Single-image super-resolution usingsparse regression and natural image prior.

IEEE transactionson pattern analysis and machine intelligence , 32(6):1127–1133, 2010.[14] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[15] T. K¨ohler, M. B¨atz, F. Naderi, A. Kaup, A. K. Maier, andC. Riess. Benchmarking super-resolution algorithms on realdata. arXiv preprint arXiv:1709.04881 , 2017.[16] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a genera-tive adversarial network. arXiv preprint , 2016.[17] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanceddeep residual networks for single image super-resolution.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) Workshops , volume 1, page 3, 2017.[18] X. Mao, Q. Li, H. Xie, R. Y. Lau, and Z. Wang. Multi-class generative adversarial networks with the l2 loss func-tion.

CoRR, abs/1611.04076 , 2, 2016.[19] T. Michaeli and M. Irani. Nonparametric blind super-resolution. In

Computer Vision (ICCV), 2013 IEEE Inter-national Conference on , pages 945–952. IEEE, 2013.[20] M. S. Sajjadi, B. Sch¨olkopf, and M. Hirsch. Enhancenet:Single image super-resolution through automated texturesynthesis. In

Computer Vision (ICCV), 2017 IEEE Interna-tional Conference on , pages 4501–4510. IEEE, 2017.[21] W.-Z. Shao and M. Elad. Simple, accurate, and robust non-parametric blind super-resolution. In

International Con-ference on Image and Graphics , pages 333–348. Springer,2015.[22] W.-Z. Shao, H.-B. Li, and M. Elad. Bi-l0-l2-norm regular-ization for blind motion deblurring.

Journal of Visual Com-munication and Image Representation , 33:42–59, 2015.[23] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken,R. Bishop, D. Rueckert, and Z. Wang. Real-time single im-age and video super-resolution using an efﬁcient sub-pixelconvolutional neural network. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 1874–1883, 2016.[24] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[25] A. Singh and N. Ahuja. Super-resolution using sub-bandself-similarity. In

Asian Conference on Computer Vision ,pages 552–568. Springer, 2014.[26] J. Sun, Z. Xu, and H.-Y. Shum. Image super-resolution us-ing gradient proﬁle prior. In

Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on , pages1–8. IEEE, 2008.[27] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang,L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee,et al. Ntire 2017 challenge on single image super-resolution:Methods and results. In

Computer Vision and Pattern Recog-nition Workshops (CVPRW), 2017 IEEE Conference on ,pages 1110–1121. IEEE, 2017.[28] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjustedanchored neighborhood regression for fast super-resolution.In

Asian Conference on Computer Vision , pages 111–126.Springer, 2014.[29] Q. Wang, X. Tang, and H. Shum. Patch based blind im-age super resolution. In

Computer Vision, 2005. ICCV 2005.Tenth IEEE International Conference on , volume 1, pages709–716. IEEE, 2005.[30] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation.

IEEE transactions onimage processing , 19(11):2861–2873, 2010.

31] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsu-pervised dual learning for image-to-image translation. arXivpreprint arXiv:1704.02510 , 2017.[32] R. Zeyde, M. Elad, and M. Protter. On single image scale-upusing sparse-representations. In

International conference oncurves and surfaces , pages 711–730. Springer, 2010.[33] H. Zhang, J. Yang, Y. Zhang, and T. S. Huang. Non-local ker-nel regression for image and video restoration. In

EuropeanConference on Computer Vision , pages 566–579. Springer,2010.[34] K. Zhang, W. Zuo, and L. Zhang. Learning a single convo-lutional super-resolution network for multiple degradations. arXiv preprint arXiv:1712.06116 , 2017.[35] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. arXiv preprint arXiv:1703.10593 , 2017., 2017.