A Generative Model for Hallucinating Diverse Versions of Super Resolution Images
AA GENERATIVE MODEL FOR HALLUCINATING DIVERSE VERSIONS OFSUPER RESOLUTION IMAGES
Mohamed Abid, Ihsen Hedhli, Christian Gagn´e ∗ IID, Universit´e Laval, ∗ Canada-CIFAR AI Chair, Mila
ABSTRACT
Traditionally, the main focus of image super-resolution tech-niques is on recovering the most likely high-quality imagesfrom low-quality images, using a one-to-one low- to high-resolution mapping. Proceeding that way, we ignore the factthat there are generally many valid versions of high-resolutionimages that map to a given low-resolution image. We aretackling in this work the problem of obtaining different high-resolution versions from the same low-resolution image us-ing Generative Adversarial Models. Our learning approachmakes use of high frequencies available in the training high-resolution images for preserving and exploring in an unsuper-vised manner the structural information available within theseimages. Experimental results on the CelebA dataset confirmthe effectiveness of the proposed method, which allows thegeneration of both realistic and diverse high-resolution im-ages from low-resolution images.
Index Terms — Super-resolution, Image Hallucination,Generative Adversarial Models.
1. INTRODUCTION
Image super-resolution (SR) is an ill-posed problem, sinceeach low-resolution (LR) image can have a practically infinitenumber of corresponding images in the high-resolution (HR)domain. However, traditional SR techniques rely on a one-to-one schema, aiming at generating (or reconstructing) onlyone HR image from a LR image, generally the most likelyone. However, they still do not account for all other possiblemappings. In practice, in applications such as microscopy ormedical imaging, where experts rely on SR methods, havingaccess to more than one possible solution can lead to betterstrategies where different outcomes, and even the certaintyon these outcomes, are considered.In this paper, we seek to perform a one-to-many schemaby generating a diverse set of HR images out of one singleLR image. For that, we are proposing to use a Generative Ad-versarial Network (GAN) that recovers an HR image from avery LR version of the same image, meanwhile being able tohallucinate a wide variety of other acceptable HR versions ofthis image. In the recent literature, some exploratory SR hasbeen investigated, for example relying on SR methods that
Fig. 1 . Conceptual representation of proposed approach:starting from an original image ( I HR ) and its downscaled ver-sion ( I LR ), different HR images can be generated, that is herethe reconstructed image ( I SR ) and two hallucinated ones ( I H z and I H z ).are either controlled by a user interface to manually manipu-late the variance and periodicity of textures [1], or by usingsemantically guided style imposition [2]. The diversity gen-erated in these methods are supervised, by either a human ora specific criterion, which is in contrast to our method wherehallucination is conducted in an unsupervised way. Anotherapproach such as [3] searches in a pretrained GAN’s manifoldfor a set of high-resolution images that maps, when down-scaled, to the same low-resolution image. While this methodyields great quality samples, it is an iterative procedure thatrequires a significant amount of time to generate each image.Meanwhile our method is a one pass-prediction which worksfor general images and is not dependent on a pretrained GAN(e.g., StyleGAN).Methods such as [4] exploited the image’s gradient asstructure guidance for super-resolution. This helps the modelto better learn and preserve structural and textural informationin the image. We follow a similar path but we are not limitingits use to preserve the structure of the image but to anchor theexploration over the image-gradient manifold in order to gen-erate other images that would preserve most of the structuregiven by the image’s gradient. The main contribution of thepaper lies in the novel neural architecture that makes use ofhigh-frequency within the image not only to recover one HRimage but also to generate a wide variety of plausible HR im- a r X i v : . [ ee ss . I V ] F e b ges that downscale to the same LR version of of this image– see Fig. 1 that illustrates the concept of our approach.
2. APPROACH
Given a LR image I LR , the objective of the proposed ap-proach is twofold:• Reconstruct the ground truth HR image that allows theretrieval of its corresponding image gradient;• Hallucinate a variety of HR images that match, whendownscaled, the input LR image ( I LR ) .To tackle the aforementioned objectives, we propose a gen-erator G that takes as input pairs of ( I LR , z ) , where z is a m -dimensional white noise vector, z ∼ N m (0 , . We followthe same assumptions made in [5], with our proposed modeldeemed to reconstruct the ground-truth HR image when z =0 , otherwise ( z (cid:54) = 0 ) sampling a variety of images matchingthe LR input but having a varied image gradient: G ( I LR ,
0) = ( I SR , g SR ) (reconstruction) , (1) G ( I LR , z ) = ( I H z , g Hz ) (hallucination) . (2)In these equations, I SR is the reconstructed ground-truth im-age while I H z is a hallucinated version given a noise vector z . Also, g SR and g H z are respectively the reconstructed imagegradient of the ground truth image and the hallucinated imagegradient given a noise vector z . Like traditional SR models, the proposed model aims to min-imize the distance between the ground-truth and the recon-structed HR images. In addition, our model aims to pre-serve structural information in the original image by usingthe ground-truth image I HR and its correspond gradient g HR ,computed as follows: g x = I HR ( x + 1 , y ) − I HR ( x − , y ) , (3) g y = I HR ( x, y + 1) − I HR ( x, y − , (4) g HR = (cid:107) ( g x , g y ) (cid:107) . (5)As shown in (1), the generator takes as input the pair ( I LR , and outputs ( I SR , g SR ) . The Perceptual loss [6] with VGG-16 [7] is then used by comparing the obtained I SR to the real I HR , by minimizing the Euclidean distances between the fea-tures of these images: L percp = E I SR (cid:13)(cid:13) φ i ( I SR ) − φ i ( I HR ) (cid:13)(cid:13) , (6)where φ i is value output by the i -th layer of VGG. As forcomparing the gradient images, the mean absolute error ofthe images (pixels) is used: L grad = (cid:13)(cid:13) g SR − g HR (cid:13)(cid:13) . (7) Finally, for both domains, we use an adversarial loss for as-sessing the quality of the images and gradients generated, thatis a non-saturating logistic loss [8] with R regularization [9]. Unlike the reconstruction, image hallucination does not relyon comparing the generated image to the ground truth im-age. Now in addition to the low-resolution input image I LR ,the proposed method uses a random vector z to generate andexplore other plausible solutions. However, it is worth not-ing that conditional GANs (cGANs) suffer from mode col-lapse problems [10, 11]. Also cGANs tend to ignore the ran-dom vector z when conditioned on inputs that contain signif-icant information about the output, for example in image-to-image translation models [12, 13] where the model tends toignore the random vector. Without addressing this problemin cGANs, the hallucinated images would be equal to the re-constructed image and the model will regress to the same be-haviour of one-to-one mapping. To circumvent this problem,we impose on the model a diversity constraint first introducedin [14], which is applied here only on the gradient output, toorient exploration of the structural and texture space of thegradient image.To enforce the effect of z for obtaining diverse solutions,the generator aims at maximizing loss L z ( G ) : L z ( G ) = E z ,z ∼N n (0 , (cid:20) min (cid:18) d ( g Hz , g Hz ) (cid:107) z − z (cid:107) , τ (cid:19)(cid:21) , (8)where d ( · , · ) is a distance metric, and g H z and g H z are the gra-dient images obtained from HR hallucinated images with z and z sampled from N n (0 , , respectively, both halluci-nated images obtained from the same LR input. The mini-mum with τ is conducted to ensure numerical stability of theoptimization, in case where the output is not bounded by anactivation function (e.g., tanh or sigmoid).Maximizing L z ( G ) encourages the generator to exploremore the HR space, by producing a varied set of hallucinatedsamples. To make sure that the hallucinated samples stayfaithful to the given LR input image, we impose the followingconstraint: (cid:107) DS( I Hz ) − I LR (cid:107) < (cid:15), (9)where DS( · ) is a down-scaling operation, and (cid:15) is a hyper-parameter. Two discriminators are used in the generative model, one forassessing the HR images ( D I ) and the other to validate its cor-responding gradient ( D g ). For both networks, non-saturatinglogistic loss [8] with R regularization [9] is used: L gadv = − E g HR (cid:2) log(1 − D g ( g HR )) (cid:3) − E g SR (cid:2) log D g ( g SR ) (cid:3) , L Iadv = − E I HR (cid:2) log(1 − D I ( I HR )) (cid:3) − E I SR (cid:2) log D I ( I SR ) (cid:3) . ig. 2 . Generator Architecture: the network consists of 8residual blocks followed by stacked upsampling blocks (Up-blocks) – see Fig. 3 for the inner layers of UpBlock.As for generating reconstructed HR images, we use aweighted sum over L percp (6), L grad (7) and the adversarialloss: L recons = γ ( L percp + L grad ) + β ( L gadv + L Iadv ) . (10)And for generating hallucinated HR images, the loss is: L halluc = L gadv + L Iadv + α L z . (11)
3. IMPLEMENTATION3.1. Conditioning
We choose to condition the discriminator D I on the constraint(9) by concatenating it at the last block of the network. Wefollow [5] by rounding the downscaled fake image to the clos-est colour value and dividing it by r = 2 / (since imagesare in the range [ − , ). F = max (cid:32) (cid:12)(cid:12) (cid:98) DS ( I H z ) (cid:101) − I LR (cid:12)(cid:12) r − (cid:15), (cid:33) . (12)That prevents producing exceedingly large values for the dis-criminator’s weights in order to measure infinitesimal differ-ences. A straight through estimator is used to pass the back-propagation gradient through the rounding operation. There-fore, the discriminator receives two inputs, fake/real imagesand F given in (12): (cid:26) D I (cid:0) I HR , (cid:1) for real images D I (cid:0) I H z , F (cid:1) for fake images . Our model consists of a generator, that takes a LR image, anoise vector, and outputs two images: a HR image and itscorresponding gradients. Two discriminators are also usedduring the learning, which are taking real/fake images andtheir gradients, respectively.
Fig. 3 . Upblock architecture: a convolution block shared be-tween the two domains is used to enable the learning of imagestructures, which are then followed by two domain-specificconvolution and a × convolution to make a projection ofthe values in the feature space into the pixel space. The keyin using a skip connection is to keep the low frequency infor-mation such as colours intact.The generator global architecture is inspired by [15] (seeFig. 2). First the residual blocks take the LR image with noisevectors, then they are followed by up-sampling blocks, whereeach block outputs an image and its corresponding gradient atthat scale (see Fig. 3). This way the gradient flows to the gen-erator at different scales, allowing it to be more stable, leadingto a faster learning. As mentioned before, two discriminatorsare used, one for each domain ( I HR and g HR ), their archi-tecture is mirrored version of the generator, but since eachdiscriminator is domain-specific, the upblock is replaced by adownblock, where we use first a × convolution to pass theimage from the pixel space to feature space. This is followedby another two convolutions, by leaky RELUs and then byaverage pooling for downsampling. After the downsamplingblocks, the residual blocks are just like the generator. The dis-criminator directly outputs the features of last convolutions,without using linear layers.
4. EXPERIMENTS4.1. Settings
Experiments were conducted on the CelebA [16] dataset, toassess the capability of our model, which was trained on up-scaling images 8 times and then compared to other SR meth-ods (i.e, [17] and [18]).Our model is trained with the Adam optimizer[19] ( lr =1 e − , beta = 0 . , beta = 0 . ) with TTUR [20], using batchsizes equal to 8. We use a leaky RELU for all activation func-tions in the generator and both discriminators, with a leakequal to 0.2.We fix (cid:15) in (12) to . , while for L recons we set γ = 10 and β = 0 . , and for L halluc , α is equal to . R LR Hallucinated images with different latent vectorSR
Fig. 4 . Results on 8x scaling: reconstruction (SR) and hallucination using different z vectors (a)(b)(c)(d)(e)(f) Fig. 5 . Results on CelebA 8x scaling: (a) ground truth, (b)bicubic reconstruction, (c) SRGAN, (d) Esrgan, (e) Ours :Reconstruction ( z = 0 ), and (f) Ours : Hallucination ( z (cid:54) = 0 ). We use the traditional SR metrics peak signal-to-noise ratio(PSNR), structural similarity index (SSIM) and learned per-ceptual image patch similarity (LPIPS) [21]. Although ourmethod is optimized to generate images with high perceptualquality, which means that the results do not correlate very wellwith PSNR and SSIM [21], we still achieve excellent perfor- Method SSIM ↑ PSNR ↑ LPIPS ↓ Bicubic 0.43 18.11 0.51SRGAN [17] 0.62 20.15 0.24ESRGAN [18] 0.55 18.43 0.23Ours . Comparison of results for 8x scaling on CelebA [16],all results are reported on the test set.mance, with an improvement by a large margin on all metrics– see Table 1 for details.We report the results of our model and compare these re-sults with the results obtained using the approaches describedin [17, 18]. We show that when the input is a very low resolu-tion image, where most of the information is missing, the useof the gradient has a huge impact on the results. As illustratedin Fig. 5, the reconstructions of our model are closer to thereal images than other methods, but also more coherent sincethe structural properties of the images is being preserved bymaking the model learn the gradient of the image.In addition to having better reconstruction given the use of (cid:15) = 0 . for (9) in our experiments, the model has more free-dom to hallucinate other semantics that do not appear in thelow-resolution image, such as changing the facial expressionsor even changing the person’s identity (see Fig. 4).
5. CONCLUSION
In this paper we propose a new super-resolution method thatmakes use of novel GAN architecture that is able to recoverimages from extremely low resolution images, and halluci-nate a wide variety of other possibilities. We showed the su-periority of our approach in both tasks on the CelebA dataset. . REFERENCES [1] Yuval Bahat and Tomer Michaeli, “Explorable SuperResolution,” arXiv e-prints , p. arXiv:1912.01839, Dec.2019.[2] Marcel Christoph B¨uhler, A. Romero, and R. Timofte,“Deepsee: Deep disentangled semantic explorative ex-treme super-resolution,”
ArXiv , vol. abs/2004.04433,2020.[3] Sachit Menon, Alexandru Damian, Shijia Hu, NikhilRavi, and Cynthia Rudin, “PULSE: Self-SupervisedPhoto Upsampling via Latent Space Exploration of Gen-erative Models,” arXiv e-prints , p. arXiv:2003.03808,Mar. 2020.[4] Cheng Ma, Yongming Rao, Yean Cheng, Ce Chen, Ji-wen Lu, and Jie Zhou, “Structure-Preserving SuperResolution with Gradient Guidance,” arXiv e-prints , p.arXiv:2003.13081, Mar. 2020.[5] David Berthelot, Peyman Milanfar, and Ian Goodfellow,“Creating High Resolution Images with a Latent Adver-sarial Generator,” arXiv e-prints , p. arXiv:2003.02365,Mar. 2020.[6] Justin Johnson, Alexandre Alahi, and Fei-Fei Li, “Per-ceptual losses for real-time style transfer and super-resolution,”
CoRR , vol. abs/1603.08155, 2016.[7] Karen Simonyan and Andrew Zisserman, “Very deepconvolutional networks for large-scale image recogni-tion,” in , Yoshua Bengioand Yann LeCun, Eds., 2015.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio, “Generative adversar-ial nets,” in
Advances in Neural Information Process-ing Systems , Z. Ghahramani, M. Welling, C. Cortes,N. Lawrence, and K. Q. Weinberger, Eds. 2014, vol. 27,pp. 2672–2680, Curran Associates, Inc.[9] Lars M. Mescheder, “On the convergence properties ofGAN training,”
CoRR , vol. abs/1801.04406, 2018.[10] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, and Xi Chen, “Im-proved techniques for training gans,”
CoRR , vol.abs/1606.03498, 2016.[11] Martin Arjovsky and L´eon Bottou, “Towards Princi-pled Methods for Training Generative Adversarial Net-works,” arXiv e-prints , p. arXiv:1701.04862, Jan. 2017. [12] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.Efros, “Image-to-image translation with conditional ad-versarial networks,”
CoRR , vol. abs/1611.07004, 2016.[13] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros, “Unpaired image-to-image translation usingcycle-consistent adversarial networks,”
CoRR , vol.abs/1703.10593, 2017.[14] Dingdong Yang, Seunghoon Hong, Yunseok Jang,Tianchen Zhao, and Honglak Lee, “Diversity-sensitiveconditional generative adversarial networks,”
CoRR ,vol. abs/1901.09024, 2019.[15] Animesh Karnewar, Oliver Wang, and Raghu SeshaIyengar, “MSG-GAN: multi-scale gradient GAN forstable image synthesis,”
CoRR , vol. abs/1903.06048,2019.[16] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang,“Deep learning face attributes in the wild,” in
Proceed-ings of International Conference on Computer Vision(ICCV) , December 2015.[17] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Ca-ballero, Andrew P. Aitken, Alykhan Tejani, JohannesTotz, Zehan Wang, and Wenzhe Shi, “Photo-realisticsingle image super-resolution using a generative adver-sarial network,”
CoRR , vol. abs/1609.04802, 2016.[18] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yi-hao Liu, Chao Dong, Chen Change Loy, Yu Qiao,and Xiaoou Tang, “ESRGAN: enhanced super-resolution generative adversarial networks,”
CoRR , vol.abs/1809.00219, 2018.[19] Diederik P. Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in , Yoshua Bengio and Yann LeCun, Eds., 2015.[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, G¨unter Klambauer, and Sepp Hochre-iter, “Gans trained by a two time-scale updaterule converge to a nash equilibrium,”
CoRR , vol.abs/1706.08500, 2017.[21] Richard Zhang, Phillip Isola, Alexei A. Efros, EliShechtman, and Oliver Wang, “The unreasonable effec-tiveness of deep features as a perceptual metric,”