[PDF] Real-World Super-Resolution of Face-Images from Surveillance Cameras

Abstract

Most existing face image Super-Resolution (SR) methods assume that the Low-Resolution (LR) images were artificially downsampled from High-Resolution (HR) images with bicubic interpolation. This operation changes the natural image characteristics and reduces noise. Hence, SR methods trained on such data most often fail to produce good results when applied to real LR images. To solve this problem, we propose a novel framework for generation of realistic LR/HR training pairs. Our framework estimates realistic blur kernels, noise distributions, and JPEG compression artifacts to generate LR images with similar image characteristics as the ones in the source domain. This allows us to train a SR model using high quality face images as Ground-Truth (GT). For better perceptual quality we use a Generative Adversarial Network (GAN) based SR model where we have exchanged the commonly used VGG-loss [24] with LPIPS-loss [52]. Experimental results on both real and artificially corrupted face images show that our method results in more detailed reconstructions with less noise compared to existing State-of-the-Art (SoTA) methods. In addition, we show that the traditional non-reference Image Quality Assessment (IQA) methods fail to capture this improvement and demonstrate that the more recent NIMA metric [16] correlates better with human perception via Mean Opinion Rank (MOR).

Full PDF

RReal-World Super-Resolution of Face-Images from Surveillance Cameras

Andreas Aakerberg Kamal Nasrollahi , Thomas B. Moeslund Visual Analysis and Perception, Aalborg University, Rendsburggade 14 Aalborg, Denmark Research Department, Milestone Systems A/S, Milestone Systems, Brøndby, Denmark { anaa, tbm } @create.aau.dk [email protected] Abstract

Most existing face image Super-Resolution (SR) meth-ods assume that the Low-Resolution (LR) images were ar-tiﬁcially downsampled from High-Resolution (HR) imageswith bicubic interpolation. This operation changes the nat-ural image characteristics and reduces noise. Hence, SRmethods trained on such data most often fail to producegood results when applied to real LR images. To solve thisproblem, we propose a novel framework for generation ofrealistic LR/HR training pairs. Our framework estimatesrealistic blur kernels, noise distributions, and JPEG com-pression artifacts to generate LR images with similar im-age characteristics as the ones in the source domain. Thisallows us to train a SR model using high quality face im-ages as Ground-Truth (GT). For better perceptual qualitywe use a Generative Adversarial Network (GAN) basedSR model where we have exchanged the commonly usedVGG-loss [24] with LPIPS-loss [52]. Experimental re-sults on both real and artiﬁcially corrupted face imagesshow that our method results in more detailed reconstruc-tions with less noise compared to existing State-of-the-Art(SoTA) methods. In addition, we show that the traditionalnon-reference Image Quality Assessment (IQA) methods failto capture this improvement and demonstrate that the morerecent NIMA metric [16] correlates better with human per-ception via Mean Opinion Rank (MOR).

1. Introduction

Face Super-Resolution (SR) is a special case of SRwhich aims to restore High-Resolution (HR) face imagesfrom their Low-Resolution (LR) counterparts. This is use-ful in many different applications such as video surveillanceand face enhancement. Current State-of-the-Art (SoTA)face SR methods based on Convolutional Neural Networks(CNNs) are able to reconstruct images with photo-realisticappearance from artiﬁcially generated LR images. How-ever, these methods often assume that the LR images weredownsampled with bicubic interpolation, and therefore fail Original ESRGAN [45] OursFigure 1: × SR of a real low-quality face image ( × pixels) from the Chokepoint DB [48]. Our method enhancesdetails and removes noise while the ESRGAN [45] ampli-ﬁes the corruptions.to produce good results when applied to real-world LR im-ages. This is mostly due to the fact that the downsam-pling operation with bicubic downscaling changes the nat-ural image characteristics and reduces the amount of arti-facts. Hence, when using algorithms trained with super-vised learning on such artiﬁcial LR/HR image pairs, the re-constructed images usually contains strong artifacts due tothe domain gap.This paper is about SR of real low-resolution, noisy,and corrupted images, also known as Real-World Super-Resolution (RWSR). We apply our proposed method to faceimages, but the method is also applicable to other imagedomains. To create a SR model that is robust against thecorruptions found in real images, we create a degradationframework that can produce LR images that have the sameimage characteristic as the images that we want to super-resolve, i.e . the source domain images. By creating LR im-ages from clean high-quality images, i.e . the target domain,allows us to train a SR model that learns to super-resolveimages with similar characteristics. This approach is in-spired by the work of Ji et al . [22] who propose to performRWSR via kernel estimation and noise injection. However,we observe that their framework for image degradation isnot ideal for SR of LR face images from surveillance cam-1 a r X i v : . [ c s . C V ] F e b ras, as these are often also corrupted by compression ar-tifacts. Hence, we extend the degradation framework from[22] to include JPEG compression artifacts. We use the ES-RGAN [45] model, which is one of the SoTA models forperceptual quality, as our backbone SR model. However,we ﬁnd that the combination of loss functions for the ES-RGAN is not ideal for optimal perceptual quality. To thisend, we exchange the VGG-loss [24] with PatchGAN [53]loss for the discriminator similar to [22]. Inspired by Jo et al . [23], we additionally exchange the VGG-loss [24]with Learned Perceptual Image Patch Similarity (LPIPS)loss [52] for better perceptual quality. Different from ex-isting models for face SR [7, 12, 8], we do not restrict ourmodel to only work for face images of ﬁxed input sizes,which makes our model more useful in practice. To the bestof our knowledge, we are the ﬁrst to propose a method forSR of real LR face images of arbitrary sizes.We evaluate our method on two different datasets. Toenable comparison of the SR performance against Ground-Truth (GT) reference images, we artiﬁcially corrupt high-quality face images from Flickr-Faces-HQ Dataset (FFHQ)[25] and report quantitative results using conventional Im-age Quality Assessment (IQA) methods and the most re-cent methods for assessment of the perceptual quality. Forevaluation on real LR face image from surveillance cam-eras we use the Chokepoint DB [48]. In this case, as noGT image is available, we report the results using MeanOpinion Rank (MOR) and several non-reference based IQAmethods. In both cases we show the effectiveness of ourmethod via quantitative and qualitative evaluations. Fur-thermore, our evaluations show that most existing non-reference based IQA methods correlate poorly with hu-man perception, while the recent Neural Image Assessment(NIMA) [16] metric provides a good correlation with hu-man judgment as proven with MOR.In summary, our contributions are:• We propose a novel framework for generation ofLR/HR training pairs that includes the most commonimage degradation types in real-world face images.Our framework includes blur kernel estimation, noiseinjection and compression artifacts.• We also propose an improved ESRGAN [45] based SRmodel with PatchGAN [53] and LPIPS loss [52] forbetter perceptual quality, and show the beneﬁt on realLR face images from the Chokepoint DB [48] and arti-ﬁcially corrupted face images from the FFHQ DB [25].• Quantitatively, we evaluate our method using the mostpopular non-reference based IQA methods, and ﬁndonly the recent NIMA [16] metric to correlate with hu-man judgment via MOR.

2. Related Work

Recent advancements within deep-learning have provenvery successful for use within super-resolution, and modelsof this type often achieve SoTA results. The ﬁrst deep-learning based method for super-resolution was proposedby Dong et al . [15] who successfully trained a CNNto learn a non-linear mapping from LR to HR images.Later proposals relied on deeper networks and residuallearning [27, 33], recursive learning [28], multi-pathlearning [21], and different loss functions [29] to reduce thereconstruction error between the super-resolved image andthe GT image. However, while these methods yield highPeak Signal-to-Noise Ratio (PSNR) values, they tend toproduce over-smoothed images which lack high-frequencydetails. To overcome this, Ledig et al . [32] proposed to useGenerative Adversarial Networks (GANs) for SR with theSRGAN, to achieve realistic looking images according tohuman perception. The ESRGAN [45] further improvesthe SRGAN [32] by several changes to the discriminatorand generator. The LR images needed for training theaforementioned deep-learning based super-resolutionmodels are typically created by downsampling HR im-ages with an ideal downscaling kernel, typically bicubicdownscaling. However, the images generated by this kerneldo not nescessarily match real SR images. Additionally,in the downscaling process, important natural imagecharacteristics, such as image sensor noise is removed,which the super-resolution algorithms are then preventedfrom learning. This results in poor reconstruction resultsand unwanted artifacts when a real-world noisy LR imageis super-resolved [35].

Real-World Super-Resolution

One way to address thethe lack of a proper imaging model for RWSR, is to createdatasets that consist of real LR/HR image pairs capturedusing two cameras with different focal lengths [9, 43, 47].However, this method is cumbersome and has inherentproblems with the alignment of the image pairs. Toovercome the problem of missing real-world training data,Shocher et al . [2] propose a zero-shot approach where asmall CNN is trained at test time on LR/HR pairs extractedfrom the LR image itself. Soh et al . [42] extend the workof [2] by using meta-transfer learning phase to exploitinformation from an external dataset. Gu et al . [20] train akernel estimator and corrector CNNs under the assumptionthat the downscaling kernel belongs to a certain familyof Gaussian ﬁlters and uses the estimated kernel as inputto a super-resolution model. To super-resolve LR imageswith arbitrary blur kernels, Zhang et al . [50] propose adeep plug-and-play framework which takes advantage ofexisting blind deblurring methods for blur kernel estima-tion. Bell-Kligler et al . [4] trains a GAN to estimate blurkernels from LR images and combines it with the ZSSRR model [2]. Fritsche et al . [17] train a GAN to introducenatural image characteristics to images downsampledwith bicubic downscaling, which is then used to train asuper-resolution for improved performance on real-worldimages. Zhang et al . [49] propose an iterative network forSR of blurry, noisy images for different scaling factors byleveraging both learning and model-based methods. Mostrecently Ji et al . [22] propose a degradation frameworkfor the creation of LRHR image pairs for training. Thedegradation framework estimates blur kernels and noisedistributions from real LR images in the source domainwhich are used to degrade HR images in the target domain.This enables training of a GAN based SR model which isshown to perform better on real LR images. However, akey limitation of this method is that it does not address thecompression artifacts often found in real-world images.

Face Super-Resolution

Face SR is a SR technique spe-cialized for reconstruction of face images. One of the ﬁrstmethods for face SR was proposed by Baker and Kanade[3]. This method reconstructed face details by searchingfor the most optimal mapping between LR and HR patches.More recent work relies on deep learning based methodswith CNNs and GANs. Dahl et al . [13] use pixel recur-sive learning with two CNNs to synthesize realistic hair andskin details. Chen et al . [11] combine face SR and facealignment to achieve previously unseen PSNR values. Bysearching the latent space of a generative model for imagesthat downscale correctly, Menon et al . [37] are able to cre-ate face images of high resolution and perceptual quality.However, the problem with this approach is that the gener-ated faces are often far from the true identity of the actualperson, as illustrated in Figure 2. Additionally, none of theabove mentioned methods are robust against noise or othercorruptions in the input images [19].There are very few publications available in the litera-ture which address the problem of RWSR of face-images[19]. Furthermore, the few existing face RWSR methodsare only compatible with LR images that have been squaredto × pixels, meaning that the reconstructed image willbe only × or × pixels depending on the scal-ing factor [7, 12, 8]. Hence, these models cannot performtrue SR directly on the LR images. This means that the ac-tual usefulness of the existing face SR models is limited. Onthe contrary, our work presents one possible solution for × RWSR of face images of arbitrary sizes, which we evaluateon real LR face images from surveillance cameras withoutany prior re-scaling.

3. The Proposed Framework

This section describes our two-step framework forRWSR. The ﬁrst step aims to generate LR images fromclean HR images in the target domain Y , such that these Original PULSE [37] OursFigure 2: An example of SR of a real low-quality face im-age from the Chokepoint DB [48], where it can be seen thatthe PULSE [37] method changes the identity of the person,while our method preserves the identity and enhances de-tails.have similar image characteristics as the ones in the sourcedomain X . The second step involves training a SR modelon the constructed paired data, and optimizing for percep-tual quality. Traditional approaches for SR assumes that a LR image I LR is the result of a downscaling operation of the corre-sponding HR image I HR using some kernel k and scalingfactor s , namely: I LR = ( I HR ∗ k ) ↓ s (1)However, real LR images from cameras are inﬂuenced bymultiple other factors that degrade the image as well. TheRealSR [22] framework tries to address this issue by con-sidering realistic noise distributions and blur kernels in thedownscaling process. However, we observe that real im-ages from surveillance cameras are often also degraded withcompression artifacts, which makes the RealSR frameworkperform poorly on such images. To this end, we extendthe degradation framework from [22] to include JPEG com-pression artifacts in addition to estimation of realistic noisedistributions and blur kernels. Thus, we extend the basic SRformulation from Equation 3.1, and assume that the follow-ing image degradation model was used to create I LR . I LR = ( I HR ∗ k ) ↓ s + n + c (2)where k , s , n , and c denotes the blur kernel, scaling factor,noise, and compression artifacts, respectively. I HR is un-known together with k , n , and c . In our degradation frame-work, we estimate the kernel and noise directly from theimages in the source domain X . We build a pool of theestimated kernels and noise patches which is used to gener-ate corrupted LR images from clean HR images and ﬁnallyJPEG compress the images, in order to create image pairsfor training the SR model.riginal MZSR[42] EDSR[33] ESRGAN[45] USRNet[49] RealSR[22] DPSR[51] OursFigure 3: Comparison with SoTA methods for SR of a small face image ( × pixels ) from the Chokepoint DB [48]. Asvisible, our method hallucinates more realistic face details than the existing methods. For estimation of realistic blur kernels, we adopt theKernelGAN method by Bell-Kligler et al . [4]. This methodestimates an image speciﬁc SR kernel k i using an unsu-pervised approach. More speciﬁcally, a GAN is trained todown-scale the input image in a way that best preservesthe image patch distributions across scales. We estimaterealistic blur kernels from training images in X to form apool of kernels that can be used to degrade the HR imagesin Y . Downsampling

To create the downsampled image I D werandomly choose a blur kernels k i from the pool of esti-mated kernels and perform cross-correlation with images in Y . More formally the process is described as: I D = ( Y n ∗ k i ) ↓ s , i ∈ { , · · · m } (3)where I D is the downscaled image, Y n is a HR image, k i refers to a kernel from the degradation pool { k , k , · · · k m } and s is the scaling factor. For degradation with realistic image noise, we adopt themethod from [10] to extract noise patches from the sourceimages X . Here the assumption is that an approximatenoise patch can be obtained from a noisy image by extract-ing an area with weak background and then subtracting themean. We deﬁne two patches p i and q ij . We obtain p i by asliding window approach across images in X , and similarlyfor q ij by scanning p i . p i is considered a smooth patch if thefollowing constraints are met: | M ean ( q ij ) − M ean ( p i ) | ≤ µ · M ean ( p i ) (4)and | V ar ( q ij ) − V ar ( p i ) | ≤ γ · V ar ( p i ) (5)where M ean and

V ar denotes the mean and variance re-spectively, and µ and γ are scaling factors. Different from [10] we add an additional constraint to ensure that saturatedpatches are not extracted: V ar ( p i ) ≥ φ (6)where φ denotes a minimum variance threshold. If allconstraints are satisﬁed, p i will be considered a smoothpatch. We then create a pool of noise patches n i bysubtracting the mean value from all valid p i . Degradation with Noise

We degrade the LR images byinjecting real noise patches from the noise pool. For betterregularization of the SR model we randomly pick a noisepatch from the noise pool and inject it to the LR image dur-ing training. The downscaled and noisy LR image I N iscreated as follows: I N = I D + n i , i ∈ { , · · · l } (7)where I D is a downscaled image, and n i is a noise patchfrom the noise pool { n , n , · · · n l } Finally, we introduce compression artifacts to the LRtraining images to close the domain gap between these andthe real JPEG compressed LR images in the source do-main X . As there are no way of determining the compres-sion strength of existing JPEG images we empirically com-pare images from X to similar images with different JPEGcompression strengths applied and ﬁnd that a compressionstrength of 30 results in similar compression artifacts. We base our SR model on the ESRGAN [45], whichis one of the SoTA networks for perceptual SR with × upscaling, and train it on the paired LR and HR imagesgenerated with our degradation framework. Different fromthe SRGAN [32], the ESRGAN uses Residual-in-ResidualDense Blocks (RRDBs) in the generator network and thediscriminator predicts the relative realness instead of anabsolute value. Additionally, the ESRGAN removes theatch normalization layers used in SRGAN. Loss Functions

While traditional supervised SR modelsare trained with pixel loss to minimize the Mean SquaredError (MSE) between the reconstructed HR image and theGT image, we rely on loss functions that maximize the per-ceptual quality. The original ESRGAN [45] model uses sev-eral different loss functions during training. More speciﬁ-cally, the generator uses adversarial loss L adv [18] in com-bination with VGG perceptual loss L vgg [24] and pixel loss L pix , while the discriminator use VGG-128 [41] loss L vgg .However, we ﬁnd that this combination of loss functions isnot ideal for high perceptual quality. Following the workof [22], we ﬁrst exchange the VGG-128 [41] discrimina-tor loss with a PatchGAN discriminator from [53] to reducethe amount of artifacts in the reconstructed images. Dif-ferent from the VGG loss, the PatchGAN loss L patch hasa fully convolutional structure, and only penalizes structuredifferences at the scale of patches, to determine if an imageis real or fake. For optimization of the generator, the lossfrom all patches are averaged and fed back to the generator.Continuing this track, we seek to also replace the VGG-lossin the generator. Inspired by [23], we ﬁnd that using theLPIPS perceptual loss L lpips [52] results in less noise andricher textures compared to using VGG-loss for the genera-tor. This is mainly because the VGG network is trained forimage classiﬁcation, while LPIPS is trained to score imagepatches based on human perceptual similarity judgements.The LPIPS perceptual loss is formulated as: L lpips = (cid:88) k τ k ( φ k ( I gen ) − φ k ( I gt )) (8)where I gen is a generated image, I gt is the corresondingGT image, φ is a feature extractor, τ is a transformationfrom embeddings to a scalar LPIPS score. The score is com-puted from k layers and averaged. In our implementation ofLPIPS we use the pre-trained AlexNet model provided bythe authors. In total, our full training loss for the generatoris as follows: L generator = λ pix · L pix + λ adv · L adv + λ lpips · L lpips (9)where λ pix , λ adv and λ lpips are scaling parameters. This section describes the datasets used for training andtesting. For our experiments on real LR face images fromsurveillance cameras we use the Chokepoint Dataset [48]as our source domain images X . This dataset contains im-ages of 29 different persons captured with three cameras ina real-world surveillance setting. All images have a reso-lution of × . We use a face detection algorithm toextract the faces from the images, and randomly split the dataset, to obtain 72,282 images for training and 3,805 im-ages for testing. The average resolution of the cropped facesis ≈ × . We only use the Chokepoint training imagesto estimate realistic blur kernels and noise distributions forour degradation framework, and not for direct training ofour SR model.For the target domain of high-quality face images Y , wecombine 571 face images from the SiblingsDB [44], 8,040face images from the Radboud Faces Database [30] and5,000 randomly selected face images from FFHQ database[25] for a total of 13,611 images. Both the SiblingsDB andRaboud Face Database contains portrait face images profes-sionally captured in a studio setting with controlled lighting.The face images from the FFHQ are more diverse in appear-ance, and ethnicity of the subjects. We augment all imagesin the target domain by downsampling by 25, 50 and 75%with bicubic downscaling to obtain a more diverse dataset.We then apply our degradation framework described in Sec-tion 3.1 on the images in Y to obtain LR/HR image pairs fortraining of our SR model.For evaluation on artiﬁcially corrupted faces images, weuse the ﬁrst 1,000 images from the FFHQ dataset. To gener-ate LR/GT images we introduce three kinds of corruptions,namely, downsampling, sensor noise, and compression ar-tifacts. For downsampling, we randomly choose a kernelfrom our blur kernel pool. For modeling of sensor noise wefollow the protocol from [34] and use pixel-wise indepen-dent Gaussian noise, with zero mean and a standard devia-tion of 8 pixels. For compression artifacts, we convert theimages to JPEG using a compression strength of 30. Real-World Images

Due to the nature of RWSR, noGT reference image exists, which makes it impossibleto compare the different methods using traditional SRIQA methods e.g . PSNR and Structural Similarity index(SSIM). To this end, we follow the non-reference basedIQA evaluation protocol from the NTIRE2020 RWSRchallenge [1]. In particular, we assess the image qualityusing NIQE [39], BRISQUE [38], PIQE [40], NQRM[36] and PI [5], where PI is a weighted score computed as ((10 − N QRM ) +

N IQE ) . However, these methodsare known to correlate poorly with human ratings [1]. Toaddress this issue, we supplement our evaluation protocolwith MOR and NIMA [16], where NIMA is a learnedmetric based on human opinion scores, which can quantifyimage quality with high correlation to human perception.We use the pre-trained model for rating of the technicalimage quality. For the MOR, we ask the participants torank overall image quality of the SR results. To simplifythe ranking, we only include the predictions of the top-5methods based on NIMA scores. To avoid bias, theorder of the methods are randomly shufﬂed. We averageriginal MZSR[42] EDSR[33] ESRGAN[45] USRNet[49] RealSR[22] DPSR[51] OursFigure 4: Comparison with SoTA methods for × SR of real low-quality face images from the Chokepoint DB [48]. Asvisible, our method generates superior reconstructions over the existing methods for different faces.Original MZSR[42] ESRGAN[45] USRNet[49] RealSR[22] DPSR[51] Ours GTFigure 5: Comparison with SoTA methods for × SR of artiﬁcially corrupted face images from the FFHQ [25] testset. Asseen, our method hallucinates faces with richer detail and less artifacts compared to the existing methods.the assigned rank of each method over all images andparticipants to compute the MOR.

Artiﬁcially Corrupted Images

For our experiments onartiﬁcially corrupted images we evaluate the performanceusing three conventional IQA methods, PSNR, SSIM, andthe later Multi Scale Structural Similarity index (MS-SSIM)[46]. However, these metrics focus more on signal ﬁdelity rather than perceptual quality [6]. As our method is opti-mized towards perceptual quality, we also include three ofthe most recent full-reference metrics targeting perceptualquality, namely Normalized Laplacian Pyramid Distance(NLPD) [31], LPIPS [52], and Deep Image Structure andTexture Similarity (DISTS) [14].ethod NIQE ↓ BRISQUE ↓ PIQE ↓ NRQM ↑ PI ↓ NIMA ↑ MOR ↓ Bicubic [26] 5.77 56.77 86.28 3.09 6.34 3.92 -MZSR [42] 7.36 50.09 77.63 3.75 6.81 3.97 -EDSR [33] 5.43 50.63 81.97 3.82 5.81 4.08 -ESRGAN [45] 3.75 19.35 19.20 7.08

Table 1: Quantitative results on the Chokepoint testset. ↑ and ↓ indicate whether higher or lower values are desired, respec-tively. Our model scores lower on the traditional IQA metrics while being superior on the more recent NIMA metric andMOR which indicate that the traditional IQA metrics are not ideal for evaluation of perceptual quality.Method PSNR ↑ SSIM ↑ MS-SSIM ↑ NLPD ↓ LPIPS ↓ DISTS ↓ Bicubic [26] 28.39 0.79 0.88 0.32 0.52 0.20MZSR [42] 29.56 0.78 0.89 0.29 0.43 0.18EDSR [33] 28.27 0.78 0.88 0.33 0.50 0.19ESRGAN [45] 28.09 0.77 0.88 0.34 0.40 0.19USRNet [49] 28.53

Table 2: Quantitative results on the FFHQ testset. ↑ and ↓ indicate whether higher or lower values are desired, respectively.Original Baseline Compression LPIPS LossFigure 6: Ablation study of the effect of including compression artifacts in the degradation framework and exchanging theVGG-loss with LPIPS-loss for the generator in the SR model, compared to the baseline and the original LR image ( × pixels)

4. Experiments and Results

Implementation Details

We perform all our experimentswith a scaling factor s = 4 . For our SR model we jointlytrain the generator and discriminator for 400K iterationswith a batch size of 16. We initialize the weights fromthe PSNR optimized RRDB model from [45]. We use LRpatches of size × , and empirically set λ pix , λ adv and λ lpips to 0.01, 0.005 and 0.001 respectively. For noise es-timation we set p i to match the LR patch size and q ij to 8.Similar to [10] we set µ and γ to 0.1 and 0.25 respectively.We empirically set the minimum variance threshold φ to0.5. For degradation with compression artifacts we JPEG compress the LR training images with strength of 30 duringtraining with a probability of 0.9 for better regularization ofthe SR model. We did not ﬁnd any other × face image speciﬁc RWSRmethods in the literature. Instead, we compare our methodto bicubic upscaling, as well as with different groups ofSoTA super-resolution methods including two generic SRmodels (ESRGAN [45], EDSR [33]), one SR method forarbitrary blur kernels (DPSR [51]), three real-world SRmodels (MZSR [42], USRNet [49], and RealSR [22]).or a fair comparison, we adjust the competing modelsfor optimal performance. For MZSR [42], which is anunsupervised method, we enable back-projection with 10iterations and set a noise level of 0.5. For DPSR [51], weuse the pre-trained DPSRGAN model with settings forreal-world images. With USRNet [49] we set the noisevalue to 15 for best results. The results for the RealSR[22], is based on our re-implementation of the frameworkas the training code was not available. We adapt theRealSR method to our face data for a fair comparison. ForESRGAN, we use the pre-trained weights provided by theauthors to better illustrate the difference from our method. Real-World Images

In this experiment we evaluate theSR performance on LR face images from the Chokepointtestset. Quantitative results can be seen in Table 1. Qualita-tive results for multiple images are shown in Figure 4 whilea close-up view of facial components can be seen in Fig-ure 3. Our method clearly outperforms the other methods interms of perceptual quality. However, while the traditionalnon-reference IQA methods (NIQE [39], BRISQUE [38],PIQE [40] and NQRM [36]) fails to capture this, scoresfrom the more recent NIMA [16] method correlates wellwith human perception, which is also backed by our MORrankings. This shows that the traditional IQA metrics arenot ideal for judgement of the perceptual quality.

Artiﬁcially Corrupted Images

This experiment evaluatethe SR performance on artiﬁcially corrupted images fromthe FFHQ testset. We show quantitative results of all meth-ods in Table 2. Qualitative results for multiple images areshown in Figure 5. Our method produces sharp and detailedimages with few artifacts which closely resembles the GTimages, which is also reﬂected in the quantitative results.Most noteworthy are the DISTS results, which are very cor-related with human perception of image quality. The resultsshow that the reconstructed images produced by our methodis superior in comparison to the other methods.

We evaluate the effect of our proposed method for real-istic image degradation and our improved ESRGAN basedSR model in the same setting as described in Section 4.1.A qualitative comparison can be seen in Figure 6.

Baseline

Here, we use kernel estimation and noise injec-tion to generate training data for the ESRGAN with patchdiscriminator, similar to [22]. This SR model is ﬁne-tunedto our face image dataset, and serves as our baseline. Theresulting HR images contain unpleasing noise and lackdetail.

Compression Artifacts

In this setting, we add JPEG compression artifacts to the LR images during trainingof the baseline model. This results in more noise-freereconstructions compared to the baseline.

LPIPS loss

Here, we use the LPIPS loss function for thegenerator instead of VGG-loss combined with the additionof compression artifacts. When the baseline model is re-trained under these settings the resulting reconstructions be-comes sharper with better texture and details. (a) (b) (c) (d)Figure 7: Examples of failure cases. Figure (a) and (b) illus-trate cases where only parts of the image is super-resolved.Figure (c) shows a case where almost no high-frequency de-tails are restored. Figure (d) shows a case where unrealisticfacial features are introduced.While our method produces reconstructed faces of bettervisual quality than the compared SoTA methods, it does notsolve the problem RWSR of face images. Figure 7 showsseveral failure cases of our method. These occur when theinput image is severely corrupted e.g . by motion blur orharsh lighting, or when out-of-focus. In these cases, ourmethod might only super-resolve some parts of the face, e.g .a single eye, or even hallucinate unrealistic facial features.

5. Conclusion

In this paper, we have presented a novel framework forRWSR, which we have evaluated on low-quality face im-ages from surveillance cameras, and artiﬁcially corruptedface images. Our method shows SoTA performance in bothcases, which is achieved by using LPIPS-loss and makingthe SR model robust against the most common degradationtypes present in real LR images. Moreover, our model isthe ﬁrst to perform SR on real LR face images of arbitrarysizes, which makes it useful for practical applications. Inthe future, even better reconstructions could possibly be ob-tained by including more image degradation types in theframework e.g . chromatic aberration.

6. Acknowledgments

This work was supported by Danmarks Frie Forsknings-fond under grant number 8022-00360B, and the Milestoneesearch Programme at Aalborg University (MRPA).

References [1] A. Lugmayr et al. Ntire 2020 challenge on real-world imagesuper-resolution: Methods and results.

CVPR Workshops ,2020.[2] Michal Irani Assaf Shocher, Nadav Cohen. ”zero-shot”super-resolution using deep internal learning. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , June 2018.[3] Simon Baker and Takeo Kanade. Limits on super-resolutionand how to break them. In , pages 2372–2379. IEEE Com-puter Society, 2000.[4] Seﬁ Bell-Kligler, Assaf Shocher, and Michal Irani. Blindsuper-resolution kernel estimation using an internal-gan. In

Advances in Neural Information Processing Systems 32 ,pages 284–293. Curran Associates, Inc., 2019.[5] Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli,and Lihi Zelnik-Manor. The 2018 pirm challenge on percep-tual image super-resolution. In Laura Leal-Taix´e and StefanRoth, editors,

Computer Vision – ECCV 2018 Workshops ,pages 334–355, Cham, 2019. Springer International Publish-ing.[6] Yochai Blau and Tomer Michaeli. The perception-distortiontradeoff. In , pages 6228–6237. IEEE Computer Soci-ety, 2018.[7] Adrian Bulat and Georgios Tzimiropoulos. Super-fan: In-tegrated facial landmark localization and super-resolution ofreal-world low resolution faces in arbitrary poses with gans.In , pages 109–117. IEEE Computer Society, 2018.[8] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. Tolearn image super-resolution, use a GAN to learn how to doimage degradation ﬁrst. In Vittorio Ferrari, Martial Hebert,Cristian Sminchisescu, and Yair Weiss, editors,

ComputerVision - ECCV 2018 - 15th European Conference, Munich,Germany, September 8-14, 2018, Proceedings, Part VI , vol-ume 11210 of

Lecture Notes in Computer Science , pages187–202. Springer, 2018.[9] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang. Toward real-world single image super-resolution: A new benchmark anda new model. In , pages 3086–3095, 2019.[10] Jingwen Chen, Jiawei Chen, Hongyang Chao, and MingYang. Image blind denoising with generative adversarial net-work based noise modeling. In , pages 3155–3164.IEEE Computer Society, 2018.[11] Yu* Chen, Ying* Tai, Xiaoming Liu, Chunhua Shen, andJian Yang. Fsrnet: End-to-end learning face super-resolution with facial priors. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018.[12] Zhiyi Cheng, Xiatian Zhu, and Shaogang Gong. Charac-teristic regularisation for super-resolving face images. In

IEEE Winter Conference on Applications of Computer Vi-sion, WACV 2020, Snowmass Village, CO, USA, March 1-5,2020 , pages 2424–2433. IEEE, 2020.[13] Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixelrecursive super resolution. In

IEEE International Conferenceon Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 5449–5458. IEEE Computer Society, 2017.[14] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli.Image quality assessment: Unifying structure and texturesimilarity.

CoRR , abs/2004.07728, 2020.[15] Chao Dong, C.C. Loy, Kaiming He, and Xiaoou Tang. Imagesuper-resolution using deep convolutional networks.

PatternAnalysis and Machine Intelligence, IEEE Transactions on ,38(2):295–307, Feb 2016.[16] Hossein Talebi Esfandarani and Peyman Milanfar. NIMA:neural image assessment.

IEEE Trans. Image Process. ,27(8):3998–4011, 2018.[17] Manuel Fritsche, Shuhang Gu, and Radu Timofte. Frequencyseparation for real-world super-resolution. In

IEEE/CVF In-ternational Conference on Computer Vision (ICCV) Work-shops , 2019.[18] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,and Yoshua Bengio. Generative adversarial nets. InZoubin Ghahramani, Max Welling, Corinna Cortes, Neil D.Lawrence, and Kilian Q. Weinberger, editors,

Advances inNeural Information Processing Systems 27: Annual Confer-ence on Neural Information Processing Systems 2014, De-cember 8-13 2014, Montreal, Quebec, Canada , pages 2672–2680, 2014.[19] Klemen Grm, Martin Pernus, Leo Cluzel, Walter J. Scheirer,Simon Dobrisek, and Vitomir Struc. Face hallucination re-visited: An exploratory study on dataset bias. In

IEEE Con-ference on Computer Vision and Pattern Recognition Work-shops, CVPR Workshops 2019, Long Beach, CA, USA, June16-20, 2019 , pages 2405–2413. Computer Vision Founda-tion / IEEE, 2019.[20] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong.Blind super-resolution with iterative kernel correction. In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2019.[21] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. S.Huang. Image super-resolution via dual-state recurrent net-works. In , pages 1654–1663, 2018.[22] Xiaozhong Ji, Yun Cao, Ying Tai, Chengjie Wang, Jilin Li,and Feiyue Huang. Real-world super-resolution via kernelestimation and noise injection. In

The IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition (CVPR)Workshops , June 2020.[23] Younghyun Jo, Sejong Yang, and Seon Joo Kim. Inves-tigating loss functions for extreme super-resolution. In , pages 1705–1712. IEEE, 2020.[24] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. InBastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, ed-itors,

Computer Vision - ECCV 2016 - 14th European Con-ference, Amsterdam, The Netherlands, October 11-14, 2016,Proceedings, Part II , volume 9906 of

Lecture Notes in Com-puter Science , pages 694–711. Springer, 2016.[25] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks. In

IEEE Conference on Computer Vision and Pattern Recogni-tion, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 ,pages 4401–4410. Computer Vision Foundation / IEEE,2019.[26] R. G. Keys. Cubic Convolution Interpolation for Digital Im-age Processing.

IEEE Transactions on Acoustics Speech andSignal Processing , 29:1153–1160, Jan. 1981.[27] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurateimage super-resolution using very deep convolutional net-works. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR Oral) , June 2016.[28] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR Oral) , June 2016.[29] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast andaccurate super-resolution. In

IEEE Conference on ComputerVision and Pattern Recognition , 2017.[30] Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel H. J.Wigboldus, Skyler T. Hawk, and Ad van Knippenberg. Pre-sentation and validation of the radboud faces database.

Cog-nition and Emotion , 24(8):1377–1388, 2010.[31] Valero Laparra, Johannes Ball´e, Alexander Berardino, andEero P. Simoncelli. Perceptual image quality assessmentusing a normalized laplacian pyramid. In Huib de Rid-der, Thrasyvoulos N. Pappas, and Bernice E. Rogowitz, edi-tors,

Human Vision and Electronic Imaging, HVEI 2016, SanFrancisco, California, USA, February 14-18, 2016 , pages 1–6. Ingenta, 2016.[32] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham,A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W.Shi. Photo-realistic single image super-resolution using agenerative adversarial network. In , pages105–114, 2017.[33] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanceddeep residual networks for single image super-resolution.In , pages 1132–1140, 2017.[34] Andreas Lugmayr, Martin Danelljan, and Radu Timofte. Un-supervised learning for real-world super-resolution. In , pages 3408–3416. IEEE, 2019.[35] A. Lugmayr, M. Danelljan, R. Timofte, M. Fritsche, S.Gu, K. Purohit, P. Kandula, M. Suin, A. N. Rajagoapalan, N. H. Joon, Y. S. Won, G. Kim, D. Kwon, C. Hsu, C.Lin, Y. Huang, X. Sun, W. Lu, J. Li, X. Gao, S. Bell-Kligler, A. Shocher, and M. Irani. Aim 2019 challenge onreal-world image super-resolution: Methods and results. In , pages 3575–3583, 2019.[36] Chao Ma, Chih-Yuan Yang, Xiaokang Yang, and Ming-Hsuan Yang. Learning a no-reference quality metric forsingle-image super-resolution.

Comput. Vis. Image Underst. ,158:1–16, 2017.[37] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi,and Cynthia Rudin. PULSE: self-supervised photo upsam-pling via latent space exploration of generative models. In , pages 2434–2442. IEEE, 2020.[38] Anish Mittal, Anush Krishna Moorthy, and Alan ConradBovik. No-reference image quality assessment in the spatialdomain.

IEEE Trans. Image Process. , 21(12):4695–4708,2012.[39] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a“Completely Blind” Image Quality Analyzer.

IEEE SignalProcessing Letters , 20(3):209–212, Mar. 2013.[40] Venkatanath N., Praneeth D., Maruthi Chandrasekhar Bh.,Sumohana S. Channappayya, and Swarup S. Medasani.Blind image quality evaluation using perception based fea-tures. In

Twenty First National Conference on Communica-tions, NCC 2015, Mumbai, India, February 27 - March 1,2015 , pages 1–6. IEEE, 2015.[41] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014.[42] Jae Woong Soh, Sunwoo Cho, and Nam Ik Cho. Meta-transfer learning for zero-shot super-resolution. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , 2020.[43] Hamid Vaezi Joze, Ilya Zharkov, Karlton Powell, CarlRingler, Luming Liang, Andy Roulston, Moshe Lutz, andVivek Pradeep. Imagepairs: Realistic super resolutiondataset via beam splitter camera rig. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR)Workshops , June 2020.[44] Tiago F. Vieira, Andrea Bottino, Aldo Laurentini, and Mat-teo De Simone. Detecting siblings in image pairs.

The VisualComputer , 30(12):1333–1345, Dec 2014.[45] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-hanced super-resolution generative adversarial networks. InLaura Leal-Taix´e and Stefan Roth, editors,

Computer Vi-sion – ECCV 2018 Workshops , pages 63–79, Cham, 2019.Springer International Publishing.[46] Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multi-scale structural similarity for image quality assessment. In inProc. IEEE Asilomar Conf. on Signals, Systems, and Com-puters, (Asilomar , pages 1398–1402, 2003.[47] Pengxu Wei, Ziwei Xie, Hannan Lu, ZongYuan Zhan, Qixi-ang Ye, Wangmeng Zuo, and Liang Lin. Component divide-nd-conquer for real-world image super-resolution. In

Pro-ceedings of the European Conference on Computer Vision ,2020.[48] Yongkang Wong, Shaokang Chen, Sandra Mau, ConradSanderson, and Brian C. Lovell. Patch-based probabilisticimage quality assessment for face selection and improvedvideo-based face recognition. In

IEEE Biometrics Workshop,Computer Vision and Pattern Recognition (CVPR) Work-shops , pages 81–88. IEEE, June 2011.[49] Kai Zhang, Luc Van Gool, and Radu Timofte. Deep unfold-ing network for image super-resolution. In

IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3217–3226, 2020.[50] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Deep plug-and-play super-resolution for arbitrary blur kernels. In

IEEE Con-ference on Computer Vision and Pattern Recognition, CVPR2019, Long Beach, CA, USA, June 16-20, 2019 , pages 1671–1681. Computer Vision Foundation / IEEE, 2019.[51] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Deep plug-and-play super-resolution for arbitrary blur kernels. In

IEEE Con-ference on Computer Vision and Pattern Recognition, CVPR2019, Long Beach, CA, USA, June 16-20, 2019 , pages 1671–1681. Computer Vision Foundation / IEEE, 2019.[52] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-man, and Oliver Wang. The unreasonable effectiveness ofdeep features as a perceptual metric. In , pages 586–595. IEEE Computer Society, 2018.[53] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In