Face hallucination using cascaded super-resolution and identity priors
Klemen Grm, Simon Dobrišek, Walter J. Scheirer, Vitomir Štruc
FFace hallucination using cascadedsuper-resolution and identity priors
Klemen Grm , Simon Dobriˇsek , Walter J. Scheirer , Vitomir ˇStruc University of Ljubljana, Faculty of Electrical Engineering University of Notre Dame, Department of Computer Science and Engineering
Fig. 1.
Sample face hallucination results generated with the proposed method.
Abstract.
In this paper we address the problem of hallucinating high-resolution facial images from unaligned low-resolution inputs at highmagnification factors. We approach the problem with convolutional neu-ral networks (CNNs) and propose a novel (deep) face hallucination modelthat incorporates identity priors into the learning procedure. The modelconsists of two main parts: i) a cascaded super-resolution network thatupscales the low-resolution images, and ii) an ensemble of face recogni-tion models that act as identity priors for the super-resolution networkduring training. Different from competing super-resolution approachesthat typically rely on a single model for upscaling (even with large mag-nification factors), our network uses a cascade of multiple SR models thatprogressively upscale the low-resolution images using steps of 2 × . Thischaracteristic allows us to apply supervision signals (target appearances)at different resolutions and incorporate identity constraints at multiple-scales. Our model is able to upscale (very) low-resolution images capturedin unconstrained conditions and produce visually convincing results. Werigorously evaluate the proposed model on a large datasets of facial im-ages and report superior performance compared to the state-of-the-art. Face hallucination represents a domain-specific super-resolution (SR) problemwhere the goal is to recover high-resolution (HR) face images from low-resolution(LR) inputs [1]. It has important applications in image enhancement, compres-sion and face recognition [2], but also surveillance and security [3,4].Similar to other single-image super-resolution tasks, face hallucination isinherently ill-posed. Given a fixed image-degradation model, every LR facialimage can be shown to have many possible HR counterparts. Thus, the solu-tion space for SR problems is extremely large and existing solutions commonly a r X i v : . [ c s . C V ] F e b Grm, Dobriˇsek, Scheirer, ˇStruc try to produce plausible reconstructions by ”hallucinating” high-frequency in-formation based on the provided LR evidence. While significant progress hasbeen made in recent years in the area of super-resolution and face hallucina-tion [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19], super-resolving arbitrary facial im-ages, especially at high magnification factors, is still an open and challengingproblem, mainly due to: – The ill-posed nature of the face hallucination problem, where the solutionspace is known to grow exponentially with an increase in the desired magni-fication factor [20]. Even with strong reconstruction constraints it is excep-tionally difficult to find good solutions and devise methods that work wellunder a broad range of conditions. Even for domain-specific SR problems,such as face hallucination, where the solution space is constrained by facialappearances, there are still an overwhelming number of possible solutions. – The difficulty of learning and integrating strong priors into the face hallu-cination models that sufficiently constrain the solution space beyond solelythe visual quality of the reconstructions. Most of the existing priors utilizedfor super-resolution relate to specific image characteristics, such as gradientdistribution [21], total variation [22], smoothness [23] and the like, and hencefocus on the perceptual quality of the super-resolved results. If discernibilityof the semantic content is the goal of the SR procedure, such priors may notbe the most optimal choice, as they are not sufficiently task-oriented.The outlined limitation are most evident for challenging face hallucinationproblems where tiny low-resolution images (e.g., 24 ×
24 pixels) of arbitrary char-acteristics need to be super-resolved at high magnification factors (e.g., 8 × ). Inthis paper, we try to address some of these limitations with a new hallucinationmodel build around deep convolutional neural networks (CNNs). Our model,called C-SRIP, uses a Cascade of simple Super-Resolution models (referred toas SR modules hereafter) for image upscaling and Identity Priors in the form ofpretrained recognition networks as constraints for the training procedure. TheSR models super-resolve the LR input images in magnification increments of2 × and, consequently, allow for intermediate supervision at every scale. Thisintermediate supervision confines the explosion of the solution-space size andcontributes towards more accurate hallucination results. To preserve identity-related features in the SR images, we incorporate pretrained recognition modelsinto the training procedure, which act as identity constraints for the face hal-lucination problem. The recognition models are trained to respond only to thehallucinated high-frequency parts of the SR images and ensure that the addedfacial details are not only plausible, but as close to the true details as possi-ble. Due to availability of intermediate SR results, we incorporate the identityconstraints at multiple scales in the C-SRIP model. Additionally, we introducea novel loss function derived from the structural similarity index (SSIM, [24])that provides a stronger error signal for model training than the loss functionscommonly used in this area.Overall, we make three main contributions in this paper: ace hallucination using cascaded super-resolution and identity priors 3
1. We propose a new CNN-based face hallucination model, C-SRIP, that in-tegrates identity priors at multiple scales into the training procedure of asuper-resolution network. To the best of our knowledge, this is the first at-tempt to exploit multi-scale identity information to constrain the solutionspace of deep-learning based SR models.2. We introduce a cascaded SR network architecture that super-resolves imagesin magnification steps of 2 × and offers a convenient and transparent way ofincorporating supervision signals an multiple scale. Once trained, the SRnetwork is able to hallucinate tiny unaligned 24 ×
24 pixel LR images atmagnification factors of 8 × and produce realistic and visually convincinghallucination results as illustrated in Fig. 1.3. We formulate a novel differentiable loss function for SR models based on theconcept of structural similarity (SSIM). The novel loss drives our SR modeltowards solutions of higher perceived quality, as it relates to a measure de-signed explicitly with the goal of modeling human image-quality perception. In this section we discuss recent research on super-resolution and face hallu-cination with the goal of providing the necessary context for our work. For amore comprehensive coverage the reader is referred to the existing surveys onsuper-resolution and face hallucination, e.g., [25,26,27,28].
Super-resolution:
Recent solutions to the problem of single-image super-resolution (SR) are dominated by learning-based methods that use pairs of cor-responding HR and LR images to train machine learning models capable ofpredicting HR outputs given LR evidence [5,6,7,8,9,10]. The learning proceduresused with these models typically aim to minimize an objective function thatquantifies the error between the ground truth HR images and the SR predic-tions. Common objectives in this area include the mean-squared-error (MSE),the mean-absolute-error (MAE) and other related error metrics. Our SR modelfollows the outlined learning paradigm, but different from existing SR methods,exploits a novel objective related to structural similarity (SSIM, [29]), whichbetter models human image perception than simple pixel-based metrics, such asMSE or MAE.Our C-SRIP model is based on convolutional neural networks (CNNs) and inthis sense is related to recent SR models that exploit CNNs for image upscaling,e.g., [9,6,11,12,14,15,16,17,18,19]. A common aspect of these models is that theysuper-resolve images in a single step and, while capable of producing impressiveSR results, rely only on LR-HR image pairs for training. Our model, on the otherhand, upscales the LR inputs in a cascaded manner and allows for supervisionsignals and constraints to be incorporated at multiple scales during training.Recent CNN-based SR models, e.g., [6,12] exploit contemporary network ar-chitectures such as ResNets [30] and Generative Adversarial Networks (GANs,[31]). These models are closely related to our work, as we also make heavy use ofresidual connections and incorporate a generative and a discriminative network
Grm, Dobriˇsek, Scheirer, ˇStruc in our model. While we do not rely on GANs per se, our model does includea discriminative (classification) model that constrains the solution space of thegenerative SR network. However, our discriminative model is pre-trained andthen frozen and not optimized alternatively with the generator, which greatlyimproves training stability and still results in realistic SR outputs.Our work can also be seen as an extreme case of the perceptual-loss ( (cid:96) p )image transformation model from [11], which relies on comparisons of high-levelfeatures extracted from a pretrained secondary network as the learning objectivefor SR, instead of comparisons at the pixel level. Our model follows a similar idea,but uses identity (information a highest possible semantic level) to constrain thesolution space of the generative SR network. Thus, instead of network features,our model considers the outputs of a pretrained network during training. Face hallucination and identity constraints:
Because the solution spaceof face hallucination models is typically constrained to a set of plausible facial ap-pearances, remarkable performance has been achieved with hallucination modelsat much higher magnification factors than for general single-image SR tasks [32].Similarly to other vision problems, the research is moving increasingly towardsdeep learning and considerable improvements have been achieved recently withCNN-based models, such as [32,33,34,35,36,37,38,39,40]. We contribute to thisbody of work in this paper with a novel deep face hallucination model. While theSR network of our model is general and applicable to arbitrary input images,we infuse domain-specific knowledge into the model through face recognitionmodels.It needs to be noted that using identity information as a prior (or constraint)for SR models has been examined before [41,42]. Henning-Yeomans et al. [43], forexample, formulated a joint optimization approach that maximized for super-resolution and face recognition performance simultaneously. This approach isconceptually similar to our work, but our approach is more general in the sensethat it can be applied with any differentiable classification model. The approachfrom [43] is focused only on linear feature extraction techniques, e.g., PCA [44].Recent CNN-based face hallucination methods [32] have included secondarynetworks as constraints, which are trained jointly with the SR network. Wefound this to decreases training stability, so we instead use separately trainedrecognition and SR networks, where the former acts as a constraint for the latter.
Our C-SRIP face hallucination model consists of two main components: i) agenerative SR network for image upscaling, build around a powerful cascadedresidual architecture, and ii) an ensemble of face recognition models that serveas identity priors for the C-SRIP model (see Fig. 2). In the following sectionswe describe all components of C-SRIP in detail and elaborate on the trainingprocedure used to learn the model parameters. ace hallucination using cascaded super-resolution and identity priors 5
8x Recognition model(details at 2x)Recognition model(details at 8x)Recognition model(details at 4x)
Identity Prior
C-SRIP model
Conv. + BN + LReLU
Conv. layerLReLU Sub-pixel conv.ClippingElement-wise sum … SR Module p blocks k3n512s1 k3n1024s1 k9n3s1 k3n128s1 k9n3s1 k9n3s1 … k3n512s1 k3n256s1 … k3n128s1 k3n256s1 SR Module SR Module p blocks p blocks k9n512s1 SR Network
Fig. 2.
Illustration of the proposed C-SRIP model. The model consists of a generativeSR network and an ensemble of face recognition models that serve as identity priorsduring training. The figure shows all architectural details (best viewed electronically).
The generative part of our C-SRIP model is a 53-layer deep convolutional neuralnetwork (CNN) that takes a LR facial image as input and super-resolves it at amagnification factor of 8 × . The network progressively upscales the images usinga cascaded series of so-called SR modules . Each module upscales the image by afactor of 2 × , which makes it possible to apply a loss function on the intermediateSR results and ensures better control of the training procedure in comparison tocompeting solutions that exploit supervision only at the final scale. The cascadedarchitecture allows us to solve a series of easier and better conditioned problemsusing repeated bottom-up inference with top-down supervision instead of onecomplex problem with an overwhelming amount of possible solutions.We design our SR network around a fully-convolutional architecture thatrelies heavily on residual blocks [30] for all processing within one SR moduleand sub-pixel convolutions [45] for image upscaling. Our design choices are mo-tivated by the success of fully-convolutional CNN models in various vision prob-lems [30,46,47] and the state-of-the-art performance ensured by the sub-pixelconvolutions in prior SR work [45,12]. Similarly to [12], the residual blocks ofthe SR modules consist of two convolution–batch-norm–activation sub-blocks,followed by a post-activation element-wise sum. We ensure a constant memoryfootprint of all SR modules by decreasing the number of filters in the convo-lutional layers by a factor of 2 with every upscaling step. This maximizes thecapacity of the network and balances the computational complexity across theSR modules. To upscale the feature maps at the output of each SR module, werely on the sub-pixel convolution layers proposed in [45]. These layers increasethe spatial dimensions of the feature maps by reshuffling and aggregating pixelsfrom multiple LR feature maps and, thus, for every upscaling step of 2 × reducethe number of available feature maps by a factor of 4 × . We counteract this effectby doubling the number of filters in the convolutional layer preceding the sub-pixel convolutions and, consequently, ensure that the capacity of the SR modulesis not compromised due to the upscaling. After reaching the target resolution,the feature maps are passed through one last residual block and a convolutional Grm, Dobriˇsek, Scheirer, ˇStruc
SR Module - Interpolation
Recognition model
Fig. 3.
Each SR module adds fine facial details during upscaling (left). The recognitionmodels are pre-trained to respond to these details only (right) and can therefore beused as identity constraints when learning the parameters of the SR network. layer with 3 output channels that produce the final 8 × super-resolved RGBimage.The network branches off after each SR module to allow for intermediatetop-down supervision during training. Each branch applies a series of large-filterconvolutions to produce intermediate SR resolution results at different scales(i.e., 2 × and 4 × the initial scale) that are incorporated into the loss functionsdiscussed in Section 3.3. However, these branches are not used at test time. Theentire architecture of our network is illustrated in detail in Fig. 2. Using prior information to constrain the solution space of SR models duringtraining is a key mechanism in the area of super-resolution [48,22,23,49,50,51,21].The main motivation for incorporating priors into SR models is to provide asource of additional information for the learning procedure that complementsthe commonly used reconstruction-oriented objectives and contributes towardssharper and more accurate SR results.An exceptionally strong prior in this context (also used in our model) isidentity. Because identity information relates to the semantic content (i.e., whois in the image) and not the perceptual quality (i.e., how visually convincing isthe image) of the SR images, it represents a natural choice for constraining thesolution space of SR models. In fact, it seem intuitive to think about SR fromboth i) an image-enhancement as well as a ii) content-preservation perspectiveand to incorporate both views into the SR model for optimal results. Whilethe image enhancement perspective is covered in our model by a reconstruction-based loss (discussed in Section 3.3), the content-preservation aspect is addressedthrough an ensemble of CNN-based face recognition models that ensure thatidentity information is not altered during upscaling.For C-SRIP we associate each recognition model with one of the SR modulesand use it as an identity prior for the corresponding SR output, as illustrated inFig. 2. Since each SR module can be shown to add only high-frequency details tothe input images (see Fig. 3 left), we pretrain all recognition models to respondonly to the hallucinated details and ignore the low-resolution content that isshared by the input and SR images (see Fig. 3 right). By focusing exclusivelyon the added details, we are able to directly link the recognition models tothe desired SR outputs and penalize the results in case they alter the facialidentity. This mechanism allows us to learn the parameters of the SR networkby considering an identity-dependent loss in the overall learning objective. ace hallucination using cascaded super-resolution and identity priors 7 -- == = -- = = = -- Fig. 4.
We generate training images for the SR network at four different spatial resolu-tions(left). For recognition-model training we compute residual images that correspondto the facial details that are hallucinated by the SR modules (right).
While in principle any differentiable recognition model could be used as theidentity prior for our face hallucination model, we select SqueezeNet models forthis work [52]. The main reason for our choice is the lightweight architectureof SqueezeNet, which does not impose significant runtime slowdowns due to itsrelatively small memory and FLOPS footprint.
We train the C-SRIP model in two stages. In the first stage, we learn the param-eters of the SqueezeNet models for all three SR outputs. In the second stage, wefreeze the the weights of the recognition models and train the SR network witha combined loss. The details of both stages are presented next.
Recognition-model training.
Next to LR and HR image pairs, we alsorequire two intermediate reference images between the lowest and the highestresolution to learn the parameters of the recognition models and SR modules.To this end, we apply a simple degradation model on the available HR images x hri and generate N image quadruplets for training, i.e., { x lri , x × i , x × i , x hri } Ni =1 ,where x lri represents the LR input image, x × i and x × i stand for the interme-diate SR results at 2 × and 4 × magnification factors, respectively, and the HRimage x hri corresponds to the ground truth for the magnification factor of 8 × .Our degradation model uses Gaussian blurring followed by image decimation fordown-sampling and produces training data as shown on the left side of Fig. 4.To train the recognition models, we construct residual images that reflect thefacial details that need to be learned by the SR modules. The residual images,shown on the right side of Fig. 4, are computed by smoothing the ground truthimages by a Gaussian kernel and subtracting the smoothed image from theoriginal, i.e., ∆ x ji = x ji − g ∗ x ji , for j ∈ { × , × , hr } , where σ values of σ × =1 / σ × = 1 and σ × = 7 / × , × , and 8 × the LRimage size, respectively. We train the SqueezeNet models based on the generatedresidual images using categorical cross-entropy L CE : L CE ( θ SN , ∆ x ) = − K (cid:88) k =1 p ∆ x ( k ) log ˆ p ∆ x ( k ) , (1)where p ∆ x denotes the ground truth class probability distribution of the residualimage ∆ x (i.e., p ∆ x ∈ { , } K is a class-encoded one-hot vector), ˆ p ∆ x ∈ R K Grm, Dobriˇsek, Scheirer, ˇStruc stands for the output probability distribution produced by SqueezeNet’s softmaxlayer based on ∆ x , i.e., K stands for the number of classes in the training dataand θ SN represents the parameters of the network. We learn the parameters ofall three recognition models through backpropagation by minimizaing the L CE loss over the training dataset, i.e.: ˆ θ jSN = arg min θ jSN E ∆ x j (cid:104) L CE ( θ jSN , ∆ x j ) (cid:105) .The result of this first training stage are three SqueezeNet face recognitionmodels ˆ θ × SN , ˆ θ × SN , ˆ θ hrSN , one for each image resolution that respond only to thehallucinated facial details and serve as identity constraints for the SR network. SR network training.
Standard reconstruction-oriented loss functions usedfor learning SR models, such as MSE or MAE, are known to produce overlysmooth and often blurry SR results [12]. We therefore design a new loss functionfor our SR network around the structural similarity index (SSIM, [29]), andintegrate it directly into our learning algorithm. Specifically, we use our SSIMapproximation as a loss function for the C-SRIP hallucination model.Given a ground truth image x and the corresponding SR network predictionˆ x = f θ SR ( x ), we compute the SSIM-based loss as follows: L SSIM ( θ SR , x ) = 12 (cid:16) − E x (cid:104) ˆ SSIM ( x , ˆ x ) (cid:105)(cid:17) , (2)where the SR network f is parametrized by θ SR , E x [ · ] stands for the expectationoperator over the spatial coordinates and ˆ SSIM ( x , ˆ x ) is a spatial similarity mapbetween x and ˆ x defined as:ˆ SSIM ( x , ˆ x ) = (2 µ + C ) (cid:12) (2 σ + C )( µ + µ + C ) (cid:12) ( σ + σ + C ) , where (3) µ = x ∗ g , µ = µ (cid:12) µ , σ = ( x (cid:12) x ) ∗ g − µ ,µ = ˆ x ∗ g , µ = µ (cid:12) µ , σ = (ˆ x (cid:12) ˆ x ) ∗ g − µ ,µ = µ (cid:12) µ , σ = ( x (cid:12) ˆ x ) ∗ g − µ . In the above equations, ∗ denotes the convolution operator, (cid:12) denotes theHadamard product, and the open parameters, g , C and C , are defined asper the SSIM reference implementation provided by the authors of [24], i.e., g is a 11 ×
11 Gaussian kernel with σ = 1 . C ≈ . C ≈ . SSIM approximation (right). The examples show that the SSIM approximationresults in error maps that are less sparse compared to the squared-differencesused with MSE-based losses, which, as we discuss in the experimental section,results in better training characteristics.Based on the pretrained SqueezeNet models and the loss introduced above,we defined the overall loss of our C-SRIP face hallucination model as follows: L ( θ SR , { x j } ) = (cid:88) j ∈D L SSIM ( θ SR , x j ) + αL CE ( θ jSN , ∆ x j ) , (4) ace hallucination using cascaded super-resolution and identity priors 9 Fig. 5.
Error maps generated by the squared-error used by the MSE loss and theproposed ˆ
SSIM function (the error map between x and the ground truth x g is definedas 1 − ˆ SSIM ( x , x g )). The figure represents degraded images (left), the correspondingsquared-error maps (center) and the error maps generated by ˆ SSIM (right). where D = { × , × , hr } , α is a weight parameter that balances the relative im-pact of the reconstruction- and recognition-based losses and θ SR stands for theparameters of the SR network that we aim to learn. The residual images ∆ x j are constructed during training as illustrated in Fig. 4 (right). We use backprop-agation to minimize the loss over our training data and find the parameters ofthe SR network ˆ θ SR , i.e., ˆ θ SR = arg min θ SR E x j (cid:2) L ( θ SR , { x j } ) (cid:3) .Once the training is complete, we remove the recognition models and networkbranches used to generate the intermediate SR results at 2 × and 4 × magnifica-tion factors and use only the main output of the SR network for face hallucina-tion. The final SR network takes a LR image x lr of size 24 ×
24 pixels as inputand returns an 8 × upscaled 192 ×
192 facial image x hr at the output. All three SqueezeNet models are implemented in accor-dance with the so-called complex SqueezeNet architecture from [52]. The modelsconsist of 9 fire modules with intermediate shortcut connections, followed by aglobal average pooling layer and a softmax classifier on top. We train the firstrecognition model to classify residual images at 2 × the initial LR scale, i.e.,48 ×
48 pixels, the second to classify images at 4 × the initial scale, i.e., 96 × ×
192 pixels in size.To learn the model parameters we use backpropagation and the Adam [53] mini-batch gradient descent algorithm, with a batch size of 128 and an initial learningrate of 10 − . The learning rate is multiplied by a factor of every 20 epochs.To avoid over-fitting, we resort to data augmentation in the form of randomhorizontal flipping and random crops. We employ an early stopping criterionbased on accuracy improvements on the validation set. If no improvements areobserved over 10 consecutive training epochs we stop the learning procedure andassume the recognition model has converged. The SR network.
The SR network consist of three SR modules that arepreceded by a convolutional layer with 512 large-scale filters of size 9 × p = 7 residual blocks that contain 512filters in the first SR module, 256 filters in the second SR module, and 128 filtersin the last SR module, as shown in Fig. 2. We set the number of filters for thefinal convolutional layer of the SR modules, to 1024 for the first, 512 for thesecond and 256 for the third module. All filters are of size 3 × block of the SR network has 128 filters 3 × × , α = 0 . × − and multiply it by at the end of epochs 10, 25, 50and 80. We use a combined early stopping criterion that assumes the model hasconverged if both SSIM and MSE show no improvements over 10 epochs. We select two datasets for our experiments. To train the C-SRIP model we usethe CASIA WebFace dataset [54] which features 494 ,
414 images of 10 ,
575 iden-tities, (i.e., N = 494 , K = 10 , , ,
749 subjects. The two datasets are selected for the experi-ments because they feature images of variable quality captured in unconstrainedconditions and thus represent a significant challenge for SR models. More im-portantly, they are designed to contains zero overlap in terms of identity, whichis paramount to ensure a fair and unbiased evaluation of the C-SRIP model.For SqueezeNet training we randomly sample identities from CASIA Web-Face and utilize 90% of the images for training and 10% for validation. The recog-nition models converge to the rank one recognition rate of 0 . . † ) with48 × . . † ) with 96 × . . † )with 192 × † validation) data. As expected,the performance decreases with a decreasing size of the residual images and isadversely affected by the lack of low-frequency information during training (see,e.g., [56] for the expected performance of SqueezeNet for face recognition). Nev-ertheless, the models contribute towards accurate and visually convincing SRresults, as evidenced by the results in the next sections. Since we also need iden-tity information when learning the parameters of the SR network of C-SRIP, weagain use the 90%/10% data split per identity for training and validation. Withthis setup we train the SR network on 494 ,
414 CASIA WebFace images.We train all models on a workstation with two Nvidia GTX Titan Xp GPUs.On this hardware, the SqueezeNet training takes 1, 2, and 5 days, respectively, ace hallucination using cascaded super-resolution and identity priors 11
LR Input Bicubic NBSRF SRCNN VDSR (cid:96) p SRGAN URDGN C-SRIP Target
Fig. 6.
Qualitative comparison of state-of-the-art SR models on sample images fromthe LFW dataset. The first column shows the input 24 ×
24 pixel LR image (upscaledwith nearest neighbor interpolation). Best viewed zoomed in. for the 2 × , 4 × and 8 × scale models. The training of the SR network with theidentity constraints included takes around 8 days. Once trained, the SR networkis capable of processing images at an average speed of 15 ms/image on GPU inbatch mode, or 30 ms/image in real-time (i.e., single-sample batch) mode. We compare our C-SRIP model with 6 state-of-the-art SR and face hallucinationmodels, i.e.: the Naive Bayes Super-Resolution Forest (NBSRF) from [10], theSuper-Resolution Convolutional Neural Network (SRCNN) from [9], the VeryDeep Super Resolution Network (VDSR) from [6], the perceptual-loss based SRmodel ( (cid:96) p ) from [11], the Super-Resolution Generative Adversarial Network from[12], and the Ultra Resolving Discriminative Generative Network (URDGN) from[32]. We train all models with the same data as C-SRIP and use open-sourceimplementations of the authors (where available) for a fair comparison. For (cid:96) p weuse features from the fire2, fire3 and fire4 layers of SqueezeNet for the learningcriterion. We include results for bicubic interpolation as a baseline. Qualitative comparison.
A few sample SR images are presented in Fig. 6.We see that with magnification factors of 8 × , interpolation methods are insuffi-cient and result in the loss of facial details. Furthermore, general SR models, suchas NBSRF, SRCNN and VDSR, fail to provide substantial improvements and areseen to amplify noise present in the LR images. These models fail to make use ofthe available facial context due to their relatively low receptive fields. The SR-GAN, URDGN and (cid:96) p models improve on this by including secondary networksas constraints during SR training. (cid:96) p is consistently the best-performing modelincluded in our comparison, only slightly behind C-SRIP. However, we notice itoften adds high-frequency noise when trying to minimize the perceptual loss ofthe convolutional maps of the secondary network. We speculate the reason ourmodel is not susceptible to these errors is the global cross-entropy loss of thesecondary networks as opposed to the local conv features exploited by (cid:96) p . Fig. 7.
Qualitative comparison of the evaluated SR models on sample images from theLFW dataset with highlighted image details. Best viewed electronically.
Table 1.
Averaged PSNR and SSIM scores for the tested SR models computed overthe LFW dataset. The highest PSNR and SSIM values are achieved by C-SRIP.
Model Bicubic NBSRF [10] SRCNN [9] VDSR [6] (cid:96) p [11] SRGAN [12] URDGN [32] C-SRIP (ours)PSNR 24 .
256 25 .
092 24 .
812 25 .
415 26 .
985 25 .
669 25 . . SSIM 0 . . . . . . . . Quantitative comparison.
We report average peak-signal-to-noise-ratio(PSNR) and structural similarity (SSIM) scores computed over the LFW imagesfor all tested models in Table 1. C-SRIP results in the best overall performancein terms of PSNR and SSIM, followed by (cid:96) p and URDGN. While providingreasonably convincing visual results, SRGAN produces only an average PSNRscore and the lowest SSIM score among all tested models. This result is expectedand is observed regularly in the literature [12] with GAN-based SR methods.NBSRF, SRCNN and VDSR improve upon the Bicubic baseline in terms ofperformance metrics, but are less competitive in comparison to the three topperformers of our experiments.The summary statistics in Table 1 show a partial picture of the performanceof the tested models. To get better insight into the performance we present Cu-mulative Score (PSNR and SSIM) Distribution (CSD) curves of the experimentsin Fig. 8. Since SR models are increasingly focusing on learning-based techniques,which are expected to perform inconsistently across images of different charac-teristics, CSD curves provide a reasonable way of visualizing this performancevariability. From the presented curves we see that all tested methods vary signif-icantly in PSNR and SSIM scores across the LFW dataset, with a large fractionof images producing sub-average performance scores. The (cid:96) p and the proposedC-SRIP models are superior to other models and very close in terms of thePSNR-based CSD curve. However, the difference becomes significantly largerwith the SSIM-based CSD curve, where C-SRIP is the top performer. ace hallucination using cascaded super-resolution and identity priors 13
22 24 26 28 30 32PSNR [dB]0.0 I m age f r a c t i on ( c u m . ) I m age f r a c t i on ( c u m . ) BicubicNBSRFSRCNNVDSR ℓ p SRGANURDGNC-SRIP (ours)
Fig. 8.
Cumulative Score Distribution curves (CSD) for the PSNR (left) and SSIM(right) scores over the LFW dataset. Curves further to the right are better.
We perform an ablation study with the goal of assessing the contribution of theindividual components of our proposed C-SRIP model. Towards this end, wetrain the following models and evaluate their performance on the LFW dataset:1.
Baseline : A baseline SR model without the cascaded SR modules. The modelconsist of 21 residual blocks similarly to our C-SRIP model, but the threesub-pixel convolution layers for upscaling are all at the end of the model.The model is trained using standard MSE loss.2.
B+SSIM : Same as above, but trained with the proposed SSIM-based loss.3.
C+SSIM : Our cascaded SR model, trained with the proposed SSIM-basedloss, but without the identity prior networks and without multi-scale super-vision i.e., the loss function is only applied at the output of the model.4.
C+SSIM+M : Our cascaded SR model, trained with multi-scale supervisionand the proposed SSIM-based loss function, but without the identity priors.5.
C-SRIP : The C-SRIP model with multi-scale SSIM and identity supervision.The results of the ablation study in Table 4 and the corresponding sampleimages in Fig. 9 show that each added component improves performance. Theonly decrease we see is when we switch from the MSE loss to the SSIM-basedloss, which slightly lowers the average PSNR score, but results in a higher SSIMscore. This result is expected, as PSNR is directly proportional to MSE and,thus, SR models optimizing for MSE typically achieve lower PSNR values thanmodels using other loss functions. Nevertheless, we observe much better train-ing characteristics with the SSIM loss, since the models converged faster andachieved significantly better SSIM and MSE scores on the training and valida-tion data than the MSE-based models. Among the evaluated components, we seethe biggest increase in the PSNR and SSIM scores with the multi-scale identitysupervision. This addition also results in the biggest visual improvement of theSR images as seen in Fig. 9.
To evaluate the weaknesses of the proposed C-SRIP model, we examine a fewexample images that result in the worst SR results according to the SSIM scorein Fig. 10. We identify a few potential reasons for the poor SR performance, i.e.:
Table 2.
Ablation study on the LFW dataset. The table shows the impact of differentmodel components on the average PSNR and SSIM scores.Baseline B+SSIM C+SSIM C+SSIM+M C-SRIPPSNR 26 . . . . . . . . . . LR Input Baseline B+SSIM C+SSIM C+SSIM+M C-SRIP Target
Fig. 9.
Visual result of the ablation study.
Fig. 10.
Examples of poor SR results obtained with the C-SRIP model according tothe SSIM value. The four columns of each image correspond to (from left to right): theinput LR image, bicubic interpolation, C-SRIP and the target HR image. – High-frequency details . Images 10a, 10b and 10d contain a great amount ofhigh-frequency details (background, hair). Our SR network is guided by theface-recognition models that focus on the face and ignore other regions. – Significant occlusion . In images 10a and 10f, the face is partially occludedby a foreground object. The occlusion changes the global facial appearance,which adversely affects the reconstruction capabilities of C-SRIP. – Significant pose variations.
In images 10e, the subject’s face is partially ob-scured due to the profile pose. Few samples in our training dataset featureprofile poses, which deteriorates performance on this type of facial images. – Low-quality HR image.
Image 10c has a significant amount of noise, whichis reduced during down-sampling and cannot be reconstructed. ace hallucination using cascaded super-resolution and identity priors 15
We have presented a novel CNN-based model for identity-preserving face halluci-nation from very low-resolution images (i.e., 24 ×
24 pixels) at high magnificationfactors. We have shown that the proposed model improves SR results on face im-ages, compared to both existing general super-resolution and face hallucinationmodels. In terms of future work, we see the possibility of adapting our model toother modalities, e.g. to video sequences via recurrent attention models.
References
1. Baker, S., Kanade, T.: Hallucinating faces. In: Automatic Face and GestureRecognition (FG), IEEE (2000) 83–882. Liu, C., Shum, H.Y., Freeman, W.T.: Face hallucination: Theory and practice.International Journal of Computer Vision (1) (2007) 1153. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEETransactions on Pattern Analysis and Machine Intelligence (9) (2002) 1167–11834. Gunturk, B.K., Batur, A.U., Altunbasak, Y., Hayes, M.H., Mersereau, R.M.:Eigenface-domain super-resolution for face recognition. IEEE Transactions on Im-age Processing (5) (2003) 597–6065. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparserepresentation. IEEE Transactions on Image Processing (11) (2010) 2861–28736. Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution usingvery deep convolutional networks. In: Computer Vision and Pattern Recognition(CVPR). (2016) 1646–16547. Yang, S., Wang, M., Chen, Y., Sun, Y.: Single-image super-resolution reconstruc-tion via learned geometric dictionaries and clustered sparse coding. IEEE Trans-actions on Image Processing (9) (2012) 4016–40288. Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood regression forfast example-based super-resolution. In: International Conference on ComputerVision (ICCV). (2013) 1920–19279. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network forimage super-resolution. In: European Conference on Computer Vision (ECCV),Springer (2014) 184–19910. Salvador, J., Perez-Pellitero, E.: Naive bayes super-resolution forest. In: Interna-tional Conference on Computer Vision. (2015) 325–33311. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer andsuper-resolution. In: European Conference on Computer Vision (ECCV), Springer(2016) 694–71112. Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken,A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Computer Vision and PatternRecognition (CVPR). (2017) 4681–469013. Cao, Q., Lin, L., Shi, Y., Liang, X., Li, G.: Attention-aware face hallucination viadeep reinforcement learning. arXiv preprint arXiv:1708.03132 (2017)14. Xu, X., Sun, D., Pan, J., Zhang, Y., Pfister, H., Yang, M.H.: Learning to super-resolve blurry face and text images. In: Computer Vision and Pattern Recognition(CVPR). (2017) 251–2606 Grm, Dobriˇsek, Scheirer, ˇStruc15. Sajjadi, M.S., Sch¨olkopf, B., Hirsch, M.: Enhancenet: Single image super-resolutionthrough automated texture synthesis. In: International Conference on ComputerVision (ICCV), IEEE (2017) 4501–451016. Tong, T., Li, G., Liu, X., Gao, Q.: Image super-resolution using dense skip con-nections. In: International Conference on Computer Vision (ICCV), IEEE (2017)4809–481717. Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep laplacian pyramid networksfor fast and accurate super-resolution. In: Computer Vision and Pattern Recogni-tion (CVPR). (July 2017)18. Tai, Y., Yang, J., Liu, X.: Image super-resolution via deep recursive residualnetwork. In: Computer Vision and Pattern Recognition (CVPR). (July 2017)19. Huang, Y., Shao, L., Frangi, A.F.: Simultaneous super-resolution and cross-modality synthesis of 3d medical images using weakly-supervised joint convolu-tional sparse coding. In: Computer Vision and Pattern Recognition (CVPR). (July2017)20. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEETransaction on Pattern Analysis and Machine Intelligence (9) (2002) 1167 –118321. Cho, T.S., Zitnick, C.L., Joshi, N., Kang, S.B., Szeliski, R., Freeman, W.T.: Imagerestoration by matching gradient distributions. IEEE Transactions on PatternAnalysis and Machine Intelligence (4) (2012) 683–69422. Wang, Y., Yin, W., Zhang, Y.: A fast algorithm for image deblurring with totalvariation regularization. (2007)23. Dai, S., Han, M., Xu, W., Wu, Y., Gong, Y.: Soft edge smoothness prior for alphachannel super resolution. In: Computer Vision and Pattern Recognition (CVPR),IEEE (2007) 1–824. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:from error visibility to structural similarity. IEEE Transactions on Image Process-ing (4) (2004) 600–61225. Tian, J., Ma, K.K.: A survey on super-resolution imaging. Signal, Image and VideoProcessing (3) (2011) 329–34226. Nasrollahi, K., Moeslund, T.B.: Super-resolution: a comprehensive survey. MachineVision and Applications (6) (2014) 1423–146827. Wang, N., Tao, D., Gao, X., Li, X., Li, J.: A comprehensive survey to face hallu-cination. International Journal of Computer Vision (1) (2014) 9–3028. Nguyen, K., Fookes, C., Sridharan, S., Tistarelli, M., Nixon, M.: Super-resolutionfor biometrics: A comprehensive survey. Pattern Recognition (2018)29. Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for imagequality assessment. In: Asilomar Conference on Signals, Systems and Computers.Volume 2., IEEE (2003) 1398–140230. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Computer Vision and Pattern Recognition (CVPR). (2016) 770–77831. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in NeuralInformation Processing Systems (NIPS). (2014) 2672–268032. Yu, X., Porikli, F.: Ultra-resolving face images by discriminative generative net-works. In: European Conference on Computer Vision (ECCV), Springer (2016)318–33333. Jia, K., Gong, S.: Generalized face super-resolution. IEEE Transactions on ImageProcessing (6) (2008) 873–886ace hallucination using cascaded super-resolution and identity priors 1734. Jin, Y., Bouganis, C.S.: Robust multi-image based blind face hallucination. In:Computer Vision and Pattern Recognition (CVPR), IEEE (2015) 5252–526035. Zhu, S., Liu, S., Loy, C.C., Tang, X.: Deep cascaded bi-network for face halluci-nation. In: European Conference on Computer Vision (ECCV), Springer (2016)614–63036. Yang, C.Y., Liu, S., Yang, M.H.: Structured face hallucination. In: ComputerVision and Pattern Recognition (CVPR), IEEE (2013) 1099–110637. Zhou, E., Fan, H., Cao, Z., Jiang, Y., Yin, Q.: Learning face hallucination in thewild. In: AAI Conference on Artificial Intelligence. (2015) 3871–387738. Farrugia, R.A., Guillemot, C.: Face hallucination using linear models of coupledsparse support. IEEE Transactions on Image Processing (9) (2017) 4562–457739. Yu, X., Porikli, F.: Face hallucination with tiny unaligned images by transformativediscriminative neural networks. In: AAAI Conference on Artificial Intelligence.Volume 2. (2017) 340. Yu, X., Porikli, F.: Imagining the unimaginable faces by deconvolutional networks.IEEE Transactions on Image Processing (2018)41. Liu, W., Lin, D., Tang, X.: Neighbor combination and transformation for halluci-nating faces. In: IEEE International Conference on Multimedia and Expo (ICME),IEEE (2005) 4–pp42. Li, B., Chang, H., Shan, S., Chen, X.: Aligning coupled manifolds for face hallu-cination. IEEE Signal Processing Letters (11) (2009) 957–96043. Hennings-Yeomans, P.H., Baker, S., Kumar, B.V.: Simultaneous super-resolutionand feature extraction for recognition of low-resolution faces. In: Computer Visionand Pattern Recognition (CVPR), IEEE (2008) 1–844. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuro-science (1) (1991) 71–8645. Shi, W., Caballero, J., Husz´ar, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D.,Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Computer Vision and Pattern Recognition(CVPR). (2016) 1874–188346. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deepconvolutional neural networks. In: Advances in Neural Information ProcessingSystems (NIPS). (2012) 1097–110547. Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: BritishMachine Vision Conference (BMVC). Volume 1. (2015) 648. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removalalgorithms. Physica D: Nonlinear Phenomena (1-4) (1992) 259–26849. Shan, Q., Jia, J., Agarwala, A.: High-quality motion deblurring from a singleimage. In: ACM Transactions on Graphics. Volume 27., ACM (2008) 7350. Lee, D.C., Hebert, M., Kanade, T.: Geometric reasoning for single image structurerecovery. In: Computer Vision and Pattern Recognition (CVPR), IEEE (2009)2136–214351. Sun, J., Xu, Z., Shum, H.Y.: Image super-resolution using gradient profile prior.In: Computer Vision and Pattern Recognition (CVPR), IEEE (2008) 1–852. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB modelsize. arXiv preprint arXiv:1602.07360 (2016)53. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: InternationalConference on Learning Representations. (2015)54. Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. arXivpreprint arXiv:1411.7923 (2014)8 Grm, Dobriˇsek, Scheirer, ˇStruc55. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:A database for studying face recognition in unconstrained environments. Technicalreport, Technical Report 07-49, University of Massachusetts, Amherst (2007)56. Grm, K., ˇStruc, V., Artiges, A., Caron, M., Ekenel, H.K.: Strengths and weaknessesof deep learning models for face recognition against image degradations. IETBiometrics (1) (2017) 81–89ace hallucination using cascaded super-resolution and identity priors 19 A Appendix
In this section we present some additional results to further highlight the meritsof our C-SRIP model. Similarly to the main paper, we use images from the LFWdataset [55] (down-sampled by smoothing the original HR images followed bysub-sampling) as our test data. All inputs to the C-SRIP model are of size 24 × A.1 Results for small magnification factors
We first demonstrate the performance of the C-SRIP model for lower magnifi-cation factors, i.e., 2 × and 4 × , that produce images of size 48 ×
48 pixels and96 ×
96 pixels, respectively, given 24 ×
24 pixel LR inputs. These images corre-spond to the intermediate results of the C-SRIP model that were not used forthe experiments in the main part of the paper and are generated by the firstand second SR module of C-SRIP as shown in Fig. 11. A few illustrative SRexamples generated at 2 × and 4 × the input scale are presented in Fig. 12. Sub-pixel conv.ClippingElement-wise sumConv . + BN + LReLUConv . layerLReLU … SR Module blocks k3n512s1 k3n1024s1 k9n3s1k9n512s1 … SR Module blocks k3n512s1 k3n1024s1 k9n3s1 … k3n512s1 k3n256s1 SR Module blocks k9n512s1 Fig. 11.
Illustration of the intermediate results generated with the SR modules of theC-SRIP model. The top row shows the output at a magnification factor of 2 × and thebottom row shows the output at 4 × . We again use the kXnXsX notation introducedin [12] to denote convolutional layers with n filters of size k × k , applied with stride s . We observe that our model achieves realistic SR results even for small mag-nification factors. That is, even when the images are upscaled to a (still modest)size of 48 ×
48 or 96 ×
96 pixels, the hallucinated images preserve the identity ofthe subjects reasonably well, despite the limited performance of the SqueezNetmodels at these scales and, consequently, the relatively weak identity constraintapplied during training. It needs to be noted that none of the presented subjectshas been included in our training data.
Fig. 12.
Qualitative results for the intermediate scales generated by our C-SRIP model.The columns correspond to (from left to right): the 24 ×
24 input image, bicubicinterpolation, our results and the ground truth (GT) at either 48 ×
48 or 96 × A.2 Improving the visual quality of the hallucinated images
It is possible to further improve on the (perceived) visual quality of the SR imagesproduced by the C-SRIP model (for large magnification factors of 8 × ) by uti-lizing simple image enhancement techniques. In Fig. 13 and Fig. 14 we show someexamples, where a standard 3 × , − , − , , −
1; 0 , − , ace hallucination using cascaded super-resolution and identity priors 21 Table 3.
PSNR and SSIM scores obtained on the training data with the MSE- andSSIM-based losses. MSE-based loss SSIM-based lossPNSR [ dB ] 28 . . . . generated images. The result of applying such post-processing steps are signif-icantly sharper in crisper SR images. However, in terms of summary statistics(i.e., average SSIM and PSNR scores) these are not competitive to the resultsreported in the main part of the paper - the sharpening operation deteriorates(quantitatively measured) performance. In Fig. 13 and Fig. 14 we also includeresults for some examples that were already presented in the main part of thepaper to facilitate implicit comparisons with competing methods. LR Input C-SRIP Enhanced Target LR Input C-SRIP Enhanced TargetLR Input C-SRIP Enhanced Target LR Input C-SRIP Enhanced TargetLR Input C-SRIP Enhanced Target LR Input C-SRIP Enhanced Target
Fig. 13.
Qualitative results for SR outputs post-processed with a standard image en-hancement technique (i.e., with a sharpening filter). For each 24 ×
24 LR input image(on the far left of each quadruplet) the following columns correspond to (from leftto right): C-SRIP, C-SRIP with image enhancement, and the target HR image. Bestviewed in high resolution.
Interestingly, after the post-processing some of the SR images appear sharperthan the original HR targets. This can be partially explained by the presence ofnoise in the target images that is not present in the SR reconstructions and thehigher image contrast after enhancement that contributes towards the perceptionof higher-quality images.
A.3 Quantitative results on the impact of the SSIM loss
Next, we present some (additional) quantitative results related to the proposedSSIM loss. Our SSIM formulation uses convolutions with a discrete Gaussian
LR Input C-SRIP C-SRIP Enhanced HR Target
Fig. 14.
Qualitative results for SR outputs post-processed with a standard image en-hancement technique (i.e., with a sharpening filter) with highlighted image details. Foreach 24 ×
24 LR input image (on the far left of each quadruplet) the following columnscorrespond to (from left to right): C-SRIP, C-SRIP with image enhancement and thetarget HR image. Best viewed in high resolution.
Table 4.
Comparison of the PSNR and SSIM scores on the test data obtained withthe MSE- and SSIM-based losses.MSE-based loss SSIM-based lossPSNR [ dB ] 26 . . . . kernel, g - see Eq. (3), to approximate the local averages used with the origi-nal SSIM and is, therefore, easily implementable using standard deep learningframeworks. As emphasized in the main part of the paper, the result of usingthe proposed SSIM-based loss are significantly better training characteristics interms of faster convergence and lower PSNR and SSIM scores on the trainingdata as shown in Table 3. Here, the results are presented for the simplest archi-tecture from the ablation study (Section 4.3), where i) the images are processedthrough a series of 21 residual blocks, ii) all three upscaling layers are placed atthe end of the SR network, and iii) supervision is applied only at the output ofthe model.The proposed SSIM-based loss ensures significantly better performance scoresduring training. Even though the MSE-based loss is directly proportional tothe PSNR score, our SSIM-based loss results in a lower average PSNR scoreon the training data, which suggests that a better optimum was found by thebackpropagation-based learning procedure. On the test data the proposed lossstill improves on the SSIM score, but offers no improvements in terms of PSNR ace hallucination using cascaded super-resolution and identity priors 23ace hallucination using cascaded super-resolution and identity priors 23