[PDF] Super-Resolved Image Perceptual Quality Improvement via Multi-Feature Discriminators

Abstract

Generative adversarial network (GAN) for image super-resolution (SR) has attracted enormous interests in recent years. However, the GAN-based SR methods only use image discriminator to distinguish SR images and high-resolution (HR) images. Image discriminator fails to discriminate images accurately since image features cannot be fully expressed. In this paper, we design a new GAN-based SR framework GAN-IMC which includes generator, image discriminator, morphological component discriminator and color discriminator. The combination of multiple feature discriminators improves the accuracy of image discrimination. Adversarial training between the generator and multi-feature discriminators forces SR images to converge with HR images in terms of data and features distribution. Moreover, in some cases, feature enhancement of salient regions is also worth considering. GAN-IMC is further optimized by weighted content loss (GAN-IMCW), which effectively restores and enhances salient regions in SR images. The effectiveness and robustness of our method are confirmed by extensive experiments on public datasets. Compared with state-of-the-art methods, the proposed method not only achieves competitive Perceptual Index (PI) and Natural Image Quality Evaluator (NIQE) values but also obtains pleasant visual perception in image edge, texture, color and salient regions.

Full PDF

11 Super-Resolved Image Perceptual Quality Improvement via Multi-Feature Discriminators

Xuan Zhu a* , Yue Cheng a , Jinye Peng a , Rongzhi Wang a , Mingnan Le a , Xin Liu a a Shool of Information Science and Technology of Northwest University, Xi’an, People’s Republic of China

Abstract.

Keywords. image super-resolution, multi-feature discriminators, perceptual quality, weighted content loss. *Xuan Zhu, E-mail: [email protected]).

1 Introduction

Super-resolution (SR) technique is the task of estimating the high-resolution (HR) image from one or a sequence of low-resolution (LR) observations at a lower cost. The improvement of image resolution contributes to accurately recognizing and understanding the image. It has an urgent requirement for SR in many fields, such as computer vision, medical image processing and remote sensing image processing . SR methods can be mainly divided into three categories: interpolation-based methods , reconstruction-based methods and learning-based methods. Learning-based methods can be divided into shallow learning SR methods and deep learning SR methods. The shallow learning methods, such as neighbor embedding methods , sparse coding methods and anchored neighborhood regression methods , learn the nonlinear mapping relationship between HR and LR image patches. Deep learning-based methods directly learn end-to-end mapping function between HR and LR images, which is represented by the parameters of convolutional neural networks (CNNs). Recently, CNN for SR has made remarkable improvements. In 2015, Dong et al. first applied a lightweight CNN to super-resolve an LR image, and the SR network was optimized by minimizing the pixel loss function. It demonstrated excellent reconstruction quality compared with state-of-the-art methods at that time due to the close correlation between pixel loss function and Power Signal-to-Noise Ratio (PSNR). This research is a milestone for SR. Subsequently, considerable researches that minimized pixel loss function to train CNNs had been conducted and PSNR values had been dramatically improved. However, the studies pointed out that SR results with good visual quality reflected by PSNR values were inconsistent with or even contrary to the subjective evaluation of human observers. Blurry edges and over-smooth textures were shown in SR results while having a high PSNR value. Both Perceptual Index (PI) and Natural Image Quality Evaluator (NIQE) are brought up to evaluate SR results in terms of perceptual quality. In order to improve SR images visual quality, researchers have introduced different loss functions to optimize SR networks. In 2017, Ledge et al. presented a generative adversarial network (GAN) composed of a generator and an image discriminator for SR. The generator is used to generate SR results. The discriminator is used to determine SR images and HR images. Adversarial learning between generator and discriminator encourages some or several types of data of SR images to be similar to that of HR images. Park et al. proposed a feature discriminator that distinguishes SR image from HR image by feature maps to produce high-frequency details. Considering that morphological component and color are highly correlated with image visual quality, a natural idea is to introduce morphological component discriminator and color discriminator to identify images. In this paper, we design a new GAN-based SR framework GAN-IMC composed of a generator and multi-feature discriminators. Multi-feature discriminators consist of image discriminator, morphological component discriminator and color discriminator. Image discriminator, like in the standard GAN , discriminates images by the pixel value. Morphological component discriminator discriminates images by the edge and texture information of images. Color discriminator discriminates images by the color of images. Multi-feature discriminators in GAN-IMC ensure that the edge and texture of SR results are enhanced, and the color misalignment is avoided. SR results generated by GAN-IMC are more consistent with the HR images in pixel, edge, texture and color, and image visual quality is improved. Moreover, it is well known that visual attention is drawn to salient regions where multiple features are aggregated. Therefore, we propose a weighted content loss function based on human visual attention mechanism to expand the difference between salient and non-salient regions in the image. GAN-IMCW can be obtained by introducing weighted content loss to optimize GAN-IMC. GAN-IMCW significantly improves visual perceptual quality of salient regions in SR results. Our main contributions are as follows: 1. We design a new SR framework that has multi-feature discriminators to improve image visual perceptual quality. Adversarial learning in terms of morphological component and color contributes to producing pleasant SR results. 2.

We propose a weighted content loss function based on human visual attention mechanism. The feature-rich regions in SR results are highlighted. A large number of experimental results show that the proposed method achieves significant improvement on the image perceptual quality assessment metrics (PI and NIQE). This paper is organized as follows. Sec. 2 briefly reviews the development of SR and image quality evaluation. We describe the proposed SR framework and elaborate on the training procedure in Sec. 3. The experimental results and their evaluation are shown in Sec. 4. Finally, we conclude our work in Sec. 5.

2 Related Work

The disagreement between objective evaluation results and human subjective observations leads to two research directions: PSNR-driven SR and perceptual-quality driven SR . PSNR-driven SR.

The PSNR-driven SR methods optimize SR networks by minimizing pixel loss function. Kim et al. exploited recursive learning and residual learning to build deeper networks to improve the PSNR values of SR results. Zhang et al.

12 applied a residual channel attention network that rescales channel-wise features to learn high-frequency information. Tong et al. combined different level feature maps using dense skip connections to boost model performance. Haris et al. exploited iterative up- and down-sampling layers to feedback projection errors and concatenated feature maps across up- and down-sampling stages to super-resolve image for large scaling factors ( × 8 ). All these methods aim to improve PSNR value while often producing visually unpleasing SR results. Perceptual-quality driven SR methods.

The perceptual-quality driven SR methods introduce different loss functions to optimize the SR network for improving visual quality. Johnson et al. proposed content loss that measures Euclidean distance between feature maps of SR images and HR images. Roey et al. proposed contextual loss that measures cosine distance between the feature maps of images. Ledge et al. applied adversarial loss to optimize SR network. Cheon et al. proposed perceptual image content loss that measures the difference between images after applying Discrete Cosine Transform (DCT) and differential operation on SR images and HR images. Sajjadi et al. used texture loss to ensure the consistent style between images. A combination of multiple loss functions has widely been used to produce visually satisfactory SR results. For SR, the classic objective evaluation methods (e.g., PSNR) evaluate the image by statistically measuring distortion values between the SR image and the HR image, and mainly concern the difference between the pixel values of the same position of the two images. To evaluate SR images perceptual quality accurately, perceptual index (PI) was proposed in the PIRM Challenge on perceptual SR , defined as follows: 𝑃𝐼 = (cid:2869)(cid:2870) (cid:4672)(cid:3435)10 − 𝑀𝑎(𝐼)(cid:3439) + 𝑁𝐼𝑄𝐸(𝐼)(cid:4673) (1) where 𝐼 is the evaluated image, 𝑀𝑎(∙)27 is the no-reference image quality evaluation function and 𝑁𝐼𝑄𝐸(∙) is the quality evaluator score function . Ref. 25 discussed the correlation between image quality assessment and human-opinion-scores. Both NIQE and PI have a high correlation with human-opinion-scores. The lower PI value and NIQE value denote the better perceptual quality. In this study, we use NIQE and PI indicators to evaluate SR images visual quality.

3 GAN-IMC

The HR image 𝐼 (cid:3009)(cid:3019) is degraded into the LR image 𝐼 (cid:3013)(cid:3019) . The degradation process is defined as follows: 𝐼 (cid:3013)(cid:3019) = (𝐼 (cid:3009)(cid:3019) ⨂𝑘) ↓ (cid:3046) + 𝑛 (2) where 𝑘 denotes blur operation, ↓ denotes down-sampling operation, s denotes scaling factor, and 𝑛 denotes noise addition operation. The SR network is used to predict the lost high-frequency information during degradation. The SR implementation based on convolution neural network is described as follows: 𝐼 (cid:3020)(cid:3019) = 𝐺(𝐼 (cid:3013)(cid:3019) ) (3) where 𝐺 denotes SR network that takes 𝐼 (cid:3013)(cid:3019) as input, and outputs the SR image 𝐼 (cid:3020)(cid:3019) . Our GAN-IMC architecture is composed of generator network 𝐺 and multi-feature discriminator network 𝐷 , as shown in Fig. 1. Multi-feature discriminator network include image discriminator 𝐷 (cid:3036)(cid:3040)(cid:3034) , morphological component discriminator 𝐷 (cid:3040)(cid:3030) and color discriminator 𝐷 (cid:3030) . GAN-IMC aims to make SR results successfully deceive the discriminators 𝐷 (cid:3036)(cid:3040)(cid:3034) , 𝐷 (cid:3040)(cid:3030) and 𝐷 (cid:3030) , and SR results are similar to HR images in data distribution. GAN-IMC is obtained by alternately optimizing discriminators 𝐷 (cid:3036)(cid:3040)(cid:3034) , 𝐷 (cid:3040)(cid:3030) , 𝐷 (cid:3030) and generator 𝐺 . Multi-feature discriminator network training is given in Sec. 3.1. The parameters of the generator network are updated by optimizing the perceptual loss function, which is described in Sec. 3.2. Fig. 1

The network architecture of GAN-IMC.

Multi-feature discriminators 𝐷 (cid:3036)(cid:3040)(cid:3034) , 𝐷 (cid:3040)(cid:3030) and 𝐷 (cid:3030) are trained to distinguish SR images from HR images in terms of pixel, edge, texture and color, respectively. The combination of multiple discriminators avoids the limitation on simple use of 𝐷 (cid:3036)(cid:3040)(cid:3034) and improves the accuracy of discrimination. Image discriminator 𝐷 (cid:3036)(cid:3040)(cid:3034) . Image discriminator takes the images as input and outputs the probability of the input image being the HR image. The architecture of 𝐷 (cid:3036)(cid:3040)(cid:3034) is shown in Table 1(a). Morphological component discriminator 𝐷 (cid:3040)(cid:3030) . Human vision is sensitive to the morphological component of images, which mainly contains edge and texture information. 𝐷 (cid:3040)(cid:3030) is built to discriminate images by the edge and texture information. Gray image without color and brightness can better highlight the edge and texture. We take the gray image as input of 𝐷 (cid:3040)(cid:3030) . The gray image is obtained: 𝐼 (cid:3034) = 0.299 ∙ 𝐼 (cid:3045) + 0.578 ∙ 𝐼 (cid:3034) + 0.114 ∙ 𝐼 (cid:3029) (4) where 𝐼 (cid:3034) is the gray image. 𝐼 (cid:3045) , 𝐼 (cid:3034) and 𝐼 (cid:3029) denote red, green and blue components of the input image, respectively. To train 𝐷 (cid:3040)(cid:3030) , we minimize the loss function 𝐿 (cid:3005) (cid:3288)(cid:3278) as follows: 𝐿 (cid:3005) (cid:3288)(cid:3278) = − log(𝐷 (cid:3040)(cid:3030) (cid:3435)𝐼 (cid:3009)(cid:3019)(cid:3034) (cid:3439)) − 𝑙𝑜𝑔 (cid:4672)1 − 𝐷 (cid:3040)(cid:3030) (cid:3435)𝐼 (cid:3020)(cid:3019)(cid:3034) (cid:3439)(cid:4673) (5) where 𝐼 (cid:3020)(cid:3019)(cid:3034) and 𝐼 (cid:3009)(cid:3019)(cid:3034) are gray SR image and gray HR image. 𝐷 (cid:3040)(cid:3030) (∙) denotes the probability that the morphological component of the input image belongs to the HR image. The architecture of 𝐷 (cid:3040)(cid:3030) is shown in Table 1(b). Table 1

The architecture of multi-feature discriminators.

Both 𝐷 (cid:3036)(cid:3040)(cid:3034) and 𝐷 (cid:3040)(cid:3030) have deep network architecture which apply deep semantic feature maps to distinguish 𝐼 (cid:3020)(cid:3019) and 𝐼 (cid:3009)(cid:3019) . Color discriminator 𝐷 (cid:3030) . Visual system is also sensitive to the main color, brightness and contrast of objects in natural images. Therefore, we apply a Gaussian blur kernel to blur natural image for reserving main color, brightness and contrast. The color discriminator 𝐷 (cid:3030) takes blurred image as input and discriminates images by color, brightness and contrast. The blurred image is obtained via blurred convolution 𝐵 : 𝐼 (cid:3003) = 𝐼 ∗ 𝐵 (6) where 𝐼 is the input image, 𝐼 (cid:3003) is the blurred image, ∗ denotes the convolution operation, and 𝐵 is the blur filter. The size of 𝐵 is

21 × 21 , the stride is 1, and the weights are fitted to Gaussian distribution, easily calculated as follows:

𝐵(𝑥, 𝑦) = (cid:2869)(cid:2870)(cid:3095)(cid:3097) (cid:3118) exp (cid:3436)− ((cid:3051)(cid:2879)(cid:3091) (cid:3299) ) (cid:3118) (cid:2870)(cid:3097) (cid:3299)(cid:3118) − (cid:3435)(cid:3052)(cid:2879)(cid:3091) (cid:3300) (cid:3439) (cid:3118) (cid:2870)(cid:3097) (cid:3300)(cid:3118) (cid:3440) (7) where 𝑥 and 𝑦 are horizontal and vertical indexes of blur filter, 𝜇 (cid:3051),(cid:3052) and 𝜎 (cid:3051),(cid:3052) are the mean and variance of 𝑥 and 𝑦 , separately. We set 𝜇 (cid:3051),(cid:3052) = 0 and 𝜎 (cid:3051),(cid:3052) = √3 . To train color discriminator, we minimize the loss function 𝐿 (cid:3005) (cid:3278) as follows: 𝐿 (cid:3005) (cid:3278) = − log(𝐷 (cid:3030) (𝐼 (cid:3009)(cid:3019)(cid:3003) )) − 𝑙𝑜𝑔(cid:3435)1 − 𝐷 (cid:3030) (𝐼 (cid:3020)(cid:3019)(cid:3003) )(cid:3439) (8) where 𝐼 (cid:3020)(cid:3019)(cid:3003) and 𝐼 (cid:3009)(cid:3019)(cid:3003) are blurred SR image and blurred HR image. 𝐷 (cid:3030) (∙) denotes the probability that the color of input image belongs to HR image. The architecture of 𝐷 (cid:3004) is shown in Table I(c). The architecture of our generator network 𝐺 is derived from Ref. 18, but the loss function is different from it. Our generator network 𝐺 is trained with perceptual loss function composed of the pixel loss 𝐿 (cid:3043)(cid:3036)(cid:3051)(cid:3032)(cid:3039) , adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049) and weighted content loss 𝐿 (cid:3050)(cid:3030) , as shown in Fig. 2. Fig. 2

The generator network training.

Pixel loss function 𝐿 (cid:3043)(cid:3036)(cid:3051)(cid:3032)(cid:3039) constrains the SR image 𝐼 (cid:3020)(cid:3019) to be close enough to the HR image 𝐼 (cid:3009)(cid:3019) on the pixel values. We measure Mean Square Error (MSE) between 𝐼 (cid:3020)(cid:3019) and 𝐼 (cid:3009)(cid:3019) , and 𝐿 (cid:3043)(cid:3036)(cid:3051)(cid:3032)(cid:3039) is defined as follows: 𝐿 (cid:3043)(cid:3036)(cid:3051)(cid:3032)(cid:3039) = (cid:2869)(cid:3035)(cid:3050)(cid:3030) ∑(𝐼 (cid:3020)(cid:3019) − 𝐼 (cid:3009)(cid:3019) ) (cid:2870) (9) where ℎ , 𝑤 and 𝑐 are the height, width and number of channels of the SR image, respectively. Our adversarial loss function is composed of image adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3036)(cid:3040)(cid:3034) , morphological component adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3040)(cid:3030) and color adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3030)(cid:3042)(cid:3039)(cid:3042)(cid:3045) . Minimizing adversarial loss function makes generator 𝐺 learn to create solutions that are highly similar to HR images in terms of image, edge, texture and color. Image adversarial loss.

The image adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3036)(cid:3040)(cid:3034) is defined as follows: 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3036)(cid:3040)(cid:3034) = − ∑ log 𝐷 (cid:3036)(cid:3040)(cid:3034) (𝐼 (cid:3020)(cid:3019) ) (10) where 𝐷 (cid:3036)(cid:3040)(cid:3034) (𝐼 (cid:3020)(cid:3019) ) is the output of the discriminator 𝐷 (cid:3036)(cid:3040)(cid:3034) when 𝐼 (cid:3020)(cid:3019) is taken as input. Morphological component adversarial loss.

The morphological component adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3040)(cid:3030) is defined as follows: 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3040)(cid:3030) = − ∑ log 𝐷 (cid:3040)(cid:3030) (cid:3435)𝐼 (cid:3020)(cid:3019)(cid:3034) (cid:3439) (11) where 𝐷 (cid:3040)(cid:3030) (cid:3435)𝐼 (cid:3020)(cid:3019)(cid:3034) (cid:3439) is the output of the discriminator 𝐷 (cid:3040)(cid:3030) when the morphological component of 𝐼 (cid:3020)(cid:3019) is taken as input. Color adversarial loss.

The color adversarial loss 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3030)(cid:3042)(cid:3039)(cid:3042)(cid:3045) is defined as follows: 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3030)(cid:3042)(cid:3039)(cid:3042)(cid:3045) = − ∑ log 𝐷 (cid:3030) (𝐼 (cid:3020)(cid:3019)(cid:3003) ) (12) where 𝐷 (cid:3030) (𝐼 (cid:3020)(cid:3019)(cid:3003) ) is the output of the discriminator 𝐷 (cid:3030) when the color of 𝐼 (cid:3020)(cid:3019) is taken as input. The total adversarial loss function 𝐿 (cid:3028)(cid:3031)(cid:3049) is calculated as follows: 𝐿 (cid:3028)(cid:3031)(cid:3049) = 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3036)(cid:3040)(cid:3034) + 10 (cid:2879)(cid:2869) 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3040)(cid:3030) + 4 × 10 (cid:2879)(cid:2871) 𝐿 (cid:3028)(cid:3031)(cid:3049)(cid:3030)(cid:3042)(cid:3039)(cid:3042)(cid:3045) (13) The studies have pro ved that the introduction of content loss function improves the visual quality of SR results. In

Ref. 22, the low-level feature maps and high-level feature maps of the image are extracted through the 𝜙 (cid:2870),(cid:2870) and 𝜙 (cid:2873),(cid:2872) of pre-trained VGG-19 network  , respectively. We propose a modified weighted content loss function which combines weighted low-level features with high-level semantic features. Minimize the Euclidean distance between two weighted low-level feature maps from 𝐼 (cid:3020)(cid:3019) and 𝐼 (cid:3009)(cid:3019) enhances the features of salient regions while weakening the features of non-salient regions. The high-level feature maps are also applied to constrain the semantics of the whole image. The weighted content loss function 𝐿 (cid:3050)(cid:3030) is formulated as follows: 𝐿 (cid:3050)(cid:3030) = 𝐿 (cid:3039)(cid:3042)(cid:3050)(cid:2879)(cid:3039)(cid:3032)(cid:3049)(cid:3032)(cid:3039) + 10 (cid:2879)(cid:2873) 𝐿 (cid:3035)(cid:3036)(cid:3034)(cid:3035)(cid:2879)(cid:3039)(cid:3032)(cid:3049)(cid:3032)(cid:3039) (14) 𝐿 (cid:3039)(cid:3042)(cid:3050)(cid:2879)(cid:3039)(cid:3032)(cid:3049)(cid:3032)(cid:3039) = (cid:2869)(cid:3024)(cid:3009) ∑ (cid:4672)𝛼 (cid:3036),(cid:3037)(cid:3020)(cid:3019) 𝜙 (cid:2870),(cid:2870) (𝐼 (cid:3020)(cid:3019) ) − 𝛼 (cid:3036),(cid:3037)(cid:3009)(cid:3019) 𝜙 (cid:2870),(cid:2870) (𝐼 (cid:3009)(cid:3019) )(cid:4673) (cid:2870) (15) 𝐿 (cid:3035)(cid:3036)(cid:3034)(cid:3035)(cid:2879)(cid:3039)(cid:3032)(cid:3049)(cid:3032)(cid:3039) = (cid:2869)(cid:3024)(cid:3009)(cid:3004) ∑ (cid:4672)𝜙 (cid:2873),(cid:2872) (𝐼 (cid:3020)(cid:3019) ) − 𝜙 (cid:2873),(cid:2872) (𝐼 (cid:3009)(cid:3019) )(cid:4673) (cid:2870) (16) where 𝜙(∙) denotes the feature maps, 𝑊 , 𝐻 and 𝐶 are the width, height and channel of the feature maps respectively, 𝛼 (cid:3036),(cid:3037) denotes the spatial weight, which is applied to each channel of feature maps 𝜙 (cid:2870),(cid:2870) (∙) , and 𝑖 and 𝑗 denote the horizontal and vertical indexes of spatial weight separately. The calculation of 𝛼 (cid:3036),(cid:3037) is summarized in Algorithm 1. Fig. 3 shows the comparison experiment based on content loss function and weighted content loss function. As shown in Fig. 3(c), the difference between salient regions and non-salient regions in the image is extended after weighting feature maps. The feature map is got by fusing the feature maps (56×56×128) extracted from 𝜙 (cid:2870),(cid:2870) , according to 1:1. Compared with content loss, weighted content loss is helpful to enhance the details of salient regions and improve SR image visual quality, as shown in Fig. 3(e). Fig. 3

Effect of weighted content loss and SR result.  𝜙 (cid:2870),(cid:2870) and 𝜙 (cid:2873),(cid:2872) denote the 2-th convolution after activation before the 2-th max-pooling layer and the 4-th convolution after activation before the 5-th max-pooling layer within the VGG19 network, respectively. Algorithm 1: The calculation of spatial weight 𝛼 (cid:3036),(cid:3037) . Input : Low-level feature maps χ (cid:3038)(cid:3036)(cid:3037) . Let χ (cid:3038)(cid:3036)(cid:3037) ∈ ℝ (cid:3004)×(cid:3024)×(cid:3009) denote the 3-dimensional feature maps extracted from the selected layer 𝜙 (cid:2870),(cid:2870) . 𝑖 , 𝑗 and 𝑘 denote horizontal, vertical and channel indexes of feature maps, respectively. 𝑊 , 𝐻 and 𝐶 are the width, height and channel of the feature maps respectively. 1. Obtain the accumulated feature responses 𝑓 (cid:3036),(cid:3037)(cid:3004) by summing the feature maps χ (cid:3038)(cid:3036)(cid:3037) per channel for each location. 𝑓 (cid:3036),(cid:3037)(cid:3004) ∈ ℝ (cid:3024)×(cid:3009) (𝑖 = 1, ⋯ 𝑊, 𝑗 = 1, ⋯ 𝐻) . 𝑓 (cid:3036),(cid:3037)(cid:3004) = (cid:3533) 𝜒 (cid:3038),(cid:3036),(cid:3037)(cid:3004)(cid:3038)(cid:2880)(cid:2869) Normalize accumulated feature maps by 𝐿2 norm to calculate spatial weight 𝛼 (cid:3036),(cid:3037) . 𝛼 (cid:3036),(cid:3037) = 𝑓 (cid:3036),(cid:3037)(cid:3004) (cid:3495)(cid:4672)∑ ∑ (cid:3435)𝑓 (cid:3036),(cid:3037)(cid:3004) (cid:3439) (cid:2870)(cid:3024)(cid:3037)(cid:2880)(cid:2869)(cid:3009)(cid:3036)(cid:2880)(cid:2869) (cid:4673) Obtain the weighted feature maps 𝜒 (cid:3038),(cid:3036),(cid:3037)^ . 𝜒 (cid:3038),(cid:3036),(cid:3037)^ = 𝜒 (cid:3038),(cid:3036),(cid:3037) ∙ 𝛼 (cid:3036),(cid:3037) Output : Spatial weight 𝛼 (cid:3036),(cid:3037) . Our perceptual loss function is a weighted sum of loss functions mentioned above, defined as follows:

𝐿 = 𝐿 (cid:3043)(cid:3036)(cid:3051)(cid:3032)(cid:3039) + 10 (cid:2879)(cid:2871) 𝐿 (cid:3028)(cid:3031)(cid:3049) + 2 × 10 (cid:2879)(cid:2872) 𝐿 (cid:3050)(cid:3030) (17)

4 Experiments

In this section, we conduct the numerous experiments on benchmark datasets to verify the performance of the proposed methods and compare them with a series of state-of-the-art methods based on different loss function. All experiments are implemented on NVIDIA GeForce GTX 1080ti (12G memory). During training, we train our SR model with a scale factor of × 4 using 800 images ( pixels) from DIV2K dataset. DIV2K dataset includes lots of high-resolution RGB images with a large diversity of contents (such as flora, fauna, handmade object, people, scenery, etc.). It captures sufficient variability of natural images and provides abundant information for SR network learning. The corresponding LR images are obtained using Matlab imresize function in MATLAB R2016b. During testing, we perform experiments on three benchmark datasets Set5 , Set14 and BSD100 . The Set5 dataset contains 5 images: ‘baby’, ‘bird’, ‘butterfly’, ‘head’ and ‘woman’. The Set14 dataset has 14 images. Some images include complex edges and textures (e.g., ‘baboon’, ‘comic’, ‘face’, etc.), some images include more edges than textures (e.g., ‘monarch’, ‘barbara’, etc.), others include rich textures (e.g., ‘coastguard’, ‘zebra’, ‘flowers’, etc.). The BSD100 dataset, which has 100 images, is built by UC Berkeley Computer Vision Group. This dataset contains different categories (such as animal, building, food, landscape, people, plant, etc.). The information contained in the above three datasets is all-encompassing. It is available to measure the robustness of the different SR methods. GAN-IMC introduces morphological component adversarial loss and color adversarial loss. It is trained with a combination of pixel loss, image adversarial loss, morphological component adversarial loss, color adversarial loss and content loss. Optimized method GAN-IMCW is trained using weighted content loss instead of content loss in GAN-IMC. We compare the proposed method with the bicubic method and several the state-of-the-art SR methods, including VDSR , EDSR , SRGAN-MSE , SRGAN-VGG22 , SRGAN , Table 2

Comparison of different perceptual loss function among the compared methods. EnhanceNet and SRFeat . Table 2 shows the loss functions of the compared methods. Both VDSR and EDSR trained by pixel loss belong to PSNR-driven SR methods. SRGAN-MSE, SRGAN, EnhanceNet and SRFeat trained with the combination of different loss belong to perceptual-quality driven SR methods. SRGAN-MSE trained with the combination of pixel loss and image adversarial loss is a baseline method. We apply PI and NIQE indicators to compare the performance of SR methods. The PSNR is also used to evaluate the image distortion in Sec 4.3. We have provided more details about the calculation of PI and NIQE in Sec 2.2. All experiment evaluation values of competing methods are from their publication and SR results of competing methods are from the author’s website.

The experimental parameters of our method are set as follows.

We crop randomly 16

96 × 96 sub-images from 800 images for each training batch. The Adam optimizer with 𝛽 (cid:2869) =0.9 is applied. The SR network is pre-trained using pixel loss function with a learning rate of (cid:2879)(cid:2872) and (cid:2872) update iterations for the initialization of the generator. We alternately update the generator and discriminators with a learning rate of (cid:2879)(cid:2872) . After (cid:2873) update iterations, the learning rate is reduced by ten times. We conduct the numerous experiments on benchmark datasets Set5, Set14 and BSD100 and show some SR results. In order to clearly compare, we amplify two times of local line in the left upper corner of the figure.

We experimented with all 5 LR test images in the Set 5 dataset. In Fig. 4-5, we show × 4

SR results on images ‘baby’ and ‘head’. Table 3 shows the PI and NIQE values of nine methods on all images in Set5. (a)

LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f) HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 4

The SR results of ‘baby’ (upscaling factor of 4). (a)

LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f)

HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 5

The SR results of ‘head’ (upscaling factor of 4).

Table 3

PI and NIQE values on Set 5.

We experimented with all 14 LR test images in the Set14 dataset. Their × 𝟒

SR results on images ‘coastguard’, ‘flowers’ and ‘monarch’ are shown in Fig. 6-8. Table 4 shows the PI and NIQE values of nine methods on all images in Set14. (a) LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f)

HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 6

The SR results of ‘coastguard’ (upscaling factor of 4). (a)

LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f)

HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 7

The SR results of ‘flowers’ (upscaling factor of 4). (a)

LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f)

HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 8

The SR results of ‘monarch’ (upscaling factor of 4). We experimented with all 100 LR test images in the BSD100 dataset. Their × 4

SR results on

Table 4

PI and NIQE values on Set 14. images ‘14037’ and ‘106024’ are shown in Fig. 9-10. Table 5 shows the PI and NIQE values of nine methods on some images in BSD100.

Table 5

PI and NIQE values on BSD100. (a) LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f)

HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 9

The SR results of ‘14037’ (upscaling factor of 4). (a)

LR (b)

Bicubic (c)

SRGAN-MSE (d)

SRGAN (e)

SRFeat (f)

HR (g)

EDSR (h)

EnhanceNet (i)

GAN-IMC (j)

GAN-IMCW

Fig. 10

The SR results of ‘106024’ (upscaling factor of 4).

Table 6

The average PSNR values (dB) on three datasets.

Visual Quality.

By comparing and analyzing the experimental results of Fig. 4-10, our methods have the following conclusions: (i) The large-scale structural edges are sharp and natural. More abundant and realistic texture details are generated while avoiding over-fake textures and less texture details in SR results of GAN-IMC and GAN-IMCW. We observed that the edges of the eye, flowers and butterfly are significantly sharp from the highlighted window of Fig. 5, 7 and 8. The highlighted window of Fig. 4, 5, 6, 9 and 10 display that the textured areas are well recovered without introducing additional visible grids and texture details are not weakened and lost. (ii)

Local color transition is natural and smooth. As shown in the highlighted window of Fig. 5, 7, 8 and 9, the color transition is comfortable. It can be seen in Fig. 5 that the eyelid part of the girl generated by GAN-IMCW is highly similar to the corresponding position of the HR image. (iii) The recovery of feature-rich regions is remarkable. As shown in the highlighted window of Fig. 7-8, GAN-IMCW expands the difference between salient regions (e.g. flowers, butterfly) and backgrounds to enhance visual quality of flowers and the butterfly, and accurately recovers the details of salient regions. (iv) GAN-IMCW shows better robustness on three test datasets containing a large number of different types of images. It makes full use of edge, texture and color features of images and highlights feature-rich regions to produce visually pleasant SR results.

In summary, compared with the competing method, GAN-IMC recovers sharp large-scale structural edges and realistic texture details, and SR results have smooth color transition. Furthermore, GAN-IMCW significantly improves the visual attention of feature-rich regions in SR results. GAN-IMCW has achieved better performance in terms of both effectiveness and robustness, which shows good agreement with the quantitative evaluation results in Table 3, 4 and 5.

PI and NIQE.

The PI, NIQE, average PI and average NIQE values of the competing methods on all images in Set5, Set14 and BSD100 are shown in Table 3, 4 and 5. GAN-IMC and GAN-IMCW achieve much better PI and NIQE index than SRGAN-MSE (baseline) on three datasets. For the PI, GAN-IMCW is better than SRGAN and EnhanceNet on Set14 and BSD100, and better than SRFeat on Set5, Set14 and BSD100. Our average gain on PI for BSD100 is 0.067, 0.578 and 0.19 less than the value of SRGAN, EnhanceNet and SRFeat separately. For the NIQE, GAN-IMCW is better than EnhanceNet on Set14 and BSD100, and better than both SRFeat and SRGAN on Set5, Set14 and BSD100. The results demonstrate that GAN-IMCW has good robustness for different kinds of test images and significantly improves the perceptual quality of SR results.

PSNR.

The average PSNR values of the different methods at × 4

SR on three benchmark datasets are shown in Table 6. Compared with the competing method, the PSNR values of the proposed method do not increase but decrease, which also proves that the proposed method can effectively improve image perceptual quality, as described in Sec. 2.1.

5 Conclusion

In this paper, we designed a novel SR network framework GAN-IMC that includes a generator and multi-feature discriminators. The data distribution of the image and component features of the image, including color, edge and texture are learned during adversarial learning between generator network and multi-feature discriminator network. Moreover, the optimized method GAN-IMCW improves the visual quality of feature-rich regions in images by using weighted content loss. A large number of experimental results indicated the superiority of GAN-IMCW over the other competing methods. It not only achieves competitive PI and NIQE values but also improves more pleasant visual quality in terms of image, edge, texture, color and feature-rich regions . Acknowledgments

This work was supported by the key project of Natural Science Foundation of Shaanxi Province (Grant Nos. 2018JZ6007).

References L. Yue, H. Shen, and J. Li, “Image super-resolution: The techniques, applications, and future,”

Signal Processing , 389-408 (2016) [doi: 10.1016/j.sigpro.2016.05.002]. 2.

Z. Wang, J. Chen, and S. C. H. Hoi, “Deep Learning for Image Super-resolution: A Survey,” preprint (2019). 3.

W. Witwit, Y. Zhao, K. Jenkins, and Y. Zhao, “Satellite image resolution enhancement using discrete wavelet transform and new edge-directed interpolation,”

J. Electron. Imaging , (2) (2017). [doi: 10.1117/1.JEI.26.2.023014]. 4. N. H. Chang, D. Y. Yeung, and N. Y. Xiong, “Super-resolution through neighbor embedding,”

Proc.

CVPR 2004 , (2004) [doi: 10.1109/CVPR.2004.1315043]. 5.

R. Timofte, V. De, and L. V. Gool, “Anchored Neighborhood Regression for Fast Example-Based Super-Resolution,”

Proc.

ICCV 2013 , (2013) [doi: 10.1109/ICCV.2013.241]. 6.

J. Yang, Wright, and T. S. Huang, et al., “Image super-resolution via sparse representation,”

IEEE Trans. Image Process. , (11) (2010) [doi: 10.1109/TIP.2010.2050625]. 7. D. Zhang, and J. He, “Hybrid sparse-representation-based approach to image super-resolution reconstruction,”

J. Electron. Imaging , (2) (2017) [doi: 10.1117/1.JEI.26.2.023008]. 8. I. J. Goodfellow, J. Pouget-Abadie, and M. Mirza, et al., “Generative Adversarial Nets,” in

Advances in neural information processing systems , 2672-2680 (2014). 9.

C. Dong, C. C. Loy, K. and He, et al., “Image Super-Resolution Using Deep Convolutional Networks,”

IEEE Trans. Pattern Anal. Mach. Intell. , 38(2), 295-301 (2016) [doi: 10.1109/TPAMI.2015.2439281]. 10.

J. Kim, J. K. Lee, and K. M. Lee, “Deeply Recursive Convolutional Network for Image Super-Resolution,”

Proc.

CVPR 2016 , 1063-6919 (2016) [doi: 10.1109/CVPR.2016.181]. 11.

K. He, X. Zhang, and S. Ren, et al., “Deep Residual Learning for Image Recognition,”

Proc.

CVPR 2016 , 1063-6919 (2016) [doi: 10.1109/CVPR.2016.90]. 12.

Y. Zhang, K. Li, and K. Li, et al., “Image Super-Resolution Using Very Deep Residual Channel Attention Networks,”

Proc.

ECCV 2018 , 286-301 (2018). 13.

J. Kim, J. K. Lee, and K. M. Lee, “Accurate Image Super-Resolution Using Very Deep Convolutional Networks,”

Proc.

CVPR 2016 , (2016) [doi: 10.1109/CVPR.2016.182]. Y. Tai, J. Yang, and X. Liu, “Image Super-Resolution via Deep Recursive Residual Network,”

Proc.

CVPR 2017 (2017) [doi: 10.1109/CVPR.2017.298]. 15.

T.. Tong, G Li, and X. Liu, et al., “Image Super-Resolution Using Dense Skip Connections,”

Proc.

ICCV 2017 , 4799-4807 (2017) [doi: 10.1109/ICCV.2017.514]. 16.

B. Lim, S. Son, and H. Kim, et al., “Enhanced Deep Residual Networks for Single Image Super-Resolution,”

Proc.

CVPR 2017 , 136-144 (2017) [doi: 10.1109/CVPRW.2017.151]. 17.

C. Ledig, L. Theis, and F. Huszar, et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,”

Proc.

CVPR 2017 , 4681-4690 (2017) [doi: 10.1109/CVPR.2017.19]. 18.

A. Dosovitskiy, and T Brox., “Generating Images with Perceptual Similarity Metrics based on Deep Networks,” in

Advances in neural information processing systems , 658-666 (2016). 19.

M. S. M. Sajjadi, Schölkopf, Bernhard, and M. Hirsch, “EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis,”

Proc.

ICCV 2017 , 4491-4500 (2017) [doi: 10.1109/ICCV.2017.481]. 20.

J.. Johnson, A Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,”

Proc.

ECCV 2016 , 694-711 (2016) [doi: 10.1007/978-3-319-46475-6_43]. 21.

L. Gatys, A. S. Ecker, and M Bethge, “Texture synthesis using convolutional neural networks,” in

Advances in neural information processing systems , 262-270 (2015) [doi: 10.1016/0014-5793(76)80724-7]. 22.

S. J. Park, H. Son, and S. Cho, et al., “SRFeat: Single Image Super-Resolution with Feature Discrimination,”

Proc.

ECCV 2018 , 439-455 (2018). 23.

R. Mechrez, I. Talmi, and F. Shama, et al., “Maintaining Natural Image Statistics with the Contextual Loss,”

Proc.

ACCV 2018 , 427-443 (2018) [doi: 10.1007/978-3-030-20893-6_27]. 24.

M. Haris, G. Shakhnarovich, and N. Ukita, “Deep Back-Projection Networks For Super-Resolution,”

Proc.

CVPR 2018 , 1664-1673 (2018) [doi: 10.1109/CVPR.2018.00179]. 25.

Y. Blau, R. Mechrez, and R. Timofte, et al., “2018 PIRM Challenge on Perceptual Image Super-resolution,”

Proc.

ECCV 2018 , (2018) [doi: 10.1007/978-3-030-11021-5_21]. 26.

M. Cheon, J. H. Kim, and J. H. Choi, et al., “Generative adversarial network-based image super-resolution using perceptual content losses,”

Proc.

ECCV 2018 , (2018). 27.

C. Ma, C. Y. Yang, and X. Yang, et al., “Learning a no-reference quality metric for single-image super-resolution,”

Comput. Vis. Image Underst. , 1-16 (2017). 28.

A. Mittal, and Anish, et al., “Making a 'Completely Blind' Image Quality Analyzer,”

IEEE Signal Process. Lett. , (3), 209-212 (2013) [doi: 10.1109/LSP.2012.2227726]. 29. M. Bevilacqua, A. Roumy, C. Guillemot and ML. Alberi, “Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding,”

Proc.

BMVC2012 (2012). 30.

R. Zeyde, M. Elad, and M. Protter, “On Single Image Scale-Up using Sparse-Representations,”

Proc. ICCS , 711-730 (2010) [doi: 10.1007/978-3-642-27413-8_47]. 31.

D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics,” in