[PDF] An Attention-Driven Approach of No-Reference Image Quality Assessment

Abstract

In this paper, we present a novel method of no-reference image quality assessment (NR-IQA), which is to predict the perceptual quality score of a given image without using any reference image. The proposed method harnesses three functions (i) the visual attention mechanism, which affects many aspects of visual perception including image quality assessment, however, is overlooked in the NR-IQA literature. The method assumes that the fixation areas on an image contain key information to the process of IQA. (ii) the robust averaging strategy, which is a means \--- supported by psychology studies \--- to integrating multiple/step-wise evidence to make a final perceptual judgment. (iii) the multi-task learning, which is believed to be an effectual means to shape representation learning and could result in a more generalized model. To exploit the synergy of the three, we consider the NR-IQA as a dynamic perception process, in which the model samples a sequence of "informative" areas and aggregates the information to learn a representation for the tasks of jointly predicting the image quality score and the distortion type. The model learning is implemented by a reinforcement strategy, in which the rewards of both tasks guide the learning of the optimal sampling policy to acquire the "task-informative" image regions so that the predictions can be made accurately and efficiently (in terms of the sampling steps). The reinforcement learning is realized by a deep network with the policy gradient method and trained through back-propagation. In experiments, the model is tested on the TID2008 dataset and it outperforms several state-of-the-art methods. Furthermore, the model is very efficient in the sense that a small number of fixations are used in NR-IQA.

Full PDF

AAn Attention-Driven Approach of No-Reference Image Quality Assessment

Diqi Chen , , Yizhou Wang , Tianfu Wu , Wen Gao Nat’l Engineering Laboratory for Video Technology,Key Laboratory of Machine Perception (MoE),Sch’l of EECS, Peking University, Beijing, 100871, China Department of ECE and the Visual Narrative Cluster,North Carolina State University Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),Institute of Computing Technology, CAS, Beijing, 100190, China { cdq, Yizhou.Wang, wgao } @pku.edu.cn, [email protected] Abstract

In this paper, we present a novel method of no-referenceimage quality assessment (NR-IQA), which is to predict theperceptual quality score of a given image without usingany reference image. The proposed method harnesses threefunctions (i) the visual attention mechanism, which affectsmany aspects of visual perception including image qualityassessment, however, is overlooked in the NR-IQA litera-ture. The method assumes that the ﬁxation areas on an im-age contain key information to the process of IQA. (ii) therobust averaging strategy, which is a means – supported bypsychology studies – to integrating multiple/step-wise evi-dence to make a ﬁnal perceptual judgment. (iii) the multi-task learning, which is believed to be an effectual means toshape representation learning and could result in a moregeneralized model. To exploit the synergy of the three, weconsider the NR-IQA as a dynamic perception process, inwhich the model samples a sequence of “informative” areasand aggregates the information to learn a representationfor the tasks of jointly predicting the image quality scoreand the distortion type. The model learning is implementedby a reinforcement strategy, in which the rewards of bothtasks guide the learning of the optimal sampling policy toacquire the “task-informative” image regions so that thepredictions can be made accurately and efﬁciently (in termsof the sampling steps). The reinforcement learning is re-alized by a deep network with the policy gradient methodand trained through back-propagation. In experiments, themodel is tested on the TID2008 dataset and it outperformsseveral state-of-the-art methods. Furthermore, the model isvery efﬁcient in the sense that a small number of ﬁxationsare used in NR-IQA.

1. Introduction

In the era of big data, an enormous amount of visualdata is making its way to end consumers through mobiledevices, social media, HDTV, etc. Since the applicationsare so broad and diverse, it becomes increasingly importantto improve the quality of experience for consumers. Auto-matic IQA becomes an indispensable module of a servicesystem, such that it is able to tell the perceptual quality ofits content (here images) and then optimizes the deliveredservices accordingly.There are two main schema in IQA: full-reference (FR)IQA [20, 23, 22] and no-reference (NR) IQA [17, 6, 11, 8,9, 18, 5, 24]. The former requires a “clean”, pristine refer-ence image with respect to which the quality of the distortedimage is assessed, and the latter takes only the distorted im-age to be assessed as input and thus is more applicable. Thispaper focuses on NR-IQA.The challenges in NR-IQA include many factors: the“what” issue – unknown types of distortion (some are lo-cal, e.g., local image regions are distorted; some are global,e.g., pervasive additive noise contaminating all the pixels inan image), the “where” issue – unknown spatial distributionof distortions, (e.g., where the degraded regions locate in animage) and the “how” issue – unknown mechanism abouthow to aggregate all the information collected from the dis-torted regions as well as from other regions of an image forquality assessment.In this paper, we address the above issues by presenting anovel method of NR-IQA. The proposed method is inspiredby the following three streams of studies.(i) The visual attention mechanism, which affects many as-pects of visual perception including image quality assess-ment. By observing how human subjects assess the qualityof images, we assume that eye ﬁxation areas on an image1 a r X i v : . [ c s . C V ] M a y t l t+1 h h x t x t + CNN l T h x T label ys T , α T CNN CNN g t g t+1 g T μ t+1 μ T Multi-scale AnalysisLocation SamplingInformation Aggregation s t+1 , α t+1 s t , α t h h h score s Figure 1. The overall architecture of the proposed model for NR-IQA.

Left : An image with the local block-wise distortion (the “blockmasks”) on the two foreground objects. Three ﬁxation areas are illustrated by the orange squares. Centered at each ﬁxation position, threemulti-scale patches are extracted and then normalized to the same size to simulate the foveal vision.

Right : Illustration of the proposednetwork composed of three components: the multi-scale image analysis module (the green boxes) – weights shared CNN used to extractthe features to update the recurrent layer, the location sampling module (the orange boxes) – a stochastic node learns to select where toattend to next, and the information aggregation module (the black box) – a recurrent neural network (RNN) aggregates information alongthe attentional saccadic path for multi-task learning. contain “key information” for IQA. However, in the IQAliterature there is little studies about how to organically in-tegrate the attention mechanism into IQA, e.g. learning thestrategy about where to attend on an image that related tothe IQA task.(ii) The robust averaging strategy [2], which is a compu-tational mechanism of perceptual judgment – supported bypsychology studies – that integrates multiple/step-wise evi-dence to make a ﬁnal judgment. By adopting this strategy,the ﬁnal image quality score is the weighted average of thescores predicted from a number of attended areas in an im-age.(iii) The multi-task learning, which is believed to be an ef-fectual means to shape representation learning and could re-sult in a more generalized model. Here, besides predictingimage scores, we empower the model to classify distortiontypes of an image to be assessed.To exploit the synergy of the three, we consider the NR-IQA as a dynamic perception process, in which the modelsamples a sequence of “informative” areas and aggregatesthe information to learn a representation for the tasks ofjointly predicting the image quality score and the distortiontype. Figure 1 illustrates the proposed model. It is com-posed of three components: • The multi-scale image analysis module: it is imple-mented by a weight-sharing-CNN (the green boxes inFigure 1), which extracts multi-scale image features around a ﬁxation point. We extract three image patchesof different scales centered at a ﬁxation point. TheCNN learns the feature representation for quality as-sessment. This component aims at solving the “what”issue mentioned above in an end-to-end learning fash-ion. • The location sampling module: it is implemented by astochastic node (the orange boxes in Figure 1), whichlearns to select “where” to attend to next based on theintegrated information about what the model has seenso far. It predicts the IQA-task-related regions suchthat the next selected ﬁxation will be sufﬁciently infor-mative. • The information aggregation module: it is imple-mented by an RNN (the black box in Figure 1), whichaggregates information along a saccadic path to com-pute the ﬁnal predictions, i.e., the image quality scoreand the distortion type. It captures both local informa-tion and global information in the sequential unfold-ing. It learns to resolve the “how” issue in NR-IQAstated above. The representation is shaped throughmulti-task learning. Inspired by the robust averagingstrategy for perceptual judgment [2] (which takes boththe “strength” and “reliability” of evidence into con-sideration when making a ﬁnal perceptual judgment),our model predicts the ﬁnal score as the weighted aver-ged of the scores predicted at the attended areas. Theweights of the scores are learned to signify the “relia-bility” of the score prediction at the attended regions.Inspired by [21], the model learning is implementedby a reinforcement strategy. The rewards of both tasks(score prediction and distortion type classiﬁcation) guidethe learning of the optimal location sampling policy to ac-quire the “task-informative” image regions so that the pre-dictions can be made accurately and efﬁciently (in termsof the sampling steps). The reinforcement learning is real-ized by a deep network with the policy gradient method andtrained through back-propagation.In experiments, the model is tested on the TID2008dataset [14] and it outperforms several state-of-the-artmethods. Furthermore, the model is very efﬁcient in thesense that a small number of ﬁxations are used in NR-IQA.

2. Related Work

We brieﬂy review the application of deep learningmodels for NR-IQA, the IQA methods using object-ness/saliency, and a related attentional model.

Neural Networks for NR-IQA:

Deep learning providesan approach to learning a mapping from raw input or low-level features into scores of image perceptual quality. Thesemethods avoid delicately designing hand-crafted features.Kang et al . [8] propose a patch-based NR-IQA method.They ﬁrst uniformly sample image patches at a predeﬁnedscale, then train a CNN to predict a quality score for eachimage patch and average the scores of the patches as theholistic image score. They further propose a multi-taskCNN [9] to classify the distortion type of each patch in ad-dition to the quality score prediction. Our method is quitedifferent from theirs, even in the distortion type classiﬁca-tion part, we do not classify the distortion type of each at-tended patch; instead, we classify a whole image based onthe “aggregated information” over a sequence of attendedregions.In addition to the patch-based methods, some meth-ods combine hand-crafted low-level features with deep net-works as an alternative approach. For example, Tang etal . [18] ﬁrst extract the LBIQ features [17] and feed thefeatures into a Restricted Boilzman Machine to predict im-age quality scores. Hou et al . [6] pose IQA problem as aclassiﬁcation problem. They slot images into different cate-gories according to the image quality and propose a qualitypooling method under the Bayesian framework to predictquality scores.

Objectness and Saliency in IQA:

Although there lacksof literature that organically fuses the attentional mecha-nism into the NR-IQA, the semantic objectness or saliencyhas been applied. Objectness and saliency are static prop-erty of image regions, whereas attention is an active per- ception process of an observer. Liu et al . [12] determinethe ﬁnal score of an image by averaging the predicted patchscores with weights. The patch weights are the saliencyvalues obtained from eye-tracking data. The performancegain of the method justiﬁes the importance of introducingvisual attention to IQA. Zhang and Li [22] argue that visualsaliency and perceptual quality are highly related, and theyutilize the relationship between a reference image saliencymap and its distortion image saliency map to predict imagequality scores. Zhang et al . [24] propose an IQA algorithmusing object-like regions. They assume that semantic re-gions contribute to perceptual quality assessment. Hou andGao [5] propose a saliency-guided framework whose ideais similar to [24]. In summary, these methods exploit im-age saliency maps in post-processing, i.e ., adopt saliency-weighted average score rather than a uniform average scoreas ﬁnal prediction. Zhang et al . [25] study different combi-nations of different saliency models and IQA methods.We also believe that image quality assessment heavilydepends on the way how we attend to images. Hence, weexplicitly model the attention process and learn the attentionpolicy from data.

Recurrent Attentional Models:

Recently, deep learn-ing models with attentional mechanism receive a lot of in-terest. The soft attentional models [15, 10] implement de-terministic attention mechanism trained by normal back-propagation. Kuen et al . [10] realize the attention mech-anism through the differentiable spatial transformer [7]and recurrent connections to reﬁne saliency map step bystep. Stochastic attention in the hard attentional mod-els [13, 1, 15] are often optimized by the REINFORCEalgorithm [21]. Implementing the similar idea, Mnih etal . [13] propose a well-designed attentional model withRNN for object recognition and Ba et al . [1] recognize andlocalize multiple objects by maximizing a variational lowerbound. Sorokin et al . [15] propose a soft attention mecha-nism designed as element-wise multiplication with impor-tance vectors and a hard attention mechanism optimized bythe REINFORCE algorithm.Compared to the above models, our model also integratesthe attentional mechanism but with different ingredients. (i)Our model is multi-task, i.e . it jointly optimizes the perfor-mance of two closely related tasks to learn representationthat leads to a more powerful attention policy. (ii) Conse-quently, the reward function of the reinforcement learningis enriched with multi-task rewards. Such enriched rewardsempower the learned policy being capable of capturing the“task-informative” regions so that the information are ag-gregated and predictions are made more accurately and efﬁ-ciently. (iii) The robust averaging mechanism of perceptualjudgment is implemented into the network architecture andlearning. (iv) The multi-scale analysis is introduced into thenetwork to emulate the foveal vision and provide contextualnformation of ﬁxations.

3. The Proposed Model and Learning

In this section, we introduce the problem deﬁnition, il-lustrate each component of our model in detail and explainhow to jointly learn knowledge about distortion type, per-ceptual quality, and attention policy.As shown in Figure 1, the proposed model consists ofthree main parts — a CNN for multi-scale image featureextraction, a stochastic node for location sampling and re-current connection for information aggregation. The ulti-mate goal is to predict the quality score s of an input image x . Starting from an initial location l , which can be ran-domly selected in the image during training, at each time t , the proposed model learns/extracts features from threenormalized multi-resolution patches x t clipped from l t andupdates the two recurrent layers h t and h t . Based on h t ,our model predicts the next location l t +1 . The model alsopredicts the image quality scale s t and the weight α t sig-nifying reliability of the score prediction. Repeat this pro-cedure for T steps, and we obtain a sequence of locations l = { l i } T − i =0 and the information on each location is ag-gregated into h T . Label y which denotes distortion typeof x is predicted based on h T as an auxiliary task. Thenthe ﬁnal quality score of the input image is computed by s = (cid:80) Tt = t s t α t . The Multi-Scale Image Analysis Module:

This modulelearns a multi-scale representation of an attended region. Atstep t , the output g t = f g ( x , l t ; θ g ) , where θ g is the param-eter.We use multi-resolution images to emulate the foveal vi-sion of human eyes. The fovea is at the center of the retina,where visual signals are captured with high-resolution andprocessed with details. Regions outside the fovea are pe-ripheral regions, which perceive visual patterns with lessdetails and the degradation grows with eccentricity. Humanbeings move and ﬁxate their eyes at the task-informative ar-eas with the aid of the peripheral vision so that they are ableto acquire/process task-related information efﬁciently [4].Here, we extract three patches at different scales centeredat the same ﬁxation point and normalize them into × patches. These multi-scale patches emulate the foveal andperipheral signals of an attended area. The normalizedpatches are stacked together and fed to the CNN.We adopt the multi-scale convolution kernels as in [16]to make the computation efﬁcient. We also treat the sam-pled ﬁxation location as a feature and feed it into a fullyconnected layer of the CNN. Our model concatenates twohidden layers of x t and l t in the CNN and connects them toanother fully connected layer, then outputs g t . The Location Sampling Module:

This module sam-ples the locations of the attention areas in an image. Theoutput l t is inﬂuenced by the hidden state h t from the lastrecurrent layer and parameters θ l : l t +1 = f l ( h t ; θ l ) . Weassume that each dimension of the next location followsa Gaussian distribution independently with the same ﬁxedstandard deviation.The locations are sampled stochastically in the train-ing stage and use the mean of the Gaussian during test-ing. The stochastic sampling is a common strategy to en-able the exploration in reinforcement learning. Firstly, wepredict the mean of the Gaussian distribution by µ t +1 = φ ( W rl h t + b l ) . φ is the HardTanh activation function limit-ing µ t +1 into appropriate range ( [ − , in this work). Thenthe next attention location is sampled from a Gaussian dis-tribution a t +1 ∼ p ( ·| µ t +1 , σ ) , where σ is the standard de-viation for the x-y dimension of the location.We learn location sampling policy by reinforcementlearning guided by the enriched multi-task rewards, so thatthe model is able to sample a sequence of “informative” ar-eas and aggregates the information to jointly predicting theimage quality score and the distortion type. The details ofthe learning will be discussed in 3.2. The Information Aggregation Module:

The RNN isadopted to learn the internal mechanism of information ag-gregation across the ﬁxation areas.A human expert judges the perceptual quality of imagesafter scanning a sequence of attended areas. We employ atwo-layer RNN to aggregate information at each time step,then predicts the distortion type and image quality score.At the same time, the model also predicts the next ﬁxationlocation l t at time t . The ﬁrst recurrent layer is computedas h t = ϕ ( W gh g t + W hh h t − + b h ) , (1)where W gh denotes the connection weights from g t to thehidden layer h and W hh denotes the connection of the hid-den layer to itself, b h is the bias and ϕ is ReLU activa-tion function. We adopt the same way for calculation of h t based on h t as input.The distortion classiﬁcation and the quality predictionare two different tasks. Although predicting quality scoreis our ultimate goal, we observe that the classiﬁcation tasknot only helps to generalize the learning but also enrich thereward for learning the attention policy. The model predictsthe distortion type ˆ y through two fully connected layers anda softmax layer in the RNN.In quality score prediction, we adopt the perceptualjudgement mechanism of robust averaging [2]. Accord-ing to [2], an optimal agent will make judgments basedon the strength and reliability of decision-relevant evi-dence. A plausible computational mechanism of the per-ceptual judgement can be a multi-element averaging model,where the weights of the variables/strengths correspond t l t CNN x t+1 l t+1 CNN x T l T CNN label ...... l t+1 l t+2 s T , α T s t+1 , α t+1 s t , α t ...... Figure 2. Illustration of the backpropagation gradients ﬂow, the redarrows denote the gradients based on supervised signals and theblue arrows denote the gradients of the REINFORCE algorithm. to the reliability of the evidence. In NR-IQA context,the decision-relevant evidence are the attention image re-gions, the strengths of the evidence are the predicted qual-ity scores, the reliability is the weights { α t } of the linearaveraging model. The weights measure the reliability of thescore prediction, and they are learned to optimize the over-all reward function of the score prediction and distortionclassiﬁcation.The unnormalized weight at time t is estimated by α t = Linear ( h T ) and the predicted score is s t = ϕ ( Linear ( ϕ ( Linear ( h T )))) . We use a softmax layer tonormalize the weights { α t } to make them sum to one. Theﬁnal score is predicted by s = (cid:80) Tt = t α t s t . There are three terms in the ﬁnal loss function L = L cla + λL reg − αJ rein , where λ and α are free parameters, L cla is the log softmax loss of the distortion classiﬁcationand L reg is the mean average error of the quality score pre-diction. J rein is the reinforcement learning term which isthe expectation of accumulated reward. In the perspectiveof reinforcement learning, the hidden layer h t in our frame-work represents the states, location prediction is the action,and the predicted Gaussian distribution represents the pol-icy. We deﬁne the reward function as R = T (cid:88) t =1 r t = r T = (cid:26) y = y (cid:48) or (cid:107) s − s (cid:48)(cid:107) < σ (2)where σ is a threshold to control the policy for assigning re-ward to the score prediction task, and y (cid:48) is the groundtruthdistortion type and s (cid:48) is the groundtruth quality score. Em-pirically, we set σ = 0 . . The reward equals to whenthe classiﬁcation is correct or the score prediction is accu-rate enough; 0 otherwise. Because our model only makespredictions at the ﬁnal step, the cumulative reward R is ac-tually the reward just for the ﬁnal step r T . Therefore, the goal of the learning is to classify distortion and predict scoreaccurately. J rein is approximated by ∇ θ R J rein ≈ M M (cid:88) j =1 T (cid:88) t =1 ∇ θ R log p ( a tj | µ tj , σ ) R j , (3)where j is index of the training images and M is the numberof images. Intuitively, we learn the sampling policy of theselective attention mechanism by maximize the above log likelihood function guided by the reward. In our case, a tj follows the Gaussian distribution parameterized by µ tj and σ , so the derivative of J rein w.r.t. a tj is ∂J rein ∂ a tj = R j ∂ log p ( a tj | µ tj , σ ) ∂ a tj = − R j σ ( a tj − µ tj ) , (4)which indicates that the proposed model tends to learn themean of Gaussian as the center of the informative attendedlocations.The model is trained with the Back-Propagation ThroughTime (BPTT) algorithm. As shown in Figure 2, the black ar-rows denote the forward computation ﬂow, the red arrowsrepresent the backpropagation ﬂow based on the supervisedloss and the blue arrows represent the back-propagationﬂow of the reinforcement learning loss. To preprocess the images, we ﬁrst turn the RGB imagesinto gray scale images, then apply a local contrast normal-ization method on them.In the multi-scale image analysis module, the patchessizes at three scales are × , × and × .All the patches are normalized to × . We use fourmulti-scale convolution layers with × , × and × convolution kernels. The ratio of numbers of the three typesof kernels is , and the numbers of the kernels in lay-ers of CNN are , , , . The spatial pooling size is × in the last convolution layer and × in the ﬁrst andthe third convolution layers. The two hidden layers of RNNfor both tasks have 256 neurons and all of the other hiddenlayers have 128 neurons. We use ReLU for all the convolu-tion and the linear layers in the multi-scale image analysismodule and the RNN.We use an adaptive gradient descent methods Adam [3]with momentum . as our optimization method. In theloss function, parameter λ for the score prediction task isset to and α for the reinforcement loss is set to . . Theinitial learning rate is . , and we train the model with epochs and linearly decay the learning rate to . .To encourage the exploration of the location sampling pol-icy, the standard deviation σ of the Gaussian distributionis linearly declined from . to . after training for R-IQAMethod PSNR SSIM[20] VSI[23]SROCC 0.749 0.814

LCC 0.738 0.821

NR-IQAMethod CNN[8] CNN++[9] Tang et al . [18]

CNN MT CNN MT+S RL RL+M RL+M+R

SROCC 0.548 0.633

Table 1. Evaluation on TID2008.

Distortion type0.00.20.40.60.81.0 S R O CC CNN++[9]Our CNN_MTOur RL+M+S

Figure 3. SROCC on TID2008 for each distortion type. epoches. We apply the (cid:15) -greedy method for location sam-pling and (cid:15) is linearly declined from . to zero after train-ing for epoches. The number of sampled locations T is set to be ﬁve. We ﬁnd that sometimes the overﬂow of lo-cations is serious, at the beginning of training. We apply asmall trick to make learning more stable at the very start. Ifwe detect the sum of µ t which is the mean of Gaussian islarger than a threshold, we randomly reset the parameters inthe location sampling module.

4. Experimental Results

TID2008 [14]: This dataset consists of reference im-ages, types of distortions and four levels of each type ofdistortion. There are in total distorted images, eachof which is labeled with a Mean Opinion Score (MOS) be-tween and .Evaluation: We choose the Pearson linear correlationcoefﬁcient (LCC) to measure the prediction accuracy andthe Spearman rank order correlation coefﬁcient (SROCC)to measure the prediction monotonicity.Local contrast normalization used in our method is not SROCC 12 13 14 15 OverallCNN MT LCC 12 13 14 15 OverallCNN MT

Table 2. Results of local distortion types on TID2008.

SROCC Class. Acc.RL+M+R without multi-resolution 0.774 80.7%RL+M+R

Table 3. Comparison between with and without multi-resolutioncontext information on TID2008. applicable to the “mean shift” and “contrast change” distor-tions, so them are neglected in our experiments. We ignorethe last reference image because it is not a natural image.We select of the reference images and the associateddistorted images as the training set, and the rest as the validation set and the testing set, respectively. Theresults are reported based on median of ﬁve random splits.During testing, we set the initial location to be the centerand choose the model parameters with the highest SROCCin the validation.

We train the model with the images of the distor-tion types together. The overall results are presented inTable 1 and SROCC evaluation for each speciﬁc distor-tion type is presented in Figure 3. We compare our rein-forcement learning model with multi-task learning and ro-bust averaging (RL+M+R) against some FR-IQA methods(PSNR, SSIM [20], VSI [23]) and some NR-IQA meth-ods (CNN [8], CNN++ [9], Tang et al .’s method [18]).The CNN [8] and CNN++ [9] are implemented by our-selves following the original settings strictly. Our modeloutperforms most of the state-of-the-art NR-IQA and evenFR-IQA methods on the TID2008 dataset. Tang et al .’smethod [18] performs better than ours, but their model is igure 4. The last attended regions of four images with the masked noise of different levels. Our model locates on or near the most salientregion around the black letters.Figure 5. The last attended regions of four images with the local block-wise distortions of different levels. Our model locates on or nearthe block masks.Figure 6. The sampled locations of four images with the high frequency noise of different levels. The ﬁxation moves out into backgroundonly in the image with most serious degradation. pre-trained on a large-scale external dataset.In order to demonstrate the beneﬁt of the robust aver-aging strategy, we implement a RL+M model which usesmulti-task learning but without robust averaging strategy.As shown in Table 1, the results of our RL+M+R is betterthan the results of the RL+M ( . v.s. . of SROCC, . v.s. . of LCC).In order to justify the importance of multi-task learning,we implement a RL model without multi-task learning androbust averaging strategy. As shown in Table 1, the resultsof the RL performs much poorer compared with the RL+Mand the RL+M+R (SROCC of the RL is . and LCC ofthe RL is . , while SROCC/LCC of the RL+M and theRL+M+R are larger than . ).In order to show that the boosted performance is dueto our task-driven attentional mechanism, we implement amulti-task CNN (CNN MT) with similar structure to ourRNN. The training and testing procedures on the CNN MTis the same as that on CNN++ [9]. As shown in Table 1,the results of the CNN MT is worse than our RL+M and RL+M+S. Furthermore, we combine a saliency model [19]with the CNN MT and name it as CNN MT+S. First weapply the saliency method to compute saliency maps of theTID2008 images, then use saliency values as the weightsto average the scores predicted by CNN MT. The resultsare shown in Table 1, the CNN MT+S is better than theCNN MT, but worse than our RL+M+R.Figure 3 shows the SROCC values of each distortion typeof different methods. It can be seen that, our model per-forms better on the images with local distortions, especiallyfor the Type 14, i.e . non-eccentricity pattern noise and theType 15, i.e . local block-wise distortions of different inten-sity. The CNN MT outperforms our model in a few distor-tion types but performs worse in the overall result. This mayindicate that it is not a good strategy to obtain the qualityscore of an image by averaging the scores of every patches.Instead, our attention-driven model which only uses “infor-mative” patches is a better method. Experiments on Local Distortion Types:

We train ourmodel on the images of four local distortion types on the igure 7. Confusion matrix on testing set

TID2008. As shown in Table 2, the results of our RL+M+Rare better than the results of CNN MT ( . v.s. . ofSROCC, . v.s. . of LCC). Learning without Multi-Resolution Information:

Inthe proposed method, we extract multi-resolution patches.As a reference, we train a model operating on only one × patch each time. This model is compared withthe proposed one in Table 3. Both the quality assessmentresults and distortion type prediction results decline whenlearning without multi-resolution patches. Classiﬁcation Task:

On the testing set, our modelobtains 87.7% classiﬁcation accuracy for the 15 distortiontypes. The confusion matrix is shown in Figure 7. Halfof the images of the Type 2 are misclassiﬁed into the Type1 because they are both additive Gaussian noise, while theType 2 is operated in the luminance channel, but the Type1 operates on the color components. The lower right cornerof confusion matrix shows that images with very small sizeof local distortions are hard to be correctly classiﬁed.

We show some results of the attended patches by ourmodel in Figure 4, 5 and 6.In Figure 4 and Figure 5, we magnify the sampledpatches at the bottom right corner of each image. Figure 4shows that the last attended regions of four images withthe masked noise. The degradation has different intensity.The masked noise is strong in regions of high spatial fre-quency. The highest spatial frequency regions are the areasaround the letters of the left cap. Our model locates thismost salient region for all the four distortion levels.The local block-wise distortion degrades image qualityby adding some annoying blocks with different intensity.Figure 5 shows that our model locates the artifact blocks inthe last attended region. Notice that in the last two images, even the distortion of only a few blocks is capture.Figure 6 displays the attentional scanpaths on an imagewith different levels of high frequency noises. Notice thatthe scanpaths are different, which indicates that differentlevel of degradation can affect the attention.

5. Conclusion

In the paper we propose an attention-driven model withmulti-task learning and robust averaging strategy for gen-eral no-reference image quality assessment. We considerthe NR-IQA as a dynamic perception process. The modellearning is implemented by a reinforcement strategy, inwhich the rewards of both tasks guide the learning of theoptimal sampling policy to acquire the task-informative im-age regions so that the predictions can be made accuratelyand efﬁciently.

References [1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recog-nition with visual attention. arXiv preprint arXiv:1412.7755 ,2014. 3[2] V. De Gardelle and C. Summerﬁeld. Robust averaging duringperceptual judgment.

Proceedings of the National Academyof Sciences , 108(32):13341–13346, 2011. 2, 4[3] T. Dozat. Incorporating nesterov momentum into adam. 5[4] J. Freeman and E. P. Simoncelli. Metamers of the ventralstream.

Nature neuroscience , 14(9):1195–1201, 2011. 4[5] W. Hou and X. Gao. Saliency-guided deep framework forimage quality assessment.

IEEE MultiMedia , 22(2):46–55,2015. 1, 3[6] W. Hou, X. Gao, D. Tao, and X. Li. Blind image qualityassessment via deep learning.

IEEE transactions on neuralnetworks and learning systems , 26(6):1275–1286, 2015. 1,3[7] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatialtransformer networks. In

Advances in Neural InformationProcessing Systems , pages 2017–2025, 2015. 3[8] L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutionalneural networks for no-reference image quality assessment.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1733–1740, 2014. 1, 3, 6[9] L. Kang, P. Ye, Y. Li, and D. Doermann. Simultaneous esti-mation of image quality and distortion via multi-task convo-lutional neural networks. In

Image Processing (ICIP), 2015IEEE International Conference on , pages 2791–2795. IEEE,2015. 1, 3, 6, 7[10] J. Kuen, Z. Wang, and G. Wang. Recurrent atten-tional networks for saliency detection. arXiv preprintarXiv:1604.03227 , 2016. 3[11] Y. Li, L.-M. Po, X. Xu, L. Feng, F. Yuan, C.-H. Cheung, andK.-W. Cheung. No-reference image quality assessment withshearlet transform and deep neural networks.

Neurocomput-ing , 154:94–109, 2015. 1[12] H. Liu and I. Heynderickx. Visual attention in objective im-age quality assessment: Based on eye-tracking data.

IEEEransactions on Circuits and Systems for Video Technology ,21(7):971–982, 2011. 3[13] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of vi-sual attention. In

Advances in Neural Information ProcessingSystems , pages 2204–2212, 2014. 3[14] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian,M. Carli, and F. Battisti. Tid2008-a database for evaluationof full-reference visual quality assessment metrics.

Advancesof Modern Radioelectronics , 10(4):30–45, 2009. 3, 6[15] I. Sorokin, A. Seleznev, M. Pavlov, A. Fedorov, and A. Ig-nateva. Deep attention recurrent q-network. arXiv preprintarXiv:1512.01693 , 2015. 3[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 1–9, 2015. 4[17] H. Tang, N. Joshi, and A. Kapoor. Learning a blind measureof perceptual image quality. In

Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on , pages 305–312. IEEE, 2011. 1, 3[18] H. Tang, N. Joshi, and A. Kapoor. Blind image quality as-sessment using semi-supervised rectiﬁer networks. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2877–2884, 2014. 1, 3, 6[19] W. Wang, Y. Wang, Q. Huang, and W. Gao. Measuring visualsaliency by site entropy rate. In

Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on , pages2368–2375. IEEE, 2010. 7[20] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-celli. Image quality assessment: from error visibility tostructural similarity.

IEEE transactions on image process-ing , 13(4):600–612, 2004. 1, 6[21] R. J. Williams. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.

Machinelearning , 8(3-4):229–256, 1992. 3[22] L. Zhang and H. Li. Sr-sim: A fast and high performance iqaindex based on spectral residual. In , pages 1473–1476.IEEE, 2012. 1, 3[23] L. Zhang, Y. Shen, and H. Li. Vsi: A visual saliency-inducedindex for perceptual image quality assessment.

IEEE Trans-actions on Image Processing , 23(10):4270–4281, 2014. 1,6[24] P. Zhang, W. Zhou, L. Wu, and H. Li. Som: Semantic ob-viousness metric for image quality assessment. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2394–2402, 2015. 1, 3[25] W. Zhang, Y. Tian, X. Zha, and H. Liu. Benchmarking state-of-the-art visual saliency models for image quality assess-ment. In2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP)