An Attention-Driven Approach of No-Reference Image Quality Assessment
AAn Attention-Driven Approach of No-Reference Image Quality Assessment
Diqi Chen , , Yizhou Wang , Tianfu Wu , Wen Gao Nat’l Engineering Laboratory for Video Technology,Key Laboratory of Machine Perception (MoE),Sch’l of EECS, Peking University, Beijing, 100871, China Department of ECE and the Visual Narrative Cluster,North Carolina State University Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),Institute of Computing Technology, CAS, Beijing, 100190, China { cdq, Yizhou.Wang, wgao } @pku.edu.cn, [email protected] Abstract
In this paper, we present a novel method of no-referenceimage quality assessment (NR-IQA), which is to predict theperceptual quality score of a given image without usingany reference image. The proposed method harnesses threefunctions (i) the visual attention mechanism, which affectsmany aspects of visual perception including image qualityassessment, however, is overlooked in the NR-IQA litera-ture. The method assumes that the fixation areas on an im-age contain key information to the process of IQA. (ii) therobust averaging strategy, which is a means – supported bypsychology studies – to integrating multiple/step-wise evi-dence to make a final perceptual judgment. (iii) the multi-task learning, which is believed to be an effectual means toshape representation learning and could result in a moregeneralized model. To exploit the synergy of the three, weconsider the NR-IQA as a dynamic perception process, inwhich the model samples a sequence of “informative” areasand aggregates the information to learn a representationfor the tasks of jointly predicting the image quality scoreand the distortion type. The model learning is implementedby a reinforcement strategy, in which the rewards of bothtasks guide the learning of the optimal sampling policy toacquire the “task-informative” image regions so that thepredictions can be made accurately and efficiently (in termsof the sampling steps). The reinforcement learning is re-alized by a deep network with the policy gradient methodand trained through back-propagation. In experiments, themodel is tested on the TID2008 dataset and it outperformsseveral state-of-the-art methods. Furthermore, the model isvery efficient in the sense that a small number of fixationsare used in NR-IQA.
1. Introduction
In the era of big data, an enormous amount of visualdata is making its way to end consumers through mobiledevices, social media, HDTV, etc. Since the applicationsare so broad and diverse, it becomes increasingly importantto improve the quality of experience for consumers. Auto-matic IQA becomes an indispensable module of a servicesystem, such that it is able to tell the perceptual quality ofits content (here images) and then optimizes the deliveredservices accordingly.There are two main schema in IQA: full-reference (FR)IQA [20, 23, 22] and no-reference (NR) IQA [17, 6, 11, 8,9, 18, 5, 24]. The former requires a “clean”, pristine refer-ence image with respect to which the quality of the distortedimage is assessed, and the latter takes only the distorted im-age to be assessed as input and thus is more applicable. Thispaper focuses on NR-IQA.The challenges in NR-IQA include many factors: the“what” issue – unknown types of distortion (some are lo-cal, e.g., local image regions are distorted; some are global,e.g., pervasive additive noise contaminating all the pixels inan image), the “where” issue – unknown spatial distributionof distortions, (e.g., where the degraded regions locate in animage) and the “how” issue – unknown mechanism abouthow to aggregate all the information collected from the dis-torted regions as well as from other regions of an image forquality assessment.In this paper, we address the above issues by presenting anovel method of NR-IQA. The proposed method is inspiredby the following three streams of studies.(i) The visual attention mechanism, which affects many as-pects of visual perception including image quality assess-ment. By observing how human subjects assess the qualityof images, we assume that eye fixation areas on an image1 a r X i v : . [ c s . C V ] M a y t l t+1 h h x t x t + CNN l T h x T label ys T , α T CNN CNN g t g t+1 g T μ t+1 μ T Multi-scale AnalysisLocation SamplingInformation Aggregation s t+1 , α t+1 s t , α t h h h score s Figure 1. The overall architecture of the proposed model for NR-IQA.
Left : An image with the local block-wise distortion (the “blockmasks”) on the two foreground objects. Three fixation areas are illustrated by the orange squares. Centered at each fixation position, threemulti-scale patches are extracted and then normalized to the same size to simulate the foveal vision.
Right : Illustration of the proposednetwork composed of three components: the multi-scale image analysis module (the green boxes) – weights shared CNN used to extractthe features to update the recurrent layer, the location sampling module (the orange boxes) – a stochastic node learns to select where toattend to next, and the information aggregation module (the black box) – a recurrent neural network (RNN) aggregates information alongthe attentional saccadic path for multi-task learning. contain “key information” for IQA. However, in the IQAliterature there is little studies about how to organically in-tegrate the attention mechanism into IQA, e.g. learning thestrategy about where to attend on an image that related tothe IQA task.(ii) The robust averaging strategy [2], which is a compu-tational mechanism of perceptual judgment – supported bypsychology studies – that integrates multiple/step-wise evi-dence to make a final judgment. By adopting this strategy,the final image quality score is the weighted average of thescores predicted from a number of attended areas in an im-age.(iii) The multi-task learning, which is believed to be an ef-fectual means to shape representation learning and could re-sult in a more generalized model. Here, besides predictingimage scores, we empower the model to classify distortiontypes of an image to be assessed.To exploit the synergy of the three, we consider the NR-IQA as a dynamic perception process, in which the modelsamples a sequence of “informative” areas and aggregatesthe information to learn a representation for the tasks ofjointly predicting the image quality score and the distortiontype. Figure 1 illustrates the proposed model. It is com-posed of three components: • The multi-scale image analysis module: it is imple-mented by a weight-sharing-CNN (the green boxes inFigure 1), which extracts multi-scale image features around a fixation point. We extract three image patchesof different scales centered at a fixation point. TheCNN learns the feature representation for quality as-sessment. This component aims at solving the “what”issue mentioned above in an end-to-end learning fash-ion. • The location sampling module: it is implemented by astochastic node (the orange boxes in Figure 1), whichlearns to select “where” to attend to next based on theintegrated information about what the model has seenso far. It predicts the IQA-task-related regions suchthat the next selected fixation will be sufficiently infor-mative. • The information aggregation module: it is imple-mented by an RNN (the black box in Figure 1), whichaggregates information along a saccadic path to com-pute the final predictions, i.e., the image quality scoreand the distortion type. It captures both local informa-tion and global information in the sequential unfold-ing. It learns to resolve the “how” issue in NR-IQAstated above. The representation is shaped throughmulti-task learning. Inspired by the robust averagingstrategy for perceptual judgment [2] (which takes boththe “strength” and “reliability” of evidence into con-sideration when making a final perceptual judgment),our model predicts the final score as the weighted aver-ged of the scores predicted at the attended areas. Theweights of the scores are learned to signify the “relia-bility” of the score prediction at the attended regions.Inspired by [21], the model learning is implementedby a reinforcement strategy. The rewards of both tasks(score prediction and distortion type classification) guidethe learning of the optimal location sampling policy to ac-quire the “task-informative” image regions so that the pre-dictions can be made accurately and efficiently (in termsof the sampling steps). The reinforcement learning is real-ized by a deep network with the policy gradient method andtrained through back-propagation.In experiments, the model is tested on the TID2008dataset [14] and it outperforms several state-of-the-artmethods. Furthermore, the model is very efficient in thesense that a small number of fixations are used in NR-IQA.
2. Related Work
We briefly review the application of deep learningmodels for NR-IQA, the IQA methods using object-ness/saliency, and a related attentional model.
Neural Networks for NR-IQA:
Deep learning providesan approach to learning a mapping from raw input or low-level features into scores of image perceptual quality. Thesemethods avoid delicately designing hand-crafted features.Kang et al . [8] propose a patch-based NR-IQA method.They first uniformly sample image patches at a predefinedscale, then train a CNN to predict a quality score for eachimage patch and average the scores of the patches as theholistic image score. They further propose a multi-taskCNN [9] to classify the distortion type of each patch in ad-dition to the quality score prediction. Our method is quitedifferent from theirs, even in the distortion type classifica-tion part, we do not classify the distortion type of each at-tended patch; instead, we classify a whole image based onthe “aggregated information” over a sequence of attendedregions.In addition to the patch-based methods, some meth-ods combine hand-crafted low-level features with deep net-works as an alternative approach. For example, Tang etal . [18] first extract the LBIQ features [17] and feed thefeatures into a Restricted Boilzman Machine to predict im-age quality scores. Hou et al . [6] pose IQA problem as aclassification problem. They slot images into different cate-gories according to the image quality and propose a qualitypooling method under the Bayesian framework to predictquality scores.
Objectness and Saliency in IQA:
Although there lacksof literature that organically fuses the attentional mecha-nism into the NR-IQA, the semantic objectness or saliencyhas been applied. Objectness and saliency are static prop-erty of image regions, whereas attention is an active per- ception process of an observer. Liu et al . [12] determinethe final score of an image by averaging the predicted patchscores with weights. The patch weights are the saliencyvalues obtained from eye-tracking data. The performancegain of the method justifies the importance of introducingvisual attention to IQA. Zhang and Li [22] argue that visualsaliency and perceptual quality are highly related, and theyutilize the relationship between a reference image saliencymap and its distortion image saliency map to predict imagequality scores. Zhang et al . [24] propose an IQA algorithmusing object-like regions. They assume that semantic re-gions contribute to perceptual quality assessment. Hou andGao [5] propose a saliency-guided framework whose ideais similar to [24]. In summary, these methods exploit im-age saliency maps in post-processing, i.e ., adopt saliency-weighted average score rather than a uniform average scoreas final prediction. Zhang et al . [25] study different combi-nations of different saliency models and IQA methods.We also believe that image quality assessment heavilydepends on the way how we attend to images. Hence, weexplicitly model the attention process and learn the attentionpolicy from data.
Recurrent Attentional Models:
Recently, deep learn-ing models with attentional mechanism receive a lot of in-terest. The soft attentional models [15, 10] implement de-terministic attention mechanism trained by normal back-propagation. Kuen et al . [10] realize the attention mech-anism through the differentiable spatial transformer [7]and recurrent connections to refine saliency map step bystep. Stochastic attention in the hard attentional mod-els [13, 1, 15] are often optimized by the REINFORCEalgorithm [21]. Implementing the similar idea, Mnih etal . [13] propose a well-designed attentional model withRNN for object recognition and Ba et al . [1] recognize andlocalize multiple objects by maximizing a variational lowerbound. Sorokin et al . [15] propose a soft attention mecha-nism designed as element-wise multiplication with impor-tance vectors and a hard attention mechanism optimized bythe REINFORCE algorithm.Compared to the above models, our model also integratesthe attentional mechanism but with different ingredients. (i)Our model is multi-task, i.e . it jointly optimizes the perfor-mance of two closely related tasks to learn representationthat leads to a more powerful attention policy. (ii) Conse-quently, the reward function of the reinforcement learningis enriched with multi-task rewards. Such enriched rewardsempower the learned policy being capable of capturing the“task-informative” regions so that the information are ag-gregated and predictions are made more accurately and effi-ciently. (iii) The robust averaging mechanism of perceptualjudgment is implemented into the network architecture andlearning. (iv) The multi-scale analysis is introduced into thenetwork to emulate the foveal vision and provide contextualnformation of fixations.
3. The Proposed Model and Learning
In this section, we introduce the problem definition, il-lustrate each component of our model in detail and explainhow to jointly learn knowledge about distortion type, per-ceptual quality, and attention policy.As shown in Figure 1, the proposed model consists ofthree main parts — a CNN for multi-scale image featureextraction, a stochastic node for location sampling and re-current connection for information aggregation. The ulti-mate goal is to predict the quality score s of an input image x . Starting from an initial location l , which can be ran-domly selected in the image during training, at each time t , the proposed model learns/extracts features from threenormalized multi-resolution patches x t clipped from l t andupdates the two recurrent layers h t and h t . Based on h t ,our model predicts the next location l t +1 . The model alsopredicts the image quality scale s t and the weight α t sig-nifying reliability of the score prediction. Repeat this pro-cedure for T steps, and we obtain a sequence of locations l = { l i } T − i =0 and the information on each location is ag-gregated into h T . Label y which denotes distortion typeof x is predicted based on h T as an auxiliary task. Thenthe final quality score of the input image is computed by s = (cid:80) Tt = t s t α t . The Multi-Scale Image Analysis Module:
This modulelearns a multi-scale representation of an attended region. Atstep t , the output g t = f g ( x , l t ; θ g ) , where θ g is the param-eter.We use multi-resolution images to emulate the foveal vi-sion of human eyes. The fovea is at the center of the retina,where visual signals are captured with high-resolution andprocessed with details. Regions outside the fovea are pe-ripheral regions, which perceive visual patterns with lessdetails and the degradation grows with eccentricity. Humanbeings move and fixate their eyes at the task-informative ar-eas with the aid of the peripheral vision so that they are ableto acquire/process task-related information efficiently [4].Here, we extract three patches at different scales centeredat the same fixation point and normalize them into × patches. These multi-scale patches emulate the foveal andperipheral signals of an attended area. The normalizedpatches are stacked together and fed to the CNN.We adopt the multi-scale convolution kernels as in [16]to make the computation efficient. We also treat the sam-pled fixation location as a feature and feed it into a fullyconnected layer of the CNN. Our model concatenates twohidden layers of x t and l t in the CNN and connects them toanother fully connected layer, then outputs g t . The Location Sampling Module:
This module sam-ples the locations of the attention areas in an image. Theoutput l t is influenced by the hidden state h t from the lastrecurrent layer and parameters θ l : l t +1 = f l ( h t ; θ l ) . Weassume that each dimension of the next location followsa Gaussian distribution independently with the same fixedstandard deviation.The locations are sampled stochastically in the train-ing stage and use the mean of the Gaussian during test-ing. The stochastic sampling is a common strategy to en-able the exploration in reinforcement learning. Firstly, wepredict the mean of the Gaussian distribution by µ t +1 = φ ( W rl h t + b l ) . φ is the HardTanh activation function limit-ing µ t +1 into appropriate range ( [ − , in this work). Thenthe next attention location is sampled from a Gaussian dis-tribution a t +1 ∼ p ( ·| µ t +1 , σ ) , where σ is the standard de-viation for the x-y dimension of the location.We learn location sampling policy by reinforcementlearning guided by the enriched multi-task rewards, so thatthe model is able to sample a sequence of “informative” ar-eas and aggregates the information to jointly predicting theimage quality score and the distortion type. The details ofthe learning will be discussed in 3.2. The Information Aggregation Module:
The RNN isadopted to learn the internal mechanism of information ag-gregation across the fixation areas.A human expert judges the perceptual quality of imagesafter scanning a sequence of attended areas. We employ atwo-layer RNN to aggregate information at each time step,then predicts the distortion type and image quality score.At the same time, the model also predicts the next fixationlocation l t at time t . The first recurrent layer is computedas h t = ϕ ( W gh g t + W hh h t − + b h ) , (1)where W gh denotes the connection weights from g t to thehidden layer h and W hh denotes the connection of the hid-den layer to itself, b h is the bias and ϕ is ReLU activa-tion function. We adopt the same way for calculation of h t based on h t as input.The distortion classification and the quality predictionare two different tasks. Although predicting quality scoreis our ultimate goal, we observe that the classification tasknot only helps to generalize the learning but also enrich thereward for learning the attention policy. The model predictsthe distortion type ˆ y through two fully connected layers anda softmax layer in the RNN.In quality score prediction, we adopt the perceptualjudgement mechanism of robust averaging [2]. Accord-ing to [2], an optimal agent will make judgments basedon the strength and reliability of decision-relevant evi-dence. A plausible computational mechanism of the per-ceptual judgement can be a multi-element averaging model,where the weights of the variables/strengths correspond t l t CNN x t+1 l t+1 CNN x T l T CNN label ...... l t+1 l t+2 s T , α T s t+1 , α t+1 s t , α t ...... Figure 2. Illustration of the backpropagation gradients flow, the redarrows denote the gradients based on supervised signals and theblue arrows denote the gradients of the REINFORCE algorithm. to the reliability of the evidence. In NR-IQA context,the decision-relevant evidence are the attention image re-gions, the strengths of the evidence are the predicted qual-ity scores, the reliability is the weights { α t } of the linearaveraging model. The weights measure the reliability of thescore prediction, and they are learned to optimize the over-all reward function of the score prediction and distortionclassification.The unnormalized weight at time t is estimated by α t = Linear ( h T ) and the predicted score is s t = ϕ ( Linear ( ϕ ( Linear ( h T )))) . We use a softmax layer tonormalize the weights { α t } to make them sum to one. Thefinal score is predicted by s = (cid:80) Tt = t α t s t . There are three terms in the final loss function L = L cla + λL reg − αJ rein , where λ and α are free parameters, L cla is the log softmax loss of the distortion classificationand L reg is the mean average error of the quality score pre-diction. J rein is the reinforcement learning term which isthe expectation of accumulated reward. In the perspectiveof reinforcement learning, the hidden layer h t in our frame-work represents the states, location prediction is the action,and the predicted Gaussian distribution represents the pol-icy. We define the reward function as R = T (cid:88) t =1 r t = r T = (cid:26) y = y (cid:48) or (cid:107) s − s (cid:48)(cid:107) < σ (2)where σ is a threshold to control the policy for assigning re-ward to the score prediction task, and y (cid:48) is the groundtruthdistortion type and s (cid:48) is the groundtruth quality score. Em-pirically, we set σ = 0 . . The reward equals to whenthe classification is correct or the score prediction is accu-rate enough; 0 otherwise. Because our model only makespredictions at the final step, the cumulative reward R is ac-tually the reward just for the final step r T . Therefore, the goal of the learning is to classify distortion and predict scoreaccurately. J rein is approximated by ∇ θ R J rein ≈ M M (cid:88) j =1 T (cid:88) t =1 ∇ θ R log p ( a tj | µ tj , σ ) R j , (3)where j is index of the training images and M is the numberof images. Intuitively, we learn the sampling policy of theselective attention mechanism by maximize the above log likelihood function guided by the reward. In our case, a tj follows the Gaussian distribution parameterized by µ tj and σ , so the derivative of J rein w.r.t. a tj is ∂J rein ∂ a tj = R j ∂ log p ( a tj | µ tj , σ ) ∂ a tj = − R j σ ( a tj − µ tj ) , (4)which indicates that the proposed model tends to learn themean of Gaussian as the center of the informative attendedlocations.The model is trained with the Back-Propagation ThroughTime (BPTT) algorithm. As shown in Figure 2, the black ar-rows denote the forward computation flow, the red arrowsrepresent the backpropagation flow based on the supervisedloss and the blue arrows represent the back-propagationflow of the reinforcement learning loss. To preprocess the images, we first turn the RGB imagesinto gray scale images, then apply a local contrast normal-ization method on them.In the multi-scale image analysis module, the patchessizes at three scales are × , × and × .All the patches are normalized to × . We use fourmulti-scale convolution layers with × , × and × convolution kernels. The ratio of numbers of the three typesof kernels is , and the numbers of the kernels in lay-ers of CNN are , , , . The spatial pooling size is × in the last convolution layer and × in the first andthe third convolution layers. The two hidden layers of RNNfor both tasks have 256 neurons and all of the other hiddenlayers have 128 neurons. We use ReLU for all the convolu-tion and the linear layers in the multi-scale image analysismodule and the RNN.We use an adaptive gradient descent methods Adam [3]with momentum . as our optimization method. In theloss function, parameter λ for the score prediction task isset to and α for the reinforcement loss is set to . . Theinitial learning rate is . , and we train the model with epochs and linearly decay the learning rate to . .To encourage the exploration of the location sampling pol-icy, the standard deviation σ of the Gaussian distributionis linearly declined from . to . after training for R-IQAMethod PSNR SSIM[20] VSI[23]SROCC 0.749 0.814
LCC 0.738 0.821
NR-IQAMethod CNN[8] CNN++[9] Tang et al . [18]
CNN MT CNN MT+S RL RL+M RL+M+R
SROCC 0.548 0.633
Table 1. Evaluation on TID2008.
Distortion type0.00.20.40.60.81.0 S R O CC CNN++[9]Our CNN_MTOur RL+M+S
Figure 3. SROCC on TID2008 for each distortion type. epoches. We apply the (cid:15) -greedy method for location sam-pling and (cid:15) is linearly declined from . to zero after train-ing for epoches. The number of sampled locations T is set to be five. We find that sometimes the overflow of lo-cations is serious, at the beginning of training. We apply asmall trick to make learning more stable at the very start. Ifwe detect the sum of µ t which is the mean of Gaussian islarger than a threshold, we randomly reset the parameters inthe location sampling module.
4. Experimental Results
TID2008 [14]: This dataset consists of reference im-ages, types of distortions and four levels of each type ofdistortion. There are in total distorted images, eachof which is labeled with a Mean Opinion Score (MOS) be-tween and .Evaluation: We choose the Pearson linear correlationcoefficient (LCC) to measure the prediction accuracy andthe Spearman rank order correlation coefficient (SROCC)to measure the prediction monotonicity.Local contrast normalization used in our method is not SROCC 12 13 14 15 OverallCNN MT LCC 12 13 14 15 OverallCNN MT
Table 2. Results of local distortion types on TID2008.
SROCC Class. Acc.RL+M+R without multi-resolution 0.774 80.7%RL+M+R
Table 3. Comparison between with and without multi-resolutioncontext information on TID2008. applicable to the “mean shift” and “contrast change” distor-tions, so them are neglected in our experiments. We ignorethe last reference image because it is not a natural image.We select of the reference images and the associateddistorted images as the training set, and the rest as the validation set and the testing set, respectively. Theresults are reported based on median of five random splits.During testing, we set the initial location to be the centerand choose the model parameters with the highest SROCCin the validation.
We train the model with the images of the distor-tion types together. The overall results are presented inTable 1 and SROCC evaluation for each specific distor-tion type is presented in Figure 3. We compare our rein-forcement learning model with multi-task learning and ro-bust averaging (RL+M+R) against some FR-IQA methods(PSNR, SSIM [20], VSI [23]) and some NR-IQA meth-ods (CNN [8], CNN++ [9], Tang et al .’s method [18]).The CNN [8] and CNN++ [9] are implemented by our-selves following the original settings strictly. Our modeloutperforms most of the state-of-the-art NR-IQA and evenFR-IQA methods on the TID2008 dataset. Tang et al .’smethod [18] performs better than ours, but their model is igure 4. The last attended regions of four images with the masked noise of different levels. Our model locates on or near the most salientregion around the black letters.Figure 5. The last attended regions of four images with the local block-wise distortions of different levels. Our model locates on or nearthe block masks.Figure 6. The sampled locations of four images with the high frequency noise of different levels. The fixation moves out into backgroundonly in the image with most serious degradation. pre-trained on a large-scale external dataset.In order to demonstrate the benefit of the robust aver-aging strategy, we implement a RL+M model which usesmulti-task learning but without robust averaging strategy.As shown in Table 1, the results of our RL+M+R is betterthan the results of the RL+M ( . v.s. . of SROCC, . v.s. . of LCC).In order to justify the importance of multi-task learning,we implement a RL model without multi-task learning androbust averaging strategy. As shown in Table 1, the resultsof the RL performs much poorer compared with the RL+Mand the RL+M+R (SROCC of the RL is . and LCC ofthe RL is . , while SROCC/LCC of the RL+M and theRL+M+R are larger than . ).In order to show that the boosted performance is dueto our task-driven attentional mechanism, we implement amulti-task CNN (CNN MT) with similar structure to ourRNN. The training and testing procedures on the CNN MTis the same as that on CNN++ [9]. As shown in Table 1,the results of the CNN MT is worse than our RL+M and RL+M+S. Furthermore, we combine a saliency model [19]with the CNN MT and name it as CNN MT+S. First weapply the saliency method to compute saliency maps of theTID2008 images, then use saliency values as the weightsto average the scores predicted by CNN MT. The resultsare shown in Table 1, the CNN MT+S is better than theCNN MT, but worse than our RL+M+R.Figure 3 shows the SROCC values of each distortion typeof different methods. It can be seen that, our model per-forms better on the images with local distortions, especiallyfor the Type 14, i.e . non-eccentricity pattern noise and theType 15, i.e . local block-wise distortions of different inten-sity. The CNN MT outperforms our model in a few distor-tion types but performs worse in the overall result. This mayindicate that it is not a good strategy to obtain the qualityscore of an image by averaging the scores of every patches.Instead, our attention-driven model which only uses “infor-mative” patches is a better method. Experiments on Local Distortion Types:
We train ourmodel on the images of four local distortion types on the igure 7. Confusion matrix on testing set
TID2008. As shown in Table 2, the results of our RL+M+Rare better than the results of CNN MT ( . v.s. . ofSROCC, . v.s. . of LCC). Learning without Multi-Resolution Information:
Inthe proposed method, we extract multi-resolution patches.As a reference, we train a model operating on only one × patch each time. This model is compared withthe proposed one in Table 3. Both the quality assessmentresults and distortion type prediction results decline whenlearning without multi-resolution patches. Classification Task:
On the testing set, our modelobtains 87.7% classification accuracy for the 15 distortiontypes. The confusion matrix is shown in Figure 7. Halfof the images of the Type 2 are misclassified into the Type1 because they are both additive Gaussian noise, while theType 2 is operated in the luminance channel, but the Type1 operates on the color components. The lower right cornerof confusion matrix shows that images with very small sizeof local distortions are hard to be correctly classified.
We show some results of the attended patches by ourmodel in Figure 4, 5 and 6.In Figure 4 and Figure 5, we magnify the sampledpatches at the bottom right corner of each image. Figure 4shows that the last attended regions of four images withthe masked noise. The degradation has different intensity.The masked noise is strong in regions of high spatial fre-quency. The highest spatial frequency regions are the areasaround the letters of the left cap. Our model locates thismost salient region for all the four distortion levels.The local block-wise distortion degrades image qualityby adding some annoying blocks with different intensity.Figure 5 shows that our model locates the artifact blocks inthe last attended region. Notice that in the last two images, even the distortion of only a few blocks is capture.Figure 6 displays the attentional scanpaths on an imagewith different levels of high frequency noises. Notice thatthe scanpaths are different, which indicates that differentlevel of degradation can affect the attention.
5. Conclusion
In the paper we propose an attention-driven model withmulti-task learning and robust averaging strategy for gen-eral no-reference image quality assessment. We considerthe NR-IQA as a dynamic perception process. The modellearning is implemented by a reinforcement strategy, inwhich the rewards of both tasks guide the learning of theoptimal sampling policy to acquire the task-informative im-age regions so that the predictions can be made accuratelyand efficiently.
References [1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recog-nition with visual attention. arXiv preprint arXiv:1412.7755 ,2014. 3[2] V. De Gardelle and C. Summerfield. Robust averaging duringperceptual judgment.
Proceedings of the National Academyof Sciences , 108(32):13341–13346, 2011. 2, 4[3] T. Dozat. Incorporating nesterov momentum into adam. 5[4] J. Freeman and E. P. Simoncelli. Metamers of the ventralstream.
Nature neuroscience , 14(9):1195–1201, 2011. 4[5] W. Hou and X. Gao. Saliency-guided deep framework forimage quality assessment.
IEEE MultiMedia , 22(2):46–55,2015. 1, 3[6] W. Hou, X. Gao, D. Tao, and X. Li. Blind image qualityassessment via deep learning.
IEEE transactions on neuralnetworks and learning systems , 26(6):1275–1286, 2015. 1,3[7] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatialtransformer networks. In
Advances in Neural InformationProcessing Systems , pages 2017–2025, 2015. 3[8] L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutionalneural networks for no-reference image quality assessment.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 1733–1740, 2014. 1, 3, 6[9] L. Kang, P. Ye, Y. Li, and D. Doermann. Simultaneous esti-mation of image quality and distortion via multi-task convo-lutional neural networks. In
Image Processing (ICIP), 2015IEEE International Conference on , pages 2791–2795. IEEE,2015. 1, 3, 6, 7[10] J. Kuen, Z. Wang, and G. Wang. Recurrent atten-tional networks for saliency detection. arXiv preprintarXiv:1604.03227 , 2016. 3[11] Y. Li, L.-M. Po, X. Xu, L. Feng, F. Yuan, C.-H. Cheung, andK.-W. Cheung. No-reference image quality assessment withshearlet transform and deep neural networks.
Neurocomput-ing , 154:94–109, 2015. 1[12] H. Liu and I. Heynderickx. Visual attention in objective im-age quality assessment: Based on eye-tracking data.
IEEEransactions on Circuits and Systems for Video Technology ,21(7):971–982, 2011. 3[13] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of vi-sual attention. In
Advances in Neural Information ProcessingSystems , pages 2204–2212, 2014. 3[14] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian,M. Carli, and F. Battisti. Tid2008-a database for evaluationof full-reference visual quality assessment metrics.
Advancesof Modern Radioelectronics , 10(4):30–45, 2009. 3, 6[15] I. Sorokin, A. Seleznev, M. Pavlov, A. Fedorov, and A. Ig-nateva. Deep attention recurrent q-network. arXiv preprintarXiv:1512.01693 , 2015. 3[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 1–9, 2015. 4[17] H. Tang, N. Joshi, and A. Kapoor. Learning a blind measureof perceptual image quality. In
Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on , pages 305–312. IEEE, 2011. 1, 3[18] H. Tang, N. Joshi, and A. Kapoor. Blind image quality as-sessment using semi-supervised rectifier networks. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2877–2884, 2014. 1, 3, 6[19] W. Wang, Y. Wang, Q. Huang, and W. Gao. Measuring visualsaliency by site entropy rate. In
Computer Vision and Pat-tern Recognition (CVPR), 2010 IEEE Conference on , pages2368–2375. IEEE, 2010. 7[20] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-celli. Image quality assessment: from error visibility tostructural similarity.
IEEE transactions on image process-ing , 13(4):600–612, 2004. 1, 6[21] R. J. Williams. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.
Machinelearning , 8(3-4):229–256, 1992. 3[22] L. Zhang and H. Li. Sr-sim: A fast and high performance iqaindex based on spectral residual. In , pages 1473–1476.IEEE, 2012. 1, 3[23] L. Zhang, Y. Shen, and H. Li. Vsi: A visual saliency-inducedindex for perceptual image quality assessment.
IEEE Trans-actions on Image Processing , 23(10):4270–4281, 2014. 1,6[24] P. Zhang, W. Zhou, L. Wu, and H. Li. Som: Semantic ob-viousness metric for image quality assessment. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 2394–2402, 2015. 1, 3[25] W. Zhang, Y. Tian, X. Zha, and H. Liu. Benchmarking state-of-the-art visual saliency models for image quality assess-ment. In2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP)