[PDF] Predicting Eye Fixations Under Distortion Using Bayesian Observers

Abstract

Visual attention is very an essential factor that affects how human perceives visual signals. This report investigates how distortions in an image could distract human's visual attention using Bayesian visual search models, specifically, Maximum-a-posteriori (MAP) \cite{findlay1982global}\cite{eckstein2001quantifying} and Entropy Limit Minimization (ELM) \cite{najemnik2009simple}, which predict eye fixation movements based on a Bayesian probabilistic framework. Experiments on modified MAP and ELM models on JPEG-compressed images containing blocking or ringing artifacts were conducted and we observed that compression artifacts can affect visual attention. We hope this work sheds light on the interactions between visual attention and perceptual quality.

Full PDF

PPredicting Eye Fixations Under DistortionUsing Bayesian Observers

Zhengzhong Tu

University of Texas at Austin

Abstract.

Visual attention is very an essential factor that aﬀects howhuman perceives visual signals. This report investigates how distortionsin an image could distract human’s visual attention using Bayesian vi-sual search models, speciﬁcally, Maximum-a-posteriori (MAP) [1] [2] andEntropy Limit Minimization (ELM) [3], which predict eye ﬁxation move-ments based on a Bayesian probabilistic framework. Experiments onmodiﬁed MAP and ELM models on JPEG-compressed images containingblocking or ringing artifacts were conducted and we observed that com-pression artifacts can aﬀect visual attention. We hope this work shedslight on the interactions between visual attention and perceptual quality.

Keywords:

Visual attention, saliency map, visual search, eye move-ments, optimal ﬁxation selection, image quality assessment

Visual attention (VA) plays a critical role in human vision systems (HVS) giventhe human brain’s limited capacity of selected neural processing as well as thenarrow and high resolution foveated receptive ﬁeld in the retina. Visual attentionis, by deﬁnition, the behavioral and cognitive process of selectively concentrat-ing on a discrete aspect of information, whether deemed subjective or objective,while ignoring other perceivable information [4]. While visual attention guidesthe anatomical structures in HVS to capture more detailed information for im-portant scenes in the context of some speciﬁc task, the main interest is on themechanism underlying this guidance.Recent decades have witnessed a growing interest in exploring the mech-anisms of visual attention. Studies on modeling visual attention, however, isamong many facets of scientiﬁc research such as psychology, neuroscience, bi-ology, computer vision, and robotics. Generally, there are two primary roots ofmotivation in the taxonomy of visual attention [5]: one for the uses of atten-tive methods in computer vision and another for the development of attentionmodels in the biological vision community. However, although both have devel-oped independently, and indeed, the use of attention appears in the computervision literature before most biological models, the major point of intersectionis the class of computational models. A computational model of visual atten-tion, according to [6], not only includes a formal or mathematical descriptionfor how attention is computed, but also has to be in a testable manner, i.e., a r X i v : . [ ee ss . I V ] F e b Zhengzhong Tu can be tested by providing input stimuli such as image/video signals to subjectsand see how the model performs by comparison. We refer the reader to [7] for acomprehensive review of computational models of visual attention.Here we scope down to models only under saliency map assumptions, origi-nated from Treisman and Gelade’s [8] “Feature Integration Theory”. Koch andUllman [9] then proposed a feed-forward model to combine these features andintroduced the concept of saliency map , which is a topographic map that rep-resents conspicuousness of scene locations. The ﬁrst complete implementationand veriﬁcation of the Koch and Ullman model was proposed by Itti et al. [10]and applied to both synthetic as well as natural scenes. Since then, there havebeen increasing interests on saliency detection or salient object detection taskson computer vision ﬁeld.

In computer vision community, stimulus-driven, saliency-based attention hasbeen a trendy research area over the past 25 years. Given the diﬃculty of accu-rately measuring or even quantifying the internal states of organisms, bottom-upattention, independent of these biological aspects, is much easier to understandand evaluate. More speciﬁcally, here we are studying visual attention modelsthat can compute saliency maps, which characterize some parts of a scene thatstand out relative to their neighboring parts. Here, we give a clear deﬁnition anddescription of what kind of problems this paper is approaching from a compu-tational standpoint. These deﬁnitions generally refer to [11].Suppose K subjects have viewed a dataset consisting of N images I = { I i } Ni =1 . Let L ki = { p kij , t kij } n ki j =1 be the vector of eye ﬁxations p kij = ( x kij , y kij )and their corresponding occurrence time t kij for the k th subject over image I i .Let the number of ﬁxations of this subject over i th image be n ki . The goal ofsaliency-based attention modeling is to ﬁnd a function (stimuli-saliency map-ping) f ∈ F which minimizes the error on eye ﬁxation prediction, i.e., minimizes (cid:80) Kk =1 (cid:80) Ni =1 m ( f ( I ki , L ki )), where m ∈ M is a distance measure. An importantpoint here is that the above deﬁnition better suits bottom-up models of overtvisual attention, and may not necessarily cover some other aspects of visualattention (e.g., covert attention or top-down factors).The problem statement described above is for general saliency research givensome computer vision task or even not given any instructions to the subjects.Actually, many technologies and applications of these models have been devel-oped for computer vision and robotics motivations such as image segmentation,image object detection, and human-robot interaction, but few have done for vi-sual quality research purposes. Hence, we would prefer to bound our interest toimage quality assessment since less attention has been paid on the variation ofsaliency map given the degradation of image quality. In the next section, we willarticulate the models researched in this paper, as well as their applications toimage quality research. SY380E VISION SYSTEMS 3

Visual search models have been used as very intuitive schemes to generatesaliency maps in that they can directly predict the eye ﬁxation positions, fromwhich it is easy to construct saliency map by merely applying Gaussian-blurredlinear operations. In the visual search literature, many researchers have proposedto model ﬁxation selection strategies of searchers from Random Search [12],Tiling Search [13] to Feature-based Search [8] [14] [2]. Despite decades of re-search, however, relatively little is known about how humans actually selectﬁxation locations in visual search.In recent years, some research works [15] [16] [3] [17] have shown that aBayesian ideal searcher under a general probabilistic framework can performstatistically as good as human eye movements. Therefore, we would like to de-velop an eye ﬁxation prediction algorithm based on the Bayesian inference modelproposed by Najemnik and Geisler [15]. Conceptually, this Bayesian model de-picted in ﬂowchart Fig. 1 works as follows: a searcher starts with some initialpriors where the target would be. Then the searcher encodes the image with afoveated visual system to optimally update the priors, from which the posteriorprobabilities are generated. Afterward, if the posterior at some location getslarge enough, the search is stopped, and the position with the largest posterioris chosen as the next ﬁxation; if the posteriors are all under the given threshold,then the searcher uses some cost functions to optimally pick the following ﬁxatelocation. Under this framework, some popular searchers like MAP [1], ELM [3],and nELM [17] have been developed to model human eye movement strategy.However, this optimal ﬁxation search framework was generally used for targetsearching tasks under natural image background, whereas not directly applica-ble for free-looking searching on distorted images. In the next section, we willslightly modify and redeﬁne some denotations for implementations of the frame-work to better ﬁt the context of searching free-looking eye ﬁxations on pristineor distorted images.

This section describes in detail how we implemented the eye ﬁxation searchalgorithm based on the Bayesian ideal search model Fig. 1 [15]. We ﬁrst deﬁnethe visibility map (section 1.1), image patch response (section 2.2), Bayesianupdating scheme (section 2.3), and the optimal selection strategy (section 2.4)as the most essential modules within this Bayesian framework, and then integratethem all to present the overall procedures of each simulated search trial amongthree representative searchers MAP [1], ELM [3], and nELM [17] respectively insection 2.5.

The detrimental eﬀect of retinal eccentricity on the visibility of a target or regionwas implemented by modeling detectability, d (cid:48) , as a function of eccentricity (cid:15) ( i, k ( t )) , Zhengzhong Tu

Fig. 1: Flow chart for ideal Bayesian searcher in a probabilistic framework [15] (a) Visibility map at center (b) Inhibition function at center

Fig. 2: Representation of the visibility map and inhibition function d (cid:48) ik ( t ) = max (cid:110) . , µ exp (cid:104) − (cid:15) ( i, k ( t )) σ (cid:105)(cid:111) (1)Where the magnitude µ and standard deviation σ are ﬁt to measure the vis-ibility map for each human subject, and (cid:15) ( i, k ( t )) is simply the distance betweentarget response location i and ﬁxation location k ( t ). Note that we restrict thevisibility value to be no less than 0 .

01 for stabilization. Here for simplicity, weset the parameters to be ﬁxed in our simulations as µ = 5 , σ = 50 , (cid:15) = (cid:107) · (cid:107) .The simulated visibility map when ﬁxation is at the center of the image is shownin Fig. 2a. SY380E VISION SYSTEMS 5

First we divie the image I (resolution W × H ) into small patches with size 16 × M = ( W × H/ { P i } Mi =1 . We limit the potential ﬁxation positions to only be at the center ofpatches to simplify both the analytic and computational load. Therefore, weget the ﬁxations set as k ( T ) = { k ( t ) } Tt =1 , where k ( t ) represents the location ofﬁxation t . Note that here we have M = T .We are now able to deﬁne the natural image response at a given image patch P i . Unlike what they did for template response settings in [16], here instead wechoose three features highly related to region conspicuousness in order to roughlyindicate the response variable W ik ( t ) at patch P i modeled as random variablessampled from Gaussian distributions: W ik ( t ) = E ( C i , L i , H i ) + N ik ( t ) , ∀ i = 1 , ..., M E ( C i , L i , H i ) = α · C βi · L γi · H ηi N ik ( t ) ∼ N (0 , /d (cid:48) ik ( t ) ) (2)Where, W ik ( t ) is the response variable with expectation E ( C i , L i , H i ) and N ik ( t ) is the simulated response noise. C i , L i and H i are the saliency-relatedfeatures deﬁned as RMS contrast, average luminance and image entropy calcu-lated for each patch P i . α is a constant to scale the response map for the wholeimage to have mean one, whereas β, γ, η are weighting constants which can beﬁt to data. N ik ( t ) is sample of internal noise drawn from a normal distributionwith standard deviation proportional to 1 /d (cid:48) ik ( t ) and we ignore external noisefor simplicity. Fig. 3a shows an example image ‘parrots.bmp’ from LIVE imagequality database [18] and Fig. 3c displays the deﬁned three types of responsemaps as well as the overall normalized response map. It can be seen that thisresponse map is highly correlated to visual saliency.In many images or video communication applications, images are often neededto be compressed to save streaming bandwidth. However, a large compressionof the image will deﬁnitely result in a huge drop in visual quality. Blocking orringing artifacts are the most common visual decay coming from modern imagecoding standards such as JPEG and JPEG2000. Thus, given a compressed imagewith noticeable blocking or ringing artifacts, like Fig. 3b, it is natural that weshould consider both the original response map, deﬁned in Eq. 2, as well as theblocking artifact attractor. However, since there didn’t exist any good metric toevaluate the perceived blocking or ringing artifacts for a given image patch, wemight as well use the local BRISQUE [19] value to approximate the local visualresponse of blockiness or ringingness. It is somewhat reasonable since BRISQUEmetric can considerably capture the image quality degradation, including com-pression artifact using Natural Scene Statistics features as well as a regressionapproach. We prefer BRISQUE than NIQE [20] since BRISQUE is trained andpredicted by SVR [21] under the supervision of human score, therefore carry-ing more stabilization and linearity than NIQE (In other words, NIQE is highlynonlinear and unstable). From the ‘Patch brisque map’ shown in Fig. 3d, we Zhengzhong Tu(a) Natural image ‘parrots’ (b) JPEG-compressed ‘parrots’(c) RMS contrast, luminance, entropy response for image (a)(d) RMS contrast, luminance, entropy, blockiness response for image(b)

Fig. 3: Proposed image patch response features

SY380E VISION SYSTEMS 7 can see that the ‘Patch brisque map’ is fairly correlated with the annoyance ofblocking/ringing artifacts for local image patches. Therefore, considering all thefour factors that drive our attention simultaneously and jointly, the responsevariable W ik ( t ) in Eq. 2 can be further modulated by the ‘Blockiness’ index B i modeled by local BRISQUE value. This gives us the following Eq. 3. W ik ( t ) = E ( C i , L i , H i , B i ) + N ik ( t ) , ∀ i = 1 , ..., M E ( C i , L i , H i , B i ) = α · N ormalize ( C βi · L γi · H ηi ) · B τi N ik ( t ) ∼ N (0 , /d (cid:48) ik ( t ) ) (3) Determining the next ﬁxation needs the requirements for the ideal Bayesiansearcher: optimal updating posterior distribution, and optimal selection of suc-cessive ﬁxation locations. These two parts are the most critical components inthe ﬂowchart Fig. 1. First, we discuss how to optimally update the posteriorprobability density based on current ﬁxation T .Suppose H i is the hypothesis that the potential interesting target position isat the i th location (Here i ∈ M ). By Bayes’ formula, the posterior p H i (use p i forshort) is proportional to the multiplication of the prior probability at location i and the joint evidence likelihood of all the block responses at all possible T ﬁxation locations. Let the vector W ( t ) represent all the block responses collectedat ﬁxation t : ( W k ( t ) , ..., W Mk ( t ) ). Given current ﬁxation T , we can update theposterior as: p i ( T ) ∝ prior ( i ) · p i ( W (1) , ..., W ( T ) | i ) (4)Where, if independent across ﬁxations, the posterior of H i is updated as, p i ( T ) ∝ prior ( i ) · T (cid:89) t =1 M (cid:89) i =1 p ( W ik ( t ) | i ) (5)Najemnik and Geisler have shown in the supplement of their paper [15] thatthe Eq. 5 is equivalent to updating by multiplying the running summation ofthe weighted response at i th location from all possible ﬁxations t = 1 , ..., T , i.e., p i ( T ) ∝ prior ( i ) · exp (cid:0) T (cid:88) t =1 d (cid:48) ik ( t ) W ik ( t ) (cid:1) (6)Where t is the ﬁxation number, and d (cid:48) ik ( t ) , W ik ( t ) are the retinal visibility andresponse variable at patch location i when the ﬁxation is at display location k ( t ). d (cid:48) ik ( t ) and W ik ( t ) have already been explicitly deﬁned in Section 2.1 and 2.2. Thesecond term in Eq. 6 can be seen as the joint likelihood of the patch responsefrom all potential ﬁxations k ( t ). Note that Eq. 6 is for the case in which bothhypotheses are mutually exclusive and independent with both stimulus noise andinternal noise statistically independent over time [15]. Zhengzhong Tu

Here the priors for each ﬁxation is initialized to be equal (i.e., prior ( i ) =1 /M, i = 1 , ..., M ). Considering the ‘inhibition of return’ [22], however, wethereby suppress the priors in the neighborhood of the most recent historicalﬁxations (suppose the time interval for each eye saccade is 250 ms). For con-sistency, we use the same function and parameters as we used for the visibilitymap in Eq. 1. Hence, the priors are modulated by ‘inhibition of return’ eﬀectsas shown in Eq. 7, where the number n is set as the inhibition number of mostrecent history ﬁxations. According to [23], we use a linearly decreasing weighting α proportional to how recent the historical ﬁxation is to weight the ‘inhibitioneﬀects’, as shown in Eq. 8. Here we set the inhibition history as n = 8, i.e., assum-ing the searchers can remember the last 2 seconds in the free-looking situation,and the ‘inhibition eﬀects’ descends linearly as weighted by { / , / , ... } . prior inhib ( i ) := prior ( i ) · T (cid:89) t = T − n (cid:110) − α · exp (cid:104) − (cid:15) ( i, k ( t )) σ (cid:105)(cid:111) (7) α = 1 − T − tn ∈ (cid:110) n , n , ..., (cid:111) (8)Thus, the posterior for each potential target location is optimally updated bymultiplying the history-suppressed priors and the joint likelihood of the currenttarget patch from each potential ﬁxation. That is, the overall posterior is updatedaccording to Eq. 9. Note that Eq. 9 should be normalized by the summation ofposteriors among all potential target locations, i.e., (cid:80) Mj =1 p j ( T ), in order to makeit a probability measure. p i ( T ) ∝ prior inhib ( i ) · exp (cid:0) T (cid:88) t =1 d (cid:48) ik ( t ) W ik ( t ) (cid:1) (9) Since here we are not conducting a target search task, we don’t set any prob-ability criterion for stopping the searching chain; instead, we always use thesearcher to iteratively compute the next optimal ﬁxation selection unless the it-erations break oﬀ. We select three representative arts of Bayesian ideal searcher,dubbed Maximum-a-posteriori (MAP for short) [1], [2], Entropy Limit Minimiza-tion (ELM for short) [3], and normalized-ELM search (nELM for short) [17] un-der the ideal Bayesian probabilistic framework (Fig. 1) for simulations. Now, wepresent the optimizing strategies as well as their corresponding reward functionsfor each of them as follows.To compute the optimal next ﬁxation point, k opt ( T + 1), the MAP searcheralways ﬁxates the location with maximum posterior probability after the optimalBayesian updating process, i.e., k ( MAP ) opt ( T + 1) = argmax i (cid:110) p i ( T ) (cid:111) (10) SY380E VISION SYSTEMS 9

Where the p i ( T ) is the posterior probability distribution updated by Eq.9. According to the equation, the MAP searcher always ﬁxates the scene loca-tion containing the most salient objects since these areas would produce moresigniﬁcant responses.The ideal searcher [15], however, considers each possible next ﬁxation andpicks the location that, given its knowledge of current posterior probabilities andvisibility map, will maximize the probability of correctly identifying the locationof the target after the ﬁxation, shown in Eq. 11. k ( Ideal ) opt ( T + 1) = argmax k ( T +1) (cid:110) N (cid:88) i =1 p i ( T ) p ( C | i, k ( T + 1)) (cid:111) (11)Where the ideal searcher Eq. 11 tries to maximize the accuracy in the target-searching task. However, given the computational complexity in ideal searcher11, Najemnik and Geisler then derived a simple heuristic model called entropylimit minimization (ELM) searcher [3] which produces near-optimal ﬁxation se-lection and produces ﬁxation statistics similar to human. They proved that,under some simple summation rule assumptions, the expected information gainof ﬁxation k ( T + 1) (Eq. 12) with regard to previous ﬁxation k ( T ) is equivalentto linearly ﬁltering the current posterior across the possible target locations (Eq.15). They derived the Eq. 15 using Eq. 12,13,14. Therefore, the ELM searcherchooses the optimal next ﬁxated location k ( T +1) as the one that maximized theexpected information gain as approximated by the summation formula shown inEq. 16, where the posteriors are updated in the same scheme Eq. 9 as in MAP. E (cid:2) ∆H ( T + 1) | k ( T + 1) (cid:3) = E (cid:2) H ( T + 1) | k ( T + 1) (cid:3) − H ( T ) (12) E (cid:2) H ( T + 1) | k ( T + 1) (cid:3) = − E (cid:104) n (cid:88) i =1 P i ( T + 1) log ( P i ( T + 1)) | k ( T + 1) (cid:105) (13) P i ( T + 1) = p i ( T ) exp (cid:2) d ik ( T +1) W ik ( T +1) (cid:3)(cid:80) nj =1 p j ( T ) exp (cid:2) d jk ( T +1) W jk ( T +1) (cid:3) (14) E (cid:2) ∆H ( T + 1) | k ( T + 1) (cid:3) ≈ N (cid:88) i =1 p i ( T ) d ik ( T +1) (15) k ( ELM ) opt ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p i ( T ) d ik ( T +1) (cid:111) (16)Normalized ELM (nELM) searcher [17], which tries to take into accountthe variation of detectability map modulated by local contrast of gaze point,was built on ELM to further generalize for non-stationary natural background.nELM searcher ﬁrst normalizes the posteriors by local contrast, then picks theﬁxate location with maximized expected information gain, expressed in Eq. 17. Algorithm 1

The procedures of each simulated search trial among threesearchers { MAP, ELM, nELM } respectively

1: Fixation begins at the center of the image (Center prior).2: Initialize visibility map d (cid:48) ik ( t ) by Eq. 1 at each possible ﬁxation location k ( t ).3: Parallel encoding of the image : At each ﬁxation k ( t ), an independent Gaussiannoise sample N ik ( t ) (with zero mean and variance 1 /d (cid:48) ik ( t ) ) is generated for eachtarget patch P i . Thus, the response variable W ik ( t ) at each patch P i from currentﬁxation t is computed by Eq. 2 or 3. (for pristine image like Fig. 3a, we use Eq. 2;for compressed image like Fig. 3b, we use Eq. 3).4: Optimal Bayesian updating : after ﬁxation T is chosen, the posterior probabilityat each image patch location i is updated by the Bayes formula Eq. 9, wherethe priors are initialized by Eq. 7 as follows. In other words, optimal integrationof information across ﬁxations is achieved by keeping a running summation ofweighted response at each potential location. p i ( T ) ∝ prior inhib ( i ) · exp (cid:0) T (cid:88) t =1 d (cid:48) ik ( t ) W ik ( t ) (cid:1) ,prior inhib ( i ) := prior ( i ) · T (cid:89) t = T − n (cid:110) − α · exp (cid:104) − (cid:15) ( i, k ( t )) σ (cid:105)(cid:111) Optimally choose the next ﬁxation : to compute the optimal next ﬁxationpoint, k opt ( T + 1), the searcher considers each possible next ﬁxation and picksthe location with the highest expected value, a product of sensory evidence andpotentially earned rewards, which are deﬁned diﬀerently: SWITCH ‘method’ ∈ { ‘MAP’,‘ELM’,‘nELM’ } DOCASE ‘MAP’: k MAP ( T + 1) = argmax i (cid:110) p i ( T ) (cid:111) CASE ‘ELM’: k ELM ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p i ( T ) d ik ( T +1) (cid:111) CASE ‘nELM’: p ( Norm ) i ( T ) := p i ( T ) /C i , for i = 1 , ..., Mk nELM ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p ( Norm ) i ( T ) d ik ( T +1) (cid:111)

6: The process jumps back to

Step 3 and repeats the search trial

Step 3-5 to predictthe next optimal ﬁxation location. p ( Norm ) i ( T ) := p i ( T ) /C i , f or i = 1 , ..., Mk ( nELM ) opt ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p ( Norm ) i ( T ) d ik ( T +1) (cid:111) (17) SY380E VISION SYSTEMS 11

Based on the above assumptions and deﬁnitions, the steps in

Algorithm

We generate the spatial 1 /f noise image which has the same spatial powerspectra as natural images as the initial test to validate our simulations of thesearchers. Note that the 1 /f image has stationary local contrast, so there isno need to test the nELM algorithm on it. We can see from Fig. 4a and 4bthat oftentimes, ELM chooses the same ﬁxation point to maximize the poste-rior just as what MAP does because information gain map is just the visibilitymap-blurred posteriors. However, MAP can take a saccade to somewhere nearthe edge whereas ELM tends only to jump a moderate distance since posteriorsnearby are pushed down by ‘inhibition of return’ and information gain at dis-tant locations tends not to be high at all. Fig. 4d shows the history-inhibitedinformation gain map, which has much lower value near the edges as well as thevisited locations. We selected the image ‘parrots.bmp’ from LIVE Image Quality Database [18], onwhich we test the

Algorithm n = 8 with linearly decreasing weights. Fig. 5shows the search results using MAP (Fig. 5a), ELM (Fig. 5c), and nELM (Fig.5e) respectively. Figures on the right column of Fig. 5 shows as the examplesof likelihood map, posteriors, information gain map, and normalized posteri-ors both at the last ﬁxation prediction. As we can see from the ﬁgures, MAPtends to choose ﬁxations only on most salient areas, whereas ELM sometimesﬁxates at locations with lower posterior probability but with higher informationgain. Note that in Fig. 5e, we regard the predicted ﬁxations by nELM not goodenough since almost all the saccades land in uninteresting points and the sac-cades distance is relatively long. That is possible because we simply normalizethe posteriors by dividing the local contrast, which would result in the posteriorof some area with lower contrast becoming unexpectedly larger after normaliza-tion. This problem may be improved by ﬁne-tuning the normalization functionbased on some subjective human studies. Since our goal is to study the distracting eﬀects of image compression artifactson predicted scanpaths, we used the same image ‘parrots.bmp’ and compressedit with

Quality = 5 using imwrite function in MATLAB. After obtaining the

Fig. 4: Fixation prediction test on 1 /f noise imagepatch response map, we did histogram equalization to make the distributionmore dispersed and smooth. Then we used Eq. 3 with parameters ( β, γ, η, τ ) =(1 , , ,

1) to calculate the response and accordingly updated the posteriors. Fig.6 shows the ﬁrst 12 ﬁxations predicted by MAP (Fig. 6a), ELM (Fig. 6c), andnELM (Fig. 6e). We observed that both searchers are distracted by the noticeableblocking or ringing artifacts to some extent, while each of the searchers is stillidentiﬁable by its own searching fashion. The zoomed image region in Fig. 6c,which is full of notably perceptible blocking artifacts, is selected as the ﬁrst ﬁxatelocation by ELM, therefore verifying our expectations that noticeable imagequality degradation truly distract our visual attention against salient areas.

Conclusively, this paper explores the Bayesian probabilistic framework shown inFig. 1 and slightly modiﬁes it to predict human eye ﬁxations in the context offree-looking at natural images suﬀered from compression artifacts. We deﬁnedthe response variables of a distortion-free image patch in a similar way with Na-jemnik and Geisler [15]. Given a compressed image with noticeable distortions,we further deﬁned the ‘blockiness response map’ as an approximation of how

SY380E VISION SYSTEMS 13(a) First 12 ﬁxations predicted by MAPon natural image ‘parrots’ (b) Likelihood map (above) and poste-riors (below)(c) First 12 ﬁxations predicted by ELMon natural image ‘parrots’ (d) Posteriors (above) and informationgain map (below)(e) First 12 ﬁxations predicted bynELM on natural image ‘parrots’ (f) Normalized posteriors (above) andinformation gain map (below)

Fig. 5: Fixation prediction results on natural image ‘parrots’salient the ‘blocking artifact’ presents to a human observer and then used thisdistortion map to compete with traditional saliency maps (i.e., distortion-freeresponse map). Integrating this additional visual factor dubbed ‘blockiness’ to

Fig. 6: Fixation prediction results on compressed natural image ‘parrots’the basic response variable, we thereby simulated three classic ﬁxation searchersunder a Bayesian probabilistic framework, which are MAP [1], ELM [3], andnELM [17] respectively. The experimental results on diﬀerent images from LIVE

SY380E VISION SYSTEMS 15(a) ELM on ‘bikes.bmp’ (b) ELM on compressed ‘bikes.jpg’(c) ELM on ‘caps.bmp’ (d) ELM on compressed ‘caps.jpg’(e) ELM on ‘ocean.bmp’ (f) ELM on compressed ‘ocean.jpg’(g) ELM on ‘sailing4.bmp’ (h) ELM on compressed ‘sailing4’

Fig. 7: More eye ﬁxation results using ELM searcher on images from LIVEdatabaseDatabase show that the Bayesian visual search is indeed disturbed by the annoy-ing blocking artifacts. These simulations have presented some evidence for thedistracting eﬀects of image distortions on visual search and build a preliminary model based on the Bayesian probabilistic framework for potential subsequentexplorations on interactions of visual attention and image quality.Lastly, we discuss as follows some deﬁciencies of the implementations in thispaper, as well as their potential improvement: – Accuracy of detectability map.

We assume a simpliﬁed model as a Gaus-sian distribution shown in Fig. 2a for mathematically convenience in whichthe target detectability is dependent on eccentricity and isotropic. However,this does not seem right since the real retinotopic detectability map looksmore like an ellipse and also tuned by background contrast. The imple-mented searchers in this paper are expected to achieve more human-alikeperformance if a more accurate retinotopic detectability map is used. – Accuracy of redeﬁned response map.

Here we redeﬁne the image re-sponse map which indicates the saliency of a patch as the multiplication ofpatch contrast, luminance, and entropy, which is a very rough approxima-tion; besides, the ‘blocking response map’ is likewise quite loose. One can justimprove by leveraging state-of-the-art saliency map algorithm, for example.From my point of view, given a distorted image, how to deﬁne the responsemap is the trickiest part in both searchers within this Bayesian frameworksince it will directly weight the posteriors, thereby deciding fairly the nextoptimal ﬁxate location. – Eﬀects of ‘Inhibition of Return’.

We noticed that the ‘Inhibition of Re-turn’ phenomenon exhibits more signiﬁcant impacts than maximizing thereward function on driving the next eye movement. Speciﬁcally, attention isencouraged towards new locations where ﬁxations have not yet visited. Herewe simply set the memorized ﬁxation history to be lasting 2 seconds (i.e., 8previous ﬁxations) with descending weights, which needs further considera-tion. – Subjective veriﬁcation.

Currently, we only show some visual examplesof the predicted eye ﬁxations from a Bayesian search framework. However,the accuracy of approximating human ﬁxations has not been justiﬁed in thispaper. A subjective experiment of human eye ﬁxations on distorted imagesis expected to be conducted to verify our hypothesis in this paper. – Improving video quality models.

Objective video quality models haveraised increasingly signiﬁcant research interest recently [24, 25]. It would beof great signiﬁcance to study the eﬀects of eye ﬁxations on the performanceof video quality models, especially on the recently popular user-generatedcontent (UGC). We may also adopt a content-adaptive streaming scheme [26]if human eye ﬁxations can be accurately predicted to save bitrate.

SY380E VISION SYSTEMS 17

References

1. Findlay, J.M.: Global visual processing for saccadic eye movements. Vision research (8) (1982) 1033–10452. Eckstein, M.P., Beutter, B.R., Stone, L.S.: Quantifying the performance limits ofhuman saccadic targeting during visual search. Perception (11) (2001) 1389–14013. Najemnik, J., Geisler, W.S.: Simple summation rule for optimal ﬁxation selectionin visual search. Vision research (10) (2009) 1286–12944. Wikipedia contributors: Attention — Wikipedia, the free encyclopedia (2019)[Online; accessed 6-May-2019].5. Tsotsos, J.K., Rothenstein, A.: Computational models of visual attention. Schol-arpedia (1) (2011) 6201 revision (3) (2001) 1947. Frintrop, S., Rome, E., Christensen, H.I.: Computational visual attention systemsand their cognitive foundations: A survey. ACM Transactions on Applied Percep-tion (TAP) (1) (2010) 68. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitivepsychology (1) (1980) 97–1369. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlyingneural circuitry. In: Matters of intelligence. Springer (1987) 115–14110. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapidscene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence(11) (1998) 1254–125911. Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE transactionson pattern analysis and machine intelligence (1) (2013) 185–20712. Engel, F.L.: Visual conspicuity, visual search and ﬁxation tendencies of the eye.Vision Research (1) (1977) 95–10813. Geisler, W.S., Chou, K.L.: Separation of low-level and high-level factors in complextasks: visual search. Psychological review (2) (1995) 35614. Wolfe, J.M.: What can 1 million trials tell us about visual search? PsychologicalScience (1) (1998) 33–3915. Najemnik, J., Geisler, W.S.: Optimal eye movement strategies in visual search.Nature (7031) (2005) 38716. Najemnik, J., Geisler, W.S.: Eye movement statistics in humans are consistentwith an optimal search strategy. Journal of Vision (3) (2008) 4–417. Abrams, J., Geisler, W.: Visual search in natural scenes: a double-dissociationparadigm for comparing observer models. Journal of vision (12) (2015) e755–e75518. Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full refer-ence image quality assessment algorithms. IEEE Transactions on image processing (11) (2006) 3440–345119. Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessmentin the spatial domain. IEEE Transactions on Image Processing (12) (2012)4695–470820. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” imagequality analyzer. IEEE Signal Processing Letters (3) (2013) 209–21221. Smola, A.J., Sch¨olkopf, B.: A tutorial on support vector regression. Statistics andcomputing (3) (2004) 199–2228 Zhengzhong Tu22. Klein, R.M.: Inhibition of return. Trends in cognitive sciences (4) (2000) 138–14723. Pylyshyn, Z.: The role of location indexes in spatial perception: A sketch of theﬁnst spatial-index model. Cognition (1) (1989) 65–9724. Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: Ugc-vqa: Bench-marking blind video quality assessment for user generated content. arXiv preprintarXiv:2005.14354 (2020)25. Chen, L.H., Bampis, C.G., Li, Z., Norkin, A., Bovik, A.C.: Proxiqa: A proxyapproach to perceptual optimization of learned image compression. IEEE Trans-actions on Image Processing30