Predicting Eye Fixations Under Distortion Using Bayesian Observers
PPredicting Eye Fixations Under DistortionUsing Bayesian Observers
Zhengzhong Tu
University of Texas at Austin
Abstract.
Visual attention is very an essential factor that affects howhuman perceives visual signals. This report investigates how distortionsin an image could distract human’s visual attention using Bayesian vi-sual search models, specifically, Maximum-a-posteriori (MAP) [1] [2] andEntropy Limit Minimization (ELM) [3], which predict eye fixation move-ments based on a Bayesian probabilistic framework. Experiments onmodified MAP and ELM models on JPEG-compressed images containingblocking or ringing artifacts were conducted and we observed that com-pression artifacts can affect visual attention. We hope this work shedslight on the interactions between visual attention and perceptual quality.
Keywords:
Visual attention, saliency map, visual search, eye move-ments, optimal fixation selection, image quality assessment
Visual attention (VA) plays a critical role in human vision systems (HVS) giventhe human brain’s limited capacity of selected neural processing as well as thenarrow and high resolution foveated receptive field in the retina. Visual attentionis, by definition, the behavioral and cognitive process of selectively concentrat-ing on a discrete aspect of information, whether deemed subjective or objective,while ignoring other perceivable information [4]. While visual attention guidesthe anatomical structures in HVS to capture more detailed information for im-portant scenes in the context of some specific task, the main interest is on themechanism underlying this guidance.Recent decades have witnessed a growing interest in exploring the mech-anisms of visual attention. Studies on modeling visual attention, however, isamong many facets of scientific research such as psychology, neuroscience, bi-ology, computer vision, and robotics. Generally, there are two primary roots ofmotivation in the taxonomy of visual attention [5]: one for the uses of atten-tive methods in computer vision and another for the development of attentionmodels in the biological vision community. However, although both have devel-oped independently, and indeed, the use of attention appears in the computervision literature before most biological models, the major point of intersectionis the class of computational models. A computational model of visual atten-tion, according to [6], not only includes a formal or mathematical descriptionfor how attention is computed, but also has to be in a testable manner, i.e., a r X i v : . [ ee ss . I V ] F e b Zhengzhong Tu can be tested by providing input stimuli such as image/video signals to subjectsand see how the model performs by comparison. We refer the reader to [7] for acomprehensive review of computational models of visual attention.Here we scope down to models only under saliency map assumptions, origi-nated from Treisman and Gelade’s [8] “Feature Integration Theory”. Koch andUllman [9] then proposed a feed-forward model to combine these features andintroduced the concept of saliency map , which is a topographic map that rep-resents conspicuousness of scene locations. The first complete implementationand verification of the Koch and Ullman model was proposed by Itti et al. [10]and applied to both synthetic as well as natural scenes. Since then, there havebeen increasing interests on saliency detection or salient object detection taskson computer vision field.
In computer vision community, stimulus-driven, saliency-based attention hasbeen a trendy research area over the past 25 years. Given the difficulty of accu-rately measuring or even quantifying the internal states of organisms, bottom-upattention, independent of these biological aspects, is much easier to understandand evaluate. More specifically, here we are studying visual attention modelsthat can compute saliency maps, which characterize some parts of a scene thatstand out relative to their neighboring parts. Here, we give a clear definition anddescription of what kind of problems this paper is approaching from a compu-tational standpoint. These definitions generally refer to [11].Suppose K subjects have viewed a dataset consisting of N images I = { I i } Ni =1 . Let L ki = { p kij , t kij } n ki j =1 be the vector of eye fixations p kij = ( x kij , y kij )and their corresponding occurrence time t kij for the k th subject over image I i .Let the number of fixations of this subject over i th image be n ki . The goal ofsaliency-based attention modeling is to find a function (stimuli-saliency map-ping) f ∈ F which minimizes the error on eye fixation prediction, i.e., minimizes (cid:80) Kk =1 (cid:80) Ni =1 m ( f ( I ki , L ki )), where m ∈ M is a distance measure. An importantpoint here is that the above definition better suits bottom-up models of overtvisual attention, and may not necessarily cover some other aspects of visualattention (e.g., covert attention or top-down factors).The problem statement described above is for general saliency research givensome computer vision task or even not given any instructions to the subjects.Actually, many technologies and applications of these models have been devel-oped for computer vision and robotics motivations such as image segmentation,image object detection, and human-robot interaction, but few have done for vi-sual quality research purposes. Hence, we would prefer to bound our interest toimage quality assessment since less attention has been paid on the variation ofsaliency map given the degradation of image quality. In the next section, we willarticulate the models researched in this paper, as well as their applications toimage quality research. SY380E VISION SYSTEMS 3
Visual search models have been used as very intuitive schemes to generatesaliency maps in that they can directly predict the eye fixation positions, fromwhich it is easy to construct saliency map by merely applying Gaussian-blurredlinear operations. In the visual search literature, many researchers have proposedto model fixation selection strategies of searchers from Random Search [12],Tiling Search [13] to Feature-based Search [8] [14] [2]. Despite decades of re-search, however, relatively little is known about how humans actually selectfixation locations in visual search.In recent years, some research works [15] [16] [3] [17] have shown that aBayesian ideal searcher under a general probabilistic framework can performstatistically as good as human eye movements. Therefore, we would like to de-velop an eye fixation prediction algorithm based on the Bayesian inference modelproposed by Najemnik and Geisler [15]. Conceptually, this Bayesian model de-picted in flowchart Fig. 1 works as follows: a searcher starts with some initialpriors where the target would be. Then the searcher encodes the image with afoveated visual system to optimally update the priors, from which the posteriorprobabilities are generated. Afterward, if the posterior at some location getslarge enough, the search is stopped, and the position with the largest posterioris chosen as the next fixation; if the posteriors are all under the given threshold,then the searcher uses some cost functions to optimally pick the following fixatelocation. Under this framework, some popular searchers like MAP [1], ELM [3],and nELM [17] have been developed to model human eye movement strategy.However, this optimal fixation search framework was generally used for targetsearching tasks under natural image background, whereas not directly applica-ble for free-looking searching on distorted images. In the next section, we willslightly modify and redefine some denotations for implementations of the frame-work to better fit the context of searching free-looking eye fixations on pristineor distorted images.
This section describes in detail how we implemented the eye fixation searchalgorithm based on the Bayesian ideal search model Fig. 1 [15]. We first definethe visibility map (section 1.1), image patch response (section 2.2), Bayesianupdating scheme (section 2.3), and the optimal selection strategy (section 2.4)as the most essential modules within this Bayesian framework, and then integratethem all to present the overall procedures of each simulated search trial amongthree representative searchers MAP [1], ELM [3], and nELM [17] respectively insection 2.5.
The detrimental effect of retinal eccentricity on the visibility of a target or regionwas implemented by modeling detectability, d (cid:48) , as a function of eccentricity (cid:15) ( i, k ( t )) , Zhengzhong Tu
Fig. 1: Flow chart for ideal Bayesian searcher in a probabilistic framework [15] (a) Visibility map at center (b) Inhibition function at center
Fig. 2: Representation of the visibility map and inhibition function d (cid:48) ik ( t ) = max (cid:110) . , µ exp (cid:104) − (cid:15) ( i, k ( t )) σ (cid:105)(cid:111) (1)Where the magnitude µ and standard deviation σ are fit to measure the vis-ibility map for each human subject, and (cid:15) ( i, k ( t )) is simply the distance betweentarget response location i and fixation location k ( t ). Note that we restrict thevisibility value to be no less than 0 .
01 for stabilization. Here for simplicity, weset the parameters to be fixed in our simulations as µ = 5 , σ = 50 , (cid:15) = (cid:107) · (cid:107) .The simulated visibility map when fixation is at the center of the image is shownin Fig. 2a. SY380E VISION SYSTEMS 5
First we divie the image I (resolution W × H ) into small patches with size 16 × M = ( W × H/ { P i } Mi =1 . We limit the potential fixation positions to only be at the center ofpatches to simplify both the analytic and computational load. Therefore, weget the fixations set as k ( T ) = { k ( t ) } Tt =1 , where k ( t ) represents the location offixation t . Note that here we have M = T .We are now able to define the natural image response at a given image patch P i . Unlike what they did for template response settings in [16], here instead wechoose three features highly related to region conspicuousness in order to roughlyindicate the response variable W ik ( t ) at patch P i modeled as random variablessampled from Gaussian distributions: W ik ( t ) = E ( C i , L i , H i ) + N ik ( t ) , ∀ i = 1 , ..., M E ( C i , L i , H i ) = α · C βi · L γi · H ηi N ik ( t ) ∼ N (0 , /d (cid:48) ik ( t ) ) (2)Where, W ik ( t ) is the response variable with expectation E ( C i , L i , H i ) and N ik ( t ) is the simulated response noise. C i , L i and H i are the saliency-relatedfeatures defined as RMS contrast, average luminance and image entropy calcu-lated for each patch P i . α is a constant to scale the response map for the wholeimage to have mean one, whereas β, γ, η are weighting constants which can befit to data. N ik ( t ) is sample of internal noise drawn from a normal distributionwith standard deviation proportional to 1 /d (cid:48) ik ( t ) and we ignore external noisefor simplicity. Fig. 3a shows an example image ‘parrots.bmp’ from LIVE imagequality database [18] and Fig. 3c displays the defined three types of responsemaps as well as the overall normalized response map. It can be seen that thisresponse map is highly correlated to visual saliency.In many images or video communication applications, images are often neededto be compressed to save streaming bandwidth. However, a large compressionof the image will definitely result in a huge drop in visual quality. Blocking orringing artifacts are the most common visual decay coming from modern imagecoding standards such as JPEG and JPEG2000. Thus, given a compressed imagewith noticeable blocking or ringing artifacts, like Fig. 3b, it is natural that weshould consider both the original response map, defined in Eq. 2, as well as theblocking artifact attractor. However, since there didn’t exist any good metric toevaluate the perceived blocking or ringing artifacts for a given image patch, wemight as well use the local BRISQUE [19] value to approximate the local visualresponse of blockiness or ringingness. It is somewhat reasonable since BRISQUEmetric can considerably capture the image quality degradation, including com-pression artifact using Natural Scene Statistics features as well as a regressionapproach. We prefer BRISQUE than NIQE [20] since BRISQUE is trained andpredicted by SVR [21] under the supervision of human score, therefore carry-ing more stabilization and linearity than NIQE (In other words, NIQE is highlynonlinear and unstable). From the ‘Patch brisque map’ shown in Fig. 3d, we Zhengzhong Tu(a) Natural image ‘parrots’ (b) JPEG-compressed ‘parrots’(c) RMS contrast, luminance, entropy response for image (a)(d) RMS contrast, luminance, entropy, blockiness response for image(b)
Fig. 3: Proposed image patch response features
SY380E VISION SYSTEMS 7 can see that the ‘Patch brisque map’ is fairly correlated with the annoyance ofblocking/ringing artifacts for local image patches. Therefore, considering all thefour factors that drive our attention simultaneously and jointly, the responsevariable W ik ( t ) in Eq. 2 can be further modulated by the ‘Blockiness’ index B i modeled by local BRISQUE value. This gives us the following Eq. 3. W ik ( t ) = E ( C i , L i , H i , B i ) + N ik ( t ) , ∀ i = 1 , ..., M E ( C i , L i , H i , B i ) = α · N ormalize ( C βi · L γi · H ηi ) · B τi N ik ( t ) ∼ N (0 , /d (cid:48) ik ( t ) ) (3) Determining the next fixation needs the requirements for the ideal Bayesiansearcher: optimal updating posterior distribution, and optimal selection of suc-cessive fixation locations. These two parts are the most critical components inthe flowchart Fig. 1. First, we discuss how to optimally update the posteriorprobability density based on current fixation T .Suppose H i is the hypothesis that the potential interesting target position isat the i th location (Here i ∈ M ). By Bayes’ formula, the posterior p H i (use p i forshort) is proportional to the multiplication of the prior probability at location i and the joint evidence likelihood of all the block responses at all possible T fixation locations. Let the vector W ( t ) represent all the block responses collectedat fixation t : ( W k ( t ) , ..., W Mk ( t ) ). Given current fixation T , we can update theposterior as: p i ( T ) ∝ prior ( i ) · p i ( W (1) , ..., W ( T ) | i ) (4)Where, if independent across fixations, the posterior of H i is updated as, p i ( T ) ∝ prior ( i ) · T (cid:89) t =1 M (cid:89) i =1 p ( W ik ( t ) | i ) (5)Najemnik and Geisler have shown in the supplement of their paper [15] thatthe Eq. 5 is equivalent to updating by multiplying the running summation ofthe weighted response at i th location from all possible fixations t = 1 , ..., T , i.e., p i ( T ) ∝ prior ( i ) · exp (cid:0) T (cid:88) t =1 d (cid:48) ik ( t ) W ik ( t ) (cid:1) (6)Where t is the fixation number, and d (cid:48) ik ( t ) , W ik ( t ) are the retinal visibility andresponse variable at patch location i when the fixation is at display location k ( t ). d (cid:48) ik ( t ) and W ik ( t ) have already been explicitly defined in Section 2.1 and 2.2. Thesecond term in Eq. 6 can be seen as the joint likelihood of the patch responsefrom all potential fixations k ( t ). Note that Eq. 6 is for the case in which bothhypotheses are mutually exclusive and independent with both stimulus noise andinternal noise statistically independent over time [15]. Zhengzhong Tu
Here the priors for each fixation is initialized to be equal (i.e., prior ( i ) =1 /M, i = 1 , ..., M ). Considering the ‘inhibition of return’ [22], however, wethereby suppress the priors in the neighborhood of the most recent historicalfixations (suppose the time interval for each eye saccade is 250 ms). For con-sistency, we use the same function and parameters as we used for the visibilitymap in Eq. 1. Hence, the priors are modulated by ‘inhibition of return’ effectsas shown in Eq. 7, where the number n is set as the inhibition number of mostrecent history fixations. According to [23], we use a linearly decreasing weighting α proportional to how recent the historical fixation is to weight the ‘inhibitioneffects’, as shown in Eq. 8. Here we set the inhibition history as n = 8, i.e., assum-ing the searchers can remember the last 2 seconds in the free-looking situation,and the ‘inhibition effects’ descends linearly as weighted by { / , / , ... } . prior inhib ( i ) := prior ( i ) · T (cid:89) t = T − n (cid:110) − α · exp (cid:104) − (cid:15) ( i, k ( t )) σ (cid:105)(cid:111) (7) α = 1 − T − tn ∈ (cid:110) n , n , ..., (cid:111) (8)Thus, the posterior for each potential target location is optimally updated bymultiplying the history-suppressed priors and the joint likelihood of the currenttarget patch from each potential fixation. That is, the overall posterior is updatedaccording to Eq. 9. Note that Eq. 9 should be normalized by the summation ofposteriors among all potential target locations, i.e., (cid:80) Mj =1 p j ( T ), in order to makeit a probability measure. p i ( T ) ∝ prior inhib ( i ) · exp (cid:0) T (cid:88) t =1 d (cid:48) ik ( t ) W ik ( t ) (cid:1) (9) Since here we are not conducting a target search task, we don’t set any prob-ability criterion for stopping the searching chain; instead, we always use thesearcher to iteratively compute the next optimal fixation selection unless the it-erations break off. We select three representative arts of Bayesian ideal searcher,dubbed Maximum-a-posteriori (MAP for short) [1], [2], Entropy Limit Minimiza-tion (ELM for short) [3], and normalized-ELM search (nELM for short) [17] un-der the ideal Bayesian probabilistic framework (Fig. 1) for simulations. Now, wepresent the optimizing strategies as well as their corresponding reward functionsfor each of them as follows.To compute the optimal next fixation point, k opt ( T + 1), the MAP searcheralways fixates the location with maximum posterior probability after the optimalBayesian updating process, i.e., k ( MAP ) opt ( T + 1) = argmax i (cid:110) p i ( T ) (cid:111) (10) SY380E VISION SYSTEMS 9
Where the p i ( T ) is the posterior probability distribution updated by Eq.9. According to the equation, the MAP searcher always fixates the scene loca-tion containing the most salient objects since these areas would produce moresignificant responses.The ideal searcher [15], however, considers each possible next fixation andpicks the location that, given its knowledge of current posterior probabilities andvisibility map, will maximize the probability of correctly identifying the locationof the target after the fixation, shown in Eq. 11. k ( Ideal ) opt ( T + 1) = argmax k ( T +1) (cid:110) N (cid:88) i =1 p i ( T ) p ( C | i, k ( T + 1)) (cid:111) (11)Where the ideal searcher Eq. 11 tries to maximize the accuracy in the target-searching task. However, given the computational complexity in ideal searcher11, Najemnik and Geisler then derived a simple heuristic model called entropylimit minimization (ELM) searcher [3] which produces near-optimal fixation se-lection and produces fixation statistics similar to human. They proved that,under some simple summation rule assumptions, the expected information gainof fixation k ( T + 1) (Eq. 12) with regard to previous fixation k ( T ) is equivalentto linearly filtering the current posterior across the possible target locations (Eq.15). They derived the Eq. 15 using Eq. 12,13,14. Therefore, the ELM searcherchooses the optimal next fixated location k ( T +1) as the one that maximized theexpected information gain as approximated by the summation formula shown inEq. 16, where the posteriors are updated in the same scheme Eq. 9 as in MAP. E (cid:2) ∆H ( T + 1) | k ( T + 1) (cid:3) = E (cid:2) H ( T + 1) | k ( T + 1) (cid:3) − H ( T ) (12) E (cid:2) H ( T + 1) | k ( T + 1) (cid:3) = − E (cid:104) n (cid:88) i =1 P i ( T + 1) log ( P i ( T + 1)) | k ( T + 1) (cid:105) (13) P i ( T + 1) = p i ( T ) exp (cid:2) d ik ( T +1) W ik ( T +1) (cid:3)(cid:80) nj =1 p j ( T ) exp (cid:2) d jk ( T +1) W jk ( T +1) (cid:3) (14) E (cid:2) ∆H ( T + 1) | k ( T + 1) (cid:3) ≈ N (cid:88) i =1 p i ( T ) d ik ( T +1) (15) k ( ELM ) opt ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p i ( T ) d ik ( T +1) (cid:111) (16)Normalized ELM (nELM) searcher [17], which tries to take into accountthe variation of detectability map modulated by local contrast of gaze point,was built on ELM to further generalize for non-stationary natural background.nELM searcher first normalizes the posteriors by local contrast, then picks thefixate location with maximized expected information gain, expressed in Eq. 17. Algorithm 1
The procedures of each simulated search trial among threesearchers { MAP, ELM, nELM } respectively
1: Fixation begins at the center of the image (Center prior).2: Initialize visibility map d (cid:48) ik ( t ) by Eq. 1 at each possible fixation location k ( t ).3: Parallel encoding of the image : At each fixation k ( t ), an independent Gaussiannoise sample N ik ( t ) (with zero mean and variance 1 /d (cid:48) ik ( t ) ) is generated for eachtarget patch P i . Thus, the response variable W ik ( t ) at each patch P i from currentfixation t is computed by Eq. 2 or 3. (for pristine image like Fig. 3a, we use Eq. 2;for compressed image like Fig. 3b, we use Eq. 3).4: Optimal Bayesian updating : after fixation T is chosen, the posterior probabilityat each image patch location i is updated by the Bayes formula Eq. 9, wherethe priors are initialized by Eq. 7 as follows. In other words, optimal integrationof information across fixations is achieved by keeping a running summation ofweighted response at each potential location. p i ( T ) ∝ prior inhib ( i ) · exp (cid:0) T (cid:88) t =1 d (cid:48) ik ( t ) W ik ( t ) (cid:1) ,prior inhib ( i ) := prior ( i ) · T (cid:89) t = T − n (cid:110) − α · exp (cid:104) − (cid:15) ( i, k ( t )) σ (cid:105)(cid:111) Optimally choose the next fixation : to compute the optimal next fixationpoint, k opt ( T + 1), the searcher considers each possible next fixation and picksthe location with the highest expected value, a product of sensory evidence andpotentially earned rewards, which are defined differently: SWITCH ‘method’ ∈ { ‘MAP’,‘ELM’,‘nELM’ } DOCASE ‘MAP’: k MAP ( T + 1) = argmax i (cid:110) p i ( T ) (cid:111) CASE ‘ELM’: k ELM ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p i ( T ) d ik ( T +1) (cid:111) CASE ‘nELM’: p ( Norm ) i ( T ) := p i ( T ) /C i , for i = 1 , ..., Mk nELM ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p ( Norm ) i ( T ) d ik ( T +1) (cid:111)
6: The process jumps back to
Step 3 and repeats the search trial
Step 3-5 to predictthe next optimal fixation location. p ( Norm ) i ( T ) := p i ( T ) /C i , f or i = 1 , ..., Mk ( nELM ) opt ( T + 1) = argmax k ( T +1) (cid:110) M (cid:88) i =1 p ( Norm ) i ( T ) d ik ( T +1) (cid:111) (17) SY380E VISION SYSTEMS 11
Based on the above assumptions and definitions, the steps in
Algorithm
We generate the spatial 1 /f noise image which has the same spatial powerspectra as natural images as the initial test to validate our simulations of thesearchers. Note that the 1 /f image has stationary local contrast, so there isno need to test the nELM algorithm on it. We can see from Fig. 4a and 4bthat oftentimes, ELM chooses the same fixation point to maximize the poste-rior just as what MAP does because information gain map is just the visibilitymap-blurred posteriors. However, MAP can take a saccade to somewhere nearthe edge whereas ELM tends only to jump a moderate distance since posteriorsnearby are pushed down by ‘inhibition of return’ and information gain at dis-tant locations tends not to be high at all. Fig. 4d shows the history-inhibitedinformation gain map, which has much lower value near the edges as well as thevisited locations. We selected the image ‘parrots.bmp’ from LIVE Image Quality Database [18], onwhich we test the
Algorithm n = 8 with linearly decreasing weights. Fig. 5shows the search results using MAP (Fig. 5a), ELM (Fig. 5c), and nELM (Fig.5e) respectively. Figures on the right column of Fig. 5 shows as the examplesof likelihood map, posteriors, information gain map, and normalized posteri-ors both at the last fixation prediction. As we can see from the figures, MAPtends to choose fixations only on most salient areas, whereas ELM sometimesfixates at locations with lower posterior probability but with higher informationgain. Note that in Fig. 5e, we regard the predicted fixations by nELM not goodenough since almost all the saccades land in uninteresting points and the sac-cades distance is relatively long. That is possible because we simply normalizethe posteriors by dividing the local contrast, which would result in the posteriorof some area with lower contrast becoming unexpectedly larger after normaliza-tion. This problem may be improved by fine-tuning the normalization functionbased on some subjective human studies. Since our goal is to study the distracting effects of image compression artifactson predicted scanpaths, we used the same image ‘parrots.bmp’ and compressedit with
Quality = 5 using imwrite function in MATLAB. After obtaining the
Fig. 4: Fixation prediction test on 1 /f noise imagepatch response map, we did histogram equalization to make the distributionmore dispersed and smooth. Then we used Eq. 3 with parameters ( β, γ, η, τ ) =(1 , , ,
1) to calculate the response and accordingly updated the posteriors. Fig.6 shows the first 12 fixations predicted by MAP (Fig. 6a), ELM (Fig. 6c), andnELM (Fig. 6e). We observed that both searchers are distracted by the noticeableblocking or ringing artifacts to some extent, while each of the searchers is stillidentifiable by its own searching fashion. The zoomed image region in Fig. 6c,which is full of notably perceptible blocking artifacts, is selected as the first fixatelocation by ELM, therefore verifying our expectations that noticeable imagequality degradation truly distract our visual attention against salient areas.
Conclusively, this paper explores the Bayesian probabilistic framework shown inFig. 1 and slightly modifies it to predict human eye fixations in the context offree-looking at natural images suffered from compression artifacts. We definedthe response variables of a distortion-free image patch in a similar way with Na-jemnik and Geisler [15]. Given a compressed image with noticeable distortions,we further defined the ‘blockiness response map’ as an approximation of how
SY380E VISION SYSTEMS 13(a) First 12 fixations predicted by MAPon natural image ‘parrots’ (b) Likelihood map (above) and poste-riors (below)(c) First 12 fixations predicted by ELMon natural image ‘parrots’ (d) Posteriors (above) and informationgain map (below)(e) First 12 fixations predicted bynELM on natural image ‘parrots’ (f) Normalized posteriors (above) andinformation gain map (below)
Fig. 5: Fixation prediction results on natural image ‘parrots’salient the ‘blocking artifact’ presents to a human observer and then used thisdistortion map to compete with traditional saliency maps (i.e., distortion-freeresponse map). Integrating this additional visual factor dubbed ‘blockiness’ to
Fig. 6: Fixation prediction results on compressed natural image ‘parrots’the basic response variable, we thereby simulated three classic fixation searchersunder a Bayesian probabilistic framework, which are MAP [1], ELM [3], andnELM [17] respectively. The experimental results on different images from LIVE
SY380E VISION SYSTEMS 15(a) ELM on ‘bikes.bmp’ (b) ELM on compressed ‘bikes.jpg’(c) ELM on ‘caps.bmp’ (d) ELM on compressed ‘caps.jpg’(e) ELM on ‘ocean.bmp’ (f) ELM on compressed ‘ocean.jpg’(g) ELM on ‘sailing4.bmp’ (h) ELM on compressed ‘sailing4’
Fig. 7: More eye fixation results using ELM searcher on images from LIVEdatabaseDatabase show that the Bayesian visual search is indeed disturbed by the annoy-ing blocking artifacts. These simulations have presented some evidence for thedistracting effects of image distortions on visual search and build a preliminary model based on the Bayesian probabilistic framework for potential subsequentexplorations on interactions of visual attention and image quality.Lastly, we discuss as follows some deficiencies of the implementations in thispaper, as well as their potential improvement: – Accuracy of detectability map.
We assume a simplified model as a Gaus-sian distribution shown in Fig. 2a for mathematically convenience in whichthe target detectability is dependent on eccentricity and isotropic. However,this does not seem right since the real retinotopic detectability map looksmore like an ellipse and also tuned by background contrast. The imple-mented searchers in this paper are expected to achieve more human-alikeperformance if a more accurate retinotopic detectability map is used. – Accuracy of redefined response map.
Here we redefine the image re-sponse map which indicates the saliency of a patch as the multiplication ofpatch contrast, luminance, and entropy, which is a very rough approxima-tion; besides, the ‘blocking response map’ is likewise quite loose. One can justimprove by leveraging state-of-the-art saliency map algorithm, for example.From my point of view, given a distorted image, how to define the responsemap is the trickiest part in both searchers within this Bayesian frameworksince it will directly weight the posteriors, thereby deciding fairly the nextoptimal fixate location. – Effects of ‘Inhibition of Return’.
We noticed that the ‘Inhibition of Re-turn’ phenomenon exhibits more significant impacts than maximizing thereward function on driving the next eye movement. Specifically, attention isencouraged towards new locations where fixations have not yet visited. Herewe simply set the memorized fixation history to be lasting 2 seconds (i.e., 8previous fixations) with descending weights, which needs further considera-tion. – Subjective verification.
Currently, we only show some visual examplesof the predicted eye fixations from a Bayesian search framework. However,the accuracy of approximating human fixations has not been justified in thispaper. A subjective experiment of human eye fixations on distorted imagesis expected to be conducted to verify our hypothesis in this paper. – Improving video quality models.
Objective video quality models haveraised increasingly significant research interest recently [24, 25]. It would beof great significance to study the effects of eye fixations on the performanceof video quality models, especially on the recently popular user-generatedcontent (UGC). We may also adopt a content-adaptive streaming scheme [26]if human eye fixations can be accurately predicted to save bitrate.
SY380E VISION SYSTEMS 17
References
1. Findlay, J.M.: Global visual processing for saccadic eye movements. Vision research (8) (1982) 1033–10452. Eckstein, M.P., Beutter, B.R., Stone, L.S.: Quantifying the performance limits ofhuman saccadic targeting during visual search. Perception (11) (2001) 1389–14013. Najemnik, J., Geisler, W.S.: Simple summation rule for optimal fixation selectionin visual search. Vision research (10) (2009) 1286–12944. Wikipedia contributors: Attention — Wikipedia, the free encyclopedia (2019)[Online; accessed 6-May-2019].5. Tsotsos, J.K., Rothenstein, A.: Computational models of visual attention. Schol-arpedia (1) (2011) 6201 revision (3) (2001) 1947. Frintrop, S., Rome, E., Christensen, H.I.: Computational visual attention systemsand their cognitive foundations: A survey. ACM Transactions on Applied Percep-tion (TAP) (1) (2010) 68. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitivepsychology (1) (1980) 97–1369. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlyingneural circuitry. In: Matters of intelligence. Springer (1987) 115–14110. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapidscene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence(11) (1998) 1254–125911. Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE transactionson pattern analysis and machine intelligence (1) (2013) 185–20712. Engel, F.L.: Visual conspicuity, visual search and fixation tendencies of the eye.Vision Research (1) (1977) 95–10813. Geisler, W.S., Chou, K.L.: Separation of low-level and high-level factors in complextasks: visual search. Psychological review (2) (1995) 35614. Wolfe, J.M.: What can 1 million trials tell us about visual search? PsychologicalScience (1) (1998) 33–3915. Najemnik, J., Geisler, W.S.: Optimal eye movement strategies in visual search.Nature (7031) (2005) 38716. Najemnik, J., Geisler, W.S.: Eye movement statistics in humans are consistentwith an optimal search strategy. Journal of Vision (3) (2008) 4–417. Abrams, J., Geisler, W.: Visual search in natural scenes: a double-dissociationparadigm for comparing observer models. Journal of vision (12) (2015) e755–e75518. Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full refer-ence image quality assessment algorithms. IEEE Transactions on image processing (11) (2006) 3440–345119. Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessmentin the spatial domain. IEEE Transactions on Image Processing (12) (2012)4695–470820. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” imagequality analyzer. IEEE Signal Processing Letters (3) (2013) 209–21221. Smola, A.J., Sch¨olkopf, B.: A tutorial on support vector regression. Statistics andcomputing (3) (2004) 199–2228 Zhengzhong Tu22. Klein, R.M.: Inhibition of return. Trends in cognitive sciences (4) (2000) 138–14723. Pylyshyn, Z.: The role of location indexes in spatial perception: A sketch of thefinst spatial-index model. Cognition (1) (1989) 65–9724. Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: Ugc-vqa: Bench-marking blind video quality assessment for user generated content. arXiv preprintarXiv:2005.14354 (2020)25. Chen, L.H., Bampis, C.G., Li, Z., Norkin, A., Bovik, A.C.: Proxiqa: A proxyapproach to perceptual optimization of learned image compression. IEEE Trans-actions on Image Processing30