[PDF] SID4VAM: A Benchmark Dataset with Synthetic Images for Visual Attention Modeling

Abstract

A benchmark of saliency models performance with a synthetic image dataset is provided. Model performance is evaluated through saliency metrics as well as the influence of model inspiration and consistency with human psychophysics. SID4VAM is composed of 230 synthetic images, with known salient regions. Images were generated with 15 distinct types of low-level features (e.g. orientation, brightness, color, size...) with a target-distractor pop-out type of synthetic patterns. We have used Free-Viewing and Visual Search task instructions and 7 feature contrasts for each feature category. Our study reveals that state-of-the-art Deep Learning saliency models do not perform well with synthetic pattern images, instead, models with Spectral/Fourier inspiration outperform others in saliency metrics and are more consistent with human psychophysical experimentation. This study proposes a new way to evaluate saliency models in the forthcoming literature, accounting for synthetic images with uniquely low-level feature contexts, distinct from previous eye tracking image datasets.

Full PDF

SSID4VAM: A Benchmark Dataset with Synthetic Imagesfor Visual Attention Modeling

David Berga Xos´e R. Fdez-Vidal Xavier Otazu Xos´e M. Pardo Computer Vision Center, Universitat Aut`onoma de Barcelona, Spain CiTIUS, Universidade de Santiago de Compostela, Spain { dberga,xotazu } @cvc.uab.es { xose.vidal,xose.pardo } @usc.es Abstract

A benchmark of saliency models performance with asynthetic image dataset is provided. Model performanceis evaluated through saliency metrics as well as the inﬂu-ence of model inspiration and consistency with human psy-chophysics. SID4VAM is composed of 230 synthetic im-ages, with known salient regions. Images were generatedwith 15 distinct types of low-level features (e.g. orienta-tion, brightness, color, size...) with a target-distractor pop-out type of synthetic patterns. We have used Free-Viewingand Visual Search task instructions and 7 feature contrastsfor each feature category. Our study reveals that state-of-the-art Deep Learning saliency models do not perform wellwith synthetic pattern images, instead, models with Spec-tral/Fourier inspiration outperform others in saliency met-rics and are more consistent with human psychophysical ex-perimentation. This study proposes a new way to evaluatesaliency models in the forthcoming literature, accountingfor synthetic images with uniquely low-level feature con-texts, distinct from previous eye tracking image datasets.

1. Introduction

Although eye movements are indicators of “where peo-ple look at” , a more complex question arises as a con-sequence for understanding bottom-up visual attention:

Are all eye movements equally valuable for determiningsaliency?

According to the initial hypotheses in visual at-tention [53, 58], we could deﬁne visual saliency as the per-ceptual quality that makes our human visual system (HVS)to gaze towards certain areas that pop-out on a scene dueto their distinctive visual characteristics. Therefore, this ca-pacity (saliency) cannot be inﬂuenced by top-down factors,which seemingly guide eye movements regardless of stimu-lus characteristics [60]. Accounting for prior knowledge ofwhether a stimulus area is salient or not, when it becomessalient, and why, are issues that need to be accounted for saliency evaluation [7, 2].Common frameworks for predicting saliency have beencreated since Koch & Ullman’s seminal work [31]. Thisframework deﬁned a theoretical basis for modeling the earlyvisual stages of the HVS in order to obtain a representationof the saliency map. By extracting sensory signals as fea-ture maps, processing the conspicuous objects and select-ing the maximally-active locations through winner-take-all(WTA) mechanisms, it is possible to obtain a unique/mastersaliency map. However, it was hypothesized that visual at-tention combines both bottom-up (saliency) and top-down(relevance) mechanisms in a central representation (prior-ity) [14, 17]. These top-down speciﬁcities (e.g. world,object, task, etc.) were later accounted in the selectivetuning model as a hierarchy of WTA-like processes [54].Despite the neural correlates simultaneously involved insaliency have been investigated [55], the direct relation be-tween saliency and eye movements deﬁned in a uniquecomputational framework requires further study. Itti et al.initially introduced a computational biologically-inspiredmodel [27] composed of 3 main steps: First, feature mapsare extracted using oriented linear DoG ﬁlters for each chro-matic channel. Second, feature conspicuity is computed us-ing center-surround differences. Third, conspicuity mapsare integrated with linear WTA mechanisms. This architec-ture has been the main inspiration for current saliency mod-els [62, 43], that alternatively use distinct mechanisms (ac-counting for different levels of processing, context or tun-ing depending on the scene) but preserving same or similarstructure for these steps. Although current state-of-the-artmodels precisely resemble eye-tracking ﬁxation data [6, 9],we question if these models represent saliency. We will testthis hypothesis with a novel synthetic image dataset.

In order to determine whether an object or a feature at-tracts attention, initial experimentation was assessing fea-ture discriminability upon display characteristics (e.g. dis- a r X i v : . [ c s . C V ] O c t lay size, feature contrast...) during visual search tasks[53, 58]. Parallel search occurs when features are pro-cessed preattentively, therefore search targets are found ef-ﬁciently regardless of distractor properties. Instead, serialsearch happens when attention is directed to one item ata time, requiring a “binding” process to allow each ob-ject to be discriminated. For this case, search time de-crease with higher target-distractor contrast and/or lower setsize (following the Weber Law [16]). More recent studiesreplicated these experiments by providing real images withparametrization of feature contrast and/or set size (iLabUSC, UCL, VAL Hardvard, ADA KCL), combining visualsearch or visual segmentation tasks, however not providingeye tracking data (Table 1B). Rather, current eye movementdatasets provide ﬁxations and scanpaths from real scenesduring free-viewing tasks. These image datasets are usu-ally composed of real image scenes (Table 1A), either fromindoor / outdoor scenes (Toronto, MIT1003, MIT300), na-ture scenes (KTH) or semantically-speciﬁc categories suchas faces (NUSEF) and several others (CAT2000). A com-plete list of eye tracking datasets is in Winkler & Subrama-nian’s overview [57]. CAT2000 training subset of “Pattern”images (CAT2000 p ) provides eye movement data with psy-chophysical / synthetic image patterns during 5 sec of free-viewing. However, no parametrization of feature contrastnor stimulus properties is given. A synthetic image datasetcould provide information of how attention is dependent onfeature contrast and other stimulus properties with distincttasks. We describe in Section 2 how we do so with our novelSID4VAM’s dataset.Table 1: Characteristics of eye tracking datasets A: Real Images

Dataset Task (cid:51)

MIT1003 [30] FV 1003 15 (cid:51)

NUSEF [41] FV 758 25 (cid:51)

KTH [32] FV 99 31 (cid:51)

MIT300 [29] FV 300 39 (cid:51)

CAT2000 [5] FV 4000 24 (cid:51)

B: Psychophysical Pattern / Synthetic Images

Dataset Task (cid:51)

UCL [64] VS & SG 2784 5 (cid:51)

VAL Harvard [59] VS 4000 30 (cid:51)

ADA KCL [50] - ˜430 - (cid:51)

CAT2000 p [5] FV 100 18 (cid:51) SID4VAM (Ours) FV & VS 230 34 (cid:51) (cid:51)

TS: total number of stimuli, PP: participants, PM:Parametrization, DO: Fixation data is available online, FV:Free-Viewing, VS: Visual Search, SG: visual segmentation

Being inspired by Itti et al’s architecture, a myiriad ofcomputational models has been proposed with distinct com-putational approaches, from biological, mathematical andphysical inspiration [62, 43]. By processing global and/or local image fatures for calculating feature conspicuity, thesemodels are able to generate a master saliency map to pre-dict human ﬁxations (Table 2). Taking up Judd et al. [29]and Borji & Borji’s [4] reviews, we have grouped saliencymodel inspiration in ﬁve general categories according to itssaliency computation endeavour: • Cognitive/Biological (C): Saliency is usually gener-ated by mimicking HVS neuronal mechanisms or ei-ther speciﬁc patterns found in human eye movementbehavior. Feature extraction is generally based onGabor-like ﬁlters and its integration with WTA-likemechanisms. • Information-Theoretic (I): These models computesaliency by selecting the regions that maximize visualinformation of scenes. • Probabilistic (P): Probabilistic models generatesaliency by optimizing the probability of performingcertain tasks and/or ﬁnding certain patterns. Thesemodels use graphs, bayesian, decision-theoretic andother approaches for their computations. • Spectral/Fourier-based (F): Spectral Analysis orFourier-based models derive saliency by extracting ormanipulating features in the frequency domain (e.g.spectral frequency or phase). • Machine/Deep Learning (D): These techniques arebased on training existing machine/deep learning ar-chitectures (e.g. CNN, RNN, GAN...) by minimizingthe error of predicting ﬁxations of images from exist-ing eye tracking data or labeled salient regions.

Visual saliency is a term coined on a perceptual ba-sis. According to this principle, a correct modelization ofsaliency should consider speciﬁc experimental conditionsupon a visual attention task. The output of such a model canvary for stimulus or task, but must arise as a common be-havioral phenomena in order to validate the general hypoth-esis deﬁnition from Treisman, Wolfe, Itti and colleagues[53, 58, 26]. Eye movements have been considered the mainbehavioral markers of visual attention. But understandingsaliency means not only to prove how visual ﬁxations canbe predicted, but to simulate which patterns of eye move-ments are gathered from vision and its sensory signals (hereavoiding any top-down inﬂuences). This challenge offerseye tracking researchers to consider several experimentalissues (with respect contextual, contrast, temporal, oculo-motor and task-related biases) when capturing bottom-upattention, largely explained by Borji et al. [4], Bruce et al.[7] and lately by Berga et al. [2]. Computational models ad-vance several ways to predict, to some extent, human visualable 2: Description of saliency models

Model Authors Year Inspiration TypeC I P F D G LIKN Itti et al.[27, 26] 1998 (cid:51) (cid:51) (cid:51)

AIM Bruce & Tsotsos [8] 2005 (cid:51) (cid:51) (cid:51)

GBVS Harel et al.[21] 2006 (cid:51) (cid:51) (cid:51)

SDLF Torralba et al. [52] 2006 (cid:51) (cid:51) (cid:51)

SR & PFT Hou & Zhang[23] 2007 (cid:51) (cid:51)

PQFT Guo & Zhang[20] 2008 (cid:51) (cid:51)

ICL Hou & Zhang [24] 2008 (cid:51) (cid:51) (cid:51) (cid:51)

SUN Zhang et al. [63] 2008 (cid:51) (cid:51)

SDSR Seo & Milanfar [48] 2009 (cid:51) (cid:51) (cid:51) (cid:51)

FT Achanta et al.[1] 2009 (cid:51) (cid:51)

DCTS/SIGS Hou et al.[22] 2011 (cid:51) (cid:51)

SIM Murray et al.[39] 2011 (cid:51) (cid:51) (cid:51)

WMAP Lopez-Garcia et al.[38] 2011 (cid:51) (cid:51) (cid:51) (cid:51)

AWS Garcia-Diaz et al.[18] 2012 (cid:51) (cid:51) (cid:51)

CASD Goferman et al.[19] 2012 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

RARE Riche et al.[45] 2012 (cid:51) (cid:51) (cid:51)

QDCT Schauerte et al.[47] 2012 (cid:51) (cid:51)

HFT Li et al.[37] 2013 (cid:51) (cid:51)

BMS Zhang & Sclaroff [61] 2013 (cid:51) (cid:51)

SALICON Jiang et al.[28, 51] 2015 (cid:51) (cid:51)

ML-Net Cornia et al.[12] 2016 (cid:51) (cid:51)

DeepGazeII K¨ummerer et al.[33] 2016 (cid:51) (cid:51)

SalGAN Pan et al.[40] 2017 (cid:51) (cid:51)

ICF K¨ummerer et al.[33] 2017 (cid:51) (cid:51) (cid:51)

SAM Cornia et al.[13] 2018 (cid:51) (cid:51)

NSWAM Berga & Otazu [3] 2018 (cid:51) (cid:51) (cid:51)

Sal-DCNN Jiang et al. [34] 2019 (cid:51) (cid:51) (cid:51) (cid:51)

Inspiration: { C : Cognitive/Biological, I : Information-Theoretic, P : Probabilistic,F : Fourier/Spectral, D : Machine/Deep Learning } Type: { G: Global, L: Local } ﬁxations. However, the limits of the prediction capabilityof these saliency models arise as a consequence of the va-lidity of the evaluation from eye tracking experimentation.We aim to to provide a new dataset with uniquely syntheticimages and a benchmark, studying for each saliency model:1. How model inspiration and feature processing inﬂu-ences model predictions?2. How does temporality of ﬁxations affect model predic-tions?3. How low-level feature type and contrast inﬂuencesmodel’s psychophysical measurements?

2. SID4VAM: Synthetic Image Dataset for Vi-sual Attention Modeling

Fixations were collected from 34 participants in adataset of 230 images[2]. Images were displayed in a res-olution of × px and ﬁxations were captured atabout pixels per degree of visual angle using SMI REDbinocular eye tracker. The dataset had been splitted in twotasks: Free-Viewing (FV) and Visual Search (VS). For theFV task, participants had to freely look at the image dur-ing 5 seconds. On each stimuli there was a salient area ofinterest (AOI). For the VS task, participants had the instruc-tion to visually locate the AOI, setting the salient region as Download the dataset in the different object. For this task, the trigger for promptingthe transition to next image was by gazing inside the AOIor pressing a key (for reporting absence of target). We canobserve the stimuli generated for both tasks on Figs. 1-2.The dataset was divided in 15 stimulus types, 5 corre-sponding to FV and 10 to VS. Some of these blocks haddistinct subsets of images (due to the alteration of eithertarget or distractor shape, color, conﬁguration and back-ground properties), abling a total of 33 subtypes. Each ofthese blocks was individually generated as a low-level fea-ture category, which had its own type of feature contrastbetween the salient region and the rest of distractors / back-ground. FV categories were mainly based for analyzingpreattentive effects (Fig. 1): 1) Corner Salience, 2) Vi-sual Segmentation by Bar Angle, 3) Visual Segmentation byBar Length, 4) Contour Integration by Bar Continuity and5) Perceptual Grouping by Distance. VS categories werebased on a feature-singleton search stimuli, where therewas a unique salient target and a set of distractors and/oraltered background (Fig. 2). These categories were: 6)Feature and Conjunctive Search, 7) Search Asymmetries,8) Search in a Rough Surface, 9) Color Search, 10) Bright-ness Search, 11) Orientation Search, 12) Dissimilar SizeSearch, 13) Orientation Search with Heterogeneous distrac-tors, 14) Orientation Search with Non-linear patterns, 15)Orientation search with distinct Categorization. Stimuli forSID4VAM’s dataset was inspired by previous psychophysi-cal experimentation [64, 58, 50].Dataset stimuli were generated with 7 speciﬁc in-stances of feature contrast ( Ψ ), corresponding to hard( Ψ h = { .. } ) and easy ( Ψ e = { .. } ) difﬁculies ofﬁnding the salient regions. These feature contrasts hadtheir own parametrization (following Berga et al’s psy-chophysical formulation [2, Section 2.4]) corresponding tothe feature differences between the salient target and therest of distractors (e.g. differences of target orientation,size, saturation, brightness...) or global effects (e.g. overalldistractor scale, shape, background color, backgroundbrightness).

3. Experiments

Fixation maps from eye tracking data are generated bydistributing each ﬁxation location to a binary map. Fixationdensity maps are created by convolving a gaussian ﬁlter tothe ﬁxation maps, this simulates a smoothing caused by thedeviations of σ = deg given from eye tracking experimen-tation, recommended by LeMeur & Baccino [36].Typically, location-based saliency metrics ( AU C

Judd , AU C

Borji , NSS) increase their score ﬁxation locations fallinside (TP) the predicted saliency maps. Conversely, scoresdecrease ﬁxation locations are not captured by saliencymaps (FN) or when saliency maps exist in locations with)2)3) 4)5) 1 2 3 4 5 6 7hard ←− Ψ −→ easyFigure 1: Free-Viewing stimulino present ﬁxations (FP). In distribution-based metrics (CC,SIM, KL), saliency maps score higher when they havehigher correlations with respect to ﬁxation density mapdistributions. We have to point out that shufﬂed metrics(sAUC, InfoGain) consider FP values when saliency mapscoincide with other ﬁxation map locations or a baseline(here, corresponding to the center bias), which are not rep-resentative data for saliency prediction. Prediction metricsand its calculations are largely explained by Bylinskii et al.[11]. Our saliency metric scores and pre-processing usedfor this experimentation have been replicated from the of-ﬁcial saliency benchmarking procedure [10]. Psychometricevaluation of saliency predictions has been done with theSaliency Index (SI) [49, 50]. This metric evaluates the en-ergy of a saliency map inside ( S t ) a salient region (whichwould enclose a salient object) compared to the energy out-side ( S b ) the salient region. This metric allows evaluationof a saliency map when the salient region is known, con-sidering in absolute terms the distribution of saliency of aparticular AOI / mask. Here we show the formula of the SI SI ( S t , S b ) = S t − S b S b . Saliency maps have been computed from models shownon Table 2. Model evaluations have been divided accordingto its inspiration and prediction scores have been evaluatedwith saliency metrics and in psychophysical terms.

Previous saliency benchmarks [6, 42, 9, 7, 10] reveal thatDeep Learning models such as SALICON, ML-Net SAM- Code for metrics: https://github.com/dberga/saliency ←− Ψ −→ easyFigure 2: Visual Search stimuliesNet, SAM-VGG, DeepGazeII or SalGan score high-est on both shufﬂed and unshufﬂed metrics. In this sec-tion we aim to evaluate whether saliency maps that scoredhighly on ﬁxation prediction do so with a synthetic imagedataset and if their inspiration inﬂuences on their perfor-mance. We present metric scores of saliency map predic-tions of the whole dataset in Table 3 and plots in Fig. 3.Saliency metric scores reveal that overall Spectral/Fourier-based saliency models predict better ﬁxations on a syntheticimage dataset.Table 3: Saliency metric scores for SID4VAM Model AUCj AUCb CC NSS KL SIM sAUC InfoGain GT Baseline-CG

IKN 0.686 0.678 0.283 0.878 1.748 0.380 0.608 -0.233SIM 0.650 0.641 0.189 0.694 1.702 0.357 0.619 -0.148AWS 0.679 0.667 0.255 1.088 1.592 0.373 0.672 0.013NSWAM 0.614 0.610 0.136 0.529 1.686 0.335 0.622 -0.150AIM 0.570 0.566 0.122 0.473 14.472 0.224 0.557 -18.182ICL 0.737 0.717 0.343 1.100 1.788 0.405 0.624 -0.313RARE 0.707 0.622 0.204 1.046 1.736 0.444 0.633 -0.158CASD 0.733 0.669 0.408 1.904 2.395 0.403 0.652 -1.046GBVS 0.747 0.718 0.400 1.464 1.363 0.413 0.628 0.331SDLF 0.620 0.607 0.156 0.585 3.954 0.322 0.596 -3.244SUN 0.542 0.532 0.080 0.333 16.408 0.165 0.530 -21.024SDSR 0.672 0.665 0.192 0.639 1.904 0.365 0.642 -0.467BMS 0.677 0.643 0.274 1.143 2.306 0.397 0.627 -0.958ICF 0.618 0.566 0.141 0.700 3.274 0.306 0.564 -2.300SR 0.748 0.694 0.420 1.916 1.432 0.431 0.685 0.348PFT 0.705 0.692 0.398 1.885 2.227 0.377 0.684 -0.893PQFT 0.701 0.693 0.387 1.774 2.197 0.373 0.684 -0.856FT 0.521 0.518 0.072 0.331 7.552 0.129 0.517 -8.498DCTS 0.729 0.724 0.439 2.004 1.363 0.396 0.708 0.337WMAP 0.729 0.709 0.468 2.136 2.283 0.397 -0.981QDCT 0.717 0.706 0.425 1.986 1.677 0.391 0.695 -0.105HFT

SalGAN 0.715 0.662 0.287 0.883 2.506 0.373 0.593 -1.350OpenSALICON 0.692 0.673 0.284 0.956 1.549 0.375 0.615 0.052DeepGazeII 0.639 0.606 0.176 0.714 2.023 0.346 0.597 -0.587SAM-VGG 0.537 0.523 0.026 0.070 11.947 0.216 0.503 -14.954SAM-ResNet 0.727 0.673 0.305 0.967 2.610 0.388 0.600 -1.475ML-Net 0.700 0.676 0.283 0.883 2.169 0.373 0.595 -0.837Sal-DCNN 0.726 0.650 0.288 0.961 3.676 0.359 0.580 -3.05

Cognitive/Biological , Information-Theoretic , Probabilistic,Fourier/Spectral, Machine/Deep Learning

Models such as HFT and WMAP remarkably outpe-form other saliency models. From other model inspira-tions, AWS score higher than other Cognitive/Biologically-inspired models, GBVS and CASD outperform other Prob-abilistic/Bayesian and Information-theoretic saliency mod-els respectively. For Deep Learning models, SAM

ResNet and OpenSALICON are the ones with highest scores. Al-though there are present differences in terms of modelperformances and model inspiration, similarities in modelmechanisms can reveal phenomena of increasing and de-creasing prediction statistics. This phenomena is present forSpectral/Fourier-based and Cognitive/Biologically-inspiredmodels, withwhom all present similar performance and bal-anced scores throughout the distinct metric scores. It isto consider that sAUC and InfoGain metrics are more re-liable compared to other metrics (which the baseline cen-ter gaussian sometimes acquires higher performance than most saliency models). In these terms, models shown onFig. 4 are efﬁcient saliency predictors for this dataset. Wecan also point out that models which process uniquely lo-cal feature conspicuity scored lower on SID4VAM ﬁxationpredictions, whereas the ones that processed global con-spicuity scored higher. This phenomena might be relatedwith the distinction of foveal (near the fovea) and ambient(away from the fovea) ﬁxations, relative to the ﬁxation or-der and the spatial locations of ﬁxations [15]. The eval-uation of gaze-wise model predictions has been done bygrouping ﬁxations of every instance separately. We haveplotted results of the sAU C saliency metric for each model(Fig. 5) and it is observable that model performance de-crease upon ﬁxation number, meaning that saliency is morelikely to be predicted during ﬁrst ﬁxations. For evaluatingthe temporal relationship between human and model perfor-mance ( sAU C ), we have performed Spearman’s ( ρ ) corre-lation tests for each ﬁxation and it can be observed that IKN,ICL, GBVS, QDCT and ML-Net follow a similar slope asthe GT, contrary to the case of the baseline center gaussian. Previous studies [4, 7, 2] found that several factors suchas feature type, feature contrast, task, temporality of ﬁx-ations and the center bias alternatively contribute to eyemovement guidance. The HVS has speciﬁc contrast sen-sitivity to each stimulus feature, so that saliency modelsshould adapt in the same way in order to be plausible inpsychometric parameters. Here we will show how saliencyprediction varies signiﬁcantly upon feature contrast and thetype of low-level features found in images. In Fig. 6a isfound that saliency models increase SI with feature contrast“ Ψ ” following the distribution of human ﬁxations. Mostprediction SI scores show a higher slope with easy targets(salient objects with higher contrast with respect the rest,when Ψ > ), being CASD and HFT the models that havehigher SI at higher contrasts.Contextual inﬂuences (here represented as distinct low-level features that appear in the image) contribute distinc-tively on saliency induced from objects that appear on thescene [25]. We suggest that not only the semantic contentthat appears on the scene affects saliency but the featurecharacteristics do signiﬁcantly impact how salient objectsare. This phenomena is observable in Fig. 6b and occursfor both human ﬁxations and model predictions, speciﬁcallywith highest SI for human ﬁxations in 1) Corner Salience,6) Feature and Conjunctive Search, 7) Search Asymmetries,10) Brightness Search, 12) Dissimilar Size Search and 13)Orientation Search with Heterogeneous distractors. HFTand CASD have highest SI when GT is higher (when hu-man ﬁxations are more probable to fall inside the AOI),even outperforming GT probabilities for the cases of 1)and 7). We show in Fig. 7a that overall Saliency Index =Best performance for each model inspirationFigure 3: Plots for saliency metric scores.of most saliency models is distinct when we vary the typeof feature contrast (easy vs hard) and the performed stim-ulus task (free-viewing vs visual search). Spectral/Fourierbased models outperform other saliency models also in SImetric. Similarly with saliency metrics shown on previoussubsection, AWS, CASD, BMS, HFT and SAM-ResNet arethe most efﬁcient models for each model inspiration cate-gory respectively. It is observable in Fig. 7b that saliencymodels have higher performance for easy targets, with in-creased overall model performance differences with respecthard targets (Fig. 7c). Similarly, visual search targets showlower difﬁculty (higher SI) to ﬁnd predicted ﬁxations insidethe AOI than the free-viewing cases (Fig. 7d-e). Also dis-tinct SI curves upon feature contrast are reported, revealingthat contrast sensitivies are distinct for each low-level fea-ture. Spearman’s correlation tests on Fig. 6b show whichmodels correlate with human performance over feature con-trast and which one do so with the baseline (designatinghigher center biases). These results show that models suchas AWS, CASD, BMS, DCTS or DeepGazeII highly corre-late with human contrast sensitivities and do not correlatewith the baseline center gaussian. Matching human con- trast sensitivities on low-level visual features would be aninteresting point of view to make future saliency models ac-curately predict saliency as well as to better understand howthe HVS processes visual scenes.

4. SIG4VAM: Generating synthetic image pat-terns for training saliency models

We have also provided a synthetic image generator(SIG4VAM) , able to generate similar psychophysical im-ages with other types of patterns. A larger set of images canbe created by parametrizing factors such as stimulus size,number of distractors, feature contrast, etc. For instance,if the same 15 types (33 subtypes) of stimuli are selectedinstead with 28 contrast ( Ψ ) instances, then is generateda dataset with ×

28 = 924 stimulus. Adding to that,synthetic images with high-level features can be created us-ing SIG4VAM (Fig. 8), by changing background proper-ties, setting speciﬁc object instances for targets/distractors,as well as their low-level properties (orientation, brightness, Code for generating synthetic stimuli: https://github.com/dberga/sig4vam mageHumansAWSNSWAMRARECASDGBVSSDSRWMAPHFTOpenSALICONSAM-ResNet

Figure 4: Examples of dataset stimuli and saliency map pre-dictions. Only two models for each inspiration category thatpresented highest performance with shufﬂed saliency met-ric scores (sAUC and InfoGain) are shown.Figure 5: sAUC gaze-wise prediction scores.color, etc.). SID4VAM has been proposed as a possible ini-tial test set for saliency prediction, where data of ﬁxationsand binary masks are available for benchmarking. Trainingsets can be obtained with SIG4VAM (GT of binary masks ofpop-out/salient regions are automatically-generated), ablingto ﬁt contrast sensitivies and obtaining loss functions uponscores of ﬁxation probability distribution [11] and salientregion detection metrics [56] (e.g. SI, PR, MAE, S-/F-measures, etc.). Latest strategies [46] that syntheticallymodify real scenes have shown dramatic changes in scoresof object detection tasks, using “object transplanting” (su-perposing an object on distinct locations on the scene). In a) b)

Figure 6: Results of Saliency Index of model predictionsupon Feature Contrast (a) and Feature Type (b) .these terms, SIG4VAM could be extended for evaluatingpredictions of models over distinct contexts and tasks.

5. Discussion

Previous saliency benchmarks show that eye movementsare efﬁciently predicted with latest Deep Learning saliencymodels. This is not the case with synthetic images, alsofor models pre-trained with sets of psychophysical patterns(e.g. SAM with CAT2000). This suggest that their com-putations of saliency do not arise as a general mechanism.These methods have been trained with eye tracking data(real images containing high-level features) and althoughseveral factors guide eye movements have been shown [58]that low-level saliency (i.e. pop-out effects) is one of themost inﬂuential for determining bottom-up attention. An-other possibility is that we randomly parametrized salientobject location, lowering the center bias effect. With thisbenchmark we can evaluate how salient is a particular objectby parametrizing its low-level feature contrast with respectto the rest of distractors and/or background. Therefore, theevaluation of saliency can be done accounting for feature a) (b) (c)(d) (e)

Figure 7: Results of Saliency Index metric scores from dataset model predictions (a) , for easy/hard difﬁculties (b-c) andFree-Viewing/Visual Search tasks (d-e) .Figure 8: Examples of generating synthetic images withhigh-level features (i.e. objects as target/distractors), chang-ing low-level feature properties (a-b) or background (c) .contrast, analyzing the importance to the objects that areeasier to detect or preattetively. Previous saliency bench-marks usually evaluate eye tracking data spatially across allﬁxations, we also propose the evaluation of saliency acrossﬁxations, which is an issue of further study. Future stepsfor this study would include the evaluation of saliency indynamic scenes [44, 35] using synthetic videos with bothstatic or dynamic camera. This would allow us to investi-gate the impact of temporally-variant features (e.g. ﬂickerand motion) over saliency predictions. Another analysisto consider is the impact of the spatial location of salientfeatures (in eccentricity terms towards the image center),which might affect each model distinctively. Each of thesteps in saliency modelization (i.e. feature extraction, con-spicuity computation and feature fusion) might have a dis-tinct inﬂuence over eye movement predictions. Acknowl-edging that conspicuity computations are the key factor forcomputing saliency, a future evaluation of how each mecha-nism contributes to model performance might be of interest.

6. Conclusion

Contrary to the current state-of-the-art, we reveal thatsaliency models are far away from acquiring HVS per-formance in terms of predicting bottom-up attention. Weprove this with a novel dataset SID4VAM, which contains uniquely synthetic images, generated with speciﬁc low-level feature contrasts. In this study, we show that over-all Spectral/Fourier-based saliency models (i.e. HFT andWMAP) clearly outperform other saliency models whendetecting a salient region with a particular conspicuousobject. Other models such as AWS, CASD, GBVS andSAM-ResNet are the best predictor candidates for eachsaliency model inspiration categories respectively (Cog-nitive/Biological, Information-Theoretic, Probabilistic andDeep Learning). In particular, visual features learned withdeep learning models might not be suitable for efﬁcientlypredicting saliency using psychophysical images. Here wepose that saliency detection might not be directly related toobject detection, therefore training upon high-level objectfeatures might not be signiﬁcatively favorable for predict-ing saliency in these terms. Future saliency modelizationand evaluation should account for low-level feature distinc-tiveness in order to accurately model bottom-up attention.Here we remark the need for analyzing other factors suchas the order of ﬁxations, the inﬂuences of the task and thepsychometric parameters of the salient regions.

7. Acknowledgements

This work was funded by the MINECO (DPI2017-89867-C2-1-R, TIN2015-71130-REDT), AGAUR (2017-SGR-649), CERCA Programme / Generalitat de Catalunya,in part by Xunta de Galicia under Project ED431C2017/69,in part by the Conseller´ıa de Cultura, Educaci´on e Orde-naci´on Universitaria (accreditation 20162019, ED431G/08)and the European Regional Development Fund, and in partby Xunta de Galicia and the European Union (European So-cial Fund). We also acknowledge the generous GPU supportfrom NVIDIA. eferences [1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada,and Sabine Susstrunk. Frequency-tuned salient region de-tection. In , jun 2009. 3[2] David Berga, Xos R. Fdez-Vidal, Xavier Otazu, Vctor Lebo-ran, and Xose M. Pardo. Psychophysical evaluation of indi-vidual low-level feature inﬂuences on visual attention.

VisionResearch , 154:60 – 79, 2019. 1, 3, 5[3] David Berga and Xavier Otazu. A neurodynamicalmodel of saliency prediction in v1.

In Review , 2018.arXiv:1811.06308. 3[4] Ali Borji and Laurent Itti. State-of-the-art in visual atten-tion modeling.

IEEE Transactions on Pattern Analysis andMachine Intelligence , 35(1):185–207, jan 2013. 2, 5[5] Ali Borji and Laurent Itti. Cat2000: A large scale ﬁxationdataset for boosting saliency research.

CVPR 2015 workshopon ”Future of Datasets” , 2015. 2[6] Ali Borji, D. N. Sihite, and L. Itti. Quantitative analysisof human-model agreement in visual saliency modeling: Acomparative study.

IEEE Transactions on Image Processing ,22(1):55–69, jan 2013. 1, 5[7] Neil D.B. Bruce, Calden Wloka, Nick Frosst, Shaﬁn Rah-man, and John K. Tsotsos. On computational modeling ofvisual saliency: Examining what’s right, and what’s left.

Vi-sion Research , 116:95–112, nov 2015. 1, 3, 5[8] Neil D. B. Bruce and John K. Tsotsos. Saliency based oninformation maximization. In , pages155–162. MIT Press, 2005. 2, 3[9] Z. Bylinskii, E.M. DeGennaro, R. Rajalingham, H. Ruda, J.Zhang, and J.K. Tsotsos. Towards the quantitative evaluationof visual attention models.

Vision Research , 116:258–268,nov 2015. 1, 5[10] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Fr´edo Du-rand, Aude Oliva, and Antonio Torralba. Mit saliency bench-mark. http://saliency.mit.edu/. 4, 5[11] Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba,and Fredo Durand. What do different evaluation metrics tellus about saliency models?

IEEE Transactions on PatternAnalysis and Machine Intelligence , pages 1–1, 2018. 4, 7[12] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and RitaCucchiara. A Deep Multi-Level Network for Saliency Pre-diction. In

International Conference on Pattern Recognition(ICPR) , 2016. 3[13] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and RitaCucchiara. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.

IEEE Transactions on Im-age Processing , 27(10):5142–5154, 2018. 3[14] Howard E. Egeth and Steven Yantis. VISUAL ATTENTION:Control, representation, and time course.

Annual Review ofPsychology , 48(1):269–297, feb 1997. 1[15] Michelle L. Eisenberg and Jeffrey M. Zacks. Ambient andfocal visual processing of naturalistic activity.

Journal ofVision , 16(2):5, mar 2016. 5[16] G. T. Fechner.

Elements of Psychophysics, Volume 1 . Holt,Rinehart and Winston, the University of Michigan, 1966. 2 [17] JH Fecteau and DP Munoz. Salience, relevance, and ﬁring:a priority map for target selection.

Trends in Cognitive Sci-ences , 10(8):382–390, aug 2006. 1[18] Anton Garcia-Diaz, Xose R. Fdez-Vidal, Xose M. Pardo, andRaquel Dosil. Saliency from hierarchical adaptation throughdecorrelation and variance normalization.

Image and VisionComputing , 30(1):51–64, jan 2012. 3[19] Stas Goferman, Lihi Zelnik-Manor, and Ayellet Tal.Context-aware saliency detection.

IEEE Transactions onPattern Analysis and Machine Intelligence , 34(10):1915–1926, oct 2012. 3[20] Chenlei Guo, Qi Ma, and Liming Zhang. Spatio-temporalsaliency detection using phase spectrum of quaternionfourier transform. In , jun 2008. 3[21] Jonathan Harel, Christof Koch, and Pietro Perona. Graph-based visual saliency. In B. Sch¨olkopf, J. C. Platt, and T.Hoffman, editors,

Advances in Neural Information Process-ing Systems 19 , pages 545–552. MIT Press, 2007. 3[22] Xiaodi Hou, J. Harel, and C. Koch. Image signature: High-lighting sparse salient regions.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 34(1):194–201, jan2012. 3[23] Xiaodi Hou and Liqing Zhang. Saliency detection: A spec-tral residual approach. In , jun 2007. 3[24] Xiaodi Hou and Liqing Zhang. Dynamic visual attention:searching for coding length increments. In D. Koller, D.Schuurmans, Y. Bengio, and L. Bottou, editors,

Advances inNeural Information Processing Systems 21 , pages 681–688.Curran Associates, Inc., 2009. 3[25] Alex D. Hwang, Hsueh-Cheng Wang, and Marc Pomplun.Semantic guidance of eye movements in real-world scenes.

Vision Research , 51(10):1192–1205, may 2011. 5[26] Laurent Itti and Christof Koch. A saliency-based searchmechanism for overt and covert shifts of visual attention.

Vi-sion Research , 40(10-12):1489–1506, jun 2000. 2, 3[27] Laurent Itti, Christof Koch, and Ernst Niebur. A modelof saliency-based visual attention for rapid scene analysis.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 20(11):1254–1259, 1998. 1, 3[28] Ming Jiang, Shengsheng Huang, Juanyong Duan, and QiZhao. SALICON: Saliency in context. In ,jun 2015. 3[29] Tilke Judd, Fredo Durant, and Antonio Torralba. A bench-mark of computational models of saliency to predict humanﬁxations.

CSAIL Technical Reports , jan 2012. 2[30] Tilke Judd, Krista Ehinger, Fredo Durand, and Antonio Tor-ralba. Learning to predict where humans look. In , sep 2009. 2[31] Christof Koch and Shimon Ullman. Shifts in selective vi-sual attention: Towards the underlying neural circuitry. In

Matters of Intelligence , pages 115–141. Springer, 1987. 1[32] Gert Kootstra, Bart de Boer, and Lambert R. B. Schomaker.Predicting eye ﬁxations on complex visual stimuli using lo-cal symmetry.

Cognitive Computation , 3(1):223–240, jan2011. 233] Matthias Kummerer, Thomas S.A. Wallis, Leon A. Gatys,and Matthias Bethge. Understanding low- and high-levelcontributions to ﬁxation prediction. In , oct 2017. 3[34] Mai Xu Zulin Wang Lai Jiang, Zhe Wang. Image saliencyprediction in transformed domain: A deep complex neuralnetwork method. February 2019. 3[35] Victor Leboran, Anton Garcia-Diaz, Xose R. Fdez-Vidal,and Xose M. Pardo. Dynamic whitening saliency.

IEEETransactions on Pattern Analysis and Machine Intelligence ,39(5):893–907, may 2017. 8[36] Olivier LeMeur and Thierry Baccino. Methods for compar-ing scanpaths and saliency maps: strengths and weaknesses.

Behavior Research Methods , 45(1):251–266, jul 2012. 3[37] Jian Li, Martin D. Levine, Xiangjing An, Xin Xu, andHangen He. Visual saliency based on scale-space analysis inthe frequency domain.

IEEE Transactions on Pattern Analy-sis and Machine Intelligence , 35(4):996–1010, apr 2013. 3[38] Fernando Lopez-Garcia, Xose Ramon, Xose Manuel, andRaquel Dosil. Scene recognition through visual attention andimage features: A comparison between SIFT and SURF ap-proaches. In

Object Recognition . InTech, apr 2011. 3[39] Naila Murray, Maria Vanrell, Xavier Otazu, and C. Alejan-dro Parraga. Saliency estimation using a non-parametric low-level vision model. In

CVPR 2011 , jun 2011. 3[40] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E.O’Connor, Jordi Torres, Elisa Sayrol, and Xavier and Giro-i Nieto. Salgan: Visual saliency prediction with generativeadversarial networks. In

CVPR 2017 Scene UnderstandingWorkshop (SUNw) , January 2017. 3[41] Subramanian Ramanathan, Harish Katti, Nicu Sebe, MohanKankanhalli, and Tat-Seng Chua. An eye ﬁxation databasefor saliency detection in images. In

Computer Vision – ECCV2010 , pages 30–43. Springer, 2010. 2[42] Nicolas Riche, Matthieu Duvinage, Matei Mancas, BernardGosselin, and Thierry Dutoit. Saliency and human ﬁxations:State-of-the-art and study of comparison metrics. In ,dec 2013. 5[43] Nicolas Riche and Matei Mancas. Bottom-up saliency mod-els for still images: A practical review. In

From Human At-tention to Computational Attention , pages 141–175. SpringerNew York, 2016. 1, 2[44] Nicolas Riche and Matei Mancas. Bottom-up saliency mod-els for videos: A practical review. In

From Human Attentionto Computational Attention , pages 177–190. Springer NewYork, 2016. 8[45] Nicolas Riche, Matei Mancas, Bernard Gosselin, and ThierryDutoit. Rare: A new bottom-up saliency model. In ,sep 2012. 3[46] Amir Rosenfeld, Richard Zemel, and John K. Tsotsos. Theelephant in the room, 2018. arXiv:1808.03305. 7[47] Boris Schauerte and Rainer Stiefelhagen. Quaternion-basedspectral saliency detection for eye ﬁxation prediction. In

Computer Vision – ECCV 2012 , pages 116–129. Springer,2012. 3 [48] Hae Jong Seo and Peyman Milanfar. Static and space-timevisual saliency detection by self-resemblance.

Journal of Vi-sion , 9(12):15–15, nov 2009. 3[49] Alireza Soltani and Christof Koch. Visual saliency compu-tations: Mechanisms, constraints, and the effect of feedback.

Journal of Neuroscience , 30(38):12831–12843, sep 2010. 4[50] Michael W. Spratling. Predictive coding as a model of thev1 saliency map hypothesis.

Neural Networks , 26:7–28, feb2012. 2, 3, 4[51] Christopher Lee Thomas. Opensalicon: An open source im-plementation of the salicon saliency model. Technical ReportTR-2016-02, University of Pittsburgh, 2016. 3[52] Antonio Torralba, Aude Oliva, Monica S. Castelhano, andJohn M. Henderson. Contextual guidance of eye movementsand attention in real-world scenes: The role of global fea-tures in object search.

Psychological Review , 113(4):766–786, 2006. 3[53] Anne M. Treisman and Garry Gelade. A feature-integrationtheory of attention.

Cognitive Psychology , 12(1):97–136, jan1980. 1, 2[54] John K. Tsotsos, Scan M. Culhane, Winky Yan Kei Wai,Yuzhong Lai, Neal Davis, and Fernando Nuﬂo. Modelingvisual attention via selective tuning.

Artiﬁcial Intelligence ,78(1-2):507–545, oct 1995. 1[55] Richard Veale, Ziad M. Hafed, and Masatoshi Yoshida.How is visual salience computed in the brain? insightsfrom behaviour, neurobiology and modelling.

Philosophi-cal Transactions of the Royal Society B: Biological Sciences ,372(1714):20160113, jan 2017. 1[56] Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, andHaibin Ling. Salient object detection in the deep learningera: An in-depth survey, 2019. arXiv:1904.09146. 7[57] Stefan Winkler and Ramanathan Subramanian. Overview ofeye tracking datasets. In , jul 2013. 2[58] J. M. Wolfe. Guided search 4.0: A guided search model thatdoes not require memory for rejected distractors.

Journal ofVision , 1(3):349–349, mar 2010. 1, 2, 3, 7[59] Jeremy M. Wolfe, Evan M. Palmer, and Todd S. Horowitz.Reaction time distributions constrain models of visualsearch.

Vision Research , 50(14):1304–1311, jun 2010. 2[60] Steven Yantis and Howard E. Egeth. On the distinction be-tween visual salience and stimulus-driven attentional cap-ture.

Journal of Experimental Psychology: Human Percep-tion and Performance , 25(3):661–676, 1999. 1[61] Jianming Zhang and Stan Sclaroff. Saliency detection: Aboolean map approach. In , dec 2013. 3[62] Liming Zhang and Weisi Lin.

Selective Visual Attention .John Wiley & Sons (Asia) Pte Ltd, mar 2013. 1, 2[63] Lingyun Zhang, Matthew H. Tong, Tim K. Marks, HonghaoShan, and Garrison W. Cottrell. SUN: A bayesian frame-work for saliency using natural statistics.

Journal of Vision ,8(7):32, dec 2008. 3[64] Li Zhaoping and Keith A. May. Psychophysical tests of thehypothesis of a bottom-up saliency map in primary visualcortex.