SID4VAM: A Benchmark Dataset with Synthetic Images for Visual Attention Modeling
David Berga, Xosé R. Fdez-Vidal, Xavier Otazu, Xosé M. Pardo
SSID4VAM: A Benchmark Dataset with Synthetic Imagesfor Visual Attention Modeling
David Berga Xos´e R. Fdez-Vidal Xavier Otazu Xos´e M. Pardo Computer Vision Center, Universitat Aut`onoma de Barcelona, Spain CiTIUS, Universidade de Santiago de Compostela, Spain { dberga,xotazu } @cvc.uab.es { xose.vidal,xose.pardo } @usc.es Abstract
A benchmark of saliency models performance with asynthetic image dataset is provided. Model performanceis evaluated through saliency metrics as well as the influ-ence of model inspiration and consistency with human psy-chophysics. SID4VAM is composed of 230 synthetic im-ages, with known salient regions. Images were generatedwith 15 distinct types of low-level features (e.g. orienta-tion, brightness, color, size...) with a target-distractor pop-out type of synthetic patterns. We have used Free-Viewingand Visual Search task instructions and 7 feature contrastsfor each feature category. Our study reveals that state-of-the-art Deep Learning saliency models do not perform wellwith synthetic pattern images, instead, models with Spec-tral/Fourier inspiration outperform others in saliency met-rics and are more consistent with human psychophysical ex-perimentation. This study proposes a new way to evaluatesaliency models in the forthcoming literature, accountingfor synthetic images with uniquely low-level feature con-texts, distinct from previous eye tracking image datasets.
1. Introduction
Although eye movements are indicators of “where peo-ple look at” , a more complex question arises as a con-sequence for understanding bottom-up visual attention:
Are all eye movements equally valuable for determiningsaliency?
According to the initial hypotheses in visual at-tention [53, 58], we could define visual saliency as the per-ceptual quality that makes our human visual system (HVS)to gaze towards certain areas that pop-out on a scene dueto their distinctive visual characteristics. Therefore, this ca-pacity (saliency) cannot be influenced by top-down factors,which seemingly guide eye movements regardless of stimu-lus characteristics [60]. Accounting for prior knowledge ofwhether a stimulus area is salient or not, when it becomessalient, and why, are issues that need to be accounted for saliency evaluation [7, 2].Common frameworks for predicting saliency have beencreated since Koch & Ullman’s seminal work [31]. Thisframework defined a theoretical basis for modeling the earlyvisual stages of the HVS in order to obtain a representationof the saliency map. By extracting sensory signals as fea-ture maps, processing the conspicuous objects and select-ing the maximally-active locations through winner-take-all(WTA) mechanisms, it is possible to obtain a unique/mastersaliency map. However, it was hypothesized that visual at-tention combines both bottom-up (saliency) and top-down(relevance) mechanisms in a central representation (prior-ity) [14, 17]. These top-down specificities (e.g. world,object, task, etc.) were later accounted in the selectivetuning model as a hierarchy of WTA-like processes [54].Despite the neural correlates simultaneously involved insaliency have been investigated [55], the direct relation be-tween saliency and eye movements defined in a uniquecomputational framework requires further study. Itti et al.initially introduced a computational biologically-inspiredmodel [27] composed of 3 main steps: First, feature mapsare extracted using oriented linear DoG filters for each chro-matic channel. Second, feature conspicuity is computed us-ing center-surround differences. Third, conspicuity mapsare integrated with linear WTA mechanisms. This architec-ture has been the main inspiration for current saliency mod-els [62, 43], that alternatively use distinct mechanisms (ac-counting for different levels of processing, context or tun-ing depending on the scene) but preserving same or similarstructure for these steps. Although current state-of-the-artmodels precisely resemble eye-tracking fixation data [6, 9],we question if these models represent saliency. We will testthis hypothesis with a novel synthetic image dataset.
In order to determine whether an object or a feature at-tracts attention, initial experimentation was assessing fea-ture discriminability upon display characteristics (e.g. dis- a r X i v : . [ c s . C V ] O c t lay size, feature contrast...) during visual search tasks[53, 58]. Parallel search occurs when features are pro-cessed preattentively, therefore search targets are found ef-ficiently regardless of distractor properties. Instead, serialsearch happens when attention is directed to one item ata time, requiring a “binding” process to allow each ob-ject to be discriminated. For this case, search time de-crease with higher target-distractor contrast and/or lower setsize (following the Weber Law [16]). More recent studiesreplicated these experiments by providing real images withparametrization of feature contrast and/or set size (iLabUSC, UCL, VAL Hardvard, ADA KCL), combining visualsearch or visual segmentation tasks, however not providingeye tracking data (Table 1B). Rather, current eye movementdatasets provide fixations and scanpaths from real scenesduring free-viewing tasks. These image datasets are usu-ally composed of real image scenes (Table 1A), either fromindoor / outdoor scenes (Toronto, MIT1003, MIT300), na-ture scenes (KTH) or semantically-specific categories suchas faces (NUSEF) and several others (CAT2000). A com-plete list of eye tracking datasets is in Winkler & Subrama-nian’s overview [57]. CAT2000 training subset of “Pattern”images (CAT2000 p ) provides eye movement data with psy-chophysical / synthetic image patterns during 5 sec of free-viewing. However, no parametrization of feature contrastnor stimulus properties is given. A synthetic image datasetcould provide information of how attention is dependent onfeature contrast and other stimulus properties with distincttasks. We describe in Section 2 how we do so with our novelSID4VAM’s dataset.Table 1: Characteristics of eye tracking datasets A: Real Images
Dataset Task (cid:51)
MIT1003 [30] FV 1003 15 (cid:51)
NUSEF [41] FV 758 25 (cid:51)
KTH [32] FV 99 31 (cid:51)
MIT300 [29] FV 300 39 (cid:51)
CAT2000 [5] FV 4000 24 (cid:51)
B: Psychophysical Pattern / Synthetic Images
Dataset Task (cid:51)
UCL [64] VS & SG 2784 5 (cid:51)
VAL Harvard [59] VS 4000 30 (cid:51)
ADA KCL [50] - ˜430 - (cid:51)
CAT2000 p [5] FV 100 18 (cid:51) SID4VAM (Ours) FV & VS 230 34 (cid:51) (cid:51)
TS: total number of stimuli, PP: participants, PM:Parametrization, DO: Fixation data is available online, FV:Free-Viewing, VS: Visual Search, SG: visual segmentation
Being inspired by Itti et al’s architecture, a myiriad ofcomputational models has been proposed with distinct com-putational approaches, from biological, mathematical andphysical inspiration [62, 43]. By processing global and/or local image fatures for calculating feature conspicuity, thesemodels are able to generate a master saliency map to pre-dict human fixations (Table 2). Taking up Judd et al. [29]and Borji & Borji’s [4] reviews, we have grouped saliencymodel inspiration in five general categories according to itssaliency computation endeavour: • Cognitive/Biological (C): Saliency is usually gener-ated by mimicking HVS neuronal mechanisms or ei-ther specific patterns found in human eye movementbehavior. Feature extraction is generally based onGabor-like filters and its integration with WTA-likemechanisms. • Information-Theoretic (I): These models computesaliency by selecting the regions that maximize visualinformation of scenes. • Probabilistic (P): Probabilistic models generatesaliency by optimizing the probability of performingcertain tasks and/or finding certain patterns. Thesemodels use graphs, bayesian, decision-theoretic andother approaches for their computations. • Spectral/Fourier-based (F): Spectral Analysis orFourier-based models derive saliency by extracting ormanipulating features in the frequency domain (e.g.spectral frequency or phase). • Machine/Deep Learning (D): These techniques arebased on training existing machine/deep learning ar-chitectures (e.g. CNN, RNN, GAN...) by minimizingthe error of predicting fixations of images from exist-ing eye tracking data or labeled salient regions.
Visual saliency is a term coined on a perceptual ba-sis. According to this principle, a correct modelization ofsaliency should consider specific experimental conditionsupon a visual attention task. The output of such a model canvary for stimulus or task, but must arise as a common be-havioral phenomena in order to validate the general hypoth-esis definition from Treisman, Wolfe, Itti and colleagues[53, 58, 26]. Eye movements have been considered the mainbehavioral markers of visual attention. But understandingsaliency means not only to prove how visual fixations canbe predicted, but to simulate which patterns of eye move-ments are gathered from vision and its sensory signals (hereavoiding any top-down influences). This challenge offerseye tracking researchers to consider several experimentalissues (with respect contextual, contrast, temporal, oculo-motor and task-related biases) when capturing bottom-upattention, largely explained by Borji et al. [4], Bruce et al.[7] and lately by Berga et al. [2]. Computational models ad-vance several ways to predict, to some extent, human visualable 2: Description of saliency models
Model Authors Year Inspiration TypeC I P F D G LIKN Itti et al.[27, 26] 1998 (cid:51) (cid:51) (cid:51)
AIM Bruce & Tsotsos [8] 2005 (cid:51) (cid:51) (cid:51)
GBVS Harel et al.[21] 2006 (cid:51) (cid:51) (cid:51)
SDLF Torralba et al. [52] 2006 (cid:51) (cid:51) (cid:51)
SR & PFT Hou & Zhang[23] 2007 (cid:51) (cid:51)
PQFT Guo & Zhang[20] 2008 (cid:51) (cid:51)
ICL Hou & Zhang [24] 2008 (cid:51) (cid:51) (cid:51) (cid:51)
SUN Zhang et al. [63] 2008 (cid:51) (cid:51)
SDSR Seo & Milanfar [48] 2009 (cid:51) (cid:51) (cid:51) (cid:51)
FT Achanta et al.[1] 2009 (cid:51) (cid:51)
DCTS/SIGS Hou et al.[22] 2011 (cid:51) (cid:51)
SIM Murray et al.[39] 2011 (cid:51) (cid:51) (cid:51)
WMAP Lopez-Garcia et al.[38] 2011 (cid:51) (cid:51) (cid:51) (cid:51)
AWS Garcia-Diaz et al.[18] 2012 (cid:51) (cid:51) (cid:51)
CASD Goferman et al.[19] 2012 (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
RARE Riche et al.[45] 2012 (cid:51) (cid:51) (cid:51)
QDCT Schauerte et al.[47] 2012 (cid:51) (cid:51)
HFT Li et al.[37] 2013 (cid:51) (cid:51)
BMS Zhang & Sclaroff [61] 2013 (cid:51) (cid:51)
SALICON Jiang et al.[28, 51] 2015 (cid:51) (cid:51)
ML-Net Cornia et al.[12] 2016 (cid:51) (cid:51)
DeepGazeII K¨ummerer et al.[33] 2016 (cid:51) (cid:51)
SalGAN Pan et al.[40] 2017 (cid:51) (cid:51)
ICF K¨ummerer et al.[33] 2017 (cid:51) (cid:51) (cid:51)
SAM Cornia et al.[13] 2018 (cid:51) (cid:51)
NSWAM Berga & Otazu [3] 2018 (cid:51) (cid:51) (cid:51)
Sal-DCNN Jiang et al. [34] 2019 (cid:51) (cid:51) (cid:51) (cid:51)
Inspiration: { C : Cognitive/Biological, I : Information-Theoretic, P : Probabilistic,F : Fourier/Spectral, D : Machine/Deep Learning } Type: { G: Global, L: Local } fixations. However, the limits of the prediction capabilityof these saliency models arise as a consequence of the va-lidity of the evaluation from eye tracking experimentation.We aim to to provide a new dataset with uniquely syntheticimages and a benchmark, studying for each saliency model:1. How model inspiration and feature processing influ-ences model predictions?2. How does temporality of fixations affect model predic-tions?3. How low-level feature type and contrast influencesmodel’s psychophysical measurements?
2. SID4VAM: Synthetic Image Dataset for Vi-sual Attention Modeling
Fixations were collected from 34 participants in adataset of 230 images[2]. Images were displayed in a res-olution of × px and fixations were captured atabout pixels per degree of visual angle using SMI REDbinocular eye tracker. The dataset had been splitted in twotasks: Free-Viewing (FV) and Visual Search (VS). For theFV task, participants had to freely look at the image dur-ing 5 seconds. On each stimuli there was a salient area ofinterest (AOI). For the VS task, participants had the instruc-tion to visually locate the AOI, setting the salient region as Download the dataset in the different object. For this task, the trigger for promptingthe transition to next image was by gazing inside the AOIor pressing a key (for reporting absence of target). We canobserve the stimuli generated for both tasks on Figs. 1-2.The dataset was divided in 15 stimulus types, 5 corre-sponding to FV and 10 to VS. Some of these blocks haddistinct subsets of images (due to the alteration of eithertarget or distractor shape, color, configuration and back-ground properties), abling a total of 33 subtypes. Each ofthese blocks was individually generated as a low-level fea-ture category, which had its own type of feature contrastbetween the salient region and the rest of distractors / back-ground. FV categories were mainly based for analyzingpreattentive effects (Fig. 1): 1) Corner Salience, 2) Vi-sual Segmentation by Bar Angle, 3) Visual Segmentation byBar Length, 4) Contour Integration by Bar Continuity and5) Perceptual Grouping by Distance. VS categories werebased on a feature-singleton search stimuli, where therewas a unique salient target and a set of distractors and/oraltered background (Fig. 2). These categories were: 6)Feature and Conjunctive Search, 7) Search Asymmetries,8) Search in a Rough Surface, 9) Color Search, 10) Bright-ness Search, 11) Orientation Search, 12) Dissimilar SizeSearch, 13) Orientation Search with Heterogeneous distrac-tors, 14) Orientation Search with Non-linear patterns, 15)Orientation search with distinct Categorization. Stimuli forSID4VAM’s dataset was inspired by previous psychophysi-cal experimentation [64, 58, 50].Dataset stimuli were generated with 7 specific in-stances of feature contrast ( Ψ ), corresponding to hard( Ψ h = { .. } ) and easy ( Ψ e = { .. } ) difficulies offinding the salient regions. These feature contrasts hadtheir own parametrization (following Berga et al’s psy-chophysical formulation [2, Section 2.4]) corresponding tothe feature differences between the salient target and therest of distractors (e.g. differences of target orientation,size, saturation, brightness...) or global effects (e.g. overalldistractor scale, shape, background color, backgroundbrightness).
3. Experiments
Fixation maps from eye tracking data are generated bydistributing each fixation location to a binary map. Fixationdensity maps are created by convolving a gaussian filter tothe fixation maps, this simulates a smoothing caused by thedeviations of σ = deg given from eye tracking experimen-tation, recommended by LeMeur & Baccino [36].Typically, location-based saliency metrics ( AU C
Judd , AU C
Borji , NSS) increase their score fixation locations fallinside (TP) the predicted saliency maps. Conversely, scoresdecrease fixation locations are not captured by saliencymaps (FN) or when saliency maps exist in locations with)2)3) 4)5) 1 2 3 4 5 6 7hard ←− Ψ −→ easyFigure 1: Free-Viewing stimulino present fixations (FP). In distribution-based metrics (CC,SIM, KL), saliency maps score higher when they havehigher correlations with respect to fixation density mapdistributions. We have to point out that shuffled metrics(sAUC, InfoGain) consider FP values when saliency mapscoincide with other fixation map locations or a baseline(here, corresponding to the center bias), which are not rep-resentative data for saliency prediction. Prediction metricsand its calculations are largely explained by Bylinskii et al.[11]. Our saliency metric scores and pre-processing usedfor this experimentation have been replicated from the of-ficial saliency benchmarking procedure [10]. Psychometricevaluation of saliency predictions has been done with theSaliency Index (SI) [49, 50]. This metric evaluates the en-ergy of a saliency map inside ( S t ) a salient region (whichwould enclose a salient object) compared to the energy out-side ( S b ) the salient region. This metric allows evaluationof a saliency map when the salient region is known, con-sidering in absolute terms the distribution of saliency of aparticular AOI / mask. Here we show the formula of the SI SI ( S t , S b ) = S t − S b S b . Saliency maps have been computed from models shownon Table 2. Model evaluations have been divided accordingto its inspiration and prediction scores have been evaluatedwith saliency metrics and in psychophysical terms.
Previous saliency benchmarks [6, 42, 9, 7, 10] reveal thatDeep Learning models such as SALICON, ML-Net SAM- Code for metrics: https://github.com/dberga/saliency ←− Ψ −→ easyFigure 2: Visual Search stimuliesNet, SAM-VGG, DeepGazeII or SalGan score high-est on both shuffled and unshuffled metrics. In this sec-tion we aim to evaluate whether saliency maps that scoredhighly on fixation prediction do so with a synthetic imagedataset and if their inspiration influences on their perfor-mance. We present metric scores of saliency map predic-tions of the whole dataset in Table 3 and plots in Fig. 3.Saliency metric scores reveal that overall Spectral/Fourier-based saliency models predict better fixations on a syntheticimage dataset.Table 3: Saliency metric scores for SID4VAM Model AUCj AUCb CC NSS KL SIM sAUC InfoGain GT Baseline-CG
IKN 0.686 0.678 0.283 0.878 1.748 0.380 0.608 -0.233SIM 0.650 0.641 0.189 0.694 1.702 0.357 0.619 -0.148AWS 0.679 0.667 0.255 1.088 1.592 0.373 0.672 0.013NSWAM 0.614 0.610 0.136 0.529 1.686 0.335 0.622 -0.150AIM 0.570 0.566 0.122 0.473 14.472 0.224 0.557 -18.182ICL 0.737 0.717 0.343 1.100 1.788 0.405 0.624 -0.313RARE 0.707 0.622 0.204 1.046 1.736 0.444 0.633 -0.158CASD 0.733 0.669 0.408 1.904 2.395 0.403 0.652 -1.046GBVS 0.747 0.718 0.400 1.464 1.363 0.413 0.628 0.331SDLF 0.620 0.607 0.156 0.585 3.954 0.322 0.596 -3.244SUN 0.542 0.532 0.080 0.333 16.408 0.165 0.530 -21.024SDSR 0.672 0.665 0.192 0.639 1.904 0.365 0.642 -0.467BMS 0.677 0.643 0.274 1.143 2.306 0.397 0.627 -0.958ICF 0.618 0.566 0.141 0.700 3.274 0.306 0.564 -2.300SR 0.748 0.694 0.420 1.916 1.432 0.431 0.685 0.348PFT 0.705 0.692 0.398 1.885 2.227 0.377 0.684 -0.893PQFT 0.701 0.693 0.387 1.774 2.197 0.373 0.684 -0.856FT 0.521 0.518 0.072 0.331 7.552 0.129 0.517 -8.498DCTS 0.729 0.724 0.439 2.004 1.363 0.396 0.708 0.337WMAP 0.729 0.709 0.468 2.136 2.283 0.397 -0.981QDCT 0.717 0.706 0.425 1.986 1.677 0.391 0.695 -0.105HFT
SalGAN 0.715 0.662 0.287 0.883 2.506 0.373 0.593 -1.350OpenSALICON 0.692 0.673 0.284 0.956 1.549 0.375 0.615 0.052DeepGazeII 0.639 0.606 0.176 0.714 2.023 0.346 0.597 -0.587SAM-VGG 0.537 0.523 0.026 0.070 11.947 0.216 0.503 -14.954SAM-ResNet 0.727 0.673 0.305 0.967 2.610 0.388 0.600 -1.475ML-Net 0.700 0.676 0.283 0.883 2.169 0.373 0.595 -0.837Sal-DCNN 0.726 0.650 0.288 0.961 3.676 0.359 0.580 -3.05
Cognitive/Biological , Information-Theoretic , Probabilistic,Fourier/Spectral, Machine/Deep Learning
Models such as HFT and WMAP remarkably outpe-form other saliency models. From other model inspira-tions, AWS score higher than other Cognitive/Biologically-inspired models, GBVS and CASD outperform other Prob-abilistic/Bayesian and Information-theoretic saliency mod-els respectively. For Deep Learning models, SAM
ResNet and OpenSALICON are the ones with highest scores. Al-though there are present differences in terms of modelperformances and model inspiration, similarities in modelmechanisms can reveal phenomena of increasing and de-creasing prediction statistics. This phenomena is present forSpectral/Fourier-based and Cognitive/Biologically-inspiredmodels, withwhom all present similar performance and bal-anced scores throughout the distinct metric scores. It isto consider that sAUC and InfoGain metrics are more re-liable compared to other metrics (which the baseline cen-ter gaussian sometimes acquires higher performance than most saliency models). In these terms, models shown onFig. 4 are efficient saliency predictors for this dataset. Wecan also point out that models which process uniquely lo-cal feature conspicuity scored lower on SID4VAM fixationpredictions, whereas the ones that processed global con-spicuity scored higher. This phenomena might be relatedwith the distinction of foveal (near the fovea) and ambient(away from the fovea) fixations, relative to the fixation or-der and the spatial locations of fixations [15]. The eval-uation of gaze-wise model predictions has been done bygrouping fixations of every instance separately. We haveplotted results of the sAU C saliency metric for each model(Fig. 5) and it is observable that model performance de-crease upon fixation number, meaning that saliency is morelikely to be predicted during first fixations. For evaluatingthe temporal relationship between human and model perfor-mance ( sAU C ), we have performed Spearman’s ( ρ ) corre-lation tests for each fixation and it can be observed that IKN,ICL, GBVS, QDCT and ML-Net follow a similar slope asthe GT, contrary to the case of the baseline center gaussian. Previous studies [4, 7, 2] found that several factors suchas feature type, feature contrast, task, temporality of fix-ations and the center bias alternatively contribute to eyemovement guidance. The HVS has specific contrast sen-sitivity to each stimulus feature, so that saliency modelsshould adapt in the same way in order to be plausible inpsychometric parameters. Here we will show how saliencyprediction varies significantly upon feature contrast and thetype of low-level features found in images. In Fig. 6a isfound that saliency models increase SI with feature contrast“ Ψ ” following the distribution of human fixations. Mostprediction SI scores show a higher slope with easy targets(salient objects with higher contrast with respect the rest,when Ψ > ), being CASD and HFT the models that havehigher SI at higher contrasts.Contextual influences (here represented as distinct low-level features that appear in the image) contribute distinc-tively on saliency induced from objects that appear on thescene [25]. We suggest that not only the semantic contentthat appears on the scene affects saliency but the featurecharacteristics do significantly impact how salient objectsare. This phenomena is observable in Fig. 6b and occursfor both human fixations and model predictions, specificallywith highest SI for human fixations in 1) Corner Salience,6) Feature and Conjunctive Search, 7) Search Asymmetries,10) Brightness Search, 12) Dissimilar Size Search and 13)Orientation Search with Heterogeneous distractors. HFTand CASD have highest SI when GT is higher (when hu-man fixations are more probable to fall inside the AOI),even outperforming GT probabilities for the cases of 1)and 7). We show in Fig. 7a that overall Saliency Index =Best performance for each model inspirationFigure 3: Plots for saliency metric scores.of most saliency models is distinct when we vary the typeof feature contrast (easy vs hard) and the performed stim-ulus task (free-viewing vs visual search). Spectral/Fourierbased models outperform other saliency models also in SImetric. Similarly with saliency metrics shown on previoussubsection, AWS, CASD, BMS, HFT and SAM-ResNet arethe most efficient models for each model inspiration cate-gory respectively. It is observable in Fig. 7b that saliencymodels have higher performance for easy targets, with in-creased overall model performance differences with respecthard targets (Fig. 7c). Similarly, visual search targets showlower difficulty (higher SI) to find predicted fixations insidethe AOI than the free-viewing cases (Fig. 7d-e). Also dis-tinct SI curves upon feature contrast are reported, revealingthat contrast sensitivies are distinct for each low-level fea-ture. Spearman’s correlation tests on Fig. 6b show whichmodels correlate with human performance over feature con-trast and which one do so with the baseline (designatinghigher center biases). These results show that models suchas AWS, CASD, BMS, DCTS or DeepGazeII highly corre-late with human contrast sensitivities and do not correlatewith the baseline center gaussian. Matching human con- trast sensitivities on low-level visual features would be aninteresting point of view to make future saliency models ac-curately predict saliency as well as to better understand howthe HVS processes visual scenes.
4. SIG4VAM: Generating synthetic image pat-terns for training saliency models
We have also provided a synthetic image generator(SIG4VAM) , able to generate similar psychophysical im-ages with other types of patterns. A larger set of images canbe created by parametrizing factors such as stimulus size,number of distractors, feature contrast, etc. For instance,if the same 15 types (33 subtypes) of stimuli are selectedinstead with 28 contrast ( Ψ ) instances, then is generateda dataset with ×
28 = 924 stimulus. Adding to that,synthetic images with high-level features can be created us-ing SIG4VAM (Fig. 8), by changing background proper-ties, setting specific object instances for targets/distractors,as well as their low-level properties (orientation, brightness, Code for generating synthetic stimuli: https://github.com/dberga/sig4vam mageHumansAWSNSWAMRARECASDGBVSSDSRWMAPHFTOpenSALICONSAM-ResNet
Figure 4: Examples of dataset stimuli and saliency map pre-dictions. Only two models for each inspiration category thatpresented highest performance with shuffled saliency met-ric scores (sAUC and InfoGain) are shown.Figure 5: sAUC gaze-wise prediction scores.color, etc.). SID4VAM has been proposed as a possible ini-tial test set for saliency prediction, where data of fixationsand binary masks are available for benchmarking. Trainingsets can be obtained with SIG4VAM (GT of binary masks ofpop-out/salient regions are automatically-generated), ablingto fit contrast sensitivies and obtaining loss functions uponscores of fixation probability distribution [11] and salientregion detection metrics [56] (e.g. SI, PR, MAE, S-/F-measures, etc.). Latest strategies [46] that syntheticallymodify real scenes have shown dramatic changes in scoresof object detection tasks, using “object transplanting” (su-perposing an object on distinct locations on the scene). In a) b)
Figure 6: Results of Saliency Index of model predictionsupon Feature Contrast (a) and Feature Type (b) .these terms, SIG4VAM could be extended for evaluatingpredictions of models over distinct contexts and tasks.
5. Discussion
Previous saliency benchmarks show that eye movementsare efficiently predicted with latest Deep Learning saliencymodels. This is not the case with synthetic images, alsofor models pre-trained with sets of psychophysical patterns(e.g. SAM with CAT2000). This suggest that their com-putations of saliency do not arise as a general mechanism.These methods have been trained with eye tracking data(real images containing high-level features) and althoughseveral factors guide eye movements have been shown [58]that low-level saliency (i.e. pop-out effects) is one of themost influential for determining bottom-up attention. An-other possibility is that we randomly parametrized salientobject location, lowering the center bias effect. With thisbenchmark we can evaluate how salient is a particular objectby parametrizing its low-level feature contrast with respectto the rest of distractors and/or background. Therefore, theevaluation of saliency can be done accounting for feature a) (b) (c)(d) (e)
Figure 7: Results of Saliency Index metric scores from dataset model predictions (a) , for easy/hard difficulties (b-c) andFree-Viewing/Visual Search tasks (d-e) .Figure 8: Examples of generating synthetic images withhigh-level features (i.e. objects as target/distractors), chang-ing low-level feature properties (a-b) or background (c) .contrast, analyzing the importance to the objects that areeasier to detect or preattetively. Previous saliency bench-marks usually evaluate eye tracking data spatially across allfixations, we also propose the evaluation of saliency acrossfixations, which is an issue of further study. Future stepsfor this study would include the evaluation of saliency indynamic scenes [44, 35] using synthetic videos with bothstatic or dynamic camera. This would allow us to investi-gate the impact of temporally-variant features (e.g. flickerand motion) over saliency predictions. Another analysisto consider is the impact of the spatial location of salientfeatures (in eccentricity terms towards the image center),which might affect each model distinctively. Each of thesteps in saliency modelization (i.e. feature extraction, con-spicuity computation and feature fusion) might have a dis-tinct influence over eye movement predictions. Acknowl-edging that conspicuity computations are the key factor forcomputing saliency, a future evaluation of how each mecha-nism contributes to model performance might be of interest.
6. Conclusion
Contrary to the current state-of-the-art, we reveal thatsaliency models are far away from acquiring HVS per-formance in terms of predicting bottom-up attention. Weprove this with a novel dataset SID4VAM, which contains uniquely synthetic images, generated with specific low-level feature contrasts. In this study, we show that over-all Spectral/Fourier-based saliency models (i.e. HFT andWMAP) clearly outperform other saliency models whendetecting a salient region with a particular conspicuousobject. Other models such as AWS, CASD, GBVS andSAM-ResNet are the best predictor candidates for eachsaliency model inspiration categories respectively (Cog-nitive/Biological, Information-Theoretic, Probabilistic andDeep Learning). In particular, visual features learned withdeep learning models might not be suitable for efficientlypredicting saliency using psychophysical images. Here wepose that saliency detection might not be directly related toobject detection, therefore training upon high-level objectfeatures might not be significatively favorable for predict-ing saliency in these terms. Future saliency modelizationand evaluation should account for low-level feature distinc-tiveness in order to accurately model bottom-up attention.Here we remark the need for analyzing other factors suchas the order of fixations, the influences of the task and thepsychometric parameters of the salient regions.
7. Acknowledgements
This work was funded by the MINECO (DPI2017-89867-C2-1-R, TIN2015-71130-REDT), AGAUR (2017-SGR-649), CERCA Programme / Generalitat de Catalunya,in part by Xunta de Galicia under Project ED431C2017/69,in part by the Conseller´ıa de Cultura, Educaci´on e Orde-naci´on Universitaria (accreditation 20162019, ED431G/08)and the European Regional Development Fund, and in partby Xunta de Galicia and the European Union (European So-cial Fund). We also acknowledge the generous GPU supportfrom NVIDIA. eferences [1] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada,and Sabine Susstrunk. Frequency-tuned salient region de-tection. In , jun 2009. 3[2] David Berga, Xos R. Fdez-Vidal, Xavier Otazu, Vctor Lebo-ran, and Xose M. Pardo. Psychophysical evaluation of indi-vidual low-level feature influences on visual attention.
VisionResearch , 154:60 – 79, 2019. 1, 3, 5[3] David Berga and Xavier Otazu. A neurodynamicalmodel of saliency prediction in v1.
In Review , 2018.arXiv:1811.06308. 3[4] Ali Borji and Laurent Itti. State-of-the-art in visual atten-tion modeling.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 35(1):185–207, jan 2013. 2, 5[5] Ali Borji and Laurent Itti. Cat2000: A large scale fixationdataset for boosting saliency research.
CVPR 2015 workshopon ”Future of Datasets” , 2015. 2[6] Ali Borji, D. N. Sihite, and L. Itti. Quantitative analysisof human-model agreement in visual saliency modeling: Acomparative study.
IEEE Transactions on Image Processing ,22(1):55–69, jan 2013. 1, 5[7] Neil D.B. Bruce, Calden Wloka, Nick Frosst, Shafin Rah-man, and John K. Tsotsos. On computational modeling ofvisual saliency: Examining what’s right, and what’s left.
Vi-sion Research , 116:95–112, nov 2015. 1, 3, 5[8] Neil D. B. Bruce and John K. Tsotsos. Saliency based oninformation maximization. In , pages155–162. MIT Press, 2005. 2, 3[9] Z. Bylinskii, E.M. DeGennaro, R. Rajalingham, H. Ruda, J.Zhang, and J.K. Tsotsos. Towards the quantitative evaluationof visual attention models.
Vision Research , 116:258–268,nov 2015. 1, 5[10] Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Fr´edo Du-rand, Aude Oliva, and Antonio Torralba. Mit saliency bench-mark. http://saliency.mit.edu/. 4, 5[11] Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba,and Fredo Durand. What do different evaluation metrics tellus about saliency models?
IEEE Transactions on PatternAnalysis and Machine Intelligence , pages 1–1, 2018. 4, 7[12] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and RitaCucchiara. A Deep Multi-Level Network for Saliency Pre-diction. In
International Conference on Pattern Recognition(ICPR) , 2016. 3[13] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and RitaCucchiara. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.
IEEE Transactions on Im-age Processing , 27(10):5142–5154, 2018. 3[14] Howard E. Egeth and Steven Yantis. VISUAL ATTENTION:Control, representation, and time course.
Annual Review ofPsychology , 48(1):269–297, feb 1997. 1[15] Michelle L. Eisenberg and Jeffrey M. Zacks. Ambient andfocal visual processing of naturalistic activity.
Journal ofVision , 16(2):5, mar 2016. 5[16] G. T. Fechner.
Elements of Psychophysics, Volume 1 . Holt,Rinehart and Winston, the University of Michigan, 1966. 2 [17] JH Fecteau and DP Munoz. Salience, relevance, and firing:a priority map for target selection.
Trends in Cognitive Sci-ences , 10(8):382–390, aug 2006. 1[18] Anton Garcia-Diaz, Xose R. Fdez-Vidal, Xose M. Pardo, andRaquel Dosil. Saliency from hierarchical adaptation throughdecorrelation and variance normalization.
Image and VisionComputing , 30(1):51–64, jan 2012. 3[19] Stas Goferman, Lihi Zelnik-Manor, and Ayellet Tal.Context-aware saliency detection.
IEEE Transactions onPattern Analysis and Machine Intelligence , 34(10):1915–1926, oct 2012. 3[20] Chenlei Guo, Qi Ma, and Liming Zhang. Spatio-temporalsaliency detection using phase spectrum of quaternionfourier transform. In , jun 2008. 3[21] Jonathan Harel, Christof Koch, and Pietro Perona. Graph-based visual saliency. In B. Sch¨olkopf, J. C. Platt, and T.Hoffman, editors,
Advances in Neural Information Process-ing Systems 19 , pages 545–552. MIT Press, 2007. 3[22] Xiaodi Hou, J. Harel, and C. Koch. Image signature: High-lighting sparse salient regions.
IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 34(1):194–201, jan2012. 3[23] Xiaodi Hou and Liqing Zhang. Saliency detection: A spec-tral residual approach. In , jun 2007. 3[24] Xiaodi Hou and Liqing Zhang. Dynamic visual attention:searching for coding length increments. In D. Koller, D.Schuurmans, Y. Bengio, and L. Bottou, editors,
Advances inNeural Information Processing Systems 21 , pages 681–688.Curran Associates, Inc., 2009. 3[25] Alex D. Hwang, Hsueh-Cheng Wang, and Marc Pomplun.Semantic guidance of eye movements in real-world scenes.
Vision Research , 51(10):1192–1205, may 2011. 5[26] Laurent Itti and Christof Koch. A saliency-based searchmechanism for overt and covert shifts of visual attention.
Vi-sion Research , 40(10-12):1489–1506, jun 2000. 2, 3[27] Laurent Itti, Christof Koch, and Ernst Niebur. A modelof saliency-based visual attention for rapid scene analysis.
IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 20(11):1254–1259, 1998. 1, 3[28] Ming Jiang, Shengsheng Huang, Juanyong Duan, and QiZhao. SALICON: Saliency in context. In ,jun 2015. 3[29] Tilke Judd, Fredo Durant, and Antonio Torralba. A bench-mark of computational models of saliency to predict humanfixations.
CSAIL Technical Reports , jan 2012. 2[30] Tilke Judd, Krista Ehinger, Fredo Durand, and Antonio Tor-ralba. Learning to predict where humans look. In , sep 2009. 2[31] Christof Koch and Shimon Ullman. Shifts in selective vi-sual attention: Towards the underlying neural circuitry. In
Matters of Intelligence , pages 115–141. Springer, 1987. 1[32] Gert Kootstra, Bart de Boer, and Lambert R. B. Schomaker.Predicting eye fixations on complex visual stimuli using lo-cal symmetry.
Cognitive Computation , 3(1):223–240, jan2011. 233] Matthias Kummerer, Thomas S.A. Wallis, Leon A. Gatys,and Matthias Bethge. Understanding low- and high-levelcontributions to fixation prediction. In , oct 2017. 3[34] Mai Xu Zulin Wang Lai Jiang, Zhe Wang. Image saliencyprediction in transformed domain: A deep complex neuralnetwork method. February 2019. 3[35] Victor Leboran, Anton Garcia-Diaz, Xose R. Fdez-Vidal,and Xose M. Pardo. Dynamic whitening saliency.
IEEETransactions on Pattern Analysis and Machine Intelligence ,39(5):893–907, may 2017. 8[36] Olivier LeMeur and Thierry Baccino. Methods for compar-ing scanpaths and saliency maps: strengths and weaknesses.
Behavior Research Methods , 45(1):251–266, jul 2012. 3[37] Jian Li, Martin D. Levine, Xiangjing An, Xin Xu, andHangen He. Visual saliency based on scale-space analysis inthe frequency domain.
IEEE Transactions on Pattern Analy-sis and Machine Intelligence , 35(4):996–1010, apr 2013. 3[38] Fernando Lopez-Garcia, Xose Ramon, Xose Manuel, andRaquel Dosil. Scene recognition through visual attention andimage features: A comparison between SIFT and SURF ap-proaches. In
Object Recognition . InTech, apr 2011. 3[39] Naila Murray, Maria Vanrell, Xavier Otazu, and C. Alejan-dro Parraga. Saliency estimation using a non-parametric low-level vision model. In
CVPR 2011 , jun 2011. 3[40] Junting Pan, Cristian Canton, Kevin McGuinness, Noel E.O’Connor, Jordi Torres, Elisa Sayrol, and Xavier and Giro-i Nieto. Salgan: Visual saliency prediction with generativeadversarial networks. In
CVPR 2017 Scene UnderstandingWorkshop (SUNw) , January 2017. 3[41] Subramanian Ramanathan, Harish Katti, Nicu Sebe, MohanKankanhalli, and Tat-Seng Chua. An eye fixation databasefor saliency detection in images. In
Computer Vision – ECCV2010 , pages 30–43. Springer, 2010. 2[42] Nicolas Riche, Matthieu Duvinage, Matei Mancas, BernardGosselin, and Thierry Dutoit. Saliency and human fixations:State-of-the-art and study of comparison metrics. In ,dec 2013. 5[43] Nicolas Riche and Matei Mancas. Bottom-up saliency mod-els for still images: A practical review. In
From Human At-tention to Computational Attention , pages 141–175. SpringerNew York, 2016. 1, 2[44] Nicolas Riche and Matei Mancas. Bottom-up saliency mod-els for videos: A practical review. In
From Human Attentionto Computational Attention , pages 177–190. Springer NewYork, 2016. 8[45] Nicolas Riche, Matei Mancas, Bernard Gosselin, and ThierryDutoit. Rare: A new bottom-up saliency model. In ,sep 2012. 3[46] Amir Rosenfeld, Richard Zemel, and John K. Tsotsos. Theelephant in the room, 2018. arXiv:1808.03305. 7[47] Boris Schauerte and Rainer Stiefelhagen. Quaternion-basedspectral saliency detection for eye fixation prediction. In
Computer Vision – ECCV 2012 , pages 116–129. Springer,2012. 3 [48] Hae Jong Seo and Peyman Milanfar. Static and space-timevisual saliency detection by self-resemblance.
Journal of Vi-sion , 9(12):15–15, nov 2009. 3[49] Alireza Soltani and Christof Koch. Visual saliency compu-tations: Mechanisms, constraints, and the effect of feedback.
Journal of Neuroscience , 30(38):12831–12843, sep 2010. 4[50] Michael W. Spratling. Predictive coding as a model of thev1 saliency map hypothesis.
Neural Networks , 26:7–28, feb2012. 2, 3, 4[51] Christopher Lee Thomas. Opensalicon: An open source im-plementation of the salicon saliency model. Technical ReportTR-2016-02, University of Pittsburgh, 2016. 3[52] Antonio Torralba, Aude Oliva, Monica S. Castelhano, andJohn M. Henderson. Contextual guidance of eye movementsand attention in real-world scenes: The role of global fea-tures in object search.
Psychological Review , 113(4):766–786, 2006. 3[53] Anne M. Treisman and Garry Gelade. A feature-integrationtheory of attention.
Cognitive Psychology , 12(1):97–136, jan1980. 1, 2[54] John K. Tsotsos, Scan M. Culhane, Winky Yan Kei Wai,Yuzhong Lai, Neal Davis, and Fernando Nuflo. Modelingvisual attention via selective tuning.
Artificial Intelligence ,78(1-2):507–545, oct 1995. 1[55] Richard Veale, Ziad M. Hafed, and Masatoshi Yoshida.How is visual salience computed in the brain? insightsfrom behaviour, neurobiology and modelling.
Philosophi-cal Transactions of the Royal Society B: Biological Sciences ,372(1714):20160113, jan 2017. 1[56] Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, andHaibin Ling. Salient object detection in the deep learningera: An in-depth survey, 2019. arXiv:1904.09146. 7[57] Stefan Winkler and Ramanathan Subramanian. Overview ofeye tracking datasets. In , jul 2013. 2[58] J. M. Wolfe. Guided search 4.0: A guided search model thatdoes not require memory for rejected distractors.
Journal ofVision , 1(3):349–349, mar 2010. 1, 2, 3, 7[59] Jeremy M. Wolfe, Evan M. Palmer, and Todd S. Horowitz.Reaction time distributions constrain models of visualsearch.
Vision Research , 50(14):1304–1311, jun 2010. 2[60] Steven Yantis and Howard E. Egeth. On the distinction be-tween visual salience and stimulus-driven attentional cap-ture.
Journal of Experimental Psychology: Human Percep-tion and Performance , 25(3):661–676, 1999. 1[61] Jianming Zhang and Stan Sclaroff. Saliency detection: Aboolean map approach. In , dec 2013. 3[62] Liming Zhang and Weisi Lin.
Selective Visual Attention .John Wiley & Sons (Asia) Pte Ltd, mar 2013. 1, 2[63] Lingyun Zhang, Matthew H. Tong, Tim K. Marks, HonghaoShan, and Garrison W. Cottrell. SUN: A bayesian frame-work for saliency using natural statistics.
Journal of Vision ,8(7):32, dec 2008. 3[64] Li Zhaoping and Keith A. May. Psychophysical tests of thehypothesis of a bottom-up saliency map in primary visualcortex.