[PDF] Modeling human visual search: A combined Bayesian searcher and saliency map approach for eye movement guidance in natural scenes

Abstract

Finding objects is essential for almost any daily-life visual task. Saliency models have been useful to predict fixation locations in natural images, but are static, i.e., they provide no information about the time-sequence of fixations. Nowadays, one of the biggest challenges in the field is to go beyond saliency maps to predict a sequence of fixations related to a visual task, such as searching for a given target. Bayesian observer models have been proposed for this task, as they represent visual search as an active sampling process. Nevertheless, they were mostly evaluated on artificial images, and how they adapt to natural images remains largely unexplored. Here, we propose a unified Bayesian model for visual search guided by saliency maps as prior information. We validated our model with a visual search experiment in natural scenes recording eye movements. We show that, although state-of-the-art saliency models perform well in predicting the first two fixations in a visual search task, their performance degrades to chance afterward. This suggests that saliency maps alone are good to model bottom-up first impressions, but are not enough to explain the scanpaths when top-down task information is critical. Thus, we propose to use them as priors of Bayesian searchers. This approach leads to a behavior very similar to humans for the whole scanpath, both in the percentage of target found as a function of the fixation rank and the scanpath similarity, reproducing the entire sequence of eye movements.

Full PDF

MModeling Human Visual Search: A Combined Bayesian Searcher and SaliencyMap Approach for Eye Movement Guidance in Natural Scenes

Melanie Sclar ∗ , Gast´on Bujia ∗ , Sebasti´an Vita , Guillermo Solovey ,Juan Esteban Kamienkowski Laboratorio de Inteligencia Artiﬁcial Aplicada, Instituto de Ciencias de la Computaci´on, Universidad de Buenos Aires –CONICET, Argentina Instituto del C´alculo, Universidad de Buenos Aires – CONICET, Argentina Departamento de F´ısica, FCEyN, Universidad de Buenos Aires, Argentina

Abstract

Finding objects is essential for almost any daily-life visualtask. Saliency models have been useful to predict ﬁxationlocations in natural images, but are static, i.e., they provideno information about the time-sequence of ﬁxations. Nowa-days, one of the biggest challenges in the ﬁeld is to go beyondsaliency maps to predict a sequence of ﬁxations related to avisual task, such as searching for a given target. Bayesian ob-server models have been proposed for this task, as they rep-resent visual search as an active sampling process. Neverthe-less, they were mostly evaluated on artiﬁcial images, and howthey adapt to natural images remains largely unexplored.Here, we propose a uniﬁed Bayesian model for visual searchguided by saliency maps as prior information. We validatedour model with a visual search experiment in natural scenesrecording eye movements. We show that, although state-of-the-art saliency models perform well in predicting the ﬁrsttwo ﬁxations in a visual search task, their performance de-grades to chance afterward. This suggests that saliency mapsalone are good to model bottom-up ﬁrst impressions, but arenot enough to explain the scanpaths when top-down task in-formation is critical. Thus, we propose to use them as priorsof Bayesian searchers. This approach leads to a behavior verysimilar to humans for the whole scanpath, both in the percent-age of target found as a function of the ﬁxation rank and thescanpath similarity, reproducing the entire sequence of eyemovements.

Introduction

Visual search is a natural task that humans perform in ev-eryday life, from looking for someone in a photograph tosearching where you left your favorite mug in the kitchen.Finding our goal relies on our ability to gather visual in-formation through a sequence of eye movements, perform-ing a discrete sampling of the scene. This sampling of in-formation is not carried out on random points. The ﬁxa-tions of the gaze follow different strategies, trying to min-imize the number of steps needed to ﬁnd the target (Borjiand Itti 2014; Rolfs 2015; Tatler et al. 2010; Yarbus 1967).This process is a classical example of the active sensing or ∗ Indicates equal contribution. Correspondence should beaddressed to M.S. ( [email protected] ) or G.B.( [email protected] ). sampling paradigm, where humans perform decisions aboutthe different ways of sampling information making infer-ences with the gathered information so far, in order to ful-ﬁll a necessity (Yang, Wolpert, and Lengyel 2016; Got-tlieb and Oudeyer 2018). This way, each decision is ex-pected to reduce uncertainty about the environment (Got-tlieb and Oudeyer 2018; Najemnik and Geisler 2005). More-over, this decision-making behavior depends on the pur-pose; for instance, whether it is task-driven or simple cu-riosity. Predicting the eye movements necessary to meet agoal is a computationally-complex task since it must com-bine the bottom-up information capturing processes and thetop-down integration of information and updating of expec-tations in each ﬁxation.A related task is the prediction of the most likely ﬁxationpositions in the scene, whose purpose is to build a saliencymap , identifying regions that draw our attention within animage. The ﬁrst saliency models were built based on com-puter vision strategies, combining different ﬁlters over theimage (Itti and Koch 2001). Some of these ﬁlters could bevery general, such as a low-pass ﬁlter that gives the ideaof the horizon (Itti, Koch, and Niebur 1998; Torralba andSinha 2001), or more speciﬁc, such as detecting high-levelfeatures like faces (Cerf et al. 2008). In recent years, deepneural networks (DNNs) have advanced the development ofsaliency maps. Many saliency models have successfully in-corporated pre-trained convolutional DNNs in order to ex-tract low and high-level features of the images (Kummereret al. 2017; Cornia et al. 2016, 2018). These novel ap-proaches were summarized in MIT/Tuebingen’s collabora-tion website (Kummerer, Wallis, and Bethge 2018). Never-theless, they produce accurate results only in the ﬁrst fewﬁxations of free exploration tasks (Torralba et al. 2006) asthey cannot make use of the sequential nature of the tasknor combine information through the sequence of ﬁxations.Thus, they may not be able to replicate their good resultswhen performing a complex task, such as a visual search.Recently, Zhang et al. (2018) proposed an extension ofthose models to predict the sequence of ﬁxations in a visualsearch in natural images. They use a greedy algorithm basedon DNNs, mimicking the behavior of the visual system byelaborating an attention map related to the search goal. Us-ing a greedy algorithm implied forcing some known behav- a r X i v : . [ c s . A I] S e p ors of human visual search –like inhibition of return– thatarise naturally with longer-sighted objective functions.Nowadays, there is a growing interest in Bayesian mod-els given their good results at modeling human behaviorand also for having straightforward interpretations relatedto human information processing (Ullman 2019). For in-stance, different Bayesian Models were applied to decisionmaking or perceptual tasks (O’Reilly, Jbabdi, and Behrens2012; Rohe and Noppeney 2015; Samad, Chung, and Shams2015; Wiecki, Poland, and Frank 2015; Turgeon, Lustig, andMeck 2016; Knill and Pouget 2004; Tenenbaum, Grifﬁths,and Kemp 2006; Meyniel, Sigman, and Mainen 2015). Inthe case of visual search, Najemnik and Geisler (2005) haveproposed a model that decides the next eye movement basedon its prior knowledge, a visibility map, and the current stateof a posterior probability that is updated after every ﬁxa-tion. In this model, inhibition of return, moderate saccadelength, among other human characteristics of visual searcharise naturally (Najemnik and Geisler 2005). The results ofthis Ideal Bayesian searcher (IBS) have had a wide impact,but the images they used in their experiments were all ar-tiﬁcial. Recently, Hoppe and Rothkopf (2019) proposed avisual search model that incorporates planning. It uses theuncertainty of the current observation to select the upcominggaze locations in order to maximize the probability of detect-ing the location of the target after the sequence of two sac-cades. As Najemnik and Geisler (2005), their task is speciﬁ-cally designed for maximizing the difference between mod-els, i.e ﬁnding the target in very few ﬁxations on artiﬁcialstimuli. In order to extend these results to natural images, itis necessary to incorporate the information available in thescene.Here, we show that an IBS model combined with state-of-the-art saliency maps as priors performs similarly to humansin a visual search task on natural scenes. Moreover, we in-corporate a different update rule to the IBS model, based ona correlation for the template response. This modiﬁcationcould incorporate the effect of distractors, even though wedid not speciﬁcally test this hypothesis in our experiments.We also simplify assumptions from prior work by avoidingto measure each subjects’ visibility map beforehand, andhaving an a priori approximation instead. Previous workcompares the general performance between humans and themodel (targets found and total number of ﬁxations). Movingone step further, we quantitatively compare the scanpaths(ordered sequence of ﬁxations) produced by the model withthe ones recorded by human observers. Visual search in natural indoor images:Human data

Paradigm and human data acquisition andpreprocessing

We set up a visual search experiment in which participantshave to search for an object in a crowded indoor scene. First,the target was presented in the center of the screen, subtend-ing × pixels of visual angle (Fig. S1). After 3 sec-onds, the target was replaced by a ﬁxation dot at a pseudo-random position at least 300 pixels away from the actual tar- get position in the image (Fig. S1). This was done to avoidstarting the search too close to the target. The initial positionwas the same for a given image and all participants. Thesearch image appears after the participant ﬁxates the dot.Thus, all observers initiate the search in the same place foreach image. (Fig. S1).Saccades and ﬁxations were parsed online. The search pe-riod ﬁnishes when the participant ﬁxates the target or after N saccades, allowing an extra 200ms, in order to allow ob-servers to process information of this last ﬁxation (Kotow-icz, Rutishauser, and Koch 2010). The maximum numbersaccades allowed ( N ) varied between , , , . These val-ues were randomized for each participant, independently ofthe image. The experiment was programmed using Psych-Toolbox and EyeLink libraries in MATLAB (Brainard 1997;Kleiner, Brainard, and Pelli 2007).The images correspond to 134 indoor pictures fromWikimedia commons, indoor design blogs, and LabelMedatabase (Russell et al. 2008), which have several objectsand no human ﬁgures or text. The image was presented ata × resolution (subtending . × . degreesof visual angle). For each image, a single target was selectedamong objects of size equal or less than × pixels. Also,we excluded targets with almost exact copies within the im-age (i.e. a cup within a set of cups) to prevent high workingmemory requirements from humans. For all of them we con-sidered a surrounding region of × pixels. Finally, wechecked that there were no consistent spatial biases acrossthe images.Fifty-seven subjects (34 male; age . ± . years old)participated in the Visual Search task. All were studentsor teachers from the Facultad de Ciencias Exactas y Natu-rales de la Universidad de Buenos Aires. All subjects werenave to the objectives of the experiment, had normal orcorrected-to-normal vision, and provided written informedconsent according to the recommendations of the declara-tion of Helsinki to participate in the study.See more details on the paradigm and data acquisition andpreprocessing in Supplemental Information. Human behavior results

During the experiment, observers have to search for a giventarget object within natural indoor scenes. The trial stopswhen the observers ﬁnd the target or after N saccades( N = 2 , , , ). As expected, the proportion of targetsfound increases as a function of the saccades allowed (Fig.1A), reaching a plateau from 8 to 12 saccades allowed andon (Fig. 1A and data from a preliminary experiment with upto 64 saccades not shown).Overall, eye movements recorded behave as expected.First, the amplitude decreases with the ﬁxation rank, pre-senting the so-called coarse-to-ﬁne effect (Fig. 1B), and thesaccades tended to be horizontal more than vertical (Fig.1C). Finally, the initial spatial distribution of ﬁxations hada central bias, and then extended ﬁrst over the horizon untilit covered the whole image, as the targets are uniformly dis-tributed along the scene (Fig. 1D). This effect could be par-tially due to the organization of the task (the central drift cor-rection and presentation of the target), the setup (the centraligure 1: General behavior. (A) Proportion of targets foundas a function of the number of saccades allowed. Distribu-tions of (B) saccade length, (C) saccade direction (measuredin degrees from the positive horizontal axis), and (D) ﬁxa-tions’ position for different ﬁxation ranks (1,2,3,4,5-8,9-12).position of the monitor with respect of the eyes/head), andthe images (the photographer typically centers the image); italso could be due to processing beneﬁts, as it is the optimalposition to acquire low-level information of the whole sceneor to start the exploration. Searcher Modeling Approach

The proposed model involves two main aspects of the visualsearch implementation: a saliency map estimation step asa ﬁrst, glimpse-like, information extraction, and successivesearch steps in order to build the full scanpath.

Exploring saliency maps

The saliency maps were usually estimated in a scene-viewing task, where observers freely moved their eyes, ex-ploring whatever captured their interests. The motivation ofusing the full saliency as prior in a visual search task lies inthe results of the ﬂash-preview moving-window paradigmthat shows that even less than a few hundreds of millisec-onds’ glimpse of a scene can guide search as long as sufﬁ-cient time is subsequently available to combine prior knowl-edge with the current visual input (Oliva and Torralba 2006;Torralba et al. 2006; Castelhano and Henderson 2007). Im-portantly, it was also shown that this is more relevant in theﬁrst saccades of a full scene search, and its predictive powerdecays with the ﬁxation rank (Torralba et al. 2006).

Saliency Maps

In the last few years several saliency mod-els appeared in the literature and made their code available.Many of those were nicely summarized and compared inthe https://saliency.tuebingen.ai/ repository(Judd, Durand, and Torralba 2012; Kummerer, Wallis, andBethge 2018; Bylinskii et al. 2018). With the purpose ofunderstanding which features guide the search in this sec-tion, we choose and compare ﬁve different state-of-the-art saliency maps for our task: DeepGaze 2 (Kummerer et al.2017), MLNet (Cornia et al. 2016), SAM-VGG and SAM-ResNet (Cornia et al. 2018), and ICF (Intensity ContrastFeature) (Kummerer et al. 2017).All the saliency models considered (except for ICF) arebased on neural network architectures, using different con-volutional networks (CNN) pretrained on object recognitiontasks. These CNNs played the role of calculating a ﬁxedfeature space representation (feature extractor) for the im-age which then will be fed to a predictor function (in themodels we consider, also a neural network). DeepGaze usesa VGG-19 (Simonyan and Zisserman 2014) as feature ex-tractor, and the predictor is a simpler four-layer CNN (Kum-merer et al. 2017). The MLNet model uses a modiﬁed VGG-16 (Simonyan and Zisserman 2014) that returns several fea-ture maps, and a simpler CNN is used as a predictor thatincorporates a learnable center prior (Cornia et al. 2016). Fi-nally, SAM could use both VGG-16 and ResNet50 (He et al.2016) as two different feature extractors, and the predictoris a neural network with attentive and convolutional mech-anisms (Cornia et al. 2018). ICF has a similar architectureto DeepGaze, but it uses Gaussian ﬁlters instead of a neuralnetwork. This way, ICF extracts purely low-level image in-formation (intensity and intensity contrast). We also includea saliency model with just the center bias, modelled by a 2DGaussian distribution.As the control model, we built a human-based saliencymap using the accumulated ﬁxation position of all observersfor a given image, smoothed with a Gaussian kernel (st. dev.= 25 pxs). Given that observers were forced to begin eachtrial in the same position, we did not use the ﬁrst ﬁxationsbut the third. This way we capture the regions that attracthuman attention.

Prediction of observers’ ﬁxation positions

We evaluatedhow just the saliency models perform in predicting ﬁxationsalong the search by themselves. Thus, we considered eachsaliency map S as a binary classiﬁer on every pixel and usedReceiver Operator Curves (ROC) and Area Under the Curve(AUC) to measure their performance. This comes with thedifﬁculty that there is not a unique way of deﬁning the falsepositive rate ( fpr ). In dealing with this problem, previouswork on this task has used many different deﬁnitions of(ROC and its corresponding) AUC (Borji et al. 2013; Richeet al. 2013; Bylinskii et al. 2018; Kummerer, Wallis, andBethge 2018). Brieﬂy, to build our ROC we considered thetrue positive rate ( tpr ) as the proportion of saliency map val-ues above each threshold at ﬁxation locations and the fpr as the proportion of saliency map values above threshold atnon-ﬁxated pixels (Fig. 2A).As expected, the saliency map built from the distribu-tion of third ﬁxations performed by humans (human-basedsaliency map) is superior to all other saliency maps, andthe center bias map was clearly worse than the rest of them(Fig. 2B). This is consistent with the idea that the ﬁrst stepsin visual search are mostly guided by image saliency. Therest of the models have similar performance on AUC, withDeepGaze2 performing slightly better than the others (Fig.2B). Using different deﬁnitions of AUC (Borji et al. 2013;igure 2: Saliency maps. A) Example on how to estimatethe TPR for the ROC curve, B) ROC curves and C) AUCvalues for the third ﬁxation and D) AUC for each Model asa function of the current Fixation Rank. Color mapping formodels is consistent over B, C, and D.Riche et al. 2013; Bylinskii et al. 2018; Kummerer, Wal-lis, and Bethge 2018) showed the same trend (Table S1). Ifwe consider all ﬁxations the AUC is reduced for all models,including the human-based saliency map built on the thirdﬁxations (Fig. S2).All models reached a maximum in AUC values at the sec-ond ﬁxation except the human-based model that peaked atthe third ﬁxation as expected (Fig. 2C). Interestingly, thecenter bias begins at a similar level as the other saliencymaps but decays more rapidly, reaching 0.5 in the fourth ﬁx-ation. Thus, other saliency maps must capture some otherrelevant visual information. Nevertheless, the AUC valuesfrom all saliency maps decay smoothly (Fig. 2C). This sug-gests that the gist the observers are able to collect in the ﬁrstﬁxations is largely modiﬁed by the search. Top-down mech-anisms must take control and play major roles in eye move-ment guidance as the number of ﬁxations increase (Itti andKoch 2000). The DeepGaze 2 model performed better overall ﬁxation ranks, becoming indistinguishable from humanperformance in the second ﬁxation (Fig. 2C).

Bayesian Searcher models

In the previous section, we showed that saliency modelsalone are not able to predict the ﬁxation positions in a vi-sual scene when a task, even a simple task like visual search,is performed. Moreover, these models are not developed topredict the order of those ﬁxations. In the next section, wedevelop models that integrate the saliency maps as priors buthave rules implemented to update those probabilities as thesearch progresses.

Description of the Ideal Bayesian searcher (IBS)

Na-jemnik and Geisler (2005)’s IBS computes the optimal next ﬁxation location in each step. It considers each possible nextﬁxation and picks the one that will maximize the probabil-ity of correctly identifying the location of the target after theﬁxation. The decision of the optimal ﬁxation location at step T + 1 , k opt ( T + 1) , is computed as (eq. 1): k opt ( T + 1) = argmax k ( T +1) n X i =1 p i ( T ) p ( C | i, k ( T + 1)) (1) where p i ( T ) is the posterior probability that the targetis at the i -th location within the grid after T ﬁxations and p ( C | i, k ( T + 1)) is the probability of being correct giventhat the true target location is i , and the location of the nextﬁxation is k ( T + 1) . p i ( T ) involves the prior, the visibilitymap ( d ik ( T ) ) and a notion of the target location ( W ik ( t ) ): p i ( T ) = prior ( i ) · T Y t =1 exp (cid:0) d ik ( t ) W ik ( t ) (cid:1) n X j =1 prior ( i ) · T Y t =1 exp (cid:0) d jk ( t ) W jk ( t ) (cid:1) (2) The template response, W ik ( t ) , quantiﬁes the similaritybetween a given position i and the target image from theﬁxated position k ( t ) ( t is any previous ﬁxation). It is deﬁnedas W ik ( t ) ∼ N ( µ ik ( t ) , σ ik ( t ) ) where: µ ik ( t ) = ( i = target location ) − . , σ ik ( t ) = 1 d ik ( t ) (3) Abusing notation, in eq. 2 W ik ( t ) refers to a value drawnfrom this distribution.IBS has only been tested in artiﬁcial images, where sub-jects need to ﬁnd a gabor patch among /f noise in one outof 25 possible locations. This work is, to our knowledge, theﬁrst one to test this approach in natural scenes. Below, wediscuss the modiﬁcations needed to apply IBS to eye move-ments in natural images. Modiﬁcations to the IBS to handle natural scenes( correlation-based IBS (cIBS)).

Since it would be bothcomputationally intractable to compute the probability ofﬁxating in every pixel of a × image, and ineffec-tive to do so –as useful information span over regions largerthan a pixel–, we restrict the possible ﬁxation locations to beanalyzed to the center points of a grid of δ × δ pixels each.We collapse the eye movements to these points accordingly:consecutive ﬁxations within a cell were merged into one ﬁx-ation to be fair with the model behavior.The original IBS model had a uniform prior distribution.Since we are trying to model ﬁxation locations in a natu-ral scene, we introduced a saliency model as the prior. The prior ( i ) will be the average of the saliency in the i -th gridcell.Importantly, the presence of the target in a certain posi-tion in natural images is not as straightforward as in artiﬁ-cial stimuli, where all the incorrect locations were equallydissimilar. In natural images there are often distractors, i.e.positions in the image that are visually similar to the target,especially if seen with low visibility. Therefore, we propose redeﬁned template response ˜ W ik ( t ) ∼ N (˜ µ ik ( t ) , ˜ σ ik ( t ) ) ,where ˜ µ ik ( t ) ∈ [ − , is deﬁned as (eq. 4): ˜ µ ik ( t ) = µ ik ( t ) · (cid:16) d ik ( t ) + 12 (cid:17) + corr i · (cid:16) − d ik ( t ) (cid:17) (4) corr i ∈ [ − . , . is the cross-correlation of location i and the target image, used as a measure of image similarity.Moreover, we modiﬁed σ ik ( t ) to keep the variance de-pending on the visibility, but we incorporate two parameters(eq. 5): ˜ σ ik ( t ) = 1 a · d ik ( t ) + b (5) The parameters a and b jointly modulate the inverse ofthe visibility and prevent /d from diverging. These pa-rameters were not included in the original model probablybecause d was estimated empirically (from thousands oftrials and independently for each subject) and the d wasnever exactly equal to zero. Recently, Bradley, Abrams, andGeisler (2014) simpliﬁed the task by ﬁtting a visibility mapbuilt from a ﬁrst-principle model which proposed an ana-lytic function with several parameters, that should still beﬁtted for each participant. Here, we further simpliﬁed it byusing a two-dimensional Gaussian with the same parametersfor every participant, avoiding a potential leak of informa-tion about the viewing patterns to the model. The parameterswere taken a priori (estimated from parameters in Najemnikand Geisler (2005); Bradley, Abrams, and Geisler (2014)).We chose the parameters of the model using a classical gridsearch procedure in a previous experiment with a smallerdataset and the same best parameters ( δ = 32 , a = 3 and b = 4 ) are used for all the models.Finally, since we are trying to model ﬁxation locations ina natural scene, we introduced a saliency model as the initialprior instead of an uniform distribution used by the originalmodel. The prior ( i ) will be the average of the saliency inthe i -th grid cell.We call this variation of the model correlation-based IBS (cIBS). The data, models and code will be publicly avail-able upon publication. It is worth mentioning that, to ourknowledge, not only an implementation of the Najemnik andGeisler (2005) model is publicly available for the ﬁrst timehere, but also it is largely optimized. More details in Supple-mental Information. Evaluating searcher models on human data

We ﬁrstevaluate the updating of probabilities and the decision ruleof the next ﬁxation position of the proposed cIBS model. Forcomparison, we used the previous

IBS model, in which the template response accounts only for the presence or not ofthe target, and not for the similarity of the given region withthe target. Also, we implemented two other basic models: a

Greedy searcher and a

Saliency-based searcher . The

Greedysearcher bases its decision on maximizing the probability ofﬁnding the target in the next ﬁxation. It only considers thepresent posterior probabilities and the visibility map, anddoes not take into account how the probability map is go-ing to be updated after that. The

Saliency-based searcher simply goes through the most salient regions of the image,adding an inhibition-of-return effect to each visited region. In these models, we used the DeepGaze2 as a prior, becauseit is the best performing saliency map of the previous sec-tion. We also evaluate the usage of different priors with thecIBS model, comparing with the center bias alone, which isa centered two-dimensional Gaussian distribution, a uniform(ﬂat) distribution, and a white noise distribution.Figure 3: Model performance comparison. Proportion of tar-get found for each threshold considered. The boxes representthe human behaviour distribution. The curves are the perfor-mance achieved by the models considered using differentsearch strategies (A) with DeepGaze2 (DG2) as prior anddifferent priors (B) using cIBS strategy.The proportion of targets found for any of the possi-ble saccades allowed was used as a measure of the overallperformance (Fig. 3). In Table 1 we summarize the met-rics comparing humans and models. The Weighted Dis-tance measures the mean difference between curves (Fig.3). The Jaccard index represents the proportion of targetsfound by the model to the total targets found by the sub-jects, and the Mean Agreement metric measures the propor-tion of trials where subject and model had the same perfor-mance: both, subject and model, found (or not) the target(See Supplemental Information). When comparing differ-ent searchers with the same prior, cIBS has the best agree-ment with the humans’ performance as a compromise of dif-ferent metrics (Table 1). In particular, the performance ofcIBS+DeepGaze2 is the closest to the human mean agree-ment ( . ± . ). Nevertheless, the curves of the IBSand Greedy models were also very close to humans (Fig 3A)and each of them performed better in one metric (Table 1).It is important to note that Weighted Distance signiﬁcantlyimproves when adding the distractor component (cIBS vsIBS).Only the basic Saliency-based model had poorer agree-ment with the human performance, showing that templatematching weighted by visibility is a plausible mechanismfor searching potential targets in the scene.Then, we explored the importance of the prior, compar-ing the best searcher model with the chosen prior against thedifferent basic priors. Again, the cIBS+DeepGaze2 had bet-ter overall agreement with humans’ behavior (Fig. 3B andTable 1) and, interestingly, is the only model that presenteda step-like function characteristic from humans (Fig. 3B).Going one step further, we compare the scanpaths usingthe scanpath dissimilarity pairwise metric proposed by Jaro-dzka, Holmqvist, and Nystr¨om (2010). Brieﬂy, the scanpathmetric represent them as time-aligned vectors, each scanpathigure 4: Human-model scanpath dissimilarity comparison. All metrics are computed ﬁrst between each pair of partici-pants/model and then averaged. A) Distribution of scanpath dissimilarity between humans and different search strategies usingDeepGaze2 as prior. Each dot represents a trial image. B-E) Distribution of scanpath dissimilarity between humans ( bhSD ) andhumans-model ( hmSD ) for each trial image. F) Distribution of scanpath dissimilarity between different priors using cIBS assearch strategy. G-J) Same plot as (B-E) for each different prior considered. Only correct trials were considered in (B-E, G-J).is deﬁned as a sequence of ﬁxations u i ( (x,y)-coordinates )where the i -th saccade is the shortest path (vector) goingfrom u i to u i +1 . Each u i may not be exactly a ﬁxation, butthe center of a cluster of several ﬁxations, making this mea-sure more robust. Depending on the objective of the com-parison, it could be used with different summary measuresbased on those characteristics. As we aim to compare thesequence of explored locations between humans and ourmodel, we use shape dissimilarity. It is calculated as thenormalized difference between the saccade vectors, wherea scanpath dissimilarity of 0 indicates that the scanpaths arehighly similar, and 1 indicates that there is no correspon-dence between them.We started comparing between searcher models, bothcIBS and IBS were almost indistinguishable from humans ,the Greedy model was still very close, and only the Saliency-based model resulted in a signiﬁcantly different behavior(Fig. 4A-E). We quantiﬁed this relation by measuring thecorrelation and the slope of a linear regression (with nullintercept) of the dissimilarity between humans (bhSD) andbetween humans and the model (hmSD) (Table 1). The ra-tionale of these measures is that the ground truth of eachimage (the human scanpath) has different variability acrossimages, and we cannot expect that in the case of very diversescanpaths (higher dissimilarity) among humans, the modelwas close to all of them. A close look at the correlation be-tween the dissimilarity measure in humans and models evi-denced that there were only a small fraction of images thatdeparted from the human’s scanpaths (Fig. 4B-E). These im-ages ( using µ + 3 σ for cIBS+DG2) correspond to caseswhere few people found the target or, interestingly, wherethere are different possible scanpath behaviors (i.e. somepeople start looking for a cup on the cupboard and others onthe table) (Supplemental Figure S3). Further studies shouldbe performed to explore individual differences between ob-servers.When comparing between priors, we observed that boththe models with DeepGaze2 and Center priors were closerto humans’ values, and the ﬂat and noisy priors had largerscanpath dissimilarities (Fig 4F). The model with a ﬂat prior had a slightly better correlation, but both the modelswith DeepGaze2 and Center priors had good correlationand slopes closer to 1 (Table 1). This suggests that the ini-tial center bias is a fair approximation of the human priors.Nonetheless, although both scanpaths were almost indistin-guishable from humans, saliency adds some information thatmakes the model with DeepGaze2 quite better in ﬁnding thetarget. Conclusions

We introduced the cIBS model, an expansion of the IBSmodel to face natural scenes. In summary, we used saliencymaps as priors to model the information collected in the ﬁrstglimpse that guides the ﬁrst saccades, and we modiﬁed thecomputation of the template response to be able to, ﬁrst, usea simpler model of visibility and, second, give graded re-sponses to regions similar to the target, incorporating thenotion of distractors. To evaluate the model, we created adataset of 57 subjects searching on 134 images, where wecompared cIBS to IBS and other strong baselines. We ob-served that saliency models performed well in predicting ini-tial ﬁxations, in particular the third ﬁxation. Humans seemedto start from the initial forced ﬁxation position, move to thecenter, and then to the most (bottom-up) salient location. Af-ter that, the performance of all the saliency models decay toalmost chance, as is expected from their conception. Theymainly encode bottom-up information of the image (Itti andKoch 2000) and not the aim of the task, and they are notable to change and update as it progresses. This is also con-sistent with previous results from Torralba et al. (2006) whoimplement a saliency model for a visual search task.As saliency models are good in predicting ﬁrst bottom-up impressions, they are ideal candidates to be included aspriors in the proposed Bayesian framework. The central biasperformed well in the ﬁrst two ﬁxations, and is included inall the saliency models (Cornia et al. 2016, 2018; Kummereret al. 2017). The central bias by itself resulted in a good priorin terms of scanpath similarity but not in the performance ofﬁnding the target. However, the DeepGaze2 had the bettercompromise of both measures. This suggests that bottom- earcher:

IBS Greedy Saliency cIBS cIBS cIBS cIBS

Prior:

DG2 DG2 DG2 DG2 Center Flat Noisy

Weighted Distance 0.78 0.13 2.14 0.31 1.38 0.72 0.15Mean Agreement 0.64 (0.096) 0.63 (0.082) 0.55 (0.092) 0.64 (0.084) 0.60 (0.086) 0.60 (0.109) 0.61 (0.082)Jaccard Index 0.54 (0.105) 0.47 (0.091) 0.32 (0.073) 0.51 (0.098) 0.38 (0.086) 0.51 (0.109) 0.46 (0.089)Linear Regression 0.85 0.76 0.52 0.84 0.88 0.69 0.61 ρ Table 1: Different measures on performance and scanpath dissimilarity between humans and models.

Weighted Distance is thedistance between humans and model performance weighted by humans dispersion.

Mean Agreement and

Jaccard Index measurecoincidence of humans and model’s correct target detections (See Supplemental Information).

Linear Regression corresponds tothe slope of a simple y ∼ x model between humans scanpaths dissimilarity (bhSD in Fig 4 B-E and G-J) and models scanpathsdissimilarity (hmSD in Fig 4 B-E and G-J). ρ is the Spearman’s correlation coefﬁcient between scanpath dissimilarities bhSDand hmSD. Only correct trials were considered and averaged for each image ( N = 134 ).up cues provided by the saliency maps are relevant to thesearch.Regarding the update and decision mechanism, it isclear that the simple saliency-based searcher, i.e. wanderingaround the most salient regions until bumping into the target,is not a good model of visual search. Humans use not onlyinformation of the scene but also information about the tar-get and previous ﬁxations. Then, a rule of comparing periph-eral information of the scene and the target should be imple-mented, along with a mechanism for combining that infor-mation. Those mechanisms are implemented in the Greedy,IBS, and cIBS models, with the difference that the Greedymodel maximizes the probability of ﬁnding the target in thenext ﬁxation, and the IBS and cIBS maximize the probabil-ity of ﬁnding the target after the next ﬁxation. Overall, theGreedy model had worse performance than both IBS andcIBS models, suggesting that humans actually implementlonger plans when searching in a visual scene. When com-paring Bayesian models, we showed that when searching ina cluttered image it’s important to account for the distractorspresent in the scene, extending the previous proposal fromNajemnik and Geisler (2005).Previous efforts on including contextual informationaimed mainly to predict image regions likely to be ﬁxated.For instance, they combined statically a spatial ﬁlter-basedsaliency map with previous knowledge of target object posi-tions on the scene (Torralba et al. 2006). Some other worksaimed to predict the sequence of ﬁxations, but efforts onnon-Bayesian modeling mainly used greedy algorithms (Ra-souli and Tsotsos 2014; Zhang et al. 2018). Here, we com-pared an example of greedy algorithm with others with amore long-sighted objective function, which has the addi-tional beneﬁt of having some known behaviors of human vi-sual search arise naturally. For example, Zhang et al. (2018)forced inhibition of return, while our model has it implicitlyincorporated. Crucially, Bayesian frameworks are highly in-terpretable and connect our work to other efforts in modelingtop-down inﬂuences in perception and decision-making.It’s also important to note that, although the Najemnikand Geisler (2005) model was a very insightful and inﬂu-ential proposal, to our knowledge, our work is the ﬁrst onethat uses a Bayesian framework to predict eye movements during visual search in natural images. It is a leap in termsof applications since prior work on Bayesian models wasdone in very constrained artiﬁcial environments (looking fora tiny Gabor patch embedded in background /f noise) (Na-jemnik and Geisler 2005). Moreover, we addressed possiblemodiﬁcations when considering the complexities of naturalimages. Speciﬁcally, the addition of a saliency map as prior,the modiﬁcation of the template response’s mean, and shiftin visibility. We also simpliﬁed assumptions from Najemnikand Geisler (2005) by not having to measure each person’svisibility map beforehand. We use the same visibility mapacross subjects, which also avoids a potential leak of infor-mation about the viewing patterns to the model. Finally, wealso share an optimized code for the models which wouldbe useful for others to replicate both Najemnik and Geisler(2005) and our results.Our model aligns with the work of many researchersthat propose probabilistic solutions to model human behav-ior from ﬁrst principles. For instance, Bruce and Tsotsos(2006, 2009) proposed a saliency model based on informa-tion maximization principle, which demonstrates great efﬁ-cacy in predicting ﬁxation patterns across both pictures andmovies. Also, Ma et al. (2011) implement a near-optimal vi-sual search model for a ﬁxed-gaze search task (i.e. exploringthe allocation of covert attention), extending previous mod-els to deal with the reliability of visual information acrossitems and displays, and proposing an implementation of howinformation should be combined across objects and spatiallocation, through marginalization. Interestingly, in both at-tempts to explain overt and covert allocation of attentionthey proposed an implementation through physiologicallyplausible neural networks.More generally, the present work expands the generalgrowing notion of the brain as an organ capable of gen-eralizing and performing inferences in noisy and clutteredscenarios through Bayesian inference, building completeand abstract models of its environment. Nowadays, thosemodels cover a broad spectrum of perceptual and cognitivefunctions, such as decision making and conﬁdence, learn-ing, multisensorial perception, and others (Knill and Pouget2004; Tenenbaum, Grifﬁths, and Kemp 2006; Meyniel, Sig-man, and Mainen 2015). ode / Data availability Code, data, and chosen parameters will be made publiclyavailable upon publication for full reproduction of our re-sults. Data will include the image dataset with targets, aswell as ﬁxation data for all of the subjects.

Author contributions

M.S. prepared the indoor images/targets dataset and codedthe ﬁrst implementation of the presented models, includ-ing numerical optimizations and speed-ups. M.S., G.S. andJ.K. designed the task, collected the human data and deﬁnedthe model idea. G.B. and S.V. pruned the model’s code, ex-tended it and explored their parameters. M.S., G.B., S.V. andJ.K. performed the analysis. The manuscript was written byG.B., M.S. and J.K.

Acknowledgments

We thank P Lagomarsino and J Laurino for their collab-oration with the data acquisition, C Diuk (Facebook), ASalles (OpenZeppelin) and Matias J Ison (Univ. of Notting-ham, UK) for their feedback and insight on the work, andK Ball (UT Austin) for English editing of this article. Theauthors were supported by the National Science and Tech-nology Research Council (CONICET) and the University ofBuenos Aires (UBA). The research was supported by theARL (W911NF-19-2-0240), UBA (20020170200259BA),and the National Agency of Promotion of Science and Tech-nology (PICT 2016-1256).

References

Borji, A.; and Itti, L. 2014. Defending Yarbus: Eye move-ments reveal observers’ task.

Journal of vision

Proceedings of the IEEE international con-ference on computer vision , 921–928.Bradley, C.; Abrams, J.; and Geisler, W. 2014. Retina-v1model of detectability across the visual ﬁeld.

Journal of vi-sion

SpatialVision

10: 433–436.Bruce, N.; and Tsotsos, J. 2006. Saliency based on infor-mation maximization. In

Advances in neural informationprocessing systems , 155–162.Bruce, N. D.; and Tsotsos, J. K. 2009. Saliency, attention,and visual search: An information theoretic approach.

Jour-nal of vision

IEEE transactions on pattern analysis andmachine intelligence

Journal of Experimental Psychology: Human Per-ception and Performance

Advances in neural information process-ing systems , 241–248.Cornia, M.; Baraldi, L.; Serra, G.; and Cucchiara, R. 2016.A deep multi-level network for saliency prediction. In , 3488–3493. IEEE.Cornia, M.; Baraldi, L.; Serra, G.; and Cucchiara, R. 2018.Predicting human eye ﬁxations via an lstm-based saliencyattentive model.

IEEE Transactions on Image Processing

Nature ReviewsNeuroscience

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.Hoppe, D.; and Rothkopf, C. A. 2019. Multi-step planningof eye movements in visual search.

Scientiﬁc reports

Visionresearch

Nature reviews neuroscience

IEEE Trans-actions on pattern analysis and machine intelligence

Proceedings of the 2010 symposium on eye-trackingresearch & applications , 211–218.Judd, T.; Durand, F.; and Torralba, A. 2012. A Benchmarkof Computational Models of Saliency to Predict Human Fix-ations. In

MIT Technical Report .Kleiner, M.; Brainard, D.; and Pelli, D. 2007. Whats new inPsychtoolbox-3? Perception 36 ECVP Abstract Supplement.

PLOS ONE .Knill, D. C.; and Pouget, A. 2004. The Bayesian brain:the role of uncertainty in neural coding and computation.

TRENDS in Neurosciences

Frontiers inHuman Neuroscience

4: 31.Kummerer, M.; Wallis, T. S.; and Bethge, M. 2018. Saliencybenchmarking made easy: Separating models, maps andmetrics. In

Proceedings of the European Conference onComputer Vision (ECCV) , 770–787.Kummerer, M.; Wallis, T. S.; Gatys, L. A.; and Bethge, M.2017. Understanding low-and high-level contributions toxation prediction. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , 4789–4798.Ma, W. J.; Navalpakkam, V.; Beck, J. M.; Van Den Berg,R.; and Pouget, A. 2011. Behavior and neural basis of near-optimal visual search.

Nature neuroscience

Neuron

Nature

Progress in brain research

EuropeanJournal of Neuroscience , 111–118. IEEE.Riche, N.; Duvinage, M.; Mancas, M.; Gosselin, B.; and Du-toit, T. 2013. Saliency and human ﬁxations: State-of-the-art and study of comparison metrics. In

Proceedings of theIEEE international conference on computer vision , 1153–1160.Rohe, T.; and Noppeney, U. 2015. Cortical hierarchies per-form Bayesian causal inference in multisensory perception.

PLoS Biology

Perception

International journal of computer vision

PloS one i-Perception

Trends in Cognitive Sciences

Psychological review

Proceedings Eighth IEEE Inter-national Conference on Computer Vision. ICCV 2001 , vol-ume 1, 763–770. IEEE.Turgeon, M.; Lustig, C.; and Meck, W. H. 2016. Cognitiveaging and time perception: roles of bayesian optimizationand degeneracy.

Frontiers in aging neuroscience

8: 102.Ullman, S. 2019. Using neuroscience to develop artiﬁcialintelligence.

Science

Clinical Psycho-logical Science

Current opinion inbehavioral sciences

11: 100–108.Yarbus, A. L. 1967. Eye movements during perception ofcomplex objects. In

Eye movements and vision , 171–211.Springer.Zhang, M.; Feng, J.; Ma, K. T.; Lim, J. H.; Zhao, Q.; andKreiman, G. 2018. Finding any Waldo with zero-shot in-variant and efﬁcient visual search.

Nature communications upplemental Information

We set up a visual search experiment in which participants have to search for an object in a crowdedindoor scene. First, the target is presented in the center of the screen, subtending × pixels ofvisual angle (Fig. S1). After 3 seconds, the target is replaced by a ﬁxation dot at a pseudo-randomposition at least 300 pixels away from the actual target position in the image (Fig. S1). This is doneto avoid starting the search close to the target. The initial position was the same for a given imageand all participants. Moreover, the search image appears after the participant ﬁxates the dot. Thus,all observers initiate the search in the same place for a given image. The image is presented at a × resolution (subtending . × . degrees of visual angle) (Fig. S1).The program automatically detects the end of each saccade during the target search. This periodﬁnishes when the participant ﬁxates the target or after N saccades, allowing an extra 200ms for theparticipant to be able to process the information in that last ﬁxation (Kotowicz, Rutishauser, andKoch, 2010). The maximum number saccades allowed ( N ) varied between 2 (13.4% of the trials),4 (14.9%), 8 (29.9%) or 12 (41.8%) for most of the participants (see Supplemental Information fordetails). These values were randomized for each participant, independently of the image.Supplementary Figure S1: Paradigm schemaAfter each trial, the participants are forced to guess the position of the target, even if they had alreadyfound it. They are instructed to cover the target position with a Gaussian blur, ﬁrst by clicking onthe center and then by choosing its radius. This is done by showing a screen with only the frame ofthe image and a mouse pointer –a small black dot– to select the desired center of the blur (Fig. S1).When choosing a position with the mouse, a Gaussian blur centered at that position is shown, and theparticipants are required to indicate the uncertainty of their decision by increasing or decreasing thesize of the blur using the keyboard. Position and uncertainty reports were not analyzed in the presentstudy.A training block of 5 trials was performed at the beginning of each session with the experimenterpresent in the room. After the training block, the experiment started and the experimenter moved a r X i v : . [ c s . A I] S e p o a contiguous room. The images were shown in random order. Each participant observed the 134images in three blocks. Before each block, a 9-point calibration was performed and the participantswere encouraged to get a small break to allow them to rest between blocks. Moreover, each trial startswith the built-in drift correction procedure from the EyeLink Toolbox, in which the participant has toﬁxate in a central dot and hit the spacebar to continue. If the gaze is detected far from the dot, a beepsignaled the necessity of a re-calibration. The experiment was programmed using PsychToolbox andEyeLink libraries in MATLAB (Brainard, 1997; Kleiner, Brainard, and Pelli, 2007). We collected 134 indoor pictures from Wikimedia commons, indoor design blogs, and LabelMedatabase (Russell et al., 2008). The selection criterion was that scenes should have several objectsand no human ﬁgures or text should be present. Moreover, the images are in black and white to makethe task take more saccades, since color is a very strong bottom-up cue. Also, a pilot experiment with5 participants was performed to select images that usually require several ﬁxations to ﬁnd the target.The original images were all larger or equal than × pixels, and all were cropped and/orscaled to × pixels. For each image, a single target was manually selected among the objectsof × pixels or less that were not repeated in the image –because we weren’t evaluating theaccuracy of memory retrieval–. For all targets, we considered a surrounding region of × pixels. Participants were seated in a dark room, cm away from a 19-inch Samsung SyncMaster 997MBmonitor (refresh rate = 60Hz), with a resolution of × . A chin and forehead rest was usedto stabilize the head. Eye movements were acquired with an Eye Link 1000 (SR Research, Ontario,Canada) monocular at 1000 Hz. The saccade detection was performed online with the native EyeLink algorithm with the defaultparameters for cognitive tasks. Fixations were collapsed into a grid with cells of × pixels,resulting in a grid size of × cells. We explored the size of the grid in terms of model performance.Consecutive ﬁxations within a cell were collapsed into one ﬁxation to be fair with the model behavior.Also, ﬁxations outside the image region were displaced to the closest cell. As we considered ﬁxations,blinks periods were excluded.The trial was considered correct (target found) if the participant ﬁxated into the target region ( × pixels). Only correct trials were analyzed in terms of eye movements. Different deﬁnitions of ROC-AUC found in https://saliency.tuebingen.ai/ showed the sametrend (Borji et al., 2013; Riche et al., 2013; Bylinskii et al., 2018; Kummerer, Wallis, and Bethge,2018). All those ROC curves are built base on the idea of considering the saliency map as a binaryclassiﬁer by applying a threshold. Here, we report three of them: AUC-Judd, AUC-Borji, andshufﬂed-AUC (or sAUC). These metrics differ mainly on the deﬁnition of the true positive rate andthe false positive rate for the corresponding ROC curves. AUC-Judd considers human ﬁxationsas ground truth and all non-ﬁxated pixels as negative cases. This way, the true positive rate is theproportion of pixels with saliency values above a certain threshold that were ﬁxated. The falsepositive (fp) rate is the proportion of pixels with saliency values above a certain threshold that werenot ﬁxated. AUC-Borji keeps the same deﬁnition of the true positive rate, but uses a uniform randomsample of image pixels as negatives and deﬁnes the saliency map values above a certain threshold atthese pixels as false positives. Thus, the false positive rate is the proportion of those cases that werenot ﬁxated. Finally, sAUC is similar to AUC-Borji, instead of sampling pixels from the same imageto deﬁne the fpr, it samples over ﬁxation’s locations on other images.2upplementary Table S1: Saliency maps: Different AUC metrics estimated for the saliency maps onthe third ﬁxation (Fig. 2)

Saliency Maps AUC-Judd AUC-Borji sAUC

MLNet 0,7464 0,6797 0,6008SAM-VGG 0.7321 0,6305 0,5666SAM-ResNet 0,7339 0,6501 0,5820DeepGaze 2 0,7637 0,6537 0,5883ICF 0,7509 0,7078 0,5808Humans 0,8076 0,7792 0,7727Center 0,6866 0,6739 0,5208Supplementary Figure S2: Saliency maps. A) ROC curves and B) AUC values for all ﬁxations. Colormapping for models is consistent with Fig. 2.

We used three measures to compare the performance (i.e. the probability of detecting a target) ofeach model with the human participants. Each of them focused on a slightly different aspect.

In order to directly compare the performance curve of each model with human participants, for eachpossible number of saccades allowed N ∈ { , , , } , we calculate the difference between the meanproportion of targets found by participants and by the model m ( P subj ( N ) and P m ( N ) respectively).For each number of saccades allowed, the difference is weighted by the standard deviation acrossparticipants σ . Then, the weighted distance W D ( m ) is the mean value each of those values: W D ( m ) = X N ∈{ , , , } | P subj ( N ) − P m ( N ) | σ (1) Jaccard Index is a metric that allows us to measure the proportion of targets found by humans that areexplained by the models. We represent each participant and each model as a boolean S -dimensionalvector with a one in the i -th position if they found the target in the i -th image, and zero otherwise. S is the number of images: in our experiment, S = 134 .3very participant has the same proportion of images with each maximum saccade possible N (13.4%of the trials with N = 2 , 14.9% with N = 4 , 29.9% with N = 8 or 41.8% with 12). Nonetheless, thesubset that gets each N is chosen uniformly at random for each subject. This way, for each image wehave subjects that were interrupted after 2, 4, 8, and 12 saccades. As each participant has a differentsample of maximum saccades allowed across images, we apply these same constraints to the model.That is, when we want to compare with participant p , we apply their constraints to our model m .Then, we decide for each image if it can ﬁnd the target in fewer saccades than the allowed. Givena model m , we deﬁne sacc ( m, i ) as the number of saccades that the model needs to ﬁnd the targetin the i -th image. Then, for each participant we have max _ sacc ( i, p ) as the number of saccadesallowed for each image i and participant p , with p = 1 . . . in our data. From these values, weconstruct the vector of targets found by the model using the saccade threshold distribution for eachparticipant p , called T F M ( m ) p (Targets Found by Model). T F M ( m ) p ∈ { , } S is deﬁned as: T F M ( m ) p ( i ) = (cid:26) if sacc ( m, i ) ≤ max _ sacc ( i, p )0 otherwise (2)Each participant p is also represented as a S -dimensional vector T F P p ∈ { , } S (Targets Found byParticipant): T F P p ( i ) = (cid:26) if participant p found the target in the i -th image otherwise (3)Then, we compute the Jaccard Index (Real and Vargas, 1996) between those vectors (4). jaccard ( p, m ) = T F P p ∩ T F M ( m ) p T F P p ∪ T F M ( m ) p (4) Another measure we considered to compare a model performance against humans was what we calledthe Mean Agreement Score (MAS), inspired by Mean Absolute Error. This measure calculates themean proportion of trials where both the participant and the model had the same performance. Thismetric has the purpose of measuring the compromise between our model and the participants in theirperformance. We compute the difference between the boolean vectors like in Jaccard Index (3) andcalculate the mean. Finally the Mean Agreement Score between model m and participant p is 1 minusthat value: (5) Some images showed an overall disagreement but, looking a little bit deeper, we can see thatparticipants performed two different but consistent patterns. As the present implementation of themodel is deterministic, it chose only one of those patterns (Figure S3). In Figure S3, we show someof the human scanpaths and the model (cIBS+DG2) scanpath for one image to illustrate that speciﬁcbehavior. In this case, the cup is the search target and there are two surfaces where, a priori, is equallylikely to ﬁnd it. We selected six human scanpaths with different hmSD values to show how the initialdecision determines the behavior and the overall hmSD for that scanpath (Fig. S3, left panel), butalmost all participants end up exploring both regions (Fig. S3, right panel). Note that dark greentraces are scanpaths very similar to the model, while yellow traces are scanpaths that differ from themodel. Further development of the should focus on mimicking these individual differences in visualsearch.

The model was fully developed in MATLAB. Saliency map calculations were performed using thepublic code provided by its respective authors (Cornia et al., 2016, 2018; Kummerer et al., 2017).4upplementary Figure S3: Scanpath prediction comparison. The ﬁgure is meshed by a ﬁxed gridof δ = 32 px. Each curve represents a scanpath, in red the cIBS+DG2 model’s scanpath and sixscanpaths of participants colored according its own dissimilarity to the model scanpath. Left panelshows the ﬁrst four ﬁxations of each scanpath, and the right panel shows the whole scanpaths. Thesearch target is represented with the blue square and the approximated ﬁrst ﬁxation with the redsquare. Above the image the hmSD and bhSD for this trial are reported. Image taken from WikipediaCommons.

All code, data, and parameters needed to reproduce this paper’s results and visualizations will beavailable in

GitHub . References

Borji, A.; Tavakoli, H. R.; Sihite, D. N.; and Itti, L. 2013. Analysis of scores, datasets, and modelsin visual saliency prediction. In

Proceedings of the IEEE international conference on computervision , 921–928.Brainard, D. H. 1997. The Psychophysics Toolbox.

Spatial Vision

10: 433–436.Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; and Durand, F. 2018. What do different evaluationmetrics tell us about saliency models?

IEEE transactions on pattern analysis and machineintelligence , 3488–3493.IEEE.Cornia, M.; Baraldi, L.; Serra, G.; and Cucchiara, R. 2018. Predicting human eye ﬁxations via anlstm-based saliency attentive model.

IEEE Transactions on Image Processing

PLOS ONE .Kotowicz, A.; Rutishauser, U.; and Koch, C. 2010. Time course of target recognition in visual search.

Frontiers in Human Neuroscience

4: 31.Kummerer, M.; Wallis, T. S.; and Bethge, M. 2018. Saliency benchmarking made easy: Separatingmodels, maps and metrics. In

Proceedings of the European Conference on Computer Vision(ECCV) , 770–787.Kummerer, M.; Wallis, T. S.; Gatys, L. A.; and Bethge, M. 2017. Understanding low-and high-levelcontributions to ﬁxation prediction. In

Proceedings of the IEEE International Conference onComputer Vision , 4789–4798.Real, R.; and Vargas, J. M. 1996. The probabilistic basis of Jaccard’s index of similarity.

Systematicbiology

Proceedings of the IEEE internationalconference on computer vision , 1153–1160.Russell, B. C.; Torralba, A.; Murphy, K. P.; and Freeman, W. T. 2008. LabelMe: a database andweb-based tool for image annotation.