Modeling human visual search: A combined Bayesian searcher and saliency map approach for eye movement guidance in natural scenes
MModeling Human Visual Search: A Combined Bayesian Searcher and SaliencyMap Approach for Eye Movement Guidance in Natural Scenes
Melanie Sclar ∗ , Gast´on Bujia ∗ , Sebasti´an Vita , Guillermo Solovey ,Juan Esteban Kamienkowski Laboratorio de Inteligencia Artificial Aplicada, Instituto de Ciencias de la Computaci´on, Universidad de Buenos Aires –CONICET, Argentina Instituto del C´alculo, Universidad de Buenos Aires – CONICET, Argentina Departamento de F´ısica, FCEyN, Universidad de Buenos Aires, Argentina
Abstract
Finding objects is essential for almost any daily-life visualtask. Saliency models have been useful to predict fixationlocations in natural images, but are static, i.e., they provideno information about the time-sequence of fixations. Nowa-days, one of the biggest challenges in the field is to go beyondsaliency maps to predict a sequence of fixations related to avisual task, such as searching for a given target. Bayesian ob-server models have been proposed for this task, as they rep-resent visual search as an active sampling process. Neverthe-less, they were mostly evaluated on artificial images, and howthey adapt to natural images remains largely unexplored.Here, we propose a unified Bayesian model for visual searchguided by saliency maps as prior information. We validatedour model with a visual search experiment in natural scenesrecording eye movements. We show that, although state-of-the-art saliency models perform well in predicting the firsttwo fixations in a visual search task, their performance de-grades to chance afterward. This suggests that saliency mapsalone are good to model bottom-up first impressions, but arenot enough to explain the scanpaths when top-down task in-formation is critical. Thus, we propose to use them as priorsof Bayesian searchers. This approach leads to a behavior verysimilar to humans for the whole scanpath, both in the percent-age of target found as a function of the fixation rank and thescanpath similarity, reproducing the entire sequence of eyemovements.
Introduction
Visual search is a natural task that humans perform in ev-eryday life, from looking for someone in a photograph tosearching where you left your favorite mug in the kitchen.Finding our goal relies on our ability to gather visual in-formation through a sequence of eye movements, perform-ing a discrete sampling of the scene. This sampling of in-formation is not carried out on random points. The fixa-tions of the gaze follow different strategies, trying to min-imize the number of steps needed to find the target (Borjiand Itti 2014; Rolfs 2015; Tatler et al. 2010; Yarbus 1967).This process is a classical example of the active sensing or ∗ Indicates equal contribution. Correspondence should beaddressed to M.S. ( [email protected] ) or G.B.( [email protected] ). sampling paradigm, where humans perform decisions aboutthe different ways of sampling information making infer-ences with the gathered information so far, in order to ful-fill a necessity (Yang, Wolpert, and Lengyel 2016; Got-tlieb and Oudeyer 2018). This way, each decision is ex-pected to reduce uncertainty about the environment (Got-tlieb and Oudeyer 2018; Najemnik and Geisler 2005). More-over, this decision-making behavior depends on the pur-pose; for instance, whether it is task-driven or simple cu-riosity. Predicting the eye movements necessary to meet agoal is a computationally-complex task since it must com-bine the bottom-up information capturing processes and thetop-down integration of information and updating of expec-tations in each fixation.A related task is the prediction of the most likely fixationpositions in the scene, whose purpose is to build a saliencymap , identifying regions that draw our attention within animage. The first saliency models were built based on com-puter vision strategies, combining different filters over theimage (Itti and Koch 2001). Some of these filters could bevery general, such as a low-pass filter that gives the ideaof the horizon (Itti, Koch, and Niebur 1998; Torralba andSinha 2001), or more specific, such as detecting high-levelfeatures like faces (Cerf et al. 2008). In recent years, deepneural networks (DNNs) have advanced the development ofsaliency maps. Many saliency models have successfully in-corporated pre-trained convolutional DNNs in order to ex-tract low and high-level features of the images (Kummereret al. 2017; Cornia et al. 2016, 2018). These novel ap-proaches were summarized in MIT/Tuebingen’s collabora-tion website (Kummerer, Wallis, and Bethge 2018). Never-theless, they produce accurate results only in the first fewfixations of free exploration tasks (Torralba et al. 2006) asthey cannot make use of the sequential nature of the tasknor combine information through the sequence of fixations.Thus, they may not be able to replicate their good resultswhen performing a complex task, such as a visual search.Recently, Zhang et al. (2018) proposed an extension ofthose models to predict the sequence of fixations in a visualsearch in natural images. They use a greedy algorithm basedon DNNs, mimicking the behavior of the visual system byelaborating an attention map related to the search goal. Us-ing a greedy algorithm implied forcing some known behav- a r X i v : . [ c s . A I] S e p ors of human visual search –like inhibition of return– thatarise naturally with longer-sighted objective functions.Nowadays, there is a growing interest in Bayesian mod-els given their good results at modeling human behaviorand also for having straightforward interpretations relatedto human information processing (Ullman 2019). For in-stance, different Bayesian Models were applied to decisionmaking or perceptual tasks (O’Reilly, Jbabdi, and Behrens2012; Rohe and Noppeney 2015; Samad, Chung, and Shams2015; Wiecki, Poland, and Frank 2015; Turgeon, Lustig, andMeck 2016; Knill and Pouget 2004; Tenenbaum, Griffiths,and Kemp 2006; Meyniel, Sigman, and Mainen 2015). Inthe case of visual search, Najemnik and Geisler (2005) haveproposed a model that decides the next eye movement basedon its prior knowledge, a visibility map, and the current stateof a posterior probability that is updated after every fixa-tion. In this model, inhibition of return, moderate saccadelength, among other human characteristics of visual searcharise naturally (Najemnik and Geisler 2005). The results ofthis Ideal Bayesian searcher (IBS) have had a wide impact,but the images they used in their experiments were all ar-tificial. Recently, Hoppe and Rothkopf (2019) proposed avisual search model that incorporates planning. It uses theuncertainty of the current observation to select the upcominggaze locations in order to maximize the probability of detect-ing the location of the target after the sequence of two sac-cades. As Najemnik and Geisler (2005), their task is specifi-cally designed for maximizing the difference between mod-els, i.e finding the target in very few fixations on artificialstimuli. In order to extend these results to natural images, itis necessary to incorporate the information available in thescene.Here, we show that an IBS model combined with state-of-the-art saliency maps as priors performs similarly to humansin a visual search task on natural scenes. Moreover, we in-corporate a different update rule to the IBS model, based ona correlation for the template response. This modificationcould incorporate the effect of distractors, even though wedid not specifically test this hypothesis in our experiments.We also simplify assumptions from prior work by avoidingto measure each subjects’ visibility map beforehand, andhaving an a priori approximation instead. Previous workcompares the general performance between humans and themodel (targets found and total number of fixations). Movingone step further, we quantitatively compare the scanpaths(ordered sequence of fixations) produced by the model withthe ones recorded by human observers. Visual search in natural indoor images:Human data
Paradigm and human data acquisition andpreprocessing
We set up a visual search experiment in which participantshave to search for an object in a crowded indoor scene. First,the target was presented in the center of the screen, subtend-ing × pixels of visual angle (Fig. S1). After 3 sec-onds, the target was replaced by a fixation dot at a pseudo-random position at least 300 pixels away from the actual tar- get position in the image (Fig. S1). This was done to avoidstarting the search too close to the target. The initial positionwas the same for a given image and all participants. Thesearch image appears after the participant fixates the dot.Thus, all observers initiate the search in the same place foreach image. (Fig. S1).Saccades and fixations were parsed online. The search pe-riod finishes when the participant fixates the target or after N saccades, allowing an extra 200ms, in order to allow ob-servers to process information of this last fixation (Kotow-icz, Rutishauser, and Koch 2010). The maximum numbersaccades allowed ( N ) varied between , , , . These val-ues were randomized for each participant, independently ofthe image. The experiment was programmed using Psych-Toolbox and EyeLink libraries in MATLAB (Brainard 1997;Kleiner, Brainard, and Pelli 2007).The images correspond to 134 indoor pictures fromWikimedia commons, indoor design blogs, and LabelMedatabase (Russell et al. 2008), which have several objectsand no human figures or text. The image was presented ata × resolution (subtending . × . degreesof visual angle). For each image, a single target was selectedamong objects of size equal or less than × pixels. Also,we excluded targets with almost exact copies within the im-age (i.e. a cup within a set of cups) to prevent high workingmemory requirements from humans. For all of them we con-sidered a surrounding region of × pixels. Finally, wechecked that there were no consistent spatial biases acrossthe images.Fifty-seven subjects (34 male; age . ± . years old)participated in the Visual Search task. All were studentsor teachers from the Facultad de Ciencias Exactas y Natu-rales de la Universidad de Buenos Aires. All subjects werenave to the objectives of the experiment, had normal orcorrected-to-normal vision, and provided written informedconsent according to the recommendations of the declara-tion of Helsinki to participate in the study.See more details on the paradigm and data acquisition andpreprocessing in Supplemental Information. Human behavior results
During the experiment, observers have to search for a giventarget object within natural indoor scenes. The trial stopswhen the observers find the target or after N saccades( N = 2 , , , ). As expected, the proportion of targetsfound increases as a function of the saccades allowed (Fig.1A), reaching a plateau from 8 to 12 saccades allowed andon (Fig. 1A and data from a preliminary experiment with upto 64 saccades not shown).Overall, eye movements recorded behave as expected.First, the amplitude decreases with the fixation rank, pre-senting the so-called coarse-to-fine effect (Fig. 1B), and thesaccades tended to be horizontal more than vertical (Fig.1C). Finally, the initial spatial distribution of fixations hada central bias, and then extended first over the horizon untilit covered the whole image, as the targets are uniformly dis-tributed along the scene (Fig. 1D). This effect could be par-tially due to the organization of the task (the central drift cor-rection and presentation of the target), the setup (the centraligure 1: General behavior. (A) Proportion of targets foundas a function of the number of saccades allowed. Distribu-tions of (B) saccade length, (C) saccade direction (measuredin degrees from the positive horizontal axis), and (D) fixa-tions’ position for different fixation ranks (1,2,3,4,5-8,9-12).position of the monitor with respect of the eyes/head), andthe images (the photographer typically centers the image); italso could be due to processing benefits, as it is the optimalposition to acquire low-level information of the whole sceneor to start the exploration. Searcher Modeling Approach
The proposed model involves two main aspects of the visualsearch implementation: a saliency map estimation step asa first, glimpse-like, information extraction, and successivesearch steps in order to build the full scanpath.
Exploring saliency maps
The saliency maps were usually estimated in a scene-viewing task, where observers freely moved their eyes, ex-ploring whatever captured their interests. The motivation ofusing the full saliency as prior in a visual search task lies inthe results of the flash-preview moving-window paradigmthat shows that even less than a few hundreds of millisec-onds’ glimpse of a scene can guide search as long as suffi-cient time is subsequently available to combine prior knowl-edge with the current visual input (Oliva and Torralba 2006;Torralba et al. 2006; Castelhano and Henderson 2007). Im-portantly, it was also shown that this is more relevant in thefirst saccades of a full scene search, and its predictive powerdecays with the fixation rank (Torralba et al. 2006).
Saliency Maps
In the last few years several saliency mod-els appeared in the literature and made their code available.Many of those were nicely summarized and compared inthe https://saliency.tuebingen.ai/ repository(Judd, Durand, and Torralba 2012; Kummerer, Wallis, andBethge 2018; Bylinskii et al. 2018). With the purpose ofunderstanding which features guide the search in this sec-tion, we choose and compare five different state-of-the-art saliency maps for our task: DeepGaze 2 (Kummerer et al.2017), MLNet (Cornia et al. 2016), SAM-VGG and SAM-ResNet (Cornia et al. 2018), and ICF (Intensity ContrastFeature) (Kummerer et al. 2017).All the saliency models considered (except for ICF) arebased on neural network architectures, using different con-volutional networks (CNN) pretrained on object recognitiontasks. These CNNs played the role of calculating a fixedfeature space representation (feature extractor) for the im-age which then will be fed to a predictor function (in themodels we consider, also a neural network). DeepGaze usesa VGG-19 (Simonyan and Zisserman 2014) as feature ex-tractor, and the predictor is a simpler four-layer CNN (Kum-merer et al. 2017). The MLNet model uses a modified VGG-16 (Simonyan and Zisserman 2014) that returns several fea-ture maps, and a simpler CNN is used as a predictor thatincorporates a learnable center prior (Cornia et al. 2016). Fi-nally, SAM could use both VGG-16 and ResNet50 (He et al.2016) as two different feature extractors, and the predictoris a neural network with attentive and convolutional mech-anisms (Cornia et al. 2018). ICF has a similar architectureto DeepGaze, but it uses Gaussian filters instead of a neuralnetwork. This way, ICF extracts purely low-level image in-formation (intensity and intensity contrast). We also includea saliency model with just the center bias, modelled by a 2DGaussian distribution.As the control model, we built a human-based saliencymap using the accumulated fixation position of all observersfor a given image, smoothed with a Gaussian kernel (st. dev.= 25 pxs). Given that observers were forced to begin eachtrial in the same position, we did not use the first fixationsbut the third. This way we capture the regions that attracthuman attention.
Prediction of observers’ fixation positions
We evaluatedhow just the saliency models perform in predicting fixationsalong the search by themselves. Thus, we considered eachsaliency map S as a binary classifier on every pixel and usedReceiver Operator Curves (ROC) and Area Under the Curve(AUC) to measure their performance. This comes with thedifficulty that there is not a unique way of defining the falsepositive rate ( fpr ). In dealing with this problem, previouswork on this task has used many different definitions of(ROC and its corresponding) AUC (Borji et al. 2013; Richeet al. 2013; Bylinskii et al. 2018; Kummerer, Wallis, andBethge 2018). Briefly, to build our ROC we considered thetrue positive rate ( tpr ) as the proportion of saliency map val-ues above each threshold at fixation locations and the fpr as the proportion of saliency map values above threshold atnon-fixated pixels (Fig. 2A).As expected, the saliency map built from the distribu-tion of third fixations performed by humans (human-basedsaliency map) is superior to all other saliency maps, andthe center bias map was clearly worse than the rest of them(Fig. 2B). This is consistent with the idea that the first stepsin visual search are mostly guided by image saliency. Therest of the models have similar performance on AUC, withDeepGaze2 performing slightly better than the others (Fig.2B). Using different definitions of AUC (Borji et al. 2013;igure 2: Saliency maps. A) Example on how to estimatethe TPR for the ROC curve, B) ROC curves and C) AUCvalues for the third fixation and D) AUC for each Model asa function of the current Fixation Rank. Color mapping formodels is consistent over B, C, and D.Riche et al. 2013; Bylinskii et al. 2018; Kummerer, Wal-lis, and Bethge 2018) showed the same trend (Table S1). Ifwe consider all fixations the AUC is reduced for all models,including the human-based saliency map built on the thirdfixations (Fig. S2).All models reached a maximum in AUC values at the sec-ond fixation except the human-based model that peaked atthe third fixation as expected (Fig. 2C). Interestingly, thecenter bias begins at a similar level as the other saliencymaps but decays more rapidly, reaching 0.5 in the fourth fix-ation. Thus, other saliency maps must capture some otherrelevant visual information. Nevertheless, the AUC valuesfrom all saliency maps decay smoothly (Fig. 2C). This sug-gests that the gist the observers are able to collect in the firstfixations is largely modified by the search. Top-down mech-anisms must take control and play major roles in eye move-ment guidance as the number of fixations increase (Itti andKoch 2000). The DeepGaze 2 model performed better overall fixation ranks, becoming indistinguishable from humanperformance in the second fixation (Fig. 2C).
Bayesian Searcher models
In the previous section, we showed that saliency modelsalone are not able to predict the fixation positions in a vi-sual scene when a task, even a simple task like visual search,is performed. Moreover, these models are not developed topredict the order of those fixations. In the next section, wedevelop models that integrate the saliency maps as priors buthave rules implemented to update those probabilities as thesearch progresses.
Description of the Ideal Bayesian searcher (IBS)
Na-jemnik and Geisler (2005)’s IBS computes the optimal next fixation location in each step. It considers each possible nextfixation and picks the one that will maximize the probabil-ity of correctly identifying the location of the target after thefixation. The decision of the optimal fixation location at step T + 1 , k opt ( T + 1) , is computed as (eq. 1): k opt ( T + 1) = argmax k ( T +1) n X i =1 p i ( T ) p ( C | i, k ( T + 1)) (1) where p i ( T ) is the posterior probability that the targetis at the i -th location within the grid after T fixations and p ( C | i, k ( T + 1)) is the probability of being correct giventhat the true target location is i , and the location of the nextfixation is k ( T + 1) . p i ( T ) involves the prior, the visibilitymap ( d ik ( T ) ) and a notion of the target location ( W ik ( t ) ): p i ( T ) = prior ( i ) · T Y t =1 exp (cid:0) d ik ( t ) W ik ( t ) (cid:1) n X j =1 prior ( i ) · T Y t =1 exp (cid:0) d jk ( t ) W jk ( t ) (cid:1) (2) The template response, W ik ( t ) , quantifies the similaritybetween a given position i and the target image from thefixated position k ( t ) ( t is any previous fixation). It is definedas W ik ( t ) ∼ N ( µ ik ( t ) , σ ik ( t ) ) where: µ ik ( t ) = ( i = target location ) − . , σ ik ( t ) = 1 d ik ( t ) (3) Abusing notation, in eq. 2 W ik ( t ) refers to a value drawnfrom this distribution.IBS has only been tested in artificial images, where sub-jects need to find a gabor patch among /f noise in one outof 25 possible locations. This work is, to our knowledge, thefirst one to test this approach in natural scenes. Below, wediscuss the modifications needed to apply IBS to eye move-ments in natural images. Modifications to the IBS to handle natural scenes( correlation-based IBS (cIBS)).
Since it would be bothcomputationally intractable to compute the probability offixating in every pixel of a × image, and ineffec-tive to do so –as useful information span over regions largerthan a pixel–, we restrict the possible fixation locations to beanalyzed to the center points of a grid of δ × δ pixels each.We collapse the eye movements to these points accordingly:consecutive fixations within a cell were merged into one fix-ation to be fair with the model behavior.The original IBS model had a uniform prior distribution.Since we are trying to model fixation locations in a natu-ral scene, we introduced a saliency model as the prior. The prior ( i ) will be the average of the saliency in the i -th gridcell.Importantly, the presence of the target in a certain posi-tion in natural images is not as straightforward as in artifi-cial stimuli, where all the incorrect locations were equallydissimilar. In natural images there are often distractors, i.e.positions in the image that are visually similar to the target,especially if seen with low visibility. Therefore, we propose redefined template response ˜ W ik ( t ) ∼ N (˜ µ ik ( t ) , ˜ σ ik ( t ) ) ,where ˜ µ ik ( t ) ∈ [ − , is defined as (eq. 4): ˜ µ ik ( t ) = µ ik ( t ) · (cid:16) d ik ( t ) + 12 (cid:17) + corr i · (cid:16) − d ik ( t ) (cid:17) (4) corr i ∈ [ − . , . is the cross-correlation of location i and the target image, used as a measure of image similarity.Moreover, we modified σ ik ( t ) to keep the variance de-pending on the visibility, but we incorporate two parameters(eq. 5): ˜ σ ik ( t ) = 1 a · d ik ( t ) + b (5) The parameters a and b jointly modulate the inverse ofthe visibility and prevent /d from diverging. These pa-rameters were not included in the original model probablybecause d was estimated empirically (from thousands oftrials and independently for each subject) and the d wasnever exactly equal to zero. Recently, Bradley, Abrams, andGeisler (2014) simplified the task by fitting a visibility mapbuilt from a first-principle model which proposed an ana-lytic function with several parameters, that should still befitted for each participant. Here, we further simplified it byusing a two-dimensional Gaussian with the same parametersfor every participant, avoiding a potential leak of informa-tion about the viewing patterns to the model. The parameterswere taken a priori (estimated from parameters in Najemnikand Geisler (2005); Bradley, Abrams, and Geisler (2014)).We chose the parameters of the model using a classical gridsearch procedure in a previous experiment with a smallerdataset and the same best parameters ( δ = 32 , a = 3 and b = 4 ) are used for all the models.Finally, since we are trying to model fixation locations ina natural scene, we introduced a saliency model as the initialprior instead of an uniform distribution used by the originalmodel. The prior ( i ) will be the average of the saliency inthe i -th grid cell.We call this variation of the model correlation-based IBS (cIBS). The data, models and code will be publicly avail-able upon publication. It is worth mentioning that, to ourknowledge, not only an implementation of the Najemnik andGeisler (2005) model is publicly available for the first timehere, but also it is largely optimized. More details in Supple-mental Information. Evaluating searcher models on human data
We firstevaluate the updating of probabilities and the decision ruleof the next fixation position of the proposed cIBS model. Forcomparison, we used the previous
IBS model, in which the template response accounts only for the presence or not ofthe target, and not for the similarity of the given region withthe target. Also, we implemented two other basic models: a
Greedy searcher and a
Saliency-based searcher . The
Greedysearcher bases its decision on maximizing the probability offinding the target in the next fixation. It only considers thepresent posterior probabilities and the visibility map, anddoes not take into account how the probability map is go-ing to be updated after that. The
Saliency-based searcher simply goes through the most salient regions of the image,adding an inhibition-of-return effect to each visited region. In these models, we used the DeepGaze2 as a prior, becauseit is the best performing saliency map of the previous sec-tion. We also evaluate the usage of different priors with thecIBS model, comparing with the center bias alone, which isa centered two-dimensional Gaussian distribution, a uniform(flat) distribution, and a white noise distribution.Figure 3: Model performance comparison. Proportion of tar-get found for each threshold considered. The boxes representthe human behaviour distribution. The curves are the perfor-mance achieved by the models considered using differentsearch strategies (A) with DeepGaze2 (DG2) as prior anddifferent priors (B) using cIBS strategy.The proportion of targets found for any of the possi-ble saccades allowed was used as a measure of the overallperformance (Fig. 3). In Table 1 we summarize the met-rics comparing humans and models. The Weighted Dis-tance measures the mean difference between curves (Fig.3). The Jaccard index represents the proportion of targetsfound by the model to the total targets found by the sub-jects, and the Mean Agreement metric measures the propor-tion of trials where subject and model had the same perfor-mance: both, subject and model, found (or not) the target(See Supplemental Information). When comparing differ-ent searchers with the same prior, cIBS has the best agree-ment with the humans’ performance as a compromise of dif-ferent metrics (Table 1). In particular, the performance ofcIBS+DeepGaze2 is the closest to the human mean agree-ment ( . ± . ). Nevertheless, the curves of the IBSand Greedy models were also very close to humans (Fig 3A)and each of them performed better in one metric (Table 1).It is important to note that Weighted Distance significantlyimproves when adding the distractor component (cIBS vsIBS).Only the basic Saliency-based model had poorer agree-ment with the human performance, showing that templatematching weighted by visibility is a plausible mechanismfor searching potential targets in the scene.Then, we explored the importance of the prior, compar-ing the best searcher model with the chosen prior against thedifferent basic priors. Again, the cIBS+DeepGaze2 had bet-ter overall agreement with humans’ behavior (Fig. 3B andTable 1) and, interestingly, is the only model that presenteda step-like function characteristic from humans (Fig. 3B).Going one step further, we compare the scanpaths usingthe scanpath dissimilarity pairwise metric proposed by Jaro-dzka, Holmqvist, and Nystr¨om (2010). Briefly, the scanpathmetric represent them as time-aligned vectors, each scanpathigure 4: Human-model scanpath dissimilarity comparison. All metrics are computed first between each pair of partici-pants/model and then averaged. A) Distribution of scanpath dissimilarity between humans and different search strategies usingDeepGaze2 as prior. Each dot represents a trial image. B-E) Distribution of scanpath dissimilarity between humans ( bhSD ) andhumans-model ( hmSD ) for each trial image. F) Distribution of scanpath dissimilarity between different priors using cIBS assearch strategy. G-J) Same plot as (B-E) for each different prior considered. Only correct trials were considered in (B-E, G-J).is defined as a sequence of fixations u i ( (x,y)-coordinates )where the i -th saccade is the shortest path (vector) goingfrom u i to u i +1 . Each u i may not be exactly a fixation, butthe center of a cluster of several fixations, making this mea-sure more robust. Depending on the objective of the com-parison, it could be used with different summary measuresbased on those characteristics. As we aim to compare thesequence of explored locations between humans and ourmodel, we use shape dissimilarity. It is calculated as thenormalized difference between the saccade vectors, wherea scanpath dissimilarity of 0 indicates that the scanpaths arehighly similar, and 1 indicates that there is no correspon-dence between them.We started comparing between searcher models, bothcIBS and IBS were almost indistinguishable from humans ,the Greedy model was still very close, and only the Saliency-based model resulted in a significantly different behavior(Fig. 4A-E). We quantified this relation by measuring thecorrelation and the slope of a linear regression (with nullintercept) of the dissimilarity between humans (bhSD) andbetween humans and the model (hmSD) (Table 1). The ra-tionale of these measures is that the ground truth of eachimage (the human scanpath) has different variability acrossimages, and we cannot expect that in the case of very diversescanpaths (higher dissimilarity) among humans, the modelwas close to all of them. A close look at the correlation be-tween the dissimilarity measure in humans and models evi-denced that there were only a small fraction of images thatdeparted from the human’s scanpaths (Fig. 4B-E). These im-ages ( using µ + 3 σ for cIBS+DG2) correspond to caseswhere few people found the target or, interestingly, wherethere are different possible scanpath behaviors (i.e. somepeople start looking for a cup on the cupboard and others onthe table) (Supplemental Figure S3). Further studies shouldbe performed to explore individual differences between ob-servers.When comparing between priors, we observed that boththe models with DeepGaze2 and Center priors were closerto humans’ values, and the flat and noisy priors had largerscanpath dissimilarities (Fig 4F). The model with a flat prior had a slightly better correlation, but both the modelswith DeepGaze2 and Center priors had good correlationand slopes closer to 1 (Table 1). This suggests that the ini-tial center bias is a fair approximation of the human priors.Nonetheless, although both scanpaths were almost indistin-guishable from humans, saliency adds some information thatmakes the model with DeepGaze2 quite better in finding thetarget. Conclusions
We introduced the cIBS model, an expansion of the IBSmodel to face natural scenes. In summary, we used saliencymaps as priors to model the information collected in the firstglimpse that guides the first saccades, and we modified thecomputation of the template response to be able to, first, usea simpler model of visibility and, second, give graded re-sponses to regions similar to the target, incorporating thenotion of distractors. To evaluate the model, we created adataset of 57 subjects searching on 134 images, where wecompared cIBS to IBS and other strong baselines. We ob-served that saliency models performed well in predicting ini-tial fixations, in particular the third fixation. Humans seemedto start from the initial forced fixation position, move to thecenter, and then to the most (bottom-up) salient location. Af-ter that, the performance of all the saliency models decay toalmost chance, as is expected from their conception. Theymainly encode bottom-up information of the image (Itti andKoch 2000) and not the aim of the task, and they are notable to change and update as it progresses. This is also con-sistent with previous results from Torralba et al. (2006) whoimplement a saliency model for a visual search task.As saliency models are good in predicting first bottom-up impressions, they are ideal candidates to be included aspriors in the proposed Bayesian framework. The central biasperformed well in the first two fixations, and is included inall the saliency models (Cornia et al. 2016, 2018; Kummereret al. 2017). The central bias by itself resulted in a good priorin terms of scanpath similarity but not in the performance offinding the target. However, the DeepGaze2 had the bettercompromise of both measures. This suggests that bottom- earcher:
IBS Greedy Saliency cIBS cIBS cIBS cIBS
Prior:
DG2 DG2 DG2 DG2 Center Flat Noisy
Weighted Distance 0.78 0.13 2.14 0.31 1.38 0.72 0.15Mean Agreement 0.64 (0.096) 0.63 (0.082) 0.55 (0.092) 0.64 (0.084) 0.60 (0.086) 0.60 (0.109) 0.61 (0.082)Jaccard Index 0.54 (0.105) 0.47 (0.091) 0.32 (0.073) 0.51 (0.098) 0.38 (0.086) 0.51 (0.109) 0.46 (0.089)Linear Regression 0.85 0.76 0.52 0.84 0.88 0.69 0.61 ρ Table 1: Different measures on performance and scanpath dissimilarity between humans and models.
Weighted Distance is thedistance between humans and model performance weighted by humans dispersion.
Mean Agreement and
Jaccard Index measurecoincidence of humans and model’s correct target detections (See Supplemental Information).
Linear Regression corresponds tothe slope of a simple y ∼ x model between humans scanpaths dissimilarity (bhSD in Fig 4 B-E and G-J) and models scanpathsdissimilarity (hmSD in Fig 4 B-E and G-J). ρ is the Spearman’s correlation coefficient between scanpath dissimilarities bhSDand hmSD. Only correct trials were considered and averaged for each image ( N = 134 ).up cues provided by the saliency maps are relevant to thesearch.Regarding the update and decision mechanism, it isclear that the simple saliency-based searcher, i.e. wanderingaround the most salient regions until bumping into the target,is not a good model of visual search. Humans use not onlyinformation of the scene but also information about the tar-get and previous fixations. Then, a rule of comparing periph-eral information of the scene and the target should be imple-mented, along with a mechanism for combining that infor-mation. Those mechanisms are implemented in the Greedy,IBS, and cIBS models, with the difference that the Greedymodel maximizes the probability of finding the target in thenext fixation, and the IBS and cIBS maximize the probabil-ity of finding the target after the next fixation. Overall, theGreedy model had worse performance than both IBS andcIBS models, suggesting that humans actually implementlonger plans when searching in a visual scene. When com-paring Bayesian models, we showed that when searching ina cluttered image it’s important to account for the distractorspresent in the scene, extending the previous proposal fromNajemnik and Geisler (2005).Previous efforts on including contextual informationaimed mainly to predict image regions likely to be fixated.For instance, they combined statically a spatial filter-basedsaliency map with previous knowledge of target object posi-tions on the scene (Torralba et al. 2006). Some other worksaimed to predict the sequence of fixations, but efforts onnon-Bayesian modeling mainly used greedy algorithms (Ra-souli and Tsotsos 2014; Zhang et al. 2018). Here, we com-pared an example of greedy algorithm with others with amore long-sighted objective function, which has the addi-tional benefit of having some known behaviors of human vi-sual search arise naturally. For example, Zhang et al. (2018)forced inhibition of return, while our model has it implicitlyincorporated. Crucially, Bayesian frameworks are highly in-terpretable and connect our work to other efforts in modelingtop-down influences in perception and decision-making.It’s also important to note that, although the Najemnikand Geisler (2005) model was a very insightful and influ-ential proposal, to our knowledge, our work is the first onethat uses a Bayesian framework to predict eye movements during visual search in natural images. It is a leap in termsof applications since prior work on Bayesian models wasdone in very constrained artificial environments (looking fora tiny Gabor patch embedded in background /f noise) (Na-jemnik and Geisler 2005). Moreover, we addressed possiblemodifications when considering the complexities of naturalimages. Specifically, the addition of a saliency map as prior,the modification of the template response’s mean, and shiftin visibility. We also simplified assumptions from Najemnikand Geisler (2005) by not having to measure each person’svisibility map beforehand. We use the same visibility mapacross subjects, which also avoids a potential leak of infor-mation about the viewing patterns to the model. Finally, wealso share an optimized code for the models which wouldbe useful for others to replicate both Najemnik and Geisler(2005) and our results.Our model aligns with the work of many researchersthat propose probabilistic solutions to model human behav-ior from first principles. For instance, Bruce and Tsotsos(2006, 2009) proposed a saliency model based on informa-tion maximization principle, which demonstrates great effi-cacy in predicting fixation patterns across both pictures andmovies. Also, Ma et al. (2011) implement a near-optimal vi-sual search model for a fixed-gaze search task (i.e. exploringthe allocation of covert attention), extending previous mod-els to deal with the reliability of visual information acrossitems and displays, and proposing an implementation of howinformation should be combined across objects and spatiallocation, through marginalization. Interestingly, in both at-tempts to explain overt and covert allocation of attentionthey proposed an implementation through physiologicallyplausible neural networks.More generally, the present work expands the generalgrowing notion of the brain as an organ capable of gen-eralizing and performing inferences in noisy and clutteredscenarios through Bayesian inference, building completeand abstract models of its environment. Nowadays, thosemodels cover a broad spectrum of perceptual and cognitivefunctions, such as decision making and confidence, learn-ing, multisensorial perception, and others (Knill and Pouget2004; Tenenbaum, Griffiths, and Kemp 2006; Meyniel, Sig-man, and Mainen 2015). ode / Data availability Code, data, and chosen parameters will be made publiclyavailable upon publication for full reproduction of our re-sults. Data will include the image dataset with targets, aswell as fixation data for all of the subjects.
Author contributions
M.S. prepared the indoor images/targets dataset and codedthe first implementation of the presented models, includ-ing numerical optimizations and speed-ups. M.S., G.S. andJ.K. designed the task, collected the human data and definedthe model idea. G.B. and S.V. pruned the model’s code, ex-tended it and explored their parameters. M.S., G.B., S.V. andJ.K. performed the analysis. The manuscript was written byG.B., M.S. and J.K.
Acknowledgments
We thank P Lagomarsino and J Laurino for their collab-oration with the data acquisition, C Diuk (Facebook), ASalles (OpenZeppelin) and Matias J Ison (Univ. of Notting-ham, UK) for their feedback and insight on the work, andK Ball (UT Austin) for English editing of this article. Theauthors were supported by the National Science and Tech-nology Research Council (CONICET) and the University ofBuenos Aires (UBA). The research was supported by theARL (W911NF-19-2-0240), UBA (20020170200259BA),and the National Agency of Promotion of Science and Tech-nology (PICT 2016-1256).
References
Borji, A.; and Itti, L. 2014. Defending Yarbus: Eye move-ments reveal observers’ task.
Journal of vision
Proceedings of the IEEE international con-ference on computer vision , 921–928.Bradley, C.; Abrams, J.; and Geisler, W. 2014. Retina-v1model of detectability across the visual field.
Journal of vi-sion
SpatialVision
10: 433–436.Bruce, N.; and Tsotsos, J. 2006. Saliency based on infor-mation maximization. In
Advances in neural informationprocessing systems , 155–162.Bruce, N. D.; and Tsotsos, J. K. 2009. Saliency, attention,and visual search: An information theoretic approach.
Jour-nal of vision
IEEE transactions on pattern analysis andmachine intelligence
Journal of Experimental Psychology: Human Per-ception and Performance
Advances in neural information process-ing systems , 241–248.Cornia, M.; Baraldi, L.; Serra, G.; and Cucchiara, R. 2016.A deep multi-level network for saliency prediction. In , 3488–3493. IEEE.Cornia, M.; Baraldi, L.; Serra, G.; and Cucchiara, R. 2018.Predicting human eye fixations via an lstm-based saliencyattentive model.
IEEE Transactions on Image Processing
Nature ReviewsNeuroscience
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , 770–778.Hoppe, D.; and Rothkopf, C. A. 2019. Multi-step planningof eye movements in visual search.
Scientific reports
Visionresearch
Nature reviews neuroscience
IEEE Trans-actions on pattern analysis and machine intelligence
Proceedings of the 2010 symposium on eye-trackingresearch & applications , 211–218.Judd, T.; Durand, F.; and Torralba, A. 2012. A Benchmarkof Computational Models of Saliency to Predict Human Fix-ations. In
MIT Technical Report .Kleiner, M.; Brainard, D.; and Pelli, D. 2007. Whats new inPsychtoolbox-3? Perception 36 ECVP Abstract Supplement.
PLOS ONE .Knill, D. C.; and Pouget, A. 2004. The Bayesian brain:the role of uncertainty in neural coding and computation.
TRENDS in Neurosciences
Frontiers inHuman Neuroscience
4: 31.Kummerer, M.; Wallis, T. S.; and Bethge, M. 2018. Saliencybenchmarking made easy: Separating models, maps andmetrics. In
Proceedings of the European Conference onComputer Vision (ECCV) , 770–787.Kummerer, M.; Wallis, T. S.; Gatys, L. A.; and Bethge, M.2017. Understanding low-and high-level contributions toxation prediction. In
Proceedings of the IEEE Interna-tional Conference on Computer Vision , 4789–4798.Ma, W. J.; Navalpakkam, V.; Beck, J. M.; Van Den Berg,R.; and Pouget, A. 2011. Behavior and neural basis of near-optimal visual search.
Nature neuroscience
Neuron
Nature
Progress in brain research
EuropeanJournal of Neuroscience , 111–118. IEEE.Riche, N.; Duvinage, M.; Mancas, M.; Gosselin, B.; and Du-toit, T. 2013. Saliency and human fixations: State-of-the-art and study of comparison metrics. In
Proceedings of theIEEE international conference on computer vision , 1153–1160.Rohe, T.; and Noppeney, U. 2015. Cortical hierarchies per-form Bayesian causal inference in multisensory perception.
PLoS Biology
Perception
International journal of computer vision
PloS one i-Perception
Trends in Cognitive Sciences
Psychological review
Proceedings Eighth IEEE Inter-national Conference on Computer Vision. ICCV 2001 , vol-ume 1, 763–770. IEEE.Turgeon, M.; Lustig, C.; and Meck, W. H. 2016. Cognitiveaging and time perception: roles of bayesian optimizationand degeneracy.
Frontiers in aging neuroscience
8: 102.Ullman, S. 2019. Using neuroscience to develop artificialintelligence.
Science
Clinical Psycho-logical Science
Current opinion inbehavioral sciences
11: 100–108.Yarbus, A. L. 1967. Eye movements during perception ofcomplex objects. In
Eye movements and vision , 171–211.Springer.Zhang, M.; Feng, J.; Ma, K. T.; Lim, J. H.; Zhao, Q.; andKreiman, G. 2018. Finding any Waldo with zero-shot in-variant and efficient visual search.
Nature communications upplemental Information
We set up a visual search experiment in which participants have to search for an object in a crowdedindoor scene. First, the target is presented in the center of the screen, subtending × pixels ofvisual angle (Fig. S1). After 3 seconds, the target is replaced by a fixation dot at a pseudo-randomposition at least 300 pixels away from the actual target position in the image (Fig. S1). This is doneto avoid starting the search close to the target. The initial position was the same for a given imageand all participants. Moreover, the search image appears after the participant fixates the dot. Thus,all observers initiate the search in the same place for a given image. The image is presented at a × resolution (subtending . × . degrees of visual angle) (Fig. S1).The program automatically detects the end of each saccade during the target search. This periodfinishes when the participant fixates the target or after N saccades, allowing an extra 200ms for theparticipant to be able to process the information in that last fixation (Kotowicz, Rutishauser, andKoch, 2010). The maximum number saccades allowed ( N ) varied between 2 (13.4% of the trials),4 (14.9%), 8 (29.9%) or 12 (41.8%) for most of the participants (see Supplemental Information fordetails). These values were randomized for each participant, independently of the image.Supplementary Figure S1: Paradigm schemaAfter each trial, the participants are forced to guess the position of the target, even if they had alreadyfound it. They are instructed to cover the target position with a Gaussian blur, first by clicking onthe center and then by choosing its radius. This is done by showing a screen with only the frame ofthe image and a mouse pointer –a small black dot– to select the desired center of the blur (Fig. S1).When choosing a position with the mouse, a Gaussian blur centered at that position is shown, and theparticipants are required to indicate the uncertainty of their decision by increasing or decreasing thesize of the blur using the keyboard. Position and uncertainty reports were not analyzed in the presentstudy.A training block of 5 trials was performed at the beginning of each session with the experimenterpresent in the room. After the training block, the experiment started and the experimenter moved a r X i v : . [ c s . A I] S e p o a contiguous room. The images were shown in random order. Each participant observed the 134images in three blocks. Before each block, a 9-point calibration was performed and the participantswere encouraged to get a small break to allow them to rest between blocks. Moreover, each trial startswith the built-in drift correction procedure from the EyeLink Toolbox, in which the participant has tofixate in a central dot and hit the spacebar to continue. If the gaze is detected far from the dot, a beepsignaled the necessity of a re-calibration. The experiment was programmed using PsychToolbox andEyeLink libraries in MATLAB (Brainard, 1997; Kleiner, Brainard, and Pelli, 2007). We collected 134 indoor pictures from Wikimedia commons, indoor design blogs, and LabelMedatabase (Russell et al., 2008). The selection criterion was that scenes should have several objectsand no human figures or text should be present. Moreover, the images are in black and white to makethe task take more saccades, since color is a very strong bottom-up cue. Also, a pilot experiment with5 participants was performed to select images that usually require several fixations to find the target.The original images were all larger or equal than × pixels, and all were cropped and/orscaled to × pixels. For each image, a single target was manually selected among the objectsof × pixels or less that were not repeated in the image –because we weren’t evaluating theaccuracy of memory retrieval–. For all targets, we considered a surrounding region of × pixels. Participants were seated in a dark room, cm away from a 19-inch Samsung SyncMaster 997MBmonitor (refresh rate = 60Hz), with a resolution of × . A chin and forehead rest was usedto stabilize the head. Eye movements were acquired with an Eye Link 1000 (SR Research, Ontario,Canada) monocular at 1000 Hz. The saccade detection was performed online with the native EyeLink algorithm with the defaultparameters for cognitive tasks. Fixations were collapsed into a grid with cells of × pixels,resulting in a grid size of × cells. We explored the size of the grid in terms of model performance.Consecutive fixations within a cell were collapsed into one fixation to be fair with the model behavior.Also, fixations outside the image region were displaced to the closest cell. As we considered fixations,blinks periods were excluded.The trial was considered correct (target found) if the participant fixated into the target region ( × pixels). Only correct trials were analyzed in terms of eye movements. Different definitions of ROC-AUC found in https://saliency.tuebingen.ai/ showed the sametrend (Borji et al., 2013; Riche et al., 2013; Bylinskii et al., 2018; Kummerer, Wallis, and Bethge,2018). All those ROC curves are built base on the idea of considering the saliency map as a binaryclassifier by applying a threshold. Here, we report three of them: AUC-Judd, AUC-Borji, andshuffled-AUC (or sAUC). These metrics differ mainly on the definition of the true positive rate andthe false positive rate for the corresponding ROC curves. AUC-Judd considers human fixationsas ground truth and all non-fixated pixels as negative cases. This way, the true positive rate is theproportion of pixels with saliency values above a certain threshold that were fixated. The falsepositive (fp) rate is the proportion of pixels with saliency values above a certain threshold that werenot fixated. AUC-Borji keeps the same definition of the true positive rate, but uses a uniform randomsample of image pixels as negatives and defines the saliency map values above a certain threshold atthese pixels as false positives. Thus, the false positive rate is the proportion of those cases that werenot fixated. Finally, sAUC is similar to AUC-Borji, instead of sampling pixels from the same imageto define the fpr, it samples over fixation’s locations on other images.2upplementary Table S1: Saliency maps: Different AUC metrics estimated for the saliency maps onthe third fixation (Fig. 2)
Saliency Maps AUC-Judd AUC-Borji sAUC
MLNet 0,7464 0,6797 0,6008SAM-VGG 0.7321 0,6305 0,5666SAM-ResNet 0,7339 0,6501 0,5820DeepGaze 2 0,7637 0,6537 0,5883ICF 0,7509 0,7078 0,5808Humans 0,8076 0,7792 0,7727Center 0,6866 0,6739 0,5208Supplementary Figure S2: Saliency maps. A) ROC curves and B) AUC values for all fixations. Colormapping for models is consistent with Fig. 2.
We used three measures to compare the performance (i.e. the probability of detecting a target) ofeach model with the human participants. Each of them focused on a slightly different aspect.
In order to directly compare the performance curve of each model with human participants, for eachpossible number of saccades allowed N ∈ { , , , } , we calculate the difference between the meanproportion of targets found by participants and by the model m ( P subj ( N ) and P m ( N ) respectively).For each number of saccades allowed, the difference is weighted by the standard deviation acrossparticipants σ . Then, the weighted distance W D ( m ) is the mean value each of those values: W D ( m ) = X N ∈{ , , , } | P subj ( N ) − P m ( N ) | σ (1) Jaccard Index is a metric that allows us to measure the proportion of targets found by humans that areexplained by the models. We represent each participant and each model as a boolean S -dimensionalvector with a one in the i -th position if they found the target in the i -th image, and zero otherwise. S is the number of images: in our experiment, S = 134 .3very participant has the same proportion of images with each maximum saccade possible N (13.4%of the trials with N = 2 , 14.9% with N = 4 , 29.9% with N = 8 or 41.8% with 12). Nonetheless, thesubset that gets each N is chosen uniformly at random for each subject. This way, for each image wehave subjects that were interrupted after 2, 4, 8, and 12 saccades. As each participant has a differentsample of maximum saccades allowed across images, we apply these same constraints to the model.That is, when we want to compare with participant p , we apply their constraints to our model m .Then, we decide for each image if it can find the target in fewer saccades than the allowed. Givena model m , we define sacc ( m, i ) as the number of saccades that the model needs to find the targetin the i -th image. Then, for each participant we have max _ sacc ( i, p ) as the number of saccadesallowed for each image i and participant p , with p = 1 . . . in our data. From these values, weconstruct the vector of targets found by the model using the saccade threshold distribution for eachparticipant p , called T F M ( m ) p (Targets Found by Model). T F M ( m ) p ∈ { , } S is defined as: T F M ( m ) p ( i ) = (cid:26) if sacc ( m, i ) ≤ max _ sacc ( i, p )0 otherwise (2)Each participant p is also represented as a S -dimensional vector T F P p ∈ { , } S (Targets Found byParticipant): T F P p ( i ) = (cid:26) if participant p found the target in the i -th image otherwise (3)Then, we compute the Jaccard Index (Real and Vargas, 1996) between those vectors (4). jaccard ( p, m ) = T F P p ∩ T F M ( m ) p T F P p ∪ T F M ( m ) p (4) Another measure we considered to compare a model performance against humans was what we calledthe Mean Agreement Score (MAS), inspired by Mean Absolute Error. This measure calculates themean proportion of trials where both the participant and the model had the same performance. Thismetric has the purpose of measuring the compromise between our model and the participants in theirperformance. We compute the difference between the boolean vectors like in Jaccard Index (3) andcalculate the mean. Finally the Mean Agreement Score between model m and participant p is 1 minusthat value: (5) Some images showed an overall disagreement but, looking a little bit deeper, we can see thatparticipants performed two different but consistent patterns. As the present implementation of themodel is deterministic, it chose only one of those patterns (Figure S3). In Figure S3, we show someof the human scanpaths and the model (cIBS+DG2) scanpath for one image to illustrate that specificbehavior. In this case, the cup is the search target and there are two surfaces where, a priori, is equallylikely to find it. We selected six human scanpaths with different hmSD values to show how the initialdecision determines the behavior and the overall hmSD for that scanpath (Fig. S3, left panel), butalmost all participants end up exploring both regions (Fig. S3, right panel). Note that dark greentraces are scanpaths very similar to the model, while yellow traces are scanpaths that differ from themodel. Further development of the should focus on mimicking these individual differences in visualsearch.
The model was fully developed in MATLAB. Saliency map calculations were performed using thepublic code provided by its respective authors (Cornia et al., 2016, 2018; Kummerer et al., 2017).4upplementary Figure S3: Scanpath prediction comparison. The figure is meshed by a fixed gridof δ = 32 px. Each curve represents a scanpath, in red the cIBS+DG2 model’s scanpath and sixscanpaths of participants colored according its own dissimilarity to the model scanpath. Left panelshows the first four fixations of each scanpath, and the right panel shows the whole scanpaths. Thesearch target is represented with the blue square and the approximated first fixation with the redsquare. Above the image the hmSD and bhSD for this trial are reported. Image taken from WikipediaCommons.
All code, data, and parameters needed to reproduce this paper’s results and visualizations will beavailable in
GitHub . References
Borji, A.; Tavakoli, H. R.; Sihite, D. N.; and Itti, L. 2013. Analysis of scores, datasets, and modelsin visual saliency prediction. In
Proceedings of the IEEE international conference on computervision , 921–928.Brainard, D. H. 1997. The Psychophysics Toolbox.
Spatial Vision
10: 433–436.Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; and Durand, F. 2018. What do different evaluationmetrics tell us about saliency models?
IEEE transactions on pattern analysis and machineintelligence , 3488–3493.IEEE.Cornia, M.; Baraldi, L.; Serra, G.; and Cucchiara, R. 2018. Predicting human eye fixations via anlstm-based saliency attentive model.
IEEE Transactions on Image Processing
PLOS ONE .Kotowicz, A.; Rutishauser, U.; and Koch, C. 2010. Time course of target recognition in visual search.
Frontiers in Human Neuroscience
4: 31.Kummerer, M.; Wallis, T. S.; and Bethge, M. 2018. Saliency benchmarking made easy: Separatingmodels, maps and metrics. In
Proceedings of the European Conference on Computer Vision(ECCV) , 770–787.Kummerer, M.; Wallis, T. S.; Gatys, L. A.; and Bethge, M. 2017. Understanding low-and high-levelcontributions to fixation prediction. In
Proceedings of the IEEE International Conference onComputer Vision , 4789–4798.Real, R.; and Vargas, J. M. 1996. The probabilistic basis of Jaccard’s index of similarity.
Systematicbiology
Proceedings of the IEEE internationalconference on computer vision , 1153–1160.Russell, B. C.; Torralba, A.; Murphy, K. P.; and Freeman, W. T. 2008. LabelMe: a database andweb-based tool for image annotation.