Dicky N. Sihite
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dicky N. Sihite.
european conference on computer vision | 2012
Ali Borji; Dicky N. Sihite; Laurent Itti
Several salient object detection approaches have been published which have been assessed using different evaluation scores and datasets resulting in discrepancy in model comparison. This calls for a methodological framework to compare existing models and evaluate their pros and cons. We analyze benchmark datasets and scoring techniques and, for the first time, provide a quantitative comparison of 35 state-of-the-art saliency detection models. We find that some models perform consistently better than the others. Saliency models that intend to predict eye fixations perform lower on segmentation datasets compared to salient object detection algorithms. Further, we propose combined models which show that integration of the few best models outperforms all models over other datasets. By analyzing the consistency among the best models and among humans for each scene, we identify the scenes where models or humans fail to detect the most salient object. We highlight the current issues and propose future research directions.
IEEE Transactions on Image Processing | 2013
Ali Borji; Dicky N. Sihite; Laurent Itti
Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from their surroundings. The latter are often referred to as “visual saliency.” Modeling bottom-up visual saliency has been the subject of numerous research efforts during the past 20 years, with many successful applications in computer vision and robotics. Available models have been tested with different datasets (e.g., synthetic psychological search arrays, natural images or videos) using different evaluation scores (e.g., search slopes, comparison to human eye tracking) and parameter settings. This has made direct comparison of models difficult. Here, we perform an exhaustive comparison of 35 state-of-the-art saliency models over 54 challenging synthetic patterns, three natural image datasets, and two video datasets, using three evaluation scores. We find that although model rankings vary, some models consistently perform better. Analysis of datasets reveals that existing datasets are highly center-biased, which influences some of the evaluation scores. Computational complexity analysis shows that some models are very fast, yet yield competitive eye movement prediction accuracy. Different models often have common easy/difficult stimuli. Furthermore, several concerns in visual saliency modeling, eye movement datasets, and evaluation scores are discussed and insights for future work are provided. Our study allows one to assess the state-of-the-art, helps to organizing this rapidly growing field, and sets a unified comparison framework for gauging future efforts, similar to the PASCAL VOC challenge in the object recognition and detection domains.
international conference on computer vision | 2013
Ali Borji; Hamed Rezazadegan Tavakoli; Dicky N. Sihite; Laurent Itti
Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scan path sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scan path sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
computer vision and pattern recognition | 2012
Ali Borji; Dicky N. Sihite; Laurent Itti
Despite a considerable amount of previous work on bottom-up saliency modeling for predicting human fixations over static and dynamic stimuli, few studies have thus far attempted to model top-down and task-driven influences of visual attention. Here, taking advantage of the sequential nature of real-world tasks, we propose a unified Bayesian approach for modeling task-driven visual attention. Several sources of information, including global context of a scene, previous attended locations, and previous motor actions, are integrated over time to predict the next attended location. Recording eye movements while subjects engage in 5 contemporary 2D and 3D video games, as modest counterparts of everyday tasks, we show that our approach is able to predict human attention and gaze better than the state-of-the-art, with a large margin (about 15% increase in prediction accuracy). The advantage of our approach is that it is automatic and applicable to arbitrary visual tasks.
systems man and cybernetics | 2014
Ali Borji; Dicky N. Sihite; Laurent Itti
Several visual attention models have been proposed for describing eye movements over simple stimuli and tasks such as free viewing or visual search. Yet, to date, there exists no computational framework that can reliably mimic human gaze behavior in more complex environments and tasks such as urban driving. In addition, benchmark datasets, scoring techniques, and top-down model architectures are not yet well understood. In this paper, we describe new task-dependent approaches for modeling top-down overt visual attention based on graphical models for probabilistic inference and reasoning. We describe a dynamic Bayesian network that infers probability distributions over attended objects and spatial locations directly from observed data. Probabilistic inference in our model is performed over object-related functions that are fed from manual annotations of objects in video scenes or by state-of-the-art object detection/recognition algorithms. Evaluating over approximately 3 h (approximately 315 000 eye fixations and 12 000 saccades) of observers playing three video games (time-scheduling, driving, and flight combat), we show that our approach is significantly more predictive of eye fixations compared to: 1) simpler classifier-based models also developed here that map a signature of a scene (multimodal information from gist, bottom-up saliency, physical actions, and events) to eye positions; 2) 14 state-of-the-art bottom-up saliency models; and 3) brute-force algorithms such as mean eye position. Our results show that the proposed model is more effective in employing and reasoning over spatio-temporal visual data compared with the state-of-the-art.
Vision Research | 2013
Ali Borji; Dicky N. Sihite; Laurent Itti
Eye tracking has become the de facto standard measure of visual attention in tasks that range from free viewing to complex daily activities. In particular, saliency models are often evaluated by their ability to predict human gaze patterns. However, fixations are not only influenced by bottom-up saliency (computed by the models), but also by many top-down factors. Thus, comparing bottom-up saliency maps to eye fixations is challenging and has required that one tries to minimize top-down influences, for example by focusing on early fixations on a stimulus. Here we propose two complementary procedures to evaluate visual saliency. We seek whether humans have explicit and conscious access to the saliency computations believed to contribute to guiding attention and eye movements. In the first experiment, 70 observers were asked to choose which object stands out the most based on its low-level features in 100 images each containing only two objects. Using several state-of-the-art bottom-up visual saliency models that measure local and global spatial image outliers, we show that maximum saliency inside the selected object is significantly higher than inside the non-selected object and the background. Thus spatial outliers are a predictor of human judgments. Performance of this predictor is boosted by including object size as an additional feature. In the second experiment, observers were asked to draw a polygon circumscribing the most salient object in cluttered scenes. For each of 120 images, we show that a map built from annotations of 70 observers explains eye fixations of another 20 observers freely viewing the images, significantly above chance (dataset by Bruce and Tsotsos (2009); shuffled AUC score 0.62±0.07, chance 0.50, t-test p<0.05). We conclude that fixations agree with saliency judgments, and classic bottom-up saliency models explain both. We further find that computational models specifically designed for fixation prediction slightly outperform models designed for salient object detection over both types of data (i.e., fixations and objects).
computer vision and pattern recognition | 2012
Ali Borji; Simone Frintrop; Dicky N. Sihite; Laurent Itti
One challenge when tracking objects is to adapt the object representation depending on the scene context to account for changes in illumination, coloring, scaling, etc. Here, we present a solution that is based on our earlier approach for object tracking using particle filters and component-based descriptors. We extend the approach to deal with changing backgrounds by using a quick training phase with user interaction at the beginning of an image sequence. During this phase, some background clusters are learned along with object representations for those clusters. Next, for the rest of the sequence the best fitting background cluster is determined for each frame and the corresponding object representation is used for tracking. Experiments show a particle filter adapting to background changes can efficiently track objects and persons in natural scenes and results in higher tracking results than the basic approach. Additionally, using an object tracker to follow the main character in video games, we were able to explain a large amount of eye fixations higher than other saliency models in terms of NSS score proving that tracking is an important top-down attention component.
british machine vision conference | 2011
Ali Borji; Dicky N. Sihite; Laurent Itti
Modeling how visual saliency guides the deployment of attention over visual scenes has attracted much interest recently — among both computer vision and experimental/computational researchers — since visual attention is a key function of both machine and biological vision systems. Research efforts in computer vision have mostly been focused on modeling bottom-up saliency. Strong influences o n attention and eye movements, however, come from instantaneous task demands. Here, we propose models of top-down visual guidance considering task influences. The n ew models estimate the state of a human subject performing a task (here, playing video games), and map that state to an eye position. Factors influencing state come from scene gi st, physical actions, events, and bottom-up saliency. Proposed models fall into two categories. In the first category, we use classical discriminative classifiers, including Reg ression, kNN and SVM. In the second category, we use Bayesian Networks to combine all the multi-modal factors in a unified framework. Our approaches significantly outperfor m 15 competing bottom-up and top-down attention models in predicting future eye fixat ions on 18,000 and 75,00 video frames and eye movement samples from a driving and a flig ht combat video game, respectively. We further test and validate our approaches on 1.4M video frames and 11M fixations samples and in all cases obtain higher prediction s cores that reference models.
international conference on robotics and automation | 2012
Ali Borji; Dicky N. Sihite; Laurent Itti
A large number of studies have been reported on top-down influences of visual attention. However, less progress have been made in understanding and modeling its mechanisms in real-world tasks. In this paper, we propose an approach for learning spatial attention taking into account influences of physical actions on top-down attention. For this purpose, we focus on interactive visual environments (video games) which are modest real-world simulations, where a player has to attend to certain aspects of visual stimuli and perform actions to achieve a goal. The basic idea is to learn a mapping from current mental state of the game player, represented by past actions and observations, to its gaze fixation. A data-driven approach is followed where we train a model from the data of some players and test it over a new subject. In particular, two contributions this paper makes are: 1) employing multi-modal information including mean eye position, gist of a scene, physical actions, bottom-up saliency, and tagged events for state representation and 2) analysis of different methods of combining bottom-up and top-down influences. Comparing with other top-down task-driven and bottom-up spatio-temporal models, our approach shows higher NSS scores in predicting eye positions.
Journal of Vision | 2013
Ali Borji; Dicky N. Sihite; Laurent Itti