Behjat Siddiquie
SRI International
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Behjat Siddiquie.
computer vision and pattern recognition | 2011
Behjat Siddiquie; Rogério Schmidt Feris; Larry S. Davis
We propose a novel approach for ranking and retrieval of images based on multi-attribute queries. Existing image retrieval methods train separate classifiers for each word and heuristically combine their outputs for retrieving multiword queries. Moreover, these approaches also ignore the interdependencies among the query terms. In contrast, we propose a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes. Given a multi-attribute query, we also utilize other attributes in the vocabulary which are not present in the query, for ranking/retrieval. Furthermore, we integrate ranking and retrieval within the same formulation, by posing them as structured prediction problems. Extensive experimental evaluation on the Labeled Faces in the Wild(LFW), FaceTracer and PASCAL VOC datasets show that our approach significantly outperforms several state-of-the-art ranking and retrieval methods.
computer vision and pattern recognition | 2010
Behjat Siddiquie; Abhinav Gupta
We present an active learning framework to simultaneously learn appearance and contextual models for scene understanding tasks (multi-class classification). Existing multi-class active learning approaches have focused on utilizing classification uncertainty of regions to select the most ambiguous region for labeling. These approaches, however, ignore the contextual interactions between different regions of the image and the fact that knowing the label for one region provides information about the labels of other regions. For example, the knowledge of a region being sea is informative about regions satisfying the “on” relationship with respect to it, since they are highly likely to be boats. We explicitly model the contextual interactions between regions and select the question which leads to the maximum reduction in the combined entropy of all the regions in the image (image entropy). We also introduce a new methodology of posing labeling questions, mimicking the way humans actively learn about their environment. In these questions, we utilize the regions linked to a concept with high confidence as anchors, to pose questions about the uncertain regions. For example, if we can recognize water in an image then we can use the region associated with water as an anchor to pose questions such as “what is above water?”. Our active learning framework also introduces questions which help in actively learning contextual concepts. For example, our approach asks the annotator: “What is the relationship between boat and water?” and utilizes the answer to reduce the image entropies throughout the training dataset and obtain more relevant training examples for appearance models.
IEEE Transactions on Multimedia | 2012
Rogério Schmidt Feris; Behjat Siddiquie; James Petterson; Yun Zhai; Ankur Datta; Lisa M. Brown; Sharath Pankanti
We present a novel approach for visual detection and attribute-based search of vehicles in crowded surveillance scenes. Large-scale processing is addressed along two dimensions: 1) large-scale indexing, where hundreds of billions of events need to be archived per month to enable effective search and 2) learning vehicle detectors with large-scale feature selection, using a feature pool containing millions of feature descriptors. Our method for vehicle detection also explicitly models occlusions and multiple vehicle types (e.g., buses, trucks, SUVs, cars), while requiring very few manual labeling. It runs quite efficiently at an average of 66 Hz on a conventional laptop computer. Once a vehicle is detected and tracked over the video, fine-grained attributes are extracted and ingested into a database to allow future search queries such as “Show me all blue trucks larger than 7 ft. length traveling at high speed northbound last Saturday, from 2 pm to 5 pm”. We perform a comprehensive quantitative analysis to validate our approach, showing its usefulness in realistic urban surveillance settings.
international conference on computer vision | 2009
Aniruddha Kembhavi; Behjat Siddiquie; Roland Miezianko; Scott McCloskey; Larry S. Davis
A good training dataset, representative of the test images expected in a given application, is critical for ensuring good performance of a visual categorization system. Obtaining task specific datasets of visual categories is, however, far more tedious than obtaining a generic dataset of the same classes. We propose an Incremental Multiple Kernel Learning (IMKL) approach to object recognition that initializes on a generic training database and then tunes itself to the classification task at hand. Our system simultaneously updates the training dataset as well as the weights used to combine multiple information sources. We demonstrate our system on a vehicle classification problem in a video stream overlooking a traffic intersection. Our system updates itself with images of vehicles in poses more commonly observed in the scene, as well as with image patches of the background, leading to an increase in performance. A considerable change in the kernel combination weights is observed as the system gathers scene specific training data over time. The system is also seen to adapt itself to the illumination change in the scene as day transitions to night.
workshop on applications of computer vision | 2011
Rogério Schmidt Feris; James Petterson; Behjat Siddiquie; Lisa M. Brown; Sharath Pankanti
We present a novel approach for vehicle detection in urban surveillance videos, capable of handling unstructured and crowded environments with large occlusions, different vehicle shapes, and environmental conditions such as lighting changes, rain, shadows, and reflections. This is achieved with virtually no manual labeling efforts. The system runs quite efficiently at an average of 66Hz on a conventional laptop computer. Our proposed approach relies on three key contributions: 1) a co-training scheme where data is automatically captured based on motion and shape cues and used to train a detector based on appearance information; 2) an occlusion handling technique based on synthetically generated training samples obtained through Poisson image reconstruction from image gradients; 3) massively parallel feature selection over multiple feature planes which allows the final detector to be more accurate and more efficient. We perform a comprehensive quantitative analysis to validate our approach, showing its usefulness in realistic urban surveillance settings.
Journal of diabetes science and technology | 2015
Weiyu Zhang; Qian Yu; Behjat Siddiquie; Ajay Divakaran; Harpreet S. Sawhney
We present snap-n-eat, a mobile food recognition system. The system can recognize food and estimate the calorific and nutrition content of foods automatically without any user intervention. To identify food items, the user simply snaps a photo of the food plate. The system detects the salient region, crops its image, and subtracts the background accordingly. Hierarchical segmentation is performed to segment the image into regions. We then extract features at different locations and scales and classify these regions into different kinds of foods using a linear support vector machine classifier. In addition, the system determines the portion size which is then used to estimate the calorific and nutrition content of the food present on the plate. Previous approaches have mostly worked with either images captured in a lab setting, or they require additional user input (eg, user crop bounding boxes). Our system achieves automatic food detection and recognition in real-life settings containing cluttered backgrounds. When multiple food items appear in an image, our system can identify them and estimate their portion size simultaneously. We implemented this system as both an Android smartphone application and as a web service. In our experiments, we have achieved above 85% accuracy when detecting 15 different kinds of foods.
content based multimedia indexing | 2007
V.S.N. Prasad; Behjat Siddiquie; J. Golbeck; Larry S. Davis
We present an approach for classifying images of charts based on the shape and spatial relationships of their primitives. Five categories are considered: bar-charts, curve-plots, pie-charts, scatter-plots and surface-plots. We introduce two novel features to represent the structural information based on (a) region segmentation and (b) curve saliency. The local shape is characterized using the Histograms of Oriented Gradients (HOG) and the Scale Invariant Feature Transform (SIFT) descriptors. Each image is represented by sets of feature vectors of each modality. The similarity between two images is measured by the overlap in the distribution of the features -measured using the Pyramid Match algorithm. A test image is classified based on its similarity with training images from the categories. The approach is tested with a database of images collected from the Internet.
workshop on applications of computer vision | 2014
Mohamed R. Amer; Behjat Siddiquie; Saad M. Khan; Ajay Divakaran; Harpreet S. Sawhney
We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representational power of generative models. Our focus is on detecting multimodal events in time varying sequences. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We employ a deep temporal generative model for unsupervised learning of a shared representation across multiple modalities with time varying data. The temporal generative model takes into account short term temporal phenomena and allows for filling in missing data by generating data within or across modalities. The hybrid model involves augmenting the temporal generative model with a temporal discriminative model for event detection, and classification, which enables modeling long range temporal dynamics. We evaluate our approach on audio-visual datasets (AVEC, AVLetters, and CUAVE) and demonstrate its superiority compared to the state-of-the-art.
workshop on applications of computer vision | 2009
Behjat Siddiquie; Shiv Naga Prasad Vitaladevuni; Larry S. Davis
We investigate the problem of combining multiple feature channels for the purpose of efficient image classification. Discriminative kernel based methods, such as SVMs, have been shown to be quite effective for image classification. To use these methods with several feature channels, one needs to combine base kernels computed from them. Multiple kernel learning is an effective method for combining the base kernels. However, the cost of computing the kernel similarities of a test image with each of the support vectors for all feature channels is extremely high. We propose an alternate method, where training data instances are selected, using AdaBoost, for each of the base kernels. A composite decision function, which can be evaluated by computing kernel similarities with respect to only these chosen instances, is learnt. This method significantly reduces the number of kernel computations required during testing. Experimental results on the benchmark UCI datasets, as well as on a challenging painting dataset, are included to demonstrate the effectiveness of our method.
international conference on multimedia retrieval | 2014
Behjat Siddiquie; Brandyn White; Abhishek Sharma; Larry S. Davis
We propose a unified framework for image retrieval capable of handling complex and descriptive queries of multiple modalities in a scalable manner. A novel aspect of our approach is that it supports query specification in terms of objects, attributes and spatial relationships, thereby allowing for substantially more complex and descriptive queries. We allow these complex queries to be specified in three different modalities - images, sketches and structured textual descriptions. Furthermore, we propose a unique multi-modal hashing algorithm capable of mapping queries of different modalities to the same binary representation, enabling efficient and scalable image retrieval based on multi-modal queries. Extensive experimental evaluation shows that our approach outperforms the state-of-the-art image retrieval and hashing techniques on the MSRC and SUN09 datasets by about 100%, while the performance on a dataset of 1M images, from Flickr, demonstrates its scalability.