Santosh Kumar Divvala
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Santosh Kumar Divvala.
computer vision and pattern recognition | 2016
Joseph Redmon; Santosh Kumar Divvala; Ross B. Girshick; Ali Farhadi
We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
computer vision and pattern recognition | 2009
Santosh Kumar Divvala; Derek Hoiem; James Hays; Alexei A. Efros; Martial Hebert
This paper presents an empirical evaluation of the role of context in a contemporary, challenging object detection task - the PASCAL VOC 2008. Previous experiments with context have mostly been done on home-grown datasets, often with non-standard baselines, making it difficult to isolate the contribution of contextual information. In this work, we present our analysis on a standard dataset, using top-performing local appearance detectors as baseline. We evaluate several different sources of context and ways to utilize it. While we employ many contextual cues that have been used before, we also propose a few novel ones including the use of geographic context and a new approach for using object spatial support.
computer vision and pattern recognition | 2014
Santosh Kumar Divvala; Ali Farhadi; Carlos Guestrin
Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50, 000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.
international conference on computer vision | 2012
Santosh Kumar Divvala; Alexei A. Efros; Martial Hebert
The Deformable Parts Model (DPM) has recently emerged as a very useful and popular tool for tackling the intra-category diversity problem in object detection. In this paper, we summarize the key insights from our empirical analysis of the important elements constituting this detector. More specifically, we study the relationship between the role of deformable parts and the mixture model components within this detector, and understand their relative importance. First, we find that by increasing the number of components, and switching the initialization step from their aspect-ratio, left-right flipping heuristics to appearance-based clustering, considerable improvement in performance is obtained. But more intriguingly, we observed that with these new components, the part deformations can now be turned off, yet obtaining results that are almost on par with the original DPM detector.
computer vision and pattern recognition | 2008
Santosh Kumar Divvala; Alexei A. Efros; Martial Hebert
We describe a preliminary investigation of utilising large amounts of unlabelled image data to help in the estimation of rough scene layout. We take the single-view geometry estimation system of Hoiem et al (2207) as the baseline and see if it is possible to improve its performance by considering a set of similar scenes gathered from the Web. The two complimentary approaches being considered are 1) improving surface classification by using average geometry estimated from the matches, and 2) improving surface segmentation by injecting segments generated from the average of the matched images. The system is evaluated using the labelled 300-image dataset of Hoiem et al. and shows promising results.
computer vision and pattern recognition | 2017
Gunnar A. Sigurdsson; Santosh Kumar Divvala; Ali Farhadi; Abhinav Gupta
Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades [42] benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization.
international conference on computer vision | 2015
Hamid Izadinia; Fereshteh Sadeghi; Santosh Kumar Divvala; Hannaneh Hajishirzi; Yejin Choi; Ali Farhadi
We introduce Segment-Phrase Table (SPT), a large collection of bijective associations between textual phrases and their corresponding segmentations. Leveraging recent progress in object recognition and natural language semantics, we show how we can successfully build a high-quality segment-phrase table using minimal human supervision. More importantly, we demonstrate the unique value unleashed by this rich bimodal resource, for both vision as well as natural language understanding. First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. This feature enables us to achieve state-of-the-art segmentation results on benchmark datasets. Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases. Leveraging this feature, we motivate the problem of visual entailment and visual paraphrasing, and demonstrate its utility on a large dataset.
british machine vision conference | 2012
Santosh Kumar Divvala; Alexei A. Efros; Martial Hebert
Most contemporary object detection approaches assume each object instance in the training data to be uniquely represented by a single bounding box. In this paper, we go beyond this conventional view by allowing an object instance to be described by multiple bounding boxes. The new bounding box annotations are determined based on the alignment of an object instance with the other training instances in the dataset. Our proposal enables the training data to be reused multiple times for training richer multi-component category models. We operationalize this idea by two complementary operations: bounding box shrinking, which finds subregions of an object instance that could be shared; and bounding box enlarging, which enlarges object instances to include local contextual cues. We empirically validate our approach on the PASCAL VOC detection dataset.
european conference on computer vision | 2016
Noah Siegel; Zachary Horvitz; Roie Levin; Santosh Kumar Divvala; Ali Farhadi
‘Which are the pedestrian detectors that yield a precision above 95 % at 25 % recall?’ Answering such a complex query involves identifying and analyzing the results reported in figures within several research papers. Despite the availability of excellent academic search engines, retrieving such information poses a cumbersome challenge today as these systems have primarily focused on understanding the text content of scholarly documents. In this paper, we introduce FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers. Our proposed approach automatically localizes figures from research papers, classifies them, and analyses the content of the result-figures. The key challenge in analyzing the figure content is the extraction of the plotted data and its association with the legend entries. We address this challenge by formulating a novel graph-based reasoning approach using a CNN-based similarity metric. We present a thorough evaluation on a real-word annotated dataset to demonstrate the efficacy of our approach.
british machine vision conference | 2012
Karthik Sheshadri; Santosh Kumar Divvala
Character recognition in natural scenes continues to represent a formidable challenge in computer vision. Traditional optical character recognition (OCR) methods fail to perform well on characters from scene text owing to a variety of difficulties in background clutter, binarisation, and arbitrary skew. Further, English characters group into only 62 classes whereas many of the world’s languages have several hundred classes. In particular, most Indic script languages such as Kannada exhibit large intra class diversity, while the only difference between two classes may be in a minor contour above or below the character. These considerations motivate an exemplar approach to classification; one which does not seek intra class commonality among extreme examples which are essentially sub classes of their own. Exemplar SVM’s have been recently introduced in the object recognition context. The essence of the exemplar approach is that rather than seeking to establish commonality within classes, a separate classifier is learnt for each exemplar in the dataset. To make individual classification simple, linear SVM’s are used and each classifier is hence an exemplar specific weight vector. Each exemplar in the dataset is resized to standard dimensions, and thence HOG features are densely extracted to create a rigid template xE . A set of negative samples NE are created by the same process from classes not corresponding to the exemplar. Each classifier (wE ,bE ) maximizes the separation between xE and every window in NE . This is equivalent to optimizing the convex objective[4]: