Ishan Misra
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ishan Misra.
european conference on computer vision | 2016
Ishan Misra; C. Lawrence Zitnick; Martial Hebert
In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.
computer vision and pattern recognition | 2015
Ishan Misra; Abhinav Shrivastava; Martial Hebert
We present a semi-supervised approach that localizes multiple unknown object instances in long videos. We start with a handful of labeled boxes and iteratively learn and label hundreds of thousands of object instances. We propose criteria for reliable object detection and tracking for constraining the semi-supervised learning process and minimizing semantic drift. Our approach does not assume exhaustive labeling of each object instance in any single frame, or any explicit annotation of negative data. Working in such a generic setting allow us to tackle multiple object instances in video, many of which are static. In contrast, existing approaches either do not consider multiple object instances per video, or rely heavily on the motion of the objects present. The experiments demonstrate the effectiveness of our approach by evaluating the automatically labeled data on a variety of metrics like quality, coverage (recall), diversity, and relevance to training an object detector.
workshop on applications of computer vision | 2014
Ishan Misra; Abhinav Shrivastava; Martial Hebert
We consider the problem of discovering discriminative exemplars suitable for object detection. Due to the diversity in appearance in real world objects, an object detector must capture variations in scale, viewpoint, illumination etc. The current approaches do this by using mixtures of models, where each mixture is designed to capture one (or a few) axis of variation. Current methods usually rely on heuristics to capture these variations; however, it is unclear which axes of variation exist and are relevant to a particular task. Another issue is the requirement of a large set of training images to capture such variations. Current methods do not scale to large training sets either because of training time complexity [31] or test time complexity [26]. In this work, we explore the idea of compactly capturing task-appropriate variation from the data itself. We propose a two stage data-driven process, which selects and learns a compact set of exemplar models for object detection. These selected models have an inherent ranking, which can be used for anytime/budgeted detection scenarios. Another benefit of our approach (beyond the computational speedup) is that the selected set of exemplar models performs better than the entire set.
computer vision and pattern recognition | 2017
Ishan Misra; Abhinav Gupta; Martial Hebert
Compositionality and contextuality are key building blocks of intelligence. They allow us to compose known concepts to generate new and complex ones. However, traditional learning methods do not model both these properties and require copious amounts of labeled data to learn new concepts. A large fraction of existing techniques, e.g., using late fusion, compose concepts but fail to model contextuality. For example, red in red wine is different from red in red tomatoes. In this paper, we present a simple method that respects contextuality in order to compose classifiers of known visual concepts. Our method builds upon the intuition that classifiers lie in a smooth space where compositional transforms can be modeled. We show how it can generalize to unseen combinations of concepts. Our results on composing attributes, objects as well as composing subject, predicate, and objects demonstrate its strong generalization performance compared to baselines. Finally, we present detailed analysis of our method and highlight its properties.
Proceedings of SPIE | 2016
Ishan Misra; Yu-Xiong Wang; Martial Hebert
Current computer vision systems rely primarily on fixed models learned in a supervised fashion, i.e., with extensive manually labelled data. This is appropriate in scenarios in which the information about all the possible visual queries can be anticipated in advance, but it does not scale to scenarios in which new objects need to be added during the operation of the system, as in dynamic interaction with UGVs. For example, the user might have found a new type of object of interest, e.g., a particular vehicle, which needs to be added to the system right away. The supervised approach is not practical to acquire extensive data and to annotate it. In this paper, we describe techniques for rapidly updating or creating models using sparsely labelled data. The techniques address scenarios in which only a few annotated training samples are available and need to be used to generate models suitable for recognition. These approaches are crucial for on-the-fly insertion of models by users and on-line learning.
Frontiers in Computational Neuroscience | 2015
Elissa Aminoff; Mariya Toneva; Abhinav Shrivastava; Xinlei Chen; Ishan Misra; Abhinav Gupta; Michael J. Tarr
How do we understand the complex patterns of neural responses that underlie scene understanding? Studies of the network of brain regions held to be scene-selective—the parahippocampal/lingual region (PPA), the retrosplenial complex (RSC), and the occipital place area (TOS)—have typically focused on single visual dimensions (e.g., size), rather than the high-dimensional feature space in which scenes are likely to be neurally represented. Here we leverage well-specified artificial vision systems to explicate a more complex understanding of how scenes are encoded in this functional network. We correlated similarity matrices within three different scene-spaces arising from: (1) BOLD activity in scene-selective brain regions; (2) behavioral measured judgments of visually-perceived scene similarity; and (3) several different computer vision models. These correlations revealed: (1) models that relied on mid- and high-level scene attributes showed the highest correlations with the patterns of neural activity within the scene-selective network; (2) NEIL and SUN—the models that best accounted for the patterns obtained from PPA and TOS—were different from the GIST model that best accounted for the pattern obtained from RSC; (3) The best performing models outperformed behaviorally-measured judgments of scene similarity in accounting for neural data. One computer vision method—NEIL (“Never-Ending-Image-Learner”), which incorporates visual features learned as statistical regularities across web-scale numbers of scenes—showed significant correlations with neural activity in all three scene-selective regions and was one of the two models best able to account for variance in the PPA and TOS. We suggest that these results are a promising first step in explicating more fine-grained models of neural scene understanding, including developing a clearer picture of the division of labor among the components of the functional scene-selective brain network.
computer vision and pattern recognition | 2016
Ishan Misra; Abhinav Shrivastava; Abhinav Gupta; Martial Hebert
computer vision and pattern recognition | 2016
Ishan Misra; C. Lawrence Zitnick; Margaret Mitchell; Ross B. Girshick
international conference on computer vision | 2017
Debidatta Dwibedi; Ishan Misra; Martial Hebert
arXiv: Computer Vision and Pattern Recognition | 2016
Ishan Misra; C. Lawrence Zitnick; Martial Hebert