Olga Russakovsky | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Olga Russakovsky is active.

Explore More

Publication

Featured researches published by Olga Russakovsky.

european conference on computer vision | 2012

Object-Centric spatial pooling for image classification

Olga Russakovsky; Yuanqing Lin; Kai Yu; Li Fei-Fei

Spatial pyramid matching (SPM) based pooling has been the dominant choice for state-of-art image classification systems. In contrast, we propose a novel object-centric spatial pooling (OCP) approach, following the intuition that knowing the location of the object of interest can be useful for image classification. OCP consists of two steps: (1) inferring the location of the objects, and (2) using the location information to pool foreground and background features separately to form the image-level representation. Step (1) is particularly challenging in a typical classification setting where precise object location annotations are not available during training. To address this challenge, we propose a framework that learns object detectors using only image-level class labels, or so-called weak labels. We validate our approach on the challenging PASCAL07 dataset. Our learned detectors are comparable in accuracy with state-of-the-art weakly supervised detection methods. More importantly, the resulting OCP approach significantly outperforms SPM-based pooling in image classification.

computer vision and pattern recognition | 2016

End-to-End Learning of Action Detection from Frame Glimpses in Videos

Serena Yeung; Olga Russakovsky; Greg Mori; Li Fei-Fei

In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agents decision policy. Our model achieves state-of-the-art results on the THUMOS14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.

european conference on computer vision | 2010

Attribute learning in large-scale datasets

Olga Russakovsky; Li Fei-Fei

We consider the task of learning visual connections between object categories using the ImageNet dataset, which is a large-scale dataset ontology containing more than 15 thousand object classes. We want to discover visual relationships between the classes that are currently missing (such as similar colors or shapes or textures). In this work we learn 20 visual attributes and use them in a zero-shot transfer learning experiment as well as to make visual connections between semantically unrelated object categories.

european conference on computer vision | 2016

What’s the Point: Semantic Segmentation with Point Supervision

Amy Bearman; Olga Russakovsky; Vittorio Ferrari; Li Fei-Fei

The semantic image segmentation task presents a trade-off between test time accuracy and training time annotation cost. Detailed per-pixel annotations enable training accurate models but are very time-consuming to obtain; image-level class labels are an order of magnitude cheaper but result in less accurate models. We take a natural step from image-level annotation towards stronger supervision: we ask annotators to point to an object if one exists. We incorporate this point supervision along with a novel objectness potential in the training loss function of a CNN model. Experimental results on the PASCAL VOC 2012 benchmark reveal that the combined effect of point-level supervision and objectness potential yields an improvement of (12.9,%) mIOU over image-level supervision. Further, we demonstrate that models trained with point-level supervision are more accurate than models trained with image-level, squiggle-level or full supervision given a fixed annotation budget.

International Journal of Computer Vision | 2018

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Serena Yeung; Olga Russakovsky; Ning Jin; Mykhaylo Andriluka; Greg Mori; Li Fei-Fei

Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

international conference on computer vision | 2013

Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?

Olga Russakovsky; Jia Deng; Zhiheng Huang; Alexander C. Berg; Li Fei-Fei

The growth of detection datasets and the multiple directions of object detection research provide both an unprecedented need and a great opportunity for a thorough evaluation of the current state of the field of categorical object detection. In this paper we strive to answer two key questions. First, where are we currently as a field: what have we done right, what still needs to be improved? Second, where should we be going in designing the next generation of object detectors? Inspired by the recent work of Hoiem et al. on the standard PASCAL VOC detection dataset, we perform a large-scale study on the Image Net Large Scale Visual Recognition Challenge (ILSVRC) data. First, we quantitatively demonstrate that this dataset provides many of the same detection challenges as the PASCAL VOC. Due to its scale of 1000 object categories, ILSVRC also provides an excellent test bed for understanding the performance of detectors as a function of several key properties of the object classes. We conduct a series of analyses looking at how different detection methods perform on a number of image-level and object-class-level properties such as texture, color, deformation, and clutter. We learn important lessons of the current object detection methods and propose a number of insights for designing the next generation object detectors.

computer vision and pattern recognition | 2015

Best of both worlds: Human-machine collaboration for object annotation

Olga Russakovsky; Li-Jia Li; Li Fei-Fei

The long-standing goal of localizing every object in an image remains elusive. Manually annotating objects is quite expensive despite crowd engineering innovations. Current state-of-the-art automatic object detectors can accurately detect at most a few objects per image. This paper brings together the latest advancements in object detection and in crowd engineering into a principled framework for accurately and efficiently localizing objects in images. The input to the system is an image to annotate and a set of annotation constraints: desired precision, utility and/or human cost of the labeling. The output is a set of object annotations, informed by human feedback and computer vision. Our model seamlessly integrates multiple computer vision models with multiple sources of human input in a Markov Decision Process. We empirically validate the effectiveness of our human-in-the-loop labeling approach on the ILSVRC2014 object detection dataset.

computer vision and pattern recognition | 2010

A Steiner tree approach to efficient object detection

Olga Russakovsky; Andrew Y. Ng

We propose an approach to speeding up object detection, with an emphasis on settings where multiple object classes are being detected. Our method uses a segmentation algorithm to select a small number of image regions on which to run a classifier. Compared to the classical sliding window approach, this results in a significantly smaller number of rectangles examined, and thus significantly faster object detection. Further, in the multiple object class setting, we show that the computational cost of proposing candidate regions can be amortized across objects classes, resulting in an additional speedup. At the heart of our approach is a reduction to a directed Steiner tree optimization problem, which we solve approximately in order to select the segmentation algorithm parameters. The solution gives a small set of segmentation strategies that can be shared across object classes. Compared to the sliding window approach, our method results in two orders of magnitude fewer regions considered, and significant (10–15x) running time speedups on challenging object detection datasets (LabelMe and StreetScenes) while maintaining comparable detection accuracy.

international conference on robotics and automation | 2010

Autonomous operation of novel elevators for robot navigation

Ellen Klingbeil; Blake Carpenter; Olga Russakovsky; Andrew Y. Ng

Although robot navigation in indoor environments has achieved great success, robots are unable to fully navigate these spaces without the ability to operate elevators, including those which the robot has not seen before. In this paper, we focus on the key challenge of autonomous interaction with an unknown elevator button panel. A number of factors, such as lack of useful 3D features, variety of elevator panel designs, variation in lighting conditions, and small size of elevator buttons, render this goal quite difficult. To address the task of detecting, localizing, and labeling the buttons, we use state-of-the-art vision algorithms along with machine learning techniques to take advantage of contextual features. To verify our approach, we collected a dataset of 150 pictures of elevator panels from more than 60 distinct elevators, and performed extensive offline testing. On this very diverse dataset, our algorithm succeeded in correctly localizing and labeling 86.2% of the buttons. Using a mobile robot platform, we then validate our algorithms in experiments where, using only its on-board sensors, the robot autonomously interprets the panel and presses the appropriate button in elevators never seen before by the robot. In a total of 14 trials performed on 3 different elevators, our robot succeeded in localizing the requested button in all 14 trials and in pressing it correctly in 13 of the 14 trials.

computer vision and pattern recognition | 2017

Predictive-Corrective Networks for Action Detection

Achal Dave; Olga Russakovsky; Deva Ramanan

While deep feature learning has revolutionized techniques for static-image understanding, the same does not quite hold for video processing. Architectures and optimization techniques used for video are largely based off those for static images, potentially underutilizing rich video information. In this work, we rethink both the underlying network architecture and the stochastic learning paradigm for temporal data. To do so, we draw inspiration from classic theory on linear dynamic systems for modeling time series. By extending such models to include nonlinear mappings, we derive a series of novel recurrent neural networks that sequentially make top-down predictions about the future and then correct those predictions with bottom-up observations. Predictive-corrective networks have a number of desirable properties: (1) they can adaptively focus computation on surprising frames where predictions require large corrections, (2) they simplify learning in that only residual-like corrective terms need to be learned over time and (3) they naturally decorrelate an input data stream in a hierarchical fashion, producing a more reliable signal for learning at each layer of a network. We provide an extensive analysis of our lightweight and interpretable framework, and demonstrate that our model is competitive with the two-stream network on three challenging datasets without the need for computationally expensive optical flow.

Explore More