Deva Ramanan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Deva Ramanan is active.

Explore More

Publication

Featured researches published by Deva Ramanan.

computer vision and pattern recognition | 2017

Finding Tiny Faces

Peiyun Hu; Deva Ramanan

Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 82% while prior art ranges from 29-64%).

computer vision and pattern recognition | 2017

3D Human Pose Estimation = 2D Pose Estimation + Matching

Ching-Hang Chen; Deva Ramanan

We explore 3D human pose estimation from a single RGB image. While many approaches try to directly predict 3D pose from image measurements, we explore a simple architecture that reasons through intermediate 2D pose predictions. Our approach is based on two key observations (1) Deep neural nets have revolutionized 2D pose estimation, producing accurate 2D predictions even for poses with self-occlusions (2) Big-datasets of 3D mocap data are now readily available, making it tempting to lift predicted 2D poses to 3D through simple memorization (e.g., nearest neighbors). The resulting architecture is straightforward to implement with off-the-shelf 2D pose estimation systems and 3D mocap libraries. Importantly, we demonstratethatsuchmethodsoutperformalmostallstate-of-theart 3D pose estimation systems, most of which directly try to regress 3D pose from 2D measurements.

computer vision and pattern recognition | 2017

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

Rohit Girdhar; Deva Ramanan; Abhinav Gupta; Josef Sivic; Bryan C. Russell

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

computer vision and pattern recognition | 2017

Growing a Brain: Fine-Tuning by Increasing Model Capacity

Yu-Xiong Wang; Deva Ramanan; Martial Hebert

CNNs have made an undeniable impact on computer vision through the ability to learn high-capacity models with large annotated training sets. One of their remarkable properties is the ability to transfer knowledge from a large source dataset to a (typically smaller) target dataset. This is usually accomplished through fine-tuning a fixed-size network on new target data. Indeed, virtually every contemporary visual recognition system makes use of fine-tuning to transfer knowledge from ImageNet. In this work, we analyze what components and parameters change during fine-tuning, and discover that increasing model capacity allows for more natural model adaptation through fine-tuning. By making an analogy to developmental learning, we demonstrate that growing a CNN with additional units, either by widening existing layers or deepening the overall network, significantly outperforms classic fine-tuning approaches. But in order to properly grow a network, we show that newly-added units must be appropriately normalized to allow for a pace of learning that is consistent with existing units. We empirically validate our approach on several benchmark datasets, producing state-of-the-art results.

computer vision and pattern recognition | 2017

Predictive-Corrective Networks for Action Detection

Achal Dave; Olga Russakovsky; Deva Ramanan

While deep feature learning has revolutionized techniques for static-image understanding, the same does not quite hold for video processing. Architectures and optimization techniques used for video are largely based off those for static images, potentially underutilizing rich video information. In this work, we rethink both the underlying network architecture and the stochastic learning paradigm for temporal data. To do so, we draw inspiration from classic theory on linear dynamic systems for modeling time series. By extending such models to include nonlinear mappings, we derive a series of novel recurrent neural networks that sequentially make top-down predictions about the future and then correct those predictions with bottom-up observations. Predictive-corrective networks have a number of desirable properties: (1) they can adaptively focus computation on surprising frames where predictions require large corrections, (2) they simplify learning in that only residual-like corrective terms need to be learned over time and (3) they naturally decorrelate an input data stream in a hierarchical fashion, producing a more reliable signal for learning at each layer of a network. We provide an extensive analysis of our lightweight and interpretable framework, and demonstrate that our model is competitive with the two-stream network on three challenging datasets without the need for computationally expensive optical flow.

computer vision and pattern recognition | 2017

Expecting the Unexpected: Training Detectors for Unusual Pedestrians with Adversarial Imposters

Shiyu Huang; Deva Ramanan

As autonomous vehicles become an every-day reality, high-accuracy pedestrian detection is of paramount practical importance. Pedestrian detection is a highly researched topic with mature methods, but most datasets (for both training and evaluation) focus on common scenes of people engaged in typical walking poses on sidewalks. But performance is most crucial for dangerous scenarios that are rarely observed, such as children playing in the street and people using bicycles/skateboards in unexpected ways. Such in-the-tail data is notoriously hard to observe, making both training and testing difficult. To analyze this problem, we have collected a novel annotated dataset of dangerous scenarios called the Precarious Pedestrian dataset. Even given a dedicated collection effort, it is relatively small by contemporary standards (≈ 1000 images). To explore large-scale data-driven learning, we explore the use of synthetic data generated by a game engine. A significant challenge is selected the right priors or parameters for synthesis: we would like realistic data with realistic poses and object configurations. Inspired by Generative Adversarial Networks, we generate a massive amount of synthetic data and train a discriminative classifier to select a realistic subset (that fools the classifier), which we deem Synthetic Imposters. We demonstrate that this pipeline allows one to generate realistic (or adverserial) training data by making use of rendering/animation engines. Interestingly, we also demonstrate that such data can be used to rank algorithms, suggesting that Synthetic Imposters can also be used for in-the-tail validation at test-time, a notoriously difficult challenge for real-world deployment.

international conference on computer vision | 2017