Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Silvio Savarese is active.

Publication


Featured researches published by Silvio Savarese.


computer vision and pattern recognition | 2011

Recognizing human actions by attributes

Jingen Liu; Benjamin Kuipers; Silvio Savarese

In this paper we explore the idea of using high-level semantic concepts, also called attributes, to represent human actions from videos and argue that attributes enable the construction of more descriptive models for human action recognition. We propose a unified framework wherein manually specified attributes are: i) selected in a discriminative fashion so as to account for intra-class variability; ii) coherently integrated with data-driven attributes to make the attribute set more descriptive. Data-driven attributes are automatically inferred from the training data using an information theoretic approach. Our framework is built upon a latent SVM formulation where latent variables capture the degree of importance of each attribute for each action class. We also demonstrate that our attribute-based action representation can be effectively used to design a recognition procedure for classifying novel action classes for which no training samples are available. We test our approach on several publicly available datasets and obtain promising results that quantitatively demonstrate our theoretical claims.


computer vision and pattern recognition | 2011

Cross-view action recognition via view knowledge transfer

Jingen Liu; Mubarak Shah; Benjamin Kuipers; Silvio Savarese

In this paper, we present a novel approach to recognizing human actions from different views by view knowledge transfer. An action is originally modelled as a bag of visual-words (BoVW), which is sensitive to view changes. We argue that, as opposed to visual words, there exist some higher level features which can be shared across views and enable the connection of action models for different views. To discover these features, we use a bipartite graph to model two view-dependent vocabularies, then apply bipartite graph partitioning to co-cluster two vocabularies into visual-word clusters called bilingual-words (i.e., high-level features), which can bridge the semantic gap across view-dependent vocabularies. Consequently, we can transfer a BoVW action model into a bag-of-bilingual-words (BoBW) model, which is more discriminative in the presence of view changes. We tested our approach on the IXMAS data set and obtained very promising results. Moreover, to further fuse view knowledge from multiple views, we apply a Locally Weighted Ensemble scheme to dynamically weight transferred models based on the local distribution structure around each test example. This process can further improve the average recognition rate by about 7%.


computer vision and pattern recognition | 2006

Discriminative Object Class Models of Appearance and Shape by Correlatons

Silvio Savarese; John Winn; Antonio Criminisi

This paper presents a new model of object classes which incorporates appearance and shape information jointly. Modeling objects appearance by distributions of visual words has recently proven successful. Here appearancebased models are augmented by capturing the spatial arrangement of visual words. Compact spatial modeling without loss of discrimination is achieved through the introduction of adaptive vector quantized correlograms, which we call correlatons. Efficiency is further improved by means of integral images. The robustness of our new models to geometric transformations, severe occlusions and missing information is also demonstrated. The accuracy of discrimination of the proposed models is assessed with respect to existing databases with large numbers of object classes viewed under general conditions, and shown to outperform appearance-only models.


workshop on applications of computer vision | 2014

Beyond PASCAL: A benchmark for 3D object detection in the wild

Yu Xiang; Roozbeh Mottaghi; Silvio Savarese

3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d.


european conference on computer vision | 2012

A unified framework for multi-target tracking and collective activity recognition

Wongun Choi; Silvio Savarese

We present a coherent, discriminative framework for simultaneously tracking multiple people and estimating their collective activities. Instead of treating the two problems separately, our model is grounded in the intuition that a strong correlation exists between a persons motion, their activity, and the motion and activities of other nearby people. Instead of directly linking the solutions to these two problems, we introduce a hierarchy of activity types that creates a natural progression that leads from a specific persons motion to the activity of the group as a whole. Our model is capable of jointly tracking multiple people, recognizing individual activities (atomic activities), the interactions between pairs of people (interaction activities), and finally the behavior of groups of people (collective activities). We also propose an algorithm for solving this otherwise intractable joint inference problem by combining belief propagation with a version of the branch and bound algorithm equipped with integer programming. Experimental results on challenging video datasets demonstrate our theoretical claims and indicate that our model achieves the best collective activity classification results to date.


ieee workshop on motion and video computing | 2008

Spatial-Temporal correlatons for unsupervised action classification

Silvio Savarese; Andrey DelPozo; Juan Carlos Niebles; Li Fei-Fei

Spatial-temporal local motion features have shown promising results in complex human action classification. Most of the previous works [6],[16],[21] treat these spatial- temporal features as a bag of video words, omitting any long range, global information in either the spatial or temporal domain. Other ways of learning temporal signature of motion tend to impose a fixed trajectory of the features or parts of human body returned by tracking algorithms. This leaves little flexibility for the algorithm to learn the optimal temporal pattern describing these motions. In this paper, we propose the usage of spatial-temporal correlograms to encode flexible long range temporal information into the spatial-temporal motion features. This results into a much richer description of human actions. We then apply an unsupervised generative model to learn different classes of human actions from these ST-correlograms. KTH dataset, one of the most challenging and popular human action dataset, is used for experimental evaluation. Our algorithm achieves the highest classification accuracy reported for this dataset under an unsupervised learning scheme.


computer vision and pattern recognition | 2011

Learning context for collective activity recognition

Wongun Choi; Khuram Shahid; Silvio Savarese

In this paper we present a framework for the recognition of collective human activities. A collective activity is defined or reinforced by the existence of coherent behavior of individuals in time and space. We call such coherent behavior ‘Crowd Context’. Examples of collective activities are “queuing in a line” or “talking”. Following [7], we propose to recognize collective activities using the crowd context and introduce a new scheme for learning it automatically. Our scheme is constructed upon a Random Forest structure which randomly samples variable volume spatio-temporal regions to pick the most discriminating attributes for classification. Unlike previous approaches, our algorithm automatically finds the optimal configuration of spatio-temporal bins, over which to sample the evidence, by randomization. This enables a methodology for modeling crowd context. We employ a 3D Markov Random Field to regularize the classification and localize collective activities in the scene. We demonstrate the flexibility and scalability of the proposed framework in a number of experiments and show that our method outperforms state-of-the art action classification techniques [7, 19].


computer vision and pattern recognition | 2016

Deep Metric Learning via Lifted Structured Feature Embedding

Hyun Oh Song; Yu Xiang; Stefanie Jegelka; Silvio Savarese

Learning the distance metric between pairs of examples is of great importance for learning and visual recognition. With the remarkable success from the state of the art convolutional neural networks, recent works [1, 31] have shown promising results on discriminatively training the networks to learn semantic feature embeddings where similar examples are mapped close to each other and dissimilar examples are mapped farther apart. In this paper, we describe an algorithm for taking full advantage of the training batches in the neural network training by lifting the vector of pairwise distances within the batch to the matrix of pairwise distances. This step enables the algorithm to learn the state of the art feature embedding by optimizing a novel structured prediction objective on the lifted problem. Additionally, we collected Stanford Online Products dataset: 120k images of 23k classes of online products for metric learning. Our experiments on the CUB-200-2011 [37], CARS196 [19], and Stanford Online Products datasets demonstrate significant improvement over existing deep feature embedding methods on all experimented embedding sizes with the GoogLeNet [33] network. The source code and the dataset are available at: https://github.com/rksltnl/ Deep-Metric-Learning-CVPR16.


international conference on computer vision | 2009

Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories

Hao Su; Min Sun; Li Fei-Fei; Silvio Savarese

Recognizing object classes and their 3D viewpoints is an important problem in computer vision. Based on a part-based probabilistic representation [31], we propose a new 3D object class model that is capable of recognizing unseen views by pose estimation and synthesis. We achieve this by using a dense, multiview representation of the viewing sphere parameterized by a triangular mesh of viewpoints. Each triangle of viewpoints can be morphed to synthesize new viewpoints. By incorporating 3D geometrical constraints, our model establishes explicit correspondences among object parts across viewpoints. We propose an incremental learning algorithm to train the generative model. A cellphone video clip of an object is first used to initialize model learning. Then the model is updated by a set of unsorted training images without viewpoint labels. We demonstrate the robustness of our model on object detection, viewpoint classification and synthesis tasks. Our model performs superiorly to and on par with state-of-the-art algorithms on the Savarese et al. 2007 and PASCAL datasets in object detection. It outperforms all previous work in viewpoint classification and offers promising results in viewpoint synthesis.


european conference on computer vision | 2016

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Christopher Bongsoo Choy; Danfei Xu; JunYoung Gwak; Kevin Chen; Silvio Savarese

Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data [13]. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework (i) outperforms the state-of-the-art methods for single view reconstruction, and (ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).

Collaboration


Dive into the Silvio Savarese's collaboration.

Researchain Logo
Decentralizing Knowledge