Amir Roshan Zamir
Stanford University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amir Roshan Zamir.
european conference on computer vision | 2010
Amir Roshan Zamir; Mubarak Shah
Finding an images exact GPS location is a challenging computer vision problem that has many real-world applications. In this paper, we address the problem of finding the GPS location of images with an accuracy which is comparable to hand-held GPS devices. We leverage a structured data set of about 100,000 images build from Google Maps Street View as the reference images. We propose a localization method in which the SIFT descriptors of the detected SIFT interest points in the reference images are indexed using a tree. In order to localize a query image, the tree is queried using the detected SIFT descriptors in the query image. A novel GPS-tag-based pruning method removes the less reliable descriptors. Then, a smoothing step with an associated voting scheme is utilized; this allows each query descriptor to vote for the location its nearest neighbor belongs to, in order to accurately localize the query image. A parameter called Confidence of Localization which is based on the Kurtosis of the distribution of votes is defined to determine how reliable the localization of a particular image is. In addition, we propose a novel approach to localize groups of images accurately in a hierarchical manner. First, each image is localized individually; then, the rest of the images in the group are matched against images in the neighboring area of the found first match. The final location is determined based on the Confidence of Localization parameter. The proposed image group localization method can deal with very unclear queries which are not capable of being geolocated individually.
european conference on computer vision | 2012
Amir Roshan Zamir; Afshin Dehghan; Mubarak Shah
Data association is an essential component of any human tracking system. The majority of current methods, such as bipartite matching, incorporate a limited-temporal-locality of the sequence into the data association problem, which makes them inherently prone to IDswitches and difficulties caused by long-term occlusion, cluttered background, and crowded scenes.We propose an approach to data association which incorporates both motion and appearance in a global manner. Unlike limited-temporal-locality methods which incorporate a few frames into the data association problem, we incorporate the whole temporal span and solve the data association problem for one object at a time, while implicitly incorporating the rest of the objects. In order to achieve this, we utilize Generalized Minimum Clique Graphs to solve the optimization problem of our data association method. Our proposed method yields a better formulated approach to data association which is supported by our superior results. Experiments show the proposed method makes significant improvements in tracking in the diverse sequences of Town Center [1], TUD-crossing [2], TUD-Stadtmitte [2], PETS2009 [3], and a new sequence called Parking Lot compared to the state of the art methods.
computer vision and pattern recognition | 2016
Ashesh Jain; Amir Roshan Zamir; Silvio Savarese; Ashutosh Saxena
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatiotemporal graphs are a popular tool for imposing such high-level intuitions in the formulation of real world problems. In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks (RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. We expect this method to empower new approaches to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2014
Amir Roshan Zamir; Mubarak Shah
In this paper, we present a new framework for geo-locating an image utilizing a novel multiple nearest neighbor feature matching method using Generalized Minimum Clique Graphs (GMCP). First, we extract local features (e.g., SIFT) from the query image and retrieve a number of nearest neighbors for each query feature from the reference data set. Next, we apply our GMCP-based feature matching to select a single nearest neighbor for each query feature such that all matches are globally consistent. Our approach to feature matching is based on the proposition that the first nearest neighbors are not necessarily the best choices for finding correspondences in image matching. Therefore, the proposed method considers multiple reference nearest neighbors as potential matches and selects the correct ones by enforcing consistency among their global features (e.g., GIST) using GMCP. In this context, we argue that using a robust distance function for finding the similarity between the global features is essential for the cases where the query matches multiple reference images with dissimilar global features. Towards this end, we propose a robust distance function based on the Gaussian Radial Basis Function (G-RBF). We evaluated the proposed framework on a new data set of 102k street view images; the experiments show it outperforms the state of the art by 10 percent.
computer vision and pattern recognition | 2016
Iro Armeni; Ozan Sener; Amir Roshan Zamir; Helen Jiang; Ioannis Brilakis; Silvio Savarese
In this paper, we propose a method for semantic parsing the 3D point cloud of an entire building using a hierarchical approach: first, the raw data is parsed into semantically meaningful spaces (e.g. rooms, etc) that are aligned into a canonical reference coordinate system. Second, the spaces are parsed into their structural and building elements (e.g. walls, columns, etc). Performing these with a strong notation of global 3D space is the backbone of our method. The alignment in the first step injects strong 3D priors from the canonical coordinate system into the second step for discovering elements. This allows diverse challenging scenarios as man-made indoor spaces often show recurrent geometric patterns while the appearance features can change drastically. We also argue that identification of structural elements in indoor spaces is essentially a detection problem, rather than segmentation which is commonly used. We evaluated our method on a new dataset of several buildings with a covered area of over 6, 000m2 and over 215 million points, demonstrating robust results readily useful for practical applications.
computer vision and pattern recognition | 2012
Gonzalo Vaca-Castano; Amir Roshan Zamir; Mubarak Shah
This paper presents a novel method for estimating the geospatial trajectory of a moving camera with unknown intrinsic parameters, in a city-scale urban environment. The proposed method is based on a three step process that includes: 1) finding the best visual matches of individual images to a dataset of geo-referenced street view images, 2) Bayesian tracking to estimate the frame localization and its temporal evolution, and 3) a trajectory reconstruction algorithm to eliminate inconsistent estimations. As a result of matching features in query image with the features in the reference geo-taged images, in the first step, we obtain a distribution of geolocated votes of matching features which is interpreted as the likelihood of the location (latitude and longitude) given the current observation. In the second step, Bayesian tracking framework is used to estimate the temporal evolution of frame geolocalization based on the previous state probabilities and current likelihood. Finally, once a trajectory is estimated, we perform a Minimum Spanning Trees (MST) based trajectory reconstruction algorithm to eliminate trajectory loops or noisy estimations. The proposed method was tested on sixty minutes of video, which included footage downloaded from YouTube and footage captured by random users in Orlando and Pittsburgh.
international conference on computer vision | 2015
Tian Lan; Yuke Zhu; Amir Roshan Zamir; Silvio Savarese
Realistic videos of human actions exhibit rich spatiotemporal structures at multiple levels of granularity: an action can always be decomposed into multiple finer-grained elements in both space and time. To capture this intuition, we propose to represent videos by a hierarchy of mid-level action elements (MAEs), where each MAE corresponds to an action-related spatiotemporal segment in the video. We introduce an unsupervised method to generate this representation from videos. Our method is capable of distinguishing action-related segments from background segments and representing actions at multiple spatiotemporal resolutions. Given a set of spatiotemporal segments generated from the training data, we introduce a discriminative clustering algorithm that automatically discovers MAEs at multiple levels of granularity. We develop structured models that capture a rich set of spatial, temporal and hierarchical relations among the segments, where the action label and multiple levels of MAE labels are jointly inferred. The proposed model achieves state-of-the-art performance in multiple action recognition benchmarks. Moreover, we demonstrate the effectiveness of our model in real-world applications such as action recognition in large-scale untrimmed videos and action parsing.
Archive | 2014
Khurram Soomro; Amir Roshan Zamir
The ability to analyze the actions which occur in a video is essential for automatic understanding of sports. Action localization and recognition in videos are two main research topics in this context. In this chapter, we provide a detailed study of the prominent methods devised for these two tasks which yield superior results for sports videos. We adopt UCF Sports, which is a dataset of realistic sports videos collected from broadcast television channels, as our evaluation benchmark. First, we present an overview of UCF Sports along with comprehensive statistics of the techniques tested on this dataset as well as the evolution of their performance over time. To provide further details about the existing action recognition methods in this area, we decompose the action recognition framework into three main steps of feature extraction, dictionary learning to represent a video, and classification; we overview several successful techniques for each of these steps. We also overview the problem of spatio-temporal localization of actions and argue that, in general, it manifests a more challenging problem compared to action recognition. We study several recent methods for action localization which have shown promising results on sports videos. Finally, we discuss a number of forward-thinking insights drawn from overviewing the action recognition and localization methods. In particular, we argue that performing the recognition on temporally untrimmed videos and attempting to describe an action, instead of conducting a forced-choice classification, are essential for analyzing the human actions in a realistic environment.
Computer Vision and Image Understanding | 2017
Haroon Idrees; Amir Roshan Zamir; Yu-Gang Jiang; Alex Gorban; Ivan Laptev; Rahul Sukthankar; Mubarak Shah
Abstract Automatically recognizing and localizing wide ranges of human actions are crucial for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THUMOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include ‘background videos’ which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013–2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world. In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal action detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.
european conference on computer vision | 2014
Shervin Ardeshir; Amir Roshan Zamir; Alejandro Torroella; Mubarak Shah
Geographical Information System (GIS) databases contain information about many objects, such as traffic signals, road signs, fire hydrants, etc. in urban areas. This wealth of information can be utilized for assisting various computer vision tasks. In this paper, we propose a method for improving object detection using a set of priors acquired from GIS databases. Given a database of object locations from GIS and a query image with metadata, we compute the expected spatial location of the visible objects in the image. We also perform object detection in the query image (e.g., using DPM) and obtain a set of candidate bounding boxes for the objects. Then, we fuse the GIS priors with the potential detections to find the final object bounding boxes. To cope with various inaccuracies and practical complications, such as noisy metadata, occlusion, inaccuracies in GIS, and poor candidate detections, we formulate our fusion as a higher-order graph matching problem which we robustly solve using RANSAC. We demonstrate that this approach outperforms well established object detectors, such as DPM, with a large margin.