Hamed Pirsiavash
University of California, Irvine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hamed Pirsiavash.
computer vision and pattern recognition | 2011
Hamed Pirsiavash; Deva Ramanan; Charless C. Fowlkes
We analyze the computational problem of multi-object tracking in video sequences. We formulate the problem using a cost function that requires estimating the number of tracks, as well as their birth and death states. We show that the global solution can be obtained with a greedy algorithm that sequentially instantiates tracks using shortest path computations on a flow network. Greedy algorithms allow one to embed pre-processing steps, such as nonmax suppression, within the tracking algorithm. Furthermore, we give a near-optimal algorithm based on dynamic programming which runs in time linear in the number of objects and linear in the sequence length. Our algorithms are fast, simple, and scalable, allowing us to process dense input data. This results in state-of-the-art performance.
computer vision and pattern recognition | 2012
Hamed Pirsiavash; Deva Ramanan
We present a novel dataset and novel algorithms for the problem of detecting activities of daily living (ADL) in firstperson camera views. We have collected a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object tracks, hand positions, and interaction events. ADLs differ from typical actions in that they can involve long-scale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including (1) temporal pyramids, which generalize the well-known spatial pyramid to approximate temporal correspondence when scoring a model and (2) composite object models that exploit the fact that objects look different when being interacted with. We perform an extensive empirical evaluation and demonstrate that our novel representations produce a two-fold improvement over traditional approaches. Our analysis suggests that real-world ADL recognition is “all about the objects,” and in particular, “all about the objects being interacted with.”
computer vision and pattern recognition | 2011
Sangmin Oh; Anthony Hoogs; A. G. Amitha Perera; Naresh P. Cuntoor; Chia-Chih Chen; Jong Taek Lee; Saurajit Mukherjee; Jake K. Aggarwal; Hyungtae Lee; Larry S. Davis; Eran Swears; Xiaoyang Wang; Qiang Ji; Kishore K. Reddy; Mubarak Shah; Carl Vondrick; Hamed Pirsiavash; Deva Ramanan; Jenny Yuen; Antonio Torralba; Bi Song; Anesco Fong; Amit K. Roy-Chowdhury; Mita Desai
We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual [15, 8]. Datasets have been developed for movies [11] and sports [12], but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.
international conference on computer vision | 2013
Joseph J. Lim; Hamed Pirsiavash; Antonio Torralba
We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms.
computer vision and pattern recognition | 2014
Hamed Pirsiavash; Deva Ramanan
Real-world videos of human activities exhibit temporal structure at various scales, long videos are typically composed out of multiple action instances, where each instance is itself composed of sub-actions with variable durations and orderings. Temporal grammars can presumably model such hierarchical structure, but are computationally difficult to apply for long video streams. We describe simple grammars that capture hierarchical temporal structure while admitting inference with a finite-state-machine. This makes parsing linear time, constant storage, and naturally online. We train grammar parameters using a latent structural SVM, where latent subactions are learned automatically. We illustrate the effectiveness of our approach over common baselines on a new half-million frame dataset of continuous YouTube videos.
computer vision and pattern recognition | 2016
Carl Vondrick; Hamed Pirsiavash; Antonio Torralba
Anticipating actions and objects before they start or appear is a difficult problem in computer vision with several real-world applications. This task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently learning this knowledge is through readily available unlabeled video. We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions. We experimentally validate this idea on two datasets, anticipating actions one second in the future and objects five seconds in the future.
acm multimedia | 2010
John Adcock; Matthew Cooper; Laurent Denoue; Hamed Pirsiavash; Lawrence A. Rowe
The design and implementation of a search engine for lecture webcasts is described. A searchable text index is created allowing users to locate material within lecture videos found on a variety of websites such as YouTube and Berkeley webcasts. The index is created from words on the presentation slides appearing in the video along with any associated metadata such as the title and abstract when available. The video is analyzed to identify a set of distinct slide images, to which OCR and lexical processes are applied which in turn generate a list of indexable terms. Several problems were discovered when trying to identify distinct slides in the video stream. For example, picture-in-picture compositing of a speaker and a presentation slide, switching cameras, and slide builds confuse basic frame-differencing algorithms for extracting keyframe slide images. Algorithms are described that improve slide identification. A prototype system was built to test the algorithms and the utility of the search engine. Users can browse lists of lectures, slides in a specific lecture, or play the lecture video. Over 10,000 lecture videos have been indexed from a variety of sources. A public website will be published in mid 2010 that allows users to experiment with the search engine.
IEEE Transactions on Image Processing | 2009
Shahriar Negahdaripour; Hicham Sekkati; Hamed Pirsiavash
Utilization of an acoustic camera for range measurements is a key advantage for 3-D shape recovery of underwater targets by opti-acoustic stereo imaging, where the associated epipolar geometry of optical and acoustic image correspondences can be described in terms of conic sections. In this paper, we propose methods for system calibration and 3-D scene reconstruction by maximum likelihood estimation from noisy image measurements. The recursive 3-D reconstruction method utilized as initial condition a closed-form solution that integrates the advantages of two other closed-form solutions, referred to as the range and azimuth solutions. Synthetic data tests are given to provide insight into the merits of the new target imaging and 3-D reconstruction paradigm, while experiments with real data confirm the findings based on computer simulations, and demonstrate the merits of this novel 3-D reconstruction paradigm.
computer vision and pattern recognition | 2012
Hamed Pirsiavash; Deva Ramanan
We describe a method for learning steerable deformable part models. Our models exploit the fact that part templates can be written as linear filter banks. We demonstrate that one can enforce steerability and separability during learning by applying rank constraints. These constraints are enforced with a coordinate descent learning algorithm, where each step can be solved with an off-the-shelf structured SVM solver. The resulting models are orders of magnitude smaller than their counterparts, greatly simplifying learning and reducing run-time computation. Limiting the degrees of freedom also reduces overfitting, which is useful for learning large part vocabularies from limited training data. We learn steerable variants of several state-of-the-art models for object detection, human pose estimation, and facial landmark estimation. Our steerable models are smaller, faster, and often improve performance.
european conference on computer vision | 2014
Hamed Pirsiavash; Carl Vondrick; Antonio Torralba
While recent advances in computer vision have provided reliable methods to recognize actions in both images and videos, the problem of assessing how well people perform actions has been largely unexplored in computer vision. Since methods for assessing action quality have many real-world applications in healthcare, sports, and video retrieval, we believe the computer vision community should begin to tackle this challenging problem. To spur progress, we introduce a learning-based framework that takes steps towards assessing how well people perform actions in videos. Our approach works by training a regression model from spatiotemporal pose features to scores obtained from expert judges. Moreover, our approach can provide interpretable feedback on how people can improve their action. We evaluate our method on a new Olympic sports dataset, and our experiments suggest our framework is able to rank the athletes more accurately than a non-expert human. While promising, our method is still a long way to rivaling the performance of expert judges, indicating that there is significant opportunity in computer vision research to improve on this difficult yet important task.