Amir Ghodrati | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amir Ghodrati is active.

Explore More

Publication

Featured researches published by Amir Ghodrati.

computer vision and pattern recognition | 2015

Modeling video evolution for action recognition

Basura Fernando; Efstratios Gavves; M José Oramas; Amir Ghodrati; Tinne Tuytelaars

In this paper we present a method to capture video-wide temporal information for action recognition. We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video. We learn such ranking functions per video via a ranking machine and use the parameters of these as a new video representation. The proposed method is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We perform a large number of evaluations on datasets for generic action recognition (Hollywood2 and HMDB51), fine-grained actions (MPII- cooking activities) and gestures (Chalearn). Results show that the proposed method brings an absolute improvement of 7-10%, while being compatible with and complementary to further improvements in appearance and local motion based methods.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2017

Rank Pooling for Action Recognition

Basura Fernando; Efstratios Gavves; M José Oramas; Amir Ghodrati; Tinne Tuytelaars

We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g., how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features.

international conference on computer vision | 2015

DeepProposal: Hunting Objects by Cascading Deep Convolutional Layers

Amir Ghodrati; Ali Diba; Marco Pedersoli; Tinne Tuytelaars; Luc Van Gool

In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the generation of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the first layers of the network can better localize the object of interest but with a reduced recall. Based on this observation we design a method for proposing object locations that is based on CNN features and that combines the best of both worlds. We build an inverse cascade that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner. The method is efficient, because i) it uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals due to the inverse coarse-to-fine cascade. The method is also accurate, it outperforms most of the previously proposed object proposals approaches and when plugged into a CNN-based detector produces state-of-the-art detection performance.

british machine vision conference | 2014

Is 2D information enough for viewpoint estimation

Amir Ghodrati; Marco Pedersoli; Tinne Tuytelaars

Recent top performing methods for viewpoint estimation make use of 3D information like 3D CAD models or 3D landmarks to build a 3D representation of the class. These 3D annotations are expensive and not really available for many classes. In this paper we investigate whether and how comparable performance can be obtained without any 3D information. We consider viewpoint estimation as a 1-vs-all classification problem on the previously detected object bounding box. In this framework we compare several features and parameter configurations and show that the modern representations based on Fisher encoding and convolutional neural network based features together with a neighbor viewpoints suppression strategy on the training data lead to comparable or even better performance than 3D methods.

european conference on computer vision | 2016

Online Action Detection

Roeland De Geest; Efstratios Gavves; Amir Ghodrati; Zhenyang Li; Cees G. M. Snoek; Tinne Tuytelaars

In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 h of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.

british machine vision conference | 2016

Towards Automatic Image Editing: Learning to See another You

Amir Ghodrati; Xu Jia; Marco Pedersoli; Tinne Tuytelaars

Learning the distribution of images in order to generate new samples is a challenging task due to the high dimensionality of the data and the highly non-linear relations that are involved. Nevertheless, some promising results have been reported in the literature recently,building on deep network architectures. In this work, we zoom in on a specific type of image generation: given an image and knowing the category of objects it belongs to (e.g. faces), our goal is to generate a similar and plausible image, but with some altered attributes. This is particularly challenging, as the model needs to learn to disentangle the effect of each attribute and to apply a desired attribute change to a given input image, while keeping the other attributes and overall object appearance intact. To this end, we learn a convolutional network, where the desired attribute information is encoded then merged with the encoded image at feature map level. We show promising results, both qualitatively as well as quantitatively, in the context of a retrieval experiment, on two face datasets (MultiPie and CAS-PEAL-R1).

Lecture Notes in Computer Science | 2016

Cross-modal supervision for learning active speaker detection in video

Roeland De Geest; Stratis Gavves; Amir Ghodrati; Zhenyang Li; Cees Snoek; Tinne Tuytelaars

Most of Image Quality Assessment (IQA) methods require the reference image to be pixel-wise aligned with the distorted image, and thus limiting the application of reference image based IQA methods. In this paper, we show that non-aligned image with similar scene could be well used for reference, using a proposed Dual-path deep Convolutional Neural Network (DCNN). Analysis indicates that the model captures the scene structural information and non-structural information “naturalness” between the pair for quality assessment. As shown in the experiments, our proposed DCNN model handles the IQA problem well. With an aligned reference image, our predictions outperform many state-of-the-art methods. And in more general case where the reference image contains the similar scene but is not aligned with the distorted one, DCNN could still achieve superior consistency with subjective evaluation than many existing methods that even use aligned reference images.

international conference on multimedia retrieval | 2015

Swap Retrieval: Retrieving Images of Cats When the Query Shows a Dog

Amir Ghodrati; Xu Jia; Marco Pedersoli; Tinne Tuytelaars

Query-by-example remains popular in image retrieval because it can exploit contextual information encoded in the image, that is difficult to express in a traditional textual query. Textual queries, on the other hand, give more flexibility in that its easy to reformulate and refine a text query based on initial results. In this work we make a first step towards getting the best of both worlds: we use an image to specify the context, but let the user specify a related category as main search criterion. For instance, starting from an image of a dog in a certain situation/context, the goal is to find images of cats with a similar situation/context. We present an evaluation scheme for this new and challenging task, which we call swap retrieval, and use it to compare various methods. Results show that standard query-by-example techniques do not adapt well to the new task. Instead, techniques based on semantic knowledge extracted from textual descriptions available at training time perform reasonably well, although they are still far from the performance needed for practical use.

workshop on applications of computer vision | 2014

Coupling video segmentation and action recognition

Amir Ghodrati; Marco Pedersoli; Tinne Tuytelaars

Recently a lot of progress has been made in the field of video segmentation. The question then arises whether and how these results can be exploited for this other video processing challenge, action recognition. In this paper we show that a good segmentation is actually very important for recognition. We propose and evaluate several ways to integrate and combine the two tasks: i) recognition using a standard, bottom-up segmentation, ii) using a top-down segmentation geared towards actions, iii) using a segmentation based on inter-video similarities (co-segmentation), and iv) tight integration of recognition and segmentation via iterative learning. Our results clearly show that, on the one hand, the two tasks are interdependent and therefore an iterative optimization of the two makes sense and gives better results. On the other hand, comparable results can also be obtained with two separate steps but mapping the feature-space with a non-linear kernel.

International Journal of Computer Vision | 2017