Is this you? Create Your Porfile

Ali Diba

Katholieke Universiteit Leuven

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ali Diba is active.

Explore More

Publication

Featured researches published by Ali Diba.

international conference on computer vision | 2015

DeepProposal: Hunting Objects by Cascading Deep Convolutional Layers

Amir Ghodrati; Ali Diba; Marco Pedersoli; Tinne Tuytelaars; Luc Van Gool

In this paper we evaluate the quality of the activation layers of a convolutional neural network (CNN) for the generation of object proposals. We generate hypotheses in a sliding-window fashion over different activation layers and show that the final convolutional layers can find the object of interest with high recall but poor localization due to the coarseness of the feature maps. Instead, the first layers of the network can better localize the object of interest but with a reduced recall. Based on this observation we design a method for proposing object locations that is based on CNN features and that combines the best of both worlds. We build an inverse cascade that, going from the final to the initial convolutional layers of the CNN, selects the most promising object locations and refines their boxes in a coarse-to-fine manner. The method is efficient, because i) it uses the same features extracted for detection, ii) it aggregates features using integral images, and iii) it avoids a dense evaluation of the proposals due to the inverse coarse-to-fine cascade. The method is also accurate, it outperforms most of the previously proposed object proposals approaches and when plugged into a CNN-based detector produces state-of-the-art detection performance.

computer vision and pattern recognition | 2013

Multi-attribute Queries: To Merge or Not to Merge?

Mohammad Rastegari; Ali Diba; Devi Parikh; Ali Farhadi

Users often have very specific visual content in mind that they are searching for. The most natural way to communicate this content to an image search engine is to use key-words that specify various properties or attributes of the content. A naive way of dealing with such multi-attribute queries is the following: train a classifier for each attribute independently, and then combine their scores on images to judge their fit to the query. We argue that this may not be the most effective or efficient approach. Conjunctions of attribute often correspond to very characteristic appearances. It would thus be beneficial to train classifiers that detect these conjunctions as a whole. But not all conjunctions result in such tight appearance clusters. So given a multi-attribute query, which conjunctions should we model? An exhaustive evaluation of all possible conjunctions would be time consuming. Hence we propose an optimization approach that identifies beneficial conjunctions without explicitly training the corresponding classifier. It reasons about geometric quantities that capture notions similar to intra- and inter-class variances. We exploit a discriminative binary space to compute these geometric quantities efficiently. Experimental results on two challenging datasets of objects and birds show that our proposed approach can improve performance significantly over several strong base-lines, while being an order of magnitude faster than exhaustively searching through all possible conjunctions.

computer vision and pattern recognition | 2017

Deep Temporal Linear Encoding Networks

Ali Diba; Vivek Sharma; Luc Van Gool

The CNN-encoding of features from entire videos for the representation of human actions has rarely been addressed. Instead, CNN work has focused on approaches to fuse spatial and temporal networks, but these were typically limited to processing shorter sequences. We present a new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos. It encodes this aggregated information into a robust video feature representation, via end-to-end learning. Advantages of TLEs are: (a) they encode the entire video into a compact feature representation, learning the semantics and a discriminative feature space, (b) they are applicable to all kinds of networks like 2D and 3D CNNs for video classification, and (c) they model feature interactions in a more expressive way and without loss of information. We conduct experiments on two challenging human action datasets: HMDB51 and UCF101. The experiments show that TLE outperforms current state-of-the-art methods on both datasets.

computer vision and pattern recognition | 2017

Weakly Supervised Cascaded Convolutional Networks

Ali Diba; Vivek Sharma; Ali Mohammad Pazandeh; Hamed Pirsiavash; Luc Van Gool

Object detection is a challenging task in visual understanding domain, and even more so if the supervision is to be weak. Recently, few efforts to handle the task without expensive human annotations is established by promising deep neural network. A new architecture of cascaded networks is proposed to learn a convolutional neural network (CNN) under such conditions. We introduce two such architectures, with either two cascade stages or three which are trained in an end-to-end pipeline. The first stage of both architectures extracts best candidate of class specific region proposals by training a fully convolutional network. In the case of the three stage architecture, the middle stage provides object segmentation, using the output of the activation maps of first stage. The final stage of both architectures is a part of a convolutional neural network that performs multiple instance learning on proposals extracted in the previous stage(s). Our experiments on the PASCAL VOC 2007, 2010, 2012 and large scale object datasets, ILSVRC 2013, 2014 datasets show improvements in the areas of weakly-supervised object detection, classification and localization.

computer vision and pattern recognition | 2016

DeepCAMP: Deep Convolutional Action & Attribute Mid-Level Patterns

Ali Diba; Ali Mohammad Pazandeh; Hamed Pirsiavash; Luc Van Gool

The recognition of human actions and the determination of human attributes are two tasks that call for fine-grained classification. Indeed, often rather small and inconspicuous objects and features have to be detected to tell their classes apart. In order to deal with this challenge, we propose a novel convolutional neural network that mines mid-level image patches that are sufficiently dedicated to resolve the corresponding subtleties. In particular, we train a newly designed CNN (DeepPattern) that learns discriminative patch groups. There are two innovative aspects to this. On the one hand we pay attention to contextual information in an original fashion. On the other hand, we let an iteration of feature learning and patch clustering purify the set of dedicated patches that we use. We validate our method for action classification on two challenging datasets: PASCAL VOC 2012 Action and Stanford 40 Actions, and for attribute recognition we use the Berkeley Attributes of People dataset. Our discriminative mid-level mining CNN obtains state-of-the-art results on these datasets, without a need for annotations about parts and poses.

international conference on machine vision | 2017

Deep visual words: Improved fisher vector for image classification

Ali Diba; Ali Mohammad Pazandeh; Luc Van Gool

Image classification has been revolutionized by deep convolutiosnal neural networks. Using previous state-of-the-art classification methods like Fisher vector encoding in combination with deep CNNs has been shown to be promising. Motivated by the recent work on dense CNN features to extract Fisher encoding(FV-CNN), we present a scheme to discover better visual words with CNNs, to obtain improved Fisher vector features. Our method (Deep Visual Words-DVW) learns semantic visual clusters per each category, by iteratively learning and refining groups of visual patches. DVW represents an efficient feature space embedding to capture the discriminative potential between meaningful visual clusters. We evaluate our approach on popular datasets in object, scene and action classification and outperformed the state-of-the-art: scene classification MIT indoor, object categorization PASCAL VOC 2007 and Stanford40 human actions.

conference on multimedia modeling | 2018

The CAMETRON Lecture Recording System: High Quality Video Recording and Editing with Minimal Human Supervision.

Dries Hulens; Bram Aerts; Punarjay Chakravarty; Ali Diba; Toon Goedemé; Tom Roussel; Jeroen Zegers; Tinne Tuytelaars; Luc Van Eycken; Luc Van Gool; Hugo Van hamme; Joost Vennekens

In this paper, we demonstrate a system that automates the process of recording video lectures in classrooms. Through special hardware (lecturer and audience facing cameras and microphone arrays), we record multiple points of view of the lecture. Person detection and tracking, along with recognition of different human actions are used to digitally zoom in on the lecturer, and alternate focus between the lecturer and the slides or the blackboard. Audio sound source localization, along with face detection and tracking, is used to detect questions from the audience, to digitally zoom in on the member of the audience asking the question and to improve the quality of the sound recording. Finally, an automatic video editing system is used to naturally switch between the different video streams and to compose a compelling end product. We demonstrate the working system in two classrooms, over two 2-h lectures, given by two lecturers.

arXiv: Computer Vision and Pattern Recognition | 2016