Jordi Gonzílez
Autonomous University of Barcelona
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jordi Gonzílez.
Computer Vision and Image Understanding | 2012
Bhaskar Chakraborty; Michael Boelstoft Holte; Thomas B. Moeslund; Jordi Gonzílez
Recent progress in the field of human action recognition points towards the use of Spatio-Temporal Interest Points (STIPs) for local descriptor-based recognition strategies. In this paper, we present a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints. This new method is significantly different from existing STIP detection techniques and improves the performance by detecting more repeatable, stable and distinctive STIPs for human actors, while suppressing unwanted background STIPs. For action representation we use a bag-of-video words (BoV) model of local N-jet features to build a vocabulary of visual-words. To this end, we introduce a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Action class specific Support Vector Machine (SVM) classifiers are trained for categorization of human actions. A comprehensive set of experiments on popular benchmark datasets (KTH and Weizmann), more challenging datasets of complex scenes with background clutter and camera motion (CVC and CMU), movie and YouTube video clips (Hollywood 2 and YouTube), and complex scenes with multiple actors (MSR I and Multi-KTH), validates our approach and show state-of-the-art performance. Due to the unavailability of ground truth action annotation data for the Multi-KTH dataset, we introduce an actor specific spatio-temporal clustering of STIPs to address the problem of automatic action annotation of multiple simultaneous actors. Additionally, we perform cross-data action recognition by training on source datasets (KTH and Weizmann) and testing on completely different and more challenging target datasets (CVC, CMU, MSR I and Multi-KTH). This documents the robustness of our proposed approach in the realistic scenario, using separate training and test datasets, which in general has been a shortcoming in the performance evaluation of human action recognition techniques.
Pattern Recognition | 2012
Noha M. Elfiky; Fahad Shahbaz Khan; Joost van de Weijer; Jordi Gonzílez
Spatial pyramids have been successfully applied to incorporating spatial information into bag-of-words based image representation. However, a major drawback is that it leads to high dimensional image representations. In this paper, we present a novel framework for obtaining compact pyramid representation. First, we investigate the usage of the divisive information theoretic feature clustering (DITC) algorithm in creating a compact pyramid representation. In many cases this method allows us to reduce the size of a high dimensional pyramid representation up to an order of magnitude with little or no loss in accuracy. Furthermore, comparison to clustering based on agglomerative information bottleneck (AIB) shows that our method obtains superior results at significantly lower computational costs. Moreover, we investigate the optimal combination of multiple features in the context of our compact pyramid representation. Finally, experiments show that the method can obtain state-of-the-art results on several challenging data sets.
Image and Vision Computing | 2009
Jordi Gonzílez; Daniel Rowe; Javier Varona; F. Xavier Roca
In this paper, a Cognitive Vision System (CVS) is presented, which explains the human behaviour of monitored scenes using natural-language texts. This cognitive analysis of human movements recorded in image sequences is here referred to as Human Sequence Evaluation (HSE) which defines a set of transformation modules involved in the automatic generation of semantic descriptions from pixel values. In essence, the trajectories of human agents are obtained to generate textual interpretations of their motion, and also to infer the conceptual relationships of each agent w.r.t. its environment. For this purpose, a human behaviour model based on Situation Graph Trees (SGTs) is considered, which permits both bottom-up (hypothesis generation) and top-down (hypothesis refinement) analysis of dynamic scenes. The resulting system prototype interprets different kinds of behaviour and reports textual descriptions in multiple languages.
Image and Vision Computing | 2013
Javier Orozco; Ognjen Rudovic; Jordi Gonzílez; Maja Pantic
In this paper, we propose an On-line Appearance-Based Tracker (OABT) for simultaneous tracking of 3D head pose, lips, eyebrows, eyelids and irises in monocular video sequences. In contrast to previously proposed tracking approaches, which deal with face and gaze tracking separately, our OABT can also be used for eyelid and iris tracking, as well as 3D head pose, lips and eyebrows facial actions tracking. Furthermore, our approach applies an on-line learning of changes in the appearance of the tracked target. Hence, the prior training of appearance models, which usually requires a large amount of labeled facial images, is avoided. Moreover, the proposed method is built upon a hierarchical combination of three OABTs, which are optimized using a Levenberg-Marquardt Algorithm (LMA) enhanced with line-search procedures. This, in turn, makes the proposed method robust to changes in lighting conditions, occlusions and translucent textures, as evidenced by our experiments. Finally, the proposed method achieves head and facial actions tracking in real-time.
Interacting with Computers | 2009
Javier Varona; Antoni Jaume-i-Capó; Jordi Gonzílez; Francisco J. Perales
In most of the existing human-computer interfaces, enactive knowledge as new natural interaction paradigm has not been fully exploited yet. Recent technological advances have created the possibility to enhance naturally and significantly the interface perception by means of visual inputs, the so-called Vision-Based Interfaces (VBI). In the present paper, we explore the recovery of the users body posture by means of combining robust computer vision techniques and a well known inverse kinematics algorithm in real-time. Specifically, we focus on recognizing the users motions with a particular mean, that is, a body gesture. Defining an appropriate representation of the users body posture based on a temporal parameterization, we apply non-parametric techniques to learn and recognize the users body gestures. This scheme of recognition has been applied to control a computer videogame in real-time to show the viability of the presented approach.
Pattern Recognition | 2009
Ignasi Rius; Jordi Gonzílez; Javier Varona; F. Xavier Roca
In this paper, we aim to reconstruct the 3D motion parameters of a human body model from the known 2D positions of a reduced set of joints in the image plane. Towards this end, an action-specific motion model is trained from a database of real motion-captured performances, and used within a particle filtering framework as a priori knowledge on human motion. First, our dynamic model guides the particles according to similar situations previously learnt. Then, the state space is constrained so only feasible human postures are accepted as valid solutions at each time step. As a result, we are able to track the 3D configuration of the full human body from several cycles of walking motion sequences using only the 2D positions of a very reduced set of joints from lateral or frontal viewpoints.
Neurocomputing | 2013
Ivan Huerta; Ariel Amato; Xavier Roca; Jordi Gonzílez
This paper presents a novel algorithm for mobile-object segmentation from static background scenes, which is both robust and accurate under most of the common problems found in motion segmentation. In our first contribution, a case analysis of motion segmentation errors is presented taking into account the inaccuracies associated with different cues, namely colour, edge and intensity. Our second contribution is an hybrid architecture which copes with the main issues observed in the case analysis by fusing the knowledge from the aforementioned three cues and a temporal difference algorithm. On one hand, we enhance the colour and edge models to solve not only global and local illumination changes (i.e. shadows and highlights) but also the camouflage in intensity. In addition, local information is also exploited to solve the camouflage in chroma. On the other hand, the intensity cue is applied when colour and edge cues are not available because their values are beyond the dynamic range. Additionally, temporal difference scheme is included to segment motion where those three cues cannot be reliably computed, for example in those background regions not visible during the training period. Lastly, our approach is extended for handling ghost detection. The proposed method obtains very accurate and robust motion segmentation results in multiple indoor and outdoor scenarios, while outperforming the most-referred state-of-art approaches.
Signal Processing-image Communication | 2008
Carles Fernández; Pau Baiget; Xavier Roca; Jordi Gonzílez
The integration of cognitive capabilities in computer vision systems requires both to enable high semantic expressiveness and to deal with high computational costs as large amounts of data are involved in the analysis. This contribution describes a cognitive vision system conceived to automatically provide high-level interpretations of complex real-time situations in outdoor and indoor scenarios, and to eventually maintain communication with casual end users in multiple languages. The main contributions are: (i) the design of an integrative multilevel architecture for cognitive surveillance purposes; (ii) the proposal of a coherent taxonomy of knowledge to guide the process of interpretation, which leads to the conception of a situation-based ontology; (iii) the use of situational analysis for content detection and a progressive interpretation of semantically rich scenes, by managing incomplete or uncertain knowledge, and (iv) the use of such an ontological background to enable multilingual capabilities and advanced end-user interfaces. Experimental results are provided to show the feasibility of the proposed approach.
Pattern Recognition Letters | 2011
Carles Fernández; P. Baiget; F. X. Roca; Jordi Gonzílez
The fields of segmentation, tracking and behavior analysis demand for challenging video resources to test, in a scalable manner, complex scenarios like crowded environments or scenes with high semantics. Nevertheless, existing public databases cannot scale the presence of appearing agents, which would be useful to study long-term occlusions and crowds. Moreover, creating these resources is expensive and often too particularized to specific needs. We propose an augmented reality framework to increase the complexity of image sequences in terms of occlusions and crowds, in a scalable and controllable manner. Existing datasets can be increased with augmented sequences containing virtual agents. Such sequences are automatically annotated, thus facilitating evaluation in terms of segmentation, tracking, and behavior recognition. In order to easily specify the desired contents, we propose a natural language interface to convert input sentences into virtual agent behaviors. Experimental tests and validation in indoor, street, and soccer environments are provided to show the feasibility of the proposed approach in terms of robustness, scalability, and semantics.
Pattern Recognition Letters | 2011
Marco Pedersoli; Jordi Gonzílez; Andrew D. Bagdanov; Xavier Roca
Human detection is fundamental in many machine vision applications, like video surveillance, driving assistance, action recognition and scene understanding. However in most of these applications real-time performance is necessary and this is not achieved yet by current detection methods. This paper presents a new method for human detection based on a multiresolution cascade of Histograms of Oriented Gradients (HOG) that can highly reduce the computational cost of detection search without affecting accuracy. The method consists of a cascade of sliding window detectors. Each detector is a linear Support Vector Machine (SVM) composed of HOG features at different resolutions, from coarse at the first level to fine at the last one. In contrast to previous methods, our approach uses a non-uniform stride of the sliding window that is defined by the feature resolution and allows the detection to be incrementally refined as going from coarse-to-fine resolution. In this way, the speed-up of the cascade is not only due to the fewer number of features computed at the first levels of the cascade, but also to the reduced number of windows that need to be evaluated at the coarse resolution. Experimental results show that our method reaches a detection rate comparable with the state-of-the-art of detectors based on HOG features, while at the same time the detection search is up to 23 times faster.