Gemma Roig | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gemma Roig is active.

Explore More

Publication

Featured researches published by Gemma Roig.

international conference on computer vision | 2013

Online Video SEEDS for Temporal Window Objectness

Gemma Roig; Xavier Boix; Santiago Manen; Luc Van Gool

Super pixel and objectness algorithms are broadly used as a pre-processing step to generate support regions and to speed-up further computations. Recently, many algorithms have been extended to video in order to exploit the temporal consistency between frames. However, most methods are computationally too expensive for real-time applications. We introduce an online, real-time video super pixel algorithm based on the recently proposed SEEDS super pixels. A new capability is incorporated which delivers multiple diverse samples (hypotheses) of super pixels in the same image or video sequence. The multiple samples are shown to provide a strong cue to efficiently measure the objectness of image windows, and we introduce the novel concept of objectness in temporal windows. Experiments show that the video super pixels achieve comparable performance to state-of-the-art offline methods while running at 30 fps on a single 2.8 GHz i7 CPU. State-of-the-art performance on objectness is also demonstrated, yet orders of magnitude faster and extended to temporal windows in video.

international conference on computer vision | 2013

Active MAP Inference in CRFs for Efficient Semantic Segmentation

Gemma Roig; Xavier Boix; Roderick de Nijs; Sebastian Ramos; Kolja Kühnlenz; Luc Van Gool

Most MAP inference algorithms for CRFs optimize an energy function knowing all the potentials. In this paper, we focus on CRFs where the computational cost of instantiating the potentials is orders of magnitude higher than MAP inference. This is often the case in semantic image segmentation, where most potentials are instantiated by slow classifiers fed with costly features. We introduce Active MAP inference 1) to on-the-fly select a subset of potentials to be instantiated in the energy function, leaving the rest of the parameters of the potentials unknown, and 2) to estimate the MAP labeling from such incomplete energy function. Results for semantic segmentation benchmarks, namely PASCAL VOC 2010 and MSRC-21, show that Active MAP inference achieves similar levels of accuracy but with major efficiency gains.

international conference on computer vision | 2011

Conditional Random Fields for multi-camera object detection

Gemma Roig; Xavier Boix; Horesh Ben Shitrit; Pascal Fua

We formulate a model for multi-class object detection in a multi-camera environment. From our knowledge, this is the first time that this problem is addressed taken into account different object classes simultaneously. Given several images of the scene taken from different angles, our system estimates the ground plane location of the objects from the output of several object detectors applied at each viewpoint. We cast the problem as an energy minimization modeled with a Conditional Random Field (CRF). Instead of predicting the presence of an object at each image location independently, we simultaneously predict the labeling of the entire scene. Our CRF is able to take into account occlusions between objects and contextual constraints among them. We propose an effective iterative strategy that renders tractable the underlying optimization problem, and learn the parameters of the model with the max-margin paradigm. We evaluate the performance of our model on several challenging multi-camera pedestrian detection datasets namely PETS 2009 [5] and EPFL terrace sequence [9]. We also introduce a new dataset in which multiple classes of objects appear simultaneously in the scene. It is here where we show that our method effectively handles occlusions in the multi-class case.

computer vision and pattern recognition | 2013

Sparse Quantization for Patch Description

Xavier Boix; Michael Gygli; Gemma Roig; Luc Van Gool

The representation of local image patches is crucial for the good performance and efficiency of many vision tasks. Patch descriptors have been designed to generalize towards diverse variations, depending on the application, as well as the desired compromise between accuracy and efficiency. We present a novel formulation of patch description, that serves such issues well. Sparse quantization lies at its heart. This allows for efficient encodings, leading to powerful, novel binary descriptors, yet also to the generalization of existing descriptors like SIFT or BRIEF. We demonstrate the capabilities of our formulation for both key point matching and image classification. Our binary descriptors achieve state-of-the-art results for two key point matching benchmarks, namely those by Brown and Mikolajczyk. For image classification, we propose new descriptors, that perform similar to SIFT on Caltech101 and PASCAL VOC07.

intelligent robots and systems | 2012

On-line semantic perception using uncertainty

Roderick de Nijs; Sebastian Ramos; Gemma Roig; Xavier Boix; Luc Van Gool; Kolja Kühnlenz

Visual perception capabilities are still highly unreliable in unconstrained settings, and solutions might not be accurate in all regions of an image. Awareness of the uncertainty of perception is a fundamental requirement for proper high level decision making in a robotic system. Yet, the uncertainty measure is often sacrificed to account for dependencies between object/region classifiers. This is the case of Conditional Random Fields (CRFs), the success of which stems from their ability to infer the most likely world configuration, but they do not directly allow to estimate the uncertainty of the solution. In this paper, we consider the setting of assigning semantic labels to the pixels of an image sequence. Instead of using a CRF, we employ a Perturb-and-MAP Random Field, a recently introduced probabilistic model that allows performing fast approximate sampling from its probability density function. This allows to effectively compute the uncertainty of the solution, indicating the reliability of the most likely labeling in each region of the image. We report results on the CamVid dataset, a standard benchmark for semantic labeling of urban image sequences. In our experiments, we show the benefits of exploiting the uncertainty by putting more computational effort on the regions of the image that are less reliable, and use more efficient techniques for other regions, showing little decrease of performance.

european conference on computer vision | 2012

Nested sparse quantization for efficient feature coding

Xavier Boix; Gemma Roig; Christian Leistner; Luc Van Gool

Many state-of-the-art methods in object recognition extract features from an image and encode them, followed by a pooling step and classification. Within this processing pipeline, often the encoding step is the bottleneck, for both computational efficiency and performance. We present a novel assignment-based encoding formulation. It allows for the fusion of assignment-based encoding and sparse coding into one formulation. We also use this to design a new, very efficient, encoding. At the heart of our formulation lies a quantization into a set of k-sparse vectors, which we denote as sparse quantization. We design the new encoding as two nested, sparse quantizations. Its efficiency stems from leveraging bit-wise representations. In a series of experiments on standard recognition benchmarks, namely Caltech 101, PASCAL VOC 07 and ImageNet, we demonstrate that our method achieves results that are competitive with the state-of-the-art, and requires orders of magnitude less time and memory. Our method is able to encode one million images using 4 CPUs in a single day, while maintaining a good performance.

Face and Gesture 2011 | 2011

Hierarchical CRF with product label spaces for parts-based models

Gemma Roig; Xavier Boix; Fernando De la Torre; Joan Serrat; C. Vilella

Non-rigid object detection is a challenging open research problem in computer vision. It is a critical part in many applications such as image search, surveillance, human-computer interaction or image auto-annotation. Most successful approaches to non-rigid object detection make use of part-based models. In particular, Conditional Random Fields (CRF) have been successfully embedded into a discriminative parts-based model framework due to its effectiveness for learning and inference (usually based on a tree structure). However, CRF-based approaches do not incorporate global constraints and only model pairwise interactions. This is especially important when modeling object classes that may have complex parts interactions (e.g. facial features or body articulations), because neglecting them yields an oversimplified model with suboptimal performance. To overcome this limitation, this paper proposes a novel hierarchical CRF (HCRF). The main contribution is to build a hierarchy of part combinations by extending the label set to a hierarchy of product label spaces. In order to keep the inference computation tractable, we propose an effective method to reduce the new label set. We test our method on two applications: facial feature detection on the Multi-PIE database and human pose estimation on the Buffy dataset.

IEEE Transactions on Neural Networks | 2016

Learning to Predict Sequences of Human Visual Fixations

Ming Jiang; Xavier Boix; Gemma Roig; Juan Xu; Luc Van Gool; Qi Zhao

Most state-of-the-art visual attention models estimate the probability distribution of fixating the eyes in a location of the image, the so-called saliency maps. Yet, these models do not predict the temporal sequence of eye fixations, which may be valuable for better predicting the human eye fixations, as well as for understanding the role of the different cues during visual exploration. In this paper, we present a method for predicting the sequence of human eye fixations, which is learned from the recorded human eye-tracking data. We use least-squares policy iteration (LSPI) to learn a visual exploration policy that mimics the recorded eye-fixation examples. The model uses a different set of parameters for the different stages of visual exploration that capture the importance of the cues during the scanpath. In a series of experiments, we demonstrate the effectiveness of using LSPI for combining multiple cues at different stages of the scanpath. The learned parameters suggest that the low-level and high-level cues (semantics) are similarly important at the first eye fixation of the scanpath, and the contribution of high-level cues keeps increasing during the visual exploration. Results show that our approach obtains the state-of-the-art performances on two challenging data sets: 1) OSIE data set and 2) MIT data set.

international conference on computer vision | 2009

Optimal feature selection for subspace image matching

Gemma Roig; Xavier Boix; Fernando De la Torre

Image matching has been a central research topic in computer vision over the last decades. Typical approaches to correspondence involve matching features between images. In this paper, we present a novel problem for establishing correspondences between a sparse set of image features and a previously learned subspace model. We formulate the matching task as an energy minimization, and jointly optimize over all possible feature assignments and parameters of the subspace model. This problem is in general NP-hard. We propose a convex relaxation approximation, and develop two optimization strategies: naive gradient-descent and quadratic programming. Alternatively, we reformulate the optimization criterion as a sparse eigenvalue problem, and solve it using a recently proposed backward greedy algorithm. Experimental results on facial feature detection show that the quadratic programming solution provides better selection mechanism for relevant features.

bioRxiv | 2018

Task-specific vision models explain task-specific areas of visual cortex

Kshitij Dwivedi; Gemma Roig

Computational models such as deep neural networks (DNN) trained for classification are often used to explain responses of the visual cortex. However, not all the areas of the visual cortex are involved in object/scene classification. For instance, scene selective occipital place area (OPA) plays a role in mapping navigational affordances. Therefore, for explaining responses of such task-specific brain area, we investigate if a model that performs a related task can serve as a better computational model than a model that performs an unrelated task. We found that DNN trained on a task (scene-parsing) related to the function (navigational affordances) of a brain region (OPA) explains its responses better than a DNN trained on a task (scene-classification) which is not explicitly related. In a subsequent analysis, we found that the DNNs that showed high correlation with a particular brain region were trained on a task that was consistent with functions of that brain region reported in previous neuroimaging studies. Our results demonstrate that the task is paramount for selecting a computational model of a brain area. Further, explaining the responses of a brain area by a diverse set of tasks has the potential to shed some light on its functions. Author summary Areas in the human visual cortex are specialized for specific behaviors either due to supervision and interaction with the world or due to evolution. A standard way to gain insight into the function of these brain region is to design experiments related to a particular behavior, and localize the regions showing significant relative activity corresponding to that behavior. In this work, we investigate if we can figure out the function of a brain area in visual cortex using computational vision models. From our results, we find that explaining responses of a brain region using DNNs trained on a diverse set of possible vision tasks can help us gain insights into its function. The consistency of our results using DNNs with the previous neuroimaging studies suggest that the brain region may be specialized for behavior similar to the tasks for which DNNs showed a high correlation with its responses.

Explore More