Nicolas Thome | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nicolas Thome is active.

Explore More

Publication

Featured researches published by Nicolas Thome.

Computer Vision and Image Understanding | 2013

Pooling in image representation: The visual codeword point of view

Sandra Eliza Fontes de Avila; Nicolas Thome; Matthieu Cord; Eduardo Valle; Arnaldo de Albuquerque Araújo

In this work, we propose BossaNova, a novel representation for content-based concept detection in images and videos, which enriches the Bag-of-Words model. Relying on the quantization of highly discriminant local descriptors by a codebook, and the aggregation of those quantized descriptors into a single pooled feature vector, the Bag-of-Words model has emerged as the most promising approach for concept detection on visual documents. BossaNova enhances that representation by keeping a histogram of distances between the descriptors found in the image and those in the codebook, preserving thus important information about the distribution of the local descriptors around each codeword. Contrarily to other approaches found in the literature, the non-parametric histogram representation is compact and simple to compute. BossaNova compares well with the state-of-the-art in several standard datasets: MIRFLICKR, ImageCLEF 2011, PASCAL VOC 2007 and 15-Scenes, even without using complex combinations of different local descriptors. It also complements well the cutting-edge Fisher Vector descriptors, showing even better results when employed in combination with them. BossaNova also shows good results in the challenging real-world application of pornography detection.

IEEE Transactions on Circuits and Systems for Video Technology | 2008

A Real-Time, Multiview Fall Detection System: A LHMM-Based Approach

Nicolas Thome; Serge Miguet; Sebastien Ambellouis

Automatic detection of a falling person in video sequences has interesting applications in video-surveillance and is an important part of future pervasive home monitoring systems. In this paper, we propose a multiview approach to achieve this goal, where motion is modeled using a layered hidden Markov model (LHMM). The posture classification is performed by a fusion unit, merging the decision provided by the independently processing cameras in a fuzzy logic context. In each view, the fall detection is optimized in a given plane by performing a metric image rectification, making it possible to extract simple and robust features, and being convenient for real-time purpose. A theoretical analysis of the chosen descriptor enables us to define the optimal camera placement for detecting people falling in unspecified situations, and we prove that two cameras are sufficient in practice. Regarding event detection, the LHMM offers a principle way for solving the inference problem. Moreover, the hierarchical architecture decouples the motion analysis into different temporal granularity levels, making the algorithm able to detect very sudden changes, and robust to low-level steps errors.

Pattern Recognition | 2013

T-HOG: An effective gradient-based descriptor for single line text regions

Rodrigo Minetto; Nicolas Thome; Matthieu Cord; Neucimar J. Leite; Jorge Stolfi

We discuss the use of histogram of oriented gradients (HOG) descriptors as an effective tool for text description and recognition. Specifically, we propose a HOG-based texture descriptor (T-HOG) that uses a partition of the image into overlapping horizontal cells with gradual boundaries, to characterize single-line texts in outdoor scenes. The input of our algorithm is a rectangular image presumed to contain a single line of text in Roman-like characters. The output is a relatively short descriptor that provides an effective input to an SVM classifier. Extensive experiments show that the T-HOG is more accurate than Dalal and Triggss original HOG-based classifier, for any descriptor size. In addition, we show that the T-HOG is an effective tool for text/non-text discrimination and can be used in various text detection applications. In particular, combining T-HOG with a permissive bottom-up text detector is shown to outperform state-of-the-art text detection systems in two major publicly available databases.

european conference on computer vision | 2012

Unsupervised and supervised visual codes with restricted boltzmann machines

Hanlin Goh; Nicolas Thome; Matthieu Cord; Joo-Hwee Lim

Recently, the coding of local features (e.g. SIFT) for image categorization tasks has been extensively studied. Incorporated within the Bag of Words (BoW) framework, these techniques optimize the projection of local features into the visual codebook, leading to state-of-the-art performances in many benchmark datasets. In this work, we propose a novel visual codebook learning approach using the restricted Boltzmann machine (RBM) as our generative model. Our contribution is three-fold. Firstly, we steer the unsupervised RBM learning using a regularization scheme, which decomposes into a combined prior for the sparsity of each features representation as well as the selectivity for each codeword. The codewords are then fine-tuned to be discriminative through the supervised learning from top-down labels. Secondly, we evaluate the proposed method with the Caltech-101 and 15-Scenes datasets, either matching or outperforming state-of-the-art results. The codebooks are compact and inference is fast. Finally, we introduce an original method to visualize the codebooks and decipher what each visual codeword encodes.

computer vision and pattern recognition | 2013

Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis

Christian Theriault; Nicolas Thome; Matthieu Cord

In this paper, we address the challenging problem of categorizing video sequences composed of dynamic natural scenes. Contrarily to previous methods that rely on handcrafted descriptors, we propose here to represent videos using unsupervised learning of motion features. Our method encompasses three main contributions: 1) Based on the Slow Feature Analysis principle, we introduce a learned local motion descriptor which represents the principal and more stable motion components of training videos. 2) We integrate our local motion feature into a global coding/pooling architecture in order to provide an effective signature for each video sequence. 3) We report state of the art classification performances on two challenging natural scenes data sets. In particular, an outstanding improvement of 11% in classification score is reached on a data set introduced in 2012.

international conference on image processing | 2011

BOSSA: Extended bow formalism for image classification

S. Avila; Nicolas Thome; Matthieu Cord; Eduardo Valle; A. De Albuquerque Araujo

In image classification, the most powerful statistical learning approaches are based on the Bag-of-Words paradigm. In this article, we propose an extension of this formalism. Considering the Bag-of-Features, dictionary coding and pooling steps, we propose to focus on the pooling step. Instead of using the classical sum or max pooling strategies, we introduced a density function-based pooling strategy. This flexible formalism allows us to better represent the links between dictionary codewords and local descriptors in the resulting image signature. We evaluate our approach in two very challenging tasks of video and image classification, involving very high level semantic categories with large and nuanced visual diversity.

IEEE Transactions on Image Processing | 2013

Extended Coding and Pooling in the HMAX Model

Christian Theriault; Nicolas Thome; Matthieu Cord

This paper presents an extension of the HMAX model, a neural network model for image classification. The HMAX model can be described as a four-level architecture, with the first level consisting of multiscale and multiorientation local filters. We introduce two main contributions to this model. First, we improve the way the local filters at the first level are integrated into more complex filters at the last level, providing a flexible description of object regions and combining local information of multiple scales and orientations. These new filters are discriminative and yet invariant, two key aspects of visual classification. We evaluate their discriminative power and their level of invariance to geometrical transformations on a synthetic image set. Second, we introduce a multiresolution spatial pooling. This pooling encodes both local and global spatial information to produce discriminative image signatures. Classification results are reported on three image data sets: Caltech101, Caltech256, and fifteen scenes. We show significant improvements over previous architectures using a similar framework.

machine vision applications | 2011

A cognitive and video-based approach for multinational License Plate Recognition

Nicolas Thome; Antoine Vacavant; Lionel Robinault; Serge Miguet

License Plate Recognition (LPR) is mainly regarded as a solved problem. However, robust solutions able to face real-world scenarios still need to be proposed. Country-specific systems are mostly, designed, which can (artificially) reach high-level recognition rates. This option, however, strictly limits their applicability. In this paper, we propose an approach that can deal with various national plates. There are three main areas of novelty. First, the Optical Character Recognition (OCR) is managed by a hybrid strategy, combining statistical and structural algorithms. Secondly, an efficient probabilistic edit distance is proposed for providing an explicit video-based LPR. Last but not least, cognitive loops are introduced at critical stages of the algorithm. These feedback steps take advantage of the context modeling to increase the overall system performances, and overcome the inextricable parameter settings of the low-level processing. The system performances have been tested in more than 1200 static images with difficult illumination conditions and complex backgrounds, as well as in six different videos containing 525 moving vehicles. The evaluations prove our system to be very competitive among the non-country specific approaches.

IEEE Transactions on Neural Networks | 2014

Learning deep hierarchical visual feature coding.

Hanlin Goh; Nicolas Thome; Matthieu Cord; Joo-Hwee Lim

In this paper, we propose a hybrid architecture that combines the image modeling strengths of the bag of words framework with the representational power and adaptability of learning deep architectures. Local gradient-based descriptors, such as SIFT, are encoded via a hierarchical coding scheme composed of spatial aggregating restricted Boltzmann machines (RBM). For each coding layer, we regularize the RBM by encouraging representations to fit both sparse and selective distributions. Supervised fine-tuning is used to enhance the quality of the visual representation for the categorization task. We performed a thorough experimental evaluation using three image categorization data sets. The hierarchical coding scheme achieved competitive categorization accuracies of 79.7% and 86.4% on the Caltech-101 and 15-Scenes data sets, respectively. The visual representations learned are compact and the models inference is fast, as compared with sparse coding methods. The low-level representations of descriptors that were learned using this method result in generic features that we empirically found to be transferrable between different image data sets. Further analysis reveal the significance of supervised fine-tuning when the architecture has two layers of representations as opposed to a single layer.

international conference on multimedia and expo | 2015

RECIPE RECOGNITION WITH LARGE MULTIMODAL FOOD DATASET

Xin Wang; Devinder Kumar; Nicolas Thome; Matthieu Cord; Frédéric Precioso

This paper deals with automatic systems for image recipe recognition. For this purpose, we compare and evaluate leading vision-based and text-based technologies on a new very large multimodal dataset (UPMC Food-101) containing about 100,000 recipes for a total of 101 food categories. Each item in this dataset is represented by one image plus textual information. We present deep experiments of recipe recognition on our dataset using visual, textual information and fusion. Additionally, we present experiments with text-based embedding technology to represent any food word in a semantical continuous space. We also compare our dataset features with a twin dataset provided by ETHZ university: we revisit their data collection protocols and carry out transfer learning schemes to highlight similarities and differences between both datasets. Finally, we propose a real application for daily users to identify recipes. This application is a web search engine that allows any mobile device to send a query image and retrieve the most relevant recipes in our dataset.

Explore More