Gauthier Doquire | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gauthier Doquire is active.

Explore More

Publication

Featured researches published by Gauthier Doquire.

Neurocomputing | 2013

Letters: Mutual information-based feature selection for multilabel classification

Gauthier Doquire; Michel Verleysen

This paper introduces a new methodology to perform feature selection in multi-label classification problems. Unlike previous works based on the @g^2 statistics, the proposed approach uses the multivariate mutual information criterion combined with a problem transformation and a pruning strategy. This allows us to consider the possible dependencies between the class labels and between the features during the feature selection process. A way to automatically set the pruning parameter is also proposed, based on the permutation test combined with a resampling strategy. Experiments carried out on both artificial and real-world datasets show the interest of our approach over existing methods.

international conference on artificial neural networks | 2011

Feature selection for multi-label classification problems

Gauthier Doquire; Michel Verleysen

This paper proposes the use of mutual information for feature selection in multi-label classification, a surprisingly almost not studied problem. A pruned problem transformation method is first applied, transforming the multi-label problem into a single-label one. A greedy feature selection procedure based on multidimensional mutual information is then conducted. Results on three databases clearly demonstrate the interest of the approach which allows one to sharply reduce the dimension of the problem and to enhance the performance of classifiers.

Neurocomputing | 2013

A graph Laplacian based approach to semi-supervised feature selection for regression problems

Gauthier Doquire; Michel Verleysen

Feature selection is a task of fundamental importance for many data mining or machine learning applications, including regression. Surprisingly, most of the existing feature selection algorithms assume the problems to address are either supervised or unsupervised, while supervised and unsupervised samples are often simultaneously available in real-world applications. Semi-supervised feature selection methods are thus necessary, and many solutions have been proposed recently. However, almost all of them exclusively tackle classification problems. This paper introduces a semi-supervised feature selection algorithm which is specifically designed for regression problems. It relies on the notion of Laplacian score, a quantity recently introduced in the unsupervised framework. Experimental results demonstrate the efficiency of the proposed algorithm.

Neurocomputing | 2013

Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classification

Benoît Frénay; Gauthier Doquire; Michel Verleysen

Mutual information is a widely used performance criterion for filter feature selection. However, despite its popularity and its appealing properties, mutual information is not always the most appropriate criterion. Indeed, contrary to what is sometimes hypothesized in the literature, looking for a feature subset maximizing the mutual information does not always guarantee to decrease the misclassification probability, which is often the objective one is interested in. The first objective of this paper is thus to clearly illustrate this potential inadequacy and to emphasize the fact that the mutual information remains a heuristic, coming with no guarantee in terms of classification accuracy. Through extensive experiments, a deeper analysis of the cases for which the mutual information is not a suitable criterion is then conducted. This analysis allows us to confirm the general interest of the mutual information for feature selection. It also helps us better apprehending the behaviour of mutual information throughout a feature selection process and consequently making a better use of it as a feature selection criterion.

Neurocomputing | 2012

Feature selection with missing data using mutual information estimators

Gauthier Doquire; Michel Verleysen

Feature selection is an important preprocessing task for many machine learning and pattern recognition applications, including regression and classification. Missing data are encountered in many real-world problems and have to be considered in practice. This paper addresses the problem of feature selection in prediction problems where some occurrences of features are missing. To this end, the well-known mutual information criterion is used. More precisely, it is shown how a recently introduced nearest neighbors based mutual information estimator can be extended to handle missing data. This estimator has the advantage over traditional ones that it does not directly estimate any probability density function. Consequently, the mutual information may be reliably estimated even when the dimension of the space increases. Results on artificial as well as real-world datasets indicate that the method is able to select important features without the need for any imputation algorithm, under the assumption of missing completely at random data. Moreover, experiments show that selecting the features before imputing the data generally increases the precision of the prediction models, in particular when the proportion of missing data is high.

Neural Networks | 2013

Neural networks letter: Is mutual information adequate for feature selection in regression?

Benoît Frénay; Gauthier Doquire; Michel Verleysen

Feature selection is an important preprocessing step for many high-dimensional regression problems. One of the most common strategies is to select a relevant feature subset based on the mutual information criterion. However, no connection has been established yet between the use of mutual information and a regression error criterion in the machine learning literature. This is obviously an important lack, since minimising such a criterion is eventually the objective one is interested in. This paper demonstrates that under some reasonable assumptions, features selected with the mutual information criterion are the ones minimising the mean squared error and the mean absolute error. On the contrary, it is also shown that the mutual information criterion can fail in selecting optimal features in some situations that we characterise. The theoretical developments presented in this work are expected to lead in practice to a critical and efficient use of the mutual information for feature selection.

Computational Statistics & Data Analysis | 2014

Estimating mutual information for feature selection in the presence of label noise

Benoît Frénay; Gauthier Doquire; Michel Verleysen

A way to achieve feature selection for classification problems polluted by label noise is proposed. The performances of traditional feature selection algorithms often decrease sharply when some samples are wrongly labelled. A method based on a probabilistic label noise model combined with a nearest neighbours-based entropy estimator is introduced to robustly evaluate the mutual information, a popular relevance criterion for feature selection. A backward greedy search procedure is used in combination with this criterion to find relevant sets of features. Experiments establish that (i) there is a real need to take a possible label noise into account when selecting features and (ii) the proposed methodology is effectively able to reduce the negative impact of the mislabelled data points on the feature selection process.

Information Sciences | 2013

Distance estimation in numerical data sets with missing values

Emil Eirola; Gauthier Doquire; Michel Verleysen; Amaury Lendasse

The possibility of missing or incomplete data is often ignored when describing statistical or machine learning methods, but as it is a common problem in practice, it is relevant to consider. A popular strategy is to fill in the missing values by imputation as a pre-processing step, but for many methods this is not necessary, and can yield sub-optimal results. Instead, appropriately estimating pairwise distances in a data set directly enables the use of any machine learning methods using nearest neighbours or otherwise based on distances between samples. In this paper, it is shown how directly estimating distances tends to result in more accurate results than calculating distances from an imputed data set, and an algorithm to calculate the estimated distances is presented. The theoretical framework operates under the assumption of a multivariate normal distribution, but the algorithm is shown to be robust to violations of this assumption. The focus is on numerical data with a considerable proportion of missing values, and simulated experiments are provided to show accurate performance on several data sets.

international conference on artificial neural networks | 2011

Graph Laplacian for semi-supervised feature selection in regression problems

Gauthier Doquire; Michel Verleysen

Feature selection is fundamental in many data mining or machine learning applications. Most of the algorithms proposed for this task make the assumption that the data are either supervised or unsupervised, while in practice supervised and unsupervised samples are often simultaneously available. Semi-supervised feature selection is thus needed, and has been studied quite intensively these past few years almost exclusively for classification problems. In this paper, a supervised then a semi-supervised feature selection algorithms specially designed for regression problems are presented. Both are based on the Laplacian Score, a quantity recently introduced in the unsupervised framework. Experimental evidences show the efficiency of the two algorithms.

data warehousing and knowledge discovery | 2011

Feature selection with mutual information for uncertain data

Gauthier Doquire; Michel Verleysen

In many real-world situations, the data cannot be assumed to be precise. Indeed uncertain data are often encountered, due for example to the imprecision of measurement devices or to continuously moving objects for which the exact position is impossible to obtain. One way to model this uncertainty is to represent each data value as a probability distribution function; recent works show that adequately taking the uncertainty into account generally leads to improved classification performances. Working with such a representation, this paper proposes to achieve feature selection based on mutual information. Experiments on 8 UCI data sets show that the proposed approach is effective to select relevant features.

Explore More