Damien François | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Damien François is active.

Explore More

Publication

Featured researches published by Damien François.

IEEE Transactions on Knowledge and Data Engineering | 2007

The Concentration of Fractional Distances

Damien François; Vincent Wertz; Michel Verleysen

Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given.

Chemometrics and Intelligent Laboratory Systems | 2006

Mutual information for the selection of relevant variables in spectrometric nonlinear modelling

Fabrice Rossi; Amaury Lendasse; Damien François; Vincent Wertz; Michel Verleysen

Data from spectrophotometers form vectors of a large number of exploitable variables. Building quantitative models using these variables most often requires using a smaller set of variables than the initial one. Indeed, a too large number of input variables to a model results in a too large number of parameters, leading to overfitting and poor generalization abilities. In this paper, we suggest the use of the mutual information measure to select variables from the initial set. The mutual information measures the information content in input variables with respect to the model output, without making any assumption on the model that will be used; it is thus suitable for nonlinear modelling. In addition, it leads to the selection of variables among the initial set, and not to linear or nonlinear combinations of them. Without decreasing the model performances compared to other variable projection methods, it allows therefore a greater interpretability of the results.

international conference on artificial neural networks | 2005

The curse of dimensionality in data mining and time series prediction

Michel Verleysen; Damien François

Modern data analysis tools have to work on high-dimensional data, whose components are not independently distributed. High-dimensional spaces show surprising, counter-intuitive geometrical properties that have a large influence on the performances of data analysis tools. Among these properties, the concentration of the norm phenomenon results in the fact that Euclidean norms and Gaussian kernels, both commonly used in models, become inappropriate in high-dimensional spaces. This papers presents alternative distance measures and kernels, together with geometrical methods to decrease the dimension of the space. The methodology is applied to a typical time series prediction example.

Neurocomputing | 2007

Resampling methods for parameter-free and robust feature selection with mutual information

Damien François; Fabrice Rossi; Vincent Wertz; Michel Verleysen

Combining the mutual information criterion with a forward feature selection strategy offers a good trade-off between optimality of the selected feature subset and computation time. However, it requires to set the parameter(s) of the mutual information estimator and to determine when to halt the forward procedure. These two choices are difficult to make because, as the dimensionality of the subset increases, the estimation of the mutual information becomes less and less reliable. This paper proposes to use resampling methods, a K-fold cross-validation and the permutation test, to address both issues. The resampling methods bring information about the variance of the estimator, information which can then be used to automatically set the parameter and to calculate a threshold to stop the forward procedure. The procedure is illustrated on a synthetic data set as well as on the real-world examples.

IEEE Transactions on Biomedical Engineering | 2012

Weighted Conditional Random Fields for Supervised Interpatient Heartbeat Classification

G. de Lannoy; Damien François; Jean Delbeke; Michel Verleysen

This paper proposes a method for the automatic classification of heartbeats in an ECG signal. Since this task has specific characteristics such as time dependences between observations and a strong class unbalance, a specific classifier is proposed and evaluated on real ECG signals from the MIT arrhythmia database. This classifier is a weighted variant of the conditional random fields classifier. Experiments show that the proposed method outperforms previously reported heartbeat classification methods, especially for the pathological heartbeats.

international work conference on artificial and natural neural networks | 2009

On the effects of dimensionality on data analysis with neural networks

Michel Verleysen; Damien François; Geoffroy Simon; Vincent Wertz

Modern data analysis often faces high-dimensional data. Nevertheless, most neural network data analysis tools are not adapted to high-dimensional spaces, because of the use of conventional concepts (as the Euclidean distance) that scale poorly with dimension. This paper shows some limitations of such concepts and suggests some research directions as the use of alternative distance definitions and of non-linear dimension reduction.

Chemometrics and Intelligent Laboratory Systems | 2008

A data-driven functional projection approach for the selection of feature ranges in spectra with ICA or cluster analysis

Catherine Krier; Fabrice Rossi; Damien François; Michel Verleysen

Prediction problems from spectra are largely encountered in chemometry. In addition to accurate predictions, it is often needed to extract information about which wavelengths in the spectra contribute in an effective way to the quality of the prediction. This implies to select wavelengths (or wavelength intervals), a problem associated to variable selection. In this paper, it is shown how this problem may be tackled in the specific case of smooth (for example infrared) spectra. The functional character of the spectra (their smoothness) is taken into account through a functional variable projection procedure. Contrarily to standard approaches, the projection is performed on a basis that is driven by the spectra themselves, in order to best fit their characteristics. The methodology is illustrated by two examples of functional projection, using Independent Component Analysis and functional variable clustering, respectively. The performances on two standard infrared spectra benchmarks are illustrated.

workshop on self organizing maps | 2009

Fault Prediction in Aircraft Engines Using Self-Organizing Maps

Marie Cottrell; Patrice Gaubert; Cédric Eloy; Damien François; Geoffroy Hallaux; Jérôme Lacaille; Michel Verleysen

Aircraft engines are designed to be used during several tens of years. Their maintenance is a challenging and costly task, for obvious security reasons. The goal is to ensure a proper operation of the engines, in all conditions, with a zero probability of failure, while taking into account aging. The fact that the same engine is sometimes used on several aircrafts has to be taken into account too. The maintenance can be improved if an efficient procedure for the prediction of failures is implemented. The primary source of information on the health of the engines comes from measurement during flights. Several variables such as the core speed, the oil pressure and quantity, the fan speed, etc. are measured, together with environmental variables such as the outside temperature, altitude, aircraft speed, etc. In this paper, we describe the design of a procedure aiming at visualizing successive data measured on aircraft engines. The data are multi-dimensional measurements on the engines, which are projected on a self-organizing map in order to allow us to follow the trajectories of these data over time. The trajectories consist in a succession of points on the map, each of them corresponding to the two-dimensional projection of the multi-dimensional vector of engine measurements. Analyzing the trajectories aims at visualizing any deviation from a normal behavior, making it possible to anticipate an operation failure. However rough engine measurements are inappropriate for such an analysis; they are indeed influenced by external conditions, and may in addition vary between engines. In this work, we first process the data by a General Linear Model (GLM), to eliminate the effect of engines and of measured environmental conditions. The residuals are then used as inputs to a Self-Organizing Map for the easy visualization of trajectories.

arXiv: Learning | 2009

Advances in Feature Selection with Mutual Information

Michel Verleysen; Fabrice Rossi; Damien François

The selection of features that are relevant for a prediction or classification problem is an important problem in many domains involving high-dimensional data. Selecting features helps fighting the curse of dimensionality, improving the performances of prediction or classification methods, and interpreting the application. In a nonlinear context, the mutual information is widely used as relevance criterion for features and sets of features. Nevertheless, it suffers from at least three major limitations: mutual information estimators depend on smoothing parameters, there is no theoretically justified stopping criterion in the feature selection greedy procedure, and the estimation itself suffers from the curse of dimensionality. This chapter shows how to deal with these problems. The two first ones are addressed by using resampling techniques that provide a statistical basis to select the estimator parameters and to stop the search procedure. The third one is addressed by modifying the mutual information criterion into a measure of how features are complementary (and not only informative) for the problem at hand.

Future Generation Computer Systems | 2005

Vector quantization: a weighted version for time-series forecasting

Amaury Lendasse; Damien François; Vincent Wertz; Michel Verleysen

Nonlinear time-series prediction offers potential performance increases compared to linear models. Nevertheless, the enhanced complexity and computation time often prohibits an efficient use of nonlinear tools. In this paper, we present a simple nonlinear procedure for time-series forecasting, based on the use of vector quantization techniques; the values to predict are considered as missing data, and the vector quantization methods are shown to be compatible with such missing data. This method offers an alternative to more complex prediction tools, while maintaining reasonable complexity and computation time.

Explore More