Michel Verleysen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michel Verleysen is active.

Explore More

Publication

Featured researches published by Michel Verleysen.

Scientific Reports | 2013

Unique in the Crowd: The privacy bounds of human mobility

Yves-Alexandre de Montjoye; César A. Hidalgo; Michel Verleysen; Vincent D. Blondel

We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carriers antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individuals privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.

IEEE Transactions on Neural Networks | 2014

Classification in the Presence of Label Noise: A Survey

Benoît Frénay; Michel Verleysen

Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.

IEEE Transactions on Knowledge and Data Engineering | 2007

The Concentration of Fractional Distances

Damien François; Vincent Wertz; Michel Verleysen

Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given.

Chemometrics and Intelligent Laboratory Systems | 2006

Mutual information for the selection of relevant variables in spectrometric nonlinear modelling

Fabrice Rossi; Amaury Lendasse; Damien François; Vincent Wertz; Michel Verleysen

Data from spectrophotometers form vectors of a large number of exploitable variables. Building quantitative models using these variables most often requires using a smaller set of variables than the initial one. Indeed, a too large number of input variables to a model results in a too large number of parameters, leading to overfitting and poor generalization abilities. In this paper, we suggest the use of the mutual information measure to select variables from the initial set. The mutual information measures the information content in input variables with respect to the model output, without making any assumption on the model that will be used; it is thus suitable for nonlinear modelling. In addition, it leads to the selection of variables among the initial set, and not to linear or nonlinear combinations of them. Without decreasing the model performances compared to other variable projection methods, it allows therefore a greater interpretability of the results.

Neurocomputing | 2004

Nonlinear projection with curvilinear distances: Isomap versus curvilinear distance analysis

John Aldo Lee; Amaury Lendasse; Michel Verleysen

Dimension reduction techniques are widely used for the analysis and visualization of complex sets of data. This paper compares two recently published methods for nonlinear projection: Isomap and Curvilinear Distance Analysis (CDA). Contrarily to the traditional linear PCA, these methods work like multidimensional scaling, by reproducing in the projection space the pairwise distances measured in the data space. However, they di6er from the classical linear MDS by the metrics they use and by the way they build the mapping (algebraic or neural). While Isomap relies directly on the traditional MDS, CDA is based on a nonlinear variant of MDS, called Curvilinear Component Analysis (CCA). Although Isomap and CDA share the same metric, the comparison highlights their respective strengths and weaknesses.

international conference on artificial neural networks | 2005

The curse of dimensionality in data mining and time series prediction

Michel Verleysen; Damien François

Modern data analysis tools have to work on high-dimensional data, whose components are not independently distributed. High-dimensional spaces show surprising, counter-intuitive geometrical properties that have a large influence on the performances of data analysis tools. Among these properties, the concentration of the norm phenomenon results in the fact that Euclidean norms and Gaussian kernels, both commonly used in models, become inappropriate in high-dimensional spaces. This papers presents alternative distance measures and kernels, together with geometrical methods to decrease the dimension of the space. The methodology is applied to a typical time series prediction example.

Neurocomputing | 2009

Quality assessment of dimensionality reduction: Rank-based criteria

John Aldo Lee; Michel Verleysen

Dimensionality reduction aims at providing low-dimensional representations of high-dimensional data sets. Many new nonlinear methods have been proposed for the last years, yet the question of their assessment and comparison remains open. This paper first reviews some of the existing quality measures that are based on distance ranking and K-ary neighborhoods. Next, the definition of the co-ranking matrix provides a tool for comparing the ranks in the initial data set and some low-dimensional embedding. Rank errors and concepts such as neighborhood intrusions and extrusions can then be associated with different blocks of the co-ranking matrix. Several quality criteria can be cast within this unifying framework; they are shown to involve one or several of these characteristic blocks. Following this line, simple criteria are proposed, which quantify two aspects of the embedding quality, namely its overall quality and its tendency to favor intrusions or extrusions. They are applied to several recent dimensionality reduction methods in two experiments, with both artificial and real data.

IEEE Transactions on Neural Networks | 1998

Image compression by self-organized Kohonen map

Christophe Amerijckx; Michel Verleysen; Philippe Thissen; Jean-Didier Legat

This paper presents a compression scheme for digital still images, by using the Kohonens neural network algorithm, not only for its vector quantization feature, but also for its topological property. This property allows an increase of about 80% for the compression rate. Compared to the JPEG standard, this compression scheme shows better performances (in terms of PSNR) for compression rates higher than 30.

Neurocomputing | 2007

Resampling methods for parameter-free and robust feature selection with mutual information

Damien François; Fabrice Rossi; Vincent Wertz; Michel Verleysen

Combining the mutual information criterion with a forward feature selection strategy offers a good trade-off between optimality of the selected feature subset and computation time. However, it requires to set the parameter(s) of the mutual information estimator and to determine when to halt the forward procedure. These two choices are difficult to make because, as the dimensionality of the subset increases, the estimation of the mutual information becomes less and less reliable. This paper proposes to use resampling methods, a K-fold cross-validation and the permutation test, to address both issues. The resampling methods bring information about the variance of the estimator, information which can then be used to automatically set the parameter and to calculate a threshold to stop the forward procedure. The procedure is illustrated on a synthetic data set as well as on the real-world examples.

the european symposium on artificial neural networks | 2005

Representation of functional data in neural networks

Fabrice Rossi; Nicolas Delannay; Brieuc Conan-Guez; Michel Verleysen

Functional data analysis (FDA) is an extension of traditional data analysis to functional data, for example spectra, temporal series, spatio-temporal images, gesture recognition data, etc. Functional data are rarely known in practice; usually a regular or irregular sampling is known. For this reason, some processing is needed in order to benefit from the smooth character of functional data in the analysis methods. This paper shows how to extend the radial-basis function networks (RBFN) and multi-layer perceptron (MLP) models to functional data inputs, in particular when the latter are known through lists of input-output pairs. Various possibilities for functional processing are discussed, including the projection on smooth bases, functional principal component analysis, functional centering and reduction, and the use of differential operators. It is shown how to incorporate these functional processing into the RBFN and MLP models. The functional approach is illustrated on a benchmark of spectrometric data analysis.

Explore More