Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Peter J. Rousseeuw is active.

Publication


Featured researches published by Peter J. Rousseeuw.


Journal of Computational and Applied Mathematics | 1987

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Peter J. Rousseeuw

A new graphical display is proposed for partitioning techniques. Each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation. This silhouette shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters. The entire clustering is displayed by combining the silhouettes into a single plot, allowing an appreciation of the relative quality of the clusters and an overview of the data configuration. The average silhouette width provides an evaluation of clustering validity, and might be used to select an ‘appropriate’ number of clusters.


Archive | 1990

Finding Groups in Data

Leonard Kaufman; Peter J. Rousseeuw

An electrical signal transmission system, applicable to the transmission of signals from trackside hot box detector equipment for railroad locomotives and rolling stock, wherein a basic pulse train is transmitted whereof the pulses are of a selected first amplitude and represent a train axle count, and a spike pulse of greater selected amplitude is transmitted, occurring immediately after the axle count pulse to which it relates, whenever an overheated axle box is detected. To enable the signal receiving equipment to determine on which side of a train the overheated box is located, the spike pulses are of two different amplitudes corresponding, respectively, to opposite sides of the train.


Journal of the American Statistical Association | 1984

Least Median of Squares Regression

Peter J. Rousseeuw

Abstract Classical least squares regression consists of minimizing the sum of the squared residuals. Many authors have produced more robust versions of this estimator by replacing the square by something else, such as the absolute value. In this article a different approach is introduced in which the sum is replaced by the median of the squared residuals. The resulting estimator can resist the effect of nearly 50% of contamination in the data. In the special case of simple regression, it corresponds to finding the narrowest strip covering half of the observations. Generalizations are possible to multivariate location, orthogonal regression, and hypothesis testing in linear models.


Technometrics | 1999

A fast algorithm for the minimum covariance determinant estimator

Peter J. Rousseeuw; Katrien Van Driessen

The minimum covariance determinant (MCD) method of Rousseeuw is a highly robust estimator of multivariate location and scatter. Its objective is to find h observations (out of n) whose covariance matrix has the lowest determinant. Until now, applications of the MCD were hampered by the computation time of existing algorithms, which were limited to a few hundred objects in a few dimensions. We discuss two important applications of larger size, one about a production process at Philips with n = 677 objects and p = 9 variables, and a dataset from astronomy with n = 137,256 objects and p = 27 variables. To deal with such problems we have developed a new algorithm for the MCD, called FAST-MCD. The basic ideas are an inequality involving order statistics and determinants, and techniques which we call “selective iteration” and “nested extensions.” For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders...


Journal of the American Statistical Association | 1990

Unmasking Multivariate Outliers and Leverage Points

Peter J. Rousseeuw; Bert C. van Zomeren

Abstract Detecting outliers in a multivariate point cloud is not trivial, especially when there are several outliers. The classical identification method does not always find them, because it is based on the sample mean and covariance matrix, which are themselves affected by the outliers. That is how the outliers get masked. To avoid the masking effect, we propose to compute distances based on very robust estimates of location and covariance. These robust distances are better suited to expose the outliers. In the case of regression data, the classical least squares approach masks outliers in a similar way. Also here, the outliers may be unmasked by using a highly robust regression method. Finally, a new display is proposed in which the robust regression residuals are plotted versus the robust distances. This plot classifies the data into regular observations, vertical outliers, good leverage points, and bad leverage points. Several examples are discussed.


Journal of the American Statistical Association | 1993

Alternatives to the median absolute deviation

Peter J. Rousseeuw; Christophe Croux

Abstract In robust estimation one frequently needs an initial or auxiliary estimate of scale. For this one usually takes the median absolute deviation MAD n = 1.4826 med, {|xi − med j x j |}, because it has a simple explicit formula, needs little computation time, and is very robust as witnessed by its bounded influence function and its 50% breakdown point. But there is still room for improvement in two areas: the fact that MAD n is aimed at symmetric distributions and its low (37%) Gaussian efficiency. In this article we set out to construct explicit and 50% breakdown scale estimators that are more efficient. We consider the estimator Sn = 1.1926 med, {med j | xi − xj |} and the estimator Qn given by the .25 quantile of the distances {|xi − x j |; i < j}. Note that Sn and Qn do not need any location estimate. Both Sn and Qn can be computed using O(n log n) time and O(n) storage. The Gaussian efficiency of Sn is 58%, whereas Qn attains 82%. We study Sn and Qn by means of their influence functions, their b...


Archive | 1984

ROBUST REGRESSION BY MEANS OF S-ESTIMATORS

Peter J. Rousseeuw; Victor J. Yohai

There are at least two reasons why robust regression techniques are useful tools in robust time series analysis. First of all, one often wants to estimate autoregressive parameters in a robust way, and secondly, one sometimes has to fit a linear or nonlinear trend to a time series. In this paper we shall develop a class of methods for robust regression, and briefly comment on their use in time series. These new estimators are introduced because of their invulnerability to large fractions of contaminated data. We propose to call them “S-estimators” because they are based on estimators of scale.


Technometrics | 2005

ROBPCA: A New Approach to Robust Principal Component Analysis

Mia Hubert; Peter J. Rousseeuw; Karlien Vanden Branden

We introduce a new method for robust principal component analysis (PCA). Classical PCA is based on the empirical covariance matrix of the data and hence is highly sensitive to outlying observations. Two robust approaches have been developed to date. The first approach is based on the eigenvectors of a robust scatter matrix such as the minimum covariance determinant or an S-estimator and is limited to relatively low-dimensional data. The second approach is based on projection pursuit and can handle high-dimensional data. Here we propose the ROBPCA approach, which combines projection pursuit ideas with robust scatter matrix estimation. ROBPCA yields more accurate estimates at noncontaminated datasets and more robust estimates at contaminated data. ROBPCA can be computed rapidly, and is able to detect exact-fit situations. As a by-product, ROBPCA produces a diagnostic plot that displays and classifies the outliers. We apply the algorithm to several datasets from chemometrics and engineering.


Data Mining and Knowledge Discovery | 2006

Computing LTS Regression for Large Data Sets

Peter J. Rousseeuw; Katrien Van Driessen

Data mining aims to extract previously unknown patterns or substructures from large databases. In statistics, this is what methods of robust estimation and outlier detection were constructed for, see e.g. Rousseeuw and Leroy (1987). Here we will focus on least trimmed squares (LTS) regression, which is based on the subset of h cases (out of n) whose least squares fit possesses the smallest sum of squared residuals. The coverage h may be set between n/2 and n. The computation time of existing LTS algorithms grows too much with the size of the data set, precluding their use for data mining. In this paper we develop a new algorithm called FAST-LTS. The basic ideas are an inequality involving order statistics and sums of squared residuals, and techniques which we call ‘selective iteration’ and ‘nested extensions’. We also use an intercept adjustment technique to improve the precision. For small data sets FAST-LTS typically finds the exact LTS, whereas for larger data sets it gives more accurate results than existing algorithms for LTS and is faster by orders of magnitude. This allows us to apply FAST-LTS to large databases.


The American Statistician | 1999

The Bagplot: A Bivariate Boxplot

Peter J. Rousseeuw; Ida Ruts; John W. Tukey

Abstract We propose the bagplot, a bivariate generalization of the univariate boxplot. The key notion is the half space location depth of a point relative to a bivariate dataset, which extends the univariate concept of rank. The “depth median” is the deepest location, and it is surrounded by a “bag” containing the n/2 observations with largest depth. Magnifying the bag by a factor 3 yields the “fence” (which is not plotted). Observations between the bag and the fence are marked by a light gray loop, whereas observations outside the fence are flagged as outliers. The bagplot visualizes the location, spread, correlation, skewness, and tails of the data. It is equivariant for linear transformations, and not limited to elliptical distributions. Software for drawing the bagplot is made available for the S-Plus and MATLAB environments. The bagplot is illustrated on several datasets—for example, in a scatterplot matrix of multivariate data.

Collaboration


Dive into the Peter J. Rousseeuw's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tim Verdonck

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar

Pieter Segaert

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Leonard Kaufman

Vrije Universiteit Brussel

View shared research outputs
Researchain Logo
Decentralizing Knowledge