Is this you? Create Your Porfile

Cathy Maugis

Institut de Mathématiques de Toulouse

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cathy Maugis is active.

Explore More

Publication

Featured researches published by Cathy Maugis.

Biometrics | 2009

Variable Selection for Clustering with Gaussian Mixture Models

Cathy Maugis; Gilles Celeux; Marie-Laure Martin-Magniette

This article is concerned with variable selection for cluster analysis. The problem is regarded as a model selection problem in the model-based cluster analysis context. A model generalizing the model of Raftery and Dean (2006, Journal of the American Statistical Association 101, 168-178) is proposed to specify the role of each variable. This model does not need any prior assumptions about the linear link between the selected and discarded variables. Models are compared with Bayesian information criterion. Variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for clustering and linear regression. The model identifiability is established and the consistency of the resulting criterion is proved under regularity conditions. Numerical experiments on simulated datasets and a genomic application highlight the interest of the procedure.

Journal of Multivariate Analysis | 2011

Variable selection in model-based discriminant analysis

Cathy Maugis; Gilles Celeux; Marie-Laure Martin-Magniette

A general methodology for selecting predictors for Gaussian generative classification models is presented. The problem is regarded as a model selection problem. Three different roles for each possible predictor are considered: a variable can be a relevant classification predictor or not, and the irrelevant classification variables can be linearly dependent on a part of the relevant predictors or independent variables. This variable selection model was inspired by a previous work on variable selection in model-based clustering. A BIC-like model selection criterion is proposed. It is optimized through two embedded forward stepwise variable selection algorithms for classification and linear regression. The model identifiability and the consistency of the variable selection criterion are proved. Numerical experiments on simulated and real data sets illustrate the interest of this variable selection methodology. In particular, it is shown that this well ground variable selection model can be of great interest to improve the classification performance of the quadratic discriminant analysis in a high dimension context.

Journal of the American Statistical Association | 2011

Letter to the editor: ''{A} framework for feature selection in clustering''

Gilles Celeux; Marie-Laure Martin-Magniette; Cathy Maugis; Adrian E. Raftery

Witten and Tibshirani (2010), hereafter WT, proposed a framework for feature or variable selection in clustering, designed for situations with a large number of features, p. In a simulation study reported in their Table 4, their sparse K-means method outperformed others, including the model-based clustering variable selection method of Raftery and Dean (2006), hereafter RD. The RD method was developed with the traditional, small p situation in mind. At each iteration, the RD method involves a regression of the variable being considered for inclusion in the clustering model on all the already selected clustering variables, given by RD’s Equation (7). When p is large, this regression can involve many predictors and so can be heavily penalized, leading to too many clustering variables being selected. This can cause the RD method to perform poorly with large p. Maugis, Celeux, and Martin-Magniette (2009a, 2009b), hereafter MCM, proposed a modification of the RD method that can make a big difference in the large p situation. In RD’s Equation (7), they replace regression on all the already selected clustering variables by regression on a subset of them, chosen by a stepwise method (MCM 2009a). Moreover, they allow for the possibility of the proposed variable being independent of all the clustering variables (MCM 2009b). Otherwise, their method is essentially the same as that of RD, and we call it the RD-MCM method. WT acknowledged MCM (2009a) with the statement that “A related proposal [to that of RD] is made in MCM.” However, WT did not include RD-MCM in their simulation study. RD-MCM is implemented in SelvarClustIndep, which uses the MIXMOD library (Biernacki et al. 2006) and has been available at http://www.math.univ-toulouse.fr/~maugis/ SelvarClustIndepHomepage.html since May 2009. This may have left some readers with the mistaken impression that RDMCM performs similarly to RD, and so is worse than that of sparse K-means. In fact RD-MCM performs much better than RD in large p settings. To show this we partially replicated the small simulation study in the top panel of WT’s Table 4, this time including the RD-MCM method. The results are shown in Table 1. We include the results from WT’s Table 4 as well as our own for comparison. For sparse K-means and the RD method, our results are similar to those of WT, and the differences can be explained by simulation variation. The RD-MCM method performed much better than the original RD method, and its performance was very similar to that of sparse K-means. This suggests that modelTable 1. Mean and standard error of the classification error rate (CER) and the number of nonzero coefficients over the simulated datasets

Statistics and Computing | 2012