Cathy Maugis-Rabusseau

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cathy Maugis-Rabusseau is active.

Explore More

Publication

Featured researches published by Cathy Maugis-Rabusseau.

Bioinformatics | 2015

Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models.

Andrea Rau; Cathy Maugis-Rabusseau; Marie-Laure Martin-Magniette; Gilles Celeux

MOTIVATION In recent years, gene expression studies have increasingly made use of high-throughput sequencing technology. In turn, research concerning the appropriate statistical methods for the analysis of digital gene expression (DGE) has flourished, primarily in the context of normalization and differential analysis. RESULTS In this work, we focus on the question of clustering DGE profiles as a means to discover groups of co-expressed genes. We propose a Poisson mixture model using a rigorous framework for parameter estimation as well as the choice of the appropriate number of clusters. We illustrate co-expression analyses using our approach on two real RNA-seq datasets. A set of simulation studies also compares the performance of the proposed model with that of several related approaches developed to cluster RNA-seq or serial analysis of gene expression data. AVAILABILITY AND AND IMPLEMENTATION The proposed method is implemented in the open-source R package HTSCluster, available on CRAN. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Computational Statistics & Data Analysis | 2016

On the estimation of mixtures of Poisson regression models with large number of components

Panagiotis Papastamoulis; Marie-Laure Martin-Magniette; Cathy Maugis-Rabusseau

Modelling heterogeneity in large datasets of counts under the presence of covariates demands advanced clustering methods. Towards this direction a mixture of Poisson regressions is proposed. Conditionally on the covariates and a cluster, the multivariate distribution is a product of independent Poisson distributions. A variety of different parameterizations is taken into account for the slope of the conditional log-means. Also considered is the case of partitioning the response variables into sets of replicates sharing the same conditional log-mean up to an additive constant. Model parameters are estimated via an Expectation-Maximization algorithm with Newton-Raphson steps. In particular, an efficient initialization is introduced in order to improve the inference: a splitting scheme is combined with a Small-EM strategy. Simulations and application on two real high-throughput sequencing datasets highlight improvements of parameter estimations. The proposed methodology is implemented in the R package poisson.glm.mix, available on CRAN.

Briefings in Bioinformatics | 2017

Transformation and model choice for RNA-seq co-expression analysis

Andrea Rau; Cathy Maugis-Rabusseau

Although a large number of clustering algorithms have been proposed to identify groups of co-expressed genes from microarray data, the question of if and how such methods may be applied to RNA sequencing (RNA-seq) data remains unaddressed. In this work, we investigate the use of data transformations in conjunction with Gaussian mixture models for RNA-seq co-expression analyses, as well as a penalized model selection criterion to select both an appropriate transformation and number of clusters present in the data. This approach has the advantage of accounting for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In addition, it provides a rigorous statistical framework for parameter estimation, an objective assessment of data transformations and number of clusters and the possibility of performing diagnostic checks on the quality and homogeneity of the identified clusters. We analyze four varied RNA-seq data sets to illustrate the use of transformations and model selection in conjunction with Gaussian mixture models. Finally, we propose a Bioconductor package coseq (co-expression of RNA-seq data) to facilitate implementation and visualization of the recommended RNA-seq co-expression analyses.

Briefings in Bioinformatics | 2016

Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis

Guillem Rigaill; Sandrine Balzergue; Véronique Brunaud; Eddy Blondet; Andrea Rau; Odile Rogier; José Caius; Cathy Maugis-Rabusseau; Ludivine Soubigou-Taconnat; Sébastien Aubourg; Claire Lurin; Marie-Laure Martin-Magniette; Etienne Delannoy

Numerous statistical pipelines are now available for the differential analysis of gene expression measured with RNA-sequencing technology. Most of them are based on similar statistical frameworks after normalization, differing primarily in the choice of data distribution, mean and variance estimation strategy and data filtering. We propose an evaluation of the impact of these choices when few biological replicates are available through the use of synthetic data sets. This framework is based on real data sets and allows the exploration of various scenarios differing in the proportion of non-differentially expressed genes. Hence, it provides an evaluation of the key ingredients of the differential analysis, free of the biases associated with the simulation of data using parametric models. Our results show the relevance of a proper modeling of the mean by using linear or generalized linear modeling. Once the mean is properly modeled, the impact of the other parameters on the performance of the test is much less important. Finally, we propose to use the simple visualization of the raw P-value histogram as a practical evaluation criterion of the performance of differential analysis methods on real data sets.

Advanced Data Analysis and Classification | 2018

Variable selection in model-based clustering and discriminant analysis with a regularization approach

Gilles Celeux; Cathy Maugis-Rabusseau; Mohammed Sedki

Several methods for variable selection have been proposed in model-based clustering and classification. These make use of backward or forward procedures to define the roles of the variables. Unfortunately, such stepwise procedures are slow and the resulting algorithms inefficient when analyzing large data sets with many variables. In this paper, we propose an alternative regularization approach for variable selection in model-based clustering and classification. In our approach the variables are first ranked using a lasso-like procedure in order to avoid slow stepwise algorithms. Thus, the variable selection methodology of Maugis et al. (Comput Stat Data Anal 53:3872–3882, 2000b) can be efficiently applied to high-dimensional data sets.

arXiv: Applications | 2014