Julie Josse
Agrocampus Ouest
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Julie Josse.
Computational Statistics & Data Analysis | 2008
Julie Josse; Jérôme Pagès; François Husson
The relationship between two sets of variables defined for the same individuals can be evaluated by the RV coefficient. However, it is impossible to assess by the RV value alone whether or not the two sets of variables are significantly correlated, which is why a test is required. Asymptotic tests do exist but fail in many situations, hence the interest in permutation tests. However, the main drawbacks of the permutation tests are that they are time consuming. It is therefore interesting to approximate the permutation distribution with continuous distributions (without doing any permutation). The current approximations (normal approximation, a log-transformation and Pearson type III approximation) are discussed and a new one is described: an Edgeworth expansion. Finally, these different approximations are compared for both simulations and for a sensory example.
Computational Statistics & Data Analysis | 2012
Julie Josse; François Husson
Cross-validation is a tried and tested approach to select the number of components in principal component analysis (PCA), however, its main drawback is its computational cost. In a regression (or in a non parametric regression) setting, criteria such as the general cross-validation one (GCV) provide convenient approximations to leave-one-out cross-validation. They are based on the relation between the prediction error and the residual sum of squares weighted by elements of a projection matrix (or a smoothing matrix). Such a relation is then established in PCA using an original presentation of PCA with a unique projection matrix. It enables the definition of two cross-validation approximation criteria: the smoothing approximation of the cross-validation criterion (SACV) and the GCV criterion. The method is assessed with simulations and gives promising results.
Advanced Data Analysis and Classification | 2011
Julie Josse; Jérôme Pagès; François Husson
The available methods to handle missing values in principal component analysis only provide point estimates of the parameters (axes and components) and estimates of the missing values. To take into account the variability due to missing values a multiple imputation method is proposed. First a method to generate multiple imputed data sets from a principal component analysis model is defined. Then, two ways to visualize the uncertainty due to missing values onto the principal component analysis results are described. The first one consists in projecting the imputed data sets onto a reference configuration as supplementary elements to assess the stability of the individuals (respectively of the variables). The second one consists in performing a principal component analysis on each imputed data set and fitting each obtained configuration onto the reference one with Procrustes rotation. The latter strategy allows to assess the variability of the principal component analysis parameters induced by the missing values. The methodology is then evaluated from a real data set.
Journal of Classification | 2012
Julie Josse; Marie Chavent; Benoit Liquet; François Husson
A common approach to deal with missing values in multivariate exploratory data analysis consists in minimizing the loss function over all non-missing elements, which can be achieved by EM-type algorithms where an iterative imputation of the missing values is performed during the estimation of the axes and components. This paper proposes such an algorithm, named iterative multiple correspondence analysis, to handle missing values in multiple correspondence analysis (MCA). The algorithm, based on an iterative PCA algorithm, is described and its properties are studied. We point out the overfitting problem and propose a regularized version of the algorithm to overcome this major issue. Finally, performances of the regularized iterative MCA algorithm (implemented in the R-package named missMDA) are assessed from both simulations and a real dataset. Results are promising with respect to other methods such as the missing-data passive modified margin method, an adaptation of the missing passive method used in Gifi’s Homogeneity analysis framework.
Advanced Data Analysis and Classification | 2016
Vincent Audigier; François Husson; Julie Josse
We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.
Plant Ecology | 2015
Stéphane Dray; Julie Josse
Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.
Statistics and Computing | 2016
Julie Josse; Sylvain Sardy
To recover a low-rank structure from a noisy matrix, truncated singular value decomposition has been extensively used and studied. Recent studies suggested that the signal can be better estimated by shrinking the singular values as well. We pursue this line of research and propose a new estimator offering a continuum of thresholding and shrinking functions. To avoid an unstable and costly cross-validation search, we propose new rules to select two thresholding and shrinking parameters from the data. In particular we propose a generalized Stein unbiased risk estimation criterion that does not require knowledge of the variance of the noise and that is computationally fast. A Monte Carlo simulation reveals that our estimator outperforms the tested methods in terms of mean squared error on both low-rank and general signal matrices across different signal-to-noise ratio regimes. In addition, it accurately estimates the rank of the signal when it is detectable.
Statistics and Computing | 2015
Marie Verbanck; Julie Josse; François Husson
Principal component analysis (PCA) is a well-established dimensionality reduction method commonly used to denoise and visualise data. A classical PCA model is the fixed effect model in which data are generated as a fixed structure of low rank corrupted by noise. Under this model, PCA does not provide the best recovery of the underlying signal in terms of mean squared error. Following the same principle as in ridge regression, we suggest a regularised version of PCA that essentially selects a certain number of dimensions and shrinks the corresponding singular values. Each singular value is multiplied by a term which can be seen as the ratio of the signal variance over the total variance of the associated dimension. The regularised term is analytically derived using asymptotic results and can also be justified from a Bayesian treatment of the model. Regularised PCA provides promising results in terms of the recovery of the true signal and the graphical outputs in comparison with classical PCA and with a soft thresholding estimation strategy. The distinction between PCA and regularised PCA becomes especially important in the case of very noisy data.
Journal of Immunology | 2016
Hiroko Fujii; Julie Josse; Miki Tanioka; Yoshiki Miyachi; François Husson; Masahiro Ono
CD4+ T cells that express the transcription factor FOXP3 (FOXP3+ T cells) are commonly regarded as immunosuppressive regulatory T cells (Tregs). FOXP3+ T cells are reported to be increased in tumor-bearing patients or animals and are considered to suppress antitumor immunity, but the evidence is often contradictory. In addition, accumulating evidence indicates that FOXP3 is induced by antigenic stimulation and that some non-Treg FOXP3+ T cells, especially memory-phenotype FOXP3low cells, produce proinflammatory cytokines. Accordingly, the subclassification of FOXP3+ T cells is fundamental for revealing the significance of FOXP3+ T cells in tumor immunity, but the arbitrariness and complexity of manual gating have complicated the issue. In this article, we report a computational method to automatically identify and classify FOXP3+ T cells into subsets using clustering algorithms. By analyzing flow cytometric data of melanoma patients, the proposed method showed that the FOXP3+ subpopulation that had relatively high FOXP3, CD45RO, and CD25 expressions was increased in melanoma patients, whereas manual gating did not produce significant results on the FOXP3+ subpopulations. Interestingly, the computationally identified FOXP3+ subpopulation included not only classical FOXP3high Tregs, but also memory-phenotype FOXP3low cells by manual gating. Furthermore, the proposed method successfully analyzed an independent data set, showing that the same FOXP3+ subpopulation was increased in melanoma patients, validating the method. Collectively, the proposed method successfully captured an important feature of melanoma without relying on the existing criteria of FOXP3+ T cells, revealing a hidden association between the T cell profile and melanoma, and providing new insights into FOXP3+ T cells and Tregs.
Journal of Statistical Computation and Simulation | 2016
Vincent Audigier; François Husson; Julie Josse
ABSTRACT We propose a multiple imputation method based on principal component analysis (PCA) to deal with incomplete continuous data. To reflect the uncertainty of the parameters from one imputation to the next, we use a Bayesian treatment of the PCA model. Using a simulation study and real data sets, the method is compared to two classical approaches: multiple imputation based on joint modelling and on fully conditional modelling. Contrary to the others, the proposed method can be easily used on data sets where the number of individuals is less than the number of variables and when the variables are highly correlated. In addition, it provides unbiased point estimates of quantities of interest, such as an expectation, a regression coefficient or a correlation coefficient, with a smaller mean squared error. Furthermore, the widths of the confidence intervals built for the quantities of interest are often smaller whilst ensuring a valid coverage.