Peter Bühlmann
ETH Zurich
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peter Bühlmann.
Annals of Statistics | 2006
Nicolai Meinshausen; Peter Bühlmann
The pattern of zero entries in the inverse covariance matrix of a multivariate normal distribution corresponds to conditional independence restrictions between variables. Covariance selection aims at estimating those structural zeros from data. We show that neighborhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighborhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variable selection for Gaussian linear models. We show that the proposed neighborhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighborhood estimate. Controlling instead the probability of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved (with exponential rates), even when the number of variables grows as the number of observations raised to an arbitrary power.
Bioinformatics | 2006
Amela Prelić; Stefan Bleuler; Philip Zimmermann; Anja Wille; Peter Bühlmann; Wilhelm Gruissem; Lars Hennig; Lothar Thiele; Eckart Zitzler
MOTIVATION In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available. RESULTS First, this paper provides a methodology for comparing and validating biclustering methods that includes a simple binary reference model. Although this model captures the essential features of most biclustering approaches, it is still simple enough to exactly determine all optimal groupings; to this end, we propose a fast divide-and-conquer algorithm (Bimax). Second, we evaluate the performance of five salient biclustering algorithms together with the reference model and a hierarchical clustering method on various synthetic and real datasets for Saccharomyces cerevisiae and Arabidopsis thaliana. The comparison reveals that (1) biclustering in general has advantages over a conventional hierarchical clustering approach, (2) there are considerable performance differences between the tested methods and (3) already the simple reference model delivers relevant patterns within all considered settings.
Bioinformatics | 2007
Jelle J. Goeman; Peter Bühlmann
MOTIVATION Many statistical tests have been proposed in recent years for analyzing gene expression data in terms of gene sets, usually from Gene Ontology. These methods are based on widely different methodological assumptions. Some approaches test differential expression of each gene set against differential expression of the rest of the genes, whereas others test each gene set on its own. Also, some methods are based on a model in which the genes are the sampling units, whereas others treat the subjects as the sampling units. This article aims to clarify the assumptions behind different approaches and to indicate a preferential methodology of gene set testing. RESULTS We identify some crucial assumptions which are needed by the majority of methods. P-values derived from methods that use a model which takes the genes as the sampling unit are easily misinterpreted, as they are based on a statistical model that does not resemble the biological experiment actually performed. Furthermore, because these models are based on a crucial and unrealistic independence assumption between genes, the P-values derived from such methods can be wildly anti-conservative, as a simulation experiment shows. We also argue that methods that competitively test each gene set against the rest of the genes create an unnecessary rift between single gene testing and gene set testing.
Journal of the American Statistical Association | 2003
Peter Bühlmann; Bin Yu
This article investigates a computationally simple variant of boosting, L2Boost, which is constructed from a functional gradient descent algorithm with the L2-loss function. Like other boosting algorithms, L2Boost uses many times in an iterative fashion a prechosen fitting method, called the learner. Based on the explicit expression of refitting of residuals of L2Boost, the case with (symmetric) linear learners is studied in detail in both regression and classification. In particular, with the boosting iteration m working as the smoothing or regularization parameter, a new exponential bias-variance trade-off is found with the variance (complexity) term increasing very slowly as m tends to infinity. When the learner is a smoothing spline, an optimal rate of convergence result holds for both regression and classification and the boosted smoothing spline even adapts to higher-order, unknown smoothness. Moreover, a simple expansion of a (smoothed) 0–1 loss function is derived to reveal the importance of the decision boundary, bias reduction, and impossibility of an additive bias-variance decomposition in classification. Finally, simulation and real dataset results are obtained to demonstrate the attractiveness of L2Boost. In particular, we demonstrate that L2Boosting with a novel component-wise cubic smoothing spline is both practical and effective in the presence of high-dimensional predictors.
Statistical Science | 2007
Peter Bühlmann; Torsten Hothorn
We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.
Bioinformatics | 2012
Daniel Stekhoven; Peter Bühlmann
MOTIVATION Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. RESULTS We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. AVAILABILITY The package missForest is freely available from http://stat.ethz.ch/CRAN/. CONTACT [email protected]; [email protected]
Electronic Journal of Statistics | 2009
Sara van de Geer; Peter Bühlmann
Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper how the different conditions and concepts relate to each other. The restricted eigenvalue condition (Bickel et al., 2009) or the slightly weaker compatibility condition (van de Geer, 2007) are sufficient for oracle results. We argue that both these conditions allow for a fairly general class of design matrices. Hence, optimality of the Lasso for prediction and estimation holds for more general situations than what it appears from coherence (Bunea et al, 2007b,c) or restricted isometry (Candes and Tao, 2005) assumptions.
Bernoulli | 1997
Peter Bühlmann
We study a bootstrap method which is based on the method of sieves. A linear process is approximated by a sequence of autoregressive processes of order p = p(n), where p(n) -* 00, p(n) = o(n) as the sample size n -+ oo. For given data, we then estimate such an AR(p(n)) model and generate a bootstrap sample by resampling from the residuals. This sieve bootstrap enjoys a nice nonparametric property, being model-free within a class of linear processes. We show its consistency for a class of nonlinear estimators and compare the procedure with the blockwise bootstrap, which has been proposed by Kiinsch in 1989. In particular, the sieve bootstrap variance of the mean is shown to have a better rate of convergence if the dependence between separated values of the underlying process decreases sufficiently fast with growing separation. Finally, a simulation study helps to illustrate the advantages and disadvantages of the sieve compared to the blockwise bootstrap.
Journal of the American Statistical Association | 2011
Peter Bühlmann; Bin Yu
This article investigates a computationally simple variant of boosting, L2Boost, which is constructed from a functional gradient descent algorithm with the L2-loss function. Like other boosting algorithms, L2Boost uses many times in an iterative fashion a prechosen fitting method, called the learner. Based on the explicit expression of refitting of residuals of L2Boost, the case with (symmetric) linear learners is studied in detail in both regression and classification. In particular, with the boosting iteration m working as the smoothing or regularization parameter, a new exponential bias-variance trade-off is found with the variance (complexity) term increasing very slowly as m tends to infinity. When the learner is a smoothing spline, an optimal rate of convergence result holds for both regression and classification and the boosted smoothing spline even adapts to higher-order, unknown smoothness. Moreover, a simple expansion of a (smoothed) 0–1 loss function is derived to reveal the importance of the decision boundary, bias reduction, and impossibility of an additive bias-variance decomposition in classification. Finally, simulation and real dataset results are obtained to demonstrate the attractiveness of L2Boost. In particular, we demonstrate that L2Boosting with a novel component-wise cubic smoothing spline is both practical and effective in the presence of high-dimensional predictors.
Genome Biology | 2004
Anja Wille; Philip Zimmermann; Eva Vranová; Andreas Fürholz; Oliver Laule; Stefan Bleuler; Lars Hennig; Amela Prelić; Peter von Rohr; Lothar Thiele; Eckart Zitzler; Wilhelm Gruissem; Peter Bühlmann
We present a novel graphical Gaussian modeling approach for reverse engineering of genetic regulatory networks with many genes and few observations. When applying our approach to infer a gene network for isoprenoid biosynthesis in Arabidopsis thaliana, we detect modules of closely connected genes and candidate genes for possible cross-talk between the isoprenoid pathways. Genes of downstream pathways also fit well into the network. We evaluate our approach in a simulation study and using the yeast galactose network.