Michel Lang
Technical University of Dortmund
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michel Lang.
BMC Bioinformatics | 2011
Kai Kammers; Michel Lang; Jan G. Hengstler; Marcus Schmidt; Jörg Rahnenführer
BackgroundAn important application of high dimensional gene expression measurements is the risk prediction and the interpretation of the variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selection procedures can generate predictive models with high prediction accuracy and at the same time low model complexity. However, interpretability of the resulting models is still limited due to little knowledge on many of the remaining selected genes. Thus, we summarize genes as gene groups defined by the hierarchically structured Gene Ontology (GO) and include these gene groups as covariates in the hazard regression models. Since expression profiles within GO groups are often heterogeneous, we present a new method to obtain subgroups with coherent patterns. We apply preclustering to genes within GO groups according to the correlation of their gene expression measurements.ResultsWe compare Cox models for modeling disease free survival times of breast cancer patients. Besides classical clinical covariates we consider genes, GO groups and preclustered GO groups as additional genomic covariates. Survival models with preclustered gene groups as covariates have similar prediction accuracy as models built only with single genes or GO groups.ConclusionsThe preclustering information enables a more detailed analysis of the biological meaning of covariates selected in the final models. Compared to models built only with single genes there is additional functional information contained in the GO annotation, and compared to models using GO groups as covariates the preclustering yields coherent representative gene expression profiles.
Journal of Statistical Computation and Simulation | 2015
Michel Lang; Helena Kotthaus; Peter Marwedel; Claus Weihs; Jörg Rahnenführer; Bernd Bischl
Many different models for the analysis of high-dimensional survival data have been developed over the past years. While some of the models and implementations come with an internal parameter tuning automatism, others require the user to accurately adjust defaults, which often feels like a guessing game. Exhaustively trying out all model and parameter combinations will quickly become tedious or infeasible in computationally intensive settings, even if parallelization is employed. Therefore, we propose to use modern algorithm configuration techniques, e.g. iterated F-racing, to efficiently move through the model hypothesis space and to simultaneously configure algorithm classes and their respective hyperparameters. In our application we study four lung cancer microarray data sets. For these we configure a predictor based on five survival analysis algorithms in combination with eight feature selection filters. We parallelize the optimization and all comparison experiments with the BatchJobs and BatchExperiments R packages.
Journal of Social Structure | 2017
Michel Lang; Bernd Bischl; Dirk Surmann
• Local (blocking) execution in the current R session or in an externally spawned R process (intended for debugging and prototyping) • Local (non-blocking) parallel execution using parallel’s multicore backend (R Core Team 2016) or snow’s socket mode (Tierney et al. 2016). • Execution on loosely connected machines using SSH (including basic resource usage control). • Docker Swarm • IBM Spectrum LSF • OpenLava • Univa Grid Engine (formerly Oracle Grind Engine and Sun Grid Engine) • Slurm Workload Manager • TORQUE/PBS Resource Manager
Journal of Statistical Computation and Simulation | 2015
Helena Kotthaus; Ingo Korb; Michel Lang; Bernd Bischl; Jörg Rahnenführer; Peter Marwedel
R is a multi-paradigm language with a dynamic type system, different object systems and functional characteristics. These characteristics support the development of statistical algorithms at a high level of abstraction. Although R is commonly used in the statistics domain a big disadvantage are its runtime problems when handling computation-intensive algorithms. Especially in the domain of machine learning the execution of pure R programs is often unacceptably slow. Our long-term goal is to resolve these issues and in this contribution we used the traceR tool to analyse the bottlenecks arising in this domain. Here we measured the runtime and overall memory consumption on a well-defined set of classical machine learning applications and gained detailed insights into the performance issues of these programs.
PLOS ONE | 2014
Sangkyun Lee; Jörg Rahnenführer; Michel Lang; Katleen De Preter; Pieter Mestdagh; Jan Koster; Rogier Versteeg; Raymond L Stallings; Luigi Varesio; Shahab Asgharzadeh; Johannes H. Schulte; Kathrin Fielitz; Melanie Schwermer; Katharina Morik; Alexander Schramm
Identifying relevant signatures for clinical patient outcome is a fundamental task in high-throughput studies. Signatures, composed of features such as mRNAs, miRNAs, SNPs or other molecular variables, are often non-overlapping, even though they have been identified from similar experiments considering samples with the same type of disease. The lack of a consensus is mostly due to the fact that sample sizes are far smaller than the numbers of candidate features to be considered, and therefore signature selection suffers from large variation. We propose a robust signature selection method that enhances the selection stability of penalized regression algorithms for predicting survival risk. Our method is based on an aggregation of multiple, possibly unstable, signatures obtained with the preconditioned lasso algorithm applied to random (internal) subsamples of a given cohort data, where the aggregated signature is shrunken by a simple thresholding strategy. The resulting method, RS-PL, is conceptually simple and easy to apply, relying on parameters automatically tuned by cross validation. Robust signature selection using RS-PL operates within an (external) subsampling framework to estimate the selection probabilities of features in multiple trials of RS-PL. These probabilities are used for identifying reliable features to be included in a signature. Our method was evaluated on microarray data sets from neuroblastoma, lung adenocarcinoma, and breast cancer patients, extracting robust and relevant signatures for predicting survival risk. Signatures obtained by our method achieved high prediction performance and robustness, consistently over the three data sets. Genes with high selection probability in our robust signatures have been reported as cancer-relevant. The ordering of predictor coefficients associated with signatures was well-preserved across multiple trials of RS-PL, demonstrating the capability of our method for identifying a transferable consensus signature. The software is available as an R package rsig at CRAN (http://cran.r-project.org).
learning and intelligent optimization | 2016
Jakob Richter; Helena Kotthaus; Bernd Bischl; Peter Marwedel; Jörg Rahnenführer; Michel Lang
We present a Resource-Aware Model-Based Optimization framework RAMBO that leads to efficient utilization of parallel computer architectures through resource-aware scheduling strategies. Conventional MBO fits a regression model on the set of already evaluated configurations and their observed performances to guide the search. Due to its inherent sequential nature, an efficient parallel variant can not directly be derived, as only the most promising configuration w.r.t. an infill criterion is evaluated in each iteration. This issue has been addressed by generalized infill criteria in order to propose multiple points simultaneously for parallel execution in each sequential step. However, these extensions in general neglect systematic runtime differences in the configuration space which often leads to underutilized systems. We estimate runtimes using an additional surrogate model to improve the scheduling and demonstrate that our framework approach already yields improved resource utilization on two exemplary classification tasks.
learning and intelligent optimization | 2017
Helena Kotthaus; Jakob Richter; Andreas Lang; Janek Thomas; Bernd Bischl; Peter Marwedel; Jörg Rahnenführer; Michel Lang
Sequential model-based optimization is a popular technique for global optimization of expensive black-box functions. It uses a regression model to approximate the objective function and iteratively proposes new interesting points. Deviating from the original formulation, it is often indispensable to apply parallelization to speed up the computation. This is usually achieved by evaluating as many points per iteration as there are workers available. However, if runtimes of the objective function are heterogeneous, resources might be wasted by idle workers. Our new knapsack-based scheduling approach aims at increasing the effectiveness of parallel optimization by efficient resource utilization. Derived from an extra regression model we use runtime predictions of point evaluations to efficiently map evaluations to workers and reduce idling. We compare our approach to five established parallelization strategies on a set of continuous functions with heterogeneous runtimes. Our benchmark covers comparisons of synchronous and asynchronous model-based approaches and investigates the scalability.
Computational and Mathematical Methods in Medicine | 2017
Andrea Bommert; Jörg Rahnenführer; Michel Lang
Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.
Journal of Machine Learning Research | 2016
Bernd Bischl; Michel Lang; Lars Kotthoff; Julia Schiffner; Jakob Richter; Erich Studerus; Giuseppe Casalicchio; Zachary M. Jones
Journal of Statistical Software | 2015
Bernd Bischl; Michel Lang; Olaf Mersmann; Jörg Rahnenführer; Claus Weihs