Lars Rosenbaum
University of Tübingen
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lars Rosenbaum.
Clinical Chemistry | 2013
Peiyuan Yin; Andreas Peter; Holger Franken; Xinjie Zhao; Sabine S. Neukamm; Lars Rosenbaum; Marianna Lucio; Andreas Zell; Hans-Ulrich Häring; Guowang Xu; Rainer Lehmann
BACKGROUND Metabolomics is a powerful tool that is increasingly used in clinical research. Although excellent sample quality is essential, it can easily be compromised by undetected preanalytical errors. We set out to identify critical preanalytical steps and biomarkers that reflect preanalytical inaccuracies. METHODS We systematically investigated the effects of preanalytical variables (blood collection tubes, hemolysis, temperature and time before further processing, and number of freeze-thaw cycles) on metabolomics studies of clinical blood and plasma samples using a nontargeted LC-MS approach. RESULTS Serum and heparinate blood collection tubes led to chemical noise in the mass spectra. Distinct, significant changes of 64 features in the EDTA-plasma metabolome were detected when blood was exposed to room temperature for 2, 4, 8, and 24 h. The resulting pattern was characterized by increases in hypoxanthine and sphingosine 1-phosphate (800% and 380%, respectively, at 2 h). In contrast, the plasma metabolome was stable for up to 4 h when EDTA blood samples were immediately placed in iced water. Hemolysis also caused numerous changes in the metabolic profile. Unexpectedly, up to 4 freeze-thaw cycles only slightly changed the EDTA-plasma metabolome, but increased the individual variability. CONCLUSIONS Nontargeted metabolomics investigations led to the following recommendations for the preanalytical phase: test the blood collection tubes, avoid hemolysis, place whole blood immediately in ice water, use EDTA plasma, and preferably use nonrefrozen biobank samples. To exclude outliers due to preanalytical errors, inspect the biomarker signal intensities reflecting systematic as well as accidental and preanalytical inaccuracies before processing the bioinformatics data.
Journal of Cheminformatics | 2011
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Andreas Zell
BackgroundThe decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.ResultsWe provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.ConclusionsjCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.
Diabetes Care | 2013
Rainer Lehmann; Holger Franken; Sascha Dammeier; Lars Rosenbaum; Konstantinos Kantartzis; Andreas Peter; Andreas Zell; Patrick Adam; Jia Li; Guowang Xu; Alfred Königsrainer; Jürgen Machann; Fritz Schick; Martin Hrabé de Angelis; Matthias Schwab; Harald Staiger; Erwin Schleicher; Amalia Gastaldelli; Andreas Fritsche; Hans-Ulrich Häring; Norbert Stefan
OBJECTIVE Nonalcoholic fatty liver (NAFL) is thought to contribute to insulin resistance and its metabolic complications. However, some individuals with NAFL remain insulin sensitive. Mechanisms involved in the susceptibility to develop insulin resistance in humans with NAFL are largely unknown. We investigated circulating markers and mechanisms of a metabolically benign and malignant NAFL by applying a metabolomic approach. RESEARCH DESIGN AND METHODS A total of 265 metabolites were analyzed before and after a 9-month lifestyle intervention in plasma from 20 insulin-sensitive and 20 insulin-resistant subjects with NAFL. The relevant plasma metabolites were then tested for relationships with insulin sensitivity in 17 subjects without NAFL and in plasma from 29 subjects with liver tissue samples. RESULTS The best separation of the insulin-sensitive from the insulin-resistant NAFL group was achieved by a metabolite pattern including the branched-chain amino acids leucine and isoleucine, ornithine, the acylcarnitines C3:0-, C16:0-, and C18:0-carnitine, and lysophosphatidylcholine (lyso-PC) C16:0 (area under the ROC curve, 0.77 [P = 0.00023] at baseline and 0.80 [P = 0.000019] at follow-up). Among the individual metabolites, predominantly higher levels of lyso-PC C16:0, both at baseline (P = 0.0039) and at follow-up (P = 0.001), were found in the insulin-sensitive compared with the insulin-resistant subjects. In the non-NAFL groups, no differences in lyso-PC C16:0 levels were found between the insulin-sensitive and insulin-resistant subjects, and these relationships were replicated in plasma from subjects with liver tissue samples. CONCLUSIONS From a plasma metabolomic pattern, particularly lyso-PCs are able to separate metabolically benign from malignant NAFL in humans and may highlight important pathways in the pathogenesis of fatty liver–induced insulin resistance.
Journal of Cheminformatics | 2011
Lars Rosenbaum; Georg Hinselmann; Andreas Jahn; Andreas Zell
BackgroundModel-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity.ResultsWe evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor.ConclusionsIn combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.
Journal of Cheminformatics | 2013
Lars Rosenbaum; Alexander Dörr; Matthias Bauer; Frank M. Boeckler; Andreas Zell
BackgroundA plethora of studies indicate that the development of multi-target drugs is beneficial for complex diseases like cancer. Accurate QSAR models for each of the desired targets assist the optimization of a lead candidate by the prediction of affinity profiles. Often, the targets of a multi-target drug are sufficiently similar such that, in principle, knowledge can be transferred between the QSAR models to improve the model accuracy. In this study, we present two different multi-task algorithms from the field of transfer learning that can exploit the similarity between several targets to transfer knowledge between the target specific QSAR models.ResultsWe evaluated the two methods on simulated data and a data set of 112 human kinases assembled from the public database ChEMBL. The relatedness between the kinase targets was derived from the taxonomy of the humane kinome. The experiments show that multi-task learning increases the performance compared to training separate models on both types of data given a sufficient similarity between the tasks. On the kinase data, the best multi-task approach improved the mean squared error of the QSAR models of 58 kinase targets.ConclusionsMulti-task learning is a valuable approach for inferring multi-target QSAR models for lead optimization. The application of multi-task learning is most beneficial if knowledge can be transferred from a similar task with a lot of in-domain knowledge to a task with little in-domain knowledge. Furthermore, the benefit increases with a decreasing overlap between the chemical space spanned by the tasks.
Journal of Chemical Information and Modeling | 2011
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.
Journal of Cheminformatics | 2015
Alexander Dörr; Lars Rosenbaum; Andreas Zell
BackgroundIn this study, we present a SVM-based ranking algorithm for the concurrent learning of compounds with different activity profiles and their varying prioritization. To this end, a specific labeling of each compound was elaborated in order to infer virtual screening models against multiple targets. We compared the method with several state-of-the-art SVM classification techniques that are capable of inferring multi-target screening models on three chemical data sets (cytochrome P450s, dehydrogenases, and a trypsin-like protease data set) containing three different biological targets each.ResultsThe experiments show that ranking-based algorithms show an increased performance for single- and multi-target virtual screening. Moreover, compounds that do not completely fulfill the desired activity profile are still ranked higher than decoys or compounds with an entirely undesired profile, compared to other multi-target SVM methods.ConclusionsSVM-based ranking methods constitute a valuable approach for virtual screening in multi-target drug design. The utilization of such methods is most helpful when dealing with compounds with various activity profiles and the finding of many ligands with an already perfectly matching activity profile is not to be expected.
Journal of Cheminformatics | 2011
Andreas Jahn; Lars Rosenbaum; Georg Hinselmann; Andreas Zell
BackgroundThe performance of 3D-based virtual screening similarity functions is affected by the applied conformations of compounds. Therefore, the results of 3D approaches are often less robust than 2D approaches. The application of 3D methods on multiple conformer data sets normally reduces this weakness, but entails a significant computational overhead. Therefore, we developed a special conformational space encoding by means of Gaussian mixture models and a similarity function that operates on these models. The application of a model-based encoding allows an efficient comparison of the conformational space of compounds.ResultsComparisons of our 4D flexible atom-pair approach with over 15 state-of-the-art 2D- and 3D-based virtual screening similarity functions on the 40 data sets of the Directory of Useful Decoys show a robust performance of our approach. Even 3D-based approaches that operate on multiple conformers yield inferior results. The 4D flexible atom-pair method achieves an averaged AUC value of 0.78 on the filtered Directory of Useful Decoys data sets. The best 2D- and 3D-based approaches of this study yield an AUC value of 0.74 and 0.72, respectively. As a result, the 4D flexible atom-pair approach achieves an average rank of 1.25 with respect to 15 other state-of-the-art similarity functions and four different evaluation metrics.ConclusionsOur 4D method yields a robust performance on 40 pharmaceutically relevant targets. The conformational space encoding enables an efficient comparison of the conformational space. Therefore, the weakness of the 3D-based approaches on single conformations is circumvented. With over 100,000 similarity calculations on a single desktop CPU, the utilization of the 4D flexible atom-pair in real-world applications is feasible.
Molecular Informatics | 2011
Andreas Jahn; Georg Hinselmann; Lars Rosenbaum; Nikolas Fechner; Andreas Zell
3D-QSAR methods have, in comparison to 2D-QSAR approaches, an additional dimension as a source of information. This extra dimension can be seen as both a curse and a blessing. Based on the quality of the geometrical information, this additional information source can either provide valuable data or introduce undesirable noise to the final model. As a consequence, the conformational space of molecules was integrated as a fourth dimension and yielded several extensions of traditional 3D-QSAR approaches like CoMFA or GRIND. In a recently published paper, we introduced a 4D kernel function that additionally applies the information of the conformational ensembles of molecules in order to cut the dependency on biologically active conformations. The results demonstrated that this approach is able to create more robust models in comparison to the solely 3D-based approach on minimized conformations. However, further investigations and tests revealed two main limitations of the 4D kernel function. The aim of this communication is twofold. First, we want to present the modifications and new developments of the 4D kernel function that fix, or rather reduce, the impact of those limitations. Second, QSAR results of nine different benchmark data sets are compared to the previous version as well as to literature results. Furthermore, the complete source code will be publicly available on our department web site http://www.ra.cs.uni-tuebingen.de/software/ 4DFAP/. 4D Flexible Atom-Pair Kernel. To clarify the limitations of the previous version, we recapitulate the functionality of the 4D flexible atom-pair kernel (4D FAP). For a detailed explanation of the approach we refer to our previous publication. The aim of the 4D FAP is to incorporate the complete conformational space of molecules into pair-wise similarity calculations as used in instance-based machine learning methods such as support vector machines. For this purpose, we developed a conformational space encoding that enables the 4D FAP to compute similarity values that are based on an implicit comparison of the complete conformational space. Our approach can be subdivided into two disjoint steps: A preprocessing step that computes the encoding of the conformational space and the actual similarity calculation based on the encoding of the preprocessing. The conformational space encoding is based on intramolecular atom-pair (AP) distances as visualized in Figure 1. Given a series of conformers that represent the conformational space of a molecule M, the approach measures for all APs the distance in each conformation of the given molecule. This step results in [ jM j (jM j -1)]/2 AP distance profiles (histograms of Figures 3, 4, and 5), where jM j is the number of heavy atoms of the molecule M. An AP distance profile contains one distance value per conformer and represents the distance behavior in the conformational space. AP distance profiles within rigid parts of a molecule, like neighboring atoms or ring systems, contain only a small variability in the distance profile. As a result, the distance between the atoms that form the AP can be seen as a fixed distance in the conformational space. To detect those conserved APs, the 4D FAP uses a heuristic to analyze and divide the APs into three different classes: flexible, neighboring, and rigid. The heuristic counts the rotatable bonds in the shortest path between both atoms of the AP (Figure 1). A bond is assumed to be rotatable if it is a single bond and not within a ring system. Based on this heuristic, the 4D FAP treats each AP with more than one rotatable bond as a flexible AP. The APs of neighboring atoms represent the neighboring AP and the remaining pairs (not neighboring atoms with a number of rotatable bonds in the shortest path <2) of the molecule form the rigid AP class. The APs of the neighboring and rigid classes need no special encoding because the corresponding distance value is conserved in the conformational space and, therefore, the distance can be taken from an arbitrary conformation of the molecule. The information of the flexible APs has to be encoded in a way that allows for an efficient storage and comparison within a kernel function.
Molecular Informatics | 2010
Nikolas Fechner; Georg Hinselmann; Andreas Jahn; Lars Rosenbaum; Andreas Zell
The high-dimensional, implicit feature space in which a kernel-based machine learning model is defined has many beneficial properties like the possible non-linearity of the model in the input space. However, this implicit mapping is also the cause for one of the key drawbacks of kernelbased approaches compared to other machine learning approaches. Machine learning is not only about the development of a model, which can be used to predict a property of unknown data, but also to gain insight in the cause of that property. This becomes more apparent in the alternative term information retrieval that, though not fully equivalent, describes a task closely related to machine learning, but more focused on the causality of the pattern-property relationship contained in the data. This informational retrieval aspect of data mining is not trivial to realize using kernel-based techniques. Non-kernel machine learning approaches often infer models that consist of a weighting of features and an integration of these weighted features. This behaviour is apparent in polynomial regression, including partial-least-squares regression, but also decision trees and Bayesian models can be seen that way. The weighting of the features has the advantage that it reveals the cause of a prediction, and thus allows for an interpretation of the model. Therefore, these models allow retrieving the information considered as relevant for the pattern-property relationship. In contrast to these feature-based machine learning techniques, kernel-based models are based on a weighting of the training data (i.e. the kernel similarities of a data item to the training data). Therefore, the model gives only information about the contribution of the training samples to the prediction. The relationships between the features of the items and the target properties are hidden in the implicit feature space. This drawback is a consequence of the kernel trick that allows for replacing the dot product (e.g. , in the basic dual SVM formulation) by an arbitrary mercer kernel and thus prevents an expression of the model in terms of the primal variables, which would lead to an interpretable weighting of the features. Therefore, a kernel model that is based on the dot product kernel, and thus does not employ the kernel trick, can be expressed in primal terms and analyzed further. Interestingly, the cardinality of the intersection of two pattern sets, obtained by a graph decomposition, corresponds to the dot product kernel of the respective nonhashed fingerprints and thus can be applied in this framework. Decomposition kernels have the advantage that they combine the abdication of an explicit construction of the feature space with the possibility to obtain the feature space expression for an arbitrary data sample. This combination allows for an approach that is capable of performing a structural comparison, which is not biased by a fixed set of possible features, and that allows for expressing the model by means of the primal variables. This primal model can be formulated analogously to the QSAR equation obtained by the Free–Wilson approach to chemometrics. 2] The QSAR equation q :{0,1}!R for a molecule with m substructures, that is represented as a binary feature map, is given as a Free–Wilson model by