Nikolas Fechner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikolas Fechner is active.

Explore More

Publication

Featured researches published by Nikolas Fechner.

Journal of Cheminformatics | 2011

jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints

Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Andreas Zell

BackgroundThe decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.ResultsWe provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.ConclusionsjCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.

Journal of Chemical Information and Modeling | 2013

Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing

Sereina Riniker; Nikolas Fechner; Gregory A. Landrum

The concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature. The heterogeneous version of classifier fusion - fusing the predictions from different model types - has been less explored. Here, we investigate heterogeneous classifier fusion for ligand-based VS using three different ML methods, RF, naïve Bayes (NB), and logistic regression (LR), with four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint, and circular fingerprint. The methods are compared using a previously developed benchmarking platform for 2D fingerprints which is extended to ML methods in this article. The original data sets are filtered for difficulty, and a new set of challenging data sets from ChEMBL is added. Data sets were also generated for a second use case: starting from a small set of related actives instead of diverse actives. The final fused model consistently outperforms the other approaches across the broad variety of targets studied, indicating that heterogeneous classifier fusion is a very promising approach for ligand-based VS. The new data sets together with the adapted source code for ML methods are provided in the Supporting Information .

Journal of Cheminformatics | 2010

Estimation of the applicability domain of kernel-based machine learning models for virtual screening

Nikolas Fechner; Andreas Jahn; Georg Hinselmann; Andreas Zell

BackgroundThe virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model.ResultsWe evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening.ConclusionThe proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.

Journal of Chemical Information and Modeling | 2009

Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR

Nikolas Fechner; Andreas Jahn; Georg Hinselmann; Andreas Zell

In this work, we introduce a new method to regard the geometry in a structural similarity measure by approximating the conformational space of a molecule. Our idea is to break down the molecular conformation into the local conformations of neighbor atoms with respect to core atoms. This local geometry can be implicitly accessed by the trajectories of the neighboring atoms, which are emerge by rotatable bonds. In our approach, the physicochemical atomic similarity, which can be used in structured similarity measures, is augmented by a local flexibility similarity, which gives a rough estimate of the similarity of the local conformational space. We incorporated this new type of encoding the flexibility into the optimal assignment molecular similarity approach, which can be used as a pseudokernel in support vector machines. The impact of the local flexibility was evaluated on several published QSAR data sets. This lead to an improvement of the model quality on 9 out of 10 data sets compared to the unmodified optimal assignment kernel.

Journal of Chemical Information and Modeling | 2011

Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics

Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell

The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.

Chemistry Central Journal | 2008

Beyond descriptor vectors: QSAR modelling using structural similarity

Andreas Zell; Georg Hinselmann; Nikolas Fechner; Andreas Jahn

21. CIC-Workshop Meeting abstracts - A si ngle PDF containing all abstracts in this Supplement is available here . http://www. biomedcentral.co m/content/pdf/17 52-153X-2-S1-inf o.pdf

Neurocomputing | 2010

Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments

Georg Hinselmann; Nikolas Fechner; Andreas Jahn; Matthias Eckert; Andreas Zell

Approaches that can predict the biological activity or properties of a chemical compound are an important application of machine learning. In this paper, we introduce a new kernel function for measuring the similarity between chemical compounds and for learning their related properties and activities. The method is based on local atom pair environments which can be rapidly computed by using the topological all-shortest paths matrix and the geometrical distance matrix of a molecular graph as lookup tables. The local atom pair environments are stored in prefix search trees, so called tries, for an efficient comparison. The kernel can be either computed as an optimal assignment kernel or as a corresponding convolution kernel over all local atom similarities. We implemented the Tanimoto kernel, min kernel, minmax kernel and the dot product kernel as local kernels, which are computed recursively by traversing the tries. We tested the approach on eight structure-activity and structure-property molecule benchmark data sets from the literature. The models were trained with @e- support vector regression and support vector classification. The local atom pair kernels showed to be at least competitive to state-of-the-art kernels in seven out of eight cases in a direct comparison. A comparison against literature results using similar experimental setups as in the original works confirmed these findings. The method is easy to implement and has robust default parameters.

Molecular Informatics | 2010

Probabilistic Modeling of Conformational Space for 3D Machine Learning Approaches

Andreas Jahn; Georg Hinselmann; Nikolas Fechner; Carsten Henneges; Andreas Zell

We present a new probabilistic encoding of the conformational space of a molecule that allows for the integration into common similarity calculations. The method uses distance profiles of flexible atom‐pairs and computes generative models that describe the distance distribution in the conformational space. The generative models permit the use of probabilistic kernel functions and, therefore, our approach can be used to extend existing 3D molecular kernel functions, as applied in support vector machines, to build QSAR models. The resulting kernels are valid 4D kernel functions and reduce the dependency of the model quality on suitable conformations of the molecules. We showed in several experiments the robust performance of the 4D kernel function, which was extended by our approach, in comparison to the original 3D‐based kernel function. The new method compares the conformational space of two molecules within one kernel evaluation. Hence, the number of kernel evaluations is significantly reduced in comparison to common kernel‐based conformational space averaging techniques. Additionally, the performance gain of the extended model correlates with the flexibility of the data set and enables an a priori estimation of the model improvement.

Chemistry Central Journal | 2008

Estimating the applicability domain of kernel based QSPR models using classical descriptor vectors

Nikolas Fechner; Georg Hinselmann; C Schmiedl; Andreas Zell

21. CIC-Workshop Meeting abstracts - A si ngle PDF containing all abstracts in this Supplement is available here . http://www. biomedcentral.co m/content/pdf/17 52-153X-2-S1-inf o.pdf

Molecular Informatics | 2010

A Free-Wilson-like Approach to Analyze QSAR Models Based on Graph Decomposition Kernels.

Nikolas Fechner; Georg Hinselmann; Andreas Jahn; Lars Rosenbaum; Andreas Zell

The high-dimensional, implicit feature space in which a kernel-based machine learning model is defined has many beneficial properties like the possible non-linearity of the model in the input space. However, this implicit mapping is also the cause for one of the key drawbacks of kernelbased approaches compared to other machine learning approaches. Machine learning is not only about the development of a model, which can be used to predict a property of unknown data, but also to gain insight in the cause of that property. This becomes more apparent in the alternative term information retrieval that, though not fully equivalent, describes a task closely related to machine learning, but more focused on the causality of the pattern-property relationship contained in the data. This informational retrieval aspect of data mining is not trivial to realize using kernel-based techniques. Non-kernel machine learning approaches often infer models that consist of a weighting of features and an integration of these weighted features. This behaviour is apparent in polynomial regression, including partial-least-squares regression, but also decision trees and Bayesian models can be seen that way. The weighting of the features has the advantage that it reveals the cause of a prediction, and thus allows for an interpretation of the model. Therefore, these models allow retrieving the information considered as relevant for the pattern-property relationship. In contrast to these feature-based machine learning techniques, kernel-based models are based on a weighting of the training data (i.e. the kernel similarities of a data item to the training data). Therefore, the model gives only information about the contribution of the training samples to the prediction. The relationships between the features of the items and the target properties are hidden in the implicit feature space. This drawback is a consequence of the kernel trick that allows for replacing the dot product (e.g. , in the basic dual SVM formulation) by an arbitrary mercer kernel and thus prevents an expression of the model in terms of the primal variables, which would lead to an interpretable weighting of the features. Therefore, a kernel model that is based on the dot product kernel, and thus does not employ the kernel trick, can be expressed in primal terms and analyzed further. Interestingly, the cardinality of the intersection of two pattern sets, obtained by a graph decomposition, corresponds to the dot product kernel of the respective nonhashed fingerprints and thus can be applied in this framework. Decomposition kernels have the advantage that they combine the abdication of an explicit construction of the feature space with the possibility to obtain the feature space expression for an arbitrary data sample. This combination allows for an approach that is capable of performing a structural comparison, which is not biased by a fixed set of possible features, and that allows for expressing the model by means of the primal variables. This primal model can be formulated analogously to the QSAR equation obtained by the Free–Wilson approach to chemometrics. 2] The QSAR equation q :{0,1}!R for a molecule with m substructures, that is represented as a binary feature map, is given as a Free–Wilson model by

Explore More