Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Andreas Karwath is active.

Publication


Featured researches published by Andreas Karwath.


computational intelligence and data mining | 2009

Large-scale attribute selection using wrappers

Martin Gütlein; Eibe Frank; Mark A. Hall; Andreas Karwath

Scheme-specific attribute selection with the wrapper and variants of forward selection is a popular attribute selection technique for classification that yields good results. However, it can run the risk of overfitting because of the extent of the search and the extensive use of internal cross-validation. Moreover, although wrapper evaluators tend to achieve superior accuracy compared to filters, they face a high computational cost. The problems of overfitting and high runtime occur in particular on high-dimensional datasets, like microarray data. We investigate Linear Forward Selection, a technique to reduce the number of attributes expansions in each forward selection step. Our experiments demonstrate that this approach is faster, finds smaller subsets and can even increase the accuracy compared to standard forward selection. We also investigate a variant that applies explicit subset size determination in forward selection to combat overfitting, where the search is forced to stop at a precomputed “optimal” subset size. We show that this technique reduces subset size while maintaining comparable accuracy.


Journal of Cheminformatics | 2010

Collaborative development of predictive toxicology applications

Barry Hardy; Nicki Douglas; Christoph Helma; Micha Rautenberg; Nina Jeliazkova; Vedrin Jeliazkov; Ivelina Nikolova; Romualdo Benigni; Olga Tcheremenskaia; Stefan Kramer; Tobias Girschick; Fabian Buchwald; Jörg Wicker; Andreas Karwath; Martin Gütlein; Andreas Maunz; Haralambos Sarimveis; Georgia Melagraki; Antreas Afantitis; Pantelis Sopasakis; David Gallagher; Vladimir Poroikov; Dmitry Filimonov; Alexey V. Zakharov; Alexey Lagunin; Tatyana A. Gloriozova; Sergey V. Novikov; Natalia Skvortsova; Dmitry Druzhilovsky; Sunil Chawla

OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals.The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation.Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.


Bioinformatics | 2001

The utility of different representations of protein sequence for predicting functional class

Ross D. King; Andreas Karwath; Amanda Clare; Luc Dehaspe

MOTIVATION Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model. RESULTS Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%.


knowledge discovery and data mining | 2000

Genome scale prediction of protein functional class from sequence using data mining

Ross D. King; Andreas Karwath; Amanda Clare; Luc Dephaspe

The ability to predict protein function from amino acid sequence is a central research goal of molecular biology. Such a capability would greatly aid the biological interpretation of the genomic data and accelerate its medical exploitation. For the existing sequenced genomes function can be assigned to typically only between 40-60% of the genes [4,8,12,7]. The new science of functional genomics is dedicated to discovering the function of these genes, and to further detailing gene function [10,27,17,6]. Here we present a novel data-mining [24,18] approach to predicting protein functional class from sequence. We demonstrate the effectiveness of this approach on the Mycobacterium tuberculosis [8] genome. Biologically interpretable rules are identified that can predict protein function even in the absence of identifiable sequence homology. These rules predict 65% of the genes with no previous assigned function in Mycobacterium tuberculosis (the bacteria which causes TB) with an estimated accuracy of 60-80% (depending on the level of functional assignment). The rules give insight into the evolutionary history of the organism. Categories and Subject Database Applications, Learning, Life and Medical Sciences General Terms Data mining, Concept learning, Biology and genetics.


Yeast | 2000

Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.

Ross D. King; Andreas Karwath; Amanda Clare; Luc Dehaspe

The analysis of genomics data needs to become as automated as its generation. Here we present a novel data‐mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli. Copyright


Journal of Cheminformatics | 2012

CheS-Mapper - Chemical Space Mapping and Visualization in 3D

Martin Gütlein; Andreas Karwath; Stefan Kramer

Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis.


Bioinformatics | 2006

Functional bioinformatics for Arabidopsis thaliana

Amanda Clare; Andreas Karwath; Helen J. Ougham; Ross D. King

MOTIVATION The genome of Arabidopsis thaliana, which has the best understood plant genome, still has approximately one-third of its genes with no functional annotation at all from either MIPS or TAIR. We have applied our Data Mining Prediction (DMP) method to the problem of predicting the functional classes of these protein sequences. This method is based on using a hybrid machine-learning/data-mining method to identify patterns in the bioinformatic data about sequences that are predictive of function. We use data about sequence, predicted secondary structure, predicted structural domain, InterPro patterns, sequence similarity profile and expressions data. RESULTS We predicted the functional class of a high percentage of the Arabidopsis genes with currently unknown function. These predictions are interpretable and have good test accuracies. We describe in detail seven of the rules produced.


BMC Bioinformatics | 2002

Homology induction: The use of machine learning to improve sequence similarity searches

Andreas Karwath; Ross D. King

BackgroundThe inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000).ResultsWe present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodoly with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families.ConclusionsHI is a new technique for the detection of remote protein homolgy – a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method.


Molecular Informatics | 2013

A Large-Scale Empirical Evaluation of Cross-Validation and External Test Set Validation in (Q)SAR.

Martin Gütlein; Christoph Helma; Andreas Karwath; Stefan Kramer

(Q)SAR model validation is essential to ensure the quality of inferred models and to indicate future model predictivity on unseen compounds. Proper validation is also one of the requirements of regulatory authorities in order to accept the (Q)SAR model, and to approve its use in real world scenarios as alternative testing method. However, at the same time, the question of how to validate a (Q)SAR model, in particular whether to employ variants of cross‐validation or external test set validation, is still under discussion. In this paper, we empirically compare a k‐fold cross‐validation with external test set validation. To this end we introduce a workflow allowing to realistically simulate the common problem setting of building predictive models for relatively small datasets. The workflow allows to apply the built and validated models on large amounts of unseen data, and to compare the performance of the different validation approaches. The experimental results indicate that cross‐validation produces higher performant (Q)SAR models than external test set validation, reduces the variance of the results, while at the same time underestimates the performance on unseen compounds. The experimental results reported in this paper suggest that, contrary to current conception in the community, cross‐validation may play a significant role in evaluating the predictivity of (Q)SAR models.


international conference on robotics and automation | 2010

Mapping indoor environments based on human activity

Slawomir Grzonka; Frederic Dijoux; Andreas Karwath; Wolfram Burgard

We present a novel approach to build approximate maps of structured environments utilizing human motion and activity. Our approach uses data recorded with a data suit which is equipped with several IMUs to detect movements of a person and door opening and closing events. In our approach we interpret the movements as motion constraints and door handling events as landmark detections in a graph-based SLAM framework. As we cannot distinguish between individual doors, we employ a multi-hypothesis approach on top of the SLAM system to deal with the high data-association uncertainty. As a result, our approach is able to accurately and robustly recover the trajectory of the person. We additionally take advantage of the fact that people traverse free space and that doors separate rooms to recover the geometric structure of the environment after the graph optimization. We evaluate our approach in several experiments carried out with different users and in environments of different types.

Collaboration


Dive into the Andreas Karwath's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ross D. King

University of Manchester

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kristian Kersting

Technical University of Dortmund

View shared research outputs
Top Co-Authors

Avatar

Luc De Raedt

Katholieke Universiteit Leuven

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge