Roman Klinger
Fraunhofer Society
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Roman Klinger.
Genome Biology | 2008
Larry Smith; Lorraine K. Tanabe; Rie Johnson nee Ando; Cheng-Ju Kuo; I-Fang Chung; Chun-Nan Hsu; Yu-Shi Lin; Roman Klinger; Christoph M. Friedrich; Kuzman Ganchev; Manabu Torii; Hongfang Liu; Barry Haddow; Craig A. Struble; Richard J. Povinelli; Andreas Vlachos; William A. Baumgartner; Lawrence Hunter; Bob Carpenter; Richard Tzong-Han Tsai; Hong-Jie Dai; Feng Liu; Yifei Chen; Chengjie Sun; Sophia Katrenko; Pieter W. Adriaans; Christian Blaschke; Rafael Torres; Mariana Neves; Preslav Nakov
Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.
intelligent systems in molecular biology | 2008
Roman Klinger; Corinna Kolářik; Juliane Fluck; Martin Hofmann-Apitius; Christoph M. Friedrich
Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools. Results: We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run. Availability: We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component. Contact:[email protected]
meeting of the association for computational linguistics | 2014
Konstantin Buschmeier; Philipp Cimiano; Roman Klinger
Irony is an important device in human communication, both in everyday spoken conversations as well as in written texts including books, websites, chats, reviews, and Twitter messages among others. Specific cases of irony and sarcasm have been studied in different contexts but, to the best of our knowledge, only recently the first publicly available corpus including annotations about whether a text is ironic or not has been published by Filatova (2012). However, no baseline for classification of ironic or sarcastic reviews has been provided. With this paper, we aim at closing this gap. We formulate the problem as a supervised classification task and evaluate different classifiers, reaching an F1-measure of up to 74 % using logistic regression. We analyze the impact of a number of features which have been proposed in previous research as well as combinations of them.
BMC Bioinformatics | 2011
Philippe Thomas; Roman Klinger; Laura I. Furlong; Martin Hofmann-Apitius; Christoph M. Friedrich
BackgroundMost information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed.ResultsThis article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs.The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html.ConclusionsComparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.
Philosophical Transactions of the Royal Society A | 2008
Martin Hofmann-Apitius; Juliane Fluck; Laura I. Furlong; Fornes O; Corinna Kolarik; Susanne Hanser; Martin Boeker; Stefan Schulz; Ferran Sanz; Roman Klinger; Mevissen T; Gattermayer T; Baldo Oliva; Christoph M. Friedrich
In essence, the virtual physiological human (VPH) is a multiscale representation of human physiology spanning from the molecular level via cellular processes and multicellular organization of tissues to complex organ function. The different scales of the VPH deal with different entities, relationships and processes, and in consequence the models used to describe and simulate biological functions vary significantly. Here, we describe methods and strategies to generate knowledge environments representing molecular entities that can be used for modelling the molecular scale of the VPH. Our strategy to generate knowledge environments representing molecular entities is based on the combination of information extraction from scientific text and the integration of information from biomolecular databases. We introduce @neuLink, a first prototype of an automatically generated, disease-specific knowledge environment combining biomolecular, chemical, genetic and medical information. Finally, we provide a perspective for the future implementation and use of knowledge environments representing molecular entities for the VPH.
Journal of Bioinformatics and Computational Biology | 2007
Roman Klinger; Christoph M. Friedrich; Heinz-Theodor Mevissen; Juliane Fluck; Martin Hofmann-Apitius; Laura I. Furlong; Ferran Sanz
The influence of genetic variations on diseases or cellular processes is the main focus of many investigations, and results of biomedical studies are often only accessible through scientific publications. Automatic extraction of this information requires recognition of the gene names and the accompanying allelic variant information. In a previous work, the OSIRIS system for the detection of allelic variation in text based on a query expansion approach was communicated. Challenges associated with this system are the relatively low recall for variation mentions and gene name recognition. To tackle this challenge, we integrate the ProMiner system developed for the recognition and normalization of gene and protein names with a conditional random field (CRF)-based recognition of variation terms in biomedical text. Following the newly developed normalization of variation entities, we can link textual entities to Single Nucleotide Polymorphism database (dbSNP) entries. The performance of this novel approach is evaluated, and improved results in comparison to state-of-the-art systems are reported.
BMC Bioinformatics | 2009
Corinna Kolářik; Roman Klinger; Martin Hofmann-Apitius
BackgroundPosttranslational modifications of histones influence the structure of chromatine and in such a way take part in the regulation of gene expression. Certain histone modification patterns, distributed over the genome, are connected to cell as well as tissue differentiation and to the adaption of organisms to their environment. Abnormal changes instead influence the development of disease states like cancer. The regulation mechanisms for modifying histones and its functionalities are the subject of epigenomics investigation and are still not completely understood. Text provides a rich resource of knowledge on epigenomics and modifications of histones in particular. It contains information about experimental studies, the conditions used, and results. To our knowledge, no approach has been published so far for identifying histone modifications in text.ResultsWe have developed an approach for identifying histone modifications in biomedical literature with Conditional Random Fields (CRF) and for resolving the recognized histone modification term variants by term standardization. For the term identification F1 measures of 0.84 by 10-fold cross-validation on the training corpus and 0.81 on an independent test corpus have been obtained. The standardization enabled the correct transformation of 96% of the terms from training and 98% from test the corpus. Due to the lack of terminologies exhaustively covering specific histone modification types, we developed a histone modification term hierarchy for use in a semantic text retrieval system.ConclusionThe developed approach highly improves the retrieval of articles describing histone modifications. Since text contains context information about performed studies and experiments, the identification of histone modifications is the basis for supporting literature-based knowledge discovery and hypothesis generation to accelerate epigenomic research.
international conference on data mining | 2013
Roman Klinger; Philipp Cimiano
Sentiment analysis and opinion mining are often addressed as a text classification or entity recognition problem, involving the detection or classification of aspects and subjective phrases. Many approaches do not model the relation between aspects and subjective phrases explicitly, implicitly assuming that a subjective phrase refers to a certain aspect if they co-occur together in the same sentence, thus potentially sacrificing accuracy. Instead, in the approach presented in this paper, we model the relation between aspects and subjective phrases explicitly, exploiting a flexible model based on imperatively defined factor graphs (IDF). The extraction of subjective phrases, aspects and the relation between them is modeled as a joint inference problem and compared to a pure pipeline architecture. Our goal is to analyse and quantify to what extent a joint model outperforms a pipeline model in terms of extraction of aspects, subjective phrases and the relation between them. Our results show that, while we have a substantial improvement on predicting targets using a joint inference model, the performance on subjective phrase detection and relation extraction actually decreases only slightly.
information retrieval facility conference | 2010
Bernd Müller; Roman Klinger; Harsha Gurulingappa; Heinz-Theodor Mevissen; Martin Hofmann-Apitius; Juliane Fluck; Christoph M. Friedrich
In information retrieval, named entity recognition gives the opportunity to apply semantic search in domain specific corpora. Recently, more full text patents and journal articles became freely available. As the information distribution amongst the different sections is unknown, an analysis of the diversity is of interest. This paper discovers the density and variety of relevant life science terminologies in Medline abstracts, PubMedCentral journal articles and patents from the TREC Chemistry Track. For this purpose named entity recognition for various bio, pharmaceutical, and chemical entity classes has been conducted and the frequencies and distributions in the different text zones analyzed. The full texts from PubMedCentral comprise information to a greater extent than their abstracts while containing almost all given content from their abstracts. In the patents from the TREC Chemistry Track, it is even more extrem. Especially the description section includes almost all entities mentioned in a patent and contains in comparison to the claim section at least 79 % of all entities exclusively.
international conference on computational linguistics | 2014
Benjamin Paassen; Andreas Stöckel; Raphael Dickfelder; Jan Philip Göpfert; Nicole Brazda; Tarek Kirchhoffer; Hans Werner Müller; Roman Klinger; Matthias Hartung; Philipp Cimiano
Preclinical research in the field of central nervous system trauma advances at a fast pace, currently yielding over 8,000 new publications per year, at an exponentially growing rate. This amount of published information by far exceeds the capacity of individual scientists to read and understand the relevant literature. So far, no clinical trial has led to therapeutic approaches which achieve functional recovery in human patients. In this paper, we describe a first prototype of an ontology-based information extraction system that automatically extracts relevant preclinical knowledge about spinal cord injury treatments from natural language text by recognizing participating entity classes and linking them to each other. The evaluation on an independent test corpus of manually annotated full text articles shows a macroaverage F1 measure of 0.74 with precision 0.68 and recall 0.81 on the task of identifying entities participating in relations.