Is this you? Create Your Porfile

Harald Kirsch

European Bioinformatics Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harald Kirsch is active.

Explore More

Publication

Featured researches published by Harald Kirsch.

Bioinformatics | 2008

Text processing through Web services

Dietrich Rebholz-Schuhmann; Miguel Arregui; Sylvain Gaudan; Harald Kirsch; Antonio Jimeno

MOTIVATION Text-mining (TM) solutions are developing into efficient services to researchers in the biomedical research community. Such solutions have to scale with the growing number and size of resources (e.g. available controlled vocabularies), with the amount of literature to be processed (e.g. about 17 million documents in PubMed) and with the demands of the user community (e.g. different methods for fact extraction). These demands motivated the development of a server-based solution for literature analysis. Whatizit is a suite of modules that analyse text for contained information, e.g. any scientific publication or Medline abstracts. Special modules identify terms and then link them to the corresponding entries in bioinformatics databases such as UniProtKb/Swiss-Prot data entries and gene ontology concepts. Other modules identify a set of selected annotation types like the set produced by the EBIMed analysis pipeline for proteins. In the case of Medline abstracts, Whatizit offers access to EBIs in-house installation via PMID or term query. For large quantities of the users own text, the server can be operated in a streaming mode (http://www.ebi.ac.uk/webservices/whatizit).

Bioinformatics | 2007

EBIMed---text crunching to gather facts for proteins from Medline

Dietrich Rebholz-Schuhmann; Harald Kirsch; Miguel Arregui; Sylvain Gaudan; Mark Riethoven; Peter Stoehr

UNLABELLED To allow efficient and systematic retrieval of statements from Medline we have developed EBIMed, a service that combines document retrieval with co-occurrence-based analysis of Medline abstracts. Upon keyword query, EBIMed retrieves the abstracts from EMBL-EBIs installation of Medline and filters for sentences that contain biomedical terminology maintained in public bioinformatics resources. The extracted sentences and terminology are used to generate an overview table on proteins, Gene Ontology (GO) annotations, drugs and species used in the same biological context. All terms in retrieved abstracts and extracted sentences are linked to their entries in biomedical databases. We assessed the quality of the identification of terms and relations in the retrieved sentences. More than 90% of the protein names found indeed represented a protein. According to the analysis of four protein-protein pairs from the Wnt pathway we estimated that 37% of the statements containing such a pair mentioned a meaningful interaction and clarified the interaction of Dkk with LRP. We conclude that EBIMed improves access to information where proteins and drugs are involved in the same biological process, e.g. statements with GO annotations of proteins, protein-protein interactions and effects of drugs on proteins. AVAILABILITY Available at http://www.ebi.ac.uk/Rebholz-srv/ebimed

PLOS Biology | 2005

Facts from Text—Is Text Mining Ready to Deliver?

Dietrich Rebholz-Schuhmann; Harald Kirsch; Francisco M. Couto

The mining of information from scientific literature using computational tools has tremendous potential for knowledge discovery, but how close are we to realizing this potential?

Bioinformatics | 2005

Resolving abbreviations to their senses in Medline

Sylvain Gaudan; Harald Kirsch; Dietrich Rebholz-Schuhmann

MOTIVATION Biological literature contains many abbreviations with one particular sense in each document. However, most abbreviations do not have a unique sense across the literature. Furthermore, many documents do not contain the long forms of the abbreviations. Resolving an abbreviation in a document consists of retrieving its sense in use. Abbreviation resolution improves accuracy of document retrieval engines and of information extraction systems. RESULTS We combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs. The dictionary is used for the resolution of abbreviations occurring with their long forms. Ambiguous global abbreviations are resolved using support vector machines that have been trained on the context of each instance of the abbreviation/sense pairs, previously extracted for the dictionary set-up. The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2% (98.5% accuracy). This performance is superior in comparison with previously reported research work. AVAILABILITY The abbreviation resolution module is available at http://www.ebi.ac.uk/Rebholz/software.html.

Journal of Biomedical Discovery and Collaboration | 2006

GOAnnotator: linking protein GO annotations to evidence text

Francisco M. Couto; Mário J. Silva; Vivian Lee; Emily Dimmer; Evelyn Camon; Rolf Apweiler; Harald Kirsch; Dietrich Rebholz-Schuhmann

BackgroundAnnotation of proteins with gene ontology (GO) terms is ongoing work and a complex task. Manual GO annotation is precise and precious, but it is time-consuming. Therefore, instead of curated annotations most of the proteins come with uncurated annotations, which have been generated automatically. Text-mining systems that use literature for automatic annotation have been proposed but they do not satisfy the high quality expectations of curators.ResultsIn this paper we describe an approach that links uncurated annotations to text extracted from literature. The selection of the text is based on the similarity of the text to the term from the uncurated annotation. Besides substantiating the uncurated annotations, the extracted texts also lead to novel annotations. In addition, the approach uses the GO hierarchy to achieve high precision. Our approach is integrated into GOAnnotator, a tool that assists the curation process for GO annotation of UniProt proteins.ConclusionThe GO curators assessed GOAnnotator with a set of 66 distinct UniProt/SwissProt proteins with uncurated annotations. GOAnnotator provided correct evidence text at 93% precision. This high precision results from using the GO hierarchy to only select GO terms similar to GO terms from uncurated annotations in GOA. Our approach is the first one to achieve high precision, which is crucial for the efficient support of GO curators. GOAnnotator was implemented as a web tool that is freely available at http://xldb.di.fc.ul.pt/rebil/tools/goa/.

International Journal of Medical Informatics | 2006

Distributed modules for text annotation and IE applied to the biomedical domain.

Harald Kirsch; Sylvain Gaudan; Dietrich Rebholz-Schuhmann

Biological databases contain facts from scientific literature that have been curated by hand to ensure high quality. Curation is time-consuming and can be supported by information extraction methods. We present a server software infrastructure which allows to easily plug in modules to identify biologically interesting pieces of text to be then presented in a web interface to the curator. There are modules which identify UniProt, UMLS and GO terminology, gene and protein names, mutations and protein-protein interactions. UniProt, UMLS and GO concepts are automatically linked to the original source. The module for mutations is based on syntax patterns and the one for protein-protein interactions relies on chunk parsing. All modules work as separate servers possibly distributed on different machines and can be combined into processing pipelines as necessary. Communication is based on XML annotated text streams, each server processing the XML elements it is designed for, and possibly adding more information in the form of XML annotation. The server and the underlying software are available to the public.

Nature Biotechnology | 2006

Protein annotation by EBIMed

Dietrich Rebholz-Schuhmann; Harald Kirsch; Miguel Arregui; Sylvain Gaudan; Mark Rynbeek; Peter Stoehr

VOLUME 24 NUMBER 8 AUGUST 2006 NATURE BIOTECHNOLOGY is described, the control of cell lysis and passive leakage of proteins is highly relevant. The single presented lysis control (antiMalE) is not convincing as such and several cytoplasmic and periplasmic markers for cell lysis should be included. We agree completely with the conclusion made by Gál et al. that the controversial and apparently complex nature of the flagellar secretion signal should be investigated further. 1. Ghosh, P. Microbiol. Mol. Biol. Rev. 68, 771–795 (2004). 2. Végh, B. et al. Biochem. Biophys. Res. Commun. 345, 93–98 (2006). 3. Minamino, T. & Namba, K. J. Mol. Microbiol. Biotechnol. 7, 5–17 (2004) 4. Young, G.M. et al. Proc. Natl. Acad. Sci. USA 96, 6456–6461 (1999). 5. Lee, S. H. & Galán, J. Mol. Microbiol. 51, 483–495 (2004). 6. Ren, C– P. et al. J. Bacteriol. 186, 3547–3560 (2004). 7. Ren C.–P. et al. J. Bacteriol. 187, 1430–1440 (2005).

NLPXML '06 Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing | 2006

Annotation and disambiguation of semantic types in biomedical text: a cascaded approach to named entity recognition

Dietrich Rebholz-Schuhmann; Harald Kirsch; Sylvain Gaudan; Miguel Arregui; Goran Nenadic

Publishers of biomedical journals increasingly use XML as the underlying document format. We present a modular text-processing pipeline that inserts XML markup into such documents in every processing step, leading to multi-dimensional markup. The markup introduced is used to identify and disambiguate named entities of several semantic types (protein/gene, Gene Ontology terms, drugs and species) and to communicate data from one module to the next. Each module independently adds, changes or removes markup, which allows for modularization and a flexible setup of the processing pipeline. We also describe how the cascaded approach is embedded in a large-scale XML-based application (EBIMed) used for on-line access to biomedical literature. We discuss the lessons learnt so far, as well as the open problems that need to be resolved. In particular, we argue that the pragmatic and tailored solutions allow for reduction in the need for overlapping annotations --- although not completely without cost.

Database | 2013

PCorral—interactive mining of protein interactions from MEDLINE

Chen Li; Antonio Jimeno-Yepes; Miguel Arregui; Harald Kirsch; Dietrich Rebholz-Schuhmann

The extraction of information from the scientific literature is a complex task—for researchers doing manual curation and for automatic text processing solutions. The identification of protein–protein interactions (PPIs) requires the extraction of protein named entities and their relations. Semi-automatic interactive support is one approach to combine both solutions for efficient working processes to generate reliable database content. In principle, the extraction of PPIs can be achieved with different methods that can be combined to deliver high precision and/or high recall results in different combinations at the same time. Interactive use can be achieved, if the analytical methods are fast enough to process the retrieved documents. PCorral provides interactive mining of PPIs from the scientific literature allowing curators to skim MEDLINE for PPIs at low overheads. The keyword query to PCorral steers the selection of documents, and the subsequent text analysis generates high recall and high precision results for the curator. The underlying components of PCorral process the documents on-the-fly and are available, as well, as web service from the Whatizit infrastructure. The human interface summarizes the identified PPI results, and the involved entities are linked to relevant resources and databases. Altogether, PCorral serves curator at both the beginning and the end of the curation workflow for information retrieval and information extraction. Database URL: http://www.ebi.ac.uk/Rebholz-srv/pcorral.

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications | 2004

Distributed modules for text annotation and IE applied to the biomedical domain

Harald Kirsch; Dietrich Rebholz-Schuhmann

Biological databases contain facts from scientific literature, which have been curated by hand to ensure high quality. Curation is time-consuming and can be supported by information extraction methods. We present a server which identifies biological facts in scientific text and presents the annotation to the curator. Such facts are: UniProt, UMLS and GO terminology, identification of gene and protein names, mutations and protein-protein interactions. UniProt, UMLS and GO concepts are automatically linked to the original source. The module for mutations is based on syntax patterns and the one for protein-protein interactions on NLP. All modules work independently of each other in single threads and are combined in a pipeline to ensure proper meta data integration. For fast response time the modules are distributed on a Linux cluster. The server is at present available to curation teams of biomedical data and will be opened to the public in the future.

Explore More