Is this you? Create Your Porfile

Guido Sautter

Karlsruhe Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guido Sautter is active.

Explore More

Publication

Featured researches published by Guido Sautter.

ZooKeys | 2011

Interlinking journal and wiki publications through joint citation: Working examples from ZooKeys and Plazi on Species-ID

Lyubomir Penev; Gregor Hagedorn; Daniel Mietchen; Teodorss Georgiev; Pavel Stoev; Guido Sautter; Donat Agosti; Andreas Plank; Michael Balke; Lars Hendrich; Terry L. Erwin

Abstract Scholarly publishing and citation practices have developed largely in the absence of versioned documents. The digital age requires new practices to combine the old and the new. We describe how the original published source and a versioned wiki page based on it can be reconciled and combined into a single citation reference. We illustrate the citation mechanism by way of practical examples focusing on journal and wiki publishing of taxon treatments. Specifically, we discuss mechanisms for permanent cross-linking between the static original publication and the dynamic, versioned wiki, as well as for automated export of journal content to the wiki, to reduce the workload on authors, for combining the journal and the wiki citation and for integrating it with the attribution of wiki contributors.

ZooKeys | 2011

XML schemas and mark-up practices of taxonomic literature

Lyubomir Penev; Christopher H. C. Lyal; Anna L. Weitzman; David R. Morse; Guido Sautter; Teodor Georgiev; Robert A. Morris; Terry Catapano; Donat Agosti

Abstract We review the three most widely used XML schemas used to mark-up taxonomic texts, TaxonX, TaxPub and taXMLit. These are described from the viewpoint of their development history, current status, implementation, and use cases. The concept of “taxon treatment” from the viewpoint of taxonomy mark-up into XML is discussed. TaxonX and taXMLit are primarily designed for legacy literature, the former being more lightweight and with a focus on recovery of taxon treatments, the latter providing a much more detailed set of tags to facilitate data extraction and analysis. TaxPub is an extension of the National Library of Medicine Document Type Definition (NLM DTD) for taxonomy focussed on layout and recovery and, as such, is best suited for mark-up of new publications and their archiving in PubMedCentral. All three schemas have their advantages and shortcomings and can be used for different purposes.

pacific symposium on biocomputing | 2006

Semi-automated XML markup of biosystematic legacy literature with the GoldenGATE editor.

Guido Sautter; Klemens Böhm; Donat Agosti

Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.

Biodiversity Data Journal | 2014

Enriched biodiversity data as a resource and service

Rutger A. Vos; Jordan Biserkov; Bachir Balech; Niall Beard; Matthew Blissett; Christian Y. A. Brenninkmeijer; Tom van Dooren; David Eades; George Gosline; Quentin Groom; Thomas Hamann; Hannes Hettling; Robert Hoehndorf; Ayco Holleman; Peter Hovenkamp; Patricia Kelbert; Don Kirkup; Youri Lammers; Thibaut DeMeulemeester; Daniel Mietchen; Jeremy Miller; Ross Mounce; Nicola Nicolson; Rod Page; Aleksandra Pawlik; Serrano Pereira; Lyubomir Penev; Kevin Richards; Guido Sautter; David P. Shorthouse

Abstract Background: Recent years have seen a surge in projects that produce large volumes of structured, machine-readable biodiversity data. To make these data amenable to processing by generic, open source “data enrichment” workflows, they are increasingly being represented in a variety of standards-compliant interchange formats. Here, we report on an initiative in which software developers and taxonomists came together to address the challenges and highlight the opportunities in the enrichment of such biodiversity data by engaging in intensive, collaborative software development: The Biodiversity Data Enrichment Hackathon. Results: The hackathon brought together 37 participants (including developers and taxonomists, i.e. scientific professionals that gather, identify, name and classify species) from 10 countries: Belgium, Bulgaria, Canada, Finland, Germany, Italy, the Netherlands, New Zealand, the UK, and the US. The participants brought expertise in processing structured data, text mining, development of ontologies, digital identification keys, geographic information systems, niche modeling, natural language processing, provenance annotation, semantic integration, taxonomic name resolution, web service interfaces, workflow tools and visualisation. Most use cases and exemplar data were provided by taxonomists. One goal of the meeting was to facilitate re-use and enhancement of biodiversity knowledge by a broad range of stakeholders, such as taxonomists, systematists, ecologists, niche modelers, informaticians and ontologists. The suggested use cases resulted in nine breakout groups addressing three main themes: i) mobilising heritage biodiversity knowledge; ii) formalising and linking concepts; and iii) addressing interoperability between service platforms. Another goal was to further foster a community of experts in biodiversity informatics and to build human links between research projects and institutions, in response to recent calls to further such integration in this research domain. Conclusions: Beyond deriving prototype solutions for each use case, areas of inadequacy were discussed and are being pursued further. It was striking how many possible applications for biodiversity data there were and how quickly solutions could be put together when the normal constraints to collaboration were broken down for a week. Conversely, mobilising biodiversity knowledge from their silos in heritage literature and natural history collections will continue to require formalisation of the concepts (and the links between them) that define the research domain, as well as increased interoperability between the software platforms that operate on these concepts.

european conference on research and advanced technology for digital libraries | 2007

Empirical evaluation of semi-automated XML annotation of text documents with the GoldenGATE editor

Guido Sautter; Klemens Böhm; Frank Padberg; Walter F. Tichy

Digitized scientific documents should be marked up according to domain-specific XML schemas, to make maximum use of their content. Such markup allows for advanced, semantics-based access to the document collection. Many NLP applications have been developed to support automated annotation. But NLP results often are not accurate enough; and manual corrections are indispensable. We therefore have developed the GoldenGATE editor, a tool that integrates NLP applications and assistance features for manual XML editing. Plain XML editors do not feature such a tight integration: Users have to create the markup manually or move the documents back and forth between the editor and (mostly command line) NLP tools. This paper features the first empirical evaluation of how users benefit from such a tight integration when creating semantically rich digital libraries. We have conducted experiments with humans who had to perform markup tasks on a document collection from a generic domain. The results show clearly that markup editing assistance in tight combination with NLP functionality significantly reduces the user effort in annotating documents.

Biodiversity Data Journal | 2015

Integrating and visualizing primary data from prospective and legacy taxonomic literature

Jeremy Miller; Donat Agosti; Lyubomir Penev; Guido Sautter; Teodor Georgiev; Terry Catapano; David J. Patterson; Serrano Pereira; Rutger A. Vos; Soraya Sierra

Abstract Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovry and description, and 2) the prevelence of rarity in species descriptions.

international conference on contemporary computing | 2009

A New Approach towards Bibliographic Reference Identification, Parsing and Inline Citation Matching

Deepank Gupta; Bob Morris; Terry Catapano; Guido Sautter

A number of algorithms and approaches have been proposed towards the problem of scanning and digitizing research papers. We can classify work done in the past into three major approaches: regular expression based heuristics, learning based algorithm and knowledge based systems. Our findings point to the inadequacy of existing open-source solutions such as Paracite for papers with “micro-citations” in various European Languages. This paper describes the work done as part of the Google Summer of Code 2008 using a combination of regular-expression based heuristics and knowledge-based systems to develop a system which matches inline citations to their corresponding bibliographic references and identifies and extracts metadata from references. The description, implementation and results of our approach have been presented here. Our approach enhances the accuracy and provides better recognition rates.

european semantic web conference | 2009

Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain

Guido Sautter; Klemens Böhm; Donat Agosti; Christiana Klingenberg

Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.

Journal of Biomedical Semantics | 2018

OpenBiodiv-O: ontology of the OpenBiodiv knowledge management system

Viktor Senderov; Kiril Simov; Nico M. Franz; Pavel Stoev; Terry Catapano; Donat Agosti; Guido Sautter; Robert A. Morris; Lyubomir Penev

BackgroundThe biodiversity domain, and in particular biological taxonomy, is moving in the direction of semantization of its research outputs. The present work introduces OpenBiodiv-O, the ontology that serves as the basis of the OpenBiodiv Knowledge Management System. Our intent is to provide an ontology that fills the gaps between ontologies for biodiversity resources, such as DarwinCore-based ontologies, and semantic publishing ontologies, such as the SPAR Ontologies. We bridge this gap by providing an ontology focusing on biological taxonomy.ResultsOpenBiodiv-O introduces classes, properties, and axioms in the domains of scholarly biodiversity publishing and biological taxonomy and aligns them with several important domain ontologies (FaBiO, DoCO, DwC, Darwin-SW, NOMEN, ENVO). By doing so, it bridges the ontological gap across scholarly biodiversity publishing and biological taxonomy and allows for the creation of a Linked Open Dataset (LOD) of biodiversity information (a biodiversity knowledge graph) and enables the creation of the OpenBiodiv Knowledge Management System.A key feature of the ontology is that it is an ontology of the scientific process of biological taxonomy and not of any particular state of knowledge. This feature allows it to express a multiplicity of scientific opinions. The resulting OpenBiodiv knowledge system may gain a high level of trust in the scientific community as it does not force a scientific opinion on its users (e.g. practicing taxonomists, library researchers, etc.), but rather provides the tools for experts to encode different views as science progresses.ConclusionsOpenBiodiv-O provides a conceptual model of the structure of a biodiversity publication and the development of related taxonomic concepts. It also serves as the basis for the OpenBiodiv Knowledge Management System.

Biodiversity Data Journal | 2015

Corrected data re-harvested: curating literature in the era of networked biodiversity informatics.

Jeremy Miller; Teodor Georgiev; Pavel Stoev; Guido Sautter; Lyubomir Penev

Science makes progress through a constant process of re-evaluation. Revision and error correction are inevitable and generally healthy for the advancement of science. In biodiversity literature, re-evaluation of earlier work can lead to new conclusions, such as a revised taxonomic determination. When significant errors are discovered, conscientious authors may correct the record by publishing an erratum or corrigendum.

Explore More