Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Christoph Ringlstetter is active.

Publication


Featured researches published by Christoph Ringlstetter.


International Journal on Document Analysis and Recognition | 2011

Towards information retrieval on historical document collections: the role of matching procedures and special lexica

Annette Gotscharek; Ulrich Reffle; Christoph Ringlstetter; Klaus U. Schulz; Andreas Neumann

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two alternative ways to solve this problem. In the first part of the paper, we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. In the second part of the paper, we ask if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries, it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Results indicate that for earlier periods, matching procedures alone do not lead to satisfactory results. We then describe experiments where the gain for recall obtained from historical lexica of distinct sizes is estimated.


Computational Linguistics | 2006

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.


analytics for noisy unstructured text data | 2009

Enabling information retrieval on historical document collections: the role of matching procedures and special lexica

Annette Gotscharek; Andreas Neumann; Ulrich Reffle; Christoph Ringlstetter; Klaus U. Schulz

Due to the large number of spelling variants found in historical texts, standard methods of Information Retrieval (IR) fail to produce satisfactory results on historical document collections. In order to improve recall for search engines, modern words used in queries have to be associated with corresponding historical variants found in the documents. In the literature, the use of (1) special matching procedures and (2) lexica for historical language have been suggested as two ways to solve this problem. In the first part of the paper we show how the construction of matching procedures and lexica may benefit from each other, leading the way to a combination of both approaches. A tool is presented where matching rules and a historical lexicon are built in an interleaved way based on corpus analysis. A crucial question considered in the second part of the paper is if matching procedures alone suffice to lift IR on historical texts to a satisfactory level. Since historical language changes over centuries it is not simple to obtain an answer. We present experiments where the performance of matching procedures in text collections from four centuries is studied. After classifying missed vocabulary, we measure precision and recall of the matching procedure for each period. Our results indicate that for earlier periods historical lexica represent an important corrective to matching procedures in IR applications.


international conference on document analysis and recognition | 2003

Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary?

Christian M. Strohmaier; Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

Postcorrection of OCR-results for text documents is usuallybased on electronic dictionaries. When scanning textsfrom a specific thematic area, conventional dictionaries oftenmiss a considerable number of tokens. Furthermore,if word frequencies are stored with the entries, these frequencieswill not properly reflect the frequencies found inthe given thematic area. Correction adequacy suffers fromthese two shortcomings. We report on a series of experimentswhere we compare (1) the use of fixed, static large-scaledictionaries (including proper names and abbreviations)with (2) the use of dynamic dictionaries retrieved viaan automated analysis of the vocabulary of web pages froma given domain, and (3) the use of mixed dictionaries. Ourexperiments, which address English and German documentcollections from a variety of fields, show that dynamic dictionariesof the above mentioned form can improve the coveragefor the given thematic area in a significant way andhelp to improve the quality of lexical postcorrection methods.


document engineering | 2009

On lexical resources for digitization of historical documents

Annette Gotscharek; Ulrich Reffle; Christoph Ringlstetter; Klaus U. Schulz

Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modern language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.


international conference on document analysis and recognition | 2005

A corpus for comparative evaluation of OCR software and postcorrection techniques

Stoyan Mihov; Klaus U. Schulz; Christoph Ringlstetter; V. Dojchinova; V. Nakova; K. Kalpakchieva; O. Gerasimov; A. Gotscharek; C. Gercke

We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.


computer vision and pattern recognition | 2003

A visual and interactive tool for optimizing lexical postcorrection of OCR results

Christian M. Strohmaier; Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

Systems for postcorrection of OCR-results can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of an appropriate distance measure for strings and the choice of a scoring function for ranking distinct correction alternatives. When combining the results obtained from distinct OCR engines, further parameters have to be fixed. Due to all these degrees of freedom, adaption and fine tuning of systems for lexical postcorrection is a difficult process. Here we describe a visual and interactive tool that semi-automates the generation of ground truth data, partially automates adjustment of parameters, yields active support for error analysis and thus helps to find correction strategies that lead to high accuracy with realistic effort.


analytics for noisy unstructured text data | 2007

Genre as noise: noise in genre

Andrea Stubbe; Christoph Ringlstetter; Klaus U. Schulz

Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this “micro-noise” can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we present an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. Special emphasis is given to the selection of intelligent high-level features. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. Finally, we test if the error rate of a document represents a useful feature for genre classification.


ACM Transactions on Speech and Language Processing | 2007

Adaptive text correction with Web-crawled domain-dependent dictionaries

Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

For the success of lexical text correction, high coverage of the underlying background dictionary is crucial. Still, most correction tools are built on top of static dictionaries that represent fixed collections of expressions of a given language. When treating texts from specific domains and areas, often a significant part of the vocabulary is missed. In this situation, both automated and interactive correction systems produce suboptimal results. In this article, we describe strategies for crawling Web pages that fit the thematic domain of the given input text. Special filtering techniques are introduced to avoid pages with many orthographic errors. Collecting the vocabulary of filtered pages that meet the vocabulary of the input text, dynamic dictionaries of modest size are obtained that reach excellent coverage values. A tool has been developed that automatically crawls dictionaries in the indicated way. Our correction experiments with crawled dictionaries, which address English and German document collections from a variety of thematic fields, show that with these dictionaries even the error rate of highly accurate texts can be reduced, using completely automated correction methods. For interactive text correction, more sensible candidate sets for correcting erroneous words are obtained and the manual effort is reduced in a significant way. To complete this picture, we study the effect when using word trigram models for correction. Again, trigram models from crawled corpora outperform those obtained from static corpora.


international conference on document analysis and recognition | 2005

The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition

Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov; Katerina Louka

Character sets for Eastern European languages typically contain symbols that are optically almost or fully identical to Latin letters. When scanning documents with mixed Cyrillic-Latin or Greek-Latin alphabets, even high-quality OCR-software is often not able to correctly separate between Cyrillic (Greek) and Latin symbols. This effect leads to an error rate that is far beyond the usual error rates observed when recognizing single-alphabet documents. In this paper we first survey similarities between Latin and Cyrillic (Greek) letters and words for distinct languages and fonts. After briefly introducing a new and public corpus collected by our groups for evaluating OCR-technology over mixed-alphabet documents, we describe how to adapt general algorithms and tools for postcorrection of OCR results to the new context of mixed-alphabet recognition. Experimental results on Bulgarian documents from the corpus and from other sources demonstrate that a drastic reduction of error rates can be achieved.

Collaboration


Dive into the Christoph Ringlstetter's collaboration.

Top Co-Authors

Avatar

Stoyan Mihov

Bulgarian Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ying Xu

University of Alberta

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Grzegorz Kondrak

Ludwig Maximilian University of Munich

View shared research outputs
Top Co-Authors

Avatar

Luanne Freund

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge