Stoyan Mihov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stoyan Mihov is active.

Explore More

Publication

Featured researches published by Stoyan Mihov.

finite state methods and natural language processing | 2000

Incremental construction of minimal acyclic finite-state automata

Jan Daciuk; Bruce W. Watson; Stoyan Mihov; Richard E. Watson

In this paper, we describe a new method for constructing minimal, deterministic, acyclic finite-state automata from a set of strings. Traditional methods consist of two phases: the first to construct a trie, the second one to minimize it. Our approach is to construct a minimal automaton in a single phase by adding new strings one by one and minimizing the resulting automaton on-the-fly. We present a general algorithm as well as a specialization that relies upon the lexicographical ordering of the input strings. Our method is fast and significantly lowers memory requirements in comparison to other methods.

International Journal on Document Analysis and Recognition | 2002

Fast string correction with Levenshtein automata

Klaus U. Schulz; Stoyan Mihov

Abstract. The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein automata of degree n for a word W are defined as finite state automata that recognize the set of all words V where the Levenshtein distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W. Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein automata and leads to even improved efficiency. Evaluation results are given that also address variants of both methods that are based on modified Levenshtein distances where further primitive edit operations (transpositions, merges and splits) are used.

Computational Linguistics | 2004

Fast Approximate Search in Large Dictionaries

Stoyan Mihov; Klaus U. Schulz

The need to correct garbled strings arises in many areas of natural language processing. If a dictionary is available that covers all possible input tokens, a natural set of candidates for correcting an erroneous input P is the set of all words in the dictionary for which the Levenshtein distance to Pdoes not exceed a given (small) bound k. In this article we describe methods for efficiently selecting such candidate sets. After introducing as a starting point a basic correction method based on the concept of a universal Levenshtein automaton, we show how two filtering methods known from the field of approximate text search can be used to improve the basic procedure in a significant way. The first method, which uses standard dictionaries plus dictionaries with reversed words, leads to very short correction times for most classes of input strings. Our evaluation results demonstrate that correction times for fixed-distance bounds depend on the expected number of correction candidates, which decreases for longer input words. Similarly the choice of an optimal filtering method depends on the length of the input words.

Computational Linguistics | 2006

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the distribution of orthographic errors of various types in Web pages. As a by-product, methods are developed for efficiently detecting erroneous pages and for marking orthographic errors in acceptable Web documents, reducing thus the number of errors in corpora and linguistic knowledge bases automatically retrieved from the Web.

international conference on document analysis and recognition | 2003

Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary?

Christian M. Strohmaier; Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

Postcorrection of OCR-results for text documents is usuallybased on electronic dictionaries. When scanning textsfrom a specific thematic area, conventional dictionaries oftenmiss a considerable number of tokens. Furthermore,if word frequencies are stored with the entries, these frequencieswill not properly reflect the frequencies found inthe given thematic area. Correction adequacy suffers fromthese two shortcomings. We report on a series of experimentswhere we compare (1) the use of fixed, static large-scaledictionaries (including proper names and abbreviations)with (2) the use of dynamic dictionaries retrieved viaan automated analysis of the vocabulary of web pages froma given domain, and (3) the use of mixed dictionaries. Ourexperiments, which address English and German documentcollections from a variety of fields, show that dynamic dictionariesof the above mentioned form can improve the coveragefor the given thematic area in a significant way andhelp to improve the quality of lexical postcorrection methods.

international conference on document analysis and recognition | 2005

A corpus for comparative evaluation of OCR software and postcorrection techniques

Stoyan Mihov; Klaus U. Schulz; Christoph Ringlstetter; V. Dojchinova; V. Nakova; K. Kalpakchieva; O. Gerasimov; A. Gotscharek; C. Gercke

We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.

computer vision and pattern recognition | 2003

A visual and interactive tool for optimizing lexical postcorrection of OCR results

Christian M. Strohmaier; Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

Systems for postcorrection of OCR-results can be fine tuned and adapted to new recognition tasks in many respects. One issue is the selection and adaption of a suitable background dictionary. Another issue is the choice of a correction model, which includes, among other decisions, the selection of an appropriate distance measure for strings and the choice of a scoring function for ranking distinct correction alternatives. When combining the results obtained from distinct OCR engines, further parameters have to be fixed. Due to all these degrees of freedom, adaption and fine tuning of systems for lexical postcorrection is a difficult process. Here we describe a visual and interactive tool that semi-automates the generation of ground truth data, partially automates adjustment of parameters, yields active support for error analysis and thus helps to find correction strategies that lead to high accuracy with realistic effort.

ACM Transactions on Speech and Language Processing | 2007

Adaptive text correction with Web-crawled domain-dependent dictionaries

Christoph Ringlstetter; Klaus U. Schulz; Stoyan Mihov

For the success of lexical text correction, high coverage of the underlying background dictionary is crucial. Still, most correction tools are built on top of static dictionaries that represent fixed collections of expressions of a given language. When treating texts from specific domains and areas, often a significant part of the vocabulary is missed. In this situation, both automated and interactive correction systems produce suboptimal results. In this article, we describe strategies for crawling Web pages that fit the thematic domain of the given input text. Special filtering techniques are introduced to avoid pages with many orthographic errors. Collecting the vocabulary of filtered pages that meet the vocabulary of the input text, dynamic dictionaries of modest size are obtained that reach excellent coverage values. A tool has been developed that automatically crawls dictionaries in the indicated way. Our correction experiments with crawled dictionaries, which address English and German document collections from a variety of thematic fields, show that with these dictionaries even the error rate of highly accurate texts can be reduced, using completely automated correction methods. For interactive text correction, more sensible candidate sets for correcting erroneous words are obtained and the manual effort is reduced in a significant way. To complete this picture, we study the effect when using word trigram models for correction. Again, trigram models from crawled corpora outperform those obtained from static corpora.

international conference on document analysis and recognition | 2007

Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks

Stoyan Mihov; Petar Mitankin; Klaus U. Schulz

Lexical text correction relies on a central step where approximate search in a dictionary is used to select the best correction suggestions for an ill-formed input token. In previous work we introduced the concept of a universal Levenshtein automaton and showed how to use these automata for efficiently selecting from a dictionary all entries within a fixed Levenshtein distance to the garbled input word. In this paper we look at refinements of the basic Levenshtein distance that yield more sensible notions of similarity in distinct text correction applications, e.g. OCR. We show that the concept of a universal Levenshtein automaton can be adapted to these refinements. In this way we obtain a method for selecting correction candidates which is very efficient, at the same time selecting small candidate sets with high recall.

international conference on implementation and application of automata | 2000

Direct Construction of Minimal Acyclic Subsequential Transducers

Stoyan Mihov; Denis Maurel

This paper presents an algorithm for direct building of minimal acyclic subsequential transducer, which represents a finite relation given as a sorted list of words with their outputs. The algorithm constructs the minimal transducer directly - without constructing intermediate tree-like or pseudo-minimal transducers. In NLP applications our algorithm provides significantly better efficiency than the other algorithms building minimal transducer for large-scale natural language dictionaries. Some experimental comparisons are presented at the end of the paper.

Explore More