Petar Mitankin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Petar Mitankin is active.

Explore More

Publication

Featured researches published by Petar Mitankin.

international conference on management of data | 2014

State-of-the-art in string similarity search and join

Sebastian Wandelt; Dong Deng; Stefan Gerdjikov; Shashwat Mishra; Petar Mitankin; Manish Patil; Enrico Siragusa; Alexander Tiskin; Wei Wang; Jiaying Wang; Ulf Leser

String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.

international conference on document analysis and recognition | 2007

Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks

Stoyan Mihov; Petar Mitankin; Klaus U. Schulz

Lexical text correction relies on a central step where approximate search in a dictionary is used to select the best correction suggestions for an ill-formed input token. In previous work we introduced the concept of a universal Levenshtein automaton and showed how to use these automata for efficiently selecting from a dictionary all entries within a fixed Levenshtein distance to the garbled input word. In this paper we look at refinements of the basic Levenshtein distance that yield more sensible notions of similarity in distinct text correction applications, e.g. OCR. We show that the concept of a universal Levenshtein automaton can be adapted to these refinements. In this way we obtain a method for selecting correction candidates which is very efficient, at the same time selecting small candidate sets with high recall.

Theoretical Computer Science | 2011

Deciding word neighborhood with universal neighborhood automata

Petar Mitankin; Stoyan Mihov; Klaus U. Schulz

Given some form of distance between words, a fundamental operation is to decide whether the distance between two given words w and v is within a given bound. In earlier work, we introduced the concept of a universal Levenshtein automaton for a given distance bound n. This deterministic automaton takes as input a sequence @g of bitvectors computed from w and v. The sequence @g is accepted iff the Levenshtein distance between w and v does not exceed n. The automaton is called universal since the same automaton can be used for arbitrary input words w and v, regardless of the underlying input alphabet. Here, we extend this picture. After introducing a large abstract family of generalized word distances, we exactly characterize those members where word neighborhood can be decided using universal neighborhood automata similar to universal Levenshtein automata. Our theoretical results establish several bridges to the theory of synchronized finite-state transducers and dynamic programming. For small neighborhood bounds, universal neighborhood automata can be held in main memory. This leads to very efficient algorithms for the above decision problem. Evaluation results show that these algorithms are much faster than those based on dynamic programming.

Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage | 2014

An approach to unsupervised historical text normalisation

Petar Mitankin; Stefan Gerdjikov; Stoyan Mihov

We present a novel approach to unsupervised noisy text correction. Our approach is based on automatic extraction of historical variation patterns by analysing the structure of the words from a historical corpus and comparing it with the structure of the contemporary dictionary. Based on the extracted variation patterns the core candidate generator, REBELS, produces correction candidates even outside the modern dictionary. Further, the sentence correction is complemented with a modern language model combined in a log-linear model. The quality of our unsupervised approach is empirically compared against a supervised system competitive with the state-of-the-art supervised text normalisation systems. The experiments show that our system delivers 81.79% normalisation accuracy of 17th century English historical texts in a fully unsupervised setup.

australasian joint conference on artificial intelligence | 2007

Using automated error profiling of texts for improved selection of correction candidates for garbled tokens

Stoyan Mihov; Petar Mitankin; Annette Gotscharek; Ulrich Reffle; Klaus U. Schulz; Christoph Ringlstetter

Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.

document analysis systems | 2014

Flexible Noisy Text Correction

Andrey Sariev; Vladislav Nenchev; Stefan Gerdjikov; Petar Mitankin; Hristo Ganchev; Stoyan Mihov; Tinko Tinchev

We present a new general and language independent approach to the noisy text correction problem developed and implemented in the framework of the CULTURA project. We briefly describe the core candidate generator, REBELS, the complete system concept, its efficient implementation based on functional automata and its immediate applications. The quality of the whole system is empirically established in different experimental settings where language and noise sources are varied.

Archive | 2016

A New Method for Real-Time Lattice Rescoring in Speech Recognition

Petar Mitankin; Stoyan Mihov

We introduce a novel efficient method, which improves the performance of speech recognition systems by providing the option to partially compile the word lattice into a deterministic finite-state automaton, making it suitable for the rescoring step in the speech recognition process. In contrast to the widely used n-best method our method permits the consideration of significantly larger number of alternatives within the same time-constraint and thus provides better recognition results. In this paper we present a description of the new method and empirical evaluation of its performance in comparison with the n-best method. The achieved WER reduction is up to 3.77 % at a p-value below 3 %. An important advantage of our method is its applicability for real-time applications.

DH | 2013