Andrei Mikheev | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrei Mikheev is active.

Explore More

Publication

Featured researches published by Andrei Mikheev.

conference of the european chapter of the association for computational linguistics | 1999

Named Entity recognition without gazetteers

Andrei Mikheev; Marc Moens; Claire Grover

It is often claimed that Named Entity recognition systems need extensive gazetteers---lists of names of people, organisations, locations, and other named entities. Indeed, the compilation of such gazetteers is sometimes mentioned as a bottleneck in the design of Named Entity recognition systems.We report on a Named Entity recognition system which combines rule-based grammars with statistical (maximum entropy) models. We report on the systems performance with gazetteers of different types and different sizes, using test material from the MUC-7 competition. We show that, for the text type and task of this competition, it is sufficient to use relatively small gazetteers of well-known names, rather than large gazetteers of low-frequency names. We conclude with observations about the domain independence of the competition and of our experiments.

Computational Linguistics | 2002

Periods, capitalized words, etc.

Andrei Mikheev

In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by considering suggestive local contexts and repetitions of individual words within a document. This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results. When incorporated into a part-of-speech tagger, it helped reduce the error rate significantly on capitalized words and sentence boundaries. We also investigated the portability to other languages and obtained encouraging results.

conference on applied natural language processing | 1997

A Workbench for Finding Structure in Texts

Andrei Mikheev; Steven Finch

In this paper we report on a set of computational tools with (n)SGML pipeline data flow for uncovering internal structure in natural language texts. The main idea behind the workbench is the independence of the text representation and text analysis phases. At the representation phase the text is converted from a sequence of characters to features of interest by means of the annotation tools. At the analysis phase those features are used by statistics gathering and inference tools for finding significant correlations in the texts. The analysis tools are independent of particular assumptions about the nature of the feature-set and work on the abstract level of feature-elements represented as SGML items.

meeting of the association for computational linguistics | 1999

A Knowledge-free Method for Capitalized Word Disambiguation

Andrei Mikheev

In this paper we present an approach to the disambiguation of capitalized words when they are used in the positions where capitalization is expected, such as the first word in a sentence or after a period, quotes, etc.. Such words can act as proper names or can be just capitalized variants of common words. The main feature of our approach is that it uses a minimum of prebuilt resources and tires to dynamically infer the disambiguation clues from the entire document. The approach was thoroughly tested and achieved about 98.5% accuracy on unseen texts from The New York Times 1996 corpus.

international acm sigir conference on research and development in information retrieval | 2000

Document centered approach to text normalization

Andrei Mikheev

In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.

meeting of the association for computational linguistics | 1996

Unsupervised Learning of Word-Category Guessing Rules

Andrei Mikheev

Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-guessing rules. The learning was performed on the Brown Corpus data and rule-sets, with a highly competitive performance, were produced and compared with the state-of-the-art.

meeting of the association for computational linguistics | 1998

Feature Lattices for Maximum Entropy Modelling

Andrei Mikheev

Maximum entropy framework proved to be expressive and powerful for the statistical language modelling, but it suffers from the computational expensiveness of the model building. The iterative scaling algorithm that is used for the parameter estimation is computationally expensive while the feature selection process might require to estimate parameters for many candidate features many times. In this paper we present a novel approach for building maximum entropy models. Our approach uses the feature collocation lattice and builds complex candidate features without resorting to iterative scaling.

conference of the european chapter of the association for computational linguistics | 1995

Towards a workbench for acquisition of domain knowledge from natural language

Andrei Mikheev; Steven Finch

In this paper we describe an architecture and functionality of main components of a workbench for an acquisition of domain knowledge from large text corpora. The workbench supports an incremental process of corpus analysis starting from a rough automatic extraction and organization of lexico-semantic regularities and ending with a computer supported analysis of extracted data and a semiautomatic refinement of obtained hypotheses. For doing this the workbench employs methods from computational linguistics, information retrieval and knowledge engineering. Although the work-bench is currently under implementation some of its components are already implemented and their performance is illustrated with samples from engineering for a medical domain.

Natural Language Engineering | 1995

Russian morphology: An engineering approach

Andrei Mikheev; Liubov Liubushkina

Morphological analysis, which is at the heart of the processing of natural language requires computationally effective morphological processors. In this paper an approach to the organization of an inflectional morphological model and its application for the Russian language are described. The main objective of our morphological processor is not the classification of word constituents, but rather an efficient computational recognition of morpho-syntactic features of words and the generation of words according to requested morpho-syntactic features. Another major concern that the processor aims to address is the ease of extending the lexicon. The templated word-paradigm model used in the system has an engineering flavour: paradigm formation rules are of a bottom-up (word specific) nature rather than general observations about the language, and word formation units are segments of words rather than proper morphemes. This approach allows us to handle uniformly both general cases and exceptions, and requires extremely simple data structures and control mechanisms which can be easily implemented as a finite-state automata. The morphological processor described in this paper is fully implemented for a substantial subset of Russian (more then 1,500,000 word-tokens – 95,000 word paradigms) and provides an extensive list of morpho-syntactic features together with stress positions for words utilized in its lexicon. Special dictionary management tools were built for browsing, debugging and extension of the lexicon. The actual implementation was done in C and C++, and the system is available for the MS-DOS, MS-Windows and UNIX platforms.

international conference on computational linguistics | 1996

Learning part-of-speech guessing rules from lexicon: extension to non-concatenative operations

Andrei Mikheev

One of the problems in part-of-speech tagging of real-word texts is that of unknown to the lexicon words. In (Mikheev, 1996), a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words was proposed. One of the over-simplification assumed by this learning technique was the acquisition of morphological rules which obey only simple concatenative regularities of the main word with an affix. In this paper we extend this technique to the non-concatenative cases of suffixation and assess the gain in the performance.

Explore More