Featured Researches

Computation And Language

Automatically Creating Bilingual Lexicons for Machine Translation from Bilingual Text

A method is presented for automatically augmenting the bilingual lexicon of an existing Machine Translation system, by extracting bilingual entries from aligned bilingual text. The proposed method only relies on the resources already available in the MT system itself. It is based on the use of bilingual lexical templates to match the terminal symbols in the parses of the aligned sentences.

Read more
Computation And Language

Automating Coreference: The Role of Annotated Training Data

We report here on a study of interannotator agreement in the coreference task as defined by the Message Understanding Conference (MUC-6 and MUC-7). Based on feedback from annotators, we clarified and simplified the annotation specification. We then performed an analysis of disagreement among several annotators, concluding that only 16% of the disagreements represented genuine disagreement about coreference; the remainder of the cases were mostly typographical errors or omissions, easily reconciled. Initially, we measured interannotator agreement in the low 80s for precision and recall. To try to improve upon this, we ran several experiments. In our final experiment, we separated the tagging of candidate noun phrases from the linking of actual coreferring expressions. This method shows promise - interannotator agreement climbed to the low 90s - but it needs more extensive validation. These results position the research community to broaden the coreference task to multiple languages, and possibly to different kinds of coreference.

Read more
Computation And Language

Bayesian Stratified Sampling to Assess Corpus Utility

This paper describes a method for asking statistical questions about a large text corpus. We exemplify the method by addressing the question, "What percentage of Federal Register documents are real documents, of possible interest to a text researcher or analyst?" We estimate an answer to this question by evaluating 200 documents selected from a corpus of 45,820 Federal Register documents. Stratified sampling is used to reduce the sampling uncertainty of the estimate from over 3100 documents to fewer than 1000. The stratification is based on observed characteristics of real documents, while the sampling procedure incorporates a Bayesian version of Neyman allocation. A possible application of the method is to establish baseline statistics used to estimate recall rates for information retrieval systems.

Read more
Computation And Language

Best-First Surface Realization

Current work in surface realization concentrates on the use of general, abstract algorithms that interpret large, reversible grammars. Only little attention has been paid so far to the many small and simple applications that require coverage of a small sublanguage at different degrees of sophistication. The system TG/2 described in this paper can be smoothly integrated with deep generation processes, it integrates canned text, templates, and context-free rules into a single formalism, it allows for both textual and tabular output, and it can be parameterized according to linguistic preferences. These features are based on suitably restricted production system techniques and on a generic backtracking regime.

Read more
Computation And Language

Better Language Models with Model Merging

This paper investigates model merging, a technique for deriving Markov models from text or speech corpora. Models are derived by starting with a large and specific model and by successively combining states to build smaller and more general models. We present methods to reduce the time complexity of the algorithm and report on experiments on deriving language models for a speech recognition task. The experiments show the advantage of model merging over the standard bigram approach. The merged model assigns a lower perplexity to the test set and uses considerably fewer states.

Read more
Computation And Language

Beyond Word N-Grams

We describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mixtures have provably and practically better performance than almost any single model. We evaluate the model on several corpora. The low perplexity achieved by relatively small PST mixture models suggests that they may be an advantageous alternative, both theoretically and practically, to the widely used n-gram models.

Read more
Computation And Language

Bi-Lexical Rules for Multi-Lexeme Translation in Lexicalist MT

The paper presents a prototype lexicalist Machine Translation system (based on the so-called `Shake-and-Bake' approach of Whitelock (1992) consisting of an analysis component, a dynamic bilingual lexicon, and a generation component, and shows how it is applied to a range of MT problems. Multi-Lexeme translations are handled through bi-lexical rules which map bilingual lexical signs into new bilingual lexical signs. It is argued that much translation can be handled by equating translationally equivalent lists of lexical signs, either directly in the bilingual lexicon, or by deriving them through bi-lexical rules. Lexical semantic information organized as Qualia structures (Pustejovsky 1991) is used as a mechanism for restricting the domain of the rules.

Read more
Computation And Language

Bridging as Coercive Accommodation

In this paper we discuss the notion of "bridging" in Discourse Representation Theory as a tool to account for discourse referents that have only been established implicitly, through the lexical semantics of other referents. In doing so, we use ideas from Generative Lexicon theory, to introduce antecedents for anaphoric expressions that cannot be "linked" to a proper antecedent, but that do not need to be "accommodated" because they have some connection to the network of discourse referents that is already established.

Read more
Computation And Language

Building Accurate Semantic Taxonomies from Monolingual MRDs

This paper presents a method that combines a set of unsupervised algorithms in order to accurately build large taxonomies from any machine-readable dictionary (MRD). Our aim is to profit from conventional MRDs, with no explicit semantic coding. We propose a system that 1) performs fully automatic exraction of taxonomic links from MRD entries and 2) ranks the extracted relations in a way that selective manual refinement is allowed. Tested accuracy can reach around 100% depending on the degree of coverage selected, showing that taxonomy building is not limited to structured dictionaries such as LDOCE.

Read more
Computation And Language

Building Knowledge Bases for the Generation of Software Documentation

Automated text generation requires a underlying knowledge base from which to generate, which is often difficult to produce. Software documentation is one domain in which parts of this knowledge base may be derived automatically. In this paper, we describe \drafter, an authoring support tool for generating user-centred software documentation, and in particular, we describe how parts of its required knowledge base can be obtained automatically.

Read more

Ready to get started?

Join us today