Featured Researches

Computation And Language

A Lexicon for Underspecified Semantic Tagging

The paper defends the notion that semantic tagging should be viewed as more than disambiguation between senses. Instead, semantic tagging should be a first step in the interpretation process by assigning each lexical item a representation of all of its systematically related senses, from which further semantic processing steps can derive discourse dependent interpretations. This leads to a new type of semantic lexicon (CoreLex) that supports underspecified semantic tagging through a design based on systematic polysemous classes and a class-based acquisition of lexical knowledge for specific domains.

Read more
Computation And Language

A Linear Observed Time Statistical Parser Based on Maximum Entropy Models

This paper presents a statistical parser for natural language that obtains a parsing accuracy---roughly 87% precision and 86% recall---which surpasses the best previously published results on the Wall St. Journal domain. The parser itself requires very little human intervention, since the information it uses to make parsing decisions is specified in a concise and simple manner, and is combined in a fully automatic way under the maximum entropy framework. The observed running time of the parser on a test sentence is linear with respect to the sentence length. Furthermore, the parser returns several scored parses for a sentence, and this paper shows that a scheme to pick the best parse from the 20 highest scoring parses could yield a dramatically higher accuracy of 93% precision and recall.

Read more
Computation And Language

A Linguistically Interpreted Corpus of German Newspaper Text

In this paper, we report on the development of an annotation scheme and annotation tools for unrestricted German text. Our representation format is based on argument structure, but also permits the extraction of other kinds of representations. We discuss several methodological issues and the analysis of some phenomena. Additional focus is on the tools developed in our project and their applications.

Read more
Computation And Language

A Machine Learning Approach to the Classification of Dialogue Utterances

The purpose of this paper is to present a method for automatic classification of dialogue utterances and the results of applying that method to a corpus. Superficial features of a set of training utterances (which we will call cues) are taken as the basis for finding relevant utterance classes and for extracting rules for assigning these classes to new utterances. Each cue is assumed to partially contribute to the communicative function of an utterance. Instead of relying on subjective judgments for the tasks of finding classes and rules, we opt for using machine learning techniques to guarantee objectivity.

Read more
Computation And Language

A Matching Technique in Example-Based Machine Translation

This paper addresses an important problem in Example-Based Machine Translation (EBMT), namely how to measure similarity between a sentence fragment and a set of stored examples. A new method is proposed that measures similarity according to both surface structure and content. A second contribution is the use of clustering to make retrieval of the best matching example from the database more efficient. Results on a large number of test cases from the CELEX database are presented.

Read more
Computation And Language

A Maximum Entropy Approach to Identifying Sentence Boundaries

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of ., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Roman-alphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

Read more
Computation And Language

A Maximum-Entropy Partial Parser for Unrestricted Text

This paper describes a partial parser that assigns syntactic structures to sequences of part-of-speech tags. The program uses the maximum entropy parameter estimation method, which allows a flexible combination of different knowledge sources: the hierarchical structure, parts of speech and phrasal categories. In effect, the parser goes beyond simple bracketing and recognises even fairly complex structures. We give accuracy figures for different applications of the parser.

Read more
Computation And Language

A Memory-Based Approach to Learning Shallow Natural Language Patterns

Recognizing shallow linguistic patterns, such as basic syntactic relationships between words, is a common task in applied natural language and text processing. The common practice for approaching this task is by tedious manual definition of possible pattern structures, often in the form of regular expressions or finite automata. This paper presents a novel memory-based learning method that recognizes shallow patterns in new text based on a bracketed training corpus. The training data are stored as-is, in efficient suffix-tree data structures. Generalization is performed on-line at recognition time by comparing subsequences of the new text to positive and negative evidence in the corpus. This way, no information in the training is lost, as can happen in other learning systems that construct a single generalized model at the time of training. The paper presents experimental results for recognizing noun phrase, subject-verb and verb-object patterns in English. Since the learning approach enables easy porting to new domains, we plan to apply it to syntactic patterns in other languages and to sub-language patterns for information extraction.

Read more
Computation And Language

A Model of Lexical Attraction and Repulsion

This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word co-occurrence statistics to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text, as well as conversational speech, reveals that the ``attraction'' between words decays exponentially, while stylistic and syntactic contraints create a ``repulsion'' between words that discourages close co-occurrence. We show that these characteristics are well described by simple mixture models based on two-stage exponential distributions which can be trained using the EM algorithm. The resulting distance distributions can then be incorporated as penalizing features in an exponential language model.

Read more
Computation And Language

A Model-Theoretic Framework for Theories of Syntax

A natural next step in the evolution of constraint-based grammar formalisms from rewriting formalisms is to abstract fully away from the details of the grammar mechanism---to express syntactic theories purely in terms of the properties of the class of structures they license. By focusing on the structural properties of languages rather than on mechanisms for generating or checking structures that exhibit those properties, this model-theoretic approach can offer simpler and significantly clearer expression of theories and can potentially provide a uniform formalization, allowing disparate theories to be compared on the basis of those properties. We discuss $\LKP$, a monadic second-order logical framework for such an approach to syntax that has the distinctive virtue of being superficially expressive---supporting direct statement of most linguistically significant syntactic properties---but having well-defined strong generative capacity---languages are definable in $\LKP$ iff they are strongly context-free. We draw examples from the realms of GPSG and GB.

Read more

Ready to get started?

Join us today