Featured Researches

Computation And Language

A Constraint-based Case Frame Lexicon Architecture

In Turkish, (and possibly in many other languages) verbs often convey several meanings (some totally unrelated) when they are used with subjects, objects, oblique objects, adverbial adjuncts, with certain lexical, morphological, and semantic features, and co-occurrence restrictions. In addition to the usual sense variations due to selectional restrictions on verbal arguments, in most cases, the meaning conveyed by a case frame is idiomatic and not compositional, with subtle constraints. In this paper, we present an approach to building a constraint-based case frame lexicon for use in natural language processing in Turkish, whose prototype we have implemented under the TFS system developed at Univ. of Stuttgart. A number of observations that we have made on Turkish have indicated that we need something beyond the traditional transitive and intransitive distinction, and utilize a framework where verb valence is considered as the obligatory co-existence of an arbitrary subset of possible arguments along with the obligatory exclusion of certain others, relative to a verb sense. Additional morphological lexical and semantic constraints on the syntactic constituents organized as a 5-tier constraint hierarchy, are utilized to map a given syntactic structure case-fraame to a specific verb sense.

Read more
Computation And Language

A Corpus Study of Negative Imperatives in Natural Language Instructions

In this paper, we define the notion of a preventative expression and discuss a corpus study of such expressions in instructional text. We discuss our coding schema, which takes into account both form and function features, and present measures of inter-coder reliability for those features. We then discuss the correlations that exist between the function and the form features.

Read more
Computation And Language

A Corpus-Based Approach for Building Semantic Lexicons

Semantic knowledge can be a great asset to natural language processing systems, but it is usually hand-coded for each application. Although some semantic information is available in general-purpose knowledge bases such as WordNet and Cyc, many applications require domain-specific lexicons that represent words and categories for a particular topic. In this paper, we present a corpus-based method that can be used to build semantic lexicons for specific categories. The input to the system is a small set of seed words for a category and a representative text corpus. The output is a ranked list of words that are associated with the category. A user then reviews the top-ranked words and decides which ones should be entered in the semantic lexicon. In experiments with five categories, users typically found about 60 words per category in 10-15 minutes to build a core semantic lexicon.

Read more
Computation And Language

A Corpus-Based Investigation of Definite Description Use

We present the results of a study of definite descriptions use in written texts aimed at assessing the feasibility of annotating corpora with information about definite description interpretation. We ran two experiments, in which subjects were asked to classify the uses of definite descriptions in a corpus of 33 newspaper articles, containing a total of 1412 definite descriptions. We measured the agreement among annotators about the classes assigned to definite descriptions, as well as the agreement about the antecedent assigned to those definites that the annotators classified as being related to an antecedent in the text. The most interesting result of this study from a corpus annotation perspective was the rather low agreement (K=0.63) that we obtained using versions of Hawkins' and Prince's classification schemes; better results (K=0.76) were obtained using the simplified scheme proposed by Fraurud that includes only two classes, first-mention and subsequent-mention. The agreement about antecedents was also not complete. These findings raise questions concerning the strategy of evaluating systems for definite description interpretation by comparing their results with a standardized annotation. From a linguistic point of view, the most interesting observations were the great number of discourse-new definites in our corpus (in one of our experiments, about 50% of the definites in the collection were classified as discourse-new, 30% as anaphoric, and 18% as associative/bridging) and the presence of definites which did not seem to require a complete disambiguation.

Read more
Computation And Language

A Czech Morphological Lexicon

In this paper, a treatment of Czech phonological rules in two-level morphology approach is described. First the possible phonological alternations in Czech are listed and then their treatment in a practical application of a Czech morphological lexicon.

Read more
Computation And Language

A Data-Oriented Approach to Semantic Interpretation

In Data-Oriented Parsing (DOP), an annotated language corpus is used as a stochastic grammar. The most probable analysis of a new input sentence is constructed by combining sub-analyses from the corpus in the most probable way. This approach has been succesfully used for syntactic analysis, using corpora with syntactic annotations such as the Penn Treebank. If a corpus with semantically annotated sentences is used, the same approach can also generate the most probable semantic interpretation of an input sentence. The present paper explains this semantic interpretation method, and summarizes the results of a preliminary experiment. Semantic annotations were added to the syntactic annotations of most of the sentences of the ATIS corpus. A data-oriented semantic interpretation algorithm was succesfully tested on this semantically enriched corpus.

Read more
Computation And Language

A Descriptive Characterization of Tree-Adjoining Languages (Full Version)

Since the early Sixties and Seventies it has been known that the regular and context-free languages are characterized by definability in the monadic second-order theory of certain structures. More recently, these descriptive characterizations have been used to obtain complexity results for constraint- and principle-based theories of syntax and to provide a uniform model-theoretic framework for exploring the relationship between theories expressed in disparate formal terms. These results have been limited, to an extent, by the lack of descriptive characterizations of language classes beyond the context-free. Recently, we have shown that tree-adjoining languages (in a mildly generalized form) can be characterized by recognition by automata operating on three-dimensional tree manifolds, a three-dimensional analog of trees. In this paper, we exploit these automata-theoretic results to obtain a characterization of the tree-adjoining languages by definability in the monadic second-order theory of these three-dimensional tree manifolds. This not only opens the way to extending the tools of model-theoretic syntax to the level of TALs, but provides a highly flexible mechanism for defining TAGs in terms of logical constraints. This is the full version of a paper to appear in the proceedings of COLING-ACL'98 as a project note.

Read more
Computation And Language

A Divide-and-Conquer Strategy for Parsing

In this paper, we propose a novel strategy which is designed to enhance the accuracy of the parser by simplifying complex sentences before parsing. This approach involves the separate parsing of the constituent sub-sentences within a complex sentence. To achieve that, the divide-and-conquer strategy first disambiguates the roles of the link words in the sentence and segments the sentence based on these roles. The separate parse trees of the segmented sub-sentences and the noun phrases within them are then synthesized to form the final parse. To evaluate the effects of this strategy on parsing, we compare the original performance of a dependency parser with the performance when it is enhanced with the divide-and-conquer strategy. When tested on 600 sentences of the IPSM'95 data sets, the enhanced parser saw a considerable error reduction of 21.2% in its accuracy.

Read more
Computation And Language

A Dynamic Approach to Rhythm in Language: Toward a Temporal Phonology

It is proposed that the theory of dynamical systems offers appropriate tools to model many phonological aspects of both speech production and perception. A dynamic account of speech rhythm is shown to be useful for description of both Japanese mora timing and English timing in a phrase repetition task. This orientation contrasts fundamentally with the more familiar symbolic approach to phonology, in which time is modeled only with sequentially arrayed symbols. It is proposed that an adaptive oscillator offers a useful model for perceptual entrainment (or `locking in') to the temporal patterns of speech production. This helps to explain why speech is often perceived to be more regular than experimental measurements seem to justify. Because dynamic models deal with real time, they also help us understand how languages can differ in their temporal detail---contributing to foreign accents, for example. The fact that languages differ greatly in their temporal detail suggests that these effects are not mere motor universals, but that dynamical models are intrinsic components of the phonological characterization of language.

Read more
Computation And Language

A Faster Structured-Tag Word-Classification Method

Several methods have been proposed for processing a corpus to induce a tagset for the sub-language represented by the corpus. This paper examines a structured-tag word classification method introduced by McMahon (1994) and discussed further by McMahon & Smith (1995) in cmp-lg/9503011 . Two major variations, (1) non-random initial assignment of words to classes and (2) moving multiple words in parallel, together provide robust non-random results with a speed increase of 200% to 450%, at the cost of slightly lower quality than McMahon's method's average quality. Two further variations, (3) retaining information from less- frequent words and (4) avoiding reclustering closed classes, are proposed for further study. Note: The speed increases quoted above are relative to my implementation of my understanding of McMahon's algorithm; this takes time measured in hours and days on a home PC. A revised version of the McMahon & Smith (1995) paper has appeared (June 1996) in Computational Linguistics 22(2):217- 247; this refers to a time of "several weeks" to cluster 569 words on a Sparc-IPC.

Read more

Ready to get started?

Join us today