Computation And Language - Researchain | Decentralizing Knowledge

Featured Researches

Attempto Controlled English (ACE)

Attempto Controlled English (ACE) allows domain specialists to interactively formulate requirements specifications in domain concepts. ACE can be accurately and efficiently processed by a computer, but is expressive enough to allow natural usage. The Attempto system translates specification texts in ACE into discourse representation structures and optionally into Prolog. Translated specification texts are incrementally added to a knowledge base. This knowledge base can be queried in ACE for verification, and it can be executed for simulation, prototyping and validation of the specification.

Computation And Language

Automatic Alignment of English-Chinese Bilingual Texts of CNS News

In this paper we address a method to align English-Chinese bilingual news reports from China News Service, combining both lexical and satistical approaches. Because of the sentential structure differences between English and Chinese, matching at the sentence level as in many other works may result in frequent matching of several sentences en masse. In view of this, the current work also attempts to create shorter alignment pairs by permitting finer matching between clauses from both texts if possible. The current method is based on statiscal correlation between sentence or clause length of both texts and at the same time uses obvious anchors such as numbers and place names appearing frequently in the news reports as lexcial cues.

Computation And Language

Automatic Construction of Clean Broad-Coverage Translation Lexicons

Word-level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques. However, these techniques are easily fooled by {\em indirect associations} --- pairs of unrelated words whose statistical properties resemble those of mutual translations. Indirect associations pollute the resulting translation lexicons, drastically reducing their precision. This paper presents an iterative lexicon cleaning method. On each iteration, most of the remaining incorrect lexicon entries are filtered out, without significant degradation in recall. This lexicon cleaning technique can produce translation lexicons with recall and precision both exceeding 90\%, as well as dictionary-sized translation lexicons that are over 99\% correct.

Computation And Language

Automatic Detection of Omissions in Translations

ADOMIT is an algorithm for Automatic Detection of OMIssions in Translations. The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information. This property allows it to deal equally well with omissions that do not correspond to linguistic units, such as might result from word-processing mishaps. ADOMIT has proven itself by discovering many errors in a hand-constructed gold standard for evaluating bitext mapping algorithms. Quantitative evaluation on simulated omissions showed that, even with today's poor bitext mapping technology, ADOMIT is a valuable quality control tool for translators and translation bureaus.

Computation And Language

Automatic Detection of Text Genre

As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.

Computation And Language

Automatic Discovery of Non-Compositional Compounds in Parallel Data

Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of non-compositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.

Computation And Language

Automatic Extraction of Subcategorization from Corpora

We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount.

Computation And Language

Automatic Extraction of Tagset Mappings from Parallel-Annotated Corpora

This paper describes some of the recent work of project AMALGAM (automatic mapping among lexico-grammatical annotation models). We are investigating ways to map between the leading corpus annotation schemes in order to improve their resuability. Collation of all the included corpora into a single large annotated corpus will provide a more detailed language model to be developed for tasks such as speech and handwriting recognition. In particular, we focus here on a method of extracting mappings from corpora that have been annotated according to more than one annotation scheme.

Computation And Language

Automatic Inference of DATR Theories

This paper presents an approach for the automatic acquisition of linguistic knowledge from unstructured data. The acquired knowledge is represented in the lexical knowledge representation language DATR. A set of transformation rules that establish inheritance relationships and a default-inference algorithm make up the basis components of the system. Since the overall approach is not restricted to a special domain, the heuristic inference strategy uses criteria to evaluate the quality of a DATR theory, where different domains may require different criteria. The system is applied to the linguistic learning task of German noun inflection.

Computation And Language

Automatic summarising: factors and directions

This position paper suggests that progress with automatic summarising demands a better research methodology and a carefully focussed research strategy. In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose, and output factors, that bear on summarising and its evaluation. The paper analyses and illustrates these factors and their implications for evaluation. It then argues that this analysis, together with the state of the art and the intrinsic difficulty of summarising, imply a nearer-term strategy concentrating on shallow, but not surface, text analysis and on indicative summarising. This is illustrated with current work, from which a potentially productive research programme can be developed.

Ready to get started?

Join us today

Archive Your Research