Dekai Wu
Hong Kong University of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dekai Wu.
meeting of the association for computational linguistics | 1996
Dekai Wu
We introduce a polynomial-time algorithm for statistical machine translation. This algorithm can be used in place of the expensive, slow best-first search strategies in current statistical translation architectures. The approach employs the stochastic bracketing transduction grammar (SBTG) model we recently introduced to replace earlier word alignment channel models, while retaining a bigram language model. The new algorithm in our experience yields major speed improvement with no significant loss of accuracy.
meeting of the association for computational linguistics | 1994
Dekai Wu
We describe our experience with automatic alignment of sentences in parallel English-Chinese texts. Our report concerns three related topics: (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus; (2) experiments addressing the applicability of Gale & Churchs (1991) length-based statistical method to the task of alignment involving a non-Indo-European language; and (3) an improved statistical method that also incorporates domain-specific lexical cues.
Computational Linguistics | 1988
Robert Wilensky; David N. Chin; Marc Luria; James H. Martin; James Mayfield; Dekai Wu
UC (UNIX Consultant) is an intelligent, natural language interface that allows naive users to learn about the UNIX2 operating system. UC was undertaken because the task was thought to be both a fertile domain for artificial intelligence (AI) research and a useful application of AI work in planning, reasoning, natural language processing, and knowledge representation.The current implementation of UC comprises the following components: a language analyzer, called ALANA, produces a representation of the content contained in an utterance; an inference component, called a concretion mechanism, that further refines this content; a goal analyzer, PAGAN, that hypothesizes the plans and goals under which the user is operating; an agent, called UCEgo, that decides on UCs goals and proposes plans for them; a domain planner, called KIP, that computes a plan to address the users request; an expression mechanism, UCExpress, that determines the content to be communicated to the user, and a language production mechanism, UCGen, that expresses UCs response in English.UC also contains a component, called KNOME, that builds a model of the users knowledge state with respect to UNIX. Another mechanism, UCTeacher, allows a user to add knowledge of both English vocabulary and facts about UNIX to UCs knowledge base. This is done by interacting with the user in natural language.All these aspects of UC make use of knowledge represented in a knowledge representation system called KODIAK. KODIAK is a relation-oriented system that is intended to have wide representational range and a clear semantics, while maintaining a cognitive appeal. All of UCs knowledge, ranging from its most general concepts to the content of a particular utterance, is represented in KODIAK.
meeting of the association for computational linguistics | 2005
Marine Carpuat; Dekai Wu
We directly investigate a subject of much recent debate: do word sense disambiguation models help statistical machine translation quality? We present empirical results casting doubt on this common, but unproved, assumption. Using a state-of-the-art Chinese word sense disambiguation model to choose translation candidates for a typical IBM statistical MT system, we find that word sense disambiguation does not yield significantly better translation quality than the statistical machine translation system alone. Error analysis suggests several key factors behind this surprising finding, including inherent limitations of current statistical MT architectures.
north american chapter of the association for computational linguistics | 2009
Dekai Wu; Pascale Fung
We present results on a novel hybrid semantic SMT model that incorporates the strengths of both semantic role labeling and phrase-based statistical machine translation. The approach avoids major complexity limitations via a two-pass architecture. The first pass is performed using a conventional phrase-based SMT model. The second pass is performed by a re-ordering strategy guided by shallow semantic parsers that produce both semantic frame and role labels. Evaluation on a Wall Street Journal newswire genre test set showed the hybrid model to yield an improvement of roughly half a point in BLEU score over a strong pure phrase-based SMT baseline -- to our knowledge, the first successful application of semantic role labeling to SMT.
meeting of the association for computational linguistics | 1998
Dekai Wu; Hongsing Wong
We introduce a stochastic grammatical channel model for machine translation, that synthesizes several desirable characteristics of both statistical and grammatical machine translation. As with the pure statistical translation model described by Wu (1996) (in which a bracketing transduction grammar models the channel), alternative hypotheses compete probabilistically, exhaustive search of the translation hypothesis space can be performed in polynomial time, and robustness heuristics arise naturally from a language-independent inversion-transduction model. However, unlike pure statistical translation models, the generated output string is guaranteed to conform to a given target grammar. The model employs only (1) a translation lexicon, (2) a context-free grammar for the target language, and (3) a bigram language model. The fact that no explicit bilingual translation rules are used makes the model easily portable to a variety of source languages. Initial experiments show that it also achieves significant speed gains over our earlier model.
meeting of the association for computational linguistics | 1995
Dekai Wu
We describe a grammarless method for simultaneously bracketing both halves of a parallel text and giving word alignments, assuming only a translation lexicon for the language pair. We introduce inversion-invariant transduction grammars which serve as generative models for parallel bilingual sentences with weak order constraints. Focusing on transduction grammars for bracketing, we formulate a normal form, and a stochastic version amenable to a maximum-likelihood bracketing algorithm. Several extensions and experiments are discussed.
arXiv: Computation and Language | 1999
Pascale Fung; Dekai Wu
We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.
north american chapter of the association for computational linguistics | 2003
Dekai Wu; Grace Ngai; Marine Carpuat
This paper investigates stacking and voting methods for combining strong classifiers like boosting, SVM, and TBL, on the named-entity recognition task. We demonstrate several effective approaches, culminating in a model that achieves error rate reductions on the development and test sets of 63.6% and 55.0% (English) and 47.0% and 51.7% (German) over the CoNLL-2003 standard baseline respectively, and 19.7% over a strong AdaBoost baseline model from CoNLL-2002.
pacific-asia conference on knowledge discovery and data mining | 2000
Yanlei Diao; Hongjun Lu; Dekai Wu
This paper addresses personal E-mail filtering by casting it in the framework of text classification. Modeled as semi-structured documents, Email messages consist of a set of fields with predefined semantics and a number of variable length free-text fields. While most work on classification either concentrates on structured data or free text, the work in this paper deals with both of them. To perform classification, a naive Bayesian classifier was designed and implemented, and a decision tree based classifier was implemented. The design considerations and implementation issues are discussed. Using a relatively large amount of real personal E-mail data, a comprehensive comparative study was conducted using the two classifiers. The importance of different features is reported. Results of other issues related to building an effective personal E-mail classifier are presented and discussed. It is shown that both classifiers can perform filtering with reasonable accuracy. While the decision tree based classifier outperforms the Bayesian classifier when features and training size are selected optimally for both, a carefully designed naive Bayesian classifier is more robust.