Is this you? Create Your Porfile

Pascale Fung

Hong Kong University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pascale Fung is active.

Explore More

Publication

Featured researches published by Pascale Fung.

meeting of the association for computational linguistics | 1998

An IR Approach for Translating New Words from Nonparallel, Comparable Texts

Pascale Fung; Lo Yuen Yee

In recent years, there is a phenomenal growth in the amount of online text material available from the greatest information repository known as the World Wide Web. Various traditional information retrieval(IR) techniques combined with natural language processing(NLP) techniques have been re-targeted to enable efficient access of the WWW--search engines, indexing, relevance feedback, query term and keyword weighting, document analysis, document classification, etc. Most of these techniques aim at efficient online search for information already on the Web. Meanwhile, the corpus linguistic community regards the WWW as a vast potential of corpus resources. It is now possible to download a large amount of texts with automatic tools when one needs to compute, for example, a list of synonyms; or download domain-specific monolingual texts by specifying a keyword to the search engine, and then use this text to extract domain-specific terms. It remains to be seen how we can also make use of the multilingual texts as NLP resources. In the years since the appearance of the first papers on using statistical models for bilingual lexicon compilation and machine translation(Brown et al., 1993; Brown et al., 1991; Gale and Church, 1993; Church, 1993; Simard et al., 1992), large amount of human effort and time has been invested in collecting parallel corpora of translated texts. Our goal is to alleviate this effort and enlarge the scope of corpus resources by looking into monolingual, comparable texts. This type of texts are known as nonparallel corpora. Such nonparallel, monolingual texts should be much more prevalent than parallel texts. However, previous attempts at using nonparallel corpora for terminology translation were constrained by the inadequate availability of same-domain, comparable texts in electronic form. The type of nonparallel texts obtained from the LDC or university libraries were often restricted, and were usually out-of-date as soon as they became available. For new word translation, the timeliness of corpus resources is a prerequisite, so is the continuous and automatic availability of nonparallel, comparable texts in electronic form. Data collection effort should not inhibit the actual translation effort. Fortunately, nowadays tile World Wide Web provides us with a daily increase of fresh, up-to-date multilingual material, together with the archived versions, all easily downloadable by software tools running in the background. It is possible to specify the URL of the online site of a newspaper, and the start and end dates, and automatically download all the daily newspaper materials between those dates. In this paper, we describe a new method which combines IR and NLP techniques to extract new word translation from automatically downloaded English-Chinese nonparallel newspaper texts.

The 3rd Workshop on Very Large Corpora | 1995

Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus

Pascale Fung

We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional statistics derived from parallel texts. There is little correlation between such statistics of a word and its translation in non-parallel corpora. On the other hand, we suggest that words with productive context in one language translate to words with productive context in another language, and words with rigid context translate into words With rigid context. Context heterogeneity measures how productive the context of a word is in a given domain, independent of its absolute occurrence frequency in the text. Based on this information, we derive statistics of bilingual word pairs from a non-parallel corpus. These statistics can be used to bootstrap a bilingual dictionary compilation algorithm.

Journal of Visual Languages and Computing | 1997

Finding Terminology Translations from Non-parallel Corpora

Kathleen R. McKeown; Pascale Fung

We present a statistical word feature, the Word Relation Matrix, which can be used to find translated pairs of words and terms from non-parallel corpora, across language groups. Online dictionary entries are used as seed words to generate Word Relation Matrices for the unknown words according to correlation measures. Word Relation Matrices are then mapped across the corpora to find translation pairs. Translation accuracies are around 30% when only the top candidate is counted. Nevertheless, top 20 candidate output give a 50.9% average increase in accuracy on human translator performance.

north american chapter of the association for computational linguistics | 2009

Semantic Roles for SMT: A Hybrid Two-Pass Model

Dekai Wu; Pascale Fung

We present results on a novel hybrid semantic SMT model that incorporates the strengths of both semantic role labeling and phrase-based statistical machine translation. The approach avoids major complexity limitations via a two-pass architecture. The first pass is performed using a conventional phrase-based SMT model. The second pass is performed by a re-ordering strategy guided by shallow semantic parsers that produce both semantic frame and role labels. Evaluation on a Wall Street Journal newswire genre test set showed the hybrid model to yield an improvement of roughly half a point in BLEU score over a strong pure phrase-based SMT baseline -- to our knowledge, the first successful application of semantic role labeling to SMT.

international conference on acoustics, speech, and signal processing | 1993

The estimation of powerful language models from small and large corpora

Paul Placeway; Richard M. Schwartz; Pascale Fung; Long Nguyen

The authors consider the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. They begin with improved modeling of the grammar statistics, based on a combination of the backing-off technique and zero-frequency techniques. These are extended to be more amenable to the particular system considered here. The resulting technique is greatly simplified, more robust, and gives improved recognition performance over either of the previous techniques. The authors also consider the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. A technique that allows the estimation of a high-order model on modest computation resources is also presented. This makes it possible to run a 4-gram statistical model of a 50-million word corpus on a workstation of only modest capability and cost. Finally, the authors discuss results from applying a 2-gram statistical language model integrated in the HMM (hidden Markov model) search, obtaining a list of the N-best recognition results, and rescoring this list with a higher-order statistical model.<<ETX>>

Machine Translation | 1998

A Technical Word- and Term-Translation Aid Using Noisy Parallel Corpora across Language Groups

Pascale Fung; Kathleen R. McKeown

Technical-term translation represents one of the most difficult tasks for human translators since (1) most translators are not familiar with terms and domain-specific terminology and (2) such terms are not adequately covered by printed dictionaries. This paper describes an algorithm for translating technical words and terms from noisy parallel corpora across language groups. Given any word which is part of a technical term in the source language, the algorithm produces a ranked candidate match for it in the target language. Potential translations for the term are compiled from the matched words and are also ranked. We show how this ranked list helps translators in technical-term translation. Most algorithms for lexical and term translation focus on Indo-European language pairs, and most use a sentence-aligned clean parallel corpus without insertion, deletion or OCR noise. Our algorithm is language- and character-set-independent, and is robust to noise in the corpus. We show how our algorithm requires minimum preprocessing and is able to obtain technical-word translations without sentence-boundary identification or sentence alignment, from the English–Japanese awk manual corpus with noise arising from text insertions or deletions and on the English–Chinese HKUST bilingual corpus. We obtain a precision of 55.35% from the awk corpus for word translation including rare words, counting only the best candidate and direct translations. Translation precision of the best-candidate translation is 89.93% from the HKUST corpus. Potential term translations produced by the program help bilingual speakers to get a 47% improvement in translating technical terms.

international conference on acoustics speech and signal processing | 1999

Fast accent identification and accented speech recognition

Liu Wai Kat; Pascale Fung

The performance of speech recognition systems degrades when speaker accent is different from that in the training set. Accent-independent or accent-dependent recognition both require collection of more training data. In this paper, we propose a faster accent classification approach using phoneme-class models. We also present our findings in acoustic features sensitive to a Cantonese accent, and possibly other Asian language accents. In addition, we show how we can rapidly transform a native accent pronunciation dictionary to that for accented speech by simply using knowledge of the native language of the foreign speaker. The use of this accent-adapted dictionary reduces recognition error rate by 13.5%, similar to the results obtained from a longer, data-driven process.

arXiv: Computation and Language | 1999

Statistical Augmentation of a Chinese Machine-Readable Dictionary

Pascale Fung; Dekai Wu

We describe a method of using statistically-collected Chinese character groups from a corpus to augment a Chinese dictionary. The method is particularly useful for extracting domain-specific and regional words not readily available in machine-readable dictionaries. Output was evaluated both using human evaluators and against a previously available dictionary. We also evaluated performance improvement in automatic Chinese tokenization. Results show that our method outputs legitimate words, acronymic constructions, idioms, names and titles, as well as technical compounds, many of which were lacking from the original dictionary.

international conference on computational linguistics | 2004

Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus

Pascale Fung; Percy Chi Shun Cheung

We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better sentence matching also leads to better document matching. Based on this, we use multi-level bootstrapping to improve the alignments between documents, sentences, and bilingual word pairs, iteratively. Our method is the first method that does not rely on any supervised training data, such as a sentence-aligned corpus, or temporal information, such as the publishing date of a news article. It is validated by experimental results that show a 23% improvement over a method without multilevel bootstrapping.

international joint conference on natural language processing | 2005

Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora

Dekai Wu; Pascale Fung

We present a new implication of Wus (1997) Inversion Transduction Grammar (ITG) Hypothesis, on the problem of retrieving truly parallel sentence translations from large collections of highly non-parallel documents. Our approach leverages a strong language universal constraint posited by the ITG Hypothesis, that can serve as a strong inductive bias for various language learning problems, resulting in both efficiency and accuracy gains. The task we attack is highly practical since non-parallel multilingual data exists in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Our aim here is to mine truly parallel sentences, as opposed to comparable sentence pairs or loose translations as in most previous work. The method we introduce exploits Bracketing ITGs to produce the first known results for this problem. Experiments show that it obtains large accuracy gains on this task compared to the expected performance of state-of-the-art models that were developed for the less stringent task of mining comparable sentence pairs.

Explore More