Grace Ngai
Hong Kong Polytechnic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Grace Ngai.
international conference on human language technology research | 2001
David Yarowsky; Grace Ngai; Richard Wicentowski
This paper describes a system and set of algorithms for automatically inducing stand-alone monolingual part-of-speech taggers, base noun-phrase bracketers, named-entity taggers and morphological analyzers for an arbitrary foreign language. Case studies include French, Chinese, Czech and Spanish.Existing text analysis tools for English are applied to bilingual text corpora and their output projected onto the second language via statistically derived word alignments. Simple direct annotation projection is quite noisy, however, even with optimal alignments. Thus this paper presents noise-robust tagger, bracketer and lemmatizer training procedures capable of accurate system bootstrapping from noisy and incomplete initial projections.Performance of the induced stand-alone part-of-speech tagger applied to French achieves 96% core part-of-speech (POS) tag accuracy, and the corresponding induced noun-phrase bracketer exceeds 91% F-measure. The induced morphological analyzer achieves over 99% lemmatization accuracy on the complete French verbal system.This achievement is particularly noteworthy in that it required absolutely no hand-annotated training data in the given language, and virtually no language-specific knowledge or resources beyond raw text. Performance also significantly exceeds that obtained by direct annotation projection.
north american chapter of the association for computational linguistics | 2001
Grace Ngai; Radu Florian
Transformation-based learning has been successfully employed to solve many natural language processing problems. It achieves state-of-the-art performance on many natural language processing tasks and does not overtrain easily. However, it does have a serious drawback: the training time is often intorelably long, especially on the large corpora which are often used in NLP. In this paper, we present a novel and realistic method for speeding up the training time of a transformation-based learner without sacrificing performance. The paper compares and contrasts the training time needed and performance achieved by our modified learner with two other systems: a standard transformation-based learner, and the ICA system (Hepple, 2000). The results of these experiments show that our system is able to achieve a significant improvement in training time while still achieving the same performance as a standard transformation-based learner. This is a valuable contribution to systems and algorithms which utilize transformation-based learning at any part of the execution.
north american chapter of the association for computational linguistics | 2001
David Yarowsky; Grace Ngai
This paper investigates the potential for projecting linguistic annotations including part-of-speech tags and base noun phrase bracketings from one language to another via automatically word-aligned parallel corpora. First, experiments assess the accuracy of unmodified direct transfer of tags and brackets from the source language English to the target languages French and Chinese, both for noisy machine-aligned sentences and for clean hand-aligned sentences. Performance is then substantially boosted over both of these baselines by using training techniques optimized for very noisy data, yielding 94-96% core French part-of-speech tag accuracy and 90% French bracketing F-measure for stand-alone monolingual tools trained without the need for any human-annotated data in the given language.
meeting of the association for computational linguistics | 2000
Grace Ngai; David Yarowsky
This paper presents a comprehensive empirical comparison between two approaches for developing a base noun phrase chunker: human rule writing and active learning using interactive real-time human annotation. Several novel variations on active learning are investigated, and underlying cost models for cross-modal machine learning comparison are presented and explored. Results show that it is more efficient and more successful by several measures to train a system using active learning annotation rather than hand-crafted rule writing at a comparable level of human labor investment.
north american chapter of the association for computational linguistics | 2003
Dekai Wu; Grace Ngai; Marine Carpuat
This paper investigates stacking and voting methods for combining strong classifiers like boosting, SVM, and TBL, on the named-entity recognition task. We demonstrate several effective approaches, culminating in a model that achieves error rate reductions on the development and test sets of 63.6% and 55.0% (English) and 47.0% and 51.7% (German) over the CoNLL-2003 standard baseline respectively, and 19.7% over a strong AdaBoost baseline model from CoNLL-2002.
international world wide web conferences | 2011
Cane Wing-ki Leung; Stephen Chi-fai Chan; Fu-Lai Chung; Grace Ngai
We propose a novel Probabilistic Rating infErence Framework, known as Pref, for mining user preferences from reviews and then mapping such preferences onto numerical rating scales. Pref applies existing linguistic processing techniques to extract opinion words and product features from reviews. It then estimates the sentimental orientations (SO) and strength of the opinion words using our proposed relative-frequency-based method. This method allows semantically similar words to have different SO, thereby addresses a major limitation of existing methods. Pref takes the intuitive relationships between class labels, which are scalar ratings, into consideration when assigning ratings to reviews. Empirical results validated the effectiveness of Pref against several related algorithms, and suggest that Pref can produce reasonably good results using a small training corpus. We also describe a useful application of Pref as a rating inference framework. Rating inference transforms user preferences described as natural language texts into numerical rating scales. This allows Collaborative Filtering (CF) algorithms, which operate mostly on databases of scalar ratings, to utilize textual reviews as an additional source of user preferences. We integrated Pref with a classical CF algorithm, and empirically demonstrated the advantages of using rating inference to augment ratings for CF.
ACM Transactions on Speech and Language Processing | 2006
Pascale Fung; Grace Ngai
This article presents a multidocument, multilingual, theme-based summarization system based on modeling text cohesion (story flow). Conventional extractive summarization systems which pick out salient sentences to include in a summary often disregard any flow or sequence that might exist between these sentences. We argue that such inherent text cohesion exists and is (1) specific to a particular story and (2) specific to a particular language. Documents within the same story, and in the same language, share a common story flow, and this flow differs across stories, and across languages. We propose using Hidden Markov Models (HMMs) as story models. An unsupervised segmental K-means method is used to iteratively cluster multiple documents into different topics (stories) and learn the parameters of parallel Hidden Markov Story Models (HMSM), one for each story. We compare story models within and across stories and within and across languages (English and Chinese). The experimental results support our “one story, one flow” and “one language, one flow” hypotheses. We also propose a Naïve Bayes classifier for document summarization. The performance of our summarizer is superior to conventional methods that do not incorporate text cohesion information. Our HMSM method also provides a simple way to compile a single metasummary for multiple documents from individual summaries via state labeled sentences.
technical symposium on computer science education | 2009
Winnie W.Y. Lau; Grace Ngai; Stephen Chi-fai Chan; Joey C.Y. Cheung
As enrollments in engineering and computer science programs around the world have fallen in recent years, those who wish to see this trend reversed take heart from findings that children are more likely to develop an abiding interest in technology if they are exposed to it at an early age [3, 9]. In line with this research, we now see more summer camps and workshops being offered to middle school students with the objective of teaching programming and computer technology [1, 6, 8, 12]. To offer students a stimulating and interesting environment while teaching computing subjects, the learning tools in these camps usually revolve around robots and graphical programming of animations or games. These tools tend to mainly attract youngsters who like robotics or game design. However, we believe that we can improve the diversity of the student pool by introducing other topics. In this paper, we describe our experience in designing and organizing a programming course that focuses on wearable computing, fashion and design for middle school students. We will show that 1) wearable computing is interesting and inspiring to the students, 2) wearable computing motivates both boys and girls to learn technology and computing, which implies that it may be able to increase the potential computer science population, 3) wearable computing can provide a space for students to exercise their creativity while at the same time, teaching them about technology and programming.
international conference on computational linguistics | 2002
Dekai Wu; Grace Ngai; Marine Carpuat; Jeppe Larsen; Yongsheng Yang
This paper presents a system that applies boosting to the task of named-entity identification. The CoNLL-2002 shared task, for which the system is designed, is language-independent named-entity recognition. Using a set of features which are easily obtainable for almost any language, the presented system uses boosting to combine a set of weak classifiers into a final system that performs significantly better than that of an off-the-shelf maximum entropy classifier.
ACM Transactions on Asian Language Information Processing | 2004
Pascale Fung; Grace Ngai; Yongsheng Yang; Benfeng Chen
Parsing, the task of identifying syntactic components, e.g., noun and verb phrases, in a sentence, is one of the fundamental tasks in natural language processing. Many natural language applications such as spoken-language understanding, machine translation, and information extraction, would benefit from, or even require, high accuracy parsing as a preprocessing step. Even though most state-of-the-art statistical parsers were initially constructed for parsing in English, most of them are not language-specific, in that they do not rely on properties of the language that are specific to English. Therefore, construction of a parser in a given language becomes a matter of retraining the statistical parameters with a Treebank in the corresponding language. The development of the Chinese treebank [Xia et al. 2000] spurred the construction of parsers for Chinese. However, Chinese as a language poses some unique problems for the development of a statistical parser, the most apparent being word segmentation. Since words in written Chinese are not delimited in the same way as in Western languages, the first problem that needs to be solved before an existing statistical method can be applied to Chinese is to identify the word boundaries. This is a step that is neglected by most pre-existing Chinese parsers, which assume that the input data has already been pre-segmented. This article describes a character-based statistical parser, which gives the best performance to-date on the Chinese treebank data. We augment an existing maximum entropy parser with transformation-based learning, creating a parser that can operate at the character level. We present experiments that show that our parser achieves results that are close to those achievable under perfect word segmentation conditions.