Tom B. Y. Lai
City University of Hong Kong
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tom B. Y. Lai.
international conference on computational linguistics | 2004
Raymond W.M. Yuen; Terence Y.W. Chan; Tom B. Y. Lai; Oi Yee Kwong; Benjamin Ka-Yin T'sou
The evaluative character of a word is called its semantic orientation (SO). A positive SO indicates desirability (e.g. Good, Honest) and a negative SO indicates undesirability (e.g., Bad, Ugly). This paper presents a method, based on Turney (2003), for inferring the SO of a word from its statistical association with strongly-polarized words and morphemes in Chinese. It is noted that morphemes are much less numerous than words, and that also a small number of fundamental morphemes may be used in the modified system to great advantage. The algorithm was tested on 1,249 words (604 positive and 645 negative) in a corpus of 34 million words, and was run with 20 and 40 polarized words respectively, giving a high precision (79.96% to 81.05%), but a low recall (45.56% to 59.57%). The algorithm was then run with 20 polarized morphemes, or single characters, in the same corpus, giving a high precision of 80.23% and a high recall of 85.03%. We concluded that morphemes in Chinese, as in any language, constitute a distinct sub-lexical unit which, though small in number, has greater linguistic significance than words, as seen by the significant enhancement of results with a much smaller corpus than that required by Turney.
north american chapter of the association for computational linguistics | 2000
Samuel W. K. Chan; Tom B. Y. Lai; W. J. Gao; Benjamin K. Tsou
Discourse markers foreshadow the message thrust of texts and saliently guide their rhetorical structure which are important for content filtering and text abstraction. This paper reports on efforts to automatically identify and classify discourse markers in Chinese texts using heuristic-based and corpus-based data-mining methods, as an integral part of automatic text summarization via rhetorical structure and Discourse Markers. Encouraging results are reported.
meeting of the association for computational linguistics | 2000
Benjamin K. Tsou; Tom B. Y. Lai; Samuel W. K. Chan; Weijun Gao; Xuegang Zhan
Discourse markers are complex discontinuous linguistic expressions which are used to explicitly signal the discourse structure of a text. This paper describes efforts to improve an automatic tagging system which identifies and classifies discourse markers in Chinese texts by applying machine learning (ML) to the disambiguation of discourse markers, as an integral part of automatic text summarization via rhetorical structure. Encouraging results are reported.
International Journal of Computer Processing of Languages | 2005
Benjamin Ka-Yin T'sou; Oi Yee Kwong; Wei Lung Wong; Tom B. Y. Lai
Typical news coverage contains both objective facts and subjective sentiments. This is especially true for newsworthy individuals and organizations, and media opinion on strategic subjects. Analysis either on demand or on a longitudinal basis provides a critical source of information heretofore not readily nor economically obtainable for a range of meaningful purposes. One application is the monitoring of positive or negative summative news coverage on targeted subjects.
international joint conference on natural language processing | 2004
Benjamin K. Tsou; Tom B. Y. Lai; Ka-po Chow
Using a large synchronous Chinese corpus, we show how word and character entropy variations exhibit interesting differences in terms of time and space for different Chinese speech communities. We find that word entropy values are affected by the quality of the segmentation process. We also note that word entropies can be affected by proper nouns, which is the most volatile segment of the stable lexicon of the language. Our word and character entropy results provide interesting comparison with the earlier results and the average joint character entropies (a.k.a. entropy rates) of Chinese up to order 20 provided by us indicate that the limits of the conditional character entropies of Chinese for the different speech communities should be about 1 (or less). This invites questions on whether early convergence of character entropies would also entail word entropy convergence.
Software - Practice and Experience | 2003
Robert W. P. Luk; Benjamin K. Tsou; Tom B. Y. Lai; Oi Yee Kwong; Francis C. Y. Chik; Lawrence Y. L. Cheung
In certain bilingual and multi‐lingual societies, translated legal documents are as important as the original legal documents because they have the same legal status as the originals. However, there is little reported work on the retrieval and management of bilingual legal documents. We describe the design and development of a bilingual document retrieval and management prototype, called ELDoS, which is used by court interpreters and judges from the Hong Kong Judiciary. Since the speed of retrieval is a major concern for user acceptance, and therefore for widespread deployment of the system, the architecture of the prototype is designed to balance the workload of the client and server. Extensible Markup Language (XML) is used to mark up the bilingual legal documents for a variety of document retrieval and management tasks. XML enables the use of XML Stylesheet Language Transformation (XSLT) to align bilingual data in the client, instead of the server, and improve alignment speed linearly with respect to the size of the document, using a high‐end PC, when the server has no concurrent access. The design of the interface was continually improved after extensive consultation with court interpreters and after the user acceptance tests. In our evaluation, the facilities for highlighting translated terms have a macro‐averaged precision of 90+% and a macro‐average recall of 80+%, which were considered acceptable by our users. We believe that the experience in the design and development of this prototype is applicable to other language pairs as well as to other domains. Copyright
中文計算語言學期刊 | 1998
Hing-lung Lin; Tom B. Y. Lai; Samuel W. K. Chan
In Chinese text, discourse connectives constitute a major linguistic device available for a writer to explicitly indicate the structure of a discourse. This set of discourse connectives, consisting of a few hundred entries in modern Chinese, is relatively stable and domain independent. In a recently published paper [Tsou 1996], a computational procedure was introduced to generate the abstract of an input text using mainly the discourse connectives appearing in the text. This paper attempts to demonstrate the validity, of this approach to full-text abstraction by means of an evaluation method, which compares human efforts in text abstraction with the performance of an experimental system called ACFAS. Specifically, our concern is about the relationship between the perceived importance of each individual sentence as judged by human beings and the sentences containing discourse connectives within an argumentative discourse.
international conference on computational linguistics | 2002
Lawrence Y. L. Cheung; Tom B. Y. Lai; Robert W. P. Luk; Oi Yee Kwong; King Kui Sin; Benjamin K. Tsou
Despite progress in the development of computational means, human input is still critical in the production of consistent and useable aligned corpora and term banks. This is especially true for specialized corpora and term banks whose end-users are often professionals with very stringent requirements for accuracy, consistency and coverage. In the compilation of a high quality Chinese-English legal glossary for ELDoS project, we have identified a number of issues that make the role human input critical for term alignment and extraction. They include the identification of low frequency terms, paraphrastic expressions, discontinuous units, and maintaining consistent term granularity, etc. Although manual intervention can more satisfactorily address these issues, steps must also be taken to address intra- and inter-annotator inconsistency.
International Journal of Computer Processing of Languages | 2006
Benjamin K. Tsou; Tom B. Y. Lai; King Kui Sin; Lawrence Y. L. Cheung
Implementation of legal bilingualism in Hong Kong after 1997 has necessitated the production of voluminous and extensive court proceedings and judgments in both Chinese and English. For the Chinese records, Cantonese, a dialect of Chinese, is the home language of more than 90% of the population in Hong Kong and is thus officially used in the courts. For the court proceedings, Cantonese speech would have to be recorded, and a Cantonese Computer-Aided Transcription system has been developed. The transcription system converts stenographic codes into Chinese text, i.e. from phonetic to orthographic representation of the language. The main challenge lies in the resolution of the severe ambiguity resulting from homocode problems in the conversion process. Cantonese Chinese is typified by problematic homonymy, which presents serious challenges. The N-gram statistical model is employed to estimate the most probable character string of the input transcription codes. Domain-specific corpora have been compiled to support the statistical computation. To improve accuracy, scalable techniques such as domain-specific transcription and special encoding are used. Put together, these techniques deliver 96% transcription accuracy.
international conference on computational linguistics | 2000
Benjamin K. Tsou; King Kui Sin; Samuel W. K. Chan; Tom B. Y. Lai; Caesar Suen Lun; K. T. Ko; Gary K. K. Chan; Lawrence Y. L. Cheung
A Cantonese Chinese transcription system to automatically convert stenograph code to Chinese characters is reported. The major challenge in developing such a system is the critical homocode problem because of homonymy. The statistical N-gram model is used to compute the best combination of characters. Supplemented with a 0.85 million character corpus of domain-specific training data and enhancement measures, the bigram and trigram implementations achieve 95% and 96% accuracy respectively, as compared with 78% accuracy in the baseline model. The system performance is comparable with other advanced Chinese Speech-to-Text input applications under development. The system meets an urgent need of the Judiciary of post-1997 Hong Kong.