Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guohong Fu is active.

Publication


Featured researches published by Guohong Fu.


Sigkdd Explorations | 2005

Chinese named entity recognition using lexicalized HMMs

Guohong Fu; Kk Luke

This paper presents a lexicalized HMM-based approach to Chinese named entity recognition (NER). To tackle the problem of unknown words, we unify unknown word identification and NER as a single tagging task on a sequence of known words. To do this, we first employ a known-word bigram-based model to segment a sentence into a sequence of known words, and then apply the uniformly lexicalized HMMs to assign each known word a proper hybrid tag that indicates its pattern in forming an entity and the category of the formed entity. Our system is able to integrate both the internal formation patterns and the surrounding contextual clues for NER under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. We have tested our system using different public corpora. The results show that lexicalized HMMs can substantially improve NER performance over standard HMMs. The results also indicate that character-based tagging (viz. the tagging based on pure single-character words) is comparable to and can even outperform the relevant known-word based tagging when a lexicalization technique is applied.


international joint conference on natural language processing | 2004

Chinese unknown word identification using class-based LM

Guohong Fu; Kk Luke

This paper presents a modified class-based LM approach to Chinese unknown word identification. In this work, Chinese unknown word identification is viewed as a classification problem and the part-of-speech of each unknown word is defined as its class. Furthermore, three types of features, including contextual class feature, word juncture model and word formation patterns, are combined in a framework of class-based LM to perform correct unknown word identification on a sequence of known words. In addition to unknown word identification, the class-based LM approach also provides a solution for unknown word tagging. The results of our experiments show that most unknown words in Chinese texts can be resolved effectively by the proposed approach.


Proceedings of the Second SIGHAN Workshop on Chinese Language Processing | 2003

A Two-stage Statistical Word Segmentation System for Chinese

Guohong Fu; Kk Luke

In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and word-formation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.


international conference on machine learning and cybernetics | 2004

Chinese unknown word identification as known word tagging

Guohong Fu; Kk Luke

This work presents a tagging approach to Chinese unknown word identification based on lexicalized hidden Markov models (LHMMs). In this work, Chinese unknown word identification is represented as a tagging task on a sequence of known words by introducing word-formation patterns and part-of-speech. Based on the lexicalized HMMs, a statistical tagger is further developed to assign each known word an appropriate tag that indicates its pattern in forming a word and the part-of-speech of the formed word. The experimental results on the Peking University corpus indicate that the use of lexicalization technique and the introduction of part-of-speech are helpful to unknown word identification. The experiment on the SIGHAN-PK open test data also shows that our system can achieve state-of-art performance.


asia information retrieval symposium | 2006

Automatic expansion of abbreviations in chinese news text

Guohong Fu; Kk Luke; Guodong Zhou; Ruifeng Xu

This paper presents an n-gram based approach to Chinese abbreviation expansion. In this study, we distinguish reduced abbreviations from non-reduced abbreviations that are created by elimination or generalization. For a reduced abbreviation, a mapping table is compiled to map each short-word in it to a set of long-words, and a bigram based Viterbi algorithm is thus applied to decode an appropriate combination of long-words as its full-form. For a non-reduced abbreviation, a dictionary of non-reduced abbreviation/full-form pairs is used to generate its expansion candidates, and a disambiguation technique is further employed to select a proper expansion based on bigram word segmentation. The evaluation on an abbreviation-expanded corpus built from the PKU corpus showed that the proposed system achieved a recall of 82.9% and a precision of 85.5% on average for different types of abbreviations in Chinese news text.


international conference on machine learning and cybernetics | 2005

Chinese text chunking using lexicalized HMMs

Guohong Fu; Ruifeng Xu; Kk Luke; Qin Lu

This paper presents a lexicalized HMM-based approach to Chinese text chunking. To tackle the problem of unknown words, we formalize Chinese text chunking as a tagging task on a sequence of known words. To do this, we employ the uniformly lexicalized HMMs and develop a lattice-based tagger to assign each known word a proper hybrid tag, which involves four types of information: word boundary, POS, chunk boundary and chunk type. In comparison with most previous approaches, our approach is able to integrate different features such as part-of-speech information, chunk-internal cues and contextual information for text chunking under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. Our preliminary experiments on the PolyU Shallow Treebank show that the use of lexicalization technique can substantially improve the performance of a HMM-based chunking system.


International Journal of Technology and Human Interaction | 2006

Chinese POS Disambiguation and Unknown Word Guessing with Lexicalized HMMs

Guohong Fu; Kk Luke

This article presents a lexicalized HMM-based approach to Chinese part-of-speech (POS) disambiguation and unknown word guessing (UWG). In order to explore word-internal morphological features for Chinese POS tagging, four types of pattern tags are defined to indicate the way lexicon words are used in a segmented sentence. Such patterns are combined further with POS tags. Thus, Chinese POS disambiguation and UWG can be unified as a single task of assigning each known word to input a proper hybrid tag. Furthermore, a uniformly lexicalized HMM-based tagger also is developed to perform this task, which can incorporate both internal word-formation patterns and surrounding contextual information for Chinese POS tagging under the framework of HMMs. Experiments on the Peking University Corpus indicate that the tagging precision can be improved with efficiency by the proposed approach.


international conference natural language processing | 2003

Integrated approaches to prosodic word prediction for Chinese TTS

Guohong Fu; Kk Luke

We focus on integrated prosodic word prediction for Chinese TTS. To avoid the problem of inconsistency between lexical words and prosodic words in Chinese, lexical word segmentation and prosodic word prediction are taken as one process instead of two independent tasks. Furthermore, two word-based approaches are proposed to drive this integrated prosodic word prediction: The first one follows the notion of lexicalized hidden Markov models, and the second one is borrowed from unknown word identification for Chinese. The results of our primary experiment show these integrated approaches are effective.


International Journal of Computer Processing of Languages | 2007

Automatic Expansion of Abbreviations in Chinese News Text: A Hybrid Approach

Guohong Fu; Kk Luke; Jonathan J. Webster

Chinese news texts often contain a number of abbreviations without explicitly defining their full-forms. Therefore, expanding abbreviations to their original full-forms plays an important role in improving accuracy of the information extraction and retrieval systems for Chinese. In this paper, we present a hybrid approach to automatic expansion of abbreviations in Chinese news texts. Generally, Chinese abbreviations are produced from their original full-forms via reduction, elimination or generalization. To ensure every abbreviation can successfully be expanded, each abbreviation under expansion is assumed to be created by these three methods, respectively. Based on this assumption, a mapping table between shortened words and their matrix words, and a dictionary of short-form/full-form pairs are used to generate all possible expansions for abbreviations. For an ambiguous abbreviation with mutiple expansion candidates, then hidden Markov models are employed to rank all its expansion candidates and select a proper one with the maximum score. In order to further improve expansion performance, some linguistic knowledge like discourse information and abbreviation patterns are utilized to correct possible expansion errors. Evaluation on an abbreviation-expanded corpus built from the Peking University Corpus showed that our approach can achieve 86.3% and 83.8% on average in precision and recall respectively for various types of abbreviations in Chinese news texts.


Journal of Chinese Language and Computing | 2003

An integrated approach for Chinese word segmentation

Guohong Fu; Kang-Kwong Luke

Collaboration


Dive into the Guohong Fu's collaboration.

Top Co-Authors

Avatar

Kk Luke

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Ruifeng Xu

Harbin Institute of Technology Shenzhen Graduate School

View shared research outputs
Top Co-Authors

Avatar

Jonathan J. Webster

City University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Qin Lu

Hong Kong Polytechnic University

View shared research outputs
Researchain Logo
Decentralizing Knowledge