Chunyu Kit | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chunyu Kit is active.

Explore More

Publication

Featured researches published by Chunyu Kit.

international conference on computational linguistics | 1992

Tokenization as the initial phase in NLP

Jonathan J. Webster; Chunyu Kit

In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.

international joint conference on natural language processing | 2009

Cross Language Dependency Parsing using a Bilingual Lexicon

Hai Zhao; Yan Song; Chunyu Kit; Guodong Zhou

This paper proposes an approach to enhance dependency parsing in a language by using a translated treebank from another language. A simple statistical machine translation method, word-by-word decoding, where not a parallel corpus but a bilingual lexicon is necessary, is adopted for the treebank translation. Using an ensemble method, the key information extracted from word pairs with dependency relations in the translated text is effectively integrated into the parser for the target language. The proposed method is evaluated in English and Chinese treebanks. It is shown that a translated English treebank helps a Chinese parser obtain a state-of-the-art result.

Information Sciences | 2011

Integrating unsupervised and supervised word segmentation: The role of goodness measures

Hai Zhao; Chunyu Kit

This study explores the feasibility of integrating unsupervised and supervised segmentation of Chinese texts for enhancing performance beyond the present state-of-the art, focusing on the critical role of the former in enhancing the latter. Following only a pre-defined goodness measure, unsupervised segmentation has the advantage of discovering many new words in raw texts, but it has the disadvantage of inevitably corrupting many known. By contrast, supervised segmentation conventionally trained only on a pre-segmented corpus is particularly good at identifying known words but possesses little intrinsic mechanism to deal with unseen ones until it is formulated as character tagging. To combine their strengths, we empirically evaluate a set of goodness measures, among which description length gain excels in word discovery, but simple strategies like word candidate pruning and assemble segmentation can further improve it. Interestingly, however, accessor variety and boundary entropy, two other goodness measures, are found more effective in enhancing the supervised learning of character tagging with the conditional random fields model. All goodness scores are discretized into feature values to enrich this model. The success of this approach has been verified by our experiments on the benchmark data sets of the last two Bakeoffs: on average, it achieves an error reduction of 6.39% over the best performance of closed test in Bakeoff-3 and ranks first in all five closed test tracks in Bakeoff-4, outperforming other participants significantly and consistently by an error reduction of 8.96%.

Information Sciences | 2008

Chinese word segmentation as morpheme-based lexical chunking

Guo-Hong Fu; Chunyu Kit; Jonathan J. Webster

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as tagging units. In this paper we present a morpheme-based chunking approach and implement it in a two-stage system. It consists of two main components, namely a morpheme segmentation component to segment an input sentence to a sequence of morphemes based on morpheme-formation models and bigram language models, and a lexical chunking component to label each segmented morphemes position in a word of a special type with the aid of lexicalized hidden Markov models. To facilitate these tasks, a statistically-based technique is also developed for automatically compiling a morpheme dictionary from a segmented or tagged corpus. To evaluate this approach, we conduct a closed test and an open test using the 2005 SIGHAN Bakeoff data. Our system demonstrates state-of-the-art performance on different test sets, showing the benefits of choosing morphemes as tagging units. Furthermore, the open test results indicate significant performance enhancement using lexicalization and part-of-speech features.

international joint conference on natural language processing | 2004

Unsupervised segmentation of chinese corpus using accessor variety

Haodi Feng; Kang Chen; Chunyu Kit; Xiaotie Deng

The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods.

conference on computational natural language learning | 2008

Parsing Syntactic and Semantic Dependencies with Two Single-Stage Maximum Entropy Models

Hai Zhao; Chunyu Kit

This paper describes our system to carry out the joint parsing of syntactic and semantic dependencies for our participation in the shared task of CoNLL-2008. We illustrate that both syntactic parsing and semantic parsing can be transformed into a word-pair classification problem and implemented as a single-stage system with the aid of maximum entropy modeling. Our system ranks the fourth in the closed track for the task with the following performance on the WSJ+Brown test set: 81.44% labeled macro F1 for the overall task, 86.66% labeled attachment for syntactic dependencies, and 76.16% labeled F1 for semantic dependencies.

international conference natural language processing | 2003

Transductive HMM based Chinese text chunking

Heng Li; Jonathan J. Webster; Chunyu Kit; Tianshun Yao

We present a novel methodology to enhance Chinese text chunking with the aid of transductive Hidden Markov Models (transductive HMMs, henceforth). We consider chunking as a special tagging problem and attempt to utilize, via a number of transformation functions, as much relevant contextual information as possible for model training. These functions enable the models to make use of contextual information to a greater extent and keep us away from costly changes of the original training and tagging process. Each of them results in an individual model with certain pros and cons. Through a number of experiments, we succeed in integrating the best two models into a significantly better one. We carry out the chunking experiments on the HIT Chinese Treebank corpus. Experimental results show that it is an effective approach, achieving an F score of 82.38%.

Journal of Artificial Intelligence Research | 2013

Integrative semantic dependency parsing via efficient large-scale feature selection

Hai Zhao; Xiaotian Zhang; Chunyu Kit

Semantic parsing, i.e., the automatic derivation of meaning representation such as an instantiated predicate-argument structure for a sentence, plays a critical role in deep processing of natural language. Unlike all other top systems of semantic dependency parsing that have to rely on a pipeline framework to chain up a series of submodels each specialized for a specific subtask, the one presented in this article integrates everything into one model, in hopes of achieving desirable integrity and practicality for real applications while maintaining a competitive performance. This integrative approach tackles semantic parsing as a word pair classification problem using a maximum entropy classifier. We leverage adaptive pruning of argument candidates and large-scale feature selection engineering to allow the largest feature space ever in use so far in this field, it achieves a state-of-the-art performance on the evaluation data set for CoNLL-2008 shared task, on top of all but one top pipeline system, confirming its feasibility and effectiveness.

international conference on the computer processing of oriental languages | 2009

An Extractive Text Summarizer Based on Significant Words

Xiaoyue Liu; Jonathan J. Webster; Chunyu Kit

Document summarization can be viewed as a reductive distilling of source text through content condensation, while words with high quantities of information are believed to carry more content and thereby importance. In this paper, we propose a new quantification measure for word significance used in natural language processing (NLP) tasks, and successfully apply it to an extractive text summarization approach. In a query-based summarization setting, the correlation between user queries and sentences to be scored is established from both the micro (i.e. at the word level) and the macro (i.e. at the sentence level) perspectives, resulting in an effective ranking formula. The experiments, both on a generic single document summarization evaluation, and on a query-based multi-document evaluation, verify the effectiveness of the proposed measures and show that the proposed approach achieves a state-of-the-art performance.

Machine Translation | 2009

ATEC: automatic evaluation of machine translation via word choice and word order

Billy Tak-Ming Wong; Chunyu Kit

We propose a novel metric ATEC for automatic MT evaluation based on explicit assessment of word choice and word order in an MT output in comparison to its reference translation(s), the two most fundamental factors in the construction of meaning for a sentence. The former is assessed by matching word forms at various linguistic levels, including surface form, stem, sound and sense, and further by weighing the informativeness of each word. The latter is quantified in term of the discordance of word position and word sequence between a translation candidate and its reference. In the evaluations using the MetricsMATR08 data set and the LDC MTC2 and MTC4 corpora, ATEC demonstrates an impressive positive correlation to human judgments at the segment level, highly comparable to the few state-of-the-art evaluation metrics.

Explore More