Keh-Jiann Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keh-Jiann Chen is active.

Explore More

Publication

Featured researches published by Keh-Jiann Chen.

international conference on computational linguistics | 1992

Word identification for Mandarin Chinese sentences

Keh-Jiann Chen; Shing-Huan Liu

Chinese sentences are composed with string of characters without blanks to mark words. However the basic unit for sentence parsing and understanding is word. Therefore the first step of processing Chinese sentences is to identify the words. The difficulties of identifying words include (1) the identification of complex words, such as Determinative-Measure, reduplications, derived words etc., (2) the identification of proper names, (3) resolving the ambiguous segmentations. In this paper, we propose the possible solutions for the above difficulties. We adopt a matching algorithm with 6 different heuristic rules to resolve the ambiguities and achieve an 99.77% of the success rate. The statistical data supports that the maximal matching algorithm is the most effective heuristics.

Proceedings of the Second SIGHAN Workshop on Chinese Language Processing | 2003

Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff

Wei-Yun Ma; Keh-Jiann Chen

In this paper, we roughly described the procedures of our segmentation system, including the methods for resolving segmentation ambiguities and identifying unknown words. The CKIP group of Academia Sinica participated in testing on open and closed tracks of Beijing University (PK) and Hong Kong Cityu (HK). The evaluation results show our system performs very well in either HK open track or HK closed track and just acceptable in PK tracks. Some explanations and analysis are presented in this paper.

international conference on computational linguistics | 2002

Unknown word extraction for Chinese documents

Keh-Jiann Chen; Wei-Yun Ma

There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Conventionally unknown words were extracted by statistical methods because statistical methods are simple and efficient. However the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall, since character strings with statistical significance might be phrases or partial phrases instead of words and low frequency new words are hardly identifiable by statistical methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context and content information of unknown words in the steps of detection process, extraction process, and verification process. A practical unknown word extraction system was implemented which online identifies new words, including low frequency new words, with high precision and high recall rates.

international conference on speech image processing and neural networks | 1994

Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions

Lin-Shan Lee; Keh-Jiann Chen; Chiu-yu Tseng; Ren-Yuan Lyu; Lee-Feng Chien; Hsin-Min Wang; Jia-Lin Shen; Sung-Chien Lin; Yen-Ju Yang; Bo-Ren Bai; Chi-ping Nee; Chun-Yi Liao; Shueh-Sheng Lin; Chung-Shu Yang; I-Jung Hung; Ming-Yu Lee; Rei-Chang Wang; Bo-Shen Lin; Yuan-Cheng Chang; Rung-Chiung Yang; Yung-Chi Huang; Chen-Yuan Lou; Tung-Sheng Lin

Golden Mandarin (II) is an intelligent single-chip based real-time Mandarin dictation machine for the Chinese language with a very large vocabulary for the input of unlimited Chinese texts into computers using voice. This dictation machine can be installed on any personal computer, in which only a single chip Motorola DSP 96002D is used, with a preliminary character correct rate around 95% at a speed of 0.6 sec per character. Various adaptation/learning functions have been developed for this machine, including fast adaptation to new speakers, on-line learning the voice characteristics, task domains, word pattern and noise environments of the users, so the machine can be easily personalized for each user. These adaptation/learning functions are the major subjects of the paper.<<ETX>>

中文計算語言學期刊 | 1998

Unknown Word Detection for Chinese by a Corpus-based Learning Method

Keh-Jiann Chen; Ming-Hong Bai

One of the most prominent problems in computer processing of the Chinese language is identification of the words in a sentence. Since there are no blanks to mark word boundaries, identifying words is difficult because of segmentation ambiguities and occurrences of out-of-vocabulary words (i.e., unknown words). In this paper, a corpus-based learning method is proposed which derives sets of syntactic rules that are applied to distinguish monosyllabic words from monosyllabic morphemes which may be parts of unknown words or typographical errors. The corpus-based learning approach has the advantages of: 1. automatic rule learning, 2. automatic evaluation of the performance of each rule, and 3. balancing of recall and precision rates through dynamic rule set selection. The experimental results show that the rule set derived using the proposed method outperformed hand-crafted rules produced by human experts in detecting unknown words.

international conference on acoustics, speech, and signal processing | 1993

Golden Mandarin (II)-an improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary

Lin-Shan Lee; Chiu-yu Tseng; Keh-Jiann Chen; I-Jung Hung; Ming-Yu Lee; Lee-Feng Chien; Yumin Lee; Ren-Yuan Lyu; Hsin-Min Wang; Yung-Chuan Wu; Tung-Sheng Lin; Hung-yan Gu; Chi-ping Nee; Chun-Yi Liao; Yeng-Ju Yang; Yuan-Cheng Chang; Rung-Chiung Yang

Golden Mandarin (II) is an improved single-chip real-time Mandarin dictation machine with a very large vocabulary for the input of unlimited Chinese sentences into computers using voice. In this dictation machine only a single-chip Motorola DSP 96002D on an Ariel DSP-96 card is used, with a preliminary character correct rate of around 95% in speaker-dependent mode at a speech of 0.36 s per character. This is achieved by many new techniques, primarily a segmental probability modeling technique for syllable recognition especially considering the characteristics of Mandarin syllables, and a word-lattice-based Chinese character bigram for character identification especially considering the structure of the Chinese language.<<ETX>>

Proceedings of the Second SIGHAN Workshop on Chinese Language Processing | 2003

A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction

Wei-Yun Ma; Keh-Jiann Chen

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.

meeting of the association for computational linguistics | 2000

Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface

Chu-Ren Huang; Fengyi Chen; Keh-Jiann Chen; Zhao-Ming Gao; Kuang-Yu Chen

This paper describes the design criteria and annotation guidelines of Sinica Treebank. The three design criteria are: Maximal Resource Sharing, Minimal Structural Complexity, and Optimal Semantic Information. One of the important design decisions following these criteria is the encoding of thematic role information. An on-line interface facilitating empirical studies of Chinese phrase structure is also described.

international conference on acoustics, speech, and signal processing | 1997

Internet Chinese information retrieval using unconstrained Mandarin speech queries based on a client-server architecture and a PAT-tree-based language model

Lee-Feng Chien; Sung-Chien Lin; Jenn-Chau Hong; Ming-Chiuan Chen; Hsin-Min Wang; Jia-Lin Shen; Keh-Jiann Chen; Lin-Shan Lee

In order to pursue high performance of Chinese information access on the Internet, this paper presents an attractive approach with a successful integration of efficient speech recognition and information retrieval techniques. A working system based on the proposed approach for speech retrieval of real-time Chinese netnews services has been implemented and tested. Very exciting performance has been achieved.

中文計算語言學期刊 | 2000

The Module-Attribute Representation of Verbal Semantics: From Semantics to Argument Structure

Chu-Ren Huang; Kathleen Ahrens; Li-Li Chang; Keh-Jiann Chen; Meichun Liu; Mei-Chih Tsai

In this paper, we set forth a theory of lexical knowledge. We propose two types of modules: event structure modules and role modules, as well as two sets of attributes: event-internal attributes and role-internal attributes, which are linked to the event structure module and role module, respectively. These module-attribute semantic representations have associated grammatical consequences. Our data is drawn from a comprehensive corpus-based study of Mandarin Chinese verbal semantics, and four particular case studies are presented.

Explore More