Cheng-Wei Shih | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cheng-Wei Shih is active.

Explore More

Publication

Featured researches published by Cheng-Wei Shih.

中文計算語言學期刊 | 2004

Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model

Tzong-Han Tsai; Shih-Hung Wu; Cheng-Wei Lee; Cheng-Wei Shih; Wen-Lian Hsu

This paper presents a Chinese named entity recognizer (NER): Mencius. It aims to address Chinese NER problems by combining the advantages of rule-based and machine learning (ML) based NER systems. Rule-based NER systems can explicitly encode human comprehension and can be tuned conveniently, while ML-based systems are robust, portable and inexpensive to develop. Our hybrid system incorporates a rule-based knowledge representation and template-matching tool, called InfoMap [Wu et al. 2002], into a maximum entropy (ME) framework. Named entities are represented in InfoMap as templates, which serve as ME features in Mencius. These features are edited manually, and their weights are estimated by the ME framework according to the training data. To understand how word segmentation might influence Chinese NER and the differences between a pure template-based method and our hybrid method, we configure Mencius using four distinct settings. The F-Measures of person names (PER), location names (LOC) and organization names (ORO) of the best configuration in our experiment were respectively 94.3%, 77.8% and 75.3%. From comparing the experiment results obtained using these configurations reveals that hybrid NER Systems always perform better performance in identifying person names. On the other hand, they have a little difficulty identifying location and organization names. Furthermore, using a word segmentation module improves the performance of pure Template-based NER Systems, but, it has little effect on hybrid NER systems.

ACM Transactions on Asian Language Information Processing | 2012

Validating Contradiction in Texts Using Online Co-Mention Pattern Checking

Cheng-Wei Shih; Cheng-Wei Lee; Richard Tzong-Han Tsai; Wen-Lian Hsu

Detecting contradictive statements is a foundational and challenging task for text understanding applications such as textual entailment. In this article, we aim to address the problem of the shortage of specific background knowledge in contradiction detection. A novel contradiction detecting approach based on the distribution of the query composed of critical mismatch combinations on the Internet is proposed to tackle the problem. By measuring the availability of mismatch conjunction phrases (MCPs), the background knowledge about two target statements can be implicitly obtained for identifying contradictions. Experiments on three different configurations show that the MCP-based approach achieves remarkable improvement on contradiction detection and can significantly improve the performance of textual entailment recognition.

ACM Transactions on Asian Language Information Processing | 2008

Boosting Chinese Question Answering with Two Lightweight Methods: ABSPs and SCO-QAT

Cheng-Wei Lee; Min-Yuh Day; Cheng-Lung Sung; Yi-Hsun Lee; Tian-Jian Jiang; Chia-Wei Wu; Cheng-Wei Shih; Yu-Ren Chen; Wen-Lian Hsu

Question Answering (QA) research has been conducted in many languages. Nearly all the top performing systems use heavy methods that require sophisticated techniques, such as parsers or logic provers. However, such techniques are usually unavailable or unaffordable for under-resourced languages or in resource-limited situations. In this article, we describe how a top-performing Chinese QA system can be designed by using lightweight methods effectively. We propose two lightweight methods, namely the Sum of Co-occurrences of Question and Answer Terms (SCO-QAT) and Alignment-based Surface Patterns (ABSPs). SCO-QAT is a co-occurrence-based answer-ranking method that does not need extra knowledge, word-ignoring heuristic rules, or tools. It calculates co-occurrence scores based on the passage retrieval results. ABSPs are syntactic patterns trained from question-answer pairs with a multiple alignment algorithm. They are used to capture the relations between terms and then use the relations to filter answers. We attribute the success of the ABSPs and SCO-QAT methods to the effective use of local syntactic information and global co-occurrence information. By using SCO-QAT and ABSPs, we improved the RU-Accuracy of our testbed QA system, ASQA, from 0.445 to 0.535 on the NTCIR-5 dataset. It also achieved the top 0.5 RU-Accuracy on the NTCIR-6 dataset. The result shows that lightweight methods are not only cheaper to implement, but also have the potential to achieve state-of-the-art performances.

International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 3, September 2012 | 2012

Enhancement of Feature Engineering for Conditional Random Field Learning in Chinese Word Segmentation Using Unlabeled Data

Mike Tian-Jian Jiang; Cheng-Wei Shih; Ting-Hao Yang; Chan-Hung Kuo; Richard Tzong-Han Tsai; Wen-Lian Hsu

This work proposes a unified view of several features based on frequent strings extracted from unlabeled data that improve the conditional random fields (CRF) model for Chinese word segmentation (CWS). These features include character-based n-gram (CNG), accessor variety based string (AVS) and its variation of left-right co-existed feature (LRAVS), term-contributed frequency (TCF), and term-contributed boundary (TCB) with a specific manner of boundary overlapping. For the experiments, the baseline is the 6-tag, a state-of-the-art labeling scheme of CRF-based CWS, and the data set is acquired from the 2005 CWS Bakeoff of Special Interest Group on Chinese Language Processing (SIGHAN) of the Association for Computational Linguistics (ACL) and SIGHAN CWS Bakeoff 2010. The experimental results show that all of these features improve the performance of the baseline system in terms of recall, precision, and their harmonic average as F1 measure score, on both accuracy (F) and out-of-vocabulary recognition (FOOV). In particular, this work presents compound features involving LRAVS/AVS and TCF/TCB that are competitive with other types of features for CRF-based CWS in terms of F and FOOV, respectively.

NTCIR | 2005

ASQA: Academia Sinica Question Answering System for NTCIR-5 CLQA

Cheng-Wei Lee; Cheng-Wei Shih; Min-Yuh Day; Tzong-Han Tsai; Mike Tian-Jian Jiang; Chia-Wei Wu; Cheng-Lung Sung; Yu-Ren Chen; Shih-Hung Wu; Wen-Lian Hsu

NTCIR | 2008