Cheng-Lung Sung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cheng-Lung Sung is active.

Explore More

Publication

Featured researches published by Cheng-Lung Sung.

decision support systems | 2007

Reference metadata extraction using a hierarchical knowledge representation framework

Min-Yuh Day; Richard Tzong-Han Tsai; Cheng-Lung Sung; Chiu-Chen Hsieh; Cheng-Wei Lee; Shih-Hung Wu; Kuen-Pin Wu; Chorng-Shyong Ong; Wen-Lian Hsu

The integration of bibliographical information on scholarly publications available on the Internet is an important task in the academic community. Accurate reference metadata extraction from such publications is essential for the integration of metadata from heterogeneous reference sources. In this paper, we propose a hierarchical template-based reference metadata extraction method for scholarly publications. We adopt a hierarchical knowledge representation framework called INFOMAP, which automatically extracts metadata. The experimental results show that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference styles with a high degree of precision. The overall average accuracy is 92.39% for the six major reference styles compared in this study.

BMC Bioinformatics | 2007

BIOSMILE: A semantic role labeling system for biomedical verbs using a maximum-entropy model with automatically generated template features

Richard Tzong-Han Tsai; Wen-Chi Chou; Ying-Shan Su; Yu-Chun Lin; Cheng-Lung Sung; Hong-Jie Dai; Irene Tzu-Hsuan Yeh; Wei Ku; Ting-Yi Sung; Wen-Lian Hsu

BackgroundBioinformatics tools for automatic processing of biomedical literature are invaluable for both the design and interpretation of large-scale experiments. Many information extraction (IE) systems that incorporate natural language processing (NLP) techniques have thus been developed for use in the biomedical field. A key IE task in this field is the extraction of biomedical relations, such as protein-protein and gene-disease interactions. However, most biomedical relation extraction systems usually ignore adverbial and prepositional phrases and words identifying location, manner, timing, and condition, which are essential for describing biomedical relations. Semantic role labeling (SRL) is a natural language processing technique that identifies the semantic roles of these words or phrases in sentences and expresses them as predicate-argument structures. We construct a biomedical SRL system called BIOSMILE that uses a maximum entropy (ME) machine-learning model to extract biomedical relations. BIOSMILE is trained on BioProp, our semi-automatic, annotated biomedical proposition bank. Currently, we are focusing on 30 biomedical verbs that are frequently used or considered important for describing molecular events.ResultsTo evaluate the performance of BIOSMILE, we conducted two experiments to (1) compare the performance of SRL systems trained on newswire and biomedical corpora; and (2) examine the effects of using biomedical-specific features. The experimental results show that using BioProp improves the F-score of the SRL system by 21.45% over an SRL system that uses a newswire corpus. It is noteworthy that adding automatically generated template features improves the overall F-score by a further 0.52%. Specifically, ArgM-LOC, ArgM-MNR, and Arg2 achieve statistically significant performance improvements of 3.33%, 2.27%, and 1.44%, respectively.ConclusionWe demonstrate the necessity of using a biomedical proposition bank for training SRL systems in the biomedical domain. Besides the different characteristics of biomedical and newswire sentences, factors such as cross-domain framesets and verb usage variations also influence the performance of SRL systems. For argument classification, we find that NE (named entity) features indicating if the target node matches with NEs are not effective, since NEs may match with a node of the parsing tree that does not have semantic role labels in the training set. We therefore incorporate templates composed of specific words, NE types, and POS tags into the SRL system. As a result, the classification accuracy for adjunct arguments, which is especially important for biomedical SRL, is improved significantly.

conference on information and knowledge management | 2000

Semantic search on Internet tabular information extraction for answering queries

Huei-Long Wang; Shih-Hung Wu; I. C. Wang; Cheng-Lung Sung; Wen-Lian Hsu; Wei-Kuan Shih

[email protected] ABSTRACT Although extracting information from tables is essential for Internet information agents, most tables are designed for human eyes and their layout and semantic meanings are not well defined. In practice, encoding the layout of each information source is impossible. This work presents a novel semantic search approach capable of extracting information from general tables. Semantic ontology allows our agents to read tables in the same knowledge domain with different layouts. In addition, a system of layout syntax and a set of transformation rules are defined to transform tables into databases without losing their semantic meanings.

information reuse and integration | 2005

A knowledge-based approach to citation extraction

Min-Yuh Day; Tzong-Han Tsai; Cheng-Lung Sung; Cheng-Wei Lee; Shih-Hung Wu; Chorng-Shyong Ong; Wen-Lian Hsu

Integration of the bibliographical information of scholarly publications available on the Internet is an important task in academic research. To accomplish this task, accurate reference metadata extraction for scholarly publications is essential for the integration of information from heterogeneous reference sources. In this paper, we propose a knowledge-based approach to literature mining and focus on reference metadata extraction methods for scholarly publications. We adopt an ontological knowledge representation framework called INFOMAP to automatically extract the reference metadata. The experimental results show that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of accuracy. The overall average field accuracy of citation extraction for a bioinformatics dataset is 97.87% for six reference styles.

information reuse and integration | 2008

An alignment-based surface pattern for a question answering system

Cheng-Lung Sung; Cheng-Wei Lee; Hsu Chun Yen; Wen-Lian Hsu

In this paper, we propose an alignment-based surface pattern approach, called ABSP, which integrates semantic information into syntactic patterns for question answering (QA). ABSP uses surface patterns to extract important terms from questions, and constructs the terms’ relations from sentences in the corpus. The relations are then used to filter appropriate answer candidates. Experiments show that ABSP can achieve high accuracy and can be incorporated into other QA systems that have high coverage. It can also be used in cross-lingual QA systems. The approach is both robust and portable to other domains.

ACM Transactions on Asian Language Information Processing | 2008

Boosting Chinese Question Answering with Two Lightweight Methods: ABSPs and SCO-QAT

Cheng-Wei Lee; Min-Yuh Day; Cheng-Lung Sung; Yi-Hsun Lee; Tian-Jian Jiang; Chia-Wei Wu; Cheng-Wei Shih; Yu-Ren Chen; Wen-Lian Hsu

Question Answering (QA) research has been conducted in many languages. Nearly all the top performing systems use heavy methods that require sophisticated techniques, such as parsers or logic provers. However, such techniques are usually unavailable or unaffordable for under-resourced languages or in resource-limited situations. In this article, we describe how a top-performing Chinese QA system can be designed by using lightweight methods effectively. We propose two lightweight methods, namely the Sum of Co-occurrences of Question and Answer Terms (SCO-QAT) and Alignment-based Surface Patterns (ABSPs). SCO-QAT is a co-occurrence-based answer-ranking method that does not need extra knowledge, word-ignoring heuristic rules, or tools. It calculates co-occurrence scores based on the passage retrieval results. ABSPs are syntactic patterns trained from question-answer pairs with a multiple alignment algorithm. They are used to capture the relations between terms and then use the relations to filter answers. We attribute the success of the ABSPs and SCO-QAT methods to the effective use of local syntactic information and global co-occurrence information. By using SCO-QAT and ABSPs, we improved the RU-Accuracy of our testbed QA system, ASQA, from 0.445 to 0.535 on the NTCIR-5 dataset. It also achieved the top 0.5 RU-Accuracy on the NTCIR-6 dataset. The result shows that lightweight methods are not only cheaper to implement, but also have the potential to achieve state-of-the-art performances.

intelligent systems design and applications | 2008

Compute the Term Contributed Frequency

Cheng-Lung Sung; Hsu Chun Yen; Wen-Lian Hsu

In this paper, we propose an algorithm and data structure for computing the term contributed frequency (tcf) for all N-grams in a text corpus. Although term frequency is one of the standard notions of frequency in Corpus-Based Natural Language Processing (NLP), there are some problems regarding the use of the concept to N-grams approaches such as the distortion of phrase frequencies. We attempt to overcome this drawback by building a DAG containing the proposed data structure and using it to retrieve more reliable term frequencies. Our proposed algorithm and data structure are more efficient than traditional term frequency extraction approaches and portable to various languages.

intelligence and security informatics | 2008

A template alignment algorithm for question classification

Cheng-Lung Sung; Min-Yuh Day; Hsu-Chun Yen; Wen-Liar Hsu

Question classification (QC) plays a key role in automated question answering (QA) systems. In Chinese QC, for example, a question is analyzed and then labeled with the question type it belongs to and the expected answer type. In this paper, we propose a novel method of Chinese QC that integrates syntactic tags and semantic tags into an alignment-based approach. We adopt a template alignment (TA) algorithm to process large collections of Chinese questions and compare the classification results with those of INFOMAP, a human annotated knowledge inference engine for Chinese questions. We experimented with two approaches for the proposed system: a majority algorithm and a machine learning method that uses Support Vector Machine (SVM). The TA algorithm performs well with both approaches. The experimental results show that the accuracy achieved by TA (85.5%) is comparable to that of INFOMAP (88%). In contrast, QC based on the SVM approach, which incorporates syntactic features and TA yields an accuracy rate of 91.5%.

information reuse and integration | 2006

Chinese Word Segmentation with Minimal Linguistic Knowledge: An Improved Conditional Random Fields Coupled with Character Clustering and Automatically Discovered Template Matching

Richard Tzong-Han Tsai; Hong-Jie Dai; Hsieh-Chuan Hung; Cheng-Lung Sung; Min-Yuh Day; Wen-Lian Hsu

This paper addresses three major problems of closed task Chinese word segmentation (CWS): word overlap, tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. For the first, we use additional bigram features to approximate trigram and tetragram features. For the second, we first apply K-means clustering to identify non-Chinese characters. Then, we employ a two-tagger architecture: one for Chinese text and the other for non-Chinese text. Finally, we post-process our CWS output using automatically generated templates. Our results show that additional bigrams can effectively identify more unknown words. Secondly, using our two-tagger method, segmentation performance on sentences containing non-Chinese words is significantly improved when non-Chinese characters are sparse in the training corpus. Lastly, identification of long NEs and long words is also enhanced by template-based post-processing. Using corpora in closed task of SIGHAN CWS, our best system achieves F-scores of 0.956, 0.947, and 0.965 on the AS, HK, and MSR corpora respectively, compared to the best context scores of 0.952, 0.943, and 0.964 in SIGHAN Bakeoff 2005. In AS, this performance is comparable to the best result (F = 0.956) in the open task

BMC Bioinformatics | 2006