Jui-Feng Yeh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jui-Feng Yeh is active.

Explore More

Publication

Featured researches published by Jui-Feng Yeh.

ACM Transactions on Asian Language Information Processing | 2005

Domain-specific FAQ retrieval using independent aspects

Chung-Hsien Wu; Jui-Feng Yeh; Ming-Jun Chen

This investigation presents an approach to domain-specific FAQ (frequently-asked question) retrieval using independent aspects. The data analysis classifies the questions in the collected QA (question-answer) pairs into ten question types in accordance with question stems. The answers in the QA pairs are then paragraphed and clustered using latent semantic analysis and the K-means algorithm. For semantic representation of the aspects, a domain-specific ontology is constructed based on WordNet and HowNet. A probabilistic mixture model is then used to interpret the query and QA pairs based on independent aspects; hence the retrieval process can be viewed as the maximum likelihood estimation problem. The expectation-maximization (EM) algorithm is employed to estimate the optimal mixing weights in the probabilistic mixture model. Experimental results indicate that the proposed approach outperformed the FAQ-Finder system in medical FAQ retrieval.

IEEE Transactions on Knowledge and Data Engineering | 2006

Semantic segment extraction and matching for Internet FAQ retrieval

Chung-Hsien Wu; Jui-Feng Yeh; Yu-Sheng Lai

This investigation presents a novel approach to semantic segment extraction and matching for retrieving information from Internet FAQs with natural language queries. Two semantic segments, the question category segment (QS) and the keyword segment (KS), are extracted from the input queries and the FAQ questions with a semiautomatically derived question-semantic grammar. A semantic matching method is presented to estimate the similarity between the semantic segments of the query and the questions in the FAQ collection. Additionally, the vector space model (VSM) is adopted to measure the similarity between the query and the answers of the QA pairs. Finally, a multistage ranking strategy is adopted to determine the optimally performing combination of similarity metrics. The experimental results illustrate that the proposed method achieves an average rank of 4.52 and a top-10 recall rate of 90.89 percent. Compared with the query-expansion method, this method improves the performance by 4.82 places in the average rank of correct answers, 25.34 percent in the top-5 recall rate, and 5.21 percent in the top-10 recall rate.

IEEE Transactions on Audio, Speech, and Language Processing | 2006

Edit disfluency detection and correction using a cleanup language model and an alignment model

Jui-Feng Yeh; Chung-Hsien Wu

This investigation presents a novel approach to detecting and correcting the edit disfluency in spontaneous speech. Hypothesis testing using acoustic features is first adopted to detect potential interruption points (IPs) in the input speech. The word order of the cleanup utterance is then cleaned up based on the potential IPs using a class-based cleanup language model, the deletable region and the correction are aligned using an alignment model. Finally, log linear weighting is applied to optimize the performance. Using the acoustic features, the IP detection rate is significantly improved especially in recall rate. Based on the positions of the potential IPs, the cleanup language model and the alignment model are able to detect and correct the edit disfluency efficiently. Experimental results demonstrate that the proposed approach has achieved error rates of 0.33 and 0.21 for IP detection and edit word deletion, respectively

IEEE Transactions on Evolutionary Computation | 2008

HAL-Based Evolutionary Inference for Pattern Induction From Psychiatry Web Resources

Liang-Chih Yu; Chung-Hsien Wu; Jui-Feng Yeh; Fong-Lin Jang

Negative and stressful life events play a significant role in triggering depressive episodes. Psychiatric services that can identify such events efficiently are vital for mental health care and prevention. Meaningful patterns, e.g., <lost, parents>, must be extracted from psychiatric texts before these services can be provided. This study presents an evolutionary text-mining framework capable of inducing variable-length patterns from unannotated psychiatry Web resources. The proposed framework can be divided into two parts: 1) a cognitive motivated model such as hyperspace analog to language (HAL) and 2) an evolutionary inference algorithm (EIA). The HAL model constructs a high-dimensional context space to represent words as well as combinations of words. Based on the HAL model, the EIA bootstraps with a small set of seed patterns, and then iteratively induces additional relevant patterns. To avoid moving in the wrong direction, the EIA further incorporates relevance feedback to guide the induction process. Experimental results indicate that combining the HAL model and relevance feedback enables the EIA to not only induce patterns from the unannotated Web corpora, but also achieve useful results in a reasonable amount of time. The proposed framework thus significantly reduces reliance on annotated corpora.

IEEE Transactions on Audio, Speech, and Language Processing | 2011

Speaker Clustering Using Decision Tree-Based Phone Cluster Models With Multi-Space Probability Distributions

Han-Ping Shen; Jui-Feng Yeh; Chung-Hsien Wu

This paper presents an approach to speaker clustering using decision tree-based phone cluster models (DT-PCMs). In this approach, phone clustering is first applied to construct the universal phone cluster models to accommodate acoustic characteristics from different speakers. Since pitch feature is highly speaker-related and beneficial for speaker identification, the decision trees based on multi-space probability distributions (MSDs), useful to model both pitch and cepstral features for voiced and unvoiced speech simultaneously, are constructed. In speaker clustering based on DT-PCMs, contextual, phonetic, and prosodic features of each input speech segment is used to select the speaker-related MSDs from the MSD decision trees to construct the initial phone cluster models. The maximum-likelihood linear regression (MLLR) method is then employed to adapt the initial models to the speaker-adapted phone cluster models according to the input speech segment. Finally, the agglomerative clustering algorithm is applied on all speaker-adapted phone cluster models, each representing one input speech segment, for speaker clustering. In addition, an efficient estimation method for phone model merging is proposed for model parameter combination. Experimental results show that the MSD-based DT-PCMs outperform the conventional GMM- and HMM-based approaches for speaker clustering on the RT09 tasks.

ACM Transactions on Asian Language Information Processing | 2011

Interruption Point Detection of Spontaneous Speech Using Inter-Syllable Boundary-Based Prosodic Features

Chung-Hsien Wu; Wei-Bin Liang; Jui-Feng Yeh

This article presents a probabilistic scheme for detecting the interruption point (IP) in spontaneous speech based on inter-syllable boundary-based prosodic features. Because of the high error rate in spontaneous speech recognition, a combined acoustic model considering both syllable and subsyllable recognition units, is firstly used to determine the inter-syllable boundaries and output the recognition confidence of the input speech. Based on the finding that IPs always occur at inter-syllable boundaries, a probability distribution of the prosodic features at the current potential IP is estimated. The Conditional Random Field (CRF) model, which employs the clustered prosodic features of the current potential IP and its preceding and succeeding inter-syllable boundaries, is employed to output the IP likelihood measure. Finally, the confidence of the recognized speech, the probability distribution of the prosodic features and the CRF-based IP likelihood measure are integrated to determine the optimal IP sequence of the input spontaneous speech. In addition, pitch reset and lengthening are also applied to improve the IP detection performance. The Mandarin Conversional Dialogue Corpus is adopted for evaluation. Experimental results show that the proposed IP detection approach obtains 10.56% and 6.5% more effective results than the hidden Markov model and the Maximum Entropy model respectively under the same experimental conditions. Besides, the IP detection error rate can be further reduced by 9.15% using pitch reset and lengthening information. The experimental results confirm that the proposed model based on inter-syllable boundary-based prosodic features can effectively detect the interruption point in spontaneous Mandarin speech.

Knowledge Based Systems | 2016

Near-synonym substitution using a discriminative vector space model

Liang-Chih Yu; Lung-Hao Lee; Jui-Feng Yeh; Hsiu-Min Shih; Yu-Ling Lai

Near-synonyms are fundamental and useful knowledge resources for computer-assisted language learning (CALL) applications. For example, in online language learning systems, learners may have a need to express a similar meaning using different words. However, it is usually difficult to choose suitable near-synonyms to fit a given context because the differences of near-synonyms are not easily grasped in practical use, especially for second language (L2) learners. Accordingly, it is worth developing algorithms to verify whether near-synonyms match given contexts. Such algorithms could be used in applications to assist L2 learners in discovering the collocational differences between near-synonyms. We propose a discriminative vector space model for the near-synonym substitution task, and consider this task as a classification task. There are two components: a vector space model and discriminative training. The vector space model is used as a baseline classifier to classify test examples into one of the near-synonyms in a given near-synonym set. A discriminative training technique is then employed to improve the vector space model by distinguishing positive and negative features for each near-synonym. Experimental results show that the DT-VSM achieves higher accuracy than both pointwise mutual information and n-gram-based methods that have been used in previous studies.

international conference on multimedia and expo | 2008

Interruption point detection of spontaneous speech using prior knowledge and multiple features

Wei-Bin Liang; Jui-Feng Yeh; Chung-Hsien Wu; Chi-Chiuan Liou

This paper presents an approach to interruption point (IP) detection of spontaneous speech based on conditional random fields using prior knowledge and multiple features. The features adopted in this study consist of subsyllable boundaries and prosodic features. Conditional random fields (CRFs) and variable-length contextual features are employed for IP modeling. In order to apply the features with continuous values to the CRF models, the K-means clustering algorithm is adopted for the quantization of the prosodic features. In the experimental results, Mandarin Conversional Dialogue Corpus (MCDC) was used to evaluate the proposed method. The IP detection error rate achieved almost 20% reduction in Rt04 measure. The experimental results show that the proposed model can effectively detect the interruption point in spontaneous speech.

international conference natural language processing | 2003

Semantic inference based on ontology for medical FAQ mining

Jui-Feng Yeh; Ming-Jun Chen; Chung-Hsien Wu

We present an approach to semantic inference for FAQ mining based on ontology. The questions are classified into ten intension categories using predefined question stemming keywords. The answers in the FAQ database are also clustered using latent semantic analysis (LSA) and K-means algorithm. For FAQ mining, given a query, the question part and answer part in an FAQ question-answer pair is matched with the input query, respectively. Finally, the probabilities estimated from these two parts are integrated and used to choose the most likely answer for the input query. These approaches are experimented on a medical FAQ system. The results show that the proposed approach achieved a retrieval rate of 90% and outperformed the keyword-based approach.

International Journal of Computational Linguistics & Chinese Language Processing, Volume 13, Number 4, December 2008 | 2008

Corpus Cleanup of Mistaken Agreement Using Word Sense Disambiguation

Liang-Chih Yu; Chung-Hsien Wu; Jui-Feng Yeh; Eduard H. Hovy

Word sense annotated corpora are useful resources for many text mining applications. Such corpora are only useful if their annotations are consistent. Most large-scale annotation efforts take special measures to reconcile inter-annotator disagreement. To date, however, nobody has investigated how to automatically determine exemplars in which the annotators agree but are wrong. In this paper, we use OntoNotes, a large-scale corpus of semantic annotations, including word senses, predicate-argument structure, ontology linking, and coreference. To determine the mistaken agreements in word sense annotation, we employ word sense disambiguation (WSD) to select a set of suspicious candidates for human evaluation. Experiments are conducted from three aspects (precision, cost-effectiveness ratio, and entropy) to examine the performance of WSD. The experimental results show that WSD is most effective in identifying erroneous annotations for highly-ambiguous words, while a baseline is better for other cases. The two methods can be combined to improve the cleanup process. This procedure allows us to find approximately 2% of the remaining erroneous agreements in the OntoNotes corpus. A similar procedure can be easily defined to check other annotated corpora.

Explore More