Yue-Shi Lee
Ming Chuan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yue-Shi Lee.
Expert Systems With Applications | 2009
Show-Jane Yen; Yue-Shi Lee
For classification problem, the training data will significantly influence the classification accuracy. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belongs to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class and investigate the effect of under-sampling methods in the imbalanced class distribution environment. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.
Pattern Recognition | 2008
Yu-Chieh Wu; Yue-Shi Lee; Jie-Chi Yang
Phrase pattern recognition (phrase chunking) refers to automatic approaches for identifying predefined phrase structures in a stream of text. Support vector machines (SVMs)-based methods had shown excellent performance in many sequential text pattern recognition tasks such as protein name finding, and noun phrase (NP)-chunking. Even though they yield very accurate results, they are not efficient for online applications, which need to handle hundreds of thousand words in a limited time. In this paper, we firstly re-examine five typical multiclass SVM methods and the adaptation to phrase chunking. However, most of them were inefficient when the number of phrase types scales. We thus introduce the proposed two new multiclass SVM models that make the system substantially faster in terms of training and testing while keeps the SVM accurate. The two methods can also be applied to similar tasks such as named entity recognition and Chinese word segmentation. Experiments on CoNLL-2000 chunking and Chinese base-chunking tasks showed that our method can achieve very competitive accuracy and at least 100 times faster than the state-of-the-art SVM-based phrase chunking method. Besides, the computational time complexity and the time cost analysis of our methods were also given in this paper.
data warehousing and knowledge discovery | 2007
Show-Jane Yen; Yue-Shi Lee
Mining weighted association rules considers the profits of items in a transaction database, such that the association rules about important items can be discovered. However, high profit items may not always be high revenue products, since purchased quantities of items would also influence the revenue for the items. This paper considers both profits and purchased quantities of items to calculate utility for the items. Mining high utility quantitative association rules is to discover that when some items are purchased on some quantities, the other items on some quantities are purchased too, which have high utility. In this paper, we propose a data mining algorithm to find high utility itemsets with purchased quantities, from which high utility quantitative association rules also can be generated. Our algorithm needs not generate candidate itemsets and just need to scan the original database twice. The experimental results show that our algorithm is more efficient than the other algorithms which only discovered high utility association rules.
international conference on intelligent computing | 2006
Show-Jane Yen; Yue-Shi Lee
The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.
Information Sciences | 2013
Show-Jane Yen; Yu-Chieh Wu; Jie-Chi Yang; Yue-Shi Lee; Chung-Jung Lee; Jui-Jung Liu
Modern information technologies and Internet services are suffering from the problem of selecting and managing a growing amount of textual information, to which access is often critical. Machine learning techniques have recently shown excellent performance and flexibility in many applications, such as artificial intelligence and pattern recognition. Question answering (QA) is a method of locating exact answer sentences from vast document collections. This paper presents a machine learning-based question-answering framework, which integrates a question classifier, simple document/passage retrievers, and the proposed context-ranking models. The question classifier is trained to categorize the answer type of the given question and instructs the context-ranking model to re-rank the passages retrieved from the initial retrievers. This method provides flexible features to learners, such as word forms, syntactic features, and semantic word features. The proposed context-ranking model, which is based on the sequential labeling of tasks, combines rich features to predict whether the input passage is relevant to the question type. We employ TREC-QA tracks and question classification benchmarks to evaluate the proposed method. The experimental results show that the question classifier achieves 85.60% accuracy without any additional semantic or syntactic taggers, and reached 88.60% after we employed the proposed term expansion techniques and a predefined related-word set. In the TREC-10 QA task, by using the gold TREC-provided relevant document set, the QA model achieves a 0.563 mean reciprocal rank (MRR) score, and a 0.342 MRR score is achieved after using the simple document and passage retrieval algorithms.
pacific-asia conference on knowledge discovery and data mining | 2006
Yu-Chieh Wu; Teng-Kai Fan; Yue-Shi Lee; Show-Jane Yen
Identifying proper names, like gene names, DNAs, or proteins is useful to help researchers to mining the text information. Learning to extract proper names in natural language text is a named entity recognition (NER) task. Previous studies focus on combining abundant human made rules, trigger words, to enhance the system performance. However these methods require domain experts to build up these rules and word set which relies on lots of human efforts. In this paper, we present a robust named entity recognition system based on support vector machines (SVM). By integrating with rich feature set and the proposed mask method, the system performance is satisfactory on the MUC-7 and biology named entity recognition tasks which outperforms famous machine learning-based method, such as hidden markov model (HMM), and maximum entropy model (MEM). We compare our method to previous systems that were performed on the same data set. The experiments show that when training with the MUC-7 data set, our system achieves 86.4 in F(β=1) rate and 81.57 for the biology corpus. Besides, our named entity system is able to handle real time processing applications, the turn around time on a 63 K words document set is less than 30 seconds.
ieee international conference on e-commerce technology for dynamic e-business | 2004
Yue-Shi Lee; Show-Jane Yen; Min-Chi Hsieh; Ghi-Hua Tu
World Wide Web is the most excited impacts to the human society in the past few years. It provides another way for doing business. This paper proposes a new algorithm, IPA (integrating path traversal patterns and association rules), for mining Web transaction patterns in the electronic commerce environment. The IPA algorithm takes both the traveling and purchasing behaviors of customers into consideration at the same time to overcome the disadvantages of pure association rules mining and pure path traversal pattern mining. The experimental results show that the IPA algorithm can simultaneously and efficiently capture the users traversing and purchasing behaviors
systems, man and cybernetics | 2006
Show-Jane Yen; Yue-Shi Lee; Cheng-Han Lin; Jia-Ching Ying
Classification is an important and well-known technique in the field of machine learning, and the training data will significantly influence the classification accuracy. However, the training data in real-world applications often are imbalanced class distribution. It is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose a cluster-based sampling approach for selecting the representative data as training data to improve the classification accuracy and investigate the effect of under-sampling methods in the imbalanced class distribution problem. In the experiments, we evaluate the performances for our cluster-based sampling approach and the other sampling methods in the previous studies.
Expert Systems With Applications | 2006
Show-Jane Yen; Yue-Shi Lee
Mining association rules and mining sequential patterns both are to discover customer purchasing behaviors from a transaction database, such that the quality of business decision can be improved. However, the size of the transaction database can be very large. It is very time consuming to find all the association rules and sequential patterns from a large database, and users may be only interested in some information. Moreover, the criteria of the discovered association rules and sequential patterns for the user requirements may not be the same. Many uninteresting information for the user requirements can be generated when traditional mining methods are applied. Hence, a data mining language needs to be provided such that users can query only interesting knowledge to them from a large database of customer transactions. In this paper, a data mining language is presented. From the data mining language, users can specify the interested items and the criteria of the association rules or sequential patterns to be discovered. Also, the efficient data mining techniques are proposed to extract the association rules and the sequential patterns according to the user requirements.
data warehousing and knowledge discovery | 2002
Yue-Shi Lee; Show-Jane Yen
The decision-tree learning algorithms, e.g., C5, are good at dataset classification. But those algorithms usually work with only one attribute at a time. The dependencies among attributes are not considered in those algorithms. Unfortunately, in the real world, most datasets contain attributes, which are dependent. Generally, these dependencies are classified into two types: categorical-type and numerical-type dependencies. Thus, it is very important to construct a model to discover the dependencies among attributes, and to improve the accuracy of the decision-tree learning algorithms. Neural network model is a good choice to concern with these two types of dependencies. In this paper, we propose a Neural Decision Tree (NDT) model to deal with the problems described above. NDT model combines the neural network technologies and the traditional decision-tree learning capabilities to handle the complicated and real cases. The experimental results show that the NDT model can significantly improve the accuracy of C5.