Is this you? Create Your Porfile

Ya-Han Hu

National Chung Cheng University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ya-Han Hu is active.

Explore More

Publication

Featured researches published by Ya-Han Hu.

systems man and cybernetics | 2012

Machine Learning in Financial Crisis Prediction: A Survey

Wei-Yang Lin; Ya-Han Hu; Chih-Fong Tsai

For financial institutions, the ability to predict or forecast business failures is crucial, as incorrect decisions can have direct financial consequences. Bankruptcy prediction and credit scoring are the two major research problems in the accounting and finance domain. In the literature, a number of models have been developed to predict whether borrowers are in danger of bankruptcy and whether they should be considered a good or bad credit risk. Since the 1990s, machine-learning techniques, such as neural networks and decision trees, have been studied extensively as tools for bankruptcy prediction and credit score modeling. This paper reviews 130 related journal papers from the period between 1995 and 2010, focusing on the development of state-of-the-art machine-learning techniques, including hybrid and ensemble classifiers. Related studies are compared in terms of classifier design, datasets, baselines, and other experimental factors. This paper presents the current achievements and limitations associated with the development of bankruptcy-prediction and credit-scoring models employing machine learning. We also provide suggestions for future research.

Journal of Clinical Epidemiology | 2015

Developing a stroke severity index based on administrative data was feasible using data mining techniques

Sheng Feng Sung; Cheng Yang Hsieh; Yea Huei Kao Yang; Huey Juan Lin; Chih Hung Chen; Yu Wei Chen; Ya-Han Hu

OBJECTIVES Case-mix adjustment is difficult for stroke outcome studies using administrative data. However, relevant prescription, laboratory, procedure, and service claims might be surrogates for stroke severity. This study proposes a method for developing a stroke severity index (SSI) by using administrative data. STUDY DESIGN AND SETTING We identified 3,577 patients with acute ischemic stroke from a hospital-based registry and analyzed claims data with plenty of features. Stroke severity was measured using the National Institutes of Health Stroke Scale (NIHSS). We used two data mining methods and conventional multiple linear regression (MLR) to develop prediction models, comparing the model performance according to the Pearson correlation coefficient between the SSI and the NIHSS. We validated these models in four independent cohorts by using hospital-based registry data linked to a nationwide administrative database. RESULTS We identified seven predictive features and developed three models. The k-nearest neighbor model (correlation coefficient, 0.743; 95% confidence interval: 0.737, 0.749) performed slightly better than the MLR model (0.742; 0.736, 0.747), followed by the regression tree model (0.737; 0.731, 0.742). In the validation cohorts, the correlation coefficients were between 0.677 and 0.725 for all three models. CONCLUSION The claims-based SSI enables adjusting for disease severity in stroke studies using administrative data.

data and knowledge engineering | 2009

On mining multi-time-interval sequential patterns

Ya-Han Hu; Tony Cheng-Kui Huang; Hui-Ru Yang; Yen-Liang Chen

Sequential pattern mining is essential in many applications, including computational biology, consumer behavior analysis, web log analysis, etc. Although sequential patterns can tell us what items are frequently to be purchased together and in what order, they cannot provide information about the time span between items for decision support. Previous studies dealing with this problem either set time constraints to restrict the patterns discovered or define time-intervals between two successive items to provide time information. Accordingly, the first approach falls short in providing clear time-interval information while the second cannot discover time-interval information between two non-successive items in a sequential pattern. To provide more time-related knowledge, we define a new variant of time-interval sequential patterns, called multi-time-interval sequential patterns, which can reveal the time-intervals between all pairs of items in a pattern. Accordingly, we develop two efficient algorithms, called the MI-Apriori and MI-PrefixSpan algorithms, to solve this problem. The experimental results show that the MI-PrefixSpan algorithm is faster than the MI-Apriori algorithm, but the MI-Apriori algorithm has better scalability in long sequence data.

Information Processing and Management | 2017

Opinion mining from online hotel reviews A text summarization approach

Ya-Han Hu; Yen-Liang Chen; Hui-Ling Chou

Text summarization technique can extract essential information from online reviews.Our method can identify top-k most informative sentences from online hotel reviews.We jointly considered author, review time, usefulness, and opinion factors.Online hotel reviews were collected from TripAdvisor in experimental evaluation.The results show that our approach provides more comprehensive hotel information. Online travel forums and social networks have become the most popular platform for sharing travel information, with enormous numbers of reviews posted daily. Automatically generated hotel summaries could aid travelers in selecting hotels. This study proposes a novel multi-text summarization technique for identifying the top-k most informative sentences of hotel reviews. Previous studies on review summarization have primarily examined content analysis, which disregards critical factors like author credibility and conflicting opinions. We considered such factors and developed a new sentence importance metric. Both the content and sentiment similarities were used to determine the similarity of two sentences. To identify the top-k sentences, the k-medoids clustering algorithm was used to partition sentences into k groups. The medoids from these groups were then selected as the final summarization results. To evaluate the performance of the proposed method, we collected two sets of reviews for the two hotels posted on TripAdvisor.com. A total of 20 subjects were invited to review the text summarization results from the proposed approach and two conventional approaches for the two hotels. The results indicate that the proposed approach outperforms the other two, and most of the subjects believed that the proposed approach can provide more comprehensive hotel information.

Information Sciences | 2017

Clustering-based undersampling in class-imbalanced data

Wei-Chao Lin; Chih-Fong Tsai; Ya-Han Hu; Jing-Shang Jhang

Abstract Class imbalance is often a problem in various real-world data sets, where one class (i.e. the minority class) contains a small number of data points and the other (i.e. the majority class) contains a large number of data points. It is notably difficult to develop an effective model using current data mining and machine learning algorithms without considering data preprocessing to balance the imbalanced data sets. Random undersampling and oversampling have been used in numerous studies to ensure that the different classes contain the same number of data points. A classifier ensemble (i.e. a structure containing several classifiers) can be trained on several different balanced data sets for later classification purposes. In this paper, we introduce two undersampling strategies in which a clustering technique is used during the data preprocessing step. Specifically, the number of clusters in the majority class is set to be equal to the number of data points in the minority class. The first strategy uses the cluster centers to represent the majority class, whereas the second strategy uses the nearest neighbors of the cluster centers. A further study was conducted to examine the effect on performance of the addition or deletion of 5 to 10 cluster centers in the majority class. The experimental results obtained using 44 small-scale and 2 large-scale data sets revealed that the clustering-based undersampling approach with the second strategy outperformed five state-of-the-art approaches. Specifically, this approach combined with a single multilayer perceptron classifier and C4.5 decision tree classifier ensembles delivered optimal performance over both small- and large-scale data sets.

Knowledge Based Systems | 2014

Discovering valuable frequent patterns based on RFM analysis without customer identification information

Ya-Han Hu; Tzu-Wei Yeh

RFM analysis and market basket analysis (i.e., frequent pattern mining) are two most important tasks in database marketing. Based on customers’ historical purchasing behavior, RFM analysis can identify a valuable customer group, while market basket analysis can find interesting purchasing patterns. Previous studies reveal that recency, frequency and monetary (RFM) analysis and frequent pattern mining can be successfully integrated to discover valuable patterns, denoted as RFM-customer-patterns. However, since many retailers record transactions without collecting customer information, the RFM-customer-patterns cannot be discovered by existing approaches. Therefore, the aim of this study was to define the RFM-pattern and develop a novel algorithm to discover complete set of RFM-patterns that can approximate the set of RFM-customer-patterns without customer identification information. Instead of evaluating values of patterns from a customer point of view, this study directly measures pattern ratings by considering RFM features. An RFM-pattern is defined as a pattern that is not only occurs frequently, but involves a recent purchase and a higher percentage of revenue. This study also proposes a tree structure, called an RFM-pattern-tree, to compress and store entire transactional database, and develops a pattern growth-based algorithm, called RFMP-growth, to discover all the RFM-patterns in an RFM-pattern-tree. Experimental results show that the proposed approach is efficient and can effectively discover the greater part of RFM-customer-patterns.

Journal of Systems and Software | 2013

An efficient tree-based algorithm for mining sequential patterns with multiple minimum supports

Ya-Han Hu; Fan Wu; Yi-Jiun Liao

Sequential pattern mining (SPM) is an important technique for determining time-related behavior in sequence databases. In real-life applications, the frequencies for various items in a sequence database are not exactly equal. If all items are set with the same minimum support, the rare item problem may result, meaning that we are unable to effectively retrieve interesting patterns regardless of whether minsup is set too high or too low. Liu (2006) first included the concept of multiple minimum supports (MMSs) to SPM. It allows users to specify the minimum item support (MIS) for each item according to its natural frequency. A generalized sequential pattern-based algorithm, named Multiple Supports - Generalized Sequential Pattern (MS-GSP), was also developed to mine complete set of sequential patterns. However, the MS-GSP adopts candidate generate-and-test approach, which has been recognized as a costly and time-consuming method in pattern discovery. For the efficient mining of sequential patterns with MMSs, this study first proposes a compact data structure, called a Preorder Linked Multiple Supports tree (PLMS-tree), to store and compress the entire sequence database. Based on a PLMS-tree, we develop an efficient algorithm, Multiple Supports - Conditional Pattern growth (MSCP-growth), to discover the complete set of patterns. The experimental result shows that the proposed approach achieves more preferable findings than the MS-GSP and the conventional SPM.

Journal of Systems and Software | 2013

Knowledge discovery of weighted RFM sequential patterns from customer sequence databases

Ya-Han Hu; Tony Cheng-Kui Huang; Yu-Hua Kao

In todays business environment, there is tremendous interest in the mining of interesting patterns for superior decision making. Although many successful customer relationship management (CRM) applications have been developed based on sequential pattern mining techniques, they basically assume that the importance of each customer is the same. Previous studies in CRM show that not all customers make the same contribution to a business, and it is indispensible to evaluate customer value before developing effective marketing strategies. Therefore, this study includes the concepts of recency, frequency, and monetary (RFM) analysis in the sequential pattern mining process. For a given subsequence, each customer sequence contributes its own recency, frequency, and monetary scores to represent customer importance. An efficient algorithm is developed to discover sequential patterns with high recency, frequency, and monetary scores. Empirical results show that the proposed method is efficient and can effectively discover more valuable patterns than conventional frequent pattern mining.

Artificial Intelligence in Medicine | 2012

Predicting warfarin dosage from clinical data: A supervised learning approach

Ya-Han Hu; Fan Wu; Chia-Lun Lo; Chun-Tien Tai

OBJECTIVE Safety of anticoagulant administration has been a primary concern of the Joint Commission on Accreditation of Healthcare Organizations. Among all anticoagulants, warfarin has long been listed among the top ten drugs causing adverse drug events. Due to narrow therapeutic range and significant side effects, warfarin dosage determination becomes a challenging task in clinical practice. For superior clinical decision making, this study attempts to build a warfarin dosage prediction model utilizing a number of supervised learning techniques. METHODS AND MATERIALS The data consists of complete historical records of 587 Taiwan clinical cases who received warfarin treatment as well as warfarin dose adjustment. A number of supervised learning techniques were investigated, including multilayer perceptron, model tree, k nearest neighbors, and support vector regression (SVR). To achieve higher prediction accuracy, we further consider both homogeneous and heterogeneous ensembles (i.e., bagging and voting). For performance evaluation, the initial dose of warfarin prescribed by clinicians is established as the baseline. The mean absolute error (MAE) and standard deviation of errors (σ(E)) are considered as evaluation indicators. RESULTS The overall evaluation results show that all of the learning based systems are significantly more accurate than the baseline (MAE=0.394, σ(E)=0.558). Among all prediction models, both Bagged Voting (MAE=0.210, σ(E)=0.357) with four classifiers and Bagged SVR (MAE=0.210, σ(E)=0.366) are suggested as the two most effective prediction models due to their lower MAE and σ(E). CONCLUSION The investigated models can not only facilitate clinicians in dosage decision-making, but also help reduce patient risk from adverse drug events.

Pediatric Emergency Care | 2015

Predicting Factors and Risk Stratification for Return Visits to the Emergency Department Within 72 Hours in Pediatric Patients.

Sheng-Feng Sung; Kang Ernest Liu; Solomon Chih-Cheng Chen; Chia-Lun Lo; Kuei-Chih Lin; Ya-Han Hu

Objectives A return visit (RV) to the emergency department (ED) is usually used as a quality indicator for EDs. A thorough comprehension of factors affecting RVs is beneficial to enhancing the quality of emergency care. We performed this study to identify pediatric patients at high risk of RVs using readily available characteristics during an ED visit. Methods We retrospectively collected data of pediatric patients visiting 6 branches of an urban hospital during 2007. Potential variables were analyzed using a multivariable logistic regression analysis to determine factors associated with RVs and a classification and regression tree technique to identify high-risk groups. Results Of the 35,435 visits from which patients were discharged home, 2291 (6.47%) visits incurred an RV within 72 hours. On multivariable analysis, younger age, weekday visits, diagnoses belonging to the category of symptoms, signs, and ill-defined conditions, and being seen by a female physician were associated with a higher probability of RVs. Children younger than 6.5 years who visited on weekdays or between midnight and 8:00 AM on weekends or holidays had the highest probability of returning to the ED within 72 hours. Conclusions Our study reexamined several important factors that could affect RVs of pediatric patients to the ED and identified high-risk groups of RVs. Further intervention studies or qualitative research could be targeted on these at-risk groups.

Explore More