Is this you? Create Your Porfile

Shih-Wen Ke

Chung Yuan Christian University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shih-Wen Ke is active.

Explore More

Publication

Featured researches published by Shih-Wen Ke.

Knowledge Based Systems | 2015

CANN: An intrusion detection system based on combining cluster centers and nearest neighbors

Wei-Chao Lin; Shih-Wen Ke; Chih-Fong Tsai

Abstract The aim of an intrusion detection systems (IDS) is to detect various types of malicious network traffic and computer usage, which cannot be detected by a conventional firewall. Many IDS have been developed based on machine learning techniques. Specifically, advanced detection approaches created by combining or integrating multiple learning techniques have shown better detection performance than general single learning techniques. The feature representation method is an important pattern classifier that facilitates correct classifications, however, there have been very few related studies focusing how to extract more representative features for normal connections and effective detection of attacks. This paper proposes a novel feature representation approach, namely the cluster center and nearest neighbor (CANN) approach. In this approach, two distances are measured and summed, the first one based on the distance between each data sample and its cluster center, and the second distance is between the data and its nearest neighbor in the same cluster. Then, this new and one-dimensional distance based feature is used to represent each data sample for intrusion detection by a k-Nearest Neighbor (k-NN) classifier. The experimental results based on the KDD-Cup 99 dataset show that the CANN classifier not only performs better than or similar to k-NN and support vector machines trained and tested by the original feature representation in terms of classification accuracy, detection rates, and false alarms. I also provides high computational efficiency for the time of classifier training and testing (i.e., detection).

Journal of Systems and Software | 2016

Big data mining with parallel computing

Chih-Fong Tsai; Wei-Chao Lin; Shih-Wen Ke

The performances of distributed and MapReduce methodologies over big datasets are compared.Particularly, mining accuracy and efficiency of these two methodologies are examined.The MapReduce based procedure by different numbers of nodes performs very stable.Moreover, the MapReduce procedure requires the least computational cost to process big datasets. Mining with big data or big data mining has become an active research area. It is very difficult using current methodologies and data mining software tools for a single personal computer to efficiently deal with very large datasets. The parallel and cloud computing platforms are considered a better solution for big data mining. The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner. There are two common methodologies used to tackle the big data problem. The first one is the distributed procedure based on the data parallelism paradigm, where a given big dataset can be manually divided into n subsets, and n algorithms are respectively executed for the corresponding n subsets. The final result can be obtained from a combination of the outputs produced by the n algorithms. The second one is the MapReduce based procedure under the cloud computing platform. This procedure is composed of the map and reduce processes, in which the former performs filtering and sorting and the later performs a summary operation in order to produce the final result. In this paper, we aim to compare the performance differences between the distributed and MapReduce methodologies over large scale datasets in terms of mining accuracy and efficiency. The experiments are based on four large scale datasets, which are used for the data classification problems. The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset. In addition, the MapReduce procedure requires the least computational cost to process these big datasets.

PLOS ONE | 2017

SVM and SVM Ensembles in Breast Cancer Prediction

Min-Wei Huang; Chih-Wen Chen; Wei-Chao Lin; Shih-Wen Ke; Chih-Fong Tsai

Breast cancer is an all too common disease in women, making how to effectively predict it an active research problem. A number of statistical and machine learning techniques have been employed to develop various breast cancer prediction models. Among them, support vector machines (SVM) have been shown to outperform many related techniques. To construct the SVM classifier, it is first necessary to decide the kernel function, and different kernel functions can result in different prediction performance. However, there have been very few studies focused on examining the prediction performances of SVM based on different kernel functions. Moreover, it is unknown whether SVM classifier ensembles which have been proposed to improve the performance of single classifiers can outperform single SVM classifiers in terms of breast cancer prediction. Therefore, the aim of this paper is to fully assess the prediction performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The classification accuracy, ROC, F-measure, and computational times of training SVM and SVM ensembles are compared. The experimental results show that linear kernel based SVM ensembles based on the bagging method and RBF kernel based SVM ensembles with the boosting method can be the better choices for a small scale dataset, where feature selection should be performed in the data pre-processing stage. For a large scale dataset, RBF kernel based SVM ensembles based on boosting perform better than the other classifiers.

SpringerPlus | 2016

The distance function effect on k-nearest neighbor classification for medical datasets.

Li-Yu Hu; Min-Wei Huang; Shih-Wen Ke; Chih-Fong Tsai

IntroductionK-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output.Case descriptionSince the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually.Discussion and evaluationThe experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets.ConclusionsIn this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.

Journal of Systems and Software | 2015

Learning to detect representative data for large scale instance selection

Wei-Chao Lin; Chih-Fong Tsai; Shih-Wen Ke; Chia-Wen Hung; William Eberle

Abstract Instance selection is an important data pre-processing step in the knowledge discovery process. However, the dataset sizes of various domain problems are usually very large, and some are even non-stationary, composed of both old data and a large amount of new data samples. Current algorithms for solving this type of scalability problem have certain limitations, meaning they require a very high computational cost over very large scale datasets during instance selection. To this end, we introduce the ReDD ( Re presentative D ata D etection) approach, which is based on outlier pattern analysis and prediction. First, a machine learning model, or detector, is used to learn the patterns of (un)representative data selected by a specific instance selection method from a small amount of training data. Then, the detector can be used to detect the rest of the large amount of training data, or newly added data. We empirically evaluate ReDD over 50 domain datasets to examine the effectiveness of the learned detector, using four very large scale datasets for validation. The experimental results show that ReDD not only reduces the computational cost nearly two or three times by three baselines, but also maintains the final classification accuracy.

Information Sciences | 2016

Keypoint selection for efficient bag-of-words feature generation and effective image classification

Wei-Chao Lin; Chih-Fong Tsai; Zong-Yao Chen; Shih-Wen Ke

One of the most popular image representations for image classification is based on the bag-of-words (BoW) features. However, the number of keypoints that need to be detected from images to generate the BoW features is usually very large, which causes two problems. First, the computational cost during the vector quantization step is high. Second, some of the detected keypoints are not helpful for recognition. To resolve these limitations, we introduce a framework, called iterative keypoint selection (IKS), with which to select representative keypoints for accelerating the computational time to generate the BoW features, leading to more discriminative feature representation. Each iteration in IKS is comprised of two steps. In the first step some representative keypoint(s) are identified from each image. Then, the keypoints are filtered out if the distances between them and the identified representative keypoint(s) are less than a pre-defined distance. The iteration process continues until no unrepresentative keypoints can be found. Two specific approaches are proposed to perform the first step of IKS. IKS1 focuses on randomly selecting one representative keypoint and IKS2 is based on a clustering algorithm in which the representative keypoints are the closest points to their cluster centers. Experiments carried out based on the Caltech 101, Caltech 256, and PASCAL 2007 datasets demonstrate that performing keypoint selection using IKS1 and IKS2 to generate both the BoW and spatial-based BoW features allows the support vector machine (SVM) classifier to provide better classification accuracy than with the baseline features without keypoint selection. However, it is found that the computational cost of IKS1 is larger than the baseline methods. On the other hand, IKS2 is able to not only efficiently generate the BoW and spatial-based features that reduce the computational time for vector quantization over these datasets, but also provides better classification results than IKS1 over the PASCAL 2007 and Caltech 256 datasets.

soft computing | 2015

Instance selection by genetic-based biological algorithm

Zong-Yao Chen; Chih-Fong Tsai; William Eberle; Wei-Chao Lin; Shih-Wen Ke

Instance selection is an important research problem of data pre-processing in the data mining field. The aim of instance selection is to reduce the data size by filtering out noisy data, which may degrade the mining performance, from a given dataset. Genetic algorithms have presented an effective instance selection approach to improve the performance of data mining algorithms. However, current approaches only pursue the simplest evolutionary process based on the most reasonable and simplest rules. In this paper, we introduce a novel instance selection algorithm, namely a genetic-based biological algorithm (GBA). GBA fits a “biological evolution” into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. Our experiments are based on comparing GBA with five state-of-the-art algorithms over 50 different domain datasets from the UCI Machine Learning Repository. The experimental results demonstrate that GBA outperforms these baselines, providing the lowest classification error rate and the least storage requirement. Moreover, GBA is very computational efficient, which only requires slightly larger computational cost than GA.

Kybernetes | 2014

Dimensionality and data reduction in telecom churn prediction

Wei-Chao Lin; Chih-Fong Tsai; Shih-Wen Ke

Purpose – Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However, during data mining, dimensionality reduction (or feature selection) and data reduction are the two important data preprocessing steps. In particular, the aims of feature selection and data reduction are to filter out irrelevant features and noisy data samples, respectively. The purpose of this paper, performing these data preprocessing tasks, is to make the mining algorithm produce good quality mining results. Design/methodology/approach – Based on a real telecom customer churn data set, seven different preprocessed data sets based on performing feature selection and data reduction by different priorities are used to train the artificial neural network as the churn prediction model. Findings – The results show that performing data reduction first by self-organizing maps and feature selection second by principal component ...

Expert Systems | 2016

Data preprocessing issues for incomplete medical datasets

Min-Wei Huang; Wei-Chao Lin; Chih-Wen Chen; Shih-Wen Ke; Chih-Fong Tsai; William Eberle

While there is an ample amount of medical information available for data mining, many of the datasets are unfortunately incomplete - missing relevant values needed by many machine learning algorithms. Several approaches have been proposed for the imputation of missing values, using various reasoning steps to provide estimations from the observed data. One of the important steps in data mining is data preprocessing, where unrepresentative data is filtered out of the data to be mined. However, none of the related studies about missing value imputation consider performing a data preprocessing step before imputation. Therefore, the aim of this study is to examine the effect of two preprocessing steps, feature and instance selection, on missing value imputation. Specifically, eight different medical-related datasets are used, containing categorical, numerical and mixed types of data. Our experimental results show that imputation after instance selection can produce better classification performance than imputation alone. In addition, we will demonstrate that imputation after feature selection does not have a positive impact on the imputation result.

Neurocomputing | 2015

The effect of low-level image features on pseudo relevance feedback

Wei-Chao Lin; Zong-Yao Chen; Shih-Wen Ke; Chih-Fong Tsai; Wei-Yang Lin

Relevance feedback (RF) is a technique popularly used to improve the effectiveness of traditional content-based image retrieval systems. However, users must provide relevant and/or irrelevant images as feedback for their queries, which is a tedious task. To alleviate this problem, pseudo relevance feedback (PRF) can be utilized. It not only automates the manual component of RF, but can also provide reasonably good retrieval performance. Specifically, it is assumed that a fraction of the top-ranked images in the initial search results are pseudo-positive. The Rocchio algorithm is a classic approach for the implementation of RF/PRF, which is based on the query vector modification discipline. The aim is to reproduce a new query vector by taking the weighted sum of the original query and the mean vectors of the relevant and irrelevant sets. Image feature representation is the key factor affecting the PRF performance. This study is the first to examine the retrieval performances of 63 different image feature descriptors ranging from 64 to 10426 dimensionalities in the context of PRF. Experimental results are obtained based on the NUS-WIDE dataset which contains 22156 Flickr images associated with 69 concepts. It is shown that the combination of color moments, edges, wavelet textures, and locality-constrained linear coding of the bag-of-words model provides the optimal feature representation, giving relatively good retrieval effectiveness and reasonably good retrieval efficiency for Rocchio based PRF.

Explore More