Is this you? Create Your Porfile

Hsing-Kuo Pao

National Taiwan University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hsing-Kuo Pao is active.

Explore More

Publication

Featured researches published by Hsing-Kuo Pao.

international conference on big data | 2013

Malicious URL filtering — A big data application

Min-Sheng Lin; Chien-Yi Chiu; Yuh-Jye Lee; Hsing-Kuo Pao

Malicious URLs have become a channel for Internet criminal activities such as drive-by-download, spamming and phishing. Applications for the detection of malicious URLs are accurate but slow (because they need to download the content or query some Internet host information). In this paper we present a novel lightweight filter based only on the URL string itself to use before existing processing methods. We run experiments on a large dataset and demonstrate a 75% reduction in workload size while retaining at least 90% of malicious URLs. Existing methods do not scale well with the hundreds of millions of URLs encountered every day as the problem is a heavily-imbalanced, large-scale binary classification problem. Our proposed method is able to handle nearly two million URLs in less than five minutes. We generate two filtering models by using lexical features and descriptive features, and then combine the filtering results. The on-line learning algorithms are applied here not only for dealing with large-scale data sets but also for fitting the very short lifetime characteristics of malicious URLs. Our filter can significantly reduce the volume of URL queries on which further analysis needs to be performed, saving both computing time and bandwidth used for content retrieval.

Archive | 2009

Adaptive Alarm Filtering by Causal Correlation Consideration in Intrusion Detection

Heng-Sheng Lin; Hsing-Kuo Pao; Ching-Hao Mao; Hahn-Ming Lee; Tsuhan Chen; Yuh-Jye Lee

One of the main difficulties in most modern Intrusion Detection Systems is the problem of massive alarms generated by the systems. The alarms may either be false alarms which are wrongly classified by a sensitive model, or duplicated alarms which may be issued by various intrusion detectors or be issued at different time for the same attack. We focus on learning-based alarm filtering system. The system takes alarms as the input which may include the alarms from several intrusion detectors, or the alarms issued in different time such as for multi-step attacks. The goal is to filter those alarms with high accuracy and enough representative capability so that the number of false alarms and duplicated alarms can be reduced and the efforts from alarm analysts can be significantly saved. To achieve that, we consider the causal correlation between relevant alarms in the temporal domain to re-label the alarm either to be a false alarm, a duplicated alarm, or a representative true alarm. To be more specific, recognizing the importance of causal correlation can also help us to find novel attacks. As another feature of our system, our system can deal with the frequent changes of network environment. The framework gives the judgment of attacks adaptively. An ensemble of classifiers is adopted for the purpose. Accordingly, we propose a system mainly consisting of two components: one is for alarm filtering to reduce the number of false alarms and duplicated alarms; and one is the ensemble-based adaptive learner which is capable of adapting to environment changes through automatic tuning given the expertise feedback. Two datasets are evaluated.

web intelligence | 2012

Malicious URL Detection Based on Kolmogorov Complexity Estimation

Hsing-Kuo Pao; Yan-Lin Chou; Yuh-Jye Lee

Malicious URL detection has drawn a significant research attention in recent years. It is helpful if we can simply use the URL string to make precursory judgment about how dangerous a website is. By doing that, we can save efforts on the website content analysis and bandwidth for content retrieval. We propose a detection method that is based on an estimation of the conditional Kolmogorov complexity of URL strings. To overcome the incomputability of Kolmogorov complexity, we adopt a compression method for its approximation, called conditional Kolmogorov measure. As a single significant feature for detection, we can achieve a decent performance that can not be achieved by any other single feature that we know. Moreover, the proposed Kolmogorov measure can work together with other features for a successful detection. The experiment has been conducted using a private dataset from a commercial company which can collect more than one million unclassified URLs in a typical hour. On average, the proposed measure can process such hourly data in less than a few minutes.

international conference on technologies and applications of artificial intelligence | 2010

An Intrinsic Graphical Signature Based on Alert Correlation Analysis for Intrusion Detection

Hsing-Kuo Pao; Ching-Hao Mao; Hahn-Ming Lee; Chi-Dong Chen; Christos Faloutsos

We propose a graphical signature for intrusion detection given alert sequences. By correlating alerts with their temporal proximity, we build a probabilistic graph-based model to describe a group of alerts that form an attack or normal behavior. Using the models, we design a pairwise measure based on manifold learning to measure the dissimilarities between different groups of alerts. A large dissimilarity implies different behaviors between the two groups of alerts. Such measure can therefore be combined with regular classification methods for intrusion detection. We evaluate our framework mainly on Acer 2007, a private dataset gathered from a well-known Security Operation Center in Taiwan. The performance on the real data suggests that the proposed method can achieve high detection accuracy. Moreover, the graphical structures and the representation from manifold learning naturally provide the visualized result suitable for further analysis from domain experts.

international conference on big data | 2014

Efficient traffic speed forecasting based on massive heterogenous historical data

Xing-Yu Chen; Hsing-Kuo Pao; Yuh-Jye Lee

Drivers dream of foreseeing traffic condition to enjoy efficient driving experience at all times. Given the historical patterns for different locations and different time, people should be able to guess the possible traffic speed in a near future moment. What is difficult and interesting for this task is that we need to filter the useful data that could help us for the next moment traffic speed prediction from a massive amount of historical data. On the other hand, the traffic condition could be highly dynamic and we can only give a reliable traffic prediction by using the most updated model for prediction. This implies that frequent retraining is necessary. To conquer the task, we propose a lazy learning approach for traffic speed prediction given massive historical data. The approach integrates the kNN and Gaussian process regression for efficient and robust traffic speed prediction. kNN can help us to select the most informative data for Gaussian process Regression using a big data framework. Thanks for the most recent progress of big data research, the processing of massive data for prediction in close to real time has become possible now compared to any time in the past. We aim at using a Hadoop framework for the prediction given heterogeneous data including traffic data such as speed, flow, occupancy, and weather data.

privacy and security issues in data mining and machine learning | 2010

SBAD: sequence based attack detection via sequence comparison

Ching-Hao Mao; Hsing-Kuo Pao; Christos Faloutsos; Hahn-Ming Lee

Given a stream of time-stamped events, like alerts in a network monitoring setting, how can we isolate a sequence of alerts that form a network attack? We propose a Sequence Based Attack Detection (SBAD) method, which makes the following contributions: (a) it automatically identifies groups of alerts that are frequent; (b) it summarizes them into a suspicious sequence of activity, representing them with graph structures; and (c) it suggests a novel graph-based dissimilarity measure. As a whole, SBAD is able to group suspicious alerts, visualize them, and spot anomalies at the sequence level. The evaluations from three datasets--two benchmark datasets (DARPA 1999, PKDD 2007) and a private dataset Acer 2007 gathered from a Security Operation Center in Taiwan--support our approach. The method performs well even without the help of the IP and payload information. No need for privacy information as the input makes the method easy to plug into existing system such as an intrusion detector. To talk about efficiency, the proposed method can deal with large-scale problems, such as processing 300K alerts within 20 mins on a regular PC.

international conference on technologies and applications of artificial intelligence | 2010

A Passive-Aggressive Algorithm for Semi-supervised Learning

Chien-Chung Chang; Yuh-Jye Lee; Hsing-Kuo Pao

In this paper, we proposed a novel semi-supervised learning algorithm, named passive-aggressive semi-supervised learner, which consists of the concepts of passive-aggressive, down-weighting, and multi-view scheme. Our approach performs the labeling and training procedures iteratively. In labeling procedure, we use two views, known as teachers classifiers for consensus training to obtain a set of guessed labeled points. In training procedure, we use the idea of down-weighting to retrain the third view, i.e., students classifier by the given initial labeled and guessed labeled points. Based on the idea of passive-aggressive algorithm, we would also like the new retrained classifier to be held as near as possible to the original classifier produced by the initial labeled data. The experiment results showed that our method only uses a small portion of the labeled training data points, but its test accuracy is comparable to the pure supervised learning scheme that uses all the labeled data points for training.

Archive | 2008

Data Visualization via Kernel Machines

Yuan-chin Ivan Chang; Yuh-Jye Lee; Hsing-Kuo Pao; Mei-Hsien Lee; Su-Yun Huang

Due to the rapid development of information technology in recent years, it is common to encounter enormousamounts of data collected fromdiverse sources.This has led to a great demand for innovative analytic tools that can handle the kinds of complex data sets that cannot be tackled using traditional statistical methods. Modern data visualization techniques face a similar situation and must also provide adequate solutions.

Archive | 2012

Introduction to Support Vector Machines and Their Applications in Bankruptcy Prognosis

Yuh-Jye Lee; Yi-Ren Yeh; Hsing-Kuo Pao

We aim at providing a comprehensive introduction to Support Vector Machines and their applications in computational finance. Based on the advances of the statistical learning theory, one of the first SVM algorithms was proposed in mid 1990s. Since then, they have drawn a lot of research interests both in theoretical and application domains and have became the state-of-the-art techniques in solving classification and regression problems. The reason for the success is not only because of their sound theoretical foundation but also their good generalization performance in many real applications. In this chapter, we address the theoretical, algorithmic and computational issues and try our best to make the article self-contained. Moreover, in the end of this chapter, a case study on default prediction is also presented. We discuss the issues when SVM algorithms are applied to bankruptcy prognosis such as how to deal with the unbalanced dataset, how to tune the parameters to have a better performance and how to deal with large scale dataset.

international conference on big data | 2016

Compressed learning for time series classification

Yuh-Jye Lee; Hsing-Kuo Pao; Shueh-Han Shih; Jing-Yao Lin; Xin-Rong Chen

The time series classification has been studied for various applications in the last decades. In the time series classification problem, we decide the class information based on a small piece of the time series inputs. In general, the approaches to time series classification can be categorized into three types, distance-based, model-based, and feature-based approaches. In this research, we focus on the feature-based methods, which represent time series as a set of characterized values. It is quite often the case that features generated by existing representation techniques are not transparent to domain experts and the feature that are selected for classification are not completely interpretable. We aim to propose a novel time series representation, called Envelope to solve the problem. The proposed supervised feature extraction method transforms time series into simple 1/0/-1 values. A heuristic is introduced to determine the most appropriate representation which includes the features that are the best to discriminate data of different labels. Moreover, this new representation enjoys the characteristic of sparsity which is an essential property when we need to apply compressed sensing techniques. With this advantage, we can benefit from high transmission efficiency, the reduction of required storage and model complexity. We conduct a series of tests on various benchmark time series data to show the effectiveness of the proposed method. Other than the classification effectiveness, we demonstrate how to visualize the similarity between time series of the same and different kinds from the proposed Envelope method.

Explore More