Jen Wei Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jen Wei Huang is active.

Explore More

Publication

Featured researches published by Jen Wei Huang.

IEEE Transactions on Knowledge and Data Engineering | 2006

Adaptive Clustering for Multiple Evolving Streams

Bi-Ru Dai; Jen Wei Huang; Mi-Yen Yeh; Ming-Syan Chen

In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Henc...In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to observe the changes of group behavior. In order to support flexible clustering requirements, we devise in this paper a clustering on demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two advantageous features, namely, one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. The COD framework consists of two phases, i.e., the online maintenance phase and the offline clustering phase. The online maintenance phase provides an efficient mechanism to maintain summary hierarchies of data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve approximations of desired substreams from summary hierarchies according to clustering queries. We propose two summarization techniques, based on wavelet and regression analyses, to construct the summary hierarchies. The regression-based summary hierarchy approximates the data stream more precisely and provides better clustering results, at the cost of slightly longer time than and twice the storage space as the wavelet-based one. An adaptive version of COD framework is designed to make a selection between a wavelet-based model and a regression-based model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with almost the same quality as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses and also validated by our empirical studies, the COD framework performs very efficiently in the data stream environment while producing clustering results of very high quality

IEEE Transactions on Knowledge and Data Engineering | 2008

A General Model for Sequential Pattern Mining with a Progressive Database

Jen Wei Huang; Chi Yao Tseng; Jian Chih Ou; Ming-Syan Chen

Although there have been many recent studies on the mining of sequential patterns in a static database and in a database with increasing data, these works, in general, do not fully explore the effect of deleting old data from the sequences in the database. When sequential patterns are generated, the newly arriving patterns may not be identified as frequent sequential patterns due to the existence of old data and sequences. Even worse, the obsolete sequential patterns that are not frequent recently may stay in the reported results. In practice, users are usually more interested in the recent data than the old ones. To capture the dynamic nature of data addition and deletion, we propose a general model of sequential pattern mining with a progressive database while the data in the database may be static, inserted, or deleted. In addition, we present a progressive algorithm Pisa, which stands for progressive mining of sequential patterns, to progressively discover sequential patterns in defined time period of interest (POI). The POI is a sliding window continuously advancing as the time goes by. Pisa utilizes a progressive sequential tree to efficiently maintain the latest data sequences, discover the complete set of up-to-date sequential patterns, and delete obsolete data and patterns accordingly. The height of the sequential pattern tree proposed is bounded by the length of POI, thereby effectively limiting the memory space required by Pisa that is significantly smaller than the memory needed by the alternative method, direct appending (DirApp). Note that the sequential pattern mining with a static database and with an incremental database are special cases of the progressive sequential pattern mining. By changing start time and end time of the POI, Pisa can easily deal with a static database or an incremental database as well. Complexity of algorithms proposed is analyzed. The experimental results show that Pisa not only significantly outperforms the prior methods in execution time by orders of magnitude but also possesses graceful scalability.

IEEE Transactions on Knowledge and Data Engineering | 2008

Hardware-Enhanced Association Rule Mining with Hashing and Pipelining

Ying Hsiang Wen; Jen Wei Huang; Ming-Syan Chen

Generally speaking, to implement Apriori-based association rule mining in hardware, one has to load candidate itemsets and a database into the hardware. Since the capacity of the hardware architecture is fixed, if the number of candidate itemsets or the number of items in the database is larger than the hardware capacity, the items are loaded into the hardware separately. The time complexity of those steps that need to load candidate itemsets or database items into the hardware is in proportion to the number of candidate itemsets multiplied by the number of items in the database. Too many candidate itemsets and a large database would create a performance bottleneck. In this paper, we propose a HAsh-based and Pipelined (abbreviated as HAPPI) architecture for hardware- enhanced association rule mining. We apply the pipeline methodology in the HAPPI architecture to compare itemsets with the database and collect useful information for reducing the number of candidate itemsets and items in the database simultaneously. When the database is fed into the hardware, candidate itemsets are compared with the items in the database to find frequent itemsets. At the same time, trimming information is collected from each transaction. In addition, itemsets are generated from transactions and hashed into a hash table. The useful trimming information and the hash table enable us to reduce the number of items in the database and the number of candidate itemsets. Therefore, we can effectively reduce the frequency of loading the database into the hardware. As such, HAPPI solves the bottleneck problem in a priori-based hardware schemes. We also derive some properties to investigate the performance of this hardware implementation. As shown by the experiment results, HAPPI significantly outperforms the previous hardware approach and the software algorithm in terms of execution time.

international conference on data mining | 2004

Clustering on demand for multiple data streams

Bi-Ru Dai; Jen Wei Huang; Mi-Yen Yeh; Ming-Syan Chen

In the data stream environment, the patterns generated by the mining techniques are usually distinct at different time because of the evolution of data. In order to deal with various types of multiple data streams and to support flexible mining requirements, we devise in this paper a clustering on demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two major features, namely one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multiresolution approximations of data streams, flexible clustering demands can be supported.

ACM Transactions on Knowledge Discovery From Data | 2007

Twain: Two-end association miner with precise frequent exhibition periods

Jen Wei Huang; Bi-Ru Dai; Ming-Syan Chen

We investigate the general model of mining associations in a temporal database, where the exhibition periods of items are allowed to be different from one to another. The database is divided into partitions according to the time granularity imposed. Such temporal association rules allow us to observe short-term but interesting patterns that are absent when the whole range of the database is evaluated altogether. Prior work may omit some temporal association rules and thus have limited practicability. To remedy this and to give more precise frequent exhibition periods of frequent temporal itemsets, we devise an efficient algorithm Twain (standing for TWo end AssocIation miNer.) Twain not only generates frequent patterns with more precise frequent exhibition periods, but also discovers more interesting frequent patterns. Twain employs Start time and End time of each item to provide precise frequent exhibition period while progressively handling itemsets from one partition to another. Along with one scan of the database, Twain can generate frequent 2-itemsets directly according to the cumulative filtering threshold. Then, Twain adopts the scan reduction technique to generate all frequent k-itemsets (k > 2) from the generated frequent 2-itemsets. Theoretical properties of Twain are derived as well in this article. The experimental results show that Twain outperforms the prior works in the quality of frequent patterns, execution time, I/O cost, CPU overhead and scalability.

knowledge discovery and data mining | 2010

DPSP: distributed progressive sequential pattern mining on the cloud

Jen Wei Huang; Su Chen Lin; Ming-Syan Chen

The progressive sequential pattern mining problem has been discussed in previous research works With the increasing amount of data, single processors struggle to scale up Traditional algorithms running on a single machine may have scalability troubles Therefore, mining progressive sequential patterns intrinsically suffers from the scalability problem In view of this, we design a distributed mining algorithm to address the scalability problem of mining progressive sequential patterns The proposed algorithm DPSP, standing for Distributed Progressive Sequential Pattern mining algorithm, is implemented on top of Hadoop platform, which realizes the cloud computing environment We propose Map/Reduce jobs in DPSP to delete obsolete itemsets, update current candidate sequential patterns and report up-to-date frequent sequential patterns within each POI The experimental results show that DPSP possesses great scalability and consequently increases the performance and the practicability of mining algorithms.

acm symposium on applied computing | 2006

Scheduling dependent items in data broadcasting environments

Hao Ping Hung; Jen Wei Huang; Jung Long Huang; Ming-Syan Chen

Most of the prior research works in data broadcasting are based on the assumption that the disseminated items are independent of one another. Since in many applications, a mobile user will be interested in more than one item simultaneously, we discuss in this paper the issue of dependency in generating a broadcast program. Algorithm PBA, standing for Placement-Based Allocation, is proposed to generate a broadcast program with high quality and low complexity in the dependent data broadcasting environment. The experimental results show that the proposed placement-based allocation for scheduling dependent items leads to better execution efficiency and solution quality than those by prior works.

knowledge discovery and data mining | 2007

ProMail: using progressive email social network for spam detection

Chi Yao Tseng; Jen Wei Huang; Ming-Syan Chen

The spam problem continues growing drastically. Owing to the ever-changing tricks of spammers, the filtering technique with continual update is imperative nowadays. In this paper, a server-oriented spam detection system ProMail, which investigates human email social network, is presented. According to recent email interaction and reputation of users, arriving emails can be classified as spam or non-spam(ham). To capture the dynamic email communication, the progressive update scheme is introduced to include latest arriving emails by the feedback mechanism and delete obsolete ones. This not only effectively limits the memory space, but also keeps the most up-to-date information. For better efficiency, it is not required to sort the scores of each email user and acquire the exact ones. Instead, the reputation procedure, SpGrade, is proposed to accelerate the progressive rating process. In addition, Pro-Mail is able to deal with huge amounts of emails without delaying the delivery time and possesses higher attack resilience against spammers. The real dataset of 1,500,000 emails is used to evaluate the performance of ProMail, and the experimental results show that ProMail is more accurate and efficient.

Sensors | 2016

Transportation Modes Classification Using Sensors on Smartphones

Shih-Hau Fang; Hao Hsiang Liao; Yu Xiang Fei; Kai Hsiang Chen; Jen Wei Huang; Yu Ding Lu; Yu Tsao

This paper investigates the transportation and vehicular modes classification by using big data from smartphone sensors. The three types of sensors used in this paper include the accelerometer, magnetometer, and gyroscope. This study proposes improved features and uses three machine learning algorithms including decision trees, K-nearest neighbor, and support vector machine to classify the user’s transportation and vehicular modes. In the experiments, we discussed and compared the performance from different perspectives including the accuracy for both modes, the executive time, and the model size. Results show that the proposed features enhance the accuracy, in which the support vector machine provides the best performance in classification accuracy whereas it consumes the largest prediction time. This paper also investigates the vehicle classification mode and compares the results with that of the transportation modes.

business intelligence for the real-time enterprises | 2008

Simplifying Information Integration: Object-Based Flow-of-Mappings Framework for Integration

Bogdan Alexe; Michael N. Gubanov; Mauricio A. Hernández; C. T. Howard Ho; Jen Wei Huang; Yannis Katsis; Lucian Popa; Barna Saha; Ioana Stanoi

The Clio project at IBM Almaden investigates foundational aspects of data transformation, with particular emphasis on the design and execution of schema mappings. We now use Clio as part of a broader data-flow framework in which mappings are just one component. These data-flows express complex transformations between several source and target schemas and require multiple mappings to be specified. This paper describes research issues we have encountered as we try to create and run these mapping-based data-flows. In particular, we describe how we use Unified Famous Objects (UFOs), a schema abstraction similar to business objects, as our data model, how we reason about flows of mappings over UFOs, and how we create and deploy transformations into different run-time engines.

Explore More