Cheng-Ru Lin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cheng-Ru Lin is active.

Explore More

Publication

Featured researches published by Cheng-Ru Lin.

conference on information and knowledge management | 2001

Sliding-window filtering: an efficient algorithm for incremental mining

Chang-Hung Lee; Cheng-Ru Lin; Ming-Syan Chen

We explore in this paper an effective sliding-window filtering (abbreviatedly as SWF) algorithm for incremental mining of association rules. In essence, by partitioning a transaction database into several partitions, algorithm SWF employs a filtering threshold in each partition to deal with the candidate itemset generation. Under SWF, the cumulative information of mining previous partitions is selectively carried over toward the generation of candidate itemsets for the subsequent partitions. Algorithm SWF not only significantly reduces I/O and CPU cost by the concepts of cumulative filtering and scan reduction techniques but also effectively controls memory utilization by the technique of sliding-window partition. Algorithm SWF is particularly powerful for efficient incremental mining for an ongoing time-variant transaction database. By utilizing proper scan reduction techniques, only one scan of the incremented dataset is needed by algorithm SWF. The I/O cost of SWF is, in orders of magnitude, smaller than those required by prior methods, thus resolving the performance bottleneck. Experimental studies are performed to evaluate performance of algorithm SWF. It is noted that the improvement achieved by algorithm SWF is even more prominent as the incremented portion of the dataset increases and also as the size of the database increases.

IEEE Transactions on Knowledge and Data Engineering | 2005

Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging

Cheng-Ru Lin; Ming-Syan Chen

Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids or the distance between two closest (or farthest) data points However, all of these measures are vulnerable to outliers and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measure, referred to as cohesion, to measure the intercluster distances. By using this new measure of cohesion, we have designed a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in time linear to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. The time and the space complexities of algorithm CSM are analyzed. As shown by our performance studies, the cohesion-based clustering is very robust and possesses excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently and provide better clustering results than those by prior methods.

IEEE Transactions on Knowledge and Data Engineering | 2003

Progressive partition miner: an efficient algorithm for mining general temporal association rules

Chang-Hung Lee; Ming-Syan Chen; Cheng-Ru Lin

We explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items of which each item contains an individual exhibition period. The current model of association rule mining is not able to handle the publication database due to the following fundamental problems, i.e., 1) lack of consideration of the exhibition period of each individual item and 2) lack of an equitable support counting basis for each item. To remedy this, we propose an innovative algorithm progressive-partition-miner (abbreviated as PPM) to discover general temporal association rules in a publication database. The basic idea of PPM is to first partition the publication database in light of exhibition periods of items and then progressively accumulate the occurrence count of each candidate 2-itemset based on the intrinsic partitioning characteristics. Algorithm PPM is also designed to employ a filtering threshold in each partition to early prune out those cumulatively infrequent 2-itemsets. The feature that the number of candidate 2-itemsets generated by PPM is very close to the number of frequent 2-itemsets allows us to employ the scan reduction technique to effectively reduce the number of database scans. Explicitly, the execution time of PPM is, in orders of magnitude, smaller than those required by other competitive schemes that are directly extended from existing methods. The correctness of PPM is proven and some of its theoretical properties are derived. Sensitivity analysis of various parameters is conducted to provide many insights into Algorithm PPM.

international conference on data mining | 2001

On mining general temporal association rules in a publication database

Chang-Hung Lee; Cheng-Ru Lin; Ming-Syan Chen

In this paper, we explore a new problem of mining general temporal association rules in publication databases. In essence, a publication database is a set of transactions where each transaction T is a set of items, each containing an individual exhibition period. The current model of association rule mining is not able to handle a publication database due to the following fundamental problems: (1) lack of consideration of the exhibition period of each individual item; and (2) lack of an equitable support counting basis for each item. To remedy this, we propose an innovative algorithm, progressive-partition-miner (PPM), to discover general temporal association rules in a publication database. The basic idea of PPM is to first partition the publication database into exhibition periods of items and then progressively accumulate the occurrence count of each candidate 2-itemset based on the intrinsic partitioning characteristics. PPM is also designed to employ a filtering threshold in each partition to prune out those cumulatively infrequent 2-itemsets at an early stage. Explicitly, the execution time of PPM is, in orders of magnitude, smaller than those required by schemes which are directly extended from existing methods.

Information Systems | 2005

Sliding window filtering: an efficient method for incremental mining on a time-variant database.

Chang-Hung Lee; Cheng-Ru Lin; Ming-Syan Chen

Recently, several important database applications have called for the design of efficient techniques for incremental mining of association rules. In response to this need, we explore in this paper an effective sliding-window filtering (abbreviatedly as SWF) algorithm for incremental mining of association rules. In essence, by partitioning a transaction database into several partitions, algorithm SWF employs a filtering threshold in each partition to deal with the candidate itemset generation. Under SWF, the cumulative information of mining previous partitions is selectively carried over toward the generation of candidate itemsets for the subsequent partitions. Algorithm SWF not only significantly reduces I/O and CPU cost by the concepts of cumulative filtering and scan reduction techniques but also effectively controls memory utilization by the technique of sliding-window partition. More importantly, algorithm SWF is particularly powerful for efficient incremental mining for an ongoing time-variant transaction database. By utilizing proper scan reduction techniques, only one scan of the incremented dataset is needed by algorithm SWF. The I/O cost of SWF is, in orders of magnitude, smaller than those required by prior methods, thus resolving the performance bottleneck. Extensive experimental studies are performed to evaluate performance of algorithm SWF. Sensitivity analysis of various parameters is conducted to provide many insights into algorithm SWF. It is noted that the improvement achieved by algorithm SWF is even more prominent as the incremented portion of the dataset increases and also as the size of the database increases.

IEEE Transactions on Knowledge and Data Engineering | 2005

Dual clustering: integrating data clustering over optimization and constraint domains

Cheng-Ru Lin; Ken-Hao Liu; Ming-Syan Chen

Spatial clustering has attracted a lot of research attention due to its various applications. In most conventional clustering problems, the similarity measurement mainly takes the geometric attributes into consideration. However, in many real applications, the nongeometric attributes are what users are concerned about. In the conventional spatial clustering, the input data set is partitioned into several compact regions and data points which are similar to one another in their nongeometric attributes may be scattered over different regions, thus making the corresponding objective difficult to achieve. To remedy this, we propose and explore in this paper a new clustering problem on two domains, called dual clustering, where one domain refers to the optimization domain and the other refers to the constraint domain. Attributes on the optimization domain are those involved in the optimization of the objective function, while those on the constraint domain specify the application dependent constraints. Our goal is to optimize the objective function in the optimization domain while satisfying the constraint specified in the constraint domain. We devise an efficient and effective algorithm, named Interlaced Clustering-Classification, abbreviated as ICC, to solve this problem. The proposed ICC algorithm combines the information in both domains and iteratively performs a clustering algorithm on the optimization domain and also a classification algorithm on the constraint domain to reach the target clustering effectively. The time and space complexities of the ICC algorithm are formally analyzed. Several experiments are conducted to provide the insights into the dual clustering problem and the proposed algorithm.

knowledge discovery and data mining | 2002

A robust and efficient clustering algorithm based on cohesion self-merging

Cheng-Ru Lin; Ming-Syan Chen

Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids, or the distance between two closest (or farthest) data points. However, all of these measurements are vulnerable to outliers, and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measurement referred to as cohesion, to measure the inter-cluster distances. By using this new measurement of cohesion, we design a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in linear time to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase, and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. As shown by our performance studies, the cohesion-based clustering is very robust and possesses the excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently, and provide better clustering results than those by prior methods.Index Terms: Data mining, data clustering, hierarchical clustering, partitional clustering

IEEE Transactions on Consumer Electronics | 2004

Design and performance study of rate staggering storage for scalable video in a disk-array-based video server

Xin-Mao Huang; Cheng-Ru Lin; Ming-Syan Chen

This paper provides a data placement method based on rate staggering to store scalable video data in a disk-array-based video server. Scalable video means a video which is coded in such a way that subsets of the full video bit stream can be decoded to create low quality/resolution videos. Supporting layered multiple resolutions from a video server is very desirable in many applications. Note that in a disk array the video data corresponding to different rates of the same video clip are not required to reside in the same disk. In view of this, we propose and explore in this paper the approach of rate staggering, i.e., staggering video data in the disk array based on data rates, it is shown that the advantages of the proposed rate staggering method include: (1) minimizing the intermediate buffer space required by the server, (2) achieving better load balancing due to finer scheduling granularity, and (3) alleviating the disk bandwidth fragmentation. These advantages enable a video server using the rate staggering method to provide feasible solutions to some video stream requests which cannot be met otherwise. The system throughput can thus be increased. We also conduct several simulations for various applications. These experimental results show that the rate staggering method can significantly improve the performance of a VOD system.

very large data bases | 2007

Constrained data clustering by depth control and progressive constraint relaxation

Bi-Ru Dai; Cheng-Ru Lin; Ming-Syan Chen

In order to import the domain knowledge or application-dependent parameters into the data mining systems, constraint-based mining has attracted a lot of research attention recently. In this paper, the attributes employed to model the constraints are called constraint attributes and those attributes involved in the objective function to be optimized are called optimization attributes. The constrained clustering considered in this paper is conducted in such a way that the objective function of optimization attributes is optimized subject to the condition that the imposed constraint is satisfied. Explicitly, we address the problem of constrained clustering with numerical constraints, in which the constraint attribute values of any two data items in the same cluster are required to be within the corresponding constraint range. This numerical constrained clustering problem, however, cannot be dealt with by any conventional clustering algorithms. Consequently, we devise several effective and efficient algorithms to solve such a clustering problem. It is noted that due to the intrinsic nature of the numerical constrained clustering, there is an order dependency on the process of attaining the clustering, which in many cases degrades the clustering results. In view of this, we devise a progressive constraint relaxation technique to remedy this drawback and improve the overall performance of clustering results. Explicitly, by using a smaller (tighter) constraint range in earlier iterations of merge, we will have more room to relax the constraint and seek for better solutions in subsequent iterations. It is empirically shown that the progressive constraint relaxation technique is able to improve not only the execution efficiency but also the clustering quality.

IEEE Transactions on Parallel and Distributed Systems | 2002

On the asymptotical optimality of multilayered decentralized consensus protocol

Cheng-Ru Lin; Ming-Syan Chen

A decentralized consensus protocol refers to a process for all nodes in a distributed system to collect the information/status from every other node and reach a consensus among them. Two classes of decentralized consensus protocols have been studied before: the one without an initiator and the one with an initiator. While the one without an initiator has been well studied in the literature, it is noted that the prior protocols with an initiator mainly relied upon the one without an initiator and thus did not fully exploit the intrinsic properties of having an initiator. By exploiting the concept of multilayered execution, we develop in this paper an efficient multilayered decentralized consensus protocol for a distributed system with an initiator. By adapting itself to the number of nodes in the system, the proposed protocol can determine a proper layer for execution and reach the consensus in the minimal numbers of message steps while incurring a much smaller number of messages than required by prior works. Several illustrative examples are given and performance analysis of the proposed algorithm is conducted to provide many insights into the problem studied. It is shown that the decentralized consensus protocols developed in this paper for the case of having an initiator significantly outperform prior schemes. Specifically, it is proven that (1) the ratio of the average number of messages incurred by the proposed algorithm to that by the prior method approaches zero as the number of nodes increases and (2) the proposed algorithm is asymptotically optimal in the sense that the message number required by the proposed algorithm and that of the optimal one are asymptotically of the same complexity with respect to the number of nodes in the system, showing the very important advantage of the proposed algorithm.

Explore More