Sau Dan Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sau Dan Lee is active.

Explore More

Publication

Featured researches published by Sau Dan Lee.

database systems for advanced applications | 1997

A General Incremental Technique for Maintaining Discovered Association Rules

David W. Cheung; Sau Dan Lee; Ben Kao

A more general incremental updating technique is developed for maintaining the association rules discovered in a database in the cases including insertion, deletion, and modijication of transactions in the database. A previously proposed algorithm FUP can only handle the maintenance problem in the case of insertion. The proposed algorithm FUP2 makes use of the previous mining result to cut down the cost of finding the new rules in an updated database. In the insertion only case, FUP2 is equivalent to FUP. In the deletion only case, FUP2 is a complementary algorithm of FUP which is very eficient when the deleted transactions is a small part of the database, which is the most applicable case. In the general case, FUP2 can elqiciently update the discovered rules when new transactions are added to a transaction database, and obsolete transactions are removed from it. The proposed algorithm has been implemented and its performance is studied and compared with the best algorithms for mining association rules studied so far. The study shows that the new incremental algorithm is signijcantly faster than the traditional approach of mining the whole updated database.

international conference on data mining | 2009

Naive Bayes Classification of Uncertain Data

Jiangtao Ren; Sau Dan Lee; Xianlu Chen; Ben Kao; Reynold Cheng; David W. Cheung

Traditional machine learning algorithms assume that data are exact or precise. However, this assumption may not hold in some situations because of data uncertainty arising from measurement errors, data staleness, and repeated measurements, etc. With uncertainty, the value of each data item is represented by a probability distribution function (pdf). In this paper, we propose a novel naive Bayes classification algorithm for uncertain data with a pdf. Our key solution is to extend the class conditional probability estimation in the Bayes model to handle pdf’s. Extensive experiments on UCI datasets show that the accuracy of naive Bayes model can be improved by taking into account the uncertainty information.

Data Mining and Knowledge Discovery | 1998

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

Sau Dan Lee; David W. Cheung; Ben Kao

By nature, sampling is an appealing technique for data mining, because approximate solutions in most cases may already be of great satisfaction to the need of the users. We attempt to use sampling techniques to address the problem of maintaining discovered association rules. Some studies have been done on the problem of maintaining the discovered association rules when updates are made to the database. All proposed methods must examine not only the changed part but also the unchanged part in the original database, which is very large, and hence take much time. Worse yet, if the updates on the rules are performed frequently on the database but the underlying rule set has not changed much, then the effort could be mostly wasted. In this paper, we devise an algorithm which employs sampling techniques to estimate the difference between the association rules in a database before and after the database is updated. The estimated difference can be used to determine whether we should update the mined association rules or not. If the estimated difference is small, then the rules in the original database is still a good approximation to those in the updated database. Hence, we do not have to spend the resources to update the rules. We can accumulate more updates before actually updating the rules, thereby avoiding the overheads of updating the rules too frequently. Experimental results show that our algorithm is very efficient and highly accurate.

IEEE Transactions on Knowledge and Data Engineering | 2010

Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index

Ben Kao; Sau Dan Lee; Foris K. F. Lee; David W. Cheung; Wai-Shing Ho

Abstract-We study the problem of clustering uncertain objects whose locations are described by probability density functions (pdfs). We show that the UK-means algorithm, which generalizes the k-means algorithm to handle uncertain objects, is very inefficient. The inefficiency comes from the fact that UK-means computes expected distances (EDs) between objects and cluster representatives. For arbitrary pdfs, expected distances are computed by numerical integrations, which are costly operations. We propose pruning techniques that are based on Voronoi diagrams to reduce the number of expected distance calculations. These techniques are analytically proven to be more effective than the basic bounding-box-based technique previously known in the literature. We then introduce an R-tree index to organize the uncertain objects so as to reduce pruning overheads. We conduct experiments to evaluate the effectiveness of our novel techniques. We show that our techniques are additive and, when used in combination, significantly outperform previously known methods.

international conference on data mining | 2008

Clustering Uncertain Data Using Voronoi Diagrams

Ben Kao; Sau Dan Lee; David W. Cheung; Wai-Shing Ho; K. F. Chan

We study the problem of clustering uncertain objects whose locations are described by probability density functions (pdf). We show that the UK-means algorithm, which generalises the k-means algorithm to handle uncertain objects, is very inefficient. The inefficiency comes from the fact that UK-means computes expected distances (ED) between objects and cluster representatives. For arbitrary pdfs, expected distances are computed by numerical integrations, which are costly operations. We propose pruning techniques that are based on Voronoi diagrams to reduce the number of expected distance calculation. These techniques are analytically proven to be more effective than the basic bounding-box-based technique previous known in the literature. We conduct experiments to evaluate the effectiveness of our pruning techniques and to show that our techniques significantly outperform previous methods.

international conference on data mining | 2007

Reducing UK-Means to K-Means

Sau Dan Lee; Ben Kao; Reynold Cheng

This paper proposes an optimisation to the UK-means algorithm, which generalises the k-means algorithm to han- dle objects whose locations are uncertain. The location of each object is described by a probability density function (pdf). The UK-means algorithm needs to compute expected distances (EDs) between each object and the cluster repre- sentatives. The evaluation of ED from first principles is very costly operation, because the pdf s are different and arbi- trary. But UK-means needs to evaluate a lot of EDs. This is a major performance burden of the algorithm. In this pa- per, we derive a formula for evaluating EDs efficiently. This tremendously reduces the execution time of UK-means, as demonstrated by our preliminary experiments. We also il- lustrate that this optimised formula effectively reduces the UK-means problem to the traditional clustering algorithm addressed by the k-means algorithm.

conference on information and knowledge management | 2010

Accelerating probabilistic frequent itemset mining: a model-based approach

Liang Wang; Reynold Cheng; Sau Dan Lee; David W. Cheung

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel method to capture the itemset mining process as a Poisson binomial distribution. This model-based approach extracts frequent itemsets with a high degree of accuracy, and supports large databases. We apply our techniques to improve the performance of the algorithms for: (1) finding itemsets whose frequentness probabilities are larger than some threshold; and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate. Moreover, they are orders of magnitudes faster than previous approaches.

IEEE Transactions on Knowledge and Data Engineering | 2012

Efficient Mining of Frequent Item Sets on Large Uncertain Databases

Liang Wang; D. W-L Cheung; Reynold Cheng; Sau Dan Lee; Xuan S. Yang

The data handled in emerging applications like location-based services, sensor monitoring systems, and data integration, are often inexact in nature. In this paper, we study the important problem of extracting frequent item sets from a large uncertain database, interpreted under the Possible World Semantics (PWS). This issue is technically challenging, since an uncertain database contains an exponential number of possible worlds. By observing that the mining process can be modeled as a Poisson binomial distribution, we develop an approximate algorithm, which can efficiently and accurately discover frequent item sets in a large uncertain database. We also study the important issue of maintaining the mining result for a database that is evolving (e.g., by inserting a tuple). Specifically, we propose incremental mining algorithms, which enable Probabilistic Frequent Item set (PFI) results to be refreshed. This reduces the need of re-executing the whole mining algorithm on the new database, which is often more expensive and unnecessary. We examine how an existing algorithm that extracts exact item sets, as well as our approximate algorithm, can support incremental mining. All our approaches support both tuple and attribute uncertainty, which are two common uncertain database models. We also perform extensive evaluation on real and synthetic data sets to validate our approaches.

IEEE Transactions on Knowledge and Data Engineering | 2002

Effect of data skewness and workload balance in parallel data mining

David W. Cheung; Sau Dan Lee; Yongqiao Xiao

To mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we previously proposed for distributed mining of association rules (Cheung et al., 1996). FPM requires fewer rounds of message exchanges than FDM and, hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a representative parallel algorithm for the same goal (Agrawal and Srikant, 1994). The efficiency of FPM is attributed to the incorporation of two powerful candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together, we can mine association rules from a database efficiently.

data and knowledge engineering | 2001

Towards the building of a dense-region-based OLAP system

David W. Cheung; Bo Zhou; Ben Kao; Hu Kan; Sau Dan Lee

Abstract On-line analytical processing (OLAP) has become a very useful tool in decision support systems built on data warehouses. Relational OLAP (ROLAP) and multidimensional OLAP (MOLAP) are two popular approaches for building OLAP systems. These two approaches have very different performance characteristics: MOLAP has good query performance but bad space efficiency, while ROLAP can be built on mature RDBMS technology but it needs sizable indices to support it. Many data warehouses contain many small clustered multidimensional data ( dense regions ), with sparse points scattered around in the rest of the space. For these databases, we propose that the dense regions be located and separated from the sparse points. The dense regions can subsequently be represented by small MOLAPs, while the sparse points are put in a ROLAP table. Thus the MOLAP and ROLAP approaches can be integrated in one structure to build a high performance and space efficient dense-region-based data cube. In this paper, we define the dense region location problem as an optimization problem and develop a chunk scanning algorithm to compute dense regions. We prove a lower bound on the accuracy of the dense regions computed. Also, we analyze the sensitivity of the accuracy on user inputs. Finally, extensive experiments are performed to study the efficiency and accuracy of the proposed algorithm.

Explore More