David W. Cheung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David W. Cheung is active.

Explore More

Publication

Featured researches published by David W. Cheung.

GigaScience | 2012

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W. Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiao-qian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak Wah Lam; Jun Wang

BackgroundThere is a rapidly increasing amount of de novo genome assembly using next-generation sequencing (NGS) short reads; however, several big challenges remain to be overcome in order for this to be efficient and accurate. SOAPdenovo has been successfully applied to assemble many published genomes, but it still needs improvement in continuity, accuracy and coverage, especially in repeat regions.FindingsTo overcome these challenges, we have developed its successor, SOAPdenovo2, which has the advantage of a new algorithm design that reduces memory consumption in graph construction, resolves more repeat regions in contig assembly, increases coverage and length in scaffold construction, improves gap closing, and optimizes for large genome.ConclusionsBenchmark using the Assemblathon1 and GAGE datasets showed that SOAPdenovo2 greatly surpasses its predecessor SOAPdenovo and is competitive to other assemblers on both assembly length and accuracy. We also provide an updated assembly version of the 2008 Asian (YH) genome using SOAPdenovo2. Here, the contig and scaffold N50 of the YH genome were ~20.9 kbp and ~22 Mbp, respectively, which is 3-fold and 50-fold longer than the first published version. The genome coverage increased from 81.16% to 93.91%, and memory consumption was ~2/3 lower during the point of largest memory consumption.

international conference on data engineering | 1996

Maintenance of discovered association rules in large databases: an incremental updating technique

David W. Cheung; Jiawei Han; Vincent T. Y. Ng; C. Y. Wong

An incremental updating technique is developed for maintenance of the association rules discovered by database mining. There have been many studies on efficient discovery of association rules in large databases. However, it is nontrivial to maintain such discovered rules in large databases because a database may allow frequent or occasional updates and such updates may not only invalidate some existing strong association rules but also turn some weak rules into strong ones. An incremental updating technique is proposed for efficient maintenance of discovered association rules when new transaction data are added to a transaction database.

international conference on parallel and distributed information systems | 1996

A fast distributed algorithm for mining association rules

David W. Cheung; Jiawei Han; Vincent T. Y. Ng; Ada Wai-Chee Fu; Yongjian Fu

With the existence of many large transaction databases, the huge amounts of data, the high scalability of distributed systems, and the easy partitioning and distribution of a centralized database, it is important to investigate efficient methods for distributed mining of association rules. The study discloses some interesting relationships between locally large and globally large item sets and proposes an interesting distributed association rule mining algorithm, FDM (fast distributed mining of association rules), which generates a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules. A performance study shows that FDM has a superior performance over the direct application of a typical sequential algorithm. Further performance enhancement leads to a few variations of the algorithm.

database systems for advanced applications | 1997

A General Incremental Technique for Maintaining Discovered Association Rules

David W. Cheung; Sau Dan Lee; Ben Kao

A more general incremental updating technique is developed for maintaining the association rules discovered in a database in the cases including insertion, deletion, and modijication of transactions in the database. A previously proposed algorithm FUP can only handle the maintenance problem in the case of insertion. The proposed algorithm FUP2 makes use of the previous mining result to cut down the cost of finding the new rules in an updated database. In the insertion only case, FUP2 is equivalent to FUP. In the deletion only case, FUP2 is a complementary algorithm of FUP which is very eficient when the deleted transactions is a small part of the database, which is the most applicable case. In the general case, FUP2 can elqiciently update the discovered rules when new transactions are added to a transaction database, and obsolete transactions are removed from it. The proposed algorithm has been implemented and its performance is studied and compared with the best algorithms for mining association rules studied so far. The study shows that the new incremental algorithm is signijcantly faster than the traditional approach of mining the whole updated database.

IEEE Transactions on Knowledge and Data Engineering | 1996

Efficient mining of association rules in distributed databases

David W. Cheung; Vincent T. Y. Ng; Ada Wai-Chee Fu; Yongjian Fu

Many sequential algorithms have been proposed for the mining of association rules. However, very little work has been done in mining association rules in distributed databases. A direct application of sequential algorithms to distributed databases is not effective, because it requires a large amount of communication overhead. In this study, an efficient algorithm called DMA (Distributed Mining of Association rules), is proposed. It generates a small number of candidate sets and requires only O(n) messages for support-count exchange for each candidate set, where n is the number of sites in a distributed database. The algorithm has been implemented on an experimental testbed, and its performance is studied. The results show that DMA has superior performance, when compared with the direct application of a popular sequential algorithm, in distributed databases.

knowledge discovery and data mining | 2004

Mining, indexing, and querying historical spatiotemporal data

Nikos Mamoulis; Huiping Cao; George Kollios; Marios Hadjieleftheriou; Yufei Tao; David W. Cheung

In many applications that track and analyze spatiotemporal data, movements obey periodic patterns; the objects follow the same routes (approximately) over regular time intervals. For example, people wake up at the same time and follow more or less the same route to their work everyday. The discovery of hidden periodic patterns in spatiotemporal data, apart from unveiling important information to the data analyst, can facilitate data management substantially. Based on this observation, we propose a framework that analyzes, manages, and queries object movements that follow such patterns. We define the spatiotemporal periodic pattern mining problem and propose an effective and fast mining algorithm for retrieving maximal periodic patterns. We also devise a novel, specialized index structure that can benefit from the discovered patterns to support more efficient execution of spatiotemporal queries. We evaluate our methods experimentally using datasets with object trajectories that exhibit periodicity.

international conference on data mining | 2005

Mining frequent spatio-temporal sequential patterns

Huiping Cao; Nikos Mamoulis; David W. Cheung

Many applications track the movement of mobile objects, which can be represented as sequences of timestamped locations. Given such a spatiotemporal series, we study the problem of discovering sequential patterns, which are routes frequently followed by the object. Sequential pattern mining algorithms for transaction data are not directly applicable for this setting. The challenges to address are: (i) the fuzziness of locations in patterns, and (ii) the identification of non-explicit pattern instances. In this paper, we define pattern elements as spatial regions around frequent line segments. Our method first transforms the original sequence into a list of sequence segments, and detects frequent regions in a heuristic way. Then, we propose algorithms to find patterns by employing a newly proposed substring tree structure and improving a priori technique. A performance evaluation demonstrates the effectiveness and efficiency of our approach.

knowledge discovery and data mining | 2002

Enhancing Effectiveness of Outlier Detections for Low Density Patterns

Jian Tang; Zhixiang Chen; Ada Wai-Chee Fu; David W. Cheung

Outlier detection is concerned with discovering exceptional behaviors of objects in data sets.It is becoming a growingly useful tool in applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, identifying computer intrusion, detecting health problems, etc. In this paper, we introduce a connectivity-based outlier factor (COF) scheme that improves the effectiveness of an existing local outlier factor (LOF) scheme when a pattern itself has similar neighbourhood density as an outlier. We give theoretical and empirical analysis to demonstrate the improvement in effectiveness and the capability of the COF scheme in comparison with the LOF scheme.

IEEE Transactions on Knowledge and Data Engineering | 2004

An efficient and scalable algorithm for clustering XML documents by structure

Wang Lian; David W. Cheung; Nikos Mamoulis; Siu-Ming Yiu

With the standardization of XML as an information exchange language over the Internet, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.

Computers & Mathematics With Applications | 1998

Uncertainty reasoning based on cloud models in controllers

D. Li; David W. Cheung; Xuemei Shi; V. Ng

Abstract The methodology of fuzzy reasoning has been shown to be very useful technology for modeling complex nonlinear systems. However, the most commonly used method for reasoning with fuzzy systems models, the Mamdani-Zadeh paradigm, faces many criticisms, particularly from the probability community. A new mathematical representation of linguistic concepts is presented in this paper. With the new model of normal compatibility clouds and a virtual rule engine, a novel uncertainty reasoning technology is proposed. It not only serves as a foundation of linguistic control, but also integrating fuzziness and randomness in an inseparable way. A case study is given to clean up many doubts raised in the debate between fuzzy theory and probability theory researchers, and to give a good interpretation of the Mamdani-Zadeh operations for the defuzzification strategy as well. The architecture of such a controller shows the advantages in hardware implementations.

Explore More