Is this you? Create Your Porfile

Edward Hung

Hong Kong Polytechnic University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Edward Hung is active.

Explore More

Publication

Featured researches published by Edward Hung.

knowledge discovery and data mining | 2007

Mining frequent itemsets from uncertain data

Chun Kit Chui; Ben Kao; Edward Hung

We study the problem of mining frequent itemsets from uncertain data under a probabilistic framework. We consider transactions whose items are associated with existential probabilities and give a formal definition of frequent patterns under such an uncertain data model. We show that traditional algorithms for mining frequent itemsets are either inapplicable or computationally inefficient under such a model. A data trimming framework is proposed to improve mining efficiency. Through extensive experiments, we show that the data trimming technique can achieve significant savings in both CPU cost and I/O cost.

international conference on data engineering | 2003

PXML: a probabilistic semistructured data model and algebra

Edward Hung; Lise Getoor; V. S. Subrahmanian

Despite the recent proliferation of work on semistructured data models, there has been little work to date on supporting uncertainty in these models. We propose a model for probabilistic semistructured data (PSD). The advantage of our approach is that it supports a flexible representation that allows the specification of a wide class of distributions over semistructured instances. We provide two semantics for the model and show that the semantics are probabilistically coherent. Next, we develop an extension of the relational algebra to handle probabilistic semistructured data and describe efficient algorithms for answering queries that use this algebra. Finally, we present experimental results showing the efficiency of our algorithms.

Distributed and Parallel Databases | 2002

Parallel Mining of Outliers in Large Database

Edward Hung; David W. Cheung

Data mining is a new, important and fast growing database application. Outlier (exception) detection is one kind of data mining, which can be applied in a variety of areas like monitoring of credit card fraud and criminal activities in electronic commerce. With the ever-increasing size and attributes (dimensions) of database, previously proposed detection methods for two dimensions are no longer applicable. The time complexity of the Nested-Loop (NL) algorithm (Knorr and Ng, in Proc. 24th VLDB, 1998) is linear to the dimensionality but quadratic to the dataset size, inducing an unacceptable cost for large dataset.A more efficient version (ENL) and its parallel version (PENL) are introduced. In theory, the improvement of performance in PENL is linear to the number of processors, as shown in a performance comparison between ENL and PENL using Bulk Synchronization Parallel (BSP) model. The great improvement is further verified by experiments on a parallel computer system IBM 9076 SP2. The results show that it is a very good choice to mine outliers in a cluster of workstations with a low-cost interconnected by a commodity communication network.

very large data bases | 2009

An audit environment for outsourcing of frequent itemset mining

Wai Kit Wong; David W. Cheung; Edward Hung; Ben Kao; Nikos Mamoulis

Finding frequent itemsets is the most costly task in association rule mining. Outsourcing this task to a service provider brings several benefits to the data owner such as cost relief and a less commitment to storage and computational resources. Mining results, however, can be corrupted if the service provider (i) is honest but makes mistakes in the mining process, or (ii) is lazy and reduces costly computation, returning incomplete results, or (iii) is malicious and contaminates the mining results. We address the integrity issue in the outsourcing process, i.e., how the data owner verifies the correctness of the mining results. For this purpose, we propose and develop an audit environment, which consists of a database transformation method and a result verification method. The main component of our audit environment is an artificial itemset planting (AIP) technique. We provide a theoretical foundation on our technique by proving its appropriateness and showing probabilistic guarantees about the correctness of the verification process. Through analytical and experimental studies, we show that our technique is both effective and efficient.

ACM Transactions on Computational Logic | 2007

Probabilistic interval XML

Edward Hung; Lise Getoor; V. S. Subrahmanian

Interest in XML databases has been expanding rapidly over the last few years. In this paper, we study the problem of incorporating probabilistic information into XML databases. We propose the Probabilistic Interval XML (PIXML for short) data model in this paper. Using this data model, users can express probabilistic information within XML markups. In addition, we provide two alternative formal model-theoretic semantics for PIXML data. The first semantics is a “global” semantics which is relatively intuitive, but is not directly amenable to computation. The second semantics is a “local” semantics which supports efficient computation. We prove several correspondence results between the two semantics. To our knowledge, this is the first formal model theoretic semantics for probabilistic interval XML. We then provide an operational semantics that may be used to compute answers to queries and that is correct for a large class of probabilistic instances.

international conference on data engineering | 2005

RDF aggregate queries and views

Edward Hung; Yu Deng; V. S. Subrahmanian

Resource description framework (RDF) is a rapidly expanding Web standard. RDF databases attempt to track the massive amounts of Web data and services available. In this paper, we study the problem of aggregate queries. We develop an algorithm to compute answers to aggregate queries over RDF databases and algorithms to maintain views involving those aggregates. Though RDF data can be stored in a standard relational DBMS (and hence we can execute standard relational aggregate queries and view maintenance methods on them), we show experimentally that our algorithms that operate directly on the RDF representation exhibit significantly superior performance.

computational intelligence and data mining | 2007

An Efficient Distance Calculation Method for Uncertain Objects

Lurong Xiao; Edward Hung

Recently the academic communities have paid more attention to the queries and mining on uncertain data. In the tasks such as clustering or nearest-neighbor queries, expected distance is often used as a distance measurement among uncertain data objects. Traditional database systems store uncertain objects using their expected (average) location in the data space. Distances can be calculated easily from the expected locations, but it poorly approximates the real expected distance values. Recent research work calculates the expected distance by calculating the weighted average of the pair-wise distances among samples of two uncertain objects. However the pair-wise distance calculations take much longer time than the the former method. In this paper, we propose an efficient method approximation by single Gaussian (ASG) to calculate the expected distance by a function of the means and variances of samples of uncertain objects. Theoretical and experimental studies show that ASG has both advantages of the latter methods high accuracy and the former methods fast execution time. We suggest that ASG plays an important role in reducing computational costs significantly in query processing and various data mining tasks such as clustering and outlier detection

international conference on move to meaningful internet systems | 2005

Probabilistic ontologies and relational databases

Octavian Udrea; Deng Yu; Edward Hung; V. S. Subrahmanian

The relational algebra and calculus do not take the semantics of terms into account when answering queries. As a consequence, not all tuples that should be returned in response to a query are always returned, leading to low recall. In this paper, we propose the novel notion of a constrained probabilistic ontology (CPO). We developed the concept of a CPO-enhanced relation in which each attribute of a relation has an associated CPO. These CPOs describe relationships between terms occurring in the domain of that attribute. We show that the relational algebra can be extended to handle CPO-enhanced relations. This allows queries to yield sets of tuples, each of which has a probability of being correct.

Expert Systems With Applications | 2011

A subspace decision cluster classifier for text classification

Yan Li; Edward Hung; Korris Fu-Lai Chung

In this paper, a new classification method (SDCC) for high dimensional text data with multiple classes is proposed. In this method, a subspace decision cluster classification (SDCC) model consists of a set of disjoint subspace decision clusters, each labeled with a dominant class to determine the class of new objects falling in the cluster. A cluster tree is first generated from a training data set by recursively calling a subspace clustering algorithm Entropy Weighting k-Means algorithm. Then, the SDCC model is extracted from the subspace decision cluster tree. Various tests including Anderson-Darling test are used to determine the stopping condition of the tree growing. A series of experiments on real text data sets have been conducted. Their results show that the new classification method (SDCC) outperforms the existing methods like decision tree and SVM. SDCC is particularly suitable for large, high dimensional sparse text data with many classes.

Neurocomputing | 2015

Large margin clustering on uncertain data by considering probability distribution similarity

Lei Xu; Qinghua Hu; Edward Hung; Baowen Chen; Xu Tan; Changrui Liao

In this paper, the problem of clustering uncertain objects whose locations are uncertain and described by probability density functions (pdf) is studied. Though some existing methods (i.e. K-means, DBSCAN) have been extended to handle uncertain object clustering, there are still some limitations to be solved. K-means assumes that the objects are described by reasonably separated spherical balls. Thus, UK-means based on K-means is limited in handling objects which are in non-spherical shape. On the other hand, the probability density function is an important characteristic of uncertain data, but few existing clustering methods consider the difference between objects relying on probability density functions. Therefore, in this article, a clustering algorithm based on probability distribution similarity is proposed. Our method aims at finding the largest margin between clusters to overcome the limitation of UK-means. Extensively experimental results verify the performance of our method by effectiveness, efficiency and scalability on both synthetic and real data sets. HighlightsWe study the problem of clustering on uncertain objects.We consider the difference between objects based on probability density functions.We aim at finding the largest margin between clusters to overcome the limitation of UK-means.The experimental results verify the performance of our method by effectiveness, efficiency and scalability on both synthetic and real data sets.

Explore More