Aoying Zhou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aoying Zhou is active.

Explore More

Publication

Featured researches published by Aoying Zhou.

very large data bases | 2004

False positive or false negative: mining frequent itemsets from high speed transactional data streams

Jeffery Xu Yu; Zhihong Chong; Hongjun Lu; Aoying Zhou

The problem of finding frequent items has been recently studied over high speed data streams. However, mining frequent itemsets from transactional data streams has not been well addressed yet in terms of its bounds of memory consumption. The main difficulty is due to the nature of the exponential explosion of itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I - 1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory. However, the real killer of effective frequent itemset mining is that most of existing algorithms are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter e, and allow items with support below the specified minimum support s but above s-e counted as frequent ones. Such false-positive items increase the number of false-positive frequent itemsets exponentially, which may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. We developed algorithms based on Chernoff bound. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms.

conference on information and knowledge management | 2003

Dynamically maintaining frequent items over a data stream

Cheqing Jin; Weining Qian; Chaofeng Sha; Jeffrey Xu Yu; Aoying Zhou

It is challenge to maintain frequent items over a data stream, with a small bounded memory, in a dynamic environment where both insertion/deletion of items are allowed. In this paper, we propose a new novel algorithm, called hCount, which can handle both insertion and deletion of items with a much less memory space than the best reported algorithm. Our algorithm is also superior in terms of precision, recall and processing time. In addition, our approach does not request the preknowledge on the size of range for a data stream, and can handle range extension dynamically. Given a little modification, algorithm hCount can be improved to hCount*, which even owns significantly better performance than before.

international conference on data engineering | 2006

VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes

H. V. Jagadish; Beng Chin Ooi; Quang Hieu Vu; Rong Zhang; Aoying Zhou

Multi-dimensional data indexing has received much attention in a centralized database. However, not so much work has been done on this topic in the context of Peerto- Peer systems. In this paper, we propose a new Peer-to- Peer framework based on a balanced tree structure overlay, which can support extensible centralized mapping methods and query processing based on a variety of multidimensional tree structures, including R-Tree, X-Tree, SSTree, and M-Tree. Specifically, in a network with N nodes, our framework guarantees that point queries and range queries can be answered within O(logN) hops. We also provide an effective load balancing strategy to allow nodes to balance their work load efficiently. An experimental assessment validates the practicality of our proposal.

Knowledge and Information Systems | 2008

Tracking clusters in evolving data streams over sliding windows

Aoying Zhou; Feng Cao; Weining Qian; Cheqing Jin

Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.

international conference on data engineering | 2005

Bloom filter-based XML packets filtering for millions of path queries

Xueqing Gong; Weining Qian; Ying Yan; Aoying Zhou

The filtering of XML data is the basis of many complex applications. Lots of algorithms have been proposed to solve this problem. One important challenge is that the number of path queries is huge. It is necessary to take an efficient data structure representing path queries. Another challenge is that these path queries usually vary with time. The maintenance of path queries determines the flexibility and capacity of a filtering system. In this paper, we introduce a novel approximate method for XML data filtering, which uses Bloom filters representing path queries. In this method, millions of path queries can be stored efficiently At the same time, it is easy to deal with the change of these path queries. To improve the filtering performance, we introduce a new data structure, Prefix Filters, to decrease the number of candidate paths. Experiments show that our Bloom filter-based method takes less time to build routing table than automaton-based method. And our method has a good performance with acceptable false positive when filtering XML packets of relatively small depth with millions of path queries.

very large data bases | 2002

DTD-directed publishing with attribute translation grammars

Michael Benedikt; Chee Yong Chan; Wenfei Fan; Rajeev Rastogi; Shihui Zheng; Aoying Zhou

We present a framework for publishing relational data in XML with respect to a fixed DTD. In data exchange on the Web, XML views of relational data are typically required to conform to a predefined DTD. The presence of recursion in a DTD as well as non-determinism makes it challenging to generate DTD-directed, efficient transformations. Our framework provides a language for defining views that are guaranteed to be DTD-conformant, as well as middleware for evaluating these views. It is based on a novel notion of attribute translation grammars (ATGs). An ATG extends a DTD by associating semantic rules via SQL queries. Directed by the DTD, it extracts data from a relational database, and constructs an XML document. We provide algorithms for efficiently evaluating ATGs, along with methods for statically analyzing them. This yields a systematic and effective approach to publishing data with respect to a predefined DTD.

database systems for advanced applications | 2003

M-kernel merging: towards density estimation over data streams

Aoying Zhou; Zhiyuan Cai; Li Wei; Weining Qian

Density estimation is a costly operation for computing distribution information of data sets underlying many important data mining applications, such as clustering and biased sampling. However, traditional density estimation methods are inapplicable for streaming data, which are continuously arriving large volume of data, because of their request for linear storage and square size calculation. The shortcoming limits the application of many existing effective algorithms on data streams, for which the mining problem is an emergency for applications and a challenge for research. In this paper, the problem of computing density functions over data streams is examined. A novel method attacking this shortcoming of existing methods is developed to enable density estimation for large volume of data in linear time, fixed size memory, and without lose of accuracy. The method is based on M-Kernel merging, so that limited kernel functions to be maintained are determined intelligently, The application of the new method on different streaming data models is discussed, and the result of intensive experiments is presented. The analytical and empirical result show that this new density estimation algorithm for data streams can calculate density functions on demand at any time with high accuracy for different streaming data models.

international conference on data engineering | 2007

Distributed Data Stream Clustering: A Fast EM-based Approach

Aoying Zhou; Feng Cao; Ying Yan; Chaofeng Sha; Xiaofeng He

Clustering data streams has been attracting a lot of research efforts recently. However, this problem has not received enough consideration when the data streams are generated in a distributed fashion, whereas such a scenario is very common in real life applications. There exist constraining factors in clustering the data streams in the distributed environment: the data records generated are noisy or incomplete due to the unreliable distributed system; the system needs to on-line process a huge volume of data; the communication is potentially a bottleneck of the system. All these factors pose great challenge for clustering the distributed data streams. In this paper, we proposed an EM-based (Expectation Maximization) framework to effectively cluster the distributed data streams, with the above fundamental challenges in mind. In the presence of noisy or incomplete data records, our algorithms learn the distribution of underlying data streams by maximizing the likelihood of the data clusters. A test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams. Our extensive experimental studies show that the proposed algorithms can achieve a high accuracy with less communication cost, memory consumption and CPU time.

Knowledge and Information Systems | 2006

Finding centric local outliers in categorical/numerical spaces

Jeffrey Xu Yu; Weining Qian; Hongjun Lu; Aoying Zhou

Outlier detection techniques are widely used in many applications such as credit-card fraud detection, monitoring criminal activities in electronic commerce, etc. These applications attempt to identify outliers as noises, exceptions, or objects around the border. The existing density-based local outlier detection assigns the degree to which an object is an outlier in a numerical space. In this paper, we propose a novel mutual-reinforcement-based local outlier detection approach. Instead of detecting local outliers as noise, we attempt to identify local outliers in the center, where they are similar to some clusters of objects on one hand, and are unique on the other. Our technique can be used for bank investment to identify a unique body, similar to many good competitors, in which to invest. We attempt to detect local outliers in categorical, ordinal as well as numerical data. In categorical data, the challenge is that there are many similar but different ways to specify relationships among the data items. Our mutual-reinforcement-based approach is stable, with similar but different user-defined relationships. Our technique can reduce the burden for users to determine the relationships among data items, and find the explanations why the outliers are found. We conducted extensive experimental studies using real datasets.

very large data bases | 2007

An adaptive and dynamic dimensionality reduction method for high-dimensional indexing

Heng Tao Shen; Xiaofang Zhou; Aoying Zhou

The notorious “dimensionality curse” is a well-known phenomenon for any multi-dimensional indexes attempting to scale up to high dimensions. One well-known approach to overcome degradation in performance with respect to increasing dimensions is to reduce the dimensionality of the original dataset before constructing the index. However, identifying the correlation among the dimensions and effectively reducing them are challenging tasks. In this paper, we present an adaptive Multi-level Mahalanobis-based Dimensionality Reduction (MMDR) technique for high-dimensional indexing. Our MMDR technique has four notable features compared to existing methods. First, it discovers elliptical clusters for more effective dimensionality reduction by using only the low-dimensional subspaces. Second, data points in the different axis systems are indexed using a single B+-tree. Third, our technique is highly scalable in terms of data size and dimension. Finally, it is also dynamic and adaptive to insertions. An extensive performance study was conducted using both real and synthetic datasets, and the results show that our technique not only achieves higher precision, but also enables queries to be processed efficiently.

Explore More