Is this you? Create Your Porfile

Zhenjie Zhang

National University of Singapore

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhenjie Zhang is active.

Explore More

Publication

Featured researches published by Zhenjie Zhang.

international conference on management of data | 2006

Finding k-dominant skylines in high dimensional space

Chee Yong Chan; H. V. Jagadish; Kian-Lee Tan; Anthony K. H. Tung; Zhenjie Zhang

Given a d-dimensional data set, a point p dominates another point q if it is better than or equal to q in all dimensions and better than q in at least one dimension. A point is a skyline point if there does not exists any point that can dominate it. Skyline queries, which return skyline points, are useful in many decision making applications.Unfortunately, as the number of dimensions increases, the chance of one point dominating another point is very low. As such, the number of skyline points become too numerous to offer any interesting insights. To find more important and meaningful skyline points in high dimensional space, we propose a new concept, called k-dominant skyline which relaxes the idea of dominance to k-dominance. A point p is said to k-dominate another point q if there are k ≤ d dimensions in which p is better than or equal to q and is better in at least one of these k dimensions. A point that is not k-dominated by any other points is in the k-dominant skyline.We prove various properties of k-dominant skyline. In particular, because k-dominant skyline points are not transitive, existing skyline algorithms cannot be adapted for k-dominant skyline. We then present several new algorithms for finding k-dominant skyline and its variants. Extensive experiments show that our methods can answer different queries on both synthetic and real data sets efficiently.

extending database technology | 2006

On high dimensional skylines

Chee Yong Chan; H. V. Jagadish; Kian-Lee Tan; Anthony K. H. Tung; Zhenjie Zhang

In many decision-making applications, the skyline query is frequently used to find a set of dominating data points (called skyline points) in a multi-dimensional dataset. In a high-dimensional space skyline points no longer offer any interesting insights as there are too many of them. In this paper, we introduce a novel metric, called skyline frequency that compares and ranks the interestingness of data points based on how often they are returned in the skyline when different number of dimensions (i.e., subspaces) are considered. Intuitively, a point with a high skyline frequency is more interesting as it can be dominated on fewer combinations of the dimensions. Thus, the problem becomes one of finding top-k frequent skyline points. But the algorithms thus far proposed for skyline computation typically do not scale well with dimensionality. Moreover, frequent skyline computation requires that skylines be computed for each of an exponential number of subsets of the dimensions. We present efficient approximate algorithms to address these twin difficulties. Our extensive performance study shows that our approximate algorithm can run fast and compute the correct result on large data sets in high-dimensional spaces.

international conference on management of data | 2010

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Zhenjie Zhang; Marios Hadjieleftheriou; Beng Chin Ooi; Divesh Srivastava

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

very large data bases | 2012

Functional mechanism: regression analysis under differential privacy

Jun Zhang; Zhenjie Zhang; Xiaokui Xiao; Yin Yang; Marianne Winslett

e-differential privacy is the state-of-the-art model for releasing sensitive information while protecting privacy. Numerous methods have been proposed to enforce e-differential privacy in various analytical tasks, e.g., regression analysis. Existing solutions for regression analysis, however, are either limited to non-standard types of regression or unable to produce accurate regression results. Motivated by this, we propose the Functional Mechanism, a differentially private method designed for a large class of optimization-based analyses. The main idea is to enforce e-differential privacy by perturbing the objective function of the optimization problem, rather than its results. As case studies, we apply the functional mechanism to address two most widely used regression models, namely, linear regression and logistic regression. Both theoretical analysis and thorough experimental evaluations show that the functional mechanism is highly effective and efficient, and it significantly outperforms existing solutions.

very large data bases | 2009

Effectively indexing uncertain moving objects for predictive queries

Meihui Zhang; Su Chen; Christian S. Jensen; Beng Chin Ooi; Zhenjie Zhang

Moving object indexing and query processing is a well studied research topic, with applications in areas such as intelligent transport systems and location-based services. While much existing work explicitly or implicitly assumes a deterministic object movement model, real-world objects often move in more complex and stochastic ways. This paper investigates the possibility of a marriage between moving-object indexing and probabilistic object modeling. Given the distributions of the current locations and velocities of moving objects, we devise an efficient inference method for the prediction of future locations. We demonstrate that such prediction can be seamlessly integrated into existing index structures designed for moving objects, thus improving the meaningfulness of range and nearest neighbor query results in highly dynamic and uncertain environments. The paper reports on extensive experiments on the Bx-tree that offer insights into the properties of the papers proposal.

international conference on management of data | 2009

Minimizing the communication cost for continuous skyline maintenance

Zhenjie Zhang; Reynold Cheng; Dimitris Papadias; Anthony K. H. Tung

Existing work in the skyline literature focuses on optimizing the processing cost. This paper aims at minimization of the communication overhead in client-server architectures, where a server continuously maintains the skyline of dynamic objects. Our first contribution is a Filter method that avoids transmission of updates from objects that cannot influence the skyline. Specifically, each object is assigned a filter so that it needs to issue an update only if it violates its filter. Filter achieves significant savings over the naive approach of transmitting all updates. Going one step further, we introduce the concept of frequent skyline query over a sliding window(FSQW). The motivation is that snapshot skylines are not very useful in streaming environments because they keep changing over time. Instead, FSQW reports the objects that appear in the skylines of at least θ ⋅ s of the s most recent timestamps (0 < θ ≤ 1). Filter can be easily adapted to FSQW processing, however, with potentially high overhead for large and frequently updated datasets. To further reduce the communication cost, we propose a Sampling method, which returns approximate FSQW results without computing each snapshot skyline. Finally, we integrate Filter and Sampling in a Hybrid approach that combines their individual advantages.

very large data bases | 2010

Efficient and effective similarity search over probabilistic data based on earth mover's distance

Jia Xu; Zhenjie Zhang; Anthony K. H. Tung; Ge Yu

Advances in geographical tracking, multimedia processing, information extraction, and sensor networks have created a deluge of probabilistic data. While similarity search is an important tool to support the manipulation of probabilistic data, it raises new challenges to traditional relational databases. The problem stems from the limited effectiveness of the distance metrics employed by existing database systems. On the other hand, several more complicated distance operators have proven their values for better distinguishing ability in specific probabilistic domains. In this paper, we discuss the similarity search problem with respect to Earth Mover’s Distance (EMD). EMD is the most successful distance metric for probability distribution comparison but is an expensive operator as it has cubic time complexity. We present a new database indexing approach to answer EMD-based similarity queries, including range queries and k-nearest neighbor queries on probabilistic data. Our solution utilizes primal-dual theory from linear programming and employs a group of B+ trees for effective candidate pruning. We also apply our filtering technique to the processing of continuous similarity queries, especially with applications to frame copy detection in real-time videos. Extensive experiments show that our proposals dramatically improve the usefulness and scalability of probabilistic data management.

conference on information and knowledge management | 2005

Discovering strong skyline points in high dimensional spaces

Zhenjie Zhang; Xinyu Guo; Hua Lu; Anthony K. H. Tung; Nan Wang

Current interests in skyline computation arise due to their relation to preference queries. Since it is guaraneed that a skyline point will not lose out in all dimensions when compared to any other point in the data set, this means that for each skyline point, there exists a set of weight assignments to the dimensions such that the point will become the top user preference.We believe that the usefulness of skyline points is not limited to such application and can be extended to data analysis and knowledge discovery as well. However, since the skyline of high dimensional datasets (which are common in data analysis applications) can contain too many points, various means must be developed to filter off the less interesting skyline points in high dimensions. In this paper, we will propose algorithms to find a set of interesting skyline points called strong skyline points. Extensive experiments show that our proposal is both effective and efficient.

international conference on management of data | 2009

Kernel-based skyline cardinality estimation

Zhenjie Zhang; Yin Yang; Ruichu Cai; Dimitris Papadias; Anthony K. H. Tung

The skyline of a d-dimensional dataset consists of all points not dominated by others. The incorporation of the skyline operator into practical database systems necessitates an efficient and effective cardinality estimation module. However, existing theoretical work on this problem is limited to the case where all d dimensions are independent of each other, which rarely holds for real datasets. The state of the art Log Sampling (LS) technique simply applies theoretical results for independent dimensions to non-independent data anyway, sometimes leading to large estimation errors. To solve this problem, we propose a novel Kernel-Based (KB) approach that approximates the skyline cardinality with nonparametric methods. Extensive experiments with various real datasets demonstrate that KB achieves high accuracy, even in cases where LS fails. At the same time, despite its numerical nature, the efficiency of KB is comparable to that of LS. Furthermore, we extend both LS and KB to the k-dominant skyline, which is commonly used instead of the conventional skyline for high-dimensional data.

IEEE Transactions on Knowledge and Data Engineering | 2008

Continuous k-Means Monitoring over Moving Objects

Zhenjie Zhang; Yin Yang; Anthony K. H. Tung; Dimitris Papadias

Given a data set P, a k-means query returns k points in space (called centers), such that the average squared distance between each point in P and its nearest center is minimized. Since this problem is NP-hard, several approximate algorithms have been proposed and used in practice. In this paper, we study continuous k-means computation at a server that monitors a set of moving objects. Reevaluating k-means every time there is an object update imposes a heavy burden on the server (for computing the centers from scratch) and the clients (for continuously sending location updates). We overcome these problems with a novel approach that significantly reduces the computation and communication costs, while guaranteeing that the quality of the solution, with respect to the reevaluation approach, is bounded by a user-defined tolerance. The proposed method assigns each moving object a threshold (i.e., range) such that the object sends a location update only when it crosses the range boundary. First, we develop an efficient technique for maintaining the k-means. Then, we present mathematical formulas and algorithms for deriving the individual thresholds. Finally, we justify our performance claims with extensive experiments.

Explore More