Zhiyuan Chen
University of Maryland, Baltimore County
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zhiyuan Chen.
international conference on data engineering | 2001
Zhiyuan Chen; H. V. Jagadish; Flip Korn; Nick Koudas; S. Muthukrishnan; Raymond T. Ng; Divesh Srivastava
Describes efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e. a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cost-based query optimization. Our summary data structure scalably represents approximate frequency information about twiglets (i.e. small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: set hashing, used to estimate the number of matches of each query twiglet, and maximal overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML data sets, with a variety of twig queries. Our results demonstrate that accurate and robust estimates can be achieved, even with limited space.
international conference on management of data | 2001
Zhiyuan Chen; Johannes Gehrke; Flip Korn
Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed format on disk. Despite the abundance of string-valued attributes in relational schemas there is little work on compression for string attributes in a database context. Moreover, none of the previous work suitably addresses the role of the query optimizer: During query execution, data is either eagerly decompressed when it is read into main memory, or data lazily stays compressed in main memory and is decompressed on demand only In this paper, we present an effective approach for database compression based on lightweight, attribute-level compression techniques. We propose a IIierarchical Dictionary Encoding strategy that intelligently selects the most effective compression method for string-valued attributes. We show that eager and lazy decompression strategies produce sub-optimal plans for queries involving compressed string attributes. We then formalize the problem of compression-aware query optimization and propose one provably optimal and two fast heuristic algorithms for selecting a query plan for relational schemas with compressed attributes; our algorithms can easily be integrated into existing cost-based query optimizers. Experiments using TPC-H data demonstrate the impact of our string compression methods and show the importance of compression-aware query optimization. Our approach results in up to an order speed up over existing approaches.
very large data bases | 2006
Shibnath Mukherjee; Zhiyuan Chen; Aryya Gangopadhyay
Privacy preserving data mining has become increasingly popular because it allows sharing of privacy-sensitive data for analysis purposes. However, existing techniques such as random perturbation do not fare well for simple yet widely used and efficient Euclidean distance-based mining algorithms. Although original data distributions can be pretty accurately reconstructed from the perturbed data, distances between individual data points are not preserved, leading to poor accuracy for the distance-based mining methods. Besides, they do not generally focus on data reduction. Other studies on secure multi-party computation often concentrate on techniques useful to very specific mining algorithms and scenarios such that they require modification of the mining algorithms and are often difficult to generalize to other mining algorithms or scenarios. This paper proposes a novel generalized approach using the well-known energy compaction power of Fourier-related transforms to hide sensitive data values and to approximately preserve Euclidean distances in centralized and distributed scenarios to a great degree of accuracy. Three algorithms to select the most important transform coefficients are presented, one for a centralized database case, the second one for a horizontally partitioned, and the third one for a vertically partitioned database case. Experimental results demonstrate the effectiveness of the proposed approach.
international conference on management of data | 2007
Zhiyuan Chen; Tao Li
Database queries are often exploratory and users often find their queries return too many answers, many of them irrelevant. Existing work either categorizes or ranks the results to help users locate interesting results. The success of both approaches depends on the utilization of user preferences. However, most existing work assumes that all users have the same user preferences, but in real life different users often have different preferences. This paper proposes a two-step solution to address the diversity issue of user preferences for the categorization approach. The proposed solution does not require explicit user involvement. The first step analyzes query history of all users in the system offline and generates a set of clusters over the data, each corresponding to one type of user preferences. When user asks a query, the second step presents to the user a navigational tree over clusters generated in the first step such that the user can easily select the subset of clusters matching his needs. The user then can browse, rank, or categorize the results in selected clusters. The navigational tree is automatically constructed using a cost-based algorithm which considers the cost of visiting both intermediate nodes and leaf nodes in the tree. An empirical study demonstrates the benefits of our approach.
ACM Computing Surveys | 2011
Pimwadee Chaovalit; Aryya Gangopadhyay; George Karabatis; Zhiyuan Chen
Time series are recorded values of an interesting phenomenon such as stock prices, household incomes, or patient heart rates over a period of time. Time series data mining focuses on discovering interesting patterns in such data. This article introduces a wavelet-based time series data analysis to interested readers. It provides a systematic survey of various analysis techniques that use discrete wavelet transformation (DWT) in time series data mining, and outlines the benefits of this approach demonstrated by previous studies performed on diverse application domains, including image classification, multimedia retrieval, and computer network anomaly detection.
symposium on principles of database systems | 2000
Zhiyuan Chen; Nick Koudas; Flip Korn; S. Muthukrishnan
In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied. Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures. We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straight-forward approach based on the independence assumption wherein correlations are not explicitly captured.
International Journal of Information Security and Privacy | 2008
Tamas S. Gal; Zhiyuan Chen; Aryya Gangopadhyay
The identity of patients must be protected when patient data are shared. The two most commonly used models to protect identity of patients are L-diversity and K-anonymity. However, existing work mainly considers data sets with a single sensitive attribute, while patient data often contain multiple sensitive attributes (e.g., diagnosis and treatment). This article shows that although the K-anonymity model can be trivially extended to multiple sensitive attributes, the L-diversity model cannot. The reason is that achieving L-diversity for each individual sensitive attribute does not guarantee L-diversity over all sensitive attributes. We propose a new model that extends L-diversity and K-anonymity to multiple sensitive attributes and propose a practical method to implement this model. Experimental results demonstrate the effectiveness of our approach.
data and knowledge engineering | 2007
Zhiyuan Chen; Johannes Gehrke; Flip Korn; Nick Koudas; Jayavel Shanmugasundaram; Divesh Srivastava
Various index structures have been proposed to speed up the evaluation of XML path expressions. However, existing XML path indices suffer from at least one of three limitations: they focus only on indexing the structure (relying on a separate index for node content), they are useful only for simple path expressions such as root-to-leaf paths, or they cannot be tightly integrated with a relational query processor. Moreover, there is no unified framework to compare these index structures. In this paper, we present a framework defining a family of index structures that includes most existing XML path indices. We also propose two novel index structures in this family, with different space-time tradeoffs, that are effective for the evaluation of XML branching path expressions (i.e., twigs) with value conditions. We also show how this family of index structures can be implemented using the access methods of the underlying relational database system. Finally, we present an experimental evaluation that shows the performance tradeoff between index space and matching time. The experimental results show that our novel indices achieve orders of magnitude improvement in performance for evaluating twig queries, albeit at a higher space cost, over the use of previously proposed XML path indices that can be tightly integrated with a relational query processor.
Government Information Quarterly | 2007
Zhiyuan Chen; Arrya Gangopadhyay; Stephen H. Holden; George Karabatis; Michael P. McGuire
Normative models of e-government typically assert that horizontal (i.e., inter-agency) and vertical (i.e., inter-governmental) integration of data flows and business processes represent the most sophisticated form of e-government, delivering the greatest payoff for both governments and users. This paper concentrates on the integration of data supporting water quality management as an example of how such integration can enable higher levels of e-government. It describes a prototype system that allows users to integrate water monitoring data across many federal, state, and local government organizations and provides novel techniques for information discovery, thus improving information quality and availability for decision making. Specifically, this paper outlines techniques to integrate numerous water quality monitoring data sources, to resolve data disparities, and to retrieve data using semantic relationships among data sources taking advantage of customized user profiles. Preliminary user feedback indicates that these techniques enhance quantity and quality of information available for water quality management.
IEEE Transactions on Knowledge and Data Engineering | 2014
Liang Tang; Tao Li; Yexi Jiang; Zhiyuan Chen
Modern scientific databases and web databases maintain large and heterogeneous data. These real-world databases contain hundreds or even thousands of relations and attributes. Traditional predefined query forms are not able to satisfy various ad-hoc queries from users on those databases. This paper proposes DQF, a novel database query form interface, which is able to dynamically generate query forms. The essence of DQF is to capture a users preference and rank query form components, assisting him/her in making decisions. The generation of a query form is an iterative process and is guided by the user. At each iteration, the system automatically generates ranking lists of form components and the user then adds the desired form components into the query form. The ranking of form components is based on the captured user preference. A user can also fill the query form and submit queries to view the query result at each iteration. In this way, a query form could be dynamically refined until the user is satisfied with the query results. We utilize the expected F-measure for measuring the goodness of a query form. A probabilistic model is developed for estimating the goodness of a query form in DQF. Our experimental evaluation and user study demonstrate the effectiveness and efficiency of the system.