Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jiexing Li is active.

Publication


Featured researches published by Jiexing Li.


very large data bases | 2012

Robust estimation of resource consumption for SQL queries using statistical techniques

Jiexing Li; Arnd Christian König; Vivek R. Narasayya; Surajit Chaudhuri

The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft SQL Server.


international conference on data engineering | 2012

GSLPI: A Cost-Based Query Progress Indicator

Jiexing Li; Rimma V. Nehme; Jeffrey F. Naughton

Progress indicators for SQL queries were first published in 2004 with the simultaneous and independent proposals from Chaudhuri et al. and Luo et al. In this paper, we implement both progress indicators in the same commercial RDBMS to investigate their performance. We summarize common cases in which they are both accurate and cases in which they fail to provide reliable estimates. Although there are differences in their performance, much more striking is the similarity in the errors they make due to a common simplifying uniform future speed assumption. While the developers of these progress indicators were aware that this assumption could cause errors, they neither explored how large the errors might be nor did they investigate the feasibility of removing the assumption. To rectify this we propose a new query progress indicator, similar to these early progress indicators but without the uniform speed assumption. Experiments show that on the TPC-H benchmark, on queries for which the original progress indicators have errors up to 30X the query running time, the new progress indicator is accurate to within 10 percent. We also discuss the sources of the errors that still remain and shed some light on what would need to be done to eliminate them.


very large data bases | 2014

Resource bricolage for parallel database systems

Jiexing Li; Jeffrey F. Naughton; Rimma V. Nehme

Running parallel database systems in an environment with heterogeneous resources has become increasingly common, due to cluster evolution and increasing interest in moving applications into public clouds. For database systems running in a heterogeneous cluster, the default uniform data partitioning strategy may overload some of the slow machines while at the same time it may under-utilize the more powerful machines. Since the processing time of a parallel query is determined by the slowest machine, such an allocation strategy may result in a significant query performance degradation. We take a first step to address this problem by introducing a technique we call resource bricolage that improves database performance in heterogeneous environments. Our approach quantifies the performance differences among machines with various resources as they process workloads with diverse resource requirements. We formalize the problem of minimizing workload execution time and view it as an optimization problem, and then we employ linear programming to obtain a recommended data partitioning scheme. We verify the effectiveness of our technique with an extensive experimental study on a commercial database system.


international conference on data engineering | 2009

Privacy Preserving Publishing on Multiple Quasi-identifiers

Jian Pei; Yufei Tao; Jiexing Li; Xiaokui Xiao

In some applications of privacy preserving data publishing, a practical demand is to publish a data set on multiple quasi-identifiers for multiple users simultaneously, which poses several challenges. Can we generate one anonymized version of the data so that the privacy preservation requirement like


international conference on data engineering | 2010

Correlation hiding by independence masking

Yufei Tao; Jian Pei; Jiexing Li; Xiaokui Xiao; Ke Yi; Zhengzheng Xing

k


international conference on management of data | 2016

Operator and Query Progress Estimation in Microsoft SQL Server Live Query Statistics

Kukjin Lee; Arnd Christian König; Vivek R. Narasayya; Bolin Ding; Surajit Chaudhuri; Brent Ellwein; Alexey Eksarevskiy; Manbeen Kohli; Jacob Wyant; Praneeta Prakash; Rimma V. Nehme; Jiexing Li; Jeffrey F. Naughton

-anonymity is satisfied for all users and the information loss is reduced as much as possible? In this paper, we identify and tackle the novel problem by an elegant solution.The full paper is available at http://www.cs.sfu.ca/~jpei/publications/butterfly-tr.pdf


very large data bases | 2017

Resource bricolage and resource selection for parallel database systems

Jiexing Li; Jeffrey F. Naughton; Rimma V. Nehme

Extracting useful correlation from a dataset has been extensively studied. In this paper, we deal with the opposite, namely, a problem we call correlation hiding (CH), which is fundamental in numerous applications that need to disseminate data containing sensitive information. In this problem, we are given a relational table T whose attributes can be classified into three disjoint sets A, B, and C. The objective is to distort some values in T so that A becomes independent from B, and yet, their correlation with C is preserved as much as possible. CH is different from all the problems studied previously in the area of data privacy, in that CH demands complete elimination of the correlation between two sets of attributes, whereas the previous research focuses on partial elimination up to a certain level. A new operator called independence masking is proposed to solve the CH problem. Implementations of the operator with good worst case guarantees are described in the full version of this short note.


very large data bases | 2018

F1 Query: Declarative Querying at Scale

Bart Samwel; Himani Apte; Felix Weigel; David Wilhite; Jiacheng Yang; Jun Xu; Jiexing Li; Zhan Yuan; Craig Chasseur; Qiang Zeng; Ian Rae; John Cieslewicz; Anurag Biyani; Andrew Harn; Yang Xia; Andrey Gubichev; Amr El-Helw; Orri Erling; Zhepeng Yan; Mohan Yang; Yiqun Wei; Thanh Do; Ben Handy; Colin Zheng; Goetz Graefe; Somayeh Sardashti; Ahmed M. Aly; Divy Agrawal; Ashish Gupta; Shiv Venkataraman

We describe the design and implementation of the new Live Query Statistics (LQS) feature in Microsoft SQL Server 2016. The functionality includes the display of overall query progress as well as progress of individual operators in the query execution plan. We describe the overall functionality of LQS, give usage examples and detail all areas where we had to extend the current state-of-the-art to build the complete LQS feature. Finally, we evaluate the effect these extensions have on progress estimation accuracy with a series of experiments using a large set of synthetic and real workloads.


international conference on management of data | 2016

Resource Bricolage for Parallel DBMSs on Heterogeneous Clusters

Jiexing Li; Jeffrey F. Naughton; Rimma V. Nehme

Running parallel database systems in an environment with heterogeneous resources has become increasingly common, due to cluster evolution and increasing interest in moving applications into public clouds. Performance differences among machines in the same cluster pose new challenges for parallel database systems. First, for database systems running in a heterogeneous cluster, the default uniform data partitioning strategy may overload some of the slow machines, while at the same time it may underutilize the more powerful machines. Since the processing time of a parallel query is determined by the slowest machine, such an allocation strategy may result in a significant query performance degradation. Second, since machines might have varying resources or performance, different choices of machines may lead to different costs or performance for executing the same workload. By carefully selecting the most suitable machines for running a workload, we may achieve better performance with the same budget, or we may meet the same performance requirements with a lower cost. We address these challenges by introducing techniques we call resource bricolage and resource selection that improve database performance in heterogeneous environments. Our approaches quantify the performance differences among machines with various resources as they process workloads with diverse resource requirements. For the purpose of better resource utilization, we formalize the problem of minimizing workload execution time and view it as an optimization problem, and then, we employ linear programming to obtain a recommended data partitioning scheme. For the purpose of better resource selection, we formalize two problems: One minimizes the total workload execution time with a given budget, and the other minimizes the total budget with a given performance target. We then employ different mixed-integer programs to search for the optimal resource selection decisions. We verify the effectiveness of both resource bricolage and resource selection techniques with an extensive experimental study.


international conference on management of data | 2008

Preservation of proximity privacy in publishing numerical sensitive data

Jiexing Li; Yufei Tao; Xiaokui Xiao

F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems at Google (e.g., Bigtable, Spanner, Google Spreadsheets, etc.). F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP-style point queries that affect only a few records; (ii) low-latency OLAP querying of large amounts of data; and (iii) large ETL pipelines. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic. F1 Query satisfies key requirements that are highly desirable within Google: (i) it provides a unified view over data that is fragmented and distributed over multiple data sources; (ii) it leverages datacenter resources for performant query processing with high throughput and low latency; (iii) it provides high scalability for large data sizes by increasing computational parallelism; and (iv) it is extensible and uses innovative approaches to integrate complex business logic in declarative query processing. This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database originally built to manage Googles advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems.

Collaboration


Dive into the Jiexing Li's collaboration.

Top Co-Authors

Avatar

Jeffrey F. Naughton

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Yufei Tao

The Chinese University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Xiaokui Xiao

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Jian Pei

Simon Fraser University

View shared research outputs
Top Co-Authors

Avatar

Ian Rae

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge