IEEE Transactions on Knowledge and Data Engineering | 2019

Bounded Approximate Query Processing

 
 
 
 
 

Abstract


OLAP is a core functionality in database systems and the performance is crucial to enable on-time decisions. However, OLAP queries are rather time consuming, especially on large datasets, and traditional exact solutions usually cannot meet the high-performance requirement. Recently, approximate query processing (AQP) has been proposed to enable approximate OLAP. However, existing AQP methods have some limitations. First, they may involve unacceptable errors on skewed data (e.g., long-tail distribution). Second, they require to store large amount of data and have no significant performance improvement. Third, they only support a small subset of SQL aggregation queries. To overcome these limitations, we propose a bounded approximate query processing framework <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq1-2877362.gif /></alternatives></inline-formula>. Given a predefined error bound and a set of queries, <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq2-2877362.gif /></alternatives></inline-formula> judiciously selects high-quality samples from the data to generate a unified synopsis offline, and then uses the synopsis to answer online queries. Compared with existing methods, <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq3-2877362.gif /></alternatives></inline-formula> has the following salient features. (1) <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq4-2877362.gif /></alternatives></inline-formula> does not need to generate a synopsis for each query while it only generates a unified synopsis, and thus <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq5-2877362.gif /></alternatives></inline-formula> has much smaller synopsis. (2) <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq6-2877362.gif /></alternatives></inline-formula> achieves much smaller error than existing studies. Specifically, <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq7-2877362.gif /></alternatives></inline-formula> can provide deterministic approximate results (i.e., the estimated query results must be within the error bound with 100 percent confidence) for SQL aggregation queries that do not contain selection conditions on numerical columns. For queries with selection conditions on numerical columns, we propose effective grouping-based techniques and the estimated results are also within the error bound in practice. Experimental results on both real and synthetic datasets show that <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq8-2877362.gif /></alternatives></inline-formula> significantly outperforms state-of-the-art approaches. For example, on a Microsoft production dataset (a real dataset with synthetic queries), <inline-formula><tex-math notation= LaTeX >${\\mathtt {BAQ}}$</tex-math><alternatives><mml:math><mml:mi mathvariant= monospace >BAQ</mml:mi></mml:math><inline-graphic xlink:href= li-ieq9-2877362.gif /></alternatives></inline-formula> has 10-100× improvement on synopsis size and 10-100× improvement on the error compared with state-of-the-art algorithms.

Volume 31
Pages 2262-2276
DOI 10.1109/TKDE.2018.2877362
Language English
Journal IEEE Transactions on Knowledge and Data Engineering

Full Text