Volker Markl | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Volker Markl is active.

Explore More

Publication

Featured researches published by Volker Markl.

international conference on management of data | 2004

CORDS: automatic discovery of correlations and soft functional dependencies

Ihab F. Ilyas; Volker Markl; Peter J. Haas; Paul Brown; Ashraf Aboulnaga

The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but can also cause query optimizers---which usually assume that columns are statistically independent---to underestimate the selectivities of conjunctive predicates by orders of magnitude. We introduce CORDS, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between columns. CORDS searches for column pairs that might have interesting and useful dependency relations by systematically enumerating candidate pairs and simultaneously pruning unpromising candidates using a flexible set of heuristics. A robust chi-squared analysis is applied to a sample of column values in order to identify correlations, and the number of distinct values in the sampled columns is analyzed to detect soft functional dependencies. CORDS can be used as a data mining tool, producing dependency graphs that are of intrinsic interest. We focus primarily on the use of CORDS in query optimization. Specifically, CORDS recommends groups of columns on which to maintain certain simple joint statistics. These column-group statistics are then used by the optimizer to avoid naive selectivity estimates based on inappropriate independence assumptions. This approach, because of its simplicity and judicious use of sampling, is relatively easy to implement in existing commercial systems, has very low overhead, and scales well to the large numbers of columns and large table sizes found in real-world databases. Experiments with a prototype implementation show that the use of CORDS in query optimization can speed up query execution times by an order of magnitude. CORDS can be used in tandem with query feedback systems such as the LEO learning optimizer, leveraging the infrastructure of such systems to correct bad selectivity estimates and ameliorating the poor performance of feedback systems during slow learning phases.

international conference on management of data | 2004

Robust query processing through progressive optimization

Volker Markl; Vijayshankar Raman; David E. Simmen; Guy M. Lohman; Hamid Pirahesh; Miso Cilimdzic

Virtually every commercial query optimizer chooses the best plan for a query using a cost model that relies heavily on accurate cardinality estimation. Cardinality estimation errors can occur due to the use of inaccurate statistics, invalid assumptions about attribute independence, parameter markers, and so on. Cardinality estimation errors may cause the optimizer to choose a sub-optimal plan. We present an approach to query processing that is extremely robust because it is able to detect and recover from cardinality estimation errors. We call this approach progressive query optimization (POP). POP validates cardinality estimates against actual values as measured during query execution. If there is significant disagreement between estimated and actual values, execution might be stopped and re-optimization might occur. Oscillation between optimization and execution steps can occur any number of times. A re-optimization step can exploit both the actual cardinality and partial results, computed during a previous execution step. Checkpoint operators (CHECK) validate the optimizers cardinality estimates against actual cardinalities. Each CHECK has a condition that indicates the cardinality bounds within which a plan is valid. We compute this validity range through a novel sensitivity analysis of query plan operators. If the CHECK condition is violated, CHECK triggers re-optimization. POP has been prototyped in a leading commercial DBMS. An experimental evaluation of POP using TPC-H queries illustrates the robustness POP adds to query processing, while incurring only negligible overhead. A case-study applying POP to a real-world database and workload shows the potential of POP, accelerating complex OLAP queries by almost two orders of magnitude.

international conference on management of data | 2008

Damia: data mashups for intranet applications

David E. Simmen; Mehmet Altinel; Volker Markl; Sriram Padmanabhan; Ashutosh Singh

Increasingly large numbers of situational applications are being created by enterprise business users as a by-product of solving day-to-day problems. In efforts to address the demand for such applications, corporate IT is moving toward Web 2.0 architectures. In particular, the corporate intranet is evolving into a platform of readily accessible data and services where communities of business users can assemble and deploy situational applications. Damia is a web style data integration platform being developed to address the data problem presented by such applications, which often access and combine data from a variety of sources. Damia allows business users to quickly and easily create data mashups that combine data from desktop, web, and traditional IT sources into feeds that can be consumed by AJAX, and other types of web applications. This paper describes the key features and design of Damias data integration engine, which has been packaged with Mashup Hub, an enterprise feed server currently available for download on IBM alphaWorks. Mashup Hub exposes Damias data integration capabilities in the form of a service that allows users to create hosted data mashups.

Ibm Systems Journal | 2003

LEO: An autonomic query optimizer for DB2

Volker Markl; Guy M. Lohman; Vijayshankar Raman

Structured Query Language (SQL) has emerged as an industry standard for querying relational database management systems, largely because a user need only specify what data are wanted, not the details of how to access those data. A query optimizer uses a mathematical model of query execution to determine automatically the best way to access and process any given SQL query. This model is heavily dependent upon the optimizers estimates for the number of rows that will result at each step of the query execution plan (QEP), especially for complex queries involving many predicates and/or operations. These estimates rely upon statistics on the database and modeling assumptions that may or may not be true for a given database. In this paper, we discuss an autonomic query optimizer that automatically self-validates its model without requiring any user interaction to repair incorrect statistics or cardinality estimates. By monitoring queries as they execute, the autonomic optimizer compares the optimizers estimates with actual cardinalities at each step in a QEP, and computes adjustments to its estimates that may be used during future optimizations of similar queries. Moreover, the detection of estimation errors can also trigger reoptimization of a query in mid-execution. The autonomic refinement of the optimizers model can result in a reduction of query execution time by orders of magnitude at negligible additional run-time cost. We discuss various research issues and practical considerations that were addressed during our implementation of a first prototype of LEO, a LEarning Optimizer for DB2Â® (Database 2TM) that learns table access cardinalities and for future queries corrects the estimation error for simple predicates by adjusting the database statistics of DB2.

international conference on data engineering | 2006

ISOMER: Consistent Histogram Construction Using Query Feedback

Utkarsh Srivastava; Peter J. Haas; Volker Markl; Marcel Kutsch; Tam Minh Tran

Database columns are often correlated, so that cardinality estimates computed by assuming independence often lead to a poor choice of query plan by the optimizer. Multidimensional histograms can help solve this problem, but the traditional approach of building such histograms using a data scan often scales poorly and does not always yield the best histogram for a given workload. An attractive alternative is to gather feedback from the query execution engine about the observed cardinality of predicates and use this feedback as the basis for a histogram. In this paper we describe ISOMER, a new feedback-based algorithm for collecting optimizer statistics by constructing and maintaining multidimensional histograms. ISOMER uses the maximumentropy principle to approximate the true data distribution by a histogram distribution that is as simpleas possible while being consistent with the observed predicate cardinalities. ISOMER adapts readily to changes in the underlying data, automatically detecting and eliminating inconsistent feedback information in an efficient manner. The algorithm controls the size of the histogram by retaining only the most important feedback. Our experiments indicate that, unlike previous methods for feedback-driven histogram maintenance, ISOMER imposes little overhead, is extremely scalable, and yields highly accurate cardinality estimates while using only a modest amount of storage.

international conference on data engineering | 2007

Adaptively Reordering Joins during Query Execution

Quanzhong Li; Minglong Sha; Volker Markl; Kevin S. Beyer; Latha S. Colby; Guy M. Lohman

Traditional query processing techniques based on static query optimization are ineffective in applications where statistics about the data are unavailable at the start of query execution or where the data characteristics are skewed and change dynamically. Several adaptive query processing techniques have been proposed in recent years to overcome the limitations of static query optimizers through either explicit re-optimization of plans during execution or by using a row-routing based approach. In this paper, we present a novel method for processing pipelined join plans that dynamically arranges the join order of both inner and outer-most tables at run-time. We extend the Eddies concept of moments of symmetry to reorder indexed nested-loop joins, the join method used by all commercial DBMSs for building pipelined query plans for applications for which low latencies are crucial. Unlike row-routing techniques, our approach achieves adaptability by changing the pipeline itself which avoids the bookkeeping and routing decision associated with each row. Operator selectivities monitored during query execution are used to change the execution plan at strategic points, and the change of execution plans utilizes a novel and efficient technique for avoiding duplicates in the query results. Our prototype implementation in a commercial DBMS shows a query execution speedup of up to 8 times.

very large data bases | 2007

Consistent selectivity estimation via maximum entropy

Volker Markl; Peter J. Haas; Marcel Kutsch; Nimrod Megiddo; Utkarsh Srivastava; Tam Minh Tran

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Experiments with our prototype implementation in DB2 UDB show that use of the ME approach can improve the optimizer’s cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times. For almost all queries, these improvements are obtained while adding only tens of milliseconds to the overall time required for query optimization.

very large data bases | 2004

Automated statistics collection in DB2 UDB

Ashraf Aboulnaga; Peter J. Haas; Mokhtar Kandil; Sam Lightstone; Guy M. Lohman; Volker Markl; Ivan Popivanov; Vijayshankar Raman

The use of inaccurate or outdated database statistics by the query optimizer in a relational DBMS often results in a poor choice of query execution plans and hence unacceptably long query processing times. Configuration and maintenance of these statistics has traditionally been a time-consuming manual operation, requiring that the database administrator (DBA) continually monitor query performance and data changes in order to determine when to refresh the statistics values and when and how to adjust the set of statistics that the DBMS maintains. In this paper we describe the new Automated Statistics Collection (ASC) component of IBM® DB2® Universal DatabaseTM (DB2 UDB). This autonomic technology frees the DBA from the tedious task of manually supervising the collection and maintenance of database statistics. ASC monitors both the update-delete-insert (UDI) activities on the data as well as query feedback (QF), i.e., the results of the queries that are executed on the data. ASC uses these two sources of information to automatically decide which statistics to collect and when to collect them. This combination of UDI-driven and QF-driven autonomic processes ensures that the system can handle unforeseen queries while also ensuring good performance for frequent and important queries. We present the basic concepts, architecture, and key implementation details of ASC in DB2 UDB, and present a case study showing how the use of ASC can speed up a query workload by orders of magnitude without requiring any DBA intervention.

very large data bases | 2002

Processing star queries on hierarchically-clustered fact tables

Nikos Karayannidis; Aris Tsois; Timos K. Sellis; Roland Pieringer; Volker Markl; Frank Ramsak; Robert Fenk; Klaus Elhardt; Rudolf Bayer

Star queries are the most prevalent kind of queries in data warehousing, OLAP and business intelligence applications. Thus, there is an imperative need for efficiently processing star queries. To this end, a new class of fact table organizations has emerged that exploits path-based surrogate keys in order to hierarchically cluster the fact table data of a star schema [DRSN98, MRB99, KS01]. In the context of these new organizations, star query processing changes radically. In this paper, we present a complete abstract processing plan that captures all the necessary steps in evaluating such queries over hierarchically clustered fact tables. Furthermore, we present optimizations for surrogate key processing and a novel early grouping transformation for grouping on the dimension hierarchies. Our algorithms have been already implemented in a commercial relational database management system (RDBMS) and the experimental evaluation, as well as customer feedback, indicates speedups of orders of magnitude for typical star queries in real world applications.

very large data bases | 2008

Parallelizing query optimization

Wook-Shin Han; Wooseong Kwak; Jinsoo Lee; Guy M. Lohman; Volker Markl

Many commercial RDBMSs employ cost-based query optimization exploiting dynamic programming (DP) to efficiently generate the optimal query execution plan. However, optimization time increases rapidly for queries joining more than 10 tables. Randomized or heuristic search algorithms reduce query optimization time for large join queries by considering fewer plans, sacrificing plan optimality. Though commercial systems executing query plans in parallel have existed for over a decade, the optimization of such plans still occurs serially. While modern microprocessors employ multiple cores to accelerate computations, parallelizing query optimization to exploit multi-core parallelism is not as straightforward as it may seem. The DP used in join enumeration belongs to the challenging nonserial polyadic DP class because of its non-uniform data dependencies. In this paper, we propose a comprehensive and practical solution for parallelizing query optimization in the multi-core processor architecture, including a parallel join enumeration algorithm and several alternative ways to allocate work to threads to balance their load. We also introduce a novel data structure called skip vector array to significantly reduce the generation of join partitions that are infeasible. This solution has been prototyped in PostgreSQL. Extensive experiments using various query graph topologies confirm that our algorithms allocate the work evenly, thereby achieving almost linear speed-up. Our parallel join enumeration algorithm enhanced with our skip vector array outperforms the conventional generate-and-filter DP algorithm by up to two orders of magnitude for star queries-linear speedup due to parallelism and an order of magnitude performance improvement due to the skip vector array.

Explore More