Marcin Zukowski
Centrum Wiskunde & Informatica
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcin Zukowski.
international conference on data engineering | 2006
Marcin Zukowski; Sándor Héman; Niels Nes; Peter A. Boncz
High-performance data-intensive query processing tasks like OLAP, data mining or scientific data analysis can be severely I/O bound, even when high-end RAID storage systems are used. Compression can alleviate this bottleneck only if encoding and decoding speeds significantly exceed RAID I/O bandwidth. For this purpose, we propose three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs. We compare these algorithms with compression techniques used in (commercial) database and information retrieval systems. Our experiments on the MonetDB/X100 database system, using both DSM and PAX disk storage, show that these techniques strongly accelerate TPC-H performance to the point that the I/O bottleneck is eliminated.
international conference on management of data | 2010
Sándor Héman; Marcin Zukowski; Niels Nes; Lefteris Sidirourgos; Peter A. Boncz
In this paper we investigate techniques that allow for on-line updates to columnar databases, leaving intact their high read-only performance. Rather than keeping differential structures organized by the table key values, the core proposition of this paper is that this can better be done by keeping track of the tuple position of the modifications. Not only does this minimize the computational overhead of merging in differences into read-only queries, but this makes the differential structure oblivious of the value of the order keys, allowing it to avoid disk I/O for retrieving the order keys in read-only queries that otherwise do not need them - a crucial advantage for a column-store. We describe a new data structure for maintaining such positional updates, called the Positional Delta Tree (PDT), and describe detailed algorithms for PDT/column merging, updating PDTs, and for using PDTs in transaction management. In experiments with a columnar DBMS, we perform microbenchmarks on PDTs, and show in a TPC-H workload that PDTs allow quick on-line updates, yet significantly reduce their performance impact on read-only queries compared with classical value-based differential methods.
data management on new hardware | 2006
Marcin Zukowski; Sándor Héman; Peter A. Boncz
Hashing is one of the fundamental techniques used to implement query processing operators such as grouping, aggregation and join. This paper studies the interaction between modern computer architecture and hash-based query processing techniques. First, we focus on extracting maximum hashing performance from super-scalar CPUs. In particular, we discuss fast hash functions, ways to efficiently handle multi-column keys and propose the use of a recently introduced hashing scheme called Cuckoo Hashing over the commonly used bucket-chained hashing. In the second part of the paper, we focus on the CPU cache usage, by dynamically partitioning data streams such that the partial hash tables fit in the CPU cache. Conventional partitioning works as a separate preparatory phase, forcing materialization, which may require I/O if the stream does not fit in RAM. We introduce best-effort partitioning, a technique that interleaves partitioning with execution of hash-based query processing operators and avoids I/O. In the process, we show how to prevent issues in partitioning with cacheline alignment, that can strongly decrease throughput. We also demonstrate overall query processing performance when both CPU-efficient hashing and best-effort partitioning are combined.
very large data bases | 2008
Roberto Cornacchia; Sándor Héman; Marcin Zukowski; Arjen P. de Vries; Peter A. Boncz
The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.
international conference on management of data | 2013
Bogdan Răducanu; Peter A. Boncz; Marcin Zukowski
Performance of query processing functions in a DBMS can be affected by many factors, including the hardware platform, data distributions, predicate parameters, compilation method, algorithmic variations and the interactions between these. Given that there are often different function implementations possible, there is a latent performance diversity which represents both a threat to performance robustness if ignored (as is usual now) and an opportunity to increase the performance if one would be able to use the best performing implementation in each situation. Micro Adaptivity, proposed here, is a framework that keeps many alternative function implementations (flavors) in a system. It uses a learning algorithm to choose the most promising flavor potentially at each function call, guided by the actual costs observed so far. We argue that Micro Adaptivity both increases performance robustness, and saves development time spent in finding and tuning heuristics and cost model thresholds in query optimization. In this paper, we (i) characterize a number of factors that cause performance diversity between primitive flavors, (ii) describe an e-greedy learning algorithm that casts the flavor selection into a multi-armed bandit problem, and (iii) describe the software framework for Micro Adaptivity that we implemented in the Vectorwise system. We provide micro-benchmarks, and an overall evaluation on TPC-H, showing consistent improvements.
international conference on data engineering | 2015
Renata Borovica-Gajic; Stratos Idreos; Anastasia Ailamaki; Marcin Zukowski; Campbell Fraser
Query optimizers depend heavily on statistics representing column distributions to create efficient query plans. In many cases, though, statistics are outdated or non-existent, and the process of refreshing statistics is very expensive, especially for ad-hoc workloads on ever bigger data. This results in suboptimal plans that severely hurt performance. The main problem is that any decision, once made by the optimizer, is fixed throughout the execution of a query. In particular, each logical operator translates into a fixed choice of a physical operator at run-time. In this paper, we advocate for continuous adaptation and morphing of physical operators throughout their lifetime, by adjusting their behavior in accordance with the statistical properties of the data. We demonstrate the benefits of the new paradigm by designing and implementing an adaptive access path operator called Smooth Scan, which morphs continuously within the space of traditional index access and full table scan. Smooth Scan behaves similarly to an index scan for low selectivity; if selectivity increases, however, Smooth Scan progressively morphs its behavior toward a sequential scan. As a result, a system with Smooth Scan requires no access path decisions up front nor does it need accurate statistics to provide good performance. We implement Smooth Scan in PostgreSQL and, using both synthetic benchmarks as well as TPC-H, we show that it achieves robust performance while at the same time being statistics-oblivious.
international conference on management of data | 2012
Marcin Zukowski; Peter A. Boncz
In 2008 a group of researchers behind the X100 database kernel created Vectorwise: a spin-off which together with the Actian corporation (previously Ingres) worked on bringing this technology to the market. Today, Vectorwise is a popular product and one of the examples of conversion of a research prototype into successful commercial software. We describe here some of the interesting aspects of the work performed by the Vectorwise development team in the process, and discuss the opportunities and challenges resulting from the decision of integrating a prototype-quality kernel with Ingres, an established commercial product. We also discuss how requirements coming from reallife scenarios sometimes clashed with design choices and simplifications often found in research projects, and how Vectorwise team addressed some of of them.
very large data bases | 2012
Michał Świtakowski; Peter A. Boncz; Marcin Zukowski
In analytical applications, database systems often need to sustain workloads with multiple concurrent scans hitting the same table. The Cooperative Scans (CScans) framework, which introduces an Active Buffer Manager (ABM) component into the database architecture, has been the most effective and elaborate response to this problem, and was initially developed in the X100 research prototype. We now report on the the experiences of integrating Cooperative Scans into its industrial-strength successor, the Vectorwise database product. During this implementation we invented a simpler optimization of concurrent scan buffer management, called Predictive Buffer Management (PBM). PBM is based on the observation that in a workload with long-running scans, the buffer manager has quite a bit of information on the workload in the immediate future, such that an approximation of the ideal OPT algorithm becomes feasible. In the evaluation on both synthetic benchmarks as well as a TPC-H throughput run we compare the benefits of naive buffer management (LRU) versus CScans, PBM and OPT; showing that PBM achieves benefits close to Cooperative Scans, while incurring much lower architectural impact.
very large data bases | 2018
Renata Borovica-Gajic; Stratos Idreos; Anastasia Ailamaki; Marcin Zukowski; Campbell Fraser
Query optimizers depend heavily on statistics representing column distributions to create good query plans. In many cases, though, statistics are outdated or nonexistent, and the process of refreshing statistics is very expensive, especially for ad hoc workloads on ever bigger data. This results in suboptimal plans that severely hurt performance. The core of the problem is the fixed decision on the type of physical operators that comprise a query plan. This paper makes a case for continuous adaptation and morphing of physical operators throughout their lifetime, by adjusting their behavior in accordance with the observed statistical properties of the data at run time. We demonstrate the benefits of the new paradigm by designing and implementing an adaptive access path operator called Smooth Scan, which morphs continuously within the space of index access and full table scan. Smooth Scan behaves similarly to an index scan for low selectivity; if selectivity increases, however, Smooth Scan progressively morphs its behavior toward a sequential scan. As a result, a system with Smooth Scan requires no optimization decisions on the access paths up front. Additionally, by depending only on the result distribution and eschewing statistics and cardinality estimates altogether, Smooth Scan ensures repeatable execution across multiple query invocations. Smooth Scan implemented in PostgreSQL demonstrates robust, near-optimal performance on micro-benchmarks and real-life workloads, while being statistics oblivious at the same time.
IEEE Data(base) Engineering Bulletin | 2005
Marcin Zukowski; Peter A. Boncz; Niels Nes; Sándor Héman