Sándor Héman
Centrum Wiskunde & Informatica
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sándor Héman.
international conference on data engineering | 2006
Marcin Zukowski; Sándor Héman; Niels Nes; Peter A. Boncz
High-performance data-intensive query processing tasks like OLAP, data mining or scientific data analysis can be severely I/O bound, even when high-end RAID storage systems are used. Compression can alleviate this bottleneck only if encoding and decoding speeds significantly exceed RAID I/O bandwidth. For this purpose, we propose three new versatile compression schemes (PDICT, PFOR, and PFOR-DELTA) that are specifically designed to extract maximum IPC from modern CPUs. We compare these algorithms with compression techniques used in (commercial) database and information retrieval systems. Our experiments on the MonetDB/X100 database system, using both DSM and PAX disk storage, show that these techniques strongly accelerate TPC-H performance to the point that the I/O bottleneck is eliminated.
international conference on management of data | 2010
Sándor Héman; Marcin Zukowski; Niels Nes; Lefteris Sidirourgos; Peter A. Boncz
In this paper we investigate techniques that allow for on-line updates to columnar databases, leaving intact their high read-only performance. Rather than keeping differential structures organized by the table key values, the core proposition of this paper is that this can better be done by keeping track of the tuple position of the modifications. Not only does this minimize the computational overhead of merging in differences into read-only queries, but this makes the differential structure oblivious of the value of the order keys, allowing it to avoid disk I/O for retrieving the order keys in read-only queries that otherwise do not need them - a crucial advantage for a column-store. We describe a new data structure for maintaining such positional updates, called the Positional Delta Tree (PDT), and describe detailed algorithms for PDT/column merging, updating PDTs, and for using PDTs in transaction management. In experiments with a columnar DBMS, we perform microbenchmarks on PDTs, and show in a TPC-H workload that PDTs allow quick on-line updates, yet significantly reduce their performance impact on read-only queries compared with classical value-based differential methods.
data management on new hardware | 2006
Marcin Zukowski; Sándor Héman; Peter A. Boncz
Hashing is one of the fundamental techniques used to implement query processing operators such as grouping, aggregation and join. This paper studies the interaction between modern computer architecture and hash-based query processing techniques. First, we focus on extracting maximum hashing performance from super-scalar CPUs. In particular, we discuss fast hash functions, ways to efficiently handle multi-column keys and propose the use of a recently introduced hashing scheme called Cuckoo Hashing over the commonly used bucket-chained hashing. In the second part of the paper, we focus on the CPU cache usage, by dynamically partitioning data streams such that the partial hash tables fit in the CPU cache. Conventional partitioning works as a separate preparatory phase, forcing materialization, which may require I/O if the stream does not fit in RAM. We introduce best-effort partitioning, a technique that interleaves partitioning with execution of hash-based query processing operators and avoids I/O. In the process, we show how to prevent issues in partitioning with cacheline alignment, that can strongly decrease throughput. We also demonstrate overall query processing performance when both CPU-efficient hashing and best-effort partitioning are combined.
very large data bases | 2008
Roberto Cornacchia; Sándor Héman; Marcin Zukowski; Arjen P. de Vries; Peter A. Boncz
The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.
IEEE Data(base) Engineering Bulletin | 2005
Marcin Zukowski; Peter A. Boncz; Niels Nes; Sándor Héman
very large data bases | 2007
Marcin Zukowski; Sándor Héman; Niels Nes; Peter A. Boncz
Journal of Computational Physics | 2007
Marcin Zukowski; Sándor Héman; Niels Nes; Peter A. Boncz
Archive | 2010
Sándor Héman; Perter A. Boncz; Marcin Zukowski; Nicolaas J. Nes
conference on innovative data systems research | 2007
Sándor Héman; Marcin Zukowski; Arjen P. de Vries; Peter A. Boncz
Archive | 2012
Sándor Héman; Peter A. Boncz; Marcin Zukowski; Nicolaas J. Nes