Is this you? Create Your Porfile

Caetano Sauer

Kaiserslautern University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Caetano Sauer is active.

Explore More

Publication

Featured researches published by Caetano Sauer.

Datenbank-spektrum | 2013

Compilation of Query Languages into MapReduce

Caetano Sauer; Theo Härder

The introduction of MapReduce as a tool for Big Data Analytics, combined with the new requirements of emerging application scenarios such as the Web 2.0 and scientific computing, has motivated the development of data processing languages which are more flexible and widely applicable than SQL. Based on the Big Data context, we discuss the points in which SQL is considered too restrictive. Furthermore, we provide a qualitative evaluation of how recent query languages overcome these restrictions. Having established the desired characteristics of a query language, we provide an abstract description of the compilation into the MapReduce programming model, which, up to minor variations, is essentially the same in all approaches. Given the requirements of query processing, we introduce simple generalizations of the model, which allow the reuse of well-established query evaluation techniques, and discuss strategies to generate optimized MapReduce plans.

Synthesis Lectures on Data Management | 2014

Instant Recovery with Write-Ahead Logging:Page Repair, System Restart, and Media Restore

Goetz Graefe; Wey Guy; Caetano Sauer

Traditional theory and practice of write-ahead logging and of database recovery techniques revolve around three failure classes: transaction failures resolved by rollback; system failures (typically software faults) resolved by restart with log analysis, “redo,” and “undo” phases; and media failures (typically hardware faults) resolved by restore operations that combine multiple types of backups and log replay. The recent addition of single-page failures and single-page recovery has opened new opportunities far beyond its original aim of immediate, lossless repair of single-page wear-out in novel or traditional storage hardware. In the contexts of system and media failures, efficient single-page recovery enables on-demand incremental “redo” and “undo” as part of system restart or media restore operations. This can give the illusion of practically instantaneous restart and restore: instant restart permits processing new queries and updates seconds after system reboot and instant restore permits resuming queries and updates on empty replacement media as if those were already fully recovered. In addition to these instant recovery techniques, the discussion introduces much faster offline restore operations without slowdown in backup operations and with hardly any slowdown in log archiving operations. The new restore techniques also render differential and incremental backups obsolete, complete backup commands on the database server practically instantly, and even permit taking full backups without imposing any load on the database server.

advances in databases and information systems | 2013

Versatile XQuery Processing in MapReduce

Caetano Sauer; Sebastian Bächle; Theo Härder

The MapReduce MR framework has become a standard tool for performing large batch computations--usually of aggregative nature--in parallel over a cluster of commodity machines. A significant share of typical MR jobs involves standard database-style queries, where it becomes cumbersome to specify map and reduce functions from scratch. To overcome this burden, higher-level languages such as HiveQL, PigLatin, and JAQL have been proposed to allow the automatic generation of MR jobs from declarative queries. We identify two major problems of these existing solutions: i they introduce new query languages and implement systems from scratch for the sole purpose of expressing MR jobs; and ii despite solving some of the major limitations of SQL, they still lack the flexibility required by big data applications. We propose BrackitMR, an approach based on the XQuery language with extended JSON support. XQuery not only is an established query language, but also has a more expressive data model and more powerful language constructs, enabling a much greater degree of flexibility. From a system design perspective, we extend an existing single-node query processor, Brackit, adding MR as a distributed coordination layer. Such heavy reuse of the standard query processor not only provides performance, but also allows for a more elegant design which transparently integrates MR processing into a generic query engine.

advances in databases and information systems | 2017

Instant Restore After a Media Failure

Caetano Sauer; Goetz Graefe; Theo Härder

Media failures usually leave database systems unavailable for several hours until recovery is complete, especially in applications with large devices and high transaction volume. Previous work introduced a technique called single-pass restore, which increases restore bandwidth and thus substantially decreases time to repair. Instant restore goes further as it permits read/write access to any data on a device undergoing restore—even data not yet restored—by restoring individual data segments on demand. Thus, the restore process is guided primarily by the needs of applications, and the observed mean time to repair is effectively reduced from several hours to a few seconds.

Datenbank-spektrum | 2014

Unleashing XQuery for Data-Independent Programming

Sebastian Bächle; Caetano Sauer

The XQuery language was initially developed as an SQL equivalent for XML data, but its roots in functional programming make it also a perfect choice for processing almost any kind of structured and semi-structured data. Apart from standard XML processing, however, advanced language features make it hard to efficiently implement the complete language for large data volumes. This work proposes a novel compilation strategy that provides both flexibility and efficiency to unleash XQuery’s potential as data programming language. It combines the simplicity and versatility of a storage-independent data abstraction with the scalability advantages of set-oriented processing. Expensive iterative sections in a query are unrolled to a pipeline of relational-style operators, which is open for optimized join processing, index use, and parallelization. The remaining aspects of the language are processed in a standard fashion, yet can be compiled anytime to more efficient native operations of the actual runtime environment. This hybrid compilation mechanism yields an efficient and highly flexible query engine that is able to drive any computation from simple XML transformation to complex data analysis, even on non-XML data. Experiments with our prototype and state-of-the-art competitors in classic XML query processing and business analytics over relational data attest the generality and efficiency of the design.

international conference on database theory | 2009

Enhanced Statistics for Element-Centered XML Summaries

José de Aguiar Moraes Filho; Theo Härder; Caetano Sauer

Element-centered XML summaries collect statistical information for document nodes and their axes relationships and aggregate them separately for each distinct element/attribute name. They have already partially proven their superiority in quality, space consumption, and evaluation performance. This kind of inversion seems to have more service capability than conventional approaches. Therefore, we refined and extended element-centered XML summaries to capture more statistical information and propose new estimation methods. We tested our ideas on a set of documents with largely varying characteristics.

advances in databases and information systems | 2016

Update Propagation Strategies for High-Performance OLTP

Caetano Sauer; Lucas Lersch; Theo Härder; Goetz Graefe

Traditional transaction processing architectures employ a buffer pool where page updates are absorbed in main memory and asynchronously propagated to the persistent database. In a scenario where transaction throughput is limited by I/O bandwidth—which was typical when OLTP systems first arrived—such propagation usually happens on demand, as a consequence of evicting a page. However, as the cost of main memory decreases and larger portions of an application’s working set fit into the buffer pool, running transactions are less likely to depend on page I/O to make progress. In this scenario, update propagation plays a more independent and proactive role, where the main goal is to control the amount of cached dirty data. This is crucial to maintain high performance as well as to reduce recovery time in case of a system failure. In this paper, we analyze different propagation strategies and measure their effectiveness in reducing the number of dirty pages in the buffer pool. We show that typical strategies have a complex parametrization space, yet fail to robustly deliver high propagation rates. As a solution, we propose a propagation strategy based on efficient log replay rather than writing page images from the buffer pool. This novel technique not only maximizes propagation efficiency, but also has interesting properties that can be exploited for novel logging and recovery schemes.

web age information management | 2013

BrackitMR: flexible XQuery processing in mapreduce

Caetano Sauer; Sebastian Bächle; Theo Härder

We present BrackitMR, a framework that executes XQuery programs over distributed data using MapReduce. The main goal is to provide flexible MapReduce-based data processing with minimal performance penalties. Based on the Brackit query engine, a generic query compilation and optimization infrastructure, our system allows for a transparent integration of multiple data sources, such as XML, JSON, and CSV files, as well as relational databases, NoSQL stores, and lower-level record APIs such as BerkeleyDB.

very large data bases | 2018

FineLine: log-structured transactional storage and recovery

Caetano Sauer; Goetz Graefe; Theo Härder

Recovery is an intricate aspect of transaction processing architectures. In its traditional implementation, recovery requires the management of two persistent data stores—a write-ahead log and a materialized database—which must be carefully orchestrated to maintain transactional consistency. Furthermore, the design and implementation of recovery algorithms have deep ramifications into almost every component of the internal system architecture, from concurrency control to buffer management and access path implementation. Such complexity not only incurs high costs for development, testing, and training, but also unavoidably affects system performance, introducing overheads and limiting scalability. This paper proposes a novel approach for transactional storage and recovery called FineLine. It simplifies the implementation of transactional database systems by eliminating the log-database duality and maintaining all persistent data in a single, log-structured data structure. This approach not only provides more efficient recovery with less overhead, but also decouples the management of persistent data from in-memory access paths. As such, it blurs the lines that separate in-memory from disk-based database systems, providing the efficiency of the former with the reliability of the latter.

advances in databases and information systems | 2015

Optimizing Sort in Hadoop using Replacement Selection

Pedro Martins Dusso; Caetano Sauer; Theo Härder

This paper presents and evaluates an alternative sorting component for Hadoop based on the replacement selection algorithm. In comparison with the default quicksort-based implementation, replacement selection generates runs which are in average twice as large. This makes the merge phase more efficient, since the amount of data that can be merged in one pass increases in average by a factor of two. For almost-sorted inputs, replacement selection is often capable of sorting an arbitrarily large file in a single pass, eliminating the need for a merge phase. This paper evaluates an implementation of replacement selection for MapReduce computations in the Hadoop framework. We show that the performance is comparable to quicksort for random inputs, but with substantial gains for inputs which are either almost sorted or require two merge passes in quicksort.

Explore More