Fabian Hueske | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fabian Hueske is active.

Explore More

Publication

Featured researches published by Fabian Hueske.

very large data bases | 2014

The Stratosphere platform for big data analytics

Alexander Alexandrov; Rico Bergmann; Stephan Ewen; Johann Christoph Freytag; Fabian Hueske; Arvid Heise; Odej Kao; Marcus Leich; Ulf Leser; Volker Markl; Felix Naumann; Mathias Peters; Astrid Rheinländer; Matthias J. Sax; Sebastian Schelter; Mareike Hoger; Kostas Tzoumas; Daniel Warneke

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

symposium on cloud computing | 2010

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Dominic Battré; Stephan Ewen; Fabian Hueske; Odej Kao; Volker Markl; Daniel Warneke

We present a parallel data processor centered around a programming model of so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele [18]. The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. We describe methods to transform a PACT program into a data flow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. Our definition of PACTs allows to apply several types of optimizations on the data flow during the transformation. The system as a whole is designed to be as generic as (and compatible to) map/reduce systems, while overcoming several of their major weaknesses: 1) The functions map and reduce alone are not sufficient to express many data processing tasks both naturally and efficiently. 2) Map/reduce ties a program to a single fixed execution strategy, which is robust but highly suboptimal for many tasks. 3) Map/reduce makes no assumptions about the behavior of the functions. Hence, it offers only very limited optimization opportunities. With a set of examples and experiments, we illustrate how our system is able to naturally represent and efficiently execute several tasks that do not fit the map/reduce model well.

very large data bases | 2012

Opening the black boxes in data flow optimization

Fabian Hueske; Mathias Peters; Matthias J. Sax; Astrid Rheinländer; Rico Bergmann; Aljoscha Krettek; Kostas Tzoumas

Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among todays systems.

international conference on data engineering | 2013

Peeking into the optimization of data flow programs with MapReduce-style UDFs

Fabian Hueske; Mathias Peters; Aljoscha Krettek; Matthias Ringwald; Kostas Tzoumas; Volker Markl; Johann Christoph Freytag

Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and nonrelational data flow programs which highlight the salient features of our approach.

Information Systems | 2015

SOFA: An extensible logical optimizer for UDF-heavy data flows

Astrid Rheinländer; Arvid Heise; Fabian Hueske; Ulf Leser; Felix Naumann

Abstract Recent years have seen an increased interest in large-scale analytical data flows on non-relational data. These data flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such data flows are user-defined predicates or functions (U df s). However, the heavy use of U df s is not well taken into account for data flow optimization in current systems. S ofa is a novel and extensible optimizer for U df -heavy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style U df s and a small set of rewrite rules, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of our approach is extensibility: we arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. We evaluate S ofa on a selection of U df -heavy data flows from different domains and compare its performance to three other algorithms for data flow optimization. Our experiments reveal that S ofa finds efficient plans, outperforming the best plans found by its competitors by a factor of up to six.

Large-Scale Data Analytics | 2014

Optimization of Massively Parallel Data Flows

Fabian Hueske; Volker Markl

Massively parallel data analysis is an emerging research topic that is motivated by the continuous growth of data sets and the rising complexity of data analysis tasks. To facilitate the analysis of big data, several parallel data processing frameworks, such as MapReduce and parallel data flow processors, have emerged. However, the implementation and tuning of parallel data analysis tasks requires expert knowledge and is very time-consuming and costly. Higher-level abstraction frameworks have been designed to ease the definition of analysis tasks. Optimizers can automatically generate efficient parallel execution plans from higher-level task definitions. Therefore, optimization is a crucial technology for massively parallel data analysis. This chapter presents the state of the art in optimization of parallel data flows. It covers higher-level languages for MapReduce, approaches to optimize plain MapReduce jobs, and optimization for parallel data flow systems. The optimization capabilities of those approaches are discussed and compared with each other. The chapter concludes with directions for future research on parallel data flow optimization.

very large data bases | 2010