Daniel Warneke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Warneke is active.

Explore More

Publication

Featured researches published by Daniel Warneke.

very large data bases | 2014

The Stratosphere platform for big data analytics

Alexander Alexandrov; Rico Bergmann; Stephan Ewen; Johann Christoph Freytag; Fabian Hueske; Arvid Heise; Odej Kao; Marcus Leich; Ulf Leser; Volker Markl; Felix Naumann; Mathias Peters; Astrid Rheinländer; Matthias J. Sax; Sebastian Schelter; Mareike Hoger; Kostas Tzoumas; Daniel Warneke

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

IEEE Transactions on Parallel and Distributed Systems | 2011

Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud

Daniel Warneke; Odej Kao

In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of a cloud. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper, we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by todays IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on this new framework, we perform extended evaluations of MapReduce-inspired processing jobs on an IaaS cloud system and compare the results to the popular data processing framework Hadoop.

symposium on cloud computing | 2010

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Dominic Battré; Stephan Ewen; Fabian Hueske; Odej Kao; Volker Markl; Daniel Warneke

We present a parallel data processor centered around a programming model of so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele [18]. The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. We describe methods to transform a PACT program into a data flow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. Our definition of PACTs allows to apply several types of optimizations on the data flow during the transformation. The system as a whole is designed to be as generic as (and compatible to) map/reduce systems, while overcoming several of their major weaknesses: 1) The functions map and reduce alone are not sufficient to express many data processing tasks both naturally and efficiently. 2) Map/reduce ties a program to a single fixed execution strategy, which is robust but highly suboptimal for many tasks. 3) Map/reduce makes no assumptions about the behavior of the functions. Hence, it offers only very limited optimization opportunities. With a set of examples and experiments, we illustrate how our system is able to naturally represent and efficiently execute several tasks that do not fit the map/reduce model well.

many task computing on grids and supercomputers | 2009

Nephele: efficient parallel data processing in the cloud

Daniel Warneke; Odej Kao

In recent years Cloud Computing has emerged as a promising new approach for ad-hoc parallel data processing. Major cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used stem from the field of cluster computing and disregard the particular nature of a cloud. As a result, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our ongoing research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by todays compute clouds for both, task scheduling and execution. It allows assigning the particular tasks of a processing job to different types of virtual machines and takes care of their instantiation and termination during the job execution. Based on this new framework, we perform evaluations on a compute cloud system and compare the results to the existing data processing framework Hadoop.

international conference on management of data | 2013

Iterative parallel data processing with stratosphere: an inside look

Stephan Ewen; Sebastian Schelter; Kostas Tzoumas; Daniel Warneke; Volker Markl

Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms. In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithms code and a visualization of the produced data flow programs. The second step shows the optimizers execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates. To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.

high performance distributed computing | 2012

Massively-parallel stream processing under QoS constraints with Nephele

Björn Lohrmann; Daniel Warneke; Odej Kao

Today, a growing number of commodity devices, like mobile phones or smart meters, is equipped with rich sensors and capable of producing continuous data streams. The sheer amount of these devices and the resulting overall data volumes of the streams raise new challenges with respect to the scalability of existing stream processing systems. At the same time, massively-parallel data processing systems like MapReduce have proven that they scale to large numbers of nodes and efficiently organize data transfers between them. Many of these systems also provide streaming capabilities. However, unlike traditional stream processors, these systems have disregarded QoS requirements of prospective stream processing applications so far. In this paper we address this gap. First, we analyze common design principles of todays parallel data processing frameworks and identify those principles that provide degrees of freedom in trading off the QoS goals latency and throughput. Second, we propose a scheme which allows these frameworks to detect violations of user-defined latency constraints and optimize the job execution without manual interaction in order to meet these constraints while keeping the throughput as high as possible. As a proof of concept, we implemented our approach for our parallel data processing framework Nephele and evaluated its effectiveness through a comparison with Hadoop Online. For a multimedia streaming application we can demonstrate an improved processing latency by factor of at least 15 while preserving high data throughput when needed.

vehicular technology conference | 2008

Mobile Cooperative WLANs - MAC and Transceiver Design, Prototyping, and Field Measurements

Stefan Valentin; Hermann S. Lichte; Daniel Warneke; Thorsten Biermann; Rafael Funke; Holger Karl

We propose a practical medium access control (MAC) protocol and transceiver design for mobile cooperative WLANs. Our MAC protocol integrates selection decode-and-forward (SDF) cooperative relaying into IEEE 802.11. Unlike previous approaches, its cooperative signaling cycle allows communication if the direct link fails even for small signaling frames. Further SDF functions are efficiently supported by our ready- to-use transceiver design. We implement this design, including our MAC protocol, on a software defined radio (SDR) resulting in a cooperative IEEE 802.11g prototype. Using this prototype, we validate feasibility and performance of our approaches by extensive field measurements in indoor and railroad scenarios with an actual train moving the cooperating terminals.

many-task computing on grids and supercomputers | 2010

Detecting bottlenecks in parallel DAG-based data flow programs

Dominic Battré; Matthias Hovestadt; Björn Lohrmann; Alexander Stanik; Daniel Warneke

In recent years, several frameworks have been introduced to facilitate massively-parallel data processing on shared-nothing architectures like compute clouds. While these frameworks generally offer good support in terms of task deployment and fault tolerance, they only provide poor assistance in finding reasonable degrees of parallelization for the tasks to be executed. However, as billing models of clouds enable the utilization of many resources for a short period of time for the same cost as utilizing few resources for a long time, proper levels of parallelization are crucial to achieve short processing times while maintaining good resource utilization and therefore good cost efficiency. In this paper, we present and evaluate a solution for detecting CPU and I/O bottlenecks in parallel DAG-based data flow programs assuming capacity constrained communication channels. The detection of bottlenecks represents an important foundation for manually or automatically scaling out and tuning parallel data flow programs in order to increase performance and cost efficiency.

international conference on cloud computing | 2011

Evaluation of Network Topology Inference in Opaque Compute Clouds through End-to-End Measurements

Dominic Battré; Natalia Frejnik; Siddhant Goel; Odej Kao; Daniel Warneke

Modern Infrastructure-as-a-Service (IaaS) clouds offer an unprecedented flexibility and elasticity in terms of resource provisioning through the use of hardware virtualization. However, for the cloud customer, this virtualization also introduces an opaqueness which imposes serious obstacles for data-intensive distributed applications. In particular, the lack of network topology information, i.e. information on how the rented virtual machines are physically interconnected, can easily cause network bottlenecks as common techniques to exploit data locality cannot be applied. In this paper we study to what extent the underlying network topology of virtual machines inside an IaaS cloud can be inferred based on end-to-end measurements. Therefore, we experimentally evaluate the impact of hardware virtualization on the measurable link characteristics packet loss and delay using the popular open source hyper visors KVM and XEN. Afterwards, we compare the accuracy of different topology inference approaches and propose an extension to improve the inference accuracy for typical network structures in datacenters. We found that common assumptions for end-to-end measurements do not hold in presence of virtualization and that RTT-based measurements in Para virtualized environments lead to the most accurate inference results.

Cluster Computing | 2014

Nephele streaming: stream processing under QoS constraints at scale

Björn Lohrmann; Daniel Warneke; Odej Kao

The ability to process large numbers of continuous data streams in a near-real-time fashion has become a crucial prerequisite for many scientific and industrial use cases in recent years. While the individual data streams are usually trivial to process, their aggregated data volumes easily exceed the scalability of traditional stream processing systems.At the same time, massively-parallel data processing systems like MapReduce or Dryad currently enjoy a tremendous popularity for data-intensive applications and have proven to scale to large numbers of nodes. Many of these systems also provide streaming capabilities. However, unlike traditional stream processors, these systems have disregarded QoS requirements of prospective stream processing applications so far.In this paper we address this gap. First, we analyze common design principles of today’s parallel data processing frameworks and identify those principles that provide degrees of freedom in trading off the QoS goals latency and throughput. Second, we propose a highly distributed scheme which allows these frameworks to detect violations of user-defined QoS constraints and optimize the job execution without manual interaction. As a proof of concept, we implemented our approach for our massively-parallel data processing framework Nephele and evaluated its effectiveness through a comparison with Hadoop Online.For an example streaming application from the multimedia domain running on a cluster of 200 nodes, our approach improves the processing latency by a factor of at least 13 while preserving high data throughput when needed.

Explore More