Stephan Ewen
Technical University of Berlin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stephan Ewen.
very large data bases | 2014
Alexander Alexandrov; Rico Bergmann; Stephan Ewen; Johann Christoph Freytag; Fabian Hueske; Arvid Heise; Odej Kao; Marcus Leich; Ulf Leser; Volker Markl; Felix Naumann; Mathias Peters; Astrid Rheinländer; Matthias J. Sax; Sebastian Schelter; Mareike Hoger; Kostas Tzoumas; Daniel Warneke
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of “Big Data” use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.
symposium on cloud computing | 2010
Dominic Battré; Stephan Ewen; Fabian Hueske; Odej Kao; Volker Markl; Daniel Warneke
We present a parallel data processor centered around a programming model of so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele [18]. The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. We describe methods to transform a PACT program into a data flow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. Our definition of PACTs allows to apply several types of optimizations on the data flow during the transformation. The system as a whole is designed to be as generic as (and compatible to) map/reduce systems, while overcoming several of their major weaknesses: 1) The functions map and reduce alone are not sufficient to express many data processing tasks both naturally and efficiently. 2) Map/reduce ties a program to a single fixed execution strategy, which is robust but highly suboptimal for many tasks. 3) Map/reduce makes no assumptions about the behavior of the functions. Hence, it offers only very limited optimization opportunities. With a set of examples and experiments, we illustrate how our system is able to naturally represent and efficiently execute several tasks that do not fit the map/reduce model well.
very large data bases | 2012
Stephan Ewen; Kostas Tzoumas; Moritz Kaufmann; Volker Markl
Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.
conference on information and knowledge management | 2013
Sebastian Schelter; Stephan Ewen; Kostas Tzoumas; Volker Markl
Executing data-parallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written checkpoint and resumes execution from that point. We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, self-correcting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converge to the correct solution from various intermediate consistent states. In the case of a failure, we apply a user-defined compensate function that algorithmically creates such a consistent state, instead of rolling back to a previous checkpointed state. Our optimistic recovery does not checkpoint any state and hence achieves optimal failure-free performance with respect to the overhead necessary for guaranteeing fault tolerance. We illustrate the applicability of this approach for three wide classes of problems. Furthermore, we show how to implement the proposed optimistic recovery mechanism in a data flow system. Similar to the Combine operator in MapReduce, our proposed functionality is optional and can be applied to increase performance without changing the semantics of programs. In an experimental evaluation on large datasets, we show that our proposed approach provides optimal failure-free performance. In the absence of failures our optimistic scheme is able to outperform a pessimistic approach by a factor of two to five. In presence of failures, our approach provides fast recovery and outperforms pessimistic approaches in the majority of cases.
international conference on management of data | 2013
Stephan Ewen; Sebastian Schelter; Kostas Tzoumas; Daniel Warneke; Volker Markl
Iterative algorithms occur in many domains of data analysis, such as machine learning or graph analysis. With increasing interest to run those algorithms on very large data sets, we see a need for new techniques to execute iterations in a massively parallel fashion. In prior work, we have shown how to extend and use a parallel data flow system to efficiently run iterative algorithms in a shared-nothing environment. Our approach supports the incremental processing nature of many of those algorithms. In this demonstration proposal we illustrate the process of implementing, compiling, optimizing, and executing iterative algorithms on Stratosphere using examples from graph analysis and machine learning. For the first step, we show the algorithms code and a visualization of the produced data flow programs. The second step shows the optimizers execution plan choices, while the last phase monitors the execution of the program, visualizing the state of the operators and additional metrics, such as per-iteration runtime and number of updates. To show that the data flow abstraction supports easy creation of custom programming APIs, we also present programs written against a lightweight Pregel API that is layered on top of our system with a small programming effort.
very large data bases | 2017
Paris Carbone; Stephan Ewen; Gyula Fóra; Seif Haridi; Stefan Richter; Kostas Tzoumas
Stream processors are emerging in industry as an apparatus that drives analytical but also mission critical services handling the core of persistent application logic. Thus, apart from scalability and low-latency, a rising system need is first-class support for application state together with strong consistency guarantees, and adaptivity to cluster reconfigurations, software patches and partial failures. Although prior systems research has addressed some of these specific problems, the practical challenge lies on how such guarantees can be materialized in a transparent, non-intrusive manner that relieves the user from unnecessary constraints. Such needs served as the main design principles of state management in Apache Flink, an open source, scalable stream processor. We present Flinks core pipelined, in-flight mechanism which guarantees the creation of lightweight, consistent, distributed snapshots of application state, progressively, without impacting continuous execution. Consistent snapshots cover all needs for system reconfiguration, fault tolerance and version management through coarse grained rollback recovery. Application state is declared explicitly to the system, allowing efficient partitioning and transparent commits to persistent storage. We further present Flinks backend implementations and mechanisms for high availability, external state queries and output commit. Finally, we demonstrate how these mechanisms behave in practice with metrics and large-deployment insights exhibiting the low performance trade-offs of our approach and the general benefits of exploiting asynchrony in continuous, yet sustainable system deployments.
international conference on management of data | 2015
Sergey Dudoladov; Chen Xu; Sebastian Schelter; Asterios Katsifodimos; Stephan Ewen; Kostas Tzoumas; Volker Markl
Over the past years, parallel dataflow systems have been employed for advanced analytics in the field of data mining where many algorithms are iterative. These systems typically provide fault tolerance by periodically checkpointing the algorithms state and, in case of failure, restoring a consistent state from a checkpoint. In prior work, we presented an optimistic recovery mechanism that in certain cases eliminates the need to checkpoint the intermediate state of an iterative algorithm. In case of failure, our mechanism uses a compensation function to transit the algorithm to a consistent state, from which the execution can continue and successfully converge. Since this recovery mechanism does not checkpoint any state, it achieves optimal failure-free performance while guaranteeing fault tolerance. In this paper, we demonstrate our recovery mechanism with the Apache Flink data processing engine. During our demonstration, attendees will be able to run graph algorithms and trigger failures to observe the algorithms recovering with compensation functions instead of checkpoints.
Proceedings of the 1st Workshop on Architectures and Systems for Big Data | 2011
Alexander Alexandrov; Berni Schiefer; John Poelman; Stephan Ewen; Thomas O. Bodner; Volker Markl
The need for efficient data generation for the purposes of testing and benchmarking newly developed massively-parallel data processing systems has increased with the emergence of Big Data problems. As synthetic data model specifications evolve over time, the data generator programs implementing these models have to be adapted continuously -- a task that often becomes more tedious as the set of model constraints grows. In this paper we present Myriad - a new parallel data generation toolkit. Data generators created with the toolkit can quickly produce very large datasets in a shared-nothing parallel execution environment, while at the same time preserve with cross-partition dependencies, correlations and distributions in the generated data. In addition, we report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of OLAP-style relational datasets.
international conference on management of data | 2014
Vasiliki Kalavri; Stephan Ewen; Kostas Tzoumas; Vladimir Vlassov; Volker Markl; Seif Haridi
Iterative computations are in the core of large-scale graph processing. In these applications, a set of parameters is continuously refined, until a fixed point is reached. Such fixed point iterations often exhibit non-uniform computational behavior, where changes propagate with different speeds throughout the parameter set, making them active or inactive during iterations. This asymmetrical behavior can lead to a many redundant computations, if not exploited. Many specialized graph processing systems and APIs exist that run iterative algorithms efficiently exploiting this asymmetry. However, their functionality is sometimes vaguely defined and due to their different programming models and terminology used, it is often challenging to derive equivalence between them. We describe an optimization framework for iterative graph processing, which utilizes dataset dependencies. We explain several optimization techniques that exploit asymmetrical behavior of graph algorithms. We formally specify the conditions under which, an algorithm can use a certain technique. We also design template execution plans, using a canonical set of dataflow operators and we evaluate them using real-world datasets and applications. Our experiments show that optimized plans can significantly reduce execution time, often by an order of magnitude. Based on our experiments, we identify a trade-off that can be easily captured and could serve as the basis for automatic optimization of large-scale graph-processing applications.
IEEE Data(base) Engineering Bulletin | 2015
Paris Carbone; Asterios Katsifodimos; Stephan Ewen; Volker Markl; Seif Haridi; Kostas Tzoumas