Asterios Katsifodimos

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Asterios Katsifodimos is active.

Explore More

Publication

Featured researches published by Asterios Katsifodimos.

international conference on management of data | 2015

Implicit Parallelism through Deep Language Embedding

Alexander Alexandrov; Andreas Kunft; Asterios Katsifodimos; Felix Schüler; Lauritz Thamsen; Odej Kao; Tobias Herb; Volker Markl

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmers productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmers productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.

international conference on management of data | 2015

Optimistic Recovery for Iterative Dataflows in Action

Sergey Dudoladov; Chen Xu; Sebastian Schelter; Asterios Katsifodimos; Stephan Ewen; Kostas Tzoumas; Volker Markl

Over the past years, parallel dataflow systems have been employed for advanced analytics in the field of data mining where many algorithms are iterative. These systems typically provide fault tolerance by periodically checkpointing the algorithms state and, in case of failure, restoring a consistent state from a checkpoint. In prior work, we presented an optimistic recovery mechanism that in certain cases eliminates the need to checkpoint the intermediate state of an iterative algorithm. In case of failure, our mechanism uses a compensation function to transit the algorithm to a consistent state, from which the execution can continue and successfully converge. Since this recovery mechanism does not checkpoint any state, it achieves optimal failure-free performance while guaranteeing fault tolerance. In this paper, we demonstrate our recovery mechanism with the Apache Flink data processing engine. During our demonstration, attendees will be able to run graph algorithms and trigger failures to observe the algorithms recovering with compensation functions instead of checkpoints.

international conference on management of data | 2016

Bridging the gap: towards optimization across linear and relational algebra

Andreas Kunft; Alexander Alexandrov; Asterios Katsifodimos; Volker Markl

Advanced data analysis typically requires some form of pre-processing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program. The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.

conference on information and knowledge management | 2016

Cutty: Aggregate Sharing for User-Defined Windows

Paris Carbone; Jonas Traub; Asterios Katsifodimos; Seif Haridi; Volker Markl

Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user-defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all. In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.

international conference on management of data | 2016

Emma in Action: Declarative Dataflows for Scalable Data Analysis

Alexander Alexandrov; Andreas Salzmann; Georgi Krastev; Asterios Katsifodimos; Volker Markl

Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient execution. To retain a sufficient level of abstraction and lower the barrier of entry for data scientists, projects like Spark and Flink currently offer domain-specific APIs on top of their parallel collection abstractions. This demonstration highlights the benefits of an alternative design based on deep language embedding. We showcase Emma - a programming language embedded in Scala. Emma promotes parallel collection processing through native constructs like Scalas for-comprehensions - a declarative syntax akin to SQL. In addition, Emma also advocates quasi-quoting the entire data analysis algorithm rather than its individual dataflow expressions. This allows for decomposing the quoted code into (sequential) control flow and (parallel) dataflow fragments, optimizing the dataflows in context, and transparently offloading them to an engine like Spark or Flink. The proposed design promises increased programmer productivity due to avoiding an impedance mismatch, thereby reducing the lag times and cost of data analysis.

ieee international conference on cloud engineering | 2016

Apache Flink: Stream Analytics at Scale

Asterios Katsifodimos; Sebastian Schelter

Summary form only given. Apache Flink is an open source system for expressive, declarative, fast, and efficient data analysis on both historical (batch) and real-time (streaming) data. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. At its core, Flink builds on a distributed dataflow runtime that unifies batch and incremental computations over a true-streaming pipelined execution. Its programming model allows for stateful, fault tolerant computations, flexible user-defined windowing semantics for streaming and unique support for iterations. Flink is converging into a use-case complete system for parallel data processing with a wide range of top level libraries ranging from machine learning through to graph processing. Apache Flink originates from the Stratosphere project led by TU Berlin and has led to various scientific papers (e.g., in VLDBJ, SIGMOD, (P)VLDB, ICDE, and HPDC). In this half-day tutorial we will introduce Apache Flink, and give a tutorial on its streaming capabilities using concrete examples of application scenarios, focusing on concepts such as stream windowing, and stateful operators.

very large data bases | 2017

Blockjoin: efficient matrix partitioning through joins

Andreas Kunft; Asterios Katsifodimos; Sebastian Schelter; Tilmann Rabl; Volker Markl

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to-end ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to state-of-the-art pipelines implemented in Spark.

Information Technology | 2016

Apache Flink in current research

Tilmann Rabl; Jonas Traub; Asterios Katsifodimos; Volker Markl

Abstract Recent trends in data collection and the decreasing prices of storage result in constantly growing amounts of analyzable data. These masses of data cannot easily be processed by traditional database systems as these do not allow for a sufficient degree of scalability. Programs especially designed for parallel data analysis on large-scale distributed systems are required. Developing such programs on clusters of commodity hardware is a complex challenge for even the most experienced system developers. Frameworks such as Apache Hadoop are scalable, but – when compared to SQL – extremely hard to program. The open-source platform Apache Flink is a link between conventional database systems and big data analysis frameworks. Flink is based on a fault tolerant runtime for data stream processing, which manages the distribution of data as well as communications within the cluster. A high diversity of use cases can be supported through various interfaces that allow for the implementation of data analysis processes. In this paper, we present an overview of Apache Flink as well as some current research activities on top of the Apache Flink ecosystem.

Handbook of Big Data Technologies | 2017

Large-Scale Data Stream Processing Systems

Paris Carbone; Gábor E. Gévay; Gábor Hermann; Asterios Katsifodimos; Juan Soto; Volker Markl; Seif Haridi

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.

symposium on cloud computing | 2017

Optimized on-demand data streaming from sensor nodes

Jonas Traub; Sebastian Breß; Tilmann Rabl; Asterios Katsifodimos; Volker Markl

Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. Therefore, we need to tailor data streams to the demand of applications. We contribute a technique that optimizes communication costs while maintaining the desired accuracy. Our technique schedules reads across huge amounts of sensors based on the data-demands of a huge amount of concurrent queries. We introduce user-defined sampling functions that define the data-demand of queries and facilitate various adaptive sampling techniques, which decrease the amount of transferred data. Moreover, we share sensor reads and data transfers among queries. Our experiments with real-world data show that our approach saves up to 87% in data transmissions.

Explore More