Lauritz Thamsen
Technical University of Berlin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lauritz Thamsen.
international conference on management of data | 2015
Alexander Alexandrov; Andreas Kunft; Asterios Katsifodimos; Felix Schüler; Lauritz Thamsen; Odej Kao; Tobias Herb; Volker Markl
The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataflow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control flow structure using such APIs reveals a number of limitations that impede programmers productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmers productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataflows and (ii) hides the notion of data-parallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.
acm conference on systems programming languages and applications software for humanity | 2015
Tim Felgentreff; Jens Lincke; Robert Hirschfeld; Lauritz Thamsen
Development environments which aim to provide short feedback loops to developers must strike a balance between immediacy and the ability to abstract and reuse behavioral modules. The Lively Kernel, a self-supporting, browser-based environment for explorative development supports standard object-oriented programming with classes or prototypes, but also a more immediate, object-centric approach for modifying and programming visible objects directly. This allows users to quickly create graphical prototypes with concrete objects. However, when developing with the object-centric approach, sharing behavior between similar objects becomes cumbersome. Developers must choose to either abstract behavior into classes, scatter code across collaborating objects, or to manually copy code between multiple objects. That is, they must choose between less concrete development, reduced maintainability, or code duplication. In this paper, we propose Lively Groups, an extension to the object-centric development tools of Lively to work on multiple concrete objects. In our approach, developers may dynamically group live objects that share behavior using tags. They can then modify and program such groups as if they were single objects. Our approach scales the Lively Kernel’s explorative development approach from one to many objects, while preserving the maintainability of abstractions and the immediacy of concrete objects.
international conference on big data | 2016
Lauritz Thamsen; Thomas Renner; Marvin Byfeld; Markus Paeschke; Daniel Schroder; Felix Bohm
Distributed dataflow systems like Spark and Flink allow to analyze large datasets using clusters of computers. These frameworks provide automatic program parallelization and manage distributed workers, including worker failures. Moreover, they provide high-level programming abstractions and execute programs efficiently. Yet, the programming abstractions remain textual while the dataflow model is essentially a graph of transformations. Thus, there is a mismatch between the presented abstraction and the underlying model here. One can also argue that developing dataflow programs with these textual abstractions requires needless amounts of coding and coding skills. A dedicated programming environment could instead allow constructing dataflow programs more interactively and visually. In this paper, we therefore investigate how visual programming can make the development of parallel dataflow programs more accessible. In particular, we built a prototypical visual programming environment for Flink, which we call Flision. Flision provides a graphical user interface for creating dataflow programs, a code generation engine that generates code for Flink, and seamless deployment to a connected cluster. Users of this environment can effectively create jobs by dragging, dropping, and visually connecting operator components. To evaluate the applicability of this approach, we interviewed ten potential users. Our impressions from this qualitative user testing strengthened our believe that visual programming can be a valuable tool for users of scalable data analysis tools.
dynamic languages symposium | 2015
Bastian Steinert; Lauritz Thamsen; Tim Felgentreff; Robert Hirschfeld
We present object versioning as a generic approach to preserve access to previous development and application states. Version-aware references can manage the modifications made to the target object and record versions as desired. Such references can be provided without modifications to the virtual machine. We used proxies to implement the proposed concepts and demonstrate the Lively Kernel running on top of this object versioning layer. This enables Lively users to undo the effects of direct manipulation and other programming actions.
ubiquitous intelligence and computing | 2016
Ilya Verbitskiy; Lauritz Thamsen; Odej Kao
With the increasing amount of available data, distributed data processing systems like Apache Flink, Apache Spark have emerged that allow to analyze large-scale datasets. However, such engines introduce significant computational overhead compared to non-distributed implementations. Therefore, the question arises when using a distributed processing approach is actually beneficial. This paper helps to answer this question with an evaluation of the performance of the distributed data processing framework Apache Flink. In particular, we compare Apache Flink executed on up to 50 cluster nodes to single-threaded implementations executed on a typical laptop for three different benchmarks: TPC-H Query 10, Connected Components,, Gradient Descent. The evaluation shows that the performance of Apache Flink is highly problem dependent, varies from early outperformance in case of TPC-H Query 10 to slower runtimes in case of Connected Components. The reported results give hints for which problems, input sizes,, cluster resources using a distributed data processing system like Apache Flink or Apache Spark is sensible.
international conference on big data | 2015
Thomas Renner; Lauritz Thamsen; Odej Kao
Sharing cluster resources between multiple frameworks, applications and datasets is important for organizations doing large scale data analytics. It improves cluster utilization, avoids standalone clusters running only a single framework and allows data scientists to choose the best framework for each analysis task. Current systems for cluster resource management like YARN or Mesos achieve resource sharing using containers. Analytics frameworks execute their tasks in these containers. However, currently the container placement is based predominantly on available computing capabilities in terms of cores and memory, yet neglects to also take the network topology and data locations into account. In this paper, we propose a container placement approach that (a) takes the network topology into account to prevent network congestions in the core network and (b) places containers close to input data to improve data locality and reduce remote disk reads in distributed file systems. The main advantages of introducing topology- and data-awareness on the level of container placement is that multiple application frameworks benefit from improvements. We present a prototype integrated with Hadoop YARN and an evaluation with workloads consisting of different applications and datasets using Apache Flink. Our evaluation on a 64 core cluster, in which nodes are connected through a fat tree topology, shows promising results with speedups of up to 67% for network-intensive workloads.
international congress on big data | 2017
Lauritz Thamsen; Benjamin Rabier; Florian Schmidt; Thomas Renner; Odej Kao
Resource management systems like YARN or Mesos enable users to share cluster infrastructures by running analytics jobs in temporarily reserved containers. These containers are typically not isolated to achieve high degrees of overall resource utilizations despite the often fluctuating resource usage of single analytic jobs. However, some combinations of jobs utilize the resources better and interfere less with each others when running on the same nodes than others. This paper presents an approach for improving the resource utilization and job throughput when scheduling recurring data analysis jobs in shared cluster environments. Using a reinforcement learning algorithm, the scheduler continuously learns which jobs are best executed simultaneously on the cluster. Our evaluation of an implementation built on Hadoop YARN shows that this approach can increase resource utilization and decrease job runtimes. While interference between jobs can be avoided, co-locations of jobs with complementary resource usage are not yet always fully recognized. However, with a better measure of co-location goodness, our solution can be used to automatically adapt the scheduling to workloads with recurring batch jobs.
international conference data science | 2017
Thomas Renner; Lauritz Thamsen; Odej Kao
Many distributed data analysis jobs are executed repeatedly in production clusters. Examples include daily executed batch jobs and iterative programs. These jobs present an opportunity to learn workload characteristics through continuous fine-grained cluster monitoring. Therefore, based on detailed profiles of resource utilization, data placement, and job runtimes, resource management can in fact adapt to actual workloads. In this paper, we present a system architecture that contains four mechanisms for an adaptive resource management, encompassing data placement, resource allocation, and container as well as job scheduling. In particular, we extended Apache Hadoop’s scheduling and data placement to improve resource utilization and job runtimes for recurring analytics jobs. Furthermore, we developed a Hadoop submission tool that allows users to reserve resources for specific target runtimes and which uses historical data available from cluster monitoring
ieee international conference on cloud computing technology and science | 2017
Lauritz Thamsen; Ilya Verbitskiy; Jossekin Beilharz; Thomas Renner; Andreas Polze; Odej Kao
Distributed dataflow systems like MapReduce, Spark, and Flink help users in analyzing large datasets with a set of cluster resources. Performance modeling and runtime prediction is then used for automatically allocating resources for specific performance goals. However, the actual performance of distributed dataflow jobs can vary significantly due to factors like interference with co-located workloads, varying degrees of data locality, and failures.We address this problem with Ellis, a system that allocates an initial set of resources for a specific runtime target, yet also continuously monitors a jobs progress towards the target and if necessary dynamically adjusts the allocation. For this, Ellis models the scale-out behavior of individual stages of distributed dataflow jobs based on previous executions. Our evaluation of Ellis with iterative Spark jobs shows that dynamic adjustments can reduce the number of constraint violations by 30.7-75.0% and the magnitude of constraint violations by 70.6-94.5%.
international performance computing and communications conference | 2016
Lauritz Thamsen; Ilya Verbitskiy; Florian Schmidt; Thomas Renner; Odej Kao
Distributed dataflow systems like Spark or Flink enable users to analyze large datasets. Users create programs by providing sequential user-defined functions for a set of well-defined operations, select a set of resources, and the systems automatically distribute the jobs across these resources. However, selecting resources for specific performance needs is inherently difficult and users consequently tend to overprovision, which results in poor cluster utilization. At the same time, many important jobs are executed recurringly in production clusters. This paper presents Bell, a practical system that monitors job execution, models the scale-out behavior of jobs based on previous runs, and selects resources according to user-provided runtime targets. Bell automatically chooses between different runtime prediction models to optimally support different distributed dataflow systems. Bell is implemented as a job submission tool for YARN and, thus, works with existing cluster setups. We evaluated Bells runtime prediction with six exemplary data analytics jobs using both Spark and Flink. We present the learned scale-out models for these jobs and evaluate the relative prediction error using cross-validation, showing that our model selection approach provides better overall performance than the individual prediction models.