Thilo Kielmann | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thilo Kielmann is active.

Explore More

Publication

Featured researches published by Thilo Kielmann.

acm sigplan symposium on principles and practice of parallel programming | 1999

MagPIe: MPI's collective communication operations for clustered wide area systems

Thilo Kielmann; Rutger F. H. Hofman; Henri E. Bal; Aske Plaat; Raoul Bhoedjang

Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIEs algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIEs advantage increases for higher wide area latencies.

mobile computing, applications, and services | 2010

Cuckoo: A Computation Offloading Framework for Smartphones

Roelof Kemp; Nicholas Palmer; Thilo Kielmann; Henri E. Bal

Offloading computation from smartphones to remote cloud resources has recently been rediscovered as a technique to enhance the performance of smartphone applications, while reducing the energy usage.

Concurrency and Computation: Practice and Experience | 2005

Ibis: a Flexible and Efficient Java based Grid Programming Environment

Rob V. van Nieuwpoort; Jason Maassen; Gosia Wrzesińska; Rutger F. H. Hofman; Ceriel J. H. Jacobs; Thilo Kielmann; Henri E. Bal

In computational Grids, performance‐hungry applications need to simultaneously tap the computational power of multiple, dynamically available sites. The crux of designing Grid programming environments stems exactly from the dynamic availability of compute cycles: Grid programming environments (a) need to be portable to run on as many sites as possible, (b) they need to be flexible to cope with different network protocols and dynamically changing groups of compute nodes, while (c) they need to provide efficient (local) communication that enables high‐performance computing in the first place. Existing programming environments are either portable (Java), or flexible (Jini, Java Remote Method Invocation or (RMI)), or they are highly efficient (Message Passing Interface). No system combines all three properties that are necessary for Grid computing. In this paper, we present Ibis, a new programming environment that combines Javas ‘run everywhere’ portability both with flexible treatment of dynamically available networks and processor pools, and with highly efficient, object‐based communication. Ibis can transfer Java objects very efficiently by combining streaming object serialization with a zero‐copy protocol. Using RMI as a simple test case, we show that Ibis outperforms existing RMI implementations, achieving up to nine times higher throughputs with trees of objects. Copyright

ieee international conference on high performance computing data and analytics | 2003

Enabling Applications on the Grid: A Gridlab Overview

Gabrielle Allen; Tom Goodale; Thomas Radke; Michael Russell; Edward Seidel; Kelly Davis; Konstantinos Dolkas; Nikolaos D. Doulamis; Thilo Kielmann; Andre Merzky; Jarek Nabrzyski; Juliusz Pukacki; John Shalf; Ian J. Taylor

Grid technology is widely emerging. Still, there is an eminent shortage of real Grid users, mostly due to the lack of a “critical mass” of widely deployed and reliable higher-level Grid services, tailored to application needs. The GridLab project aims to provide fundamentally new capabilities for applications to exploit the power of Grid computing, thus bridging the gap between application needs and existing Grid middleware. We present an overview of GridLab, a large-scale, EU-funded Grid project spanning over a dozen groups in Europe and the US. We first outline our vision of Grid-empowered applications and then discuss GridLab’s general architecture and its Grid Application Toolkit (GAT). We illustrate how applications can be Grid-enabled with the GAT and discuss GridLab’s scheduler as an example of GAT services.

international parallel and distributed processing symposium | 2000

Fast Measurement of LogP Parameters for Message Passing Platforms

Thilo Kielmann; Henri E. Bal; Kees Verstoep

Performance modeling is important for implementing efficient parallel applications and runtime systems. The LogP model captures the relevant aspects of message passing in distributed-memory architectures. In this paper we describe an efficient method that measures LogP parameters for a given message passing platform. Measurements are performed for messages of different sizes, as covered by the parameterized LogP model, a slight extension of LogP and LogGP. To minimize both intrusiveness and completion time of the measurement, we propose a procedure that sends as few messages as possible. An implementation of this procedure, called the MPI LogP benchmark, is available from our WWW site.

acm sigplan symposium on principles and practice of parallel programming | 2001

Efficient load balancing for wide-area divide-and-conquer applications

Rob V. van Nieuwpoort; Thilo Kielmann; Henri E. Bal

Divide-and-conquer programs are easily parallelized by letting the programmer annotate potential parallelism in the form of spawn and sync constructs. To achieve efficient program execution, the generated work load has to be balanced evenly among the available CPUs. For single cluster systems, Random Stealing (RS) is known to achieve optimal load balancing. However, RS is inefficient when applied to hierarchical wide-area systems where multiple clusters are connected via wide-area networks (WANs) with high latency and low bandwidth. In this paper, we experimentally compare RS with existing load-balancing strategies that are believed to be efficient for multi-cluster systems, Random Pushing and two variants of Hierarchical Stealing. We demonstrate that, in practice, they obtain less than optimal results. We introduce a novel load-balancing algorithm, Cluster-aware Random Stealing (CRS) which is highly efficient and easy to implement. CRS adapts itself to network conditions and job granularities, and does not require manually-tuned parameters. Although CRS sends more data across the WANs, it is faster than its competitors for 11 out of 12 test applications with various WAN configurations. It has at most 4% overhead in run time compared to RS on a single, large cluster, even with high wide-area latencies and low wide-area bandwidths. These strong results suggest that divide-and-conquer parallelism is a useful model for writing distributed supercomputing applications on hierarchical wide-area systems.

Operating Systems Review | 2000

The distributed ASCI Supercomputer project

Henri E. Bal; Raoul Bhoedjang; Rutger F. H. Hofman; Ceriel J. H. Jacobs; Thilo Kielmann; Jason Maassen; Rob V. van Nieuwpoort; John W. Romein; Luc Renambot; Tim Rühl; Ronald Veldema; Kees Verstoep; Aline Baggio; G.C. Ballintijn; Ihor Kuz; Guillaume Pierre; Maarten van Steen; Andrew S. Tanenbaum; G. Doornbos; Desmond Germans; Hans J. W. Spoelder; Evert Jan Baerends; Stan J. A. van Gisbergen; Hamideh Afsermanesh; Dick Van Albada; Adam Belloum; David Dubbeldam; Z.W. Hendrikse; Bob Hertzberger; Alfons G. Hoekstra

The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed applications. The paper gives a preview of the most interesting research results obtained so far in the DAS project.

ACM Transactions on Programming Languages and Systems | 2001

Efficient Java RMI for Parallel Programming

Jason Maassen; Rob V. van Nieuwpoort; Ronald Veldema; Henri E. Bal; Thilo Kielmann; Ceriel J. H. Jacobs; Rutger F. H. Hofman

Java offers interesting opportunities for parallel computing. In particular, Java Remote Method Invocation (RMI) provides a flexible kind of remote procedure call (RPC) that supports polymorphism. Suns RMI implementation achieves this kind of flexibility at the cost of a major runtime overhead. The goal of this article is to show that RMI can be implemented efficiently, while still supporting polymorphism and allowing interoperability with Java Virtual Machines (JVMs). We study a new approach for implementing RMI, using a compiler-based Java system called Manta. Manta uses a native (static) compiler instead of a just-in-time compiler. To implement RMI efficiently, Manta exploits compile-time type information for generating specialized serializers. Also, it uses an efficient RMI protocol and fast low-level communication protocols.A difficult problem with this approach is how to support polymorphism and interoperability. One of the consequences of polymorphism is that an RMI implementation must be able to download remote classes into an application during runtime. Manta solves this problem by using a dynamic bytecode compiler, which is capable of compiling and linking bytecode into a running application. To allow interoperability with JVMs, Manta also implements the Sun RMI protocol (i.e., the standard RMI protocol), in addition to its own protocol.We evaluate the performance of Manta using benchmarks and applications that run on a 32-node Myrinet cluster. The time for a null-RMI (without parameters or a return value) of Manta is 35 times lower than for the Sun JDK 1.2, and only slightly higher than for a C-based RPC protocol. This high performance is accomplished by pushing almost all of the runtime overhead of RMI to compile time. We study the performance differences between the Manta and the Sun RMI protocols in detail. The poor performance of the Sun RMI protocol is in part due to an inefficient implementation of the protocol. To allow a fair comparison, we compiled the applications and the Sun RMI protocol with the native Manta compiler. The results show that Mantas null-RMI latency is still eight times lower than for the compiled Sun RMI protocol and that Mantas efficient RMI protocol results in 1.8 to 3.4 times higher speedups for four out of six applications.

ieee international conference on cloud computing technology and science | 2010

Bag-of-Tasks Scheduling under Budget Constraints

Ana-Maria Oprescu; Thilo Kielmann

Commercial cloud offerings, such as Amazon’s EC2, let users allocate compute resources on demand, charging based on reserved time intervals. While this gives great¿exibility to elastic applications, users lack guidance for choosing between multiple offerings, in order to complete their computations within given budget constraints. In this work, we present BaTS, our budget-constrained scheduler. BaTS can schedule large bags of tasks onto multiple clouds with different CPU performance and cost, minimizing completion time while respecting an upper bound for the budget to be spent. BaTS requires no a-priori information about task completion times, and learns to estimate them at runtime. We evaluate BaTS by emulating different cloud environments on the DAS-3 multi-cluster system. Our results show that BaTS is able to schedule within a user-definedbudget (if such a schedule is possible at all.) At the expense of extra compute time, significant cost savings can be achieved when comparing to a cost-oblivious round-robin scheduler.

international parallel and distributed processing symposium | 2000

Bandwidth-efficient collective communication for clustered wide area systems

Thilo Kielmann; Henri E. Bal; Sergei Gorlatch

Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks. A major problem in programming parallel applications for such platforms is their hierarchical network structure: latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our goal is to optimize MPIs collective operations for such platforms. In this paper we focus on optimized utilization of the (scarce) wide-area bandwidth. We use two techniques: selecting suitable communication graph shapes, and splitting messages into multiple segments that are sent in parallel over different WAN links. To determine the best graph shape and segment size, we introduce a performance model called parameterized LogP (P-LogP), a hierarchical extension of the LogP model that covers messages of arbitrary length. With P-LogP, the optimal segment size and the best broadcast tree shape can be determined at runtime. (For conciseness, we restrict our discussion to the broadcast operation). An experimental performance evaluation shows that the new broadcast has significantly improved performance (for large messages) and that there is a close match between the theoretical model and the measured completion times.

Explore More