Is this you? Create Your Porfile

Sascha Hunold

Vienna University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sascha Hunold is active.

Explore More

Publication

Featured researches published by Sascha Hunold.

international conference on supercomputing | 2014

Implementing a classic: zero-copy all-to-all communication with mpi datatypes

Jesper Larsson Träff; Antoine Rougier; Sascha Hunold

We investigate the use of the derived datatype mechanism of MPI (the Message-Passing Interface) in the implementation of the classic all-to-all communication algorithm of Bruck et al.\ (1997). Through a series of improvements to the canonical implementation of the algorithm we gradually eliminate initial and final processor-local data reorganizations, culminating in a \emph{zero-copy} version that contains no explicit, process-local data movement or copy operations: all necessary data movements are implied by MPI derived datatypes, and carried out as part of the communication operations. We furthermore show how the improved algorithm can be used to solve irregular all-to-all communication problems (that are not too irregular). The Bruck algorithm serves as a vehicle to demonstrate descriptive and performance advantages with MPI datatypes in the implementation of complex algorithms, and discuss shortcomings and inconveniences in the current MPI datatype mechanism. In particular, we use and implement three new derived datatypes (bounded vector, circular vector, and bucket) not in MPI that might be useful in other contexts. We also discuss the role of persistent collectives which are currently not found in MPI for amortizing type creation (and other) overheads, and implement a persistent variant of the \texttt{MPI\_Alltoall} collective. On two small systems we experimentally compare the algorithmic improvements to the Bruck et al.\ algorithm when implemented on top of MPI, showing the zero-copy version to perform significantly better than the initial, straight-forward implementation. One of our variants has also been implemented inside \texttt{mvapich}, and we show it to perform better than the \texttt{mvapich} implementation of the Bruck et al.\ algorithm for the range of processes and problem sizes where it is enabled. The persistent version of \texttt{MPI\_Alltoall} has no overhead and outperforms all other variants, and in particular improves upon the standard implementation by 50\% to 15\% across the full range of problem sizes considered.

Proceedings of the 22nd European MPI Users' Group Meeting on | 2015

Isomorphic, Sparse MPI-like Collective Communication Operations for Parallel Stencil Computations

Jesper Larsson Träff; Felix Donatus Lübbe; Antoine Rougier; Sascha Hunold

We propose a specification and discuss implementations of collective operations for parallel stencil-like computations that are not supported well by the current MPI 3.1 neighborhood collectives. In our isomorphic, sparse collectives all processes partaking in the communication operation use similar neighborhoods of processes with which to exchange data. Our interface assumes the p processes to be arranged in a d-dimensional torus (mesh) over which neighborhoods are specified per process by identical lists of relative coordinates. This extends significantly on the functionality for Cartesian communicators, and is a much lighter mechanism than distributed graph topologies. It allows for fast, local computation of communication schedules, and can be used in more dynamic contexts than current MPI functionality. We sketch three algorithms for neighborhoods with s source and target neighbors, namely a) a direct algorithm taking s communication rounds, b) a message-combining algorithm that communicates only along torus coordinates, and c) a message-combining algorithm using between [log s] and [log p] communication rounds. Our concrete interface has been implemented using the direct algorithm a). We benchmark our implementations and compare to the MPI neighborhood collectives. We demonstrate significant advantages in set-up times, and comparable communication times. Finally, we use our isomorphic, sparse collectives to implement a stencil computation with a deep halo, and discuss derived datatypes required for this application.

Proceedings of the 21st European MPI Users' Group Meeting on | 2014

Reproducible MPI Micro-Benchmarking Isn't As Easy As You Think

Sascha Hunold; Alexandra Carpen-Amarie; Jesper Larsson Träff

The Message Passing Interface (MPI) is the prevalent programming model for supercomputers. Optimizing the performance of individual MPI functions is therefore of great interest for the HPC community. However, a fair comparison of different algorithms and implementations requires a statistically sound analysis. It is often overlooked that the time to complete an MPI communication function does not only depend on internal factors such as the algorithm but also on external factors such as the system noise. Most noise produced by the system is uncontrollable without changing the software stack, e.g., the memory allocation method used by the operating system. Possibly controllable factors have not yet been identified as such in this context. We investigate several possible factors---which have been discovered in other microbenchmarks---whether they have a significant effect on the execution time of MPI functions. We experimentally and statistically show that results obtained with other common benchmarking methods for MPI functions can be misleading when comparing alternatives. To overcome these issues, we explain how to carefully design MPI micro-benchmarking experiments and how to make a fair, statistically sound comparison of MPI implementations.

IEEE Transactions on Parallel and Distributed Systems | 2016

Reproducible MPI Benchmarking is Still Not as Easy as You Think

Sascha Hunold; Alexandra Carpen-Amarie

The Message Passing Interface (MPI) is the prevalent programming model used on todays supercomputers. Therefore, MPI library developers are looking for the best possible performance (shortest run-time) of individual MPI functions across many different supercomputer architectures. Several MPI benchmark suites have been developed to assess the performance of MPI implementations. Unfortunately, the outcome of these benchmarks is often neither reproducible nor statistically sound. To overcome these issues, we show which experimental factors have an impact on the run-time of blocking collective MPI operations and how to measure their effect. Finally, we present a new experimental method that allows us to obtain reproducible and statistically sound measurements of MPI functions.

european conference on parallel processing | 2016

Automatic Verification of Self-consistent MPI Performance Guidelines

Sascha Hunold; Alexandra Carpen-Amarie; Felix Donatus Lübbe; Jesper Larsson Träff

The Message Passing Interface MPI is the most commonly used application programming interface for process communication on current large-scale parallel systems. Due to the scale and complexity of modern parallel architectures, it is becoming increasingly difficult to optimize MPI libraries, as many factors can influence the communication performance. To assist MPI developers and users, we propose an automatic way to check whether MPI libraries respect self-consistent performance guidelines for collective communication operations. We introduce the PGMPI framework to detect violations of performance guidelines through benchmarking. Our experimental results show that PGMPI can pinpoint undesired and often unexpected performance degradations of collective MPI operations. We demonstrate how to overcome performance issues of several libraries by adapting the algorithmic implementations of their respective collective MPI calls.

Concurrency and Computation: Practice and Experience | 2015

One step toward bridging the gap between theory and practice in moldable task scheduling with precedence constraints

Sascha Hunold

Because of the increasing number of cores of current parallel machines and the growing need for a concurrent execution of tasks, the problem of parallel task scheduling is more relevant than ever, especially under the moldable task model, in which tasks are allocated to a fixed number of processors before execution. Much research has been conducted to develop efficient scheduling algorithms for moldable tasks, both in theory and practice. The problem is that theoretical and practical approaches expose shortcomings, for example, many approximation algorithms only guarantee bounds under assumptions, which are unrealistic in practice, or most heuristics have not been rigorously compared with competing approximation algorithms. In particular, it is often assumed that the speedup function of moldable tasks is either non‐decreasing, sub‐linear, or concave. In practice, however, the resulting speedup of parallel programs on current hardware with deep memory hierarchies is most often neither non‐decreasing nor concave. We present a new algorithm for the problem of scheduling moldable tasks with precedence constraints for the makespan objective and for arbitrary speedup functions. We show through simulation that the algorithm not only creates competitive schedules for moldable tasks with arbitrary speedup functions but also outperforms other published heuristics and approximation algorithms for non‐decreasing speedup functions. Copyright

Proceedings of the 22nd European MPI Users' Group Meeting on | 2015

On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives

Sascha Hunold; Alexandra Carpen-Amarie

We consider the problem of accurately measuring the time to complete an MPI collective operation, as the result strongly depends on how the time is measured. Our goal is to develop an experimental method that allows for reproducible measurements of MPI collectives. When executing large parallel codes, MPI processes are often skewed in time when entering a collective operation. However, to obtain reproducible measurements, it is a common approach to synchronize all processes before they call the MPI collective operation. We therefore take a closer look at two commonly used process synchronization schemes: (1) relying on MPI_Barrier or (2) applying a window-based scheme using a common global time. We analyze both schemes experimentally and show the strengths and weaknesses of each approach. As window-based schemes require the notion of global time, we thoroughly evaluate different clock synchronization algorithms in various experiments. We also propose a novel clock synchronization algorithm that combines two advantages of known algorithms, which are (1) taking the inherent clock drift into account and (2) using a tree-based synchronization scheme to reduce the synchronization duration.

Archive | 2015

Euro-Par 2015: Parallel Processing

Jesper Larsson Träff; Sascha Hunold; Francesco Versaci

As they allow processes to communicate and synchronize, concurrent objects are, de facto, the most important objects of concurrent programming. This paper presents and illustrates two important notions associated with concurrent objects. The first one, which is related to their implementation, is the notion of a hybrid implementation. The second one, which is related to their definition, is the notion of an abortable object.

parallel computing | 2017

On expected and observed communication performance with MPI derived datatypes

Alexandra Carpen-Amarie; Sascha Hunold; Jesper Larsson Träff

Abstract We are interested in the cost of communicating simple, common, non-contiguous data layouts in various scenarios using the MPI derived datatype mechanism. Our aim is twofold. First, we provide a framework for studying communication performance for non-contiguous data layouts described with MPI derived datatypes in comparison to baseline performance with the same amount of contiguously stored data. Second, we explicate natural expectations on derived datatype communication performance that any MPI library implementation should arguably fulfill. These expectations are stated semi-formally as MPI datatype performance guidelines. Using our framework, we examine several MPI libraries on two different systems. Our findings are in many ways surprising and disappointing. First, using derived datatypes as intended by the MPI standard sometimes performs worse than the semantically equivalent packing and unpacking with the corresponding MPI functionality followed by contiguous communication. Second, communication performance with a single, contiguous datatype can be significantly worse than a repetition of its constituent datatype. Third, the heuristics that are typically employed by MPI libraries at type-commit time turn out to be insufficient to enforce the performance guidelines, showing room for better algorithms and heuristics for representing and processing derived datatypes in MPI libraries. In particular, we show cases where all MPI type constructors are necessary to achieve the expected performance. Our findings provide useful information to MPI library implementers, and hints to application programmers on good use of derived datatypes. Improved MPI libraries can be validated using our framework and approach.

european conference on parallel processing | 2015

Euro-Par 2015: Parallel Processing Workshops - Euro-Par 2015 International Workshops, Vienna, Austria, August 24-25, 2015, Revised Selected Papers

Sascha Hunold; Alexandru Iosup; Stefan Lankes; Josef Weidendorfer; Michael Alexander; Domingo Giménez; Vittorio Scarano; Stephen L. Scott; María Engracia Gómez Requena; Ana Lucia Varbanescu; Alexandru Costan; Laura Ricci

Large-scale interactive applications and online graph processing require fast data access to billions of small data objects. DXRAM addresses this challenge by keeping all data always in RAM of potentially many nodes aggregated in a data center. Such storage clusters need a space-efficient and fast meta-data management. In this paper we propose a range-based meta-data management allowing fast node lookups while being space efficient by combining data object IDs in ranges. A super-peer overlay network is used to manage these ranges together with backup-node information allowing parallel and fast recovery of meta data and data of failed peers. Furthermore, the same concept can also be used for client-side caching. The measurement results show the benefits of the proposed concepts compared to other meta-data management strategies as well as its very good overall performance evaluated using the social network benchmark BG.

Explore More