Timo Schneider
Indiana University Bloomington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Timo Schneider.
ieee international conference on high performance computing data and analytics | 2010
Torsten Hoefler; Timo Schneider; Andrew Lumsdaine
This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. Our analytical model shows that not only collective operations but also point-to-point communications influence the applications sensitivity to noise. We present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. We investigate collective operations with up to 1 million processes and three applications (Sweep3D, AMG, and POP) with up to 32,000 processes.We show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. Simulations with different network speeds show that a 10x faster network does not improve application scalability. We quantify noise and conclude that our tools can be utilized to tune the noise signatures of a specific system.
high performance distributed computing | 2010
Torsten Hoefler; Timo Schneider; Andrew Lumsdaine
We introduce LogGOPSim---a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination with full MPI message matching semantics and detailed simulation of collective operations. In addition, it enables simulation in the traditional LogP, LogGP, and LogGPS models. Its simple and fast single-queue design computes more than 1 million events per second on a single processor and enables large-scale simulations of more than 8 million processes. LogGOPSim also supports the simulation of full MPI applications by reading and simulating MPI profiling traces. We analyze the accuracy and the performance of the simulation and propose a simple extrapolation scheme for parallel applications. Our scheme extrapolates collective operations with high accuracy by rebuilding the communication pattern. Point-to-point operation patterns can be copied in the extrapolation and thus retain the main characteristics of scalable parallel applications.
international conference on cluster computing | 2008
Torsten Hoefler; Timo Schneider; Andrew Lumsdaine
Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.
international parallel and distributed processing symposium | 2008
Torsten Hoefler; Timo Schneider; Andrew Lumsdaine
Accurate, reproducible and comparable measurement of collective operations is a complicated task. Although different measurement schemes are implemented in well- known benchmarks, many of these schemes introduce different systematic errors in their measurements. We characterize these errors and select a window-based approach as the most accurate method. However, this approach complicates measurements significantly and introduces a clock synchronization as a new source of systematic errors. We analyze approaches to avoid or correct those errors and develop a scalable synchronization scheme to conduct benchmarks on massively parallel systems. Our results are compared to the window-based scheme implemented in the SKaMPI benchmarks and show a reduction of the synchronization overhead by a factor of 16 on 128 processes.
ieee international conference on high performance computing data and analytics | 2012
Torsten Hoefler; Timo Schneider
Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as collective communications in MPI. We demonstrate how neighborhood collective operations allow to specify arbitrary collective communication relations during runtime and enable optimizations similar to traditional collective calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how users can assert constraints that provide additional optimization opportunities in a portable way. We demonstrate the utility of all described optimizations in a highly optimized implementation of neighborhood collective operations. Our communication and protocol optimizations result in a performance improvement of up to a factor of two for small stencil communications. We found that, for some patterns, our optimization heuristics automatically generate communication schedules that are comparable to hand-tuned collectives. With those optimizations in place, we are able to accelerate arbitrary collective communication patterns, such as regular and irregular stencils with optimization methods for collective communications. We expect that our methods will influence the design of future MPI libraries and provide a significant performance benefit on large-scale systems.
Proceedings of the 20th European MPI Users' Group Meeting on | 2013
Timo Schneider; Fredrik Kjolstad; Torsten Hoefler
Data packing before and after communication can make up as much as 90% of the communication time on modern computers. Despite MPIs well-defined datatype interface for non-contiguous data access, many codes use manual pack loops for performance reasons. Programmers write access-pattern specific pack loops (e.g., do manual unrolling) for which compilers emit optimized code. In contrast, MPI implementations in use today interpret datatypes at pack time, resulting in high overheads. In this work we explore the effectiveness of using runtime compilation techniques to generate efficient and optimized pack code for MPI datatypes at commit time. Thus, none of the overhead of datatype interpretation is incurred at pack time and pack setup is as fast as calling a function pointer. We have implemented a library called libpack that can be used to compile and (un)pack MPI datatypes. The library optimizes the datatype representation and uses the LLVM framework to produce vectorized machine code for each datatype at commit time. We show several examples of how MPI datatype pack functions benefit from runtime compilation and analyze the performance of compiled pack functions for the data access patterns in many applications. We show that the pack/unpack functions generated by our packing library are seven times faster than those of prevalent MPI implementations for 73% of the datatypes used in a scientific application and in many cases outperform manual pack loops.
international parallel and distributed processing symposium | 2009
Torsten Hoefler; Timo Schneider; Andrew Lumsdaine
The impact of operating system noise on the performance of large-scale applications is a growing concern and ameliorating the effects of OS noise is a subject of active research. A related problem is that of network noise, which arises from shared use of an interconnection network by parallel processes. To characterize the impact of network noise on parallel applications we conducted a series of simulations and experiments using a newly-developed benchmark. Experiment results show a decrease in the communication performance of a parallel reduction operation by a factor of two on 246 nodes. In addition, simulations show that influence of network noise grows with the system size. Although network noise is not as well-studied as OS noise, our results clearly show that it is an important factor that must be considered when running large-scale applications.
ubiquitous computing | 2010
Torsten Hoefler; Timo Schneider; Andrew Lumsdaine
Accurate, reproducible and comparable measurement of the overheads, communication times and progression behaviour of blocking and nonblocking collective operations is a complicated task. Although different measurement schemes for blocking collective operations are implemented in well-known benchmarks, many of these schemes introduce different systematic errors in their measurements. We characterise these errors and select a window-based approach as the most accurate method. However, this approach complicates measurements significantly and introduces clock synchronisation as a new source of errors. We analyse approaches to avoid or correct those errors and develop a scalable synchronisation scheme to conduct benchmarks on massively parallel systems. Our results are compared to the window-based scheme implemented in the SKaMPI benchmarks and show a reduction of the synchronisation overhead by a factor of 16 on 128 processes. We also describe two different measurement schemes for the overhead and asynchronous progress of nonblocking collective communications. An implementation and results of both measurement schemes are presented.
international conference on parallel architectures and compilation techniques | 2012
Torsten Hoefler; Timo Schneider
Parallelism is steadily growing, remote-data access will soon dominate the execution time of large-scale applications. Many large-scale communication patterns expose significant structure that can be used to schedule communications accordingly. In this work, we identify concurrent communication patterns and transform them to semantically equivalent but faster communications. We show a directed acyclic graph formulation for communication schedules and concisely define their synchronization and data movement semantics. Our dataflow solver computes an internal representation (IR) that is amenable to pattern detection. We demonstrate a detection algorithm for our IR that is guaranteed to detect communication kernels on subsets of the graph and replace the subgraph with hardware accelerated or hand-tuned kernels. Those techniques are implemented in an open-source detection and transformation framework to optimize communication patterns. Experiments show that our techniques can improve the performance of representative example codes by several orders of magnitude on two different systems. However, we also show that some collective detection problems on process subsets are NP-hard. The developed analysis techniques are a first important step towards automatic large-scale communication transformations. Our developed techniques open several avenues for additional transformation heuristics and analyses. We expect that such communication analyses and transformations will become as natural as pattern detection, just-in-time compiler optimizations, and autotuning are today for serial codes.
international conference on parallel processing | 2011
Timo Schneider; Sven Eckelmann; Torsten Hoefler; Wolfgang Rehm
Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general.