Christopher Dahnken | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher Dahnken is active.

Explore More

Publication

Featured researches published by Christopher Dahnken.

Archive | 2014

Optimizing HPC Applications with Intel® Cluster Tools

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

ion and Generalization of the Platform Architecture Middleware and software architectures play a big role in HPC and in other application areas. Today, almost nobody interacts with the hardware directly. Instead, the interaction of the programmer and the hardware is facilitated via an application programming interface (API). If you think that programming in assembly language today is direct interaction with hardware, we have to disappoint you; it is not. The instruction stream is decoded into sequences of special microcode operations that in the end serve as the commands to the execution units. Software abstractions are an unavoidable part of modern applications design, and in this part of the book we will look at the software architecture from the point of view of abstraction and the consequences of using one set of abstractions over others. Selection of some abstractions may result in performance penalties because of the added translation steps; for others, the impact may be hidden by efficient pipelining (such as happens with microcode translation inside processors) and causes almost no visible overhead. Chapter 8 ■ appliCation Design ConsiDerations 248 Types of Abstractions An abstraction is a technique used to separate conceptualized ideas from specific instances and implementations of those at hand. These conceptualizations are used to hide the internal complexity of the hardware, allow portability of software, and increase the productivity of development via better reuse of components. Abstractions that are implemented in software, middleware, or firmware also allow for fixing hardware bugs with software that results in a reduced time to market for very complex systems, such as supercomputers. We believe it is generally good to have the right level of abstraction. Abstractions today are generally an unavoidable thing: we have to use different kinds of APIs because an interaction with the raw hardware is (almost) impossible. During performance optimization work, any performance overhead must be quantified to judge whether there is need to consider a lower level of abstraction that could gain more performance and increase efficiency. Abstractions apply to both control flow and data structures. Control abstraction hides the order in which the individual statements, instructions, or function calls of a program are executed. The data abstraction allows us to use high-level types, classes, and complex structures without the need to know the details about how they are stored in a computer memory or disk, or are transferred over the network. One can regard the notion of an object in object-oriented programming as an attempt to combine abstractions of data and code, and to deal with instances of objects through their specific properties and methods. Object-oriented programming is sometimes a convenient approach that improves code modularity, reuses software components, and increases productivity of development and support of the application. Some examples of control flow abstractions that a typical developer in high-performance computing will face include the following: • Decoding of processor instruction set into microcode. These are specific for a microarchitecture implementation of different processors. The details of the mapping between processor instructions and microcode operations are discussed in Chapter 7. The mapping is not a simple one-to-one or one-to-many relation. With technologies like macro fusion, the number of internal micro-operations may end up smaller than the number of incoming instructions. This abstraction allows processor designers to preserve a common instruction set architecture (ISA) across different implementations and to extend the ISA while preserving backwards compatibility. The decoding of processor instructions into micro-operations is a pipeline process, and it usually does not cause performance penalties in HPC codes. • Virtual machine, involving just-in-time compilation (JIT, widely used, for example, in Java or in the Microsoft Common Language Runtime [CLR] virtual machines) or dynamic translation (such as in scripting or interpreted languages, such as Python or Perl). Here, compilation is done during execution of a program, rather than prior to execution. With JIT, the program can be stored in a higher level compressed byte-code that is usually a portable representation, and a virtual machine translates it into processor Chapter 8 ■ appliCation Design ConsiDerations 249 instructions on the fly. JIT implementations can be sufficiently fast for use even in HPC applications, and we have seen large HPC apps written in Java and Python. And, by the way, the number of such applications grows. • Programming languages. These control abstraction. They offer notions such as functions, looping, conditional execution, and so on, to make it easier and more productive to write programs. Higher level languages, such as Fortran or C, often require compilation of programs to translate code into a stream of processor-specific instructions to achieve high performance. Unlike instruction decoding or just-in-time compilation, this happens ahead of time before the program executes. The approach ensures that overheads related to compilation of the program code to machine instructions are not impacting application execution. • Library of routines and modules. Most programming languages support extensions of programs with subprograms, modules, or libraries of routines. This enables modular architecture of final programs for faster development, better test coverage, and greater portability. Several well-known libraries provide de-facto standard sets of routines for many HPC programs, such as basic linear algebra subprograms (BLAS), linear algebra package (LAPACK), and the FFTW software library for computing discrete Fourier transforms (DFTs). These libraries not only hide the complexity of underlying algorithms but also enable vendors of hardware to provide highly tuned implementations for best performance on their computer architectures. For example, Intel Math Kernel Library (MKL), included in Intel Parallel Studio XE, provides optimized linear algebra (BLAS, LAPACK, sparse solvers, and ScaLAPACK for clusters), multidimensional (up to 7D) fast Fourier transformations and FFTW interfaces, vector math (including trigonometric, hyperbolic, exponential, logarithmic, power, root, and rounding) routines, random number generators (such as congruent, recursive, Wichman-Hill, Mersenne twister, Sobol sequences, etc.), statistics (quantiles, min/max, variance-covariance, etc.), and data fitting (spline, interpolation, cell search) routines for the latest Intel microprocessors. • API calls. Any kind of API calls provided by the operating system (OS) hide the complexity of an interaction between operating system tasks and the hardware-supported context of execution exposed by the processors. Examples of these include calls from OS to the basic input/output subsystem (BIOS) abstracting the implementation of the underlying hardware platform or a threading API that creates, controls, and coordinates the threads of execution within the application. Chapter 8 ■ appliCation Design ConsiDerations 250 • Operating system. This, and specifically its scheduler, makes every program believe that it runs continuously on the system without any interruptions. In fact, the OS scheduler does interrupt execution, and even puts execution of a program on hold to give other programs access to the processor resources. • Full system virtualization. This includes using virtual machine monitors (VMM), such as Xen, KVM, VMWare, or others. VMMs usually abstract the entire platform so that every operating system believes it is the only one running on a system, while, in fact, VMMs are doing both control and data abstraction among all the different OS versions currently executing on a platform. Data abstraction allows handling of data bits in meaningful ways. For example, data abstraction can be found behind:

european conference on parallel processing | 2010

An approach to visualize remote socket traffic on the intel Nehalem-EX

Christian Iwainsky; Thomas Reichstein; Christopher Dahnken; Dieter an Mey; Christian Terboven; Andrey Semin; Christian H. Bischof

The integration of the memory controller on the processor die enables ever larger core counts in commodity hardware shared memory systems with Non-Uniform Memory Architecture properties. Shared memory parallelization with OpenMP is an elegant and widely used approach to leverage the power of such systems. The binding of the OpenMP threads to compute cores and the corresponding memory association are becoming even more critical in order to obtain optimal performance. In this work we provide a method to measure the amount of remote socket memory accesses a thread generates. We use available performance monitoring CPU counters in combination with thread binding on a quad socket Nehalem EX system. For visualization of the collected data we use Vampir.

Archive | 2014

Addressing System Bottlenecks

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

We start with a bold statement: every application has a bottleneck. By that, we mean that there is always something that limits performance of a given application in a system. Even if the application is well optimized and it may seem that no additional improvements are possible by tuning it on the other levels, it still has a bottleneck, and that bottleneck is in the system the program runs on. The tuning starts and ends at the system level.

Archive | 2014

No Time to Read This Book

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain right away.

Archive | 2014

Top-Down Software Optimization

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

The tuning of a previously unoptimized hardware/software combination is a difficult task, one that even experts struggle with. Anything can go wrong here, from the proper setup to the compilation and execution of individual machine instructions. It is, therefore, of paramount importance to follow a logical and systematic approach to improve performance incrementally, continuously exposing the next bottleneck to be fixed.

Archive | 2014

Addressing Application Bottlenecks: Distributed Memory

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

The first application optimization level accessible to the ever-busy performance analyst is the distributed memory one, normally expressed in terms of the Message Passing Interface (MPI).1 By its very nature, the distributed memory paradigm is concerned with communication. Some people consider all communication as overhead—that is, something intrinsically harmful that needs to be eliminated. We tend to call it “investment.” Indeed, by moving data around in the right manner, you hope to get more computational power in return. The main point, then, is to optimize this investment so that your returns are maximized.

Archive | 2014

Addressing Application Bottlenecks: Microarchitecture

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

Microarchitectural performance tuning is one of the most difficult parts of the performance tuning process. In contrast to other tuning activities, it is not immediately clear what the bottlenecks are. Usually, discovering this requires study of processor manuals, which provide the details of the execution flow. Furthermore, a certain understanding of assembly language is needed to reflect the findings back onto the original source code. Each processor model will also have its own microarchitectural characteristics that have to be considered when writing efficient software.

Archive | 2014

Overview of Platform Architectures

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

In order to optimize software you need to understand hardware. In this chapter we give you a brief overview of the typical system architectures found in the high-performance computing (HPC) today. We also introduce terminology that will be used throughout the book.

Archive | 2014

Application Design Considerations

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

In Chapters 5 to 7 we reviewed the methods, tools, and techniques for application tuning, explained by using examples of HPC applications and benchmarks. The whole process followed the top-down software optimization framework explained in Chapter 3. The general approach to the tuning process is based on a quantitative analysis of execution resources required by an application and how these match the capabilities of the platform the application is run on. The blueprint analysis of platform capabilities and system-level tuning considerations were provided in Chapter 4, based on several system architecture metrics discussed in Chapter 2.

Archive | 2014

Addressing Application Bottlenecks: Shared Memory

Alexander V. Supalov; Andrey Semin; Michael Klemm; Christopher Dahnken

Chapter 5 talks about the potential bottlenecks in your application and the system it runs on. In this chapter, we will have a close look at how the application code performs on the level of an individual cluster node. It is a fair assumption that there will also be bottlenecks on this level. Removing these bottlenecks will usually translate directly to increased performance, in addition to the optimizations discussed in the previous chapters.

Explore More