Sandra Wienke
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sandra Wienke.
international conference on parallel processing | 2012
Sandra Wienke; Paul Springer; Christian Terboven; Dieter an Mey
Todays trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level programming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and regions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.
international conference on parallel processing | 2013
Dirk Schmidl; Tim Cramer; Sandra Wienke; Christian Terboven; Matthias S. Müller
The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.
european conference on parallel processing | 2014
Sandra Wienke; Christian Terboven; James C. Beyer; Matthias S. Müller
Nowadays, HPC systems frequently emerge as clusters of commodity processors with attached accelerators. Moving from tedious low-level accelerator programming to increased development productivity, the directive-based programming models OpenACC and OpenMP are promising candidates. While OpenACC was completed about two years ago, OpenMP just recently added support for accelerator programming. To assist developers in their decision-making which approach to take, we compare both models with respect to their programmability. Besides investigating their expressiveness by putting their constructs side by side, we focus on the evaluation of their power based on structured parallel programming patterns (aka algorithmic skeletons). These patterns describe the basic entities of parallel algorithms of which we cover the patterns map, stencil, reduction, fork-join, superscalar sequence, nesting and geometric decomposition. Architectural targets of this work are NVIDIA-type accelerators (GPUs) and specialties of Intel-type accelerators (Xeon Phis). Additionally, we assess the prospects of OpenACC and OpenMP concerning future development in soft- and hardware design.
ieee international conference on high performance computing data and analytics | 2014
Guido Juckeland; William C. Brantley; Sunita Chandrasekaran; Barbara M. Chapman; Shuai Che; Mathew E. Colgrove; Huiyu Feng; Alexander Grund; Robert Henschel; Wen-mei W. Hwu; Huian Li; Matthias S. Müller; Wolfgang E. Nagel; Maxim Perminov; Pavel Shelepugin; Kevin Skadron; John A. Stratton; Alexey Titov; Ke Wang; G. Matthijs van Waveren; Brian Whitney; Sandra Wienke; Rengan Xu; Kalyan Kumaran
Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (HPG) has developed a set of performance metrics to evaluate the performance and power consumption of accelerators for various science applications. The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform. The first set of published results demonstrate the viability and relevance of the new metrics in comparing accelerator performance. This paper discusses the benchmark suites and selected published results in great detail.
Computer Science - Research and Development | 2011
Sandra Wienke; Dmytro Plotnikov; Dieter an Mey; Christian H. Bischof; Ario Hardjosuwito; Christof Gorgels; Christian Brecher
The desire for general purpose computation on graphics processing units caused the advance of new programming paradigms, e.g. OpenCL C/C++, CUDA C or the PGI Accelerator Model. In this paper, we apply these programming approaches to the software KegelSpan for simulating bevel gear cutting. This engineering application simulates an important manufacturing process in the automotive industry. The results obtained are compared to an OpenMP implementation on various hardware configurations. The discussion covers performance results, but also productivity of code development realized in this effort.
international supercomputing conference | 2013
Sandra Wienke; Dieter an Mey; Matthias S. Müller
Nowadays, HPC systems emerge in a great variety including commodity processors with attached accelerators which promise to improve the performance per watt ratio. These heterogeneous architectures often get far more complex to employ. Therefore, a hardware purchase decision should not only take capital expenses and operational costs such as power consumption into account, but also manpower. In this work, we take a look at the total cost of ownership (TCO) that includes costs for administration and programming effort. From that, we compute the costs per program run which can be used as a comparison metric for a purchase decision. In a case study, we evaluate our approach on two real-world simulation applications on Intel Xeon architectures, NVIDIA GPUs and Intel Xeon Phis by using different programming models: OpenCL, OpenACC, OpenMP and Intel’s Language Extensions for Offload.
ieee international conference on high performance computing data and analytics | 2014
Sandra Wienke; Marcel Spekowius; Alesja Dammer; Dieter an Mey; Christian Hopmann; Matthias S. Müller
The simulation of the crystallisation process during the injection moulding process of plastic components is time consuming, resulting in the ability to simulate only small parts of a component. To remove this constraint and enable the simulation of complex parts, the computing power of high-performance computers is demanded. A further design objective is high scalability in performance and memory consumption on today’s and future high-performance computing architectures to allow precise predictions of global part properties. In this work, we present a simulation tool for the crystallisation process and the parallelisation of the tool by a hybrid MPI-Pthreads approach that meets this design objective. We verify the performance and memory consumption of our parallelisation using a large simulation area of a realistic plastic component as a case study and can further predict that entire parts will also be calculable.
ieee international conference on high performance computing, data, and analytics | 2015
Sandra Wienke; Hristo Iliev; Dieter an Mey; Matthias S. Müller
In pursue of exaflop computing, the expenses of HPC centers increase in terms of acquisition, energy, employment, and programming. Thus, a quantifiable metric for productivity as value per cost gets more important to make an informed decision on how to invest available budgets. In this work, we model overall productivity from a computing center’s perspective. The productivity model uses as value the number of application runs possible during the lifetime of a given supercomputer. The cost is the total cost of ownership (TCO) of an HPC center including costs for administration and programming effort. For the latter, we include techniques for software cost estimation of large codes taken from the domain of software engineering. As tuning effort increases when more performance is required, we further focus on the impact of the 80-20 rule when it comes to development effort. Here, performance can be expressed with respect to Amdahl’s law. Moreover, we include an asymptotic analysis for parameters like number of compute nodes and lifetime. We evaluate our approach on a real-world case: an engineering application in our integrative hosting environment.
ieee international conference on high performance computing, data, and analytics | 2016
Guido Juckeland; Oscar R. Hernandez; Arpith C. Jacob; Daniel Neilson; Verónica G. Vergara Larrea; Sandra Wienke; Alexander Bobyr; William C. Brantley; Sunita Chandrasekaran; Mathew E. Colgrove; Alexander Grund; Robert Henschel; Wayne Joubert; Matthias S. Müller; Dave Raddatz; Pavel Shelepugin; Brian Whitney; Bo Wang; Kalyan Kumaran
Current and next generation HPC systems will exploit accelerators and self-hosting devices within their compute nodes to accelerate applications. This comes at a time when programmer productivity and the ability to produce portable code has been recognized as a major concern. One of the goals of OpenMP and OpenACC is to allow the user to specify parallelism via directives so that compilers can generate device specific code and optimizations. However, the challenge of porting codes becomes more complex because of the different types of parallelism and memory hierarchies available on different architectures. In this paper we discuss our experience with porting the SPEC ACCEL benchmarks from OpenACC to OpenMP 4.5 using a performance portable style that lets the compiler make platform-specific optimizations to achieve good performance on a variety of systems. The ported SPEC ACCEL OpenMP benchmarks were validated on different platforms including Xeon Phi, GPUs and CPUs. We believe that this experience can help the community and compiler vendors understand how users plan to write OpenMP 4.5 applications in a performance portable style.
ieee international conference on high performance computing, data, and analytics | 2016
Marco Nicolini; Julian Miller; Sandra Wienke; Michael Schlottke-Lakemper; Matthias Meinke; Matthias S. Müller
Aeroacoustics simulations leverage the tremendous computational power of today’s supercomputers, e.g., to predict the noise emissions of airplanes. The emergence of GPUs that are usable through directive-based programming models like OpenACC promises a cost-efficient solution for flow-induced noise simulations with respect to hardware expenditure and development time. However, OpenACC’s capabilities for real-world C++ codes have been scarcely investigated so far and software costs are rarely evaluated and modeled for this kind of high-performance projects. In this paper, we present our OpenACC parallelization of ZFS, an aeroacoustics simulation framework written in C++, and its early performance results. From our implementation work, we derive common pitfalls and lessons-learned for real-world C++ codes using OpenACC. Furthermore, we borrow software cost estimation techniques from software engineering to evaluate the development efforts needed in a directive-based HPC environment. We discuss applicability and challenges of the popular COCOMO II model applied to the parallelization of ZFS.