Stefan Kral
Vienna University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stefan Kral.
conference on high performance computing (supercomputing) | 2006
Francois Gygi; Erik W. Draeger; Martin Schulz; Bronis R. de Supinski; John A. Gunnels; Vernon Austel; James C. Sexton; Franz Franchetti; Stefan Kral; Christoph W. Ueberhuber; Juergen Lorenz
First-principles simulations of high-Z metallic systems using the Qbox code on the BlueGene/L supercomputer demonstrate unprecedented performance and scaling for a quantum simulation code. Specifically designed to take advantage of massively-parallel systems like BlueGene/L, Qbox demonstrates excellent parallel efficiency and peak performance. A sustained peak performance of 207.3 TFlop/s was measured on 65,536 nodes, corresponding to 56.5% of the theoretical full machine peak using all 128k CPUs.
Proceedings of the IEEE | 2005
Franz Franchetti; Stefan Kral; Juergen Lorenz; Christoph W. Ueberhuber
This paper targets automatic performance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) parallelism. The studied SIMD instruction set extensions include Intels SSE family, AMDs 3DNow!, Motorolas AltiVec, and IBMs BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machines general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes: 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
conference on high performance computing (supercomputing) | 2005
Francois Gygi; Robert Kim Yates; Juergen Lorenz; Erik W. Draeger; Franz Franchetti; Christoph W. Ueberhuber; Bronis R. de Supinski; Stefan Kral; John A. Gunnels; James C. Sexton
We demonstrate that the Qbox code supports unprecedented large-scale First-Principles Molecular Dynamics (FPMD) applications on the BlueGene/L supercomputer. Qbox is an FPMD implementation specifically designed for large-scale parallel platforms such as BlueGene/L. Strong scaling tests for a Materials Science application show an 86% scaling efficiency between 1024 and 32,768 CPUs. Measurements of performance by means of hardware counters show that 36% of the peak FPU performance can be attained.
Ibm Journal of Research and Development | 2005
Juergen Lorenz; Stefan Kral; Franz Franchetti; Christoph W. Ueberhuber
This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Gene®/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.
european conference on parallel processing | 2003
Stefan Kral; Franz Franchetti; Juergen Lorenz; Christoph W. Ueberhuber
This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. FFT kernels are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way short vector SIMD extensions like AMD’s 3DNow! and Intel’s SSE 2. Additionally, a special compiler backend is introduced which is able to (i) utilize particular code properties, (ii) generate optimized address computation, and (iii) apply specialized register allocation and instruction scheduling.
compiler construction | 2004
Stefan Kral; Franz Franchetti; Juergen Lorenz; Christoph W. Ueberhuber; Peter Wurzinger
This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. Numerical applications are accelerated by automatically vectorizing blocks of straight line code to be run on processors featuring two-way short vector SIMD extensions like Intel’s SSE 2 on Pentium 4, SSE 3 on Intel Prescott, AMD’s 3DNow! , and IBM’s SIMD operations implemented on the new processors of the BlueGene/L supercomputer.
ieee international conference on high performance computing data and analytics | 2008
Bronis R. de Supinski; Martin Schulz; Vasily V. Bulatov; William H. Cabot; Bor Chan; Andrew W. Cook; Erik W. Draeger; James N. Glosli; Jeffrey Greenough; Keith Henderson; Alison Kubota; Steve Louis; Brian Miller; Mehul Patel; Thomas E. Spelce; Frederick H. Streitz; Peter L. Williams; Robert Kim Yates; Andy Yoo; George S. Almasi; Gyan Bhanot; Alan Gara; John A. Gunnels; Manish Gupta; José E. Moreira; James C. Sexton; Bob Walkup; Charles J. Archer; Francois Gygi; Timothy C. Germann
BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the worlds largest system both in terms of scale, with 131,072 processors, and absolute performance, with a peak rate of 367 Tflop/s. BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. However, the real value of a machine such as BG/L derives from the scientific breakthroughs that real applications can produce by successfully using its unprecedented scale and computational power. In this paper, we describe our experiences with eight large scale applications on BG/ L from several application domains, ranging from molecular dynamics to dislocation dynamics and turbulence simulations to searches in semantic graphs. We also discuss the challenges we faced when scaling these codes and present several successful optimization techniques. All applications show excellent scaling behavior, even at very large processor counts, with one code even achieving a sustained performance of more than 100 Tflop/s, clearly demonstrating the real success of the BG/L design.
ieee international conference on high performance computing data and analytics | 2004
Franz Franchetti; Stefan Kral; Juergen Lorenz; Markus Püschel; Christoph W. Ueberhuber
IBM is currently developing the new line of BlueGene/L supercomputers. The top-of-the-line installation is planned to be a 65,536 processors system featuring a peak performance of 360 Tflop/s. This system is supposed to lead the Top 500 list when being installed in 2005 at the Lawrence Livermore National Laboratory. This paper presents one of the first numerical kernels run on a prototype BlueGene/L machine. We tuned our formal vectorization approach as well as the Vienna MAP vectorizer to support BlueGene/Ls custom two-way short vector SIMD “double” floating-point unit and connected the resulting methods to the automatic performance tuning systems Spiral and Fftw. Our approach produces automatically tuned high-performance FFT kernels for BlueGene/L that are up to 45% faster than the best scalar spiral generated code and up to 75% faster than Fftw when run on a single BlueGene/L processor.
international conference on parallel processing | 2006
Stefan Kral; Markus Triska; Christoph W. Ueberhuber
Standard compilers are incapable of fully harnessing the enormous performance potential of Blue Gene systems. To reach the leading position in the Top500 supercomputing list, IBM had to put considerable effort into coding and tuning a limited range of low-level numerical kernel routines by hand. In this paper the Vienna MAP compiler is presented, which particularly targets signal transform codes ubiquitous in compute-intensive scientific applications. Compiling Fftw code, MAP reaches as much as 80% of the optimum performance of Blue Gene systems. In an application code MAP enabled a sustained performance of 60 Tflop/s to be reached on BlueGene/L.
international conference on acoustics, speech, and signal processing | 2001
Franz Franchetti; Herbert Karner; Stefan Kral; Christoph W. Ueberhuber