Roman Iakymchuk
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Roman Iakymchuk.
ieee international conference on cloud computing technology and science | 2010
Paolo Bientinesi; Roman Iakymchuk; Jeff Napper
Computing as a utility has reached the mainstream. Scientists can now easily rent time on large commercial clusters that can be expanded and reduced on-demand in real-time. However, current commercial cloud computing performance falls short of systems specifically designed for scientific applications. Scientific computing needs are quite different from those of the web applications that have been the focus of cloud computing vendors. In this chapter we demonstrate through empirical evaluation the computational efficiency of high-performance numerical applications in a commercial cloud environment when resources are shared under high contention. Using the Linpack benchmark as a case study, we show that cache utilization becomes highly unpredictable and similarly affects computation time. For some problems, not only is it more efficient to underutilize resources, but the solution can be reached sooner in realtime (wall-time). We also show that the smallest, cheapest (64-bit) instance on the studied environment is the best for price to performance ration. In light of the high-contention we witness, we believe that alternative definitions of efficiency for commercial cloud environments should be introduced where strong performance guarantees do not exist. Concepts like average, expected performance and execution time, expected cost to completion, and variance measures–-traditionally ignored in the high-performance computing context–-now should complement or even substitute the standard definitions of efficiency.
acm symposium on applied computing | 2011
Roman Iakymchuk; Jeff Napper; Paolo Bientinesi
We investigate the effects of shared resources for high-performance computing in a commercial cloud environment where multiple virtual machines share a single hardware node. Although good performance is occasionally obtained, contention degrades the expected performance and introduces significant variance. Using the DGEMM kernel and the HPL benchmark, we show that the underutilization of resources considerably improves expected performance by reducing contention for the CPU and cache space. For instance, for some cluster configurations, the solution is reached almost an order of magnitude earlier on average when the available resources are underutilized. The performance benefits for single node computations are even more impressive: Underutilization improves the expected execution time by two orders of magnitude. Finally, in contrast to unshared clusters, extending underutilized clusters by adding more nodes often improves the execution time due to an increased parallelism even with a slow interconnect. In the best case, by underutilizing the nodes performance was improved enough to entirely offset the cost of an extra node in the cluster.
parallel computing | 2015
Sylvain Collange; David Defour; Stef Graillat; Roman Iakymchuk
A parallel algorithm to compute correctly-rounded floating-point sumsHighly-optimized implementations for modern CPUs, GPUs and Xeon PhiAs fast as memory bandwidth allows for large sums with moderate dynamic rangeScales well with the problem size and resources used on a cluster of compute nodes On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.
international conference on information technology: new generations | 2015
Roman Iakymchuk; David Defour; Sylvain Collange; Stef Graillat
On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast reproducible triangular solver and results on recent NVIDIA GPUs.
international conference on conceptual structures | 2016
Stefano Markidis; Ivy Bo Peng; Roman Iakymchuk; Erwin Laure; Gokcen Kestor; Roberto Gioiosa
Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more ...
The Journal of Supercomputing | 2018
Peter Thoman; Kiril Dichev; Thomas Heller; Roman Iakymchuk; Xavier Aguilar; Khalid Hasanov; Philipp Gschwandtner; Pierre Lemarinier; Stefano Markidis; Herbert Jordan; Thomas Fahringer; Kostas Katrinis; Erwin Laure; Dimitrios S. Nikolopoulos
Task-based programming models for shared memory—such as Cilk Plus and OpenMP 3—are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.
international parallel and distributed processing symposium | 2017
Dana Akhmetova; Roman Iakymchuk; Örjan Ekeberg; Erwin Laure
With a large variety and complexity of existing HPC machines and uncertainty regarding exact future Exascale hardware, it is not clear whether existing parallel scientific codes will perform well on future Exascale systems: they can be largely modified or even completely rewritten from scratch. Therefore, now it is important to ensure that software is ready for Exascale computing and will utilize all Exascale resources well. Many parallel programming models try to take into account all possible hardware features and nuances. However, the HPC community does not yet have a precise answer whether, for Exascale computing, there should be a natural evolution of existing models interoperable with each other or it should be a disruptive approach. Here, we focus on the first option, particularly on a practical assessment of how some parallel programming models can coexist with each other. This work describes two API combination scenarios on the example of iPIC3D [26], an implicit Particle-in-Cell code for space weather applications written in C++ and MPI plus OpenMP. The first scenario is to enable multiple OpenMP threads call MPI functions simultaneously, with no restrictions, using an MPI THREAD MULTIPLE thread safety level. The second scenario is to utilize the OpenMP tasking model on top of the first scenario. The paper reports a step-by-step methodology and experience with these API combinations in iPIC3D; provides the scaling tests for these implementations with up to 2048 physical cores; discusses occurred interoperability issues; and provides suggestions to programmers and scientists who may adopt these API combinations in their own codes.
Scanning | 2014
Roman Iakymchuk; David Defour; Sylvain Collange; Stef Graillat
Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bit-wise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields both reproducible and accurate results. This algorithm is composed of two main stages: a filtering stage that uses fast vectorized floating-point expansions in conjunction with error-free transformations; an accumulation stage based on Kulisch long accumulators in a high-radix carry-save representation. Finally, we provide implementations and performance results in parallel environments like GPUs.
international conference on parallel processing | 2017
Peter Thoman; Khalid Hasanov; Kiril Dichev; Roman Iakymchuk; Xavier Aguilar; Philipp Gschwandtner; Pierre Lemarinier; Stefano Markidis; Herbert Jordan; Erwin Laure; Kostas Katrinis; Dimitrios S. Nikolopoulos; Thomas Fahringer
Task-based programming models for shared memory – such as Cilk Plus and OpenMP 3 – are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists.
ieee international conference on high performance computing data and analytics | 2011
Roman Iakymchuk; Paolo Bientinesi
We aim at modeling the performance of linear algebra algorithms without executing either the algorithms or any parts of them. The performance of an algorithm can be expressed in terms of the time spent on CPU execution and memory-stalls. The main concern of the study is to build analytical models to accurately predict memory-stalls. We construct an analytical formula for modeling cache misses of fundamental linear algebra operations such as those included in the Basic Linear Algebra Subprograms (BLAS) library. The number of cache misses occurring in higher-level algorithms--like a matrix factorization--is then predicted by combining the models for the appropriate BLAS subroutines. As case studies, we consider the LU factorization and GER--a BLAS operation and a building block for the LU factorization. We validate the models on both Intel and AMD processors, attaining remarkably accurate performance predictions.