Roman Iakymchuk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roman Iakymchuk is active.

Explore More

Publication

Featured researches published by Roman Iakymchuk.

ieee international conference on cloud computing technology and science | 2010

HPC on Competitive Cloud Resources

Paolo Bientinesi; Roman Iakymchuk; Jeff Napper

Computing as a utility has reached the mainstream. Scientists can now easily rent time on large commercial clusters that can be expanded and reduced on-demand in real-time. However, current commercial cloud computing performance falls short of systems specifically designed for scientific applications. Scientific computing needs are quite different from those of the web applications that have been the focus of cloud computing vendors. In this chapter we demonstrate through empirical evaluation the computational efficiency of high-performance numerical applications in a commercial cloud environment when resources are shared under high contention. Using the Linpack benchmark as a case study, we show that cache utilization becomes highly unpredictable and similarly affects computation time. For some problems, not only is it more efficient to underutilize resources, but the solution can be reached sooner in realtime (wall-time). We also show that the smallest, cheapest (64-bit) instance on the studied environment is the best for price to performance ration. In light of the high-contention we witness, we believe that alternative definitions of efficiency for commercial cloud environments should be introduced where strong performance guarantees do not exist. Concepts like average, expected performance and execution time, expected cost to completion, and variance measures–-traditionally ignored in the high-performance computing context–-now should complement or even substitute the standard definitions of efficiency.

acm symposium on applied computing | 2011

Improving high-performance computations on clouds through resource underutilization

Roman Iakymchuk; Jeff Napper; Paolo Bientinesi

We investigate the effects of shared resources for high-performance computing in a commercial cloud environment where multiple virtual machines share a single hardware node. Although good performance is occasionally obtained, contention degrades the expected performance and introduces significant variance. Using the DGEMM kernel and the HPL benchmark, we show that the underutilization of resources considerably improves expected performance by reducing contention for the CPU and cache space. For instance, for some cluster configurations, the solution is reached almost an order of magnitude earlier on average when the available resources are underutilized. The performance benefits for single node computations are even more impressive: Underutilization improves the expected execution time by two orders of magnitude. Finally, in contrast to unshared clusters, extending underutilized clusters by adding more nodes often improves the execution time due to an increased parallelism even with a slow interconnect. In the best case, by underutilizing the nodes performance was improved enough to entirely offset the cost of an extra node in the cluster.

parallel computing | 2015

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

Sylvain Collange; David Defour; Stef Graillat; Roman Iakymchuk

A parallel algorithm to compute correctly-rounded floating-point sumsHighly-optimized implementations for modern CPUs, GPUs and Xeon PhiAs fast as memory bandwidth allows for large sums with moderate dynamic rangeScales well with the problem size and resources used on a cluster of compute nodes On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

international conference on information technology: new generations | 2015

Reproducible Triangular Solvers for High-Performance Computing

Roman Iakymchuk; David Defour; Sylvain Collange; Stef Graillat

On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast reproducible triangular solver and results on recent NVIDIA GPUs.

international conference on conceptual structures | 2016

A Performance Characterization of Streaming Computing on Supercomputers.

Stefano Markidis; Ivy Bo Peng; Roman Iakymchuk; Erwin Laure; Gokcen Kestor; Roberto Gioiosa

Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more ...

The Journal of Supercomputing | 2018

A taxonomy of task-based parallel programming technologies for high-performance computing

Peter Thoman; Kiril Dichev; Thomas Heller; Roman Iakymchuk; Xavier Aguilar; Khalid Hasanov; Philipp Gschwandtner; Pierre Lemarinier; Stefano Markidis; Herbert Jordan; Thomas Fahringer; Kostas Katrinis; Erwin Laure; Dimitrios S. Nikolopoulos

Task-based programming models for shared memory—such as Cilk Plus and OpenMP 3—are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

international parallel and distributed processing symposium | 2017

Performance Study of Multithreaded MPI and OpenMP Tasking in a Large Scientific Code

Dana Akhmetova; Roman Iakymchuk; Örjan Ekeberg; Erwin Laure

With a large variety and complexity of existing HPC machines and uncertainty regarding exact future Exascale hardware, it is not clear whether existing parallel scientific codes will perform well on future Exascale systems: they can be largely modified or even completely rewritten from scratch. Therefore, now it is important to ensure that software is ready for Exascale computing and will utilize all Exascale resources well. Many parallel programming models try to take into account all possible hardware features and nuances. However, the HPC community does not yet have a precise answer whether, for Exascale computing, there should be a natural evolution of existing models interoperable with each other or it should be a disruptive approach. Here, we focus on the first option, particularly on a practical assessment of how some parallel programming models can coexist with each other. This work describes two API combination scenarios on the example of iPIC3D [26], an implicit Particle-in-Cell code for space weather applications written in C++ and MPI plus OpenMP. The first scenario is to enable multiple OpenMP threads call MPI functions simultaneously, with no restrictions, using an MPI THREAD MULTIPLE thread safety level. The second scenario is to utilize the OpenMP tasking model on top of the first scenario. The paper reports a step-by-step methodology and experience with these API combinations in iPIC3D; provides the scaling tests for these implementations with up to 2048 physical cores; discusses occurred interoperability issues; and provides suggestions to programmers and scientists who may adopt these API combinations in their own codes.

Scanning | 2014

Reproducible and Accurate Matrix Multiplication

Roman Iakymchuk; David Defour; Sylvain Collange; Stef Graillat

Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bit-wise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields both reproducible and accurate results. This algorithm is composed of two main stages: a filtering stage that uses fast vectorized floating-point expansions in conjunction with error-free transformations; an accumulation stage based on Kulisch long accumulators in a high-radix carry-save representation. Finally, we provide implementations and performance results in parallel environments like GPUs.

international conference on parallel processing | 2017

A Taxonomy of Task-Based Technologies for High-Performance Computing

Peter Thoman; Khalid Hasanov; Kiril Dichev; Roman Iakymchuk; Xavier Aguilar; Philipp Gschwandtner; Pierre Lemarinier; Stefano Markidis; Herbert Jordan; Erwin Laure; Kostas Katrinis; Dimitrios S. Nikolopoulos; Thomas Fahringer

Task-based programming models for shared memory – such as Cilk Plus and OpenMP 3 – are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists.

ieee international conference on high performance computing data and analytics | 2011

Execution-less performance modeling

Roman Iakymchuk; Paolo Bientinesi

We aim at modeling the performance of linear algebra algorithms without executing either the algorithms or any parts of them. The performance of an algorithm can be expressed in terms of the time spent on CPU execution and memory-stalls. The main concern of the study is to build analytical models to accurately predict memory-stalls. We construct an analytical formula for modeling cache misses of fundamental linear algebra operations such as those included in the Basic Linear Algebra Subprograms (BLAS) library. The number of cache misses occurring in higher-level algorithms--like a matrix factorization--is then predicted by combining the models for the appropriate BLAS subroutines. As case studies, we consider the LU factorization and GER--a BLAS operation and a building block for the LU factorization. We validate the models on both Intel and AMD processors, attaining remarkably accurate performance predictions.

Explore More