Lubomír Říha
Technical University of Ostrava
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lubomír Říha.
Computing | 2017
Joseph Schuchart; Michael Gerndt; Per Gunnar Kjeldsberg; Michael Lysaght; David Horák; Lubomír Říha; Andreas Gocht; Mohammed Sourouri; Madhura Kumaraswamy; Anamika Chowdhury; Magnus Jahre; Kai Diethelm; Othman Bouizi; Umbreen Sabir Mian; Jakub Kružík; Radim Sojka; Martin Beseda; Venkatesh Kannan; Zakaria Bendifallah; Daniel Hackenberg; Wolfgang E. Nagel
Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions of the program, such as compute-, memory-, and I/O-bound code regions. Some of today’s clusters already offer mechanisms to adjust the system to the resource requirements of an application, e.g., by controlling the CPU frequency. However, manually tuning for improved energy efficiency is a tedious and painstaking task that is often neglected by application developers. The European Union’s Horizon 2020 project READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) aims at developing a tools-aided approach for improved energy efficiency of current and future HPC applications. To reach this goal, the READEX project combines technologies from two ends of the compute spectrum, embedded systems and HPC, constituting a split design-time/runtime methodology. From the HPC domain, the Periscope Tuning Framework (PTF) is extended to perform dynamic auto-tuning of fine-grained application regions using the systems scenario methodology, which was originally developed for improving the energy efficiency in embedded systems. This paper introduces the concepts of the READEX project, its envisioned implementation, and preliminary results that demonstrate the feasibility of this approach.
Proceedings of the Platform for Advanced Scientific Computing Conference on | 2016
Lubomír Říha; Tomáš Brzobohatý; Alexandros Markopoulos; Ondřej Meca; Tomáš Kozubek
This paper describes the Hybrid Total FETI (HTFETI) method and its parallel implementation in the ESPRESO library. HTFETI is a variant of the FETI type domain decomposition method in which a small number of neighboring subdomains is aggregated into clusters. This can be also viewed as a multilevel decomposition approach which results into a smaller coarse problem - the main scalability bottleneck of the FETI and FETI-DP methods. The efficiency of our implementation which employs hybrid parallelization in the form of MPI and Cilk++ is evaluated using both weak and strong scalability tests. The weak scalability of the solver is shown on the 3 dimensional linear elasticity problem of a size up to 30 billion of Degrees Of Freedom (DOF) executed on 4096 compute nodes. The strong scalability is evaluated on the problem of size 2.6 billion DOF scaled from 1000 to 4913 compute nodes. The results show the super-linear scaling of the single iteration time and linear scalability of the solver runtime. The latter combines both numerical and parallel scalability and shows overall HTFETI solver performance. The large scale tests use our own parallel synthetics benchmark generator that is also described in the paper. The last set of results shows that HTFETI is very efficient for problems of size up 1.7 billion DOF and provide better time to solution when compared to TFETI method.
parallel computing | 2016
Lubomír Říha; Tomáš Brzobohatý; Alexandros Markopoulos; Marta Jarošová; Tomáš Kozubek; David Horák; Václav Hapla
Implementation, performance, and scalability results of communication layer for Total FETI and Hybrid Total FETI solver.In HTFETI several neighboring subdomains aggregated into clusters. This reduces the size of coarse problem and improves scalability.Optimization of nearest neighbor communication - global gluing matrix.Implementation of communication hiding and avoiding techniques inside the communication layerBenchmarks - elastic 3D cube up to 1.6 billion DOF and realistic car engine benchmark.Large test executed on Total FETI to see the real potential of communication layer on smaller clusters. This paper describes the implementation, performance, and scalability of our communication layer developed for Total FETI (TFETI) and Hybrid Total FETI (HTFETI) solvers. HTFETI is based on our variant of the Finite Element Tearing and Interconnecting (FETI) type domain decomposition method. In this approach a small number of neighboring subdomains is aggregated into clusters, which results in a smaller coarse problem. To solve the original problem TFETI method is applied twice: to the clusters and then to the subdomains in each cluster.The current implementation of the solver is focused on the performance optimization of the main CG iteration loop, including: implementation of communication hiding and avoiding techniques for global communications; optimization of the nearest neighbor communication - multiplication with a global gluing matrix; and optimization of the parallel CG algorithm to iterate over local Lagrange multipliers only.The performance is demonstrated on a linear elasticity 3D cube and real world benchmarks.
ieee international conference on high performance computing data and analytics | 2015
Lubomír Říha; Tomáš Brzobohatý; Alexandros Markopoulos; Tomáš Kozubek; Ondřej Meca; Olaf Schenk; Wim Vanroose
This paper presents a new approach developed for acceleration of FETI solvers by Graphic Processing Units (GPU) using the Schur complement (SC) technique. By using the SCs FETI solvers can avoid working with sparse Cholesky decomposition of the stiffness matrices. Instead a dense structure in form of SC is computed and used by conjugate gradient (CG) solver. In every iteration of CG solver a forward and backward substitution which are sequential are replaced by highly parallel General Matrix Vector Multiplication (GEMV) routine. This results in 4.1 times speedup when the Tesla K20X GPU accelerator is used and its performance is compared to a single 16-core AMD Opteron 6274 (Interlagos) CPU.
computer information systems and industrial management applications | 2015
Milan Jaros; Lubomír Říha; Petr Strakos; Tomas Karasek; Alena Vašatová; Marta Jarošová; Tomáš Kozubek
This paper describes the acceleration of the most computationally intensive kernels of the Blender rendering engine, Blender Cycles, using Intel Many Integrated Core architecture (MIC). The proposed parallelization, which uses OpenMP technology, also improves the performance of the rendering engine when running on multi-core CPUs and multi-socket servers. Although the GPU acceleration is already implemented in Cycles, its functionality is limited. Our proposed implementation for MIC architecture contains all features of the engine with improved performance. The paper presents performance evaluation for three architectures: multi-socket server, server with MIC (Intel Xeon Phi 5100p) accelerator and server with GPU accelerator (NVIDIA Tesla K20m).
Concurrency and Computation: Practice and Experience | 2018
Madhura Kumaraswamy; Anamika Chowdhury; Michael Gerndt; Zakaria Bendifallah; Othman Bouizi; Uldis Locans; Lubomír Říha; Ondřej Vysocký; Martin Beseda; Jan Zapletal
To overcome the challenges of energy consumption of HPC systems, the European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy‐efficient Exascale computing) project uses an online auto‐tuning approach to improve energy efficiency of HPC applications. The READEX methodology pre‐computes optimal system configurations at design‐time, such as the CPU frequency, for instances of program regions and switches at runtime to the configuration given in the tuning model when the region is executed. READEX goes beyond previous approaches by exploiting dynamic changes of a regions characteristics by leveraging region and characteristic specific system configurations. While the tool suite supports an automatic approach, specifying domain knowledge such as the structure and characteristics of the application and application tuning parameters can significantly help to create a more refined tuning model. This paper presents the means available for an application expert to provide domain knowledge and presents tuning results for some benchmarks.
ieee international conference on high performance computing data and analytics | 2017
Ondrej Vysocky; Martin Beseda; Lubomír Říha; Jan Zapletal; Michael Lysaght; Venkatesh Kannan
This paper introduces two tools for manual energy evaluation and runtime tuning developed at IT4Innovations in the READEX project. The MERIC library can be used for manual instrumentation and analysis of any application from the energy and time consumption point of view. Besides tracing, MERIC can also change environment and hardware parameters during the application runtime, which leads to energy savings.
ieee international conference on high performance computing data and analytics | 2017
Ondřej Meca; Lubomír Říha; Alexandros Markopoulos; Tomáš Brzobohatý; Tomáš Kozubek
ESPRESO is a FEM package that includes a Hybrid Total FETI (HTFETI) linear solver targeted at solving large scale engineering problems. The scalability of the solver was tested on several of the world’s largest supercomputers. To provide our scalable implementation of HTFETI algorithms to all potential users, a simple C API was developed and is presented. The paper describes API methods, compilation and linking process.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014) | 2015
Lubomír Říha; Tomáš Brzobohatý; Alexandros Markopoulos; Marta Jarošová; Tomáš Kozubek
We describe the implementation and the performance and scalability results of a hybrid FETI (Finite Element Tearing and Interconnecting) solver based on our variant of the FETI type domain decomposition method called Total FETI. In our approach a small number of neighboring subdomains is aggregated into clusters, which results into a smaller coarse problem. To solve the original problem Total FETI method is applied twice: to the clusters and then to the subdomains in each cluster.Current implementation of the solver is focused on the performance optimization of the main CG iteration loop, including: implementation of communication hiding and avoiding techniques for global communications; optimization of the nearest neighbor communication - multiplication with global gluing matrix; and optimization of the parallel CG algorithm to iterate over local Lagrange multipliers only.The performance is demonstrated on a linear elasticity synthetic 3D cube and real world benchmarks.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014) | 2015
Tomáš Brzobohatý; Lubomír Říha; Tomas Karasek; Tomáš Kozubek
In this article application of Open Source Field Operation and Manipulation (OpenFOAM) C++ libraries for solving engineering problems on many-core architectures is presented. Objective of this article is to present scalability of OpenFOAM on parallel platforms solving real engineering problems of fluid dynamics. Scalability test of OpenFOAM is performed using various hardware and different implementation of standard PCG and PBiCG Krylov iterative methods. Speed up of various implementations of linear solvers using GPU and MIC accelerators are presented in this paper. Numerical experiments of 3D lid-driven cavity flow for several cases with various number of cells are presented.