Lubomir Riha
Technical University of Ostrava
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lubomir Riha.
international parallel and distributed processing symposium | 2012
Maria Malik; Lubomir Riha; Colin Shea; Tarek A. El-Ghazawi
OLAP (On-Line Analytical Processing) is a powerful method for analyzing the excessive amount of data related to business intelligence applications. OLAP utilizes the efficient multidimensional data structure referred to as the OLAP cube to answer multi-faceted analytical queries. As queries become more complex and the dimensionality and size of the cube grows, the processing time required to aggregate queries increases. In this paper, we are proposing: (1) a parallel implementation of MOLAP cube using OpenMP, (2) a text-to-integer translation method to allow effective string processing on GPU, and (3) a new scheduling algorithm that support these new features. To be able to process string queries on the GPU, we are introducing a text-to-integer translation method which works with multiple dictionaries. The translation is necessary only for the GPU side of the system. To support the translation and parallel CPU implementation, a new scheduling algorithm is proposed. The scheduler divides multi-core processor(s) of a shared memory system into a processing partition and a preprocessing (or translation) partition. The performance of the new system is evaluated. The text-to-integer translation adds a new vital functionality to our system, however it also slows down the GPU processing by 7% when compare to original implementation without string support. The performance measurements indicate that due to the parallel implementation, the processing rate of the CPU partition improves from 12 to 110 queries per second. Moreover, the CPU partition is now able to process OLAP cubes of size 32 GB at rate of 11 queries per second. The total performance of the entire hybrid system (CPU + GPU) increased from 102 to 228 queries per second.
Advances in Engineering Software | 2017
Michal Merta; Lubomir Riha; Ondrej Meca; Alexandros Markopoulos; Tomas Brzobohaty; Tomáš Kozubek; Vít Vondrák
Abstract This paper describes an approach for acceleration of the Hybrid Total FETI (HTFETI) domain decomposition method using the Intel Xeon Phi coprocessors. The HTFETI method is a memory bound algorithm which uses sparse linear BLAS operations with irregular memory access pattern. The presented local Schur complement (LSC) method has regular memory access pattern, that allows the solver to fully utilize the Intel Xeon Phi fast memory bandwidth. This translates to speedup over 10.9 of the HTFETI iterative solver when solving 3 billion unknown heat transfer problem (3D Laplace equation) on almost 400 compute nodes. The comparison is between the CPU computation using sparse data structures (PARDISO sparse direct solver) and the LSC computation on Xeon Phi. In the case of the structural mechanics problem (3D linear elasticity) of size 1 billion DOFs the respective speedup is 3.4. The presented speedups are asymptotic and they are reached for problems requiring high number of iterations (e.g., ill-conditioned problems, transient problems, contact problems). For problems which can be solved with under hundred iterations the local Schur complement method is not optimal. For these cases we have implemented sparse matrix processing using PARDISO also for the Xeon Phi accelerators.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2016
Lubomir Riha; Jacqueline Le Moigne; Tarek A. El-Ghazawi
This paper evaluates the potential of embedded graphic processing units (GPU) in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multicore central processing unit (CPU), a full-fledge GPU accelerator, and an Intel Xeon Phi coprocessor, for two representative potential applications, wavelet spectral dimension reduction of hyperspectral imagery and automated cloud-cover assessment (ACCA). For these applications, Tegra K1 achieved 51% performance for the ACCA algorithm and 20% performance for the dimension reduction algorithm, as compared to the performance of the high-end eight-core server Intel Xeon CPU which has a 13.5 times higher power consumption. This paper also shows the potential of modern high-performance computing accelerators for algorithms such as the ones for which the paper presents an optimized parallel implementation. The two algorithms that were tested mostly contain spatially localized computations, and one can assume that all image processing algorithms containing localized computations would exhibit similar speed-ups when implemented on these parallel architectures.
Concurrency and Computation: Practice and Experience | 2015
Joseph Schneible; Lubomir Riha; Maria Malik; Tarek A. El-Ghazawi; Andrei Alexandru
In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing, has brought about significant performance increases for scientific applications. One of the best examples of this is lattice quantum chromodynamics (QCD), a stencil operation based simulation. These simulations have a large memory footprint necessitating the use of many graphics processing units (GPUs) in parallel. This requires the use of a heterogeneous cluster with one or more GPUs per node. In order to obtain optimal performance, it is necessary to determine an efficient communication pattern between GPUs on the same node and between nodes. In this paper, we present a performance model based method for minimizing the communication time of applications with stencil operations, such as lattice QCD, on heterogeneous computing systems with a non‐blocking InfiniBand interconnection network. The proposed method is able to increase the performance of the most computationally intensive kernel of lattice QCD by 25% due to improved overlapping of communication and computation. We also demonstrate that the aforementioned performance model and efficient communication patterns can be used to determine a cost efficient heterogeneous system design for stencil operation based applications. Copyright
application specific systems architectures and processors | 2013
Gabriel Yessin; Lubomir Riha; Tarek A. El-Ghazawi; David E. Mayhew
The current trend in computing has been to add more and more to the CPU; especially bigger and bigger caches and more cache levels. Based on these observations, we sought to see if bigger is always better. We test this by performing an architectural design space exploration of various cache and frequency configurations for ARM processors. Analyzing the data, we made the surprising discovery that bigger is not always better and we should in fact be taking a step back in the architectural evolutionary roadmap for some applications. In this study, we performed an analysis of the performance of web-browsers versus the architectural configuration and related it to end-user satisfaction. In the end, we were able to determine that a scaled back modern core would not only be sufficient, but improve the performance of the web-browser. In doing this, we have also developed GW-GEM5 a set of tools for the creation, monitoring and analysis of concurrent gem5 simulations on computer clusters for use in design space parameter studies.
international conference on high performance computing and simulation | 2016
David Horák; Lubomir Riha; Radim Sojka; Jakub Kruzik; Martin Beseda
The energy consumption of supercomputers is one of the critical problems for the upcoming Exascale supercomputing era. The awareness of power an energy consumption is required on both software and hardware side. This poster deals with the energy consumption evaluation of the Total-Finite Element Tearing and Interconnect (TFETI) based solvers [2] of linear systems implemented in PERMON toolbox [1], which is an established method for solving real-world engineering problems, and with the energy consumption evaluation of the BLAS routines. The experiments performed in the poster deal with CPU frequency. This work is performed in the scope of the READEX project (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) [6]. The measurements were performed on the Intel Xeon E5-2680 (Intel Haswell micro-architecture) based Taurus system installed at TU Dresden. The system contains over 1400 nodes that have an FPGA-based power instrumentation called HDEEM (High Definition Energy Efficiency Monitoring), that allows for fine-grained and more accurate power and energy measurements. The measurements can be accessed through the HDEEM library, allowing developers to take energy measurements before and after the region of interest. We have evaluated the effect of the CPU frequency on the energy consumption of the TFETI solver for a linear elasticity 3D cube synthetic benchmark. On the dualized problem MPFX=MPd, we have evaluated the effect of frequency tuning on the energy consumption of the essential processing kernels of the TFETI method. There are two main phases in TFETI - preprocessing and solve. In preprocessing it is necessary to regularize the stiffness matrix K and factorize it and to assemble the G and GGT matrices and the second one to factorize. Both operations belong to the most time and also energy consuming operations. The solve employs the Preconditioned Conjugate Gradient (PCG) algorithm, which consists of sparse matrix-vector multiplications (by F, P, ML, MD matrices) and vector dot products and AXPY functions. In each iteration, we need to apply the direct solver twice, i.e., for forward and backward solves for the pseudoinverse K+ action and for the coarse problem solution, the (GGT)-1 action. The multiplication by the dense Schur complement matrix adds an additional operator with different computational characteristics, potentially increasing the exploitable dynamism. The poster provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The poster shows that static tuning brings up 11.84% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 2.68%. In total, the approach presented in this paper shows the potential to save up to 14.52% of energy for TFETI based solvers, see Table1. Another energy consumption evaluations were done with selected Sparse and Dense BLAS Level 1, 2 and 3 routines. For benchmarking we have used a set of matrices from University Florida collection [4]. We have employed AXPY, Sparse Matrix-Vector, Sparse MatrixMatrix, Dense Matrix-Vector, Dense Matrix-Matrix and Sparse Matrix-Dense Matrix multiplication routines from Intel Math Kernel Library (MKL) [3]. The measured characteristics illustrate the different energy consumption of BLAS routines, as some operations are memory-bounded and others are compute-bounded. Based on our recommendations one can explore dynamic frequency switching to achieve significant energy savings up to 23%, for more details see Table 2.
International Journal of High Performance Computing Applications | 2018
Lubomir Riha; Michal Merta; Radim Vavrik; Tomas Brzobohaty; Alexandros Markopoulos; Ondrej Meca; Ondrej Vysocky; Tomáš Kozubek; Vít Vondrák
In this article, we present the ExaScale PaRallel finite element tearing and interconnecting SOlver (ESPRESO) finite element method (FEM) library, which includes an FEM toolbox with interfaces to professional and open-source simulation tools, and a massively parallel hybrid total finite element tearing and interconnecting (HTFETI) solver which can fully utilize the Oak Ridge Leadership Computing Facility Titan supercomputer and achieve superlinear scaling. This article presents several new techniques for finite element tearing and interconnecting (FETI) solvers designed for efficient utilization of supercomputers with a focus on (i) performance—we present a fivefold reduction of solver runtime for the Laplace equation by redesigning the FETI solver and offloading the key workload to the accelerator. We compare Intel Xeon Phi 7120p and Tesla K80 and P100 accelerators to Intel Xeon E5-2680v3 and Xeon Phi 7210 central processing units; and (ii) memory efficiency—we present two techniques which increase the efficiency of the HTFETI solver 1.8 times and push the limits of the largest possible problem ESPRESO that can solve from 124 to 223 billion unknowns for problems with unstructured meshes. Finally, we show that by dynamically tuning hardware parameters, we can reduce energy consumption by up to 33%.
Advances in Engineering Software | 2018
Lukas Maly; Jan Zapletal; Michal Merta; Lubomir Riha; Vít Vondrák
Abstract In the paper we provide a comparison of several runtimes which can be used for offloading computationally intensive kernels to the Intel Xeon Phi coprocessors. The presented benchmark application is a stripped-down version of an iterative solver used within the Schur complement finite or boundary element tearing and interconnecting (FETI, BETI) domain decomposition methods where the sparse solve with local stiffness matrices is replaced by the multiplication with dense matrices in order to exploit coalesced memory access patterns. We present offload approaches based on the Intel Language Extension for Offload (LEO), Hetero Streams Library (hStreams), and Heterogeneous Active Messages (HAM), and compare their performance and ease of use.
Proceedings of the 2017 International Conference on Computer Graphics and Digital Image Processing | 2017
Milan Jaros; Lubomir Riha; Tomas Karasek; Petr Strakos; Daniel Krpelik
The scene rendering is a demanding procedure which is used to create images and movies from scenes modelled in proper software environments such as open source 3D creation suite Blender. Generally the rendering can be used in two basic modes. Off-line, used for final production, and interactive mode, run in real-time, for preliminary insight during modelling. Both kinds pose a computationally challenging task especially for large scenes. In this paper, we describe our implemented parallel solution which utilizes Intel Xeon Phi co-processors for either stand-alone computer nodes or HPC (High Performance Computing) clusters in off-line and interactive rendering mode. For modelling of the scenes and their rendering Blender was used. We have extended native Blenders Cycles renderer into, as we call it, CyclesPhi [1]. The CyclesPhi is developed to support and utilize Intel Xeon Phi in HPC clusters. The parallelization described in this paper is done using hybrid MPI/OpenMP concept. This implementation utilizes two typical modes of Intel Xeon Phi, the offload and the symmetric mode. To demonstrate efficiency of our implementation, runtime comparison as well as strong scalability are presented.
INTERNATIONAL CONFERENCE OF NUMERICAL ANALYSIS AND APPLIED MATHEMATICS (ICNAAM 2016) | 2017
Radim Sojka; Lubomir Riha; David Horák; Jakub Kruzik; Martin Beseda; Martin Čermák
The paper deals with the energy consumption evaluation of selected Sparse and Dense BLAS Level 1, 2 and 3 routines. We have employed AXPY, Sparse Matrix-Vector, Sparse Matrix-Matrix, Dense Matrix-Vector, Dense Matrix-Matrix and Sparse Matrix-Dense Matrix multiplication routines from Intel Math Kernel Library (MKL). The measured characteristics illustrate the different energy consumption of BLAS routines, as some operations are memory-bounded and others are compute-bounded. Based on our recommendations one can explore dynamic frequency switching to achieve significant energy savings up to 23%.