Michal Merta
Technical University of Ostrava
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michal Merta.
Advances in Engineering Software | 2015
Michal Merta; Jan Zapletal
The in-core vectorization of the Galerkin BEM using the Vc library is proposed.Fully numerical and semi-analytical integration schemes are discussed.Numerical experiments show significant speedup of the BEM computation. Although parallelization of computationally intensive algorithms has become a standard with the scientific community, the possibility of in-core vectorization is often overlooked. With the development of modern HPC architectures, however, neglecting such programming techniques may lead to inefficient code hardly utilizing the theoretical performance of nowadays CPUs. The presented paper reports on explicit vectorization for quadratures stemming from the Galerkin formulation of boundary integral equations in 3D. To deal with the singular integral kernels, two common approaches including the semi-analytic and fully numerical schemes are used. We exploit modern SIMD (Single Instruction Multiple Data) instruction sets to speed up the assembly of system matrices based on both of these regularization techniques. The efficiency of the code is further increased by standard shared-memory parallelization techniques and is demonstrated on a set of numerical experiments.
Advances in Engineering Software | 2017
Michal Merta; Lubomir Riha; Ondrej Meca; Alexandros Markopoulos; Tomas Brzobohaty; Tomáš Kozubek; Vít Vondrák
Abstract This paper describes an approach for acceleration of the Hybrid Total FETI (HTFETI) domain decomposition method using the Intel Xeon Phi coprocessors. The HTFETI method is a memory bound algorithm which uses sparse linear BLAS operations with irregular memory access pattern. The presented local Schur complement (LSC) method has regular memory access pattern, that allows the solver to fully utilize the Intel Xeon Phi fast memory bandwidth. This translates to speedup over 10.9 of the HTFETI iterative solver when solving 3 billion unknown heat transfer problem (3D Laplace equation) on almost 400 compute nodes. The comparison is between the CPU computation using sparse data structures (PARDISO sparse direct solver) and the LSC computation on Xeon Phi. In the case of the structural mechanics problem (3D linear elasticity) of size 1 billion DOFs the respective speedup is 3.4. The presented speedups are asymptotic and they are reached for problems requiring high number of iterations (e.g., ill-conditioned problems, transient problems, contact problems). For problems which can be solved with under hundred iterations the local Schur complement method is not optimal. For these cases we have implemented sparse matrix processing using PARDISO also for the Xeon Phi accelerators.
ieee international conference on high performance computing data and analytics | 2015
Michal Merta; Jan Zapletal; Jiri Jaros
The paper presents the boundary element method accelerated by the Intel Xeon Phi coprocessors. An overview of the boundary element method for the 3D Laplace equation is given followed by the discretization and its parallelization using OpenMP and the offload features of the Xeon Phi coprocessor are discussed. The results of numerical experiments for both single- and double-layer boundary integral operators are presented. In most cases the accelerated code significantly outperforms the original code running solely on Intel Xeon processors.
Numerical Algorithms | 2015
Dalibor Lukáš; Petr Kovář; Tereza Kovářová; Michal Merta
We propose a method of a parallel distribution of densely populated matrices arising in boundary element discretizations of partial differential equations. In our method the underlying boundary element mesh consisting of n elements is decomposed into N submeshes. The related N×N submatrices are assigned to N concurrent processes to be assembled. Additionally we require each process to hold exactly one diagonal submatrix, since its assembling is typically most time consuming when applying fast boundary elements. We obtain a class of such optimal parallel distributions of the submeshes and corresponding submatrices by cyclic decompositions of undirected complete graphs. It results in a method the theoretical complexity of which is O((n/N)log(n/N))
Mathematics and Computers in Simulation | 2018
Michal Merta; Jan Zapletal
O((n/\sqrt {N})\log (n/\sqrt {N}))
Computers & Mathematics With Applications | 2017
Jan Zapletal; Michal Merta; Luk Mal
in terms of time for the setup, assembling, matrix action, as well as memory consumption per process. Nevertheless, numerical experiments up to n=2744832 and N=273 on a real-world geometry document that the method exhibits superior parallel scalability O((n/N)logn)
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014) | 2015
Martin Čermák; Michal Merta; Jan Zapletal
O((n/N)\,\log n)
International Journal of High Performance Computing Applications | 2018
Lubomir Riha; Michal Merta; Radim Vavrik; Tomas Brzobohaty; Alexandros Markopoulos; Ondrej Meca; Ondrej Vysocky; Tomáš Kozubek; Vít Vondrák
of the overall time, while the memory consumption scales accordingly to the theoretical estimate.
Advances in Engineering Software | 2018
Lukas Maly; Jan Zapletal; Michal Merta; Lubomir Riha; Vít Vondrák
In this paper we present a software for parallel solution of engineering problems based on the boundary element method. The library is written in C++ and utilizes OpenMP and MPI for parallelization in both shared and distributed memory. We give an overview of the structure of the library and present numerical results related to 3D sound-hard scattering in an unbounded domain represented by the boundary value problem for the Helmholtz equation. Scalability results for the assembly of system matrices sparsified by the adaptive cross approximation are also presented.
international conference on parallel processing | 2017
Michal Kravcenko; Lukas Maly; Michal Merta; Jan Zapletal
In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel Xeon processors we concentrate on the portability of the code to the many-core family Intel Xeon Phi. Contrary to the GP-GPU programming accelerating many scientific codes, the standard x86 architecture of the Xeon Phi processors allows to reuse the already existing multi-core implementation. Although in many cases a simple recompilation would lead to an inefficient utilization of the Xeon Phi, the effort invested in the optimization usually leads to a better performance on the multi-core Xeon processors as well. This makes the Xeon Phi an interesting platform for scientists developing a software library aimed at both modern portable PCs and high performance computing environments. Here we focus at the manually vectorized assembly of the local element contributions and the parallel assembly of the global matrices on shared memory systems. Due to the quadratic complexity of the standard assembly we also present an assembly sparsified by the adaptive cross approximation based on the same acceleration techniques. The numerical results performed on the Xeon multi-core processor and two generations of the Xeon Phi many-core platform validate the proposed implementation and highlight the importance of vectorization necessary to exploit the features of modern hardware.