Is this you? Create Your Porfile

Michal Merta

Technical University of Ostrava

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michal Merta is active.

Explore More

Publication

Featured researches published by Michal Merta.

Advances in Engineering Software | 2015

Acceleration of boundary element method by explicit vectorization

Michal Merta; Jan Zapletal

The in-core vectorization of the Galerkin BEM using the Vc library is proposed.Fully numerical and semi-analytical integration schemes are discussed.Numerical experiments show significant speedup of the BEM computation. Although parallelization of computationally intensive algorithms has become a standard with the scientific community, the possibility of in-core vectorization is often overlooked. With the development of modern HPC architectures, however, neglecting such programming techniques may lead to inefficient code hardly utilizing the theoretical performance of nowadays CPUs. The presented paper reports on explicit vectorization for quadratures stemming from the Galerkin formulation of boundary integral equations in 3D. To deal with the singular integral kernels, two common approaches including the semi-analytic and fully numerical schemes are used. We exploit modern SIMD (Single Instruction Multiple Data) instruction sets to speed up the assembly of system matrices based on both of these regularization techniques. The efficiency of the code is further increased by standard shared-memory parallelization techniques and is demonstrated on a set of numerical experiments.

Advances in Engineering Software | 2017

Intel Xeon Phi acceleration of Hybrid Total FETI solver

Michal Merta; Lubomir Riha; Ondrej Meca; Alexandros Markopoulos; Tomas Brzobohaty; Tomáš Kozubek; Vít Vondrák

Abstract This paper describes an approach for acceleration of the Hybrid Total FETI (HTFETI) domain decomposition method using the Intel Xeon Phi coprocessors. The HTFETI method is a memory bound algorithm which uses sparse linear BLAS operations with irregular memory access pattern. The presented local Schur complement (LSC) method has regular memory access pattern, that allows the solver to fully utilize the Intel Xeon Phi fast memory bandwidth. This translates to speedup over 10.9 of the HTFETI iterative solver when solving 3 billion unknown heat transfer problem (3D Laplace equation) on almost 400 compute nodes. The comparison is between the CPU computation using sparse data structures (PARDISO sparse direct solver) and the LSC computation on Xeon Phi. In the case of the structural mechanics problem (3D linear elasticity) of size 1 billion DOFs the respective speedup is 3.4. The presented speedups are asymptotic and they are reached for problems requiring high number of iterations (e.g., ill-conditioned problems, transient problems, contact problems). For problems which can be solved with under hundred iterations the local Schur complement method is not optimal. For these cases we have implemented sparse matrix processing using PARDISO also for the Xeon Phi accelerators.

ieee international conference on high performance computing data and analytics | 2015

Many Core Acceleration of the Boundary Element Method

Michal Merta; Jan Zapletal; Jiri Jaros

The paper presents the boundary element method accelerated by the Intel Xeon Phi coprocessors. An overview of the boundary element method for the 3D Laplace equation is given followed by the discretization and its parallelization using OpenMP and the offload features of the Xeon Phi coprocessor are discussed. The results of numerical experiments for both single- and double-layer boundary integral operators are presented. In most cases the accelerated code significantly outperforms the original code running solely on Intel Xeon processors.

Numerical Algorithms | 2015

A parallel fast boundary element method using cyclic graph decompositions

Dalibor Lukáš; Petr Kovář; Tereza Kovářová; Michal Merta

We propose a method of a parallel distribution of densely populated matrices arising in boundary element discretizations of partial differential equations. In our method the underlying boundary element mesh consisting of n elements is decomposed into N submeshes. The related N×N submatrices are assigned to N concurrent processes to be assembled. Additionally we require each process to hold exactly one diagonal submatrix, since its assembling is typically most time consuming when applying fast boundary elements. We obtain a class of such optimal parallel distributions of the submeshes and corresponding submatrices by cyclic decompositions of undirected complete graphs. It results in a method the theoretical complexity of which is O((n/N)log(n/N))

Mathematics and Computers in Simulation | 2018

A parallel library for boundary element discretization of engineering problems

Michal Merta; Jan Zapletal

O((n/\sqrt {N})\log (n/\sqrt {N}))

Computers & Mathematics With Applications | 2017

Boundary element quadrature schemes for multi- and many-core architectures

Jan Zapletal; Michal Merta; Luk Mal

in terms of time for the setup, assembling, matrix action, as well as memory consumption per process. Nevertheless, numerical experiments up to n=2744832 and N=273 on a real-world geometry document that the method exhibits superior parallel scalability O((n/N)logn)

PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014) | 2015

A novel boundary element library with applications

Martin Čermák; Michal Merta; Jan Zapletal

O((n/N)\,\log n)

International Journal of High Performance Computing Applications | 2018

A massively parallel and memory-efficient FEM toolbox with a hybrid total FETI solver with accelerator support

Lubomir Riha; Michal Merta; Radim Vavrik; Tomas Brzobohaty; Alexandros Markopoulos; Ondrej Meca; Ondrej Vysocky; Tomáš Kozubek; Vít Vondrák

of the overall time, while the memory consumption scales accordingly to the theoretical estimate.

Advances in Engineering Software | 2018

Evaluation of the Intel Xeon Phi offload runtimes for domain decomposition solvers

Lukas Maly; Jan Zapletal; Michal Merta; Lubomir Riha; Vít Vondrák

In this paper we present a software for parallel solution of engineering problems based on the boundary element method. The library is written in C++ and utilizes OpenMP and MPI for parallelization in both shared and distributed memory. We give an overview of the structure of the library and present numerical results related to 3D sound-hard scattering in an unbounded domain represented by the boundary value problem for the Helmholtz equation. Scalability results for the assembly of system matrices sparsified by the adaptive cross approximation are also presented.

international conference on parallel processing | 2017

Parallel Assembly of ACA BEM Matrices on Xeon Phi Clusters

Michal Kravcenko; Lukas Maly; Michal Merta; Jan Zapletal

In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel Xeon processors we concentrate on the portability of the code to the many-core family Intel Xeon Phi. Contrary to the GP-GPU programming accelerating many scientific codes, the standard x86 architecture of the Xeon Phi processors allows to reuse the already existing multi-core implementation. Although in many cases a simple recompilation would lead to an inefficient utilization of the Xeon Phi, the effort invested in the optimization usually leads to a better performance on the multi-core Xeon processors as well. This makes the Xeon Phi an interesting platform for scientists developing a software library aimed at both modern portable PCs and high performance computing environments. Here we focus at the manually vectorized assembly of the local element contributions and the parallel assembly of the global matrices on shared memory systems. Due to the quadratic complexity of the standard assembly we also present an assembly sparsified by the adaptive cross approximation based on the same acceleration techniques. The numerical results performed on the Xeon multi-core processor and two generations of the Xeon Phi many-core platform validate the proposed implementation and highlight the importance of vectorization necessary to exploit the features of modern hardware.

Explore More