Dimitar Lukarski
Uppsala University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dimitar Lukarski.
Parallel Tools Workshop | 2012
Hartwig Anzt; Werner Augustin; Martin Baumann; Thomas Gengenbach; Tobias Hahn; Andreas Helfrich-Schkarbanenko; Vincent Heuveline; Eva Ketelaer; Dimitar Lukarski; Andreas Nestler; Sebastian Ritterbusch; Staffan Ronnas; Michael Schick; Mareike Schmidtobreick; Chandramowli Subramanian; Jan-Philipp Weiss; Florian Wilhelm; Martin Wlotzka
The goal of this paper is to describe the hardware-aware parallel C++ finite element package HiFlow3. HiFlow3 aims at providing a powerful platform for simulating processes modelled by partial differential equations. Our vision is to solve boundary value problems in an appropriate way by coupling numerical simulations with modern software design and state-of-the-art hardware technologies. The main functionalities for mapping the mathematical model into parallel software are implemented in the three core modules Mesh, DoF/FEM and Linear Algebra (LA). Parallelism is realized on two levels. The modules provide efficient MPI-based distributed data structures to achieve performance on large HPC systems but also on stand-alone workstations. Additionally, the hardware-aware cross-platform approach in the LA module accelerates the solution process by exploiting the computing power from emerging technologies like multi-core CPUs and GPUs. In this context performance evaluation on different hardware-architectures will be demonstrated.
field programmable gate arrays | 2017
Dennis Weller; Fabian Oboril; Dimitar Lukarski; Jürgen Becker; Mehdi Baradaran Tahoori
An indispensable part of our modern life is scientific computing which is used in large-scale high-performance systems as well as in low-power smart cyber-physical systems. Hence, accelerators for scientific computing need to be fast and energy efficient. Therefore, partial differential equations (PDEs), as an integral component of many scientific computing tasks, require efficient implementation. In this regard, FPGAs are well suited for data-parallel computations as they occur in PDE solvers. However, including FPGAs in the programming flow is not trivial, as hardware description languages (HDLs) have to be exploited, which requires detailed knowledge of the underlying hardware. This issue is tackled by OpenCL, which allows to write standardized code in a C-like fashion, rendering experience with HDLs unnecessary. Yet, hiding the underlying hardware from the developer makes it challenging to implement solvers that exploit the full FPGA potential. Therefore, we propose in this work a comprehensive set of generic and specific optimization techniques for PDE solvers using OpenCL that improve the FPGA performance and energy efficiency by orders of magnitude. Based on these optimizations, our study shows that, despite the high abstraction level of OpenCL, very energy efficient PDE accelerators on the FPGA fabric can be designed, making the FPGA an ideal solution for power-constrained applications.
parallel computing | 2016
Stefan Engblom; Dimitar Lukarski
We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 × in speedup on a typical 6-core machine and 10 × on a dual-socket 16-core machine compared to the built-in serial implementation.
Facing the Multicore-Challenge II | 2012
Vincent Heuveline; Dimitar Lukarski; Nico Trost; Jan-Philipp Weiss
Multigrid methods are efficient and fast solvers for problems typically modeled by partial differential equations of elliptic type. We use the approach of matrix-based geometric multigrid that has high flexibility with respect to complex geometries and local singularities. Furthermore, it adapts well to the exigences of modern computing platforms. In this work we investigate multi-colored Gaus-Seidel type smoothers, the power(q)-pattern enhanced multi-colored ILU(p,q) smoothers with fill-ins, and factorized sparse approximate inverse (FSAI) smoothers. These approaches provide efficient smoothers with a high degree of parallelism. We describe the configuration of our smoothers in the context of the portable lmpLAtoolbox and the HiFlow 3 parallel finite element package. In our approach, a single source code can be used across diverse platforms including multicore CPUs and GPUs. Highly optimized implementations are hidden behind a unified user interface. Efficiency and scalability of our multigrid solvers are demonstrated by means of a comprehensive performance analysis on multicore CPUs and GPUs.
international conference on cluster computing | 2010
Vincent Heuveline; Chandramowli Subramanian; Dimitar Lukarski; Jan-Philipp Weiss
Heterogeneous clusters with multiple sockets and multicore-processors accelerated by dedicated coprocessors like GPUs, Cell BE, FPGAs or others nowadays provide unrivaled computing power in terms of floating point operations. Specific capabilities of additional processor technologies enable dedicated exploitation with respect to particular application and data characteristics. However, resource utilization, programmability, and scalability of applications across heterogeneous platforms is a major concern. In the framework of the HiFlow finite element software package we have developed a portable software approach that implements efficient parallel solvers for partial differential equations by means of unified and modular user interfaces across a variety of heterogeneous platforms — in particular on GPU accelerated clusters. We detail our concept and provide performance analysis for various test scenarios that prove performance capabilities, scalability, viability, and user friendliness.
european conference on parallel processing | 2010
Vincent Heuveline; Dimitar Lukarski; Jan-Philipp Weiss
Krylov space methods like conjugate gradient and GMRES are efficient and parallelizable approaches for solving huge and sparse linear systems of equations. But as condition numbers are increasing polynomially with problem size sophisticated preconditioning techniques are essential building blocks. However, many preconditioning approaches like Gauss-Seidel/SSOR and ILU are based on sequential algorithms. Introducing parallelism for preconditioners is mostly hampering mathematical efficiency. In the era of multi-core and many-core processors like GPUs there is a strong need for scalable and fine-grained parallel preconditioning approaches. In the framework of the multi-platform capable finite element package HiFlow3 we are investigating multi-coloring techniques for block Gauss-Seidel type preconditioners. Our approach proves efficiency and scalability across hybrid multi-core and GPU platforms.
european conference on parallel processing | 2014
Ali Dorostkar; Dimitar Lukarski; Björn Lund; Maya Neytcheva; Yvan Notay; Peter Schmidt
In this work we benchmark the performance of a preconditioned iterative method, used in large scale computer simulations of a geophysical application, namely, the elastic Glacial Isostatic Adjustment model. The model is discretized using the finite element method that gives raise to algebraic systems of equations with matrices that are large, sparse, nonsymmetric, indefinite and with a saddle point structure. The efficiency of solving systems of the latter type is crucial as it is to be embedded in a time-evolution procedure, where systems with matrices of similar type have to be solved repeatedly many times. The implementation is based on available open source software packages - Deal.II, Trilinos, PARALUTION and AGMG. These packages provide toolboxes with state-of-the-art implementations of iterative solution methods and preconditioners for multicore computer platforms and GPU. We present performance results in terms of numerical and the computational efficiency, number of iterations and execution time, and compare the timing results against a sparse direct solver from a commercial finite element package, that is often used by applied scientists in their simulations.
high performance computing for computational science (vector and parallel processing) | 2014
Hartwig Anzt; Dimitar Lukarski; Stanimire Tomov; Jack J. Dongarra
Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved.
symposium on computer animation | 2013
Stefan Suwelack; Dimitar Lukarski; Vincent Heuveline; Rüdiger Dillmann; Stefanie Speidel
In this paper we present a novel approach to efficiently simulate the deformation of highly detailed meshes using higher order finite elements (FE). An efficient algorithm based on non-linear optimization is proposed in order to find the closest point in the curved computational FE mesh for each surface vertex. In order to extrapolate deformations to surface points outside the FE mesh, we introduce a mapping scheme that generates smooth surface deformations and preserves local shape even for low-resolution computational meshes. The mapping is constructed by representing each surface vertex in terms of points on the computational mesh and its distance to the FE mesh in normal direction. A numerical analysis shows that the mapping can be robustly constructed using the proposed non-linear optimization technique. Furthermore it is demonstrated that the numerical complexity of the mapping scheme is linear in the number of surface nodes and independent of the size of the coarse computational mesh.
international parallel and distributed processing symposium | 2014
Dimitar Lukarski; Hartwig Anzt; Stanimire Tomov; Jack J. Dongarra
Iterative solvers for sparse linear systems often benefit from using preconditioners. While there exist implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems.