Is this you? Create Your Porfile

Endre László

Pázmány Péter Catholic University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Endre László is active.

Explore More

Publication

Featured researches published by Endre László.

Concurrency and Computation: Practice and Experience | 2016

Vectorizing unstructured mesh computations for many-core architectures

I. Z. Reguly; Endre László; Gihan R. Mudalige; Michael B. Giles

Achieving optimal performance on the latest multi‐core and many‐core architectures increasingly depends on making efficient use of the hardwares vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon‐Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon‐Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many‐core systems. We show that auto‐vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near‐optimal performance, two times faster than non‐vectorized code. We observe that the Xeon‐Phi does not provide good performance for these applications but is still comparable with a pair of mid‐range Xeon chips. Copyright

ieee international conference on high performance computing data and analytics | 2012

An Analytical Study of Loop Tiling for a Large-Scale Unstructured Mesh Application

Michael B. Giles; Gihan R. Mudalige; Carlo Bertolli; Paul H. J. Kelly; Endre László; I. Z. Reguly

Increasingly, the main bottleneck limiting performance on emerging multi-core and many-core processors is the movement of data between its different cores and main memory. As the number of cores increases, more and more data needs to be exchanged with memory to keep them fully utilized. This critical bottleneck is already limiting the utility of processors and our ability to leverage increased parallelism to achieve higher performance. On the other hand, considerable computer science research exists on tiling techniques (also known as sparse tiling), for reducing data transfers. Such work demonstrates how the increasing memory bottleneck could be avoided but the difficulty has been in extending these ideas to real-world applications. These algorithms quickly become highly complicated, and it has be very difficult to for a compiler to automatically detect the opportunities and implement the execution strategy. Focusing on the unstructured mesh application class, in this paper, we present a preliminary analytical investigation into the performance benefits of tiling (or loop-blocking) algorithms on a realworld industrial CFD application. We analytically estimate the reductions in communications or memory accesses for the main parallel loops in this application and predict quantitatively the performance benefits that can be gained on modern multi-core and many core hardware. The analysis demonstrates that in general a factor of four reduction in data movement can be achieved by tiling parallel loops. A major part of the savings come from contraction of temporary or transient data arrays that need not be written back to main memory, by holding them in the last level cache (LLC) of modern processors.

high performance computational finance | 2014

GPU implementation of finite difference solvers

Michael B. Giles; Endre László; I. Z. Reguly; Jeremy Appleyard; Julien Demouth

This paper discusses the implementation of one-factor and three-factor PDE models on GPUs. Both explicit and implicit time-marching methods are considered, with the latter requiring the solution of multiple tridiagonal systems of equations.Because of the small amount of data involved, one-factor models are primarily compute-limited, with a very good fraction of the peak compute capability being achieved. The key to the performance lies in the heavy use of registers and shuffle instructions for the explicit method, and a non-standard hybrid Thomas/PCR algorithm for solving the tridiagonal systems for the implicit solverThe three-factor problems involve much more data, and hence their execution is more evenly balanced between computation and data communication to/from the main graphics memory. However, it is again possible to achieve a good fraction of the theoretical peak performance on both measures. The high performance requires particularly careful attention to coalescence in the data transfers, using local shared memory for small array transpositions, and padding to avoid shared memory bank conicts.Computational results include comparisons to computations on Sandy Bridge and Haswell Intel Xeon processors, using both multithreading and AVX vectorisation.

2012 13th International Workshop on Cellular Nanoscale Networks and their Applications | 2012

Analysis of a GPU based CNN implementation

Endre László; Péter Szolgay; Zoltán Nagy

The CNN (Cellular Neural Network) is a powerful image processing architecture whose hardware implementation is extremely fast. The lack of such hardware device in a development process can be substituted by using an efficient simulator implementation. Commercially available graphics cards with high computing capabilities make this simulator feasible. The aim of this work is to present a GPU based implementation of a CNN simulator using nVidias Fermi architecture. Different implementation approaches are considered and compared to a multi-core, multi-threaded CPU and some earlier GPU implementations. A detailed analysis of the introduced GPU implementation is presented.

ACM Transactions on Mathematical Software | 2016

Manycore Algorithms for Batch Scalar and Block Tridiagonal Solvers

Endre László; Michael B. Giles; Jeremy Appleyard

Engineering, scientific, and financial applications often require the simultaneous solution of a large number of independent tridiagonal systems of equations with varying coefficients. Since the number of systems is large enough to offer considerable parallelism on manycore systems, the choice between different tridiagonal solution algorithms, such as Thomas, Cyclic Reduction (CR) or Parallel Cyclic Reduction (PCR) needs to be reexamined. This work investigates the optimal choice of tridiagonal algorithm for CPU, Intel MIC, and NVIDIA GPU with a focus on minimizing the amount of data transfer to and from the main memory using novel algorithms and the register-blocking mechanism, and maximizing the achieved bandwidth. It also considers block tridiagonal solutions, which are sometimes required in Computational Fluid Dynamic (CFD) applications. A novel work-sharing and register blocking--based Thomas solver is also presented.

international symposium on circuits and systems | 2015

Analysis of parallel processor architectures for the solution of the Black-Scholes PDE

Endre László; Zoltán Nagy; Michael B. Giles; I. Z. Reguly; Jeremy Appleyard; Péter Szolgay

Common parallel computer microarchitectures offer a wide variety of solutions to implement numerical algorithms. The efficiency of different algorithms applied to the same problem vary with the underlying architecture which can be a multi-core CPU, many-core GPU, Intels MIC (Many Integrated Core) or FPGA architecture. Significant differences between these architectures exist in the ISA (Instruction Set Architecture) and the way the compute flow is executed. The way parallelism is expressed changes with the ISA, thread management and customization available on the device. These differences pose restrictions to the implementable algorithms. The aim of the work is to analyze the efficiency of the algorithms through the architectural differences. The problem at hand is the one-factor Black-Scholes option pricing equation which is a parabolic PDE solved with explicit and implicit time-marching algorithms. In the implicit solution a scalar tridiagonal system of equations needs to be solved. The possible CPU, GPU implementations along with novel FPGA solutions with HLS (High Level Synthesis) will be shown. Performance is also analyzed and remarks on efficiency are made.

2014 14th International Workshop on Cellular Nanoscale Networks and Their Applications, CNNA 2014 | 2014

Methods to utilize SIMT and SIMD instruction level parallelism in tridiagonal solvers

Endre László; Michael B. Giles; Jeremy Appleyard; Péter Szolgay

The most widely used parallel architectures in todays High Performance Computing systems utilize multi-core CPUs, many-core GPUs or Intels MIC (Many Integrated Core). The effort of new algorithm and implementation development greatly influences the performance on these architectures, and the differences between their underlying ILP parallelism - namely SIMT (Single Instruction Multiple Thread) and SIMD (Single Instruction Multiple Data) - require different approaches. The aim of the work to be presented is to show how high performance can be achieved in solving multiple scalar- and block-tridiagonal system of equations. The Thomas algorithm is implemented on all three hardware platforms, and for the GPU we also implement a hybrid algorithm based on Parallel Cyclic Reduction and Thomas algorithm for solving scalar problem and a thread level, work-sharing based algorithm for block-tridiagonal problems. Performance comparisons and a discussion on efficiency are also included.

CTRQ 2011, The Fourth International Conference on Communication Theory, Reliability, and Quality of Service | 2011