I. Z. Reguly
Pázmány Péter Catholic University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by I. Z. Reguly.
IEEE Transactions on Parallel and Distributed Systems | 2016
I. Z. Reguly; Gihan R. Mudalige; Carlo Bertolli; Michael B. Giles; Adam Betts; Paul H. J. Kelly; David Radford
Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc., capable of performing complex simulations over highly detailed unstructured mesh geometries. Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework, demonstrating the viability of such a high-level programming approach. OP2 targets the domain of unstructured mesh problems and enables execution on a range of back-end hardware platforms. We chart the conversion of Hydra to OP2, and map out the key difficulties encountered in the process. Specifically we show how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application and how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original code could be achieved, but it can be significantly improved on conventional processor systems, and many-core systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range of contrasting platforms and their significant utility in achieving high performance without the intervention of the application programmer.
parallel computing | 2013
Gihan R. Mudalige; Michael B. Giles; Jeyarajan Thiyagalingam; I. Z. Reguly; Carlo Bertolli; Paul H. J. Kelly; Anne E. Trefethen
Discuss the main design issues in parallelizing unstructured mesh applications.Present OP2 for developing applications for heterogeneous parallel systems.Analyze the performance gained with OP2 for two industrial-representative benchmarks.Compare runtime, scaling and runtime break-downs of the applications.Present energy consumption of OP2 applications on CPU and GPU clusters. OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.
Philosophical Transactions of the Royal Society A | 2014
Michael B. Giles; I. Z. Reguly
High-performance computing has evolved remarkably over the past 20 years, and that progress is likely to continue. However, in recent years, this progress has been achieved through greatly increased hardware complexity with the rise of multicore and manycore processors, and this is affecting the ability of application developers to achieve the full potential of these systems. This article outlines the key developments on the hardware side, both in the recent past and in the near future, with a focus on two key issues: energy efficiency and the cost of moving data. It then discusses the much slower evolution of system software, and the implications of all of this for application developers.
Concurrency and Computation: Practice and Experience | 2016
I. Z. Reguly; Endre László; Gihan R. Mudalige; Michael B. Giles
Achieving optimal performance on the latest multi‐core and many‐core architectures increasingly depends on making efficient use of the hardwares vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon‐Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon‐Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many‐core systems. We show that auto‐vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near‐optimal performance, two times faster than non‐vectorized code. We observe that the Xeon‐Phi does not provide good performance for these applications but is still comparable with a pair of mid‐range Xeon chips. Copyright
ieee international conference on high performance computing data and analytics | 2014
I. Z. Reguly; Gihan R. Mudalige; Michael B. Giles; Dan Curran; Simon N McIntosh-Smith
Code maintainability, performance portability and future proofing are some of the key challenges in this era of rapid change in High Performance Computing. Domain Specific Languages and Active Libraries address these challenges by focusing on a single application domain and providing a high-level programming approach, and then subsequently using domain knowledge to deliver high performance on various hardware. In this paper, we introduce the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discuss some of its key design points; we demonstrate how OPS can be embedded in C/C++ and the API made to look like a traditional library, and how through a combination of simple text manipulation and back-end logic we can enable execution on a diverse range of hardware using different parallel programming approaches. Relying on the access-execute description of the OPS abstraction, we introduce a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking. Using performance results from CloverLeaf from the Mantevo suite of benchmarks, we demonstrate the utility of OPS.
International Journal of Parallel Programming | 2015
I. Z. Reguly; Michael B. Giles
The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used.
ieee international conference on high performance computing data and analytics | 2012
Michael B. Giles; Gihan R. Mudalige; Carlo Bertolli; Paul H. J. Kelly; Endre László; I. Z. Reguly
Increasingly, the main bottleneck limiting performance on emerging multi-core and many-core processors is the movement of data between its different cores and main memory. As the number of cores increases, more and more data needs to be exchanged with memory to keep them fully utilized. This critical bottleneck is already limiting the utility of processors and our ability to leverage increased parallelism to achieve higher performance. On the other hand, considerable computer science research exists on tiling techniques (also known as sparse tiling), for reducing data transfers. Such work demonstrates how the increasing memory bottleneck could be avoided but the difficulty has been in extending these ideas to real-world applications. These algorithms quickly become highly complicated, and it has be very difficult to for a compiler to automatically detect the opportunities and implement the execution strategy. Focusing on the unstructured mesh application class, in this paper, we present a preliminary analytical investigation into the performance benefits of tiling (or loop-blocking) algorithms on a realworld industrial CFD application. We analytically estimate the reductions in communications or memory accesses for the main parallel loops in this application and predict quantitatively the performance benefits that can be gained on modern multi-core and many core hardware. The analysis demonstrates that in general a factor of four reduction in data movement can be achieved by tiling parallel loops. A major part of the savings come from contraction of temporary or transient data arrays that need not be written back to main memory, by holding them in the last level cache (LLC) of modern processors.
International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2014
Gihan R. Mudalige; I. Z. Reguly; Michael B. Giles; Ac Mallinson; Wp Gaudin; J. A. Herdman
In this paper we present research on applying a domain specific high-level abstractions (HLA) development strategy with the aim to “future-proof” a key class of high performance computing (HPC) applications that simulate hydrodynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured mesh-based applications at the University of Oxford. OPS uses an “active library” approach where a single application code written using the OPS API can be transformed into different highly optimized parallel implementations which can then be linked against the appropriate parallel library enabling execution on different back-end hardware platforms. The target application in this work is the CloverLeaf mini-app from Sandia National Laboratory’s Mantevo suite of codes that consists of algorithms of interest from hydrodynamics workloads. Specifically, we present (1) the lessons learnt in re-engineering an industrial representative hydro-dynamics application to utilize the OPS high-level framework and subsequent code generation to obtain a range of parallel implementations, and (2) the performance of the auto-generated OPS versions of CloverLeaf compared to that of the performance of the hand-coded original CloverLeaf implementations on a range of platforms. Benchmarked systems include Intel multi-core CPUs and NVIDIA GPUs, the Archer (Cray XC30) CPU cluster and the Titan (Cray XK7) GPU cluster with different parallelizations (OpenMP, OpenACC, CUDA, OpenCL and MPI). Our results show that the development of parallel HPC applications using a high-level framework such as OPS is no more time consuming nor difficult than writing a one-off parallel program targeting only a single parallel implementation. However the OPS strategy pays off with a highly maintainable single application source, through which multiple parallelizations can be realized, without compromising performance portability on a range of parallel systems.
high performance computational finance | 2014
Michael B. Giles; Endre László; I. Z. Reguly; Jeremy Appleyard; Julien Demouth
This paper discusses the implementation of one-factor and three-factor PDE models on GPUs. Both explicit and implicit time-marching methods are considered, with the latter requiring the solution of multiple tridiagonal systems of equations.Because of the small amount of data involved, one-factor models are primarily compute-limited, with a very good fraction of the peak compute capability being achieved. The key to the performance lies in the heavy use of registers and shuffle instructions for the explicit method, and a non-standard hybrid Thomas/PCR algorithm for solving the tridiagonal systems for the implicit solverThe three-factor problems involve much more data, and hence their execution is more evenly balanced between computation and data communication to/from the main graphics memory. However, it is again possible to achieve a good fraction of the theoretical peak performance on both measures. The high performance requires particularly careful attention to coalescence in the data transfers, using local shared memory for small array transpositions, and padding to avoid shared memory bank conicts.Computational results include comparisons to computations on Sandy Bridge and Haswell Intel Xeon processors, using both multithreading and AVX vectorisation.
International Journal of Computational Fluid Dynamics | 2016
Satya P. Jammy; Gihan R. Mudalige; I. Z. Reguly; Neil D. Sandham; Michael B. Giles
abstract In this paper, we report the development and validation of a compressible solver with shock capturing, using a domain-specific high-level abstraction framework, OPS, that is being developed at the University of Oxford. OPS uses an active library approach for block-structured meshes, capable of generating codes for a variety of parallel implementations with different parallelisation strategies. Performance results on various architectures are reported for the 1D Shu–Osher test case.