Gihan R. Mudalige | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gihan R. Mudalige is active.

Explore More

Publication

Featured researches published by Gihan R. Mudalige.

measurement and modeling of computer systems | 2011

Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

Simon J. Pennycook; Simon D. Hammond; Stephen A. Jarvis; Gihan R. Mudalige

We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIAs Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIAs Fermi processor. We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.

simulation tools and techniques for communications, networks and system | 2009

WARPP: a toolkit for simulating high-performance parallel scientific codes

Simon D. Hammond; Gihan R. Mudalige; J. A. Smith; Stephen A. Jarvis; J. A. Herdman; A. Vadgama

There are a number of challenges facing the High Performance Computing (HPC) community, including increasing levels of concurrency (threads, cores, nodes), deeper and more complex memory hierarchies (register, cache, disk, network), mixed hardware sets (CPUs and GPUs) and increasing scale (tens or hundreds of thousands of processing elements). Assessing the performance of complex scientific applications on specialised high-performance computing architectures is difficult. In many cases, traditional computer benchmarking is insufficient as it typically requires access to physical machines of equivalent (or similar) specification and rarely relates to the potential capability of an application. A technique known as application performance modelling addresses many of these additional requirements. Modelling allows future architectures and/or applications to be explored in a mathematical or simulated setting, thus enabling hypothetical questions relating to the configuration of a potential future architecture to be assessed in terms of its impact on key scientific codes. This paper describes the Warwick Performance Prediction (WARPP) simulator, which is used to construct application performance models for complex industry-strength parallel scientific codes executing on thousands of processing cores. The capability and accuracy of the simulator is demonstrated through its application to a scientific benchmark developed by the United Kingdom Atomic Weapons Establishment (AWE). The results of the simulations are validated for two different HPC architectures, each case demonstrating a greater than 90% accuracy for run-time prediction. Simulation results, collected from runs on a standard PC, are provided for up to 65,000 processor cores. It is also shown how the addition of operating system jitter to the simulator can improve the quality of the application performance model results.

The Computer Journal | 2012

Performance Analysis and Optimization of the OP2 Framework on Many-Core Architectures

Michael B. Giles; Gihan R. Mudalige; Z. Sharif; Graham Markall; Paul H. J. Kelly

This paper presents a benchmarking, performance analysis and optimization study of the OP2 ‘active’ library, which provides an abstraction framework for the parallel execution of unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, and thereby achieve code longevity and near-optimal performance through re-targeting the application to execute on different multi-core/many-core hardware. Runtime performance results are presented for a representative unstructured mesh application on a variety of many-core processor systems, including traditional X86 architectures from Intel (Xeon based on the older Penryn and current Nehalem micro-architectures) and GPU offerings from NVIDIA (GTX260, Tesla C2050). Our analysis demonstrates the contrasting performance between the use of CPU (OpenMP) and GPU (CUDA) parallel implementations for the solution of an industrial-sized unstructured mesh consisting of about 1.5Â million edges. Results show the significance of choosing the correct partition and thread-block configuration, the factors limiting the GPU performance and insights into optimizations for improved performance.

international parallel and distributed processing symposium | 2008

A plug-and-play model for evaluating wavefront computations on parallel architectures

Gihan R. Mudalige; Mary K. Vernon; Stephen A. Jarvis

This paper develops a plug-and-play reusable LogGP model that can be used to predict the runtime and scaling behavior of different MPI-based pipelined wavefront applications running on modern parallel platforms with multi- core nodes. A key new feature of the model is that it requires only a few simple input parameters to project performance for wavefront codes with different structure to the sweeps in each iteration as well as different behavior during each wavefront computation and/or between iterations. We apply the model to three key benchmark applications that are used in high performance computing procurement, illustrating that the model parameters yield insight into the key differences among the codes. We also develop new, simple and highly accurate models of MPI send, receive, and group communication primitives on the dual-core Cray XT system. We validate the reusable model applied to each benchmark on up to 8192 processors on the XT3/XT4. Results show excellent accuracy for all high performance application and platform configurations that we were able to measure. Finally we use the model to assess application and hardware configurations, develop new metrics for procurement and configuration, identify bottlenecks, and assess new application design modifications that, to our knowledge, have not previously been explored.

IEEE Transactions on Parallel and Distributed Systems | 2016

Acceleration of a Full-Scale Industrial CFD Application with OP2

I. Z. Reguly; Gihan R. Mudalige; Carlo Bertolli; Michael B. Giles; Adam Betts; Paul H. J. Kelly; David Radford

Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc., capable of performing complex simulations over highly detailed unstructured mesh geometries. Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework, demonstrating the viability of such a high-level programming approach. OP2 targets the domain of unstructured mesh problems and enables execution on a range of back-end hardware platforms. We chart the conversion of Hydra to OP2, and map out the key difficulties encountered in the process. Specifically we show how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application and how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original code could be achieved, but it can be significantly improved on conventional processor systems, and many-core systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range of contrasting platforms and their significant utility in achieving high performance without the intervention of the application programmer.

parallel computing | 2013

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

Gihan R. Mudalige; Michael B. Giles; Jeyarajan Thiyagalingam; I. Z. Reguly; Carlo Bertolli; Paul H. J. Kelly; Anne E. Trefethen

Discuss the main design issues in parallelizing unstructured mesh applications.Present OP2 for developing applications for heterogeneous parallel systems.Analyze the performance gained with OP2 for two industrial-representative benchmarks.Compare runtime, scaling and runtime break-downs of the applications.Present energy consumption of OP2 applications on CPU and GPU clusters. OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.

The Computer Journal | 2012

On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures

Simon J. Pennycook; Simon D. Hammond; Gihan R. Mudalige; Steven A. Wright; Stephen A. Jarvis

In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures.

Concurrency and Computation: Practice and Experience | 2016

Vectorizing unstructured mesh computations for many-core architectures

I. Z. Reguly; Endre László; Gihan R. Mudalige; Michael B. Giles

Achieving optimal performance on the latest multi‐core and many‐core architectures increasingly depends on making efficient use of the hardwares vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon‐Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon‐Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many‐core systems. We show that auto‐vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near‐optimal performance, two times faster than non‐vectorized code. We observe that the Xeon‐Phi does not provide good performance for these applications but is still comparable with a pair of mid‐range Xeon chips. Copyright

ieee international conference on high performance computing data and analytics | 2014

The OPS domain specific abstraction for multi-block structured grid computations

I. Z. Reguly; Gihan R. Mudalige; Michael B. Giles; Dan Curran; Simon N McIntosh-Smith

Code maintainability, performance portability and future proofing are some of the key challenges in this era of rapid change in High Performance Computing. Domain Specific Languages and Active Libraries address these challenges by focusing on a single application domain and providing a high-level programming approach, and then subsequently using domain knowledge to deliver high performance on various hardware. In this paper, we introduce the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discuss some of its key design points; we demonstrate how OPS can be embedded in C/C++ and the API made to look like a traditional library, and how through a combination of simple text manipulation and back-end logic we can enable execution on a diverse range of hardware using different parallel programming approaches. Relying on the access-execute description of the OPS abstraction, we introduce a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking. Using performance results from CloverLeaf from the Mantevo suite of benchmarks, we demonstrate the utility of OPS.

ieee international conference on high performance computing data and analytics | 2012

An Analytical Study of Loop Tiling for a Large-Scale Unstructured Mesh Application

Michael B. Giles; Gihan R. Mudalige; Carlo Bertolli; Paul H. J. Kelly; Endre László; I. Z. Reguly

Increasingly, the main bottleneck limiting performance on emerging multi-core and many-core processors is the movement of data between its different cores and main memory. As the number of cores increases, more and more data needs to be exchanged with memory to keep them fully utilized. This critical bottleneck is already limiting the utility of processors and our ability to leverage increased parallelism to achieve higher performance. On the other hand, considerable computer science research exists on tiling techniques (also known as sparse tiling), for reducing data transfers. Such work demonstrates how the increasing memory bottleneck could be avoided but the difficulty has been in extending these ideas to real-world applications. These algorithms quickly become highly complicated, and it has be very difficult to for a compiler to automatically detect the opportunities and implement the execution strategy. Focusing on the unstructured mesh application class, in this paper, we present a preliminary analytical investigation into the performance benefits of tiling (or loop-blocking) algorithms on a realworld industrial CFD application. We analytically estimate the reductions in communications or memory accesses for the main parallel loops in this application and predict quantitatively the performance benefits that can be gained on modern multi-core and many core hardware. The analysis demonstrates that in general a factor of four reduction in data movement can be achieved by tiling parallel loops. A major part of the savings come from contraction of temporary or transient data arrays that need not be written back to main memory, by holding them in the last level cache (LLC) of modern processors.

Explore More