Michele Weiland | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michele Weiland is active.

Explore More

Publication

Featured researches published by Michele Weiland.

international supercomputing conference | 2013

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Gerard J. Gorman; Michele Weiland; Lawrence Mitchell; James Southern

The increasing number of processing elements and decreasing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems.

Concurrency and Computation: Practice and Experience | 2018

In situ data analytics for highly scalable cloud modelling on Cray machines

Nicholas J. Brown; Michele Weiland; Adrian Hill; Ben Shipway

MONC is a highly scalable modelling tool for the investigation of atmospheric flows, turbulence, and cloud microphysics. Typical simulations produce very large amounts of raw data, which must then be analysed for scientific investigation. For performance and scalability reasons, this analysis and subsequent writing to disk should be performed in situ on the data as it is generated; however, one does not wish to pause the computation whilst analysis is carried out. In this paper, we present the analytics approach of MONC, where cores of a node are shared between computation and data analytics. By asynchronously sending their data to an analytics core, the computational cores can run continuously without having to pause for data writing or analysis. We describe our IO server framework and analytics workflow, which is highly asynchronous, along with solutions to challenges that this approach raises and the performance implications of some common configuration choices. The result of this work is a highly scalable analytics approach, and we illustrate on up to 32 768 computational cores of a Cray XC30 that there is minimal performance impact on the runtime when enabling data analytics in MONC and also investigate the performance and suitability of our approach on the KNL.

Proceedings of the Second Workshop on Accelerator Programming using Directives | 2015

A directive based hybrid met office NERC cloud model

Nicholas Brown; Angus Lepper; Michele Weiland; Adrian Hill; Ben Shipway; Chris Maynard

Large Eddy Simulation is a critical modelling tool for the investigation of atmospheric flows, turbulence and cloud microphysics. The models used by the UK atmospheric research community are homogeneous and the latest model, MONC, is designed to run on substantial HPC systems with very high CPU core counts. In order to future proof these codes it is worth investigating other technologies and architectures which might support the communities running their codes at the exa-scale. In this paper we present a hybrid version of MONC, where the most computationally intensive aspect is offloaded to the GPU while the rest of the functionality runs concurrently on the CPU. Developed using the directive driven OpenACC, we consider the suitability and maturity of this technology to modern Fortran scientific codes as well general software engineering techniques which aid this type of porting work. The performance of our hybrid model at scale is compared against the CPU version before considering other tuning options and making a comparison between the energy usage of the homo- and hetero-geneous versions. The result of this work is a promising hybrid model that shows performance benefits of our approach when the GPU has a significant computational workload which can not only be applied to the MONC model but also other weather and climate simulations in use by the community.

Computer Science - Research and Development | 2015

Benchmarking for power consumption monitoring

Michele Weiland; Nicholas Johnson

This paper presents a set of benchmarks that are designed to measure power consumption in parallel systems. The benchmarks range from low-level, single instructions or operations, to small kernels. In addition to describing the motivation behind developing the benchmarks and the design principles that were followed, the paper also introduces a metric to quantify the power-performance of a parallel system. Initial results are presented and help to illustrate the contribution of the paper.

Concurrency and Computation: Practice and Experience | 2018

Leveraging MPI RMA to optimize halo-swapping communications in MONC on Cray machines: Leveraging MPI RMA to optimize halo-swapping communications in MONC on Cray machines

Nicholas J. Brown; Michael R. Bareford; Michele Weiland

Remote Memory Access (RMA), also known as single‐sided communications, provides a way for reading and writing directly into the memory of other processes without having to issue explicit message passing style communication calls. Previous studies have concluded that MPI RMA can provide increased communication performance over traditional MPI Point to Point (P2P), but these are based on synthetic benchmarks rather than real‐world codes. In this work, we replace the existing non‐blocking P2P communication calls in the Met Office NERC Cloud model, a mature code for modeling the atmosphere, with MPI RMA. We describe our approach in detail and discuss the options taken for correctness and performance. Experiments are performed on ARCHER, a Cray XC30, and Cirrus, an SGI ICE machine. We demonstrate on ARCHER that, by using RMA, we can obtain between a 5% and 10% reduction in communication time at each timestep on up to 32768 cores, which over the entirety of a run (with many timesteps) results in a significant improvement in performance compared to P2P on the Cray. However, RMA is not a silver bullet, and there are challenges when integrating RMA calls into existing codes: important optimizations are necessary to achieve good performance and library support is not universally mature, as is the case on Cirrus. In this paper, we discuss, in the context of a real‐world code, the lessons learned converting P2P to RMA, explore performance and scaling challenges, and contrast alternative RMA synchronization approaches in detail.

irregular applications: architectures and algorithms | 2017

Progressive load balancing of asynchronous algorithms

Justs Zarins; Michele Weiland

Synchronisation in the presence of noise and hardware performance variability is a key challenge that prevents applications from scaling to large problems and machines. Using asynchronous or semi-synchronous algorithms can help overcome this issue, but at the cost of reduced stability or convergence rate. In this paper we propose progressive load balancing to manage progress imbalance in asynchronous algorithms dynamically. In our technique the balancing is done over time, not instantaneously. Using Jacobi iterations as a test case, we show that, with CPU performance variability present, this approach leads to higher iteration rate and lower progress imbalance between parts of the solution space. We also show that under these conditions the balanced asynchronous method outperforms synchronous, semi-synchronous and totally asynchronous implementations in terms of time to solution.

Proceedings of the Exascale Applications and Software Conference 2016 on | 2016

On the trade-offs between energy to solution and runtime for real-world CFD test-cases

Michael R. Bareford; Nicholas Johnson; Michele Weiland

This paper provides an insight into the optimisation of runtime and energy performance for two widely-used CFD codes. Energy efficiency is a hot-topic in HPC and methods to reduce the energy consumption of large machines are an active area of research. In this paper, we examine the energy efficiency and runtime performance of the SBLI and Nektar++ codes, using small but real-world test cases. The codes are instrumented in sufficient detail to reveal significant variability in energy usage during the course of the simulations. In addition, the test cases are run at different CPU frequencies and the consequences of changing this parameter, for both runtime performance (time to solution) and power performance (energy to solution) are presented.

ieee international conference on high performance computing, data, and analytics | 2015

Feasibility Study of Porting a Particle Transport Code to FPGA

Iakovos Panourgias; Michele Weiland; Mark Parsons; David Turland; Dave Barrett; Wayne Gaudin

In this paper we discuss porting a particle transport code, which is based on a wavefront sweep algorithm, to FPGA. The original code is written in Fortran90. We describe the key differences between general purpose CPUs and Field Programmable Gate Arrays (FPGAs) and provide a detailed performance model of the FPGA. We describe the steps we took when porting the Fortran90 code to FPGA. Finally, the paper will present results from an extensive benchmarking exercise using a Virtex 6 FPGA.

Computer Science - Research and Development | 2010

Profile of scientific applications on HPC architectures using the DEISA Benchmark Suite

Michele Weiland

The European HPC infrastructure is composed of a wide range of architectures. In this paper, we discuss the usage of the DEISA Benchmark Suite, a selection of widely used HPC applications in a range of scientific areas, to understand and quantify the relationship between a system’s theoretical peak performance and its actual performance for the applications in the Benchmark Suite. The results show that for some applications relative performance can vary greatly between systems, underlining the fact that maintaining diversity in the HPC infrastructure in the future will be beneficial to scientific progress in Europe.

Computers & Fluids | 2015