Brendan Harding | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Brendan Harding is active.

Explore More

Publication

Featured researches published by Brendan Harding.

international parallel and distributed processing symposium | 2014

Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver

Md. Mohsin Ali; James Southern; Peter E. Strazdins; Brendan Harding

A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forums Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.

Siam Journal on Imaging Sciences | 2011

How to Transform and Filter Images Using Iterated Function Systems

Michael F. Barnsley; Brendan Harding; Konstantin Igudesman

We generalize the mathematics of fractal transformations and illustrate how it leads to a new approach to the representation and processing of digital images, and consequent novel methods for filtering, watermarking, and encryption. This work substantially generalizes earlier work on fractal tops. The approach involves fractal geometry, chaotic dynamics, and an interplay between discrete and continuous representations. The underlying mathematics is established and some applications to digital imaging are described and exemplified.

international conference on conceptual structures | 2013

Fault-Tolerant Grid-Based Solvers: Combining Concepts from Sparse Grids and MapReduce

Jay Walter Larson; Markus Hegland; Brendan Harding; Stephen Roberts; Linda Stals; Alistair P. Rendell; Peter E. Strazdins; Md. Mohsin Ali; Christoph Kowitz; Ross Nobes; James Southern; Nicholas Wilson; Michael Li; Yasuyuki Oishi

Abstract A key issue confronting petascale and exascale computing is the growth in probability of soft and hard faults with increasing system size. A promising approach to this problem is the use of algorithms that are inherently fault tolerant. We introduce such an algorithm for the solution of partial differential equations, based on the sparse grid approach. Here, the solution of multiple component grids are efficiently combined to achieve a solution on a full grid. The technique also lends itself to a (modified) MapReduce framework on a cluster of processors, with the map stage corresponding to allocating each component grid for solution over a subset of the processors, and the reduce stage corresponding to their combination. We describe how the sparse grid combination method can be modified to robustly solve partial differential equations in the presence of faults. This is based on a modified combination formula that can accommodate the loss of one or two component grids. We also discuss accuracy issues associated with this formula. We give details of a prototype implementation within a MapReduce framework using the dynamic process features and asynchronous message passing facilities of MPI. Results on a two-dimensional advection problem show that the errors after the loss of one or two sub-grids are within a factor of 3 of the sparse grid solution in the presence of no faults. They also indicate that the sparse grid technique with four times the resolution has approximately the same error as a full grid, while requiring (for a sufficiently high resolution) much lower computation and memory requirements. We finally outline a MapReduce variant capable of responding to faults in ways other than re-scheduling of failed tasks. We discuss the likely software requirements for such a flexible MapReduce framework, the requirements it will impose on users’ legacy codes, and the systems runtime behavior.

SIAM Journal on Scientific Computing | 2015

FAULT TOLERANT COMPUTATION WITH THE SPARSE GRID COMBINATION TECHNIQUE

Brendan Harding; Markus Hegland; Jay Walter Larson; James Southern

This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394--C411]. This approach to fault tolerance is novel for two reasons: First, the combination technique adds an additional level of parallelism, and second, it provides algorithm-based fault tolerance so that solutions can still be recovered if failures occur during computation. Previous work indicates how the combination technique may be adapted for a low number of faults. In this paper we develop a generalization of the combination technique for which arbitrary collections of coarse approximations may be combined to obtain an accurate approximation. A general fault tolerant combination technique for large numbers of faults is a natural consequence of this work. Using a renewal model for the time between faults on each node of a high performance computer, we also provide bounds on the expected error for interpolati...

international parallel and distributed processing symposium | 2015

Highly Scalable Algorithms for the Sparse Grid Combination Technique

Peter E. Strazdins; Md. Mohsin Ali; Brendan Harding

Many petascale and exascale scientific simulations involve the time evolution of systems modelled as Partial Differential Equations (PDEs). The sparse grid combination technique (SGCT) is a cost-effective method for solve time-evolving PDEs, especially for higher-dimensional problems. It consists of evolving PDE over a set of grids of differing resolution in each dimension, and then combining the results to approximate the solution of the PDE on a grid of high resolution in all dimensions. It can also be extended to support algorithmic-based fault-tolerance, which is also important for computations at this scale. In this paper, we present two new parallel algorithms for the SGCT that supports the full distributed memory parallelization over the dimensions of the component grids, as well as over the grids as well. The direct algorithm is so called because it directly implements a SGCT combination formula. The second algorithm converts each component grid into their hierarchical surpluses, and then uses the direct algorithm on each of the hierarchical surpluses. The conversion to/from the hierarchical surpluses is also an important algorithm in its own right. An analysis of both indicates the direct algorithm minimizes the number of messages, whereas the hierarchical surplus minimizes memory consumption and offers a reduction in bandwidth by a factor of 1 -- 2 -- d, where d is the dimensionality of the SGCT. However, this is offset by its incomplete parallelism and factor of two load imbalance in practical scenarios. Our analysis also indicates both are suitable in a bandwidth-limiting regime. Experimental results including the strong and weak scalability of the algorithms indicates that, for scenarios of practical interest, both are sufficiently scalable to support large-scale SGCT but the direct algorithm has generally better performance, to within a factor of 2. Hierarchical surplus formation is much less communication intensive, but shows less scalability with increasing core counts.

parallel computing | 2014

Managing complexity in the Parallel Sparse Grid Combination Technique

Jay Walter Larson; Peter E. Strazdins; Markus Hegland; Brendan Harding; Stephen Roberts; Linda Stals; Alistair P. Rendell; Md. Mohsin Ali; James Southern

J. W. Larson, P. E. Strazdins, M. Hegland, B. Harding, S. Roberts , L. Stals , A. P. Rendell, Md. M. Ali , and J. Southern

Archive | 2016

Adaptive Sparse Grids and Extrapolation Techniques

Brendan Harding

In this paper we extend the study of (dimension) adaptive sparse grids by building a lattice framework around projections onto hierarchical surpluses. Using this we derive formulas for the explicit calculation of combination coefficients, in particular providing a simple formula for the coefficient update used in the adaptive sparse grids algorithm. Further, we are able to extend error estimates for classical sparse grids to adaptive sparse grids. Multi-variate extrapolation has been well studied in the context of sparse grids. This too can be studied within the adaptive sparse grids framework and doing so leads to an adaptive extrapolation algorithm.

Archive | 2014

Robust Solutions to PDEs with Multiple Grids

Brendan Harding; Markus Hegland

In this paper we will discuss some approaches to fault-tolerance for solving partial differential equations. In particular we will discuss how one can combine the solution from multiple grids using ideas related to the sparse grid combination technique and multivariate extrapolation. By utilising the redundancy between the solutions on different grids we will demonstrate how this approach can be adapted for fault-tolerance. Much of this will be achieved by assuming error expansions and examining the extrapolation of these when various solutions from different grids are combined.

International Journal of High Performance Computing Applications | 2016

Complex scientific applications made fault-tolerant with the sparse grid combination technique

Md. Mohsin Ali; Peter E. Strazdins; Brendan Harding; Markus Hegland

Ultra-large–scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance. In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message passing interface (MPI) is used to recover the processes, and our implementation is robust over multiple failures and recovery (processes and nodes). An acceptable degree of modification of the applications is required. Results using the 2-D SGCT show competitive execution times with acceptable error (within 0.1% to 1.0%), compared to the same simulation with a single full resolution grid. The benefits improve when the 3-D SGCT is used. Experiments show the applications ability to successfully recover from multiple failures, and applying multiple SGCT reduces the computed solution error. Process recovery via ULFM MPI increases from approximately 1.5 sec at 64 cores to approximately 5 sec at 2048 cores for a one-off failure. This compares applications’ built-in checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the recomputation overhead. An analysis for a long-running application considering recomputation times indicates a reduction in overhead of over an order of magnitude.

Software for Exascale Computing | 2016

Handling Silent Data Corruption with the Sparse Grid Combination Technique

Alfredo Parra Hinojosa; Brendan Harding; Markus Hegland; Hans-Joachim Bungartz

We describe two algorithms to detect and filter silent data corruption (SDC) when solving time-dependent PDEs with the Sparse Grid Combination Technique (SGCT). The SGCT solves a PDE on many regular full grids of different resolutions, which are then combined to obtain a high quality solution. The algorithm can be parallelized and run on large HPC systems. We investigate silent data corruption and show that the SGCT can be used with minor modifications to filter corrupted data and obtain good results. We apply sanity checks before combining the solution fields to make sure that the data is not corrupted. These sanity checks are derived from well-known error bounds of the classical theory of the SGCT and do not rely on checksums or data replication. We apply our algorithms on a 2D advection equation and discuss the main advantages and drawbacks.

Explore More