Josh Milthorpe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Josh Milthorpe is active.

Explore More

Publication

Featured researches published by Josh Milthorpe.

international parallel and distributed processing symposium | 2011

X10 as a Parallel Language for Scientific Computation: Practice and Experience

Josh Milthorpe; V. Ganesh; Alistair P. Rendell; David Grove

X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10. This paper reports our experiences writing three codes from the chemistry/material science domain: Fast Multipole Method (FMM), Particle Mesh Ewald (PME) and Hartree-Fock (HF), entirely in X10. Performance results are presented for up to 256 places on a Blue Gene/P system. During the course of this work our experiences have been shared with the X10 development team, so that application requirements could inform language design discussions as the language capabilities influenced algorithm design. This resulted in improvements in the language implementation and standard class libraries, including the design of the array API and support for complex math. Data constructs in X10 such as \emph{places} and \emph{distributed arrays}, and parallel constructs such as \emph{finish} and \emph{async}, simplify implementation of the applications in comparison with MPI. However, current implementation limitations in X10 2.1.2 make it difficult to achieve scalable performance using the most natural expressions of the algorithms. The most serious limitation is the use of point-to-point communication patterns, rather than collectives, to implement parallel constructs and array operations. This issue will be addressed in future releases of X10.

Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming | 2014

Supporting Array Programming in X10

David Grove; Josh Milthorpe; Olivier Tardieu

Effective support for array-based programming has long been one of the central design concerns of the X10 programming language. After significant research and exploration, X10 has adopted an approach based on providing arrays via user definable and extensible class libraries. This paper surveys the range of array abstractions available to the programmer in X10 2.4 and describes the key language features and language implementation techniques necessary to make efficient and productive implementations of these abstractions possible.

international parallel and distributed processing symposium | 2015

A Resilient Framework for Iterative Linear Algebra Applications in X10

Sara S. Hamouda; Josh Milthorpe; Peter E. Strazdins; Vijay A. Saraswat

The Global Matrix Library (GML) is a distributed matrix library in the X10 language. GML is designed to simplify the development of scalable linear algebra applications. By hiding the communication and parallelism details, GML programs are written in a sequential style that is easy to use and understand by non expert programmers. Resilience is becoming a major challenge for HPC applications as the number of components in a typical system continues to increase. To address this challenge, we improved GMLs adaptability to process failure and provided a mechanism for automatic data recovery. As iterative algorithms are commonly used in linear algebra applications, we also created a checkpoint/restore framework for developing resilient iterative applications using GML. Using three example machine learning applications, we demonstrate that this framework supports resilient application development with minimal additional code compared to a non-resilient implementation. Performance measurements in a typical cluster environment show that the major cost of resilient execution is due to resilient X10 itself, and that the additional cost due to our framework is acceptable.

Concurrency and Computation: Practice and Experience | 2014

PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language

Josh Milthorpe; Alistair P. Rendell; Thomas Huber

The fast multipole method (FMM) is a complex, multi‐stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 by using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10s task‐parallel model is used to express parallelism by using a pattern of activities mapping directly onto the tree. X10s work stealing runtime handles load balancing fine‐grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single‐node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster. Copyright

Journal of Chemical Theory and Computation | 2013

Resolutions of the Coulomb Operator: † VII. Evaluation of Long- Range Coulomb and Exchange Matrices

Taweetham Limpanuparb; Josh Milthorpe; Alistair P. Rendell; Peter M. W. Gill

Use of the resolution of Ewald operator method for computing long-range Coulomb and exchange interactions is presented. We show that the accuracy of this method can be controlled by a single parameter in a manner similar to that used by conventional algorithms that compute two-electron integrals. Significant performance advantages over conventional algorithms are observed, particularly for high quality basis sets and globular systems. The approach is directly applicable to hybrid density functional theory.

ieee international conference on high performance computing, data, and analytics | 2012

Efficient update of ghost regions using active messages

Josh Milthorpe; Alistair P. Rendell

The use of ghost regions is a common feature of many distributed grid applications. A ghost region holds local read-only copies of remotely-held boundary data which are exchanged and cached many times over the course of a computation. X10 is a modern parallel programming language intended to support productive development of distributed applications. X10 supports the “active message” paradigm, which combines data transfer and computation in one-sided communications. A central feature of X10 is the distributed array, which distributes array data across multiple places, providing standard read and write operations as well as powerful high-level operations. We used active messages to implement ghost region updates for X10 distributed arrays using two different update algorithms. Our implementation exploits multiple levels of parallelism and avoids global synchronization; it also supports split-phase ghost updates, which allows for overlapping computation and communication. We compare the performance of these algorithms on two platforms: an Intel x86-64 cluster over QDR InfiniBand, and a Blue Gene/P system, using both stand-alone benchmarks and an example computational chemistry application code. Our results suggest that on a dynamically threaded architecture, a ghost region update using only pairwise synchronization exhibits superior scaling to an update that uses global collective synchronization.

international conference on computational science and its applications | 2007

Interval Arithmetic and Computational Science: Rounding and Truncation Errors in N-Body Methods

Alistair P. Rendell; Bill Clarke; Pete P. Janes; Josh Milthorpe; Rui Yang

Interval arithmetic is an alternative computational paradigm that enables arithmetic operations to be performed with guarantee error bounds. In this paper interval arithmetic is used to compare the accuracy of various methods for computing the electrostatic energy for a system of point charges. A number of summation approaches that scale as O(N2) are considered, as is an O(N) scaling Fast Multipole Method (FMM). Results are presented for various sizes of water cluster in which each water molecule is described using the popular TIP3P water model. For FMM a subtle balance between the dominance of either rounding or truncation errors is demonstrated.

Proceedings of the 6th ACM SIGPLAN Workshop on X10 | 2016

Resilient X10 over MPI user level failure mitigation

Sara S. Hamouda; Benjamin Herta; Josh Milthorpe; David Grove; Olivier Tardieu

Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not pro- vide fault tolerance capabilities. This limitation hinders PGAS languages and other high-level programming models from supporting resilience at scale. For this reason, Resilient X10 has previously been supported over sockets only, not over MPI. This paper describes the use of a fault tolerant MPI implementation, called ULFM (User Level Failure Mitigation), as a transport layer for Resilient X10. By providing fault tolerant collective and agreement algorithms, on demand failure propagation, and support for InfiniBand, ULFM provides the required infrastructure to create a high performance transport layer for Resilient X10. We show that replacing X10’s emulated collectives with ULFM’s blocking collectives results in significant performance improvements. For three iterative SPMD-style applications running on 1000 X10 places, the improvement ranged between 30% and 51%. The per-step overhead for resilience was less than 9%. A proposal for adding ULFM to the coming MPI-4 standard is currently under assessment by the MPI Forum. Our results show that adding user-level fault tolerance support in MPI makes it a suitable base for resilience in high-level programming models.

Proceedings of the ACM SIGPLAN Workshop on X10 | 2015

Local parallel iteration in x10

Josh Milthorpe

X10 programs have achieved high efficiency on petascale clusters by making significant use of parallelism between places, however, there has been less focus on exploiting local parallelism within a place. This paper introduces a standard mechanism - foreach - for efficient local parallel iteration in X10, including support for worker-local data. Library code transforms parallel iteration into an efficient pattern of activities for execution by X10’s work-stealing runtime. Parallel reductions and worker-local data help to avoid unnecessary synchronization between worker threads. The foreach mechanism is compared with leading programming technologies for shared-memory parallelism using kernel codes from high performance scientific applications. Experiments on a typical Intel multicore architecture show that X10 with foreach achieves parallel speedup comparable with OpenMP and TBB for several important patterns of iteration. foreach is composable with X10’s asynchronous partitioned global address space model, and therefore represents a step towards a parallel programming model that can express the full range of parallelism in modern high performance computing systems.

Journal of Computational Chemistry | 2014

Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10

Taweetham Limpanuparb; Josh Milthorpe; Alistair P. Rendell

Use of the modern parallel programming language X10 for computing long‐range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10s work stealing runtime, and report performance results for long‐range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine.

Explore More