Joseph Antony
Australian National University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Joseph Antony.
ieee international conference on high performance computing data and analytics | 2006
Joseph Antony; Pete P. Janes; Alistair P. Rendell
Modern shared memory multiprocessor systems commonly have non-uniform memory access (NUMA) with asymmetric memory bandwidth and latency characteristics. Operating systems now provide application programmer interfaces allowing the user to perform specific thread and memory placement. To date, however, there have been relatively few detailed assessments of the importance of memory/thread placement for complex applications. This paper outlines a framework for performing memory and thread placement experiments on Solaris and Linux. Thread binding and location specific memory allocation and its verification is discussed and contrasted. Using the framework, the performance characteristics of serial versions of lmbench, Stream and various BLAS libraries (ATLAS, GOTO, ACML on Opteron/Linux and Sunperf on Opteron, UltraSPARC/Solaris) are measured on two different hardware platforms (UltraSPARC/FirePlane and Opteron/HyperTransport). A simple model describing performance as a function of memory distribution is proposed and assessed for both the Opteron and UltraSPARC.
international symposium on parallel architectures algorithms and networks | 2008
Rui Yang; Joseph Antony; Pete P. Janes; Alistair P. Rendell
In this work we study the effect of cache blocking and memory/thread placement on a modern multicore shared memory parallel system, the SunFire X4600 M2, using the Gaussian 03 computational chemistry code. A protocol for performing memory and thread placement studies is outlined, as is a scheme for characterizing a particular memory and thread placement pattern. Results for parallel Gaussian 03 runs with up to 16 threads are presented.
international parallel and distributed processing symposium | 2011
Rui Yang; Joseph Antony; Alistair P. Rendell; Danny Robson; Peter E. Strazdins
The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the \texttt{NUMAgrind} profiling tool which can be used to simplify this process. It extends the \texttt{Val grind} binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. \ Using \texttt{NUMAgrind}, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the \texttt{NUMAgrind} tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.
high performance computing and communications | 2009
Rui Yang; Joseph Antony; Alistair P. Rendell
In this work, we extend and evaluate a simple performance model to account for NUMA and bandwidth effects for single and multi-threaded calculations within the Gaussian 03 computational chemistry code on a contemporary multi-core, NUMA platform. By using the thread and memory placement APIs in Solaris, we present results for a set of calculations from which we analyze on-chip interconnect and intra-core bandwidth contention and show the importance of load-balancing between threads. The extended model predicts single threaded performance to within 1% errors and most multi-threaded experiments within 15% errors. Our results and modeling shows that accounting for bandwidth constraints within user-space code is beneficial.
international symposium on environmental software systems | 2015
Benjamin J. K. Evans; Lesley Wyborn; Tim Pugh; Chris Allen; Joseph Antony; Kashif Gohar; David Porter; Jon Smillie; Claire Trenham; Jingbo Wang; Alex Ip; Gavin Bell
The National Computational Infrastructure (NCI) at the Australian National University (ANU) has co-located a priority set of over 10 PetaBytes (PBytes) of national data collections within a HPC research facility. The facility provides an integrated high-performance computational and storage platform, or a High Performance Data (HPD) platform, to serve and analyse the massive amounts of data across the spectrum of environmental collections – in particular from the climate, environmental and geoscientific domains. The data is managed in concert with the government agencies, major academic research communities and collaborating overseas organisations. By co-locating the vast data collections with high performance computing environments and harmonising these large valuable data assets, new opportunities have arisen for Data-Intensive interdisciplinary science at scales and resolutions not hitherto possible.
international symposium on pervasive systems, algorithms, and networks | 2009
Rui Yang; Joseph Antony; Alistair P. Rendell
In this work we study the effect of data locality on the performance of Gaussian 03 code running on a multi-core Non-Uniform Memory Access (NUMA) system. A user-space protocol which affects runtime data locality, through the use of dynamic page migration and interleaving techniques, is considered. Using this protocol results in a significant performance improvement. Results for parallel Gaussian 03 using up to 16 threads are presented. The overhead of page migration and effect of dual-core contention are also examined.
international parallel and distributed processing symposium | 2012
Peter E. Strazdins; Jie Cai; Muhammad Atif; Joseph Antony
The ubiquity of on-demand cloud computing resources enables scientific researchers to dynamically provision and consume compute and storage resources in response to science needs. Whereas traditional HPC compute resources are often centrally managed with a priori CPU-time allocations and use policies. A long term goal of our work is to assess the efficacy of preserving the user environment (compilers, support libraries, runtimes and application codes) available at a traditional HPC facility for deployment into a VM environment, which can then be subsequently used in both private and public scientific clouds. This would afford greater flexibility to users in choosing hardware resources that suit their science needs better, as well as aiding them in transitioning onto private/public cloud resources. In this paper we present work in-progress performance results for a set of benchmark kernels and scientific applications running in a traditional HPC environment, a private VM cluster and an Amazon HPC EC2 cluster. These are the OSU MPI micro-benchmark, the NAS Parallel macro-benchmarks and two large scientific application codes (the UK Met Offices MetUM global climate model and the Chaste multi-scale computational biology code) respectively. We discuss parallel scalability and runtime information obtained using the IPM performance monitoring framework for MPI applications. We were also able to successfully build application codes in a traditional HPC environment and package these into VMs which ran on both private and public cloud resources.
Proceedings of International Conference on High Performance Scientific Computing (HPSC 2006) | 2008
Joseph Antony; Michael J. Frisch; Alistair P. Rendell
Gaussian is a widely used scientific code with application areas in chemistry, biochemistry and material sciences. To operate efficiently on modern architectures Gaussian employs cache blocking in the generation and processing of the two-electron integrals that are used by many of its electronic structure methods. This study uses hardware performance counters to characterise the cache and memory behavior of the integral generation code used by Gaussian for Hartree-Fock calculations. A simple performance model is proposed that aims to predict overall performance as a function of total instruction and cache miss counts. The model is parameterised for three different x86 processors — the Intel Pentium M, the P4 and the AMD Opteron. Results suggest that the model is capable of predicting execution times to an accuracy of between 5 and 15%.
Journal of Physics: Conference Series | 2008
Alistair P. Rendell; Joseph Antony; Warren Armstrong; Pete P. Janes; Rui Yang
Building fast, reliable, and adaptive software is a constant challenge for computational science, especially given recent developments in computer architecture. This paper outlines some of our efforts to address these three issues in the context of computational chemistry. First, a simple linear performance that can be used to model and predict the performance of Hartree-Fock calculations is discussed. Second, the use of interval arithmetic to assess the numerical reliability of the sort of integrals used in electronic structure methods is presented. Third, use of dynamic code modification as part of a framework to support adaptive software is outlined.
international conference on conceptual structures | 2011
Joseph Antony; Alistair P. Rendell; Rui Yang; Gary W. Trucks; Michael J. Frisch
Abstract This paper explores the use of a simple linear performance model, that determines execution time based instruction and cache miss counts, for describing the behaviour of two-electron integral evaluation algorithm in the Gaussian computational chemistry package. Four different microarchitecture platforms are considered with a total of seven individual microprocessors. Both Hartree-Fock and hybrid Hartree-Fock/Density Functional Theory electronic structure methods are assessed. In most cases the model is found to be accurate to within 3%. Least agreement is for an Athlon64 system (ranging from 1.8% to 6.5%) and a periodic boundary computation on an Opteron where errors of up to 6.8% are observed. These errors arise as the model does not account for the intricacies of out-of-order execution, on-chip write-back buffers and prefetch techniques that modern microprocessors implement. The parameters from the linear performance model are combined with instruction and cache miss counts obtained from functional cache simulation to predict the effect of cache modification on total execution time. Variations in level 1 and 2 linesize and level 2 total size are considered, we find there is some benefit if linesizes are increased (L1: 8%, L2: 4%). Increasing the level 2 cache size is also predicted to be beneficial, although the cache blocking approach already implemented in the Gaussian integral evaluation code was found to be working well.