Brian R. Toonen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Brian R. Toonen is active.

Explore More

Publication

Featured researches published by Brian R. Toonen.

conference on high performance computing (supercomputing) | 2001

Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus

Gabrielle Allen; Thomas Dramlitsch; Ian T. Foster; Nicholas T. Karonis; Matei Ripeanu; Edward Seidel; Brian R. Toonen

Improvements in the performance of processors and networks make it both feasible and interesting to treat collections of workstations, servers, clusters, and supercomputers as integrated computational resources, or Grids. However, the highly heterogeneous and dynamic nature of such Grids can make application development di.cult. Here we describe an architecture and prototype implementation for a Grid-enabled computational framework based on Cactus, the MPICH-G2 Grid-enabled message-passing library, and a variety of specialized features to support e.cient execution in Grid environments. We have used this framework to perform record-setting computations in numerical relativity, running across four supercomputers and achieving scaling of 88% (1140 CPU’s) and 63% (1500 CPUs). The problem size we were able to compute was about five times larger than any other previous run. Further, we introduce and demonstrate adaptive methods that automatically adjust computational parameters during run time, to increase dramatically the efficiency of a distributed Grid simulation, without modification of the application and without any knowledge of the underlying network connecting the distributed computers.

parallel computing | 1995

Design and performance of a scalable parallel community climate model

John B. Drake; Ian T. Foster; John Michalakes; Brian R. Toonen; Patrick H. Worley

Abstract We describe the design of a parallel global atmospheric circulation model, PCCM2. This parallel model is functionally equivalent to the National Center for Atmospheric Researchs Community Climate Model, CCM2, but is structured to exploit distributed memory multi-computers. PCCM2 incorporates parallel spectral transform, semi-Lagrangian transport, and load balancing algorithms. We present detailed performance results on the IBM SP2 and Intel Paragon. These results provide insights into the scalability of the individual parallel algorithms and of the parallel model as a whole.

international parallel and distributed processing symposium | 2004

Design and implementation of MPICH2 over InfiniBand with RDMA support

Jiuxing Liu; Weihang Jiang; Pete Wyckoff; Dhabaleswar K. Panda; David Ashton; Darius Buntinas; William Gropp; Brian R. Toonen

Summary form only given. For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences in designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPl-1 functions in MPICH2. One of our objectives is to exploit remote direct memory access (RDMA) in InfiniBand to achieve high performance. We have based our design on the RDMA channel interface provided by MP1CH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS parallel benchmarks. Our optimized MPICH2 implementation achieves 7.6/spl mu/s latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation ofMPICH2 on InfiniBand using RDMA support.

Ibm Journal of Research and Development | 2005

Design and implementation of message-passing services for the Blue Gene/L supercomputer

George S. Almasi; Charles J. Archer; José G. Castaños; John A. Gunnels; C. Christopher Erway; Philip Heidelberger; Xavier Martorell; José E. Moreira; Kurt Walter Pinnow; Joe Ratterman; Burkhard Steinmacher-Burow; William Gropp; Brian R. Toonen

The Blue Gene®/L (BG/L) supercomputer, with 65,536 dual-processor compute nodes, was designed from the ground up to support efficient execution of massively parallel message-passing programs. Part of this support is an optimized implementation of the Message Passing Interface (MPI), which leverages the hardware features of BG/L. MPI for BG/L is implemented on top of a more basic message-passing infrastructure called the message layer. This message layer can be used both to implement other higher-level libraries and directly by applications. MPI and the message layer are used in the two BG/L modes of operation: the coprocessor mode and the virtual node mode. Performance measurements show that our message-passing services deliver performance close to the hardware limits of the machine. They also show that dedicating one of the processors of a node to communication functions (coprocessor mode) greatly improves the message-passing bandwidth, whereas running two processes per compute node (virtual node mode) can have a positive impact on application performance.

acm sigplan symposium on principles and practice of parallel programming | 1997

Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

Yuanyuan Zhou; Liviu Iftode; Jaswinder Pal Sing; Kai Li; Brian R. Toonen; Ioannis Schoinas; Mark D. Hill; David A. Wood

During the past few years, two main approaches have been taken to improve the performance of software shared memory implementations: relaxing consistency models and providing fine-grained access control. Their performance tradeoffs, however, we not well understood. This paper studies these tradeoffs on a platform that provides access control in hardware but runs coherence protocols in software, We compare the performance of three protocols across four coherence granularities, using 12 applications on a 16-node cluster of workstations. Our results show that no single combination of protocol and granularity performs best for all the applications. The combination of a sequentially consistent (SC) protocol and fine granularity works well with 7 of the 12 applications. The combination of a multiple-writer, home-based lazy release consistency (HLRC) protocol and page granularity works well with 8 out of the 12 applications. For applications that suffer performance losses in moving to coarser granularity under sequential consistency, the performance can usually be regained quite effectively using relaxed protocols, particularly HLRC. We also find that the HLRC protocol performs substantially better than a single-writer lazy release consistent (SW-LRC) protocol at coase granularity for many irregular applications. For our applications and platform, when we use the original versions of the applications ported directly from hardware-coherent shared memory, we find that the SC protocol with 256-byte granularity performs best on average. However, when the best versions of the applications are compared, the balance shifts in favor of HLRC at page granularity.

Lecture Notes in Computer Science | 2003

MPI on bluegene/L: Designing an efficient general purpose messaging solution for a large cellular system

George S. Almasi; Charles J. Archer; José G. Castaños; Manish Gupta; Xavier Martorell; José E. Moreira; William Gropp; Silvius Rus; Brian R. Toonen

The BlueGene/L computer uses system-on-a-chip integration and a highly scalable 65,536-node cellular architecture to deliver 360 Tflops of peak computing power. Efficient operation of the machine requires a fast, scalable, and standards compliant MPI library. In this paper, we discuss our efforts to port the MPICH2 library to BlueGene/L.

ieee international conference on high performance computing data and analytics | 2005

Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication

Rajeev Thakur; William Gropp; Brian R. Toonen

One-sided communication in Message Passing Interface (MPI) requires the use of one of three different synchronization mechanisms, which indicate when the one-sided operation can be started and when the operation is completed. Efficient implementation of the synchronization mechanisms is critical to achieving good performance with one-sided communication. However, our performance measurements indicate that in many MPI implementations, the synchronization functions add significant overhead, resulting in one-sided communication performing much worse than point-to-point communication for short- and medium-sized messages. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH2. We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.

conference on high performance computing (supercomputing) | 2000

MPICH-GQ: Quality-of-Service for Message Passing Programs

Alain Roy; Ian T. Foster; William Gropp; Brian R. Toonen; Nicholas T. Karonis; Volker Sander

Parallel programmers typically assume that all resources required for a program’s execution are dedicated to that purpose. However, in local and wide area networks, contention for shared networks, CPUs, and I/O systems can result in significant variations in availability, with consequent adverse effects on overall performance. We describe a new message-passing architecture, MPICH-GQ, that uses quality of service (QoS) mechanisms to manage contention and hence improve performance of message passing interface (MPI) applications. MPICH-GQ combines new QoS specification, traffic shaping, QoS reservation, and QoS implementation techniques to deliver QoS capabilities to the high-bandwidth bursty flows, complex structures, and reliable protocols used in high-performance applications-characteristics very different from the low-bandwidth, constant bit-rate media flows and unreliable protocols for which QoS mechanisms were designed. Results obtained on a differentiated services testbed demonstrate our ability to maintain application performance in the face of heavy network contention.

european conference on parallel processing | 2004

Implementing MPI on the BlueGene/L Supercomputer

George S. Almasi; Charles J. Archer; José G. Castaños; C. Christopher Erway; Philip Heidelberger; Xavier Martorell; José E. Moreira; Kurt Walter Pinnow; Joe Ratterman; Nils Smeds; Burkhard Steinmacher-Burow; William Gropp; Brian R. Toonen

The BlueGene/L supercomputer will consist of 65,536 dual-processor compute nodes interconnected by two high-speed networks: a three-dimensional torus network and a tree topology network. Each compute node can only address its own local memory, making message passing the natural programming model for BlueGene/L. In this paper we present our implementation of MPI for BlueGene/L. In particular, we discuss how we leveraged the architectural features of BlueGene/L to arrive at an efficient implementation of MPI in this machine. We validate our approach by comparing MPI performance against the hardware limits and also the relative performance of the different modes of operation of BlueGene/L. We show that dedicating one of the processors of a node to communication functions greatly improves the bandwidth achieved by MPI operation, whereas running two MPI tasks per compute node can have a positive impact on application performance.

cluster computing and the grid | 2005

Implementing MPI-IO atomic mode without file system support

Robert B. Ross; Robert Latham; William Gropp; Rajeev Thakur; Brian R. Toonen

The ROMIO implementation of the MPI-IO standard provides a portable infrastructure for use on top of any number of different underlying storage targets. These different targets vary widely in their capabilities, and in some cases, additional effort is needed within ROMIO to support the complete MPI-IO semantics. One aspect of the interface that can be problematic to implement is the MPI-IO atomic mode. This mode requires enforcing strict consistency semantics. For some file systems, native locks may be used to enforce these semantics, but not all file systems have lock support. In this work, we describe two algorithms for implementing efficient mutex locks using MPI-1 and MPI-2 capabilities. We then show how these algorithms may be used to implement a portable MPI-IO atomic mode for ROMIO. We evaluate the performance of these algorithms and show that they impose little additional overhead on the system. Because of the low-overhead nature of these algorithms, they are likely useful in a variety of situations where distributed locks are needed in the MPI-2 environment.

Explore More