William Gropp
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by William Gropp.
parallel computing | 1996
William Gropp; Ewing L. Lusk; Nathan E. Doss; Anthony Skjellum
MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.
Modern software tools for scientific computing | 1997
Satish Balay; William Gropp; Lois Curfman McInnes; Barry F. Smith
Parallel numerical software based on the message passing model is enormously complicated. This paper introduces a set of techniques to manage the complexity, while maintaining high efficiency and ease of use. The PETSc 2.0 package uses object-oriented programming to conceal the details of the message passing, without concealing the parallelism, in a high-quality set of numerical software libraries. In fact, the programming model used by PETSc is also the most appropriate for NUMA shared-memory machines, since they require the same careful attention to memory hierarchies as do distributed-memory machines. Thus, the concepts discussed are appropriate for all scalable computing systems. The PETSc libraries provide many of the data structures and numerical kernels required for the scalable solution of PDEs, offering performance portability.
ieee international conference on high performance computing data and analytics | 2005
Rajeev Thakur; Rolf Rabenseifner; William Gropp
We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBMs MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
workshop on i/o in parallel and distributed systems | 1999
Rajeev Thakur; William Gropp; Ewing L. Lusk
We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocate an implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO. In addition to basic I/O functionality, we consider the issues of supporting other MPI-IO features, such as 64-bit file sizes, noncontiguous accesses, collective I/O, asynchronous I/O, consistency and atomicity semantics, user-supplied hints, shared file pointers, portable data representation, and file preallocation. We describe how we implemented each of these features on various machines and file systems. The machines we consider are the HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, SGI Origin2000, and networks of workstations; and the file systems we consider are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS, and any general Unix file system (UFS). We also present our thoughts on how a file system can be designed to better support MPI-IO. We provide a list of features desired from a file system that would help in implementing MPI-IO correctly and with high performance.
symposium on frontiers of massively parallel computation | 1999
Rajeev Thakur; William Gropp; Ewing L. Lusk
The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an applications I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications-an astrophysics-application template (DIST3D) the NAS BTIO benchmark, and an unstructured code (UNSTRUC)-on five different parallel machines: HP Exemplar IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000.
european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 1999
William Gropp; Ewing L. Lusk
In this paper we describe the diffculties inherent in making accurate, reproducible measurements of message-passing performance. We describe some of the mistakes often made in attempting such measurements and the consequences of such mistakes. We describe mpptest, a suite of performance measurement programs developed at Argonne National Laboratory, that attempts to avoid such mistakes and obtain reproducible measures of MPI performance that can be useful to both MPI implementors and MPI application writers. We include a number of illustrative examples of its use.
ieee international conference on high performance computing data and analytics | 1999
Omer Zaki; Ewing L. Lusk; William Gropp; Deborah Swider
Jumpshot is a graphical tool for understanding the performance of parallel programs. It is in the tradition of the upshot tool but contains a number of extensions and enhancements that make it suitable for large-scale parallel computations. Jumpshot takes as input a new, more flexible logfile format and comes with a library for generating such logfiles. An MPI profiling library is also included, enabling the automatic generation of such logfiles from MPI programs. Jumpshot is written in Java and can easily be integrated as an applet into browser-based computing environments. The most novel feature of Jumpshot is its automatic detection of anomalous durations, drawing the user’s attention to problem areas in a parallel execution. This capability is particularly useful in large-scale parallel computations containing many events.
symposium on frontiers of massively parallel computation | 1996
Rajeev Thakur; William Gropp; Ewing L. Lusk
We propose a strategy for implementing parallel I/O interfaces portably and efficiently. We have defined an abstract device interface for parallel I/O, called ADIO. Any parallel I/O API can be implemented on multiple file systems by implementing the API portably on top of ADIO, and implementing only ADIO on different file systems. This approach simplifies the task of implementing an API and yet exploits the specific high performance features of individual file systems. We have used ADIO to implement the Intel PFS interface and subsets of MPI-IO and IBM PIOFS interfaces on PFS, PIOFS, Unix, and NFS file systems. Our performance studies indicate that the overhead of using ADIO as an implementation strategy is very low.
parallel computing | 2001
William Gropp; Dinesh K. Kaushik; David E. Keyes; Barry F. Smith
Fluid dynamical simulations based on finite discretizations on (quasi-)static grids scale well in parallel, but execute at a disappointing percentage of per-processor peak floating point operation rates without special attention to layout and access ordering of data. We document both claims from our experience with an unstructured grid CFD code that is typical of the state of the practice at NASA. These basic performance characteristics of PDE-based codes can be understood with surprisingly simple models, for which we quote earlier work, presenting primarily experimental results. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per node performance. This snapshot of ongoing work updates our 1999 Bell Prize-winning simulation on ASCI computers.
ieee international conference on high performance computing data and analytics | 2013
David E. Keyes; Lois Curfman McInnes; Carol S. Woodward; William Gropp; Eric Myra; Michael Pernice; John B. Bell; Jed Brown; Alain Clo; Jeffrey M. Connors; Emil M. Constantinescu; Donald Estep; Kate Evans; Charbel Farhat; Ammar Hakim; Glenn E. Hammond; Glen A. Hansen; Judith C. Hill; Tobin Isaac; Kirk E. Jordan; Dinesh K. Kaushik; Efthimios Kaxiras; Alice Koniges; Kihwan Lee; Aaron Lott; Qiming Lu; John Harold Magerlein; Reed M. Maxwell; Michael McCourt; Miriam Mehl
We consider multiphysics applications from algorithmic and architectural perspectives, where “algorithmic” includes both mathematical analysis and computational complexity, and “architectural” includes both software and hardware environments. Many diverse multiphysics applications can be reduced, en route to their computational simulation, to a common algebraic coupling paradigm. Mathematical analysis of multiphysics coupling in this form is not always practical for realistic applications, but model problems representative of applications discussed herein can provide insight. A variety of software frameworks for multiphysics applications have been constructed and refined within disciplinary communities and executed on leading-edge computer systems. We examine several of these, expose some commonalities among them, and attempt to extrapolate best practices to future systems. From our study, we summarize challenges and forecast opportunities.