David Goodell | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Goodell is active.

Explore More

Publication

Featured researches published by David Goodell.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

MPI on a Million Processors

Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Sameer Kumar; Ewing L. Lusk; Rajeev Thakur; Jesper Larsson Träff

Petascale machines with close to a million processors will soon be available. Although MPI is the dominant programming model today, some researchers and users wonder (and perhaps even doubt) whether MPI will scale to such large processor counts. In this paper, we examine this issue of how scalable is MPI. We first examine the MPI specification itself and discuss areas with scalability concerns and how they can be overcome. We then investigate issues that an MPI implementation must address to be scalable. We ran some experiments to measure MPI memory consumption at scale on up to 131,072 processes or 80% of the IBM Blue Gene/P system at Argonne National Laboratory. Based on the results, we tuned the MPI implementation to reduce its memory footprint. We also discuss issues in application algorithmic scalability to large process counts and features of MPI that enable the use of other techniques to overcome scalability limitations in applications.

international conference on parallel processing | 2009

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Darius Buntinas; Brice Goglin; David Goodell; Guillaume Mercier; Stéphanie Moreaud

The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

ieee international conference on high performance computing data and analytics | 2009

A configurable algorithm for parallel image-compositing applications

Tom Peterka; David Goodell; Robert B. Ross; Han-Wei Shen; Rajeev Thakur

Collective communication operations can dominate the cost of large-scale parallel algorithms. Image compositing in parallel scientific visualization is a reduction operation where this is the case. We present a new algorithm called Radix-k that in many cases performs better than existing compositing algorithms. It does so through a set of configurable parameters, the radices, that determine the number of communication partners in each message round. The algorithm embodies and unifies binary swap and direct-send, two of the best-known compositing methods, and enables numerous other configurations through appropriate choices of radices. While the algorithm is not tied to a particular computing architecture or network topology, the selection of radices allows Radix-k to take advantage of new supercomputer interconnect features such as multiporting. We show scalability across image size and system size, including both powers of two and nonpowers-of-two process counts.

Parallel Processing Letters | 2011

MPI ON MILLIONS OF CORES

Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Torsten Hoefler; Sameer Kumar; Ewing L. Lusk; Rajeev Thakur; Jesper Larsson Träff

Petascale parallel computers with more than a million processing cores are expected to be available in a couple of years. Although MPI is the dominant programming interface today for large-scale systems that at the highest end already have close to 300,000 processors, a challenging question to both researchers and users is whether MPI will scale to processor and core counts in the millions. In this paper, we examine the issue of scalability of MPI to very large systems. We first examine the MPI specification itself and discuss areas with scalability concerns and how they can be overcome. We then investigate issues that an MPI implementation must address in order to be scalable. To illustrate the issues, we ran a number of simple experiments to measure MPI memory consumption at scale up to 131,072 processes, or 80%, of the IBM Blue Gene/P system at Argonne National Laboratory. Based on the results, we identified nonscalable aspects of the MPI implementation and found ways to tune it to reduce its memory footprint. We also briefly discuss issues in application scalability to large process counts and features of MPI that enable the use of other techniques to alleviate scalability limitations in applications.

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface | 2010

PMI: a scalable parallel process-management interface for extreme-scale systems

Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Jayesh Krishna; Ewing L. Lusk; Rajeev Thakur

Parallel programming models on large-scale systems require a scalable system for managing the processes that make up the execution of a parallel program. The process-management system must be able to launch millions of processes quickly when starting a parallel program and must provide mechanisms for the processes to exchange the information needed to enable them communicate with each other. MPICH2 and its derivatives achieve this functionality through a carefully defined interface, called PMI, that allows different process managers to interact with the MPI library in a standardized way. In this paper, we describe the features and capabilities of PMI. We describe both PMI-1, the current generation of PMI used in MPICH2 and all its derivatives, as well as PMI-2, the second-generation of PMI that eliminates various shortcomings in PMI-1. Together with the interface itself, we also describe a reference implementation for both PMI-1 and PMI-2 in a new processmanagement framework within MPICH2, called Hydra, and compare their performance in running MPI jobs with thousands of processes.

ieee international conference on high performance computing data and analytics | 2010

Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming

Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Rajeev Thakur

As high-end computing systems continue to grow in scale, recent advances in multi- and many-core architectures have pushed such growth toward more dense architectures, that is, more processing elements per physical node, rather than more physical nodes themselves. Although a large number of scientific applications have relied so far on an MPI-everywhere model for programming high-end parallel systems; this model may not be sufficient for future machines, given their physical constraints such as decreasing amounts of memory per processing element and shared caches. As a result, application and computer scientists are exploring alternative programming models that involve using MPI between address spaces and some other threaded model, such as OpenMP, Pthreads, or Intel TBB, within an address space. Such hybrid models require efficient support from an MPI implementation for MPI messages sent from multiple threads simultaneously. In this paper, we explore the issues involved in designing such an implementation. We present four approaches to building a fully thread-safe MPI implementation, with decreasing levels of critical-section granularity (from coarse-grain locks to fine-grain locks to lock-free operations) and correspondingly increasing levels of complexity. We present performance results that demonstrate the performance implications of the different approaches.

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface | 2010

Enabling concurrent multithreaded MPI communication on multicore petascale systems

Gabor Dozsa; Sameer Kumar; Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Joe Ratterman; Rajeev Thakur

With the ever-increasing numbers of cores per node on HPC systems, applications are increasingly using threads to exploit the shared memory within a node, combined with MPI across nodes. Achieving high performance when a large number of concurrent threads make MPI calls is a challenging task for an MPI implementation. We describe the design and implementation of our solution in MPICH2 to achieve high-performance multithreaded communication on the IBM Blue Gene/P. We use a combination of a multichannel-enabled network interface, fine-grained locks, lock-free atomic operations, and specially designed queues to provide a high degree of concurrent access while still maintaining MPIs message-ordering semantics. We present performance results that demonstrate that our new design improves the multithreaded message rate by a factor of 3.6 compared with the existing implementation on the BG/P. Our solutions are also applicable to other high-end systems that have parallel network access capabilities.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2008

Toward Efficient Support for Multithreaded MPI Communication

Pavan Balaji; Darius Buntinas; David Goodell; William Gropp; Rajeev Thakur

To make the most effective use of parallel machines that are being built out of increasingly large multicore chips, researchers are exploring the use of programming models comprising a mixture of MPI and threads. Such hybrid models require efficient support from an MPI implementation for MPI messages sent from multiple threads simultaneously. In this paper, we explore the issues involved in designing such an implementation. We present four approaches to building a fully thread-safe MPI implementation, with decreasing levels of critical-section granularity (from coarse-grain locks to fine-grain locks to lock-free operations) and correspondingly increasing levels of complexity. We describe how we have structured our implementation to support all four approaches and enable one to be selected at build time. We present performance results with a message-rate benchmark to demonstrate the performance implications of the different approaches.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Hierarchical Collectives in MPICH2

Hao Zhu; David Goodell; William Gropp; Rajeev Thakur

Most parallel systems on which MPI is used are now hierarchical, such as systems with SMP nodes. Many papers have shown algorithms that exploit shared memory to optimize collective operations to good effect. But how much of the performance benefit comes from tailoring the algorithm to the hierarchical topology of the system? We describe an implementation of many of the MPI collectives based entirely on message-passing primitives that exploits the two-level hierarchy. Our results show that exploiting shared memory directly usually gives small additional benefit and suggests design approaches for where the benefit is large.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Enabling MPI interoperability through flexible communication endpoints

James Dinan; Pavan Balaji; David Goodell; Douglas R. Miller; Marc Snir; Rajeev Thakur

The current MPI model defines a one-to-one relationship between MPI processes and MPI ranks. This model captures many use cases effectively, such as one MPI process per core and one MPI process per node. However, this semantic has limited interoperability between MPI and other programming models that use threads within a node. In this paper, we describe an extension to MPI that introduces communication endpoints as a means to relax the one-to-one relationship between processes and threads. Endpoints enable a greater degree interoperability between MPI and other programming models, and we illustrate their potential for additional performance and computation management benefits through the decoupling of ranks from processes.

Explore More