John M. Mellor-Crummey

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John M. Mellor-Crummey is active.

Explore More

Publication

Featured researches published by John M. Mellor-Crummey.

Computational Science & Discovery | 2009

Terascale direct numerical simulations of turbulent combustion using S3D

J.H. Chen; Alok N. Choudhary; B.R. de Supinski; M. DeVries; Evatt R. Hawkes; Scott Klasky; Wei-keng Liao; Kwan-Liu Ma; John M. Mellor-Crummey; N Podhorszki; Ramanan Sankaran; Sameer Shende; Chun Sang Yoo

Computational science is paramount to the understanding of underlying processes in internal combustion engines of the future that will utilize non-petroleum-based alternative fuels, including carbon-neutral biofuels, and burn in new combustion regimes that will attain high efficiency while minimizing emissions of particulates and nitrogen oxides. Next-generation engines will likely operate at higher pressures, with greater amounts of dilution and utilize alternative fuels that exhibit a wide range of chemical and physical properties. Therefore, there is a significant role for high-fidelity simulations, direct numerical simulations (DNS), specifically designed to capture key turbulence-chemistry interactions in these relatively uncharted combustion regimes, and in particular, that can discriminate the effects of differences in fuel properties. In DNS, all of the relevant turbulence and flame scales are resolved numerically using high-order accurate numerical algorithms. As a consequence terascale DNS are computationally intensive, require massive amounts of computing power and generate tens of terabytes of data. Recent results from terascale DNS of turbulent flames are presented here, illustrating its role in elucidating flame stabilization mechanisms in a lifted turbulent hydrogen/air jet flame in a hot air coflow, and the flame structure of a fuel-lean turbulent premixed jet flame. Computing at this scale requires close collaborations between computer and combustion scientists to provide optimized scaleable algorithms and software for terascale simulations, efficient collective parallel I/O, tools for volume visualization of multiscale, multivariate data and automating the combustion workflow. The enabling computer science, applied to combustion science, is also required in many other terascale physics and engineering simulations. In particular, performance monitoring is used to identify the performance of key kernels in the DNS code, S3D and especially memory intensive loops in the code. Through the careful application of loop transformations, data reuse in cache is exploited thereby reducing memory bandwidth needs, and hence, improving S3Ds nodal performance. To enhance collective parallel I/O in S3D, an MPI-I/O caching design is used to construct a two-stage write-behind method for improving the performance of write-only operations. The simulations generate tens of terabytes of data requiring analysis. Interactive exploration of the simulation data is enabled by multivariate time-varying volume visualization. The visualization highlights spatial and temporal correlations between multiple reactive scalar fields using an intuitive user interface based on parallel coordinates and time histogram. Finally, an automated combustion workflow is designed using Kepler to manage large-scale data movement, data morphing, and archival and to provide a graphical display of run-time diagnostics.

Concurrency and Computation: Practice and Experience | 2009

HPCTOOLKIT: tools for performance analysis of optimized parallel programs

Laksono Adhianto; S. Banerjee; Mike Fagan; Mark W. Krentel; Gabriel Marin; John M. Mellor-Crummey; Nathan R. Tallent

HPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space–time diagrams based on traces of asynchronous call path samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications. Copyright

International Journal of Parallel Programming | 2005

New grid scheduling and rescheduling methods in the GrADS project

Fran Berman; Henri Casanova; Andrew A. Chien; Keith D. Cooper; Holly Dail; Anshuman Dasgupta; W. Deng; Jack J. Dongarra; Lennart Johnsson; Ken Kennedy; Charles Koelbel; Bo Liu; Xin Liu; Anirban Mandal; Gabriel Marin; Mark Mazina; John M. Mellor-Crummey; Celso L. Mendes; A. Olugbile; Jignesh M. Patel; Daniel A. Reed; Zhiao Shi; Otto Sievert; Huaxia Xia; A. YarKhan

The goal of the Grid Application Development Software (GrADS) Project is to provide programming tools and an execution environment to ease program development for the Grid. This paper presents recent extensions to the GrADS software framework: a new approach to scheduling workflow computations, applied to a 3-D image reconstruction application; a simple stop/migrate/restart approach to rescheduling Grid applications, applied to a QR factorization benchmark; and a process-swapping approach to rescheduling, applied to an N-body simulation. Experiments validating these methods were carried out on both the GrADS MacroGrid (a small but functional Grid) and the MicroGrid (a controlled emulation of the Grid).

measurement and modeling of computer systems | 2004

Cross-architecture performance predictions for scientific applications using parameterized models

Gabriel Marin; John M. Mellor-Crummey

This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and differentiable profile. Our toolkit operates on application binaries and succeeds in modeling key application characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architecture-neutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.

high performance distributed computing | 2005

Scheduling strategies for mapping application workflows onto the grid

Anirban Mandal; Ken Kennedy; Charles Koelbel; Gabriel Marin; John M. Mellor-Crummey; Bo Liu; S. Lennart Johnsson

In this work, we describe new strategies for scheduling and executing workflow applications on grid resources using the GrADS [Ken Kennedy et al., 2002] infrastructure. Workflow scheduling is based on heuristic scheduling strategies that use application component performance models. The workflow is executed using a novel strategy to bind and launch the application onto heterogeneous resources. We apply these strategies in the context of executing EMAN, a bio-imaging workflow application, on the grid. The results of our experiments show that our strategy of performance model based, in-advance heuristic workflow scheduling results in 1.5 to 2.2 times better makespan than other existing scheduling strategies. This strategy also achieves optimal load balance across the different grid sites for this application.

conference on high performance computing (supercomputing) | 1991

On-the-fly detection of data races for programs with nested fork-join parallelism

John M. Mellor-Crummey

No abstract available

international parallel and distributed processing symposium | 2002

Toward a framework for preparing and executing adaptive grid programs

Ken Kennedy; Mark Mazina; John M. Mellor-Crummey; Keith D. Cooper; Linda Torczon; Francine Berman; Andrew A. Chien; Holly Dail; Otto Sievert; David Sigfredo Angulo; Ian T. Foster; R. Aydt; Daniel A. Reed

This paper describes the program execution framework being developed by the Grid Application Development Software (GrADS) Project. The goal of this framework is to provide good resource allocation for Grid applications and to support adaptive reallocation if performance degrades because of changes in the availability of Grid resources. At the heart of this strategy is the notion of a configurable object program, which contains, in addition to application code, strategies for mapping the application to different collections of resources and a resource selection model that provides an estimate of the performance of the application on a specific collection of Grid resources. This model must be accurate enough to distinguish collections of resources that will deliver good performance from those that will not. The GrADS execution framework also provides a contract monitoring mechanism for interrupting and remapping an application execution when performance falls below acceptable levels.

acm sigplan symposium on principles and practice of parallel programming | 1991

Scalable reader-writer synchronization for shared-memory multiprocessors

John M. Mellor-Crummey; Michael L. Scott

Reader-writer synchronization relaxes the constraints of mutual exclusion to permit more than one process to inspect a shared object concurrently, as long as none of them changes its value. On uniprocessors, mutual exclusion and readerwriter locks are typically designed to de-schedule blocked processes; however, on shared-memory multiprocessors it is often advantageous to have processes busy wait. Unfortunately, implementations of busy-wait locks on sharedmemory multiprocessors typically cause memory and network contention that degrades performance. Several researchers have shown how to implement scalable mutual exclusion locks that exploit locality in the memory hierarchies of shared-memory multiprocessors to eliminate contention for memory and for the processor-memory interconnect. In this paper we present reader-writer locks that similarly exploit locality to achieve scalability, with variants for reader preference, writer preference, and reader-writer fairness. Performance results on a BBN TC2000 multiprocessor demonstrate that our algorithms provide low latency and excellent scalability.

architectural support for programming languages and operating systems | 1989

A software instruction counter

John M. Mellor-Crummey; Thomas J. LeBlanc

Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent software solutions. This paper describes our implementation of a software instruction counter for program debugging. We show that an instruction counter can be reasonably implemented in software, often with less than 10% execution overhead. Our experience suggests that a hardware instruction counter is not necessary for a practical implementation of watch-points and reverse execution, however it will make program instrumentation much easier for the system developer.

International Journal of Parallel Programming | 2001

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

John M. Mellor-Crummey; David B. Whalley; Ken Kennedy

The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically under-utilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.

Explore More