Douglas Doerfler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Douglas Doerfler is active.

Explore More

Publication

Featured researches published by Douglas Doerfler.

Lecture Notes in Computer Science | 2006

Measuring MPI send and receive overhead and application availability in high performance network interfaces

Douglas Doerfler; Ronald B. Brightwell

In evaluating new high-speed network interfaces, the usual metrics of latency and bandwidth are commonly measured and reported. There are numerous other message passing characteristics that can have a dramatic effect on application performance that should be analyzed when evaluating a new interconnect. One such metric is overhead, which dictates the networks ability to allow the application to perform non-message passing work while a transfer is taking place. A method for measuring overhead, and hence calculating application availability, is presented. Results for several next-generation network interfaces are also presented.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Investigating the Impact of the Cielo Cray XE6 Architecture on Scientific Application Codes

Mahesh Rajan; Richard F. Barrett; Douglas Doerfler; Kevin Pedretti

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaigns newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Crays Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.

international conference on cluster computing | 2004

A comparison of 4X InfiniBand and Quadrics Elan-4 technologies

Ron Brightwell; Douglas Doerfler; Keith D. Underwood

Quadrics Elan-4 and 4X InfiniBand have comparable performance in terms of peak bandwidth and ping-pong latency. In contrast, the two network architectures differ dramatically in details ranging from signaling technologies to programming interface design to software stacks. Both networks compete in the high performance computing marketplace, and InfiniBand is currently receiving a significant amount of attention, due mostly to its potential cost/performance advantage. This work compares 4X InfiniBand and Quadrics Elan-4 on identical compute hardware using application benchmarks of importance to the DOE community. We use scaling efficiency as the main performance metric, and we also provide a cost analysis for different network configurations. Although our 32-node test platform is relatively small, some scaling issues are evident. In general, the Quadrics hardware scales slightly better on most of the applications tested.

Future Generation Computer Systems | 2014

Exascale design space exploration and co-design

Sudip S. Dosanjh; Richard F. Barrett; Douglas Doerfler; Simon D. Hammond; Karl Scott Hemmert; Michael A. Heroux; Paul Lin; Kevin Pedretti; Arun Rodrigues; Tim Trucano; Justin Luitjens

The co-design of architectures and algorithms has been postulated as a strategy for achieving Exascale computing in this decade. Exascale design space exploration is prohibitively expensive, at least partially due to the size and complexity of scientific applications of interest. Application codes can contain millions of lines and involve many libraries. Mini-applications, which attempt to capture some key performance issues, can potentially reduce the order of the exploration by a factor of a thousand. However, we need to carefully understand how representative mini-applications are of the full application code. This paper describes a methodology for this comparison and applies it to a particularly challenging mini-application. A multi-faceted methodology for design space exploration is also described that includes measurements on advanced architecture testbeds, experiments that use supercomputers and system software to emulate future hardware, and hardware/software co-simulation tools to predict the behavior of applications on hardware that does not yet exist.

ieee international conference on high performance computing data and analytics | 2012

Navigating an Evolutionary Fast Path to Exascale

Richard F. Barrett; Simon D. Hammond; Douglas Doerfler; Michael A. Heroux; Justin Luitjens; Duncan Roweth

The computing community is in the midst of a disruptive architectural change. The advent of manycore and heterogeneous computing nodes forces us to reconsider every aspect of the system software and application stack. To address this challenge there is a broad spectrum of approaches, which we roughly classify as either revolutionary or evolutionary. With the former, the entire code base is re-written, perhaps using a new programming language or execution model. The latter, which is the focus of this work, seeks a piecewise path of effective incremental change. The end effect of our approach will be revolutionary in that the control structure of the application will be markedly different in order to utilize single-instruction multiple-data/thread (SIMD/SIMT), manycore and heterogeneous nodes, but the physics code fragments will be remarkably similar. Our approach is guided by a set of mission driven applications and their proxies, focused on balancing performance potential with the realities of existing application code bases. Although the specifics of this process have not yet converged, we find that there are several important steps that developers of scientific and engineering application programs can take to prepare for making effective use of these challenging platforms. Aiding an evolutionary approach is the recognition that the performance potential of the architectures is, in a meaningful sense, an extension of existing capabilities: vectorization, threading, and a re-visiting of node interconnect capabilities. Therefore, as architectures, programming models, and programming mechanisms continue to evolve, the preparations described herein will provide significant performance benefits on existing and emerging architectures.

international parallel and distributed processing symposium | 2006

A preliminary analysis of the InfiniPath and XD1 network interfaces

Ron Brightwell; Douglas Doerfler; Keith D. Underwood

Two recently delivered systems have begun a new trend in cluster interconnects. Both the InfiniPath network from PathScale, Inc., and the rapidarray fabric in the XDI system from Cray, Inc., leverage commodity network fabrics while customizing the network interface in an attempt to add value specifically for the high performance computing (HPC) cluster market. Both network interfaces are compatible with standard InfiniBand (IB) switches, but neither use the traditional programming interfaces to support MPI. Another fundamental difference between these networks and other modern network adapters is that much of the processing needed for the network protocol stack is performed on the host processor(s) rather than by the network interface itself. This approach stands in stark contrast to the current direction of most high-performance networking activities, which is to offload as much protocol processing as possible to the network interface. In this paper, we provide an initial performance comparison of the two partially custom networks (PathScales InfiniPath and Crays XDI) with a more commodity network (standard IB) and a more custom network (Quadrics Elan4). Our evaluation includes several micro-benchmark results as well as some initial application performance data

instrumentation and measurement technology conference | 1986

Dynamic testing of a slow sample rate, high-resolution data acquisition system

Douglas Doerfler

Performance testing a slow sample rate, high-resolution data acquisition system (DAS) is difficult due to the large number of possible codes and the time required to fully exercise them. Techniques for dynamically testing such a system are considered with advantages and disadvantages presented. The system resolution is 15 bits, sampling frequency is 128 Hz, input range is ±5 V, and the maximum input frequency is 45 Hz. Two techniques chosen to evaluate the system are the histogram method and the discrete Fourier transform (DFT) method.

Journal of Parallel and Distributed Computing | 2015

Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications

Richard F. Barrett; Paul S. Crozier; Douglas Doerfler; Michael A. Heroux; Paul Lin; Heidi K. Thornquist; Tim Trucano

Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of miniapps has been created to serve as proxies for full scale applications. Each miniapp is designed to represent a key performance characteristic that does or is expected to significantly impact the runtime performance of an application program. In this paper we introduce a methodology for assessing the ability of these miniapps to effectively represent these performance issues. We applied this methodology to three miniapps, examining the linkage between them and an application they are intended to represent. Herein we evaluate the fidelity of that linkage. This work represents the initial steps required to begin to answer the question, Under what conditions does a miniapp represent a key performance characteristic in a full app? Proxies are being used to examine the performance of full application codes.We present a methodology for showing the link between full application codes and their proxies.We demonstrate this methodology using four applications and their proxies.

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems | 2013

Analysis of Cray XC30 Performance Using Trinity-NERSC-8 Benchmarks and Comparison with Cray XE6 and IBM BG/Q

Matthew J. Cordery; Brian Austin; H. J. Wassermann; Christopher S. Daley; Nicholas J. Wright; Simon D. Hammond; Douglas Doerfler

In this paper, we examine the performance of a suite of applications on three different architectures: Edison, a Cray XC30 with Intel Ivy Bridge processors; Hopper and Cielo, both Cray XE6’s with AMD Magny–Cours processors; and Mira, an IBM BlueGene/Q with PowerPC A2 processors. The applications chosen are a subset of the applications used in a joint procurement effort between Lawrence Berkeley National Laboratory, Los Alamos National Laboratory and Sandia National Laboratories. Strong scaling results are presented, using both MPI-only and MPI+OpenMP execution models.

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011

The impact of injection bandwidth performance on application scalability

Kevin Pedretti; Ron Brightwell; Douglas Doerfler; K. Scott Hemmert; James H. Laros

Future exascale systems are expected to have significantly reduced network band width relative to computational performance than current systems. Clearly, this will impact band width-intensive applications, so it is important to gain insight into the magnitude of the negative impact on performance and scalability to help identify mitigation strategies. In this paper, we show how current systems can be configured to emulate the expected imbalance of future systems. We demonstrate this approach by reducing the network injection band width performance of a 160-node, 1920-core Cray XT5 system and analyze the performance and scalability of a suite of MPI benchmarks and applications.

Explore More