Is this you? Create Your Porfile

Nathan R. Tallent

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nathan R. Tallent is active.

Explore More

Publication

Featured researches published by Nathan R. Tallent.

Concurrency and Computation: Practice and Experience | 2009

HPCTOOLKIT: tools for performance analysis of optimized parallel programs

Laksono Adhianto; S. Banerjee; Mike Fagan; Mark W. Krentel; Gabriel Marin; John M. Mellor-Crummey; Nathan R. Tallent

HPCTOOLKIT is an integrated suite of tools that supports measurement, analysis, attribution, and presentation of application performance for both sequential and parallel programs. HPCTOOLKIT can pinpoint and quantify scalability bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Recently, new capabilities were added to HPCTOOLKIT for collecting call path profiles for fully optimized codes without any compiler support, pinpointing and quantifying bottlenecks in multithreaded programs, exploring performance information and source code using a new user interface, and displaying hierarchical space–time diagrams based on traces of asynchronous call path samples. This paper provides an overview of HPCTOOLKIT and illustrates its utility for performance analysis of parallel applications. Copyright

The Journal of Supercomputing | 2002

HPCVIEW: A Tool for Top-down Analysis of Node Performance

John M. Mellor-Crummey; Robert J. Fowler; Gabriel Marin; Nathan R. Tallent

It is increasingly difficult for complex scientific programs to attain a significant fraction of peak performance on systems that are based on microprocessors with substantial instruction-level parallelism and deep memory hierarchies. Despite this trend, performance analysis and tuning tools are still not used regularly by algorithm and application designers. To a large extent, existing performance tools fail to meet many user needs and are cumbersome to use. To address these issues, we developed HPCVIEW—a toolkit for combining multiple sets of program profile data, correlating the data with source code, and generating a database that can be analyzed anywhere with a commodity Web browser. We argue that HPCVIEW addresses many of the issues that have limited the usability and the utility of most existing tools. We originally built HPCVIEW to facilitate our own work on data layout and optimizing compilers. Now, in addition to daily use within our group, HPCVIEW is being used by several code development teams in DoD and DoE laboratories as well as at NCSA.

ACM Transactions on Mathematical Software | 2008

OpenAD/F: A Modular Open-Source Tool for Automatic Differentiation of Fortran Codes

Jean Utke; Uwe Naumann; Mike Fagan; Nathan R. Tallent; Michelle Mills Strout; Patrick Heimbach; Chris Hill; Carl Wunsch

The Open/ADF tool allows the evaluation of derivatives of functions defined by a Fortran program. The derivative evaluation is performed by a Fortran code resulting from the analysis and transformation of the original program that defines the function of interest. Open/ADF has been designed with a particular emphasis on modularity, flexibility, and the use of open source components. While the code transformation follows the basic principles of automatic differentiation, the tool implements new algorithmic approaches at various levels, for example, for basic block preaccumulation and call graph reversal. Unlike most other automatic differentiation tools, Open/ADF uses components provided by the Open/AD framework, which supports a comparatively easy extension of the code transformations in a language-independent fashion. It uses code analysis results implemented in the OpenAnalysis component. The interface to the language-independent transformation engine is an XML-based format, specified through an XML schema. The implemented transformation algorithms allow efficient derivative computations using locally optimized cross-country sequences of vertex, edge, and face elimination steps. Specifically, for the generation of adjoint codes, Open/ADF supports various code reversal schemes with hierarchical checkpointing at the subroutine level. As an example from geophysical fluid dynamics, a nonlinear time-dependent scalable, yet simple, barotropic ocean model is considered. OpenAD/Fs reverse mode is applied to compute sensitivities of some of the models transport properties with respect to gridded fields such as bottom topography as independent (control) variables.

acm sigplan symposium on principles and practice of parallel programming | 2009

Effective performance measurement and analysis of multithreaded applications

Nathan R. Tallent; John M. Mellor-Crummey

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the users computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice Universitys HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.

acm sigplan symposium on principles and practice of parallel programming | 2010

Analyzing lock contention in multithreaded applications

Nathan R. Tallent; John M. Mellor-Crummey; Allan Porterfield

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify and attribute lock contention is important for understanding where a multithreaded program needs improvement. This paper proposes and evaluates three strategies for gaining insight into performance losses due to lock contention. First, we consider using a straightforward strategy based on call stack profiling to attribute idle time and show that it fails to yield insight into lock contention. Second, we consider an approach that builds on a strategy previously used for analyzing idleness in work-stealing computations; we show that this strategy does not yield insight into lock contention. Finally, we propose a new technique for measurement and analysis of lock contention that uses data associated with locks to blame lock holders for the idleness of spinning threads. Our approach incurs ≤ 5% overhead on a quantum chemistry application that makes extensive use of locking (65M distinct locks, a maximum of 340K live locks, and an average of 30K lock acquisitions per second per thread) and attributes lock contention to its full static and dynamic calling contexts. Our strategy, implemented in HPCToolkit, is fully distributed and should scale well to systems with large core counts.

programming language design and implementation | 2009

Binary analysis for measurement and attribution of program performance

Nathan R. Tallent; John M. Mellor-Crummey; Mike Fagan

Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routines calling context. Existing performance tools fall short in this respect. Prior strategies for attributing context-sensitive performance at the source level either compromise measurement accuracy, remain too close to the binary, or require custom compilers. To understand the performance of fully optimized modular code, we developed two novel binary analysis techniques: 1) on-the-fly analysis of optimized machine code to enable minimally intrusive and accurate attribution of costs to dynamic calling contexts; and 2) post-mortem analysis of optimized machine code and its debugging sections to recover its program structure and reconstruct a mapping back to its source code. By combining the recovered static program structure with dynamic calling context information, we can accurately attribute performance metrics to calling contexts, procedures, loops, and inlined instances of procedures. We demonstrate that the fusion of this information provides unique insight into the performance of complex modular codes. This work is implemented in the HPCToolkit performance tools (http://hpctoolkit.org).

ieee international conference on high performance computing data and analytics | 2010

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles

Nathan R. Tallent; Laksono Adhianto; John M. Mellor-Crummey

Applications must scale well to make efficient use of todays class of petascale computers, which contain hundreds of thousands of processor cores. Inefficiencies that do not even appear in modest-scale executions can become major bottlenecks in large-scale executions. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of scaling problems. Load imbalance is one of the most common scaling problems. To provide actionable insight into load imbalance, we present post-mortem parallel analysis techniques for pinpointing and quantifying load imbalance in the context of call path profiles of parallel programs. We show how to identify load imbalance in its static and dynamic context by using only low-overhead asynchronous call path profiling to locate regions of code responsible for communication wait time in SPMD executions. We describe the implementation of these techniques within HPCTOOLKIT.

Journal of Physics: Conference Series | 2008

HPCToolkit: performance tools for scientific computing

Nathan R. Tallent; John M. Mellor-Crummey; Laksono Adhianto; Mike Fagan; Mark W. Krentel

As part of the U.S. Department of Energys Scientific Discovery through Advanced Computing (SciDAC) program, science teams are tackling problems that require simulation and modeling on petascale computers. As part of activities associated with the SciDAC Center for Scalable Application Development Software (CScADS) and the Performance Engineering Research Institute (PERI), Rice University is building software tools for performance analysis of scientific applications on the leadership-class platforms. In this poster abstract, we briefly describe the HPCToolkit performance tools and how they can be used to pinpoint bottlenecks in SPMD and multi-threaded parallel codes. We demonstrate HPCToolkits utility by applying it to two SciDAC applications: the S3D code for simulation of turbulent combustion and the MFDn code for ab initio calculations of microscopic structure of nuclei.

international conference on supercomputing | 2011

Scalable fine-grained call path tracing

Nathan R. Tallent; John M. Mellor-Crummey; Michael Franco; Reed Landrum; Laksono Adhianto

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks. Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained context-sensitive data --- the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context- and time-sensitive data. This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation. To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.

ieee international conference on high performance computing data and analytics | 2009

Diagnosing performance bottlenecks in emerging petascale applications

Nathan R. Tallent; John M. Mellor-Crummey; Laksono Adhianto; Mike Fagan; Mark W. Krentel

Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists to understand impediments to performance on emerging petascale systems. In this paper, we describe HPCToolkit-a suite of multi-platform tools that supports sampling-based analysis of application performance on emerging petascale platforms. HPCToolkit uses sampling to pinpoint and quantify both scaling and node performance bottlenecks. We study several emerging petascale applications on the Cray XT and IBM BlueGene/P platforms and use HPCToolkit to identify specific source lines - in their full calling context - associated with performance bottlenecks in these codes. Such information is exactly what application developers need to know to improve their applications to take full advantage of the power of petascale systems.

Explore More