John C. Linford | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John C. Linford is active.

Explore More

Publication

Featured researches published by John C. Linford.

ieee international conference on high performance computing data and analytics | 2009

Multi-core acceleration of chemical kinetics for simulation and prediction

John C. Linford; John Michalakes; Manish Vachharajani; Adrian Sandu

This work implements a computationally expensive chemical kinetics kernel from a large-scale community atmospheric model on three multi-core platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis for each platform in double and single precision on coarse and fine grids is presented. Platform-specific design and optimization is discussed in a mechanism-agnostic way, permitting the optimization of many chemical mechanisms. The implementation of a three-stage Rosenbrock solver for SIMD architectures is discussed. When used as a template mechanism in the the Kinetic PreProcessor, the multi-core implementation enables the automatic optimization and porting of many chemical mechanisms on a variety of multi-core platforms. Speedups of 5.5x in single precision and 2.7x in double precision are observed when compared to eight Xeon cores. Compared to the serial implementation, the maximum observed speedup is 41.1x in single precision.

acm sigplan symposium on principles and practice of parallel programming | 2009

A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Scott Schneider; Jae-Seung Yeom; Benjamin Rose; John C. Linford; Adrian Sandu; Dimitrios S. Nikolopoulos

On multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. This task can be complex and error-prone even for expert programmers. Before we can allow compilers to handle this complexity for us, we must identify the abstractions that are general enough to allow us to write applications with reasonable effort, yet specific enough to exploit the vast on-chip memory bandwidth of EMM multi-processors. To this end, we compare two programming models against hand-tuned codes on the STI Cell, paying attention to programmability and performance. The first programming model, Sequoia, abstracts the memory hierarchy as private address spaces, each corresponding to a parallel task. The second, Cellgen, is a new framework which provides OpenMP-like semantics and the abstraction of a shared address space divided into private and shared data. We compare three applications programmed using these models against their hand-optimized counterparts in terms of abstractions, programming complexity, and performance.

parallel computing | 2009

Scalable timestamp synchronization for event traces of message-passing applications

Daniel Becker; Rolf Rabenseifner; Felix Wolf; John C. Linford

Event traces are helpful in understanding the performance behavior of message-passing applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors. Although linear offset interpolation can restore consistency to some degree, time-dependent drifts and other inaccuracies may still disarrange the original succession of events - especially during longer runs. The controlled logical clock algorithm accounts for such violations in point-to-point communication by shifting message events in time as much as needed while trying to preserve the length of local intervals. In this article, we describe how the controlled logical clock is extended to collective communication to enable the correction of realistic message-passing traces. We present a parallel version of the algorithm scaling to more than thousand processes and evaluate its accuracy by showing that it eliminates inconsistent inter-process timings while preserving the length of local intervals.

international conference on parallel processing | 2008

Replay-Based Synchronization of Timestamps in Event Traces of Massively Parallel Applications

Daniel Becker; John C. Linford; Rolf Rabenseifner; Felix Wolf

Event traces are helpful in understanding the performance behavior of message-passing applications since they allow in-depth analyses of communication and synchronization patterns. However, the absence of synchronized hardware clocks may render the analysis ineffective because inaccurate relative event timings can misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors. Although linear offset interpolation can restore consistency to some degree, inaccuracies and time-dependent drifts may still disarrange the original succession of events - especially during longer runs. In our earlier work, we have presented an algorithm that removes the remaining violations of the logical event order postmortem and, in addition, have outlined the initial design of a parallel version. Here, we complete the parallel design and describe its implementation within the SCALASCA trace-analysis framework. We demonstrate its suitability for large-scale applications running on more than a thousand application processes and show how the correction can improve the trace analysis of a real-world application example.

IEEE Transactions on Parallel and Distributed Systems | 2011

Automatic Generation of Multicore Chemical Kernels

John C. Linford; John Michalakes; Manish Vachharajani; Adrian Sandu

Abstract-This work presents the Kinetics Preprocessor: Accelerated (KPPA), a general analysis and code generation tool that achieves significantly reduced time-to-solution for chemical kinetics kernels on three multicore platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis of chemical kernels from WRFChem and the Community Multiscale Air Quality Model (CMAQ) is presented for each platform in double and single precision on coarse and fine grids. We introduce the multicore architecture parameterization that KPPA uses to generate a chemical kernel for these platforms and describe a code generation system that produces highly tuned platform-specific code. Compared to state-of-the-art serial implementations, speedups exceeding 25x are regularly observed, with a maximum observed speedup of 41.1x in single precision.

acm symposium on applied computing | 2009

Vector stream processing for effective application of heterogeneous parallelism

John C. Linford; Adrian Sandu

Heterogeneous multicore chipsets with many levels of parallelism are becoming increasingly common in high-performance computing systems. Effective use of parallelism in these new chipsets is paramount. We present a 3D chemical transport module optimized for the Cell Broadband Engine Architecture (CBEA). By leveraging the heterogeneous parallelism of the Cell with a method we call vector stream processing, our transport module achieves performance comparable to two nodes of an IBM BlueGene/P, or eight Xeon cores, on a single Cell chip. Performance of the module on two CBEA systems, an IBM BlueGene/P, and an eight-core shared-memory Intel Xeon workstation are given.

Applied Environmental Education & Communication | 2008

The Sustainable Mobility Learning Laboratory: Interactive Web-Based Education on Transportation and the Environment

Lisa Schweitzer; Linsey C. Marr; John C. Linford; Mary Ashburn Darby

The transportation field has for many years been dominated by engineers and other technical specialists. This article describes the Sustainable Mobility Learning Lab (SMLL), a Web-based tool designed to support classroom and university outreach activities to help initiate a more inclusive, nontechnical discussion about the role of transportation technology and human behavioral changes in sustainability. The SMLL includes both a general, “one-room schoolhouse” approach, in which Web users at a variety of levels can participate, along with classroom exercises for more targeted education. In our first runs with the project, the students were engaged with the material and reported positive experiences, and the Web-based format has allowed us to use the materials in a variety of outreach and open source settings.

Workshop on OpenSHMEM and Related Technologies | 2017

Performance Analysis of OpenSHMEM Applications with TAU Commander

John C. Linford; Samuel Khuvis; Sameer Shende; Allen D. Malony; Neena Imam; Manjunath Gorentla Venkata

The TAU Performance System® (TAU) is a powerful and highly versatile profiling and tracing tool ecosystem for performance engineering of parallel programs. Developed over the last twenty years, TAU has evolved with each new generation of HPC systems and scales efficiently to hundreds of thousands of cores. TAU’s organic growth has resulted in a loosely coupled software toolbox such that novice users first encountering TAU’s complexity and vast array of features are often intimidated and easily frustrated. To lower the barrier to entry for novice TAU users, ParaTools and the US Department of Energy have developed “TAU Commander,” a performance engineering workflow manager that facilitates a systematic approach to performance engineering, guides users through common profiling and tracing workflows, and offers constructive feedback in case of error. This work compares TAU and TAU Commander workflows for common performance engineering tasks in OpenSHMEM applications and demonstrates workflows targeting two different SHMEM implementations, Intel Xeon “Haswell” and “Knights Landing” processors, direct and indirect measurement methods, callsite, profiles, and traces.

Workshop on OpenSHMEM and Related Technologies | 2016

Profiling Production OpenSHMEM Applications

John C. Linford; Samuel Khuvis; Sameer Shende; Allen D. Malony; Neena Imam; Manjunath Gorentla Venkata

Developing high performance OpenSHMEM applications routinely involves gaining a deeper understanding of software execution, yet there are numerous hurdles to gathering performance metrics in a production environment. Most OpenSHMEM performance profilers rely on the PSHMEM interface but PSHMEM is an optional and often unavailable feature. We present a tool that generates direct measurement performance profiles of OpenSHMEM applications even when PSHMEM is unavailable. The tool operates on dynamically linked and statically linked application binaries, does not require debugging symbols, and functions regardless of compiler optimization level. Integrated in the TAU Performance System, the tool uses automatically-generated wrapper libraries that intercept OpenSHMEM API calls to gather performance metrics with minimal overhead. Dynamically linked applications may use the tool without modifying the application binary in any way.

Archive | 2015

Comparison of Performance Analysis Tools for Parallel Programs Applied to CombBLAS

Wesley Collins; Daniel T. Martinez; Michael Monaghan; Alexey A. Munishkin; Ari Rapkin Blenkhorn; Jonathan S. Graf; Samuel Khuvis; John C. Linford

Performance analysis tools are powerful tools for high performance computing. By breaking down a program into how long the CPUs are taking on each process (profiling) or showing when events take place on a timeline over the course of running a program (tracing), a performance analysis tool can tell the programmer exactly, where the computer is running slowly. With this information, the programmer can focus on these performance “hotspots,” and the code can be optimized to run faster. We compared the performance analysis tools TAU, ParaTools ThreadSpotter, Intel VTune, Scalasca, HPCToolkit, and Score-P to the example code CombBLAS (combinatorial BLAS) which is a C++ implemenation of the GraphBLAS, a set of graph algorithms using BLAS (Basic Linear Algebra Subroutines). Using these performance analysis tools on CombBLAS, we found three major “hotspots” and attempted to improve the code. We were unsuccessful in improving these “hotspots” due to a time limitation but still gave suggestions on improving the OpenMP calls in the CombBLAS code.

Explore More