Is this you? Create Your Porfile

Ian Karlin

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ian Karlin is active.

Explore More

Publication

Featured researches published by Ian Karlin.

international parallel and distributed processing symposium | 2013

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application

Ian Karlin; Abhinav Bhatele; Jeff Keasler; Bradford L. Chamberlain; Jonathan D. Cohen; Zachary DeVito; Riyaz Haque; Dan Laney; Edward A. Luke; Felix Wang; David F. Richards; Martin Schulz; Charles H. Still

Parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.

international conference on conceptual structures | 2016

High-performance Tensor Contractions for GPUs

Ahmad Abdelfattah; Marc Baboulin; Veselin Dobrev; Jack J. Dongarra; Christopher Earl; Joel Falcou; Azzam Haidar; Ian Karlin; Tzanio V. Kolev; Ian Masliah; Stanimire Tomov

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8 faster than CUBLAS, and 8.5 faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

international parallel and distributed processing symposium | 2014

Characterizing the Impact of Program Optimizations on Power and Energy for Explicit Hydrodynamics

Edgar A. León; Ian Karlin

With the end of Denard scaling, future systems will be constrained by power and energy. This will impact application developers by forcing them to restructure and optimize their algorithms in terms of these resources. In this paper, we analyze the impact of different code optimizations on power, energy, and execution time. Our optimizations include loop fusion, data structure transformations, global allocation, and compiler selection. We analyze the static and dynamic components of power and energy as applied to the processor chip and memory domains within a system. In addition, our analysis correlates energy and power changes with performance events and shows that data motion is highly correlated with memory power and energy usage and executed instructions are partially correlated with processor power and energy. Our results demonstrate key tradeoffs among power, energy, and execution time for explicit hydrodynamics via a representative kernel. In particular, we observe that loop fusion and compiler selection improve all objectives, while global allocation and data layout transformations present tradeoffs that are objective-dependent.

ieee international conference on high performance computing data and analytics | 2016

Characterizing parallel scientific applications on commodity clusters: an empirical study of a tapered fat-tree

Edgar A. León; Ian Karlin; Abhinav Bhatele; Steven H. Langer; Chris Chambreau; Louis H. Howell; Trent D'Hooge; Matthew L. Leininger

Understanding the characteristics and requirements of applications that run on commodity clusters is key to properly configuring current machines and, more importantly, procuring future systems effectively. There are only a few studies, however, that are current and characterize realistic workloads. For HPC practitioners and researchers, this limits our ability to design solutions that will have an impact on real systems. We present a systematic study that characterizes applications with an emphasis on communication requirements. It includes cluster utilization data, identifying a representative set of applications from a U.S. Department of Energy laboratory, and characterizing their communication requirements. The driver for this work is understanding application sensitivity to a tapered fat-tree network. These results provided key insights into the procurement of our next generation commodity systems. We believe this investigation can provide valuable input to the HPC community in terms of workload characterization and requirements from a large supercomputing center.

international conference on cluster computing | 2016

Fast Multi-parameter Performance Modeling

Alexandru Calotoiu; David Beckinsale; Christopher Earl; Torsten Hoefler; Ian Karlin; Martin Schulz; Felix Wolf

Tuning large applications requires a clever exploration of the design and configuration space. Especially on supercomputers, this space is so large that its exhaustive traversal via performance experiments becomes too expensive, if not impossible. Manually creating analytical performance models provides insights into optimization opportunities but is extremely laborious if done for applications of realistic size. If we must consider multiple performance-relevant parameters and their possible interactions, a common requirement, this task becomes even more complex. We build on previous work on automatic scalability modeling and significantly extend it to allow insightful modeling of any combination of application execution parameters. Multi-parameter modeling has so far been outside the reach of automatic methods due to the exponential growth of the model search space. We develop a new technique to traverse the search space rapidly and generate insightful performance models that enable a wide range of uses from performance predictions for balanced machine design to performance tuning.

ieee international conference on high performance computing data and analytics | 2017

DataRaceBench: a benchmark suite for systematic evaluation of data race detection tools

Chunhua Liao; Pei-Hung Lin; Joshua Asplund; Markus Schordan; Ian Karlin

Data races in multi-threaded parallel applications are notoriously damaging while extremely difficult to detect. Many tools have been developed to help programmers find data races. However, there is no dedicated OpenMP benchmark suite to systematically evaluate data race detection tools for their strengths and limitations. In this paper, we present DataRaceBench, an open-source benchmark suite designed to systematically and quantitatively evaluate the effectiveness of data race detection tools. We focus on data race detection in programs written in OpenMP, the popular parallel programming model for multi-threaded applications. In particular, DataRaceBench includes a set of microbenchmark programs with or without data races. These microbenchmarks are either manually written, extracted from real scientific applications, or automatically generated optimization variants. We also define several metrics to represent effectiveness and efficiency of data race detection tools. Using DataRaceBench and its metrics, we evaluate four different data race detection tools: Helgrind, ThreadSanitizer, Archer, and Intel Inspector. The evaluation results show that DataRaceBench is effective to provide comparable, quantitative results and discover strengths and weaknesses of the tools being evaluated.

international workshop on openmp | 2016

Early Experiences Porting Three Applications to OpenMP 4.5

Ian Karlin; Tom Scogland; Arpith C. Jacob; Samuel F. Antao; Gheorghe-Teodor Bercea; Carlo Bertolli; Bronis R. de Supinski; Erik W. Draeger; Alexandre E. Eichenberger; Jim Glosli; Holger E. Jones; Adam Kunen; David Poliakoff; David F. Richards

Many application developers need code that runs efficiently on multiple architectures, but cannot afford to maintain architecturally specific codes. With the addition of target directives to support offload accelerators, OpenMP now has the machinery to support performance portable code development. In this paper, we describe application ports of Kripke, Cardioid, and LULESH to OpenMP 4.5 and discuss our successes and failures. Challenges encountered include how OpenMP interacts with C++ including classes with virtual methods and lambda functions. Also, the lack of deep copy support in OpenMP increased code complexity. Finally, GPUs inability to handle virtual function calls required code restructuring. Despite these challenges we demonstrate OpenMP obtains performance within 10 % of hand written CUDA for memory bandwidth bound kernels in LULESH. In addition, we show with a minor change to the OpenMP standard that register usage for OpenMP code can be reduced by up to 10 %.

international conference on cluster computing | 2015

Optimizing Explicit Hydrodynamics for Power, Energy, and Performance

Edgar A. León; Ian Karlin; Ryan E. Grant

Practical considerations for future supercomputer designs will impose limits on both instantaneous power consumption and total energy consumption. Working within these constraints while providing the maximum possible performance, application developers will need to optimize their code for speed alongside power and energy concerns. This paper analyzes the effectiveness of several code optimizations including loop fusion, data structure transformations, and global allocations. A per component measurement and analysis of different architectures is performed, enabling the examination of code optimizations on different compute subsystems. Using an explicit hydrodynamics proxy application from the U.S. Department of Energy, LULESH, we show how code optimizations impact different computational phases of the simulation. This provides insight for simulation developers into the best optimizations to use during particular simulation compute phases when optimizing code for future supercomputing platforms. We examine and contrast both x86 and Blue Gene architectures with respect to these optimizations.

parallel computing | 2016

Program optimizations

Edgar A. León; Ian Karlin; Ryan E. Grant; Matthew G. F. Dosanjh

We provide an analysis of the power and energy effects of program optimizations.The analysis relies on per application phase and per system component studies.We provide guidance on tradeoffs when tuning for performance, power, and energy.We identify energy and runtime correlations for optimizations on three architectures.Multi-objective optimizations require per component and application phase analysis. Practical considerations for future supercomputer designs will impose limits on both instantaneous power consumption and total energy consumption. Working within these constraints while providing the maximum possible performance, application developers will need to optimize their code for speed alongside power and energy concerns. This paper analyzes the effectiveness of several code optimizations including loop fusion, data structure transformations, and global allocations. A per component measurement and analysis of different architectures is performed, enabling the examination of code optimizations on different compute subsystems. Using an explicit hydrodynamics proxy application from the U.S. Department of Energy, LULESH, we show how code optimizations impact different computational phases of the simulation. This provides insight for simulation developers into the best optimizations to use during particular simulation compute phases when optimizing code for future supercomputing platforms. We examine and contrast both x86 and Blue Gene architectures with respect to these optimizations.

international parallel and distributed processing symposium | 2016

System Noise Revisited: Enabling Application Scalability and Reproducibility with SMT

Edgar A. León; Ian Karlin; Adam Moody

Despite significant advances in reducing system noise, the scalability and performance of scientific applications running on production commodity clusters today continue to suffer from the effects of noise. Unlike custom and expensive leadership systems, the Linux ecosystem provides a rich set of services that application developers utilize to increase productivity and to ease porting. The cost is the overhead that these services impose on a running application, negatively impacting its scalability and performance reproducibility. In this work, we propose and evaluate a simple yet effective way to isolate an application from system processes by leveraging Simultaneous Multi-Threading (SMT), a pervasive architectural feature on current systems. Our method requires no changes to the operating system or to the application. We quantify its effectiveness on a diverse set of scientific applications of interest to the U. S. Department of Energy showing performance improvements of up to 2.4 times at 16,384 tasks for a high-order finite elements shock hydrodynamics application. Finally, we provide guidance to system and application developers on how to best leverage SMT under different application characteristics and scales.

Explore More