Jeff Keasler
Lawrence Livermore National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jeff Keasler.
international parallel and distributed processing symposium | 2013
Ian Karlin; Abhinav Bhatele; Jeff Keasler; Bradford L. Chamberlain; Jonathan D. Cohen; Zachary DeVito; Riyaz Haque; Dan Laney; Edward A. Luke; Felix Wang; David F. Richards; Martin Schulz; Charles H. Still
Parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.
ieee international conference on high performance computing data and analytics | 2012
Hari Subramoni; Sreeram Potluri; Krishna Chaitanya Kandalla; Bill Barth; Jérôme Vienne; Jeff Keasler; Karen Tomko; Karl W. Schulz; Adam Moody; Dhabaleswar K. Panda
Over the last decade, InfiniBand has become an increasingly popular interconnect for deploying modern supercomputing systems. However, there exists no detection service that can discover the underlying network topology in a scalable manner and expose this information to runtime libraries and users of the high performance computing systems in a convenient way. In this paper, we design a novel and scalable method to detect the InfiniBand network topology by using Neighbor-Joining techniques (NJ). To the best of our knowledge, this is the first instance where the neighbor joining algorithm has been applied to solve the problem of detecting InfiniBand network topology. We also design a network-topology-aware MPI library that takes advantage of the network topology service. The library places processes taking part in the MPI job in a network-topology-aware manner with the dual aim of increasing intra-node communication and reducing the long distance inter-node communication across the InfiniBand fabric.
Archive | 2013
Kamal Sharma; I Karlin; Jeff Keasler; James R. McGraw; Sarkar
This paper describes a new approach to managing array data layouts to optimize performance for scientific codes. Prior research has shown that changing data layouts (e.g., interleaving arrays) can improve performance. However, there have been two major reasons why such optimizations are not widely used: (1) the need to select different layouts for different computing platforms, and (2) the cost of re-writing codes to use to new layouts. We describe a source-to-source translation process that allows us to generate codes with different array interleavings, based on a data layout specification. We used this process to generate 19 different data layouts for an ASC benchmark code (IRSmk) and 32 different data layouts for the DARPA UHPC challenge application (LULESH). Performance results for multicore versions of the benchmarks with different layouts show significant benefits on four computing platforms (IBM POWER7, AMD APU, Intel Sandybridge, IBM BG/Q). For IRSmk, our results show performance improvements ranging from 22.23× on IBM POWER7 to 1.10× on Intel Sandybridge. For LULESH, we see improvements ranging from 1.82× on IBM POWER7 to 1.02× on Intel Sandybridge. We also developed a new optimization algorithm to recommend a layout for an input source program and specific target machine characteristics. Our results show that the performance of this automated layout algorithm outperforms the manual layouts in one case and performs within 10% of the best architecture-specific layout in all the other cases, but one.
IEEE Transactions on Parallel and Distributed Systems | 2017
Didem Unat; Anshu Dubey; Torsten Hoefler; John Shalf; Mark James Abraham; Mauro Bianco; Bradford L. Chamberlain; Romain Cledat; H. Carter Edwards; Hal Finkel; Karl Fuerlinger; Frank Hannig; Emmanuel Jeannot; Amir Kamil; Jeff Keasler; Paul H. J. Kelly; Vitus J. Leung; Hatem Ltaief; Naoya Maruyama; Chris J. Newburn; Miquel Pericàs
The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.
international workshop on openmp | 2015
Tom Scogland; Jeff Keasler; John C. Gyllenhaal; Rich Hornung; Bronis R. de Supinski; Hal Finkel
Code-passing abstractions based on lambdas and blocks are becoming increasingly popular to capture repetitive patterns that are amenable to parallelization. These abstractions improve code mantainability and simplify choosing from a range of mechanisms to implement parallelism. Several frameworks that use this model, including RAJA and Kokkos, employ OpenMP as one of their target parallel models. However, OpenMP inadequately supports the abstraction since it frequently requires information that is not available within the abstraction. Thus, OpenMP requires access to variables and parameters not directly supplied by the base language. This paper explores the issues with supporting these abstractions in OpenMP, with a particular focus on device constructs and the aggregation and passing of OpenMP state through base language abstractions. We propose mechanisms to improve support for these abstractions and also to reduce the burden of duplication in existing OpenMP applications.
european conference on parallel processing | 2015
Kamal Sharma; Ian Karlin; Jeff Keasler; James R. McGraw; Vivek Sarkar
This paper describes a new approach to managing data layouts to optimize performance for array-intensive codes. Prior research has shown that changing data layouts (e.g., interleaving arrays) can improve performance. However, there have been two major reasons why such optimizations are not widely used in practice: (1) the challenge of selecting an optimized layout for a given computing platform, and (2) the cost of re-writing codes to use different layouts for different platforms. We describe a source-to-source code transformation process that enables the generation of different codes with different array interleavings from the same source program, controlled by data layout specifications that are defined separately from the program. Performance results for multicore versions of the benchmarks show significant benefits on four different computing platforms (up to \(22.23\times \) for IRSmk, up to \(3.68\times \) for SRAD and up to \(1.82\times \) for LULESH). We also developed a new optimization algorithm to recommend a good layout for a given source program and specific target machine characteristics. Our results show that the performance obtained using this algorithm achieves 78 %–95 % performance of the best manual layout on each platform for different benchmarks (IRSmk, SRAD, LULESH).
international workshop on openmp | 2015
Tom Scogland; John C. Gyllenhaal; Jeff Keasler; Rich Hornung; Bronis R. de Supinski
Maximizing the scope of a parallel region, which avoids the costs of barriers and of launching additional parallel regions, is among the first recommendations in any optimization guide for OpenMP. While clearly beneficial and easily accomplished for code where regions are visibly contiguous, regions often become contiguous only after compiler optimization or resolution of abstraction layers. This paper explores changes to the OpenMP specification that would allow implementations to merge adjacent parallel regions automatically, including the removal of issues that make the transformation non-conforming and the addition of hints that facilitate the optimization. Beyond simple merging, we explore hints to fuse workshared loops that occur in syntactically distinct parallel regions or to apply nowait to such loops. Our evaluation shows these changes can provide an overall speedup of 2–8\(\times \) for a microbenchmark, or 6 % for a representative physics application.
ieee international conference on high performance computing data and analytics | 2012
Ian Karlin; Jim McGraw; Esthela Gallardo; Jeff Keasler; Edgar A. León; Bert Still
Current and planned computer systems present challenges for scientific programming. Memory capacity and bandwidth are limiting performance as floating point capability increases due to more cores per processor and wider vector units. Effectively using hardware requires finding greater parallelism in programs while using relatively less memory. In this poster, we present how we tuned the Livermore Unstructured Lagrange Explicit Shock Hydrodynamics proxy application for on-node performance resulting in 62% fewer memory reads, a 19% smaller memory footprint, 770% more floating point operations vectorizing and less than 0.1% serial section runtime. Tests show serial code version runtime decreases of up to 57% and parallel runtime reductions of up to 75%. We are also applying these optimizations to GPUs and a subset of ALE3D, from which the proxy application was derived. So far we achieve up to a 1.9x speedup on GPUs, and a 13% runtime reduction in the application for the same problem.
Archive | 2015
Rich Hornung; Holger Jones; Jeff Keasler; Rob Neely; Olga Pearce; Si Hammond; Christian Trott; Paul Lin; Jeanine Cook; Rob Hoekstra; Ben Bergen; Josh Payne; Geoff Womeldorff
Presented at: IEEE IPDPS 2012, Shanghai, China, May 21 - May 25, 2012 | 2011
Krishna Chaitanya Kandalla; Ulrike Meier Yang; Jeff Keasler; Tzanio V. Kolev; Adam Moody; Hari Subramoni; Karen Tomko; Jérôme Vienne; Dhabaleswar K. Panda