Kalyan Kumaran
Argonne National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kalyan Kumaran.
ieee international conference on high performance computing data and analytics | 2011
Jiayuan Meng; Vitali A. Morozov; Kalyan Kumaran; Venkatram Vishwanath; Thomas D. Uram
We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with characteristics resembling real implementations. The synthesized characteristics are used by an existing analytical model to project GPU performance. The cost and benefit of GPU development can then be estimated according to the transformed code skeleton that yields the best projected performance. With GROPHECY, users can leap toward GPU acceleration only when the cost-benefit makes sense. The framework is validated using kernel benchmarks and data-parallel codes in legacy scientific applications. The measured performance of manually tuned codes deviates from the projected performance by 17% in geometric mean.
international workshop on openmp | 2012
Matthias S. Müller; John Baron; William C. Brantley; Huiyu Feng; Daniel Hackenberg; Robert Henschel; Gabriele Jost; Daniel Molka; Chris Parrott; Joe Robichaux; Pavel Shelepugin; G. Matthijs van Waveren; Brian Whitney; Kalyan Kumaran
This paper describes SPEC OMP2012, a benchmark developed by the SPEC High Performance Group. It consists of 15 OpenMP parallel applications from a wide range of fields. In addition to a performance metric based on the run time of the applications the benchmark adds an optional energy metric. The accompanying run rules detail how the benchmarks are executed and the results reported. They also cover the energy measurements. The first set of results provide scalability on three different platforms.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013
Michael Boyer; Jiayuan Meng; Kalyan Kumaran
Accelerators such as graphics processors (GPUs) have become increasingly popular for high performance scientific computing. Often, much effort is invested in creating and optimizing GPU code without any guaranteed performance benefit. To reduce this risk, performance models can be used to project a kernels GPU performance potential before it is ported. However, raw GPU execution time is not the only consideration. The overhead of transferring data between the CPU and the GPU is also an important factor; for some applications, this overhead may even erase the performance benefits of GPU acceleration. To address this challenge, we propose a GPU performance modeling framework that predicts both kernel execution time and data transfer time. Our extensions to an existing GPU performance model include a data usage analyzer for a sequence of GPU kernels, to determine the amount of data that needs to be transferred, and a performance model of the PCIe bus, to determine how long the data transfer will take. We have tested our framework using a set of applications running on a production machine at Argonne National Laboratory. On average, our model predicts the data transfer overhead with an error of only 8%, and the inclusion of data transfer time reduces the error in the predicted GPU speedup from 255% to 9%.
ieee international conference on high performance computing data and analytics | 2011
Leopold Grinberg; Joseph A. Insley; Vitali A. Morozov; Michael E. Papka; George Em Karniadakis; Dmitry A. Fedosov; Kalyan Kumaran
Interfacing atomistic-based with continuum-based simulation codes is now required in many multiscale physical and biological systems. We present the computational advances that have enabled the first multiscale simulation on 190,740 processors by coupling a high-order (spectral element) Navier-Stokes solver with a stochastic (coarse-grained) Molecular Dynamics solver based on Dissipative Particle Dynamics (DPD). The key contributions are proper interface conditions for overlapped domains, topology-aware communication, SIMDization, multiscale visualization and a new do- main partitioning for atomistic solvers. We study blood flow in a patient-specific cerebrovasculature with a brain aneurysm, and analyze the interaction of blood cells with the arterial walls endowed with a glycocalyx causing thrombus formation and eventual aneurysm rupture. The macro-scale dynamics (about 3 billion unknowns) are resolved by NεκTαr - a spectral element solver; the micro-scale flow and cell dynamics within the aneurysm are resolved by an in-house version of DPD-LAMMPS (for an equivalent of about 100 billions molecules).
ieee international conference on high performance computing data and analytics | 2014
Guido Juckeland; William C. Brantley; Sunita Chandrasekaran; Barbara M. Chapman; Shuai Che; Mathew E. Colgrove; Huiyu Feng; Alexander Grund; Robert Henschel; Wen-mei W. Hwu; Huian Li; Matthias S. Müller; Wolfgang E. Nagel; Maxim Perminov; Pavel Shelepugin; Kevin Skadron; John A. Stratton; Alexey Titov; Ke Wang; G. Matthijs van Waveren; Brian Whitney; Sandra Wienke; Rengan Xu; Kalyan Kumaran
Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (HPG) has developed a set of performance metrics to evaluate the performance and power consumption of accelerators for various science applications. The new benchmark comprises two suites of applications written in OpenCL and OpenACC and measures the performance of accelerators with respect to a reference platform. The first set of published results demonstrate the viability and relevance of the new metrics in comparing accelerator performance. This paper discusses the benchmark suites and selected published results in great detail.
computing frontiers | 2014
Jiayuan Meng; Xingfu Wu; Vitali A. Morozov; Venkatram Vishwanath; Kalyan Kumaran; Valerie E. Taylor
Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, performance engineers, and hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces a descriptive model about the semantic behavior of a workload, which can infer potential transformations and help users understand how workloads may interact with and adapt to emerging hardware. SKOPE models can be shared, annotated, and studied by a community of performance engineers and system designers; they offer readability in the frontend and versatility in the backend. SKOPE can be used for performance analysis, tuning, and projection. We provide two example use cases. First, we project GPU performance from CPU code without GPU programming or accessing the hardware, and are able to automatically explore transformations and the projected best-achievable performance deviates from the measured by 18% on average. Second, we project the multi-node scaling trends of two scientific workloads, and are able to achieve a projection accuracy of 95%.
international conference on parallel processing | 2012
Vitali A. Morozov; Jiayuan Meng; Venkatram Vishwanath; Jeff R. Hammond; Kalyan Kumaran; Michael E. Papka
As systems grow larger and computation is further spread across nodes, efficient data communication is becoming increasingly important to achieve high throughput and low power consumption for high performance computing systems. However, communication efficacy not only depends on application-specific communication patterns, but also on machine-specific communication subsystems, node architectures, and even the runtime communication libraries. In fact, different hardware systems lead to different tradeoffs with respect to communication mechanisms, which can impact the choice of application implementations. We present a set of MPI-based benchmarks to better understand the communication behavior of the hardware systems and guide the performance tuning of scientific applications. We further apply these benchmarks to three clusters and present several interesting lessons from our experience.
Communications of The ACM | 2016
Salman Habib; Vitali A. Morozov; Nicholas Frontiere; Hal Finkel; Adrian Pope; Katrin Heitmann; Kalyan Kumaran; Venkatram Vishwanath; Tom Peterka; Joseph A. Insley; David Daniel; Patricia K. Fasel; Zarija Lukić
Supercomputing is evolving toward hybrid and accelerator-based architectures with millions of cores. The Hardware/Hybrid Accelerated Cosmology Code (HACC) framework exploits this diverse landscape at the largest scales of problem size, obtaining high scalability and sustained performance. Developed to satisfy the science requirements of cosmological surveys, HACC melds particle and grid methods using a novel algorithmic structure that flexibly maps across architectures, including CPU/GPU, multi/many-core, and Blue Gene systems. In this Research Highlight, we demonstrate the success of HACC on two very different machines, the CPU/GPU system Titan and the BG/Q systems Sequoia and Mira, attaining very high levels of scalable performance. We demonstrate strong and weak scaling on Titan, obtaining up to 99.2% parallel efficiency, evolving 1.1 trillion particles. On Sequoia, we reach 13.94 PFlops (69.2% of peak) and 90% parallel efficiency on 1,572,864 cores, with 3.6 trillion particles, the largest cosmological benchmark yet performed. HACC design concepts are applicable to several other supercomputer applications.Supercomputing is evolving toward hybrid and accelerator-based architectures with millions of cores. The Hardware/Hybrid Accelerated Cosmology Code (HACC) framework exploits this diverse landscape at the largest scales of problem size, obtaining high scalability and sustained performance. Developed to satisfy the science requirements of cosmological surveys, HACC melds particle and grid methods using a novel algorithmic structure that flexibly maps across architectures, including CPU/GPU, multi/many-core, and Blue Gene systems. In this Research Highlight, we demonstrate the success of HACC on two very different machines, the CPU/GPU system Titan and the BG/Q systems Sequoia and Mira, attaining very high levels of scalable performance. We demonstrate strong and weak scaling on Titan, obtaining up to 99.2% parallel efficiency, evolving 1.1 trillion particles. On Sequoia, we reach 13.94 PFlops (69.2% of peak) and 90% parallel efficiency on 1,572,864 cores, with 3.6 trillion particles, the largest cosmological benchmark yet performed. HACC design concepts are applicable to several other supercomputer applications.
international parallel and distributed processing symposium | 2013
Vitali A. Morozov; Kalyan Kumaran; Venkatram Vishwanath; Jiayuan Meng; Michael E. Papka
The Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system. The BG/Q system is the third generation in Blue Gene architecture from IBM and like its predecessors combines system-onchip technology with a proprietary interconnect (5-D torus). Each compute node has 16 augmented PowerPC A2 processor cores with support for simultaneous multithreading, 4-wide double precision SIMD, and different data prefetching mechanisms. Mira offers several new opportunities for tuning and scaling scientific applications. This paper discusses our early experience with a subset of micro-benchmarks, MPI benchmarks, and a variety of science and engineering applications running at ALCF. Both performance and power are studied and results on BG/Q is compared with its predecessor BG/P. Several lessons gleaned from tuning applications on the BG/Q architecture for better performance and scalability are shared.
Ibm Journal of Research and Development | 2013
Susan Coghlan; Kalyan Kumaran; Raymond M. Loy; Paul Messina; Vitali A. Morozov; James C. Osborn; Scott Parker; Katherine Riley; Nichols A. Romero; Timothy J. Williams
A varied collection of scientific and engineering codes has been adapted and enhanced to take advantage of the IBM Blue Gene®/Q architecture and thus enable research that was previously out of reach. Computational research teams from a number of disciplines collaborated with the staff of the Argonne Leadership Computing Facility to assess which of Blue Gene/Qs many novel features could be exploited for each application to equip it to tackle existing problem classes with greater fidelity and in some cases to address new phenomena. The quad floating-point units and the five-dimensional torus interconnect are among the features that were demonstrated to be effective for a number of important applications. Furthermore, data obtained from the hardware counters provided insights that were valuable in guiding the code modifications. Hardware features and programming techniques that were effective across multiple codes are documented as well. First, we have confirmed that there is no significant code rewrite needed to run todays production codes with good performance on Mira, an IBM Blue Gene/Q supercomputer. Performance improvements are already demonstrated, even though our measurements are all on pre-production software and hardware. The application domains included biology, materials science, combustion, chemistry, nuclear physics, and industrial-scale design of nuclear reactors, jet engines, and the efficiency of transportation systems.