Is this you? Create Your Porfile

Abhinav Bhatele

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abhinav Bhatele is active.

Explore More

Publication

Featured researches published by Abhinav Bhatele.

international parallel and distributed processing symposium | 2013

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application

Ian Karlin; Abhinav Bhatele; Jeff Keasler; Bradford L. Chamberlain; Jonathan D. Cohen; Zachary DeVito; Riyaz Haque; Dan Laney; Edward A. Luke; Felix Wang; David F. Richards; Martin Schulz; Charles H. Still

Parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.

international parallel and distributed processing symposium | 2008

Overcoming scaling challenges in biomolecular simulations across multiple platforms

Abhinav Bhatele; Sameer Kumar; Chao Mei; James C. Phillips; Gengbin Zheng; Laxmikant V. Kalé

NAMD is a portable parallel application for biomolecular simulations. NAMD pioneered the use of hybrid spatial and force decomposition, a technique now used by most scalable programs for biomolecular simulations, including Blue Matter and Desmond developed by IBM and D. E. Shaw respectively. NAMD has been developed using Charm++ and benefits from its adaptive communication- computation overlap and dynamic load balancing. This paper focuses on new scalability challenges in biomolecular simulations: using much larger machines and simulating molecular systems with millions of atoms. We describe new techniques developed to overcome these challenges. Since our approach involves automatic adaptive runtime optimizations, one interesting issue involves dealing with harmful interaction between multiple adaptive strategies. NAMD runs on a wide variety of platforms, ranging from commodity clusters to supercomputers. It also scales to large machines: we present results for up to 65,536 processors on IBMs Blue Gene/L and 8,192 processors on Cray XT3/XT4. In addition, we present performance results on NCSAs Abe, SDSCs DataStar and TACCs LoneStar cluster, to demonstrate efficient portability. We also compare NAMD with Desmond and Blue Matter.

international conference on supercomputing | 2009

Dynamic topology aware load balancing algorithms for molecular dynamics applications

Abhinav Bhatele; Laxmikant V. Kalé; Sameer Kumar

Molecular Dynamics applications enhance our understanding of biological phenomena through bio-molecular simulations. Large-scale parallelization of MD simulations is challenging because of the small number of atoms and small time scales involved. Load balancing in parallel MD programs is crucial for good performance on large parallel machines. This paper discusses load balancing algorithms deployed in a MD code called NAMD. It focuses on new schemes deployed in the load balancers and provides an analysis of the performance benefits achieved. Specifically, the paper presents the technique of topology-aware mapping on 3D mesh and torus architectures, used to improve scalability and performance. These techniques have a wide applicability for latency intolerant applications.

Ibm Journal of Research and Development | 2008

Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system

Sameer Kumar; Chao Huang; Gengbin Zheng; Eric J. Bohm; Abhinav Bhatele; James C. Phillips; Hao Yu; Laxmikant V. Kalé

NAMD (nanoscale molecular dynamics) is a production molecular dynamics (MD) application for biomolecular simulations that include assemblages of proteins, cell membranes, and water molecules. In a biomolecular simulation, the problem size is fixed and a large number of iterations must be executed in order to understand interesting biological phenomena. Hence, we need MD applications to scale to thousands of processors, even though the individual timestep on one processor is quite small. NAMD has demonstrated its performance on several parallel computer architectures. In this paper, we present various compiler optimization techniques that use single-instruction, multiple-data (SIMD) instructions to obtain good sequential performance with NAMD on the embedded IBM PowerPC® 440 processor core. We also present several techniques to scale the NAMD application to 20,480 nodes of the IBM Blue Gene/L™ (BG/L) system. These techniques include topology-specific optimizations to localize communication, new messaging protocols that are optimized for the BG/L torus, topology-aware load balancing, and overlap of computation and communication. We also present performance results of various molecular systems with sizes ranging from 5,570 to 327,506 atoms.

ieee international conference on high performance computing data and analytics | 2011

Periodic hierarchical load balancing for large supercomputers

Gengbin Zheng; Abhinav Bhatele; Esteban Meneses; Laxmikant V. Kalé

Large parallel machines with hundreds of thousands of processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with a relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to take longer to arrive at good solutions. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and longer running times of traditional distributed schemes. Our solution overcomes these issues by creating multiple levels of load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We discuss techniques to deal with scalability challenges of load balancing at very large scale. We present performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at the Texas Advanced Computing Center) and 65,536 cores of Intrepid (the Blue Gene/P at Argonne National Laboratory) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD, with results on Intrepid.

ieee international conference on high performance computing data and analytics | 2011

Avoiding hot-spots on two-level direct networks

Abhinav Bhatele; Nikhil Jain; William Gropp; Laxmikant V. Kalé

A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. A two-level direct network has been proposed by several groups as a scalable design for future machines. IBMs PERCS topology and the dragonfly net-work discussed in the DARPA exascale hardware study are examples of this design. The presence of multiple levels in this design leads to hot-spots on a few links when processes are grouped together at the lowest level to minimize total communication volume. This is especially true for communication graphs with a small number of neighbors per task. Routing and mapping choices can impact the communication performance of parallel applications running on a machine with a two-level direct topology. This paper explores intelligent topology aware mappings of different communication patterns to the physical topology to identify cases that minimize link utilization. We also analyze the trade-offs between using direct and indirect routing with different mappings. We use simulations to study communication and overall performance of applications since there are no installations of two-level direct networks yet. This study raises interesting issues regarding the choice of job scheduling, routing and mapping for future machines.

ieee international conference on high performance computing data and analytics | 2011

Improving communication performance in dense linear algebra via topology aware collectives

Edgar Solomonik; Abhinav Bhatele; James Demmel

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.

ieee international conference on high performance computing, data, and analytics | 2010

Automated mapping of regular communication graphs on mesh interconnects

Abhinav Bhatele; Gagan Raj Gupta; Laxmikant V. Kalé; I-Hsin Chung

Network contention has a significantly adverse effect on the performance of parallel applications with increasing size of parallel machines. Machines of the petascale era are forcing application developers to map tasks intelligently to job partitions to achieve the best performance possible. This paper presents a framework for automated mapping of parallel applications with regular communication graphs to two and three dimensional mesh and torus networks. This framework will save much effort on the part of application developers to generate mappings for their individual applications. One component of the framework is a process topology analyzer to find regular patterns and if found, to determine the dimensions of the communication graphs of applications. The other component is a suite of heuristic techniques for mapping 2D object grids to 2D and 3D processor meshes. The framework chooses the best heuristic from the suite for a given object grid and processor mesh pair based on the hop-bytes metric. We show performance improvements using the framework, for a 2D Stencil benchmark in MPI and the Weather Research and Forecasting model running on the IBM Blue Gene/P. We also compare our algorithms with others discussed in literature.

IEEE Transactions on Visualization and Computer Graphics | 2012

Visualizing Network Traffic to Understand the Performance of Massively Parallel Simulations

Aaditya G. Landge; Joshua A. Levine; Abhinav Bhatele; Katherine E. Isaacs; Todd Gamblin; Martin Schulz; S. H. Langer; Peer-Timo Bremer; Valerio Pascucci

The performance of massively parallel applications is often heavily impacted by the cost of communication among compute nodes. However, determining how to best use the network is a formidable task, made challenging by the ever increasing size and complexity of modern supercomputers. This paper applies visualization techniques to aid parallel application developers in understanding the network activity by enabling a detailed exploration of the flow of packets through the hardware interconnect. In order to visualize this large and complex data, we employ two linked views of the hardware network. The first is a 2D view, that represents the network structure as one of several simplified planar projections. This view is designed to allow a user to easily identify trends and patterns in the network traffic. The second is a 3D view that augments the 2D view by preserving the physical network topology and providing a context that is familiar to the application developers. Using the massively parallel multi-physics code pF3D as a case study, we demonstrate that our tool provides valuable insight that we use to explain and optimize pF3Ds performance on an IBM Blue Gene/P system.

international conference on parallel processing | 2010

Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers

Gengbin Zheng; Esteban Meneses; Abhinav Bhatele; Laxmikant V. Kalé

Large parallel machines with hundreds of thousands of processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We present techniques to deal with scalability challenges of load balancing at very large scale. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD with results on the Blue Gene/P machine at ANL.

Explore More