Srinivas Sridharan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srinivas Sridharan is active.

Explore More

Publication

Featured researches published by Srinivas Sridharan.

ieee international conference on high performance computing data and analytics | 2014

Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpoints

Srinivas Sridharan; James Dinan; Dhiraj D. Kalamkar

Modern high-speed interconnection networks are designed with capabilities to support communication from multiple processor cores. The MPI endpoints extension has been proposed to ease process and thread count tradeoffs by enabling multithreaded MPI applications to efficiently drive independent network communication. In this work, we present the first implementation of the MPI endpoints interface and demonstrate the first applications running on this new interface. We use a novel library-based design that can be layered on top of any existing, production MPI implementation. Our approach uses proxy processes to isolate threads in an MPI job, eliminating threading overheads within the MPI library and allowing threads to achieve process-like communication performance. We evaluate the performance advantages of our implementation through several benchmarks and kernels. Performance results for the Lattice QCD Dslash kernel indicate that endpoints provides up to 2.9× improvement in communication performance and 1.87× overall performance improvement over a highly optimized hybrid MPI+OpenMP baseline on 128 processors.

ieee international conference on high performance computing, data, and analytics | 2016

Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels

Rob F. Van der Wijngaart; Abdullah Kayi; Jeff R. Hammond; Gabriele Jost; Tom St. John; Srinivas Sridharan; Timothy G. Mattson; John Abercrombie; Jacob Nelson

We use three Parallel Research Kernels to compare performance of a set of programming models(We employ the term programming model as it is commonly used in the application community. A more accurate term is programming environment, which is the collective of abstract programming model, embodiment of the model in an Application Programmer Interface (API), and the runtime that implements it.): MPI1 (MPI two-sided communication), MPIOPENMP (MPI+OpenMP), MPISHM (MPI1 with MPI-3 interprocess shared memory), MPIRMA (MPI one-sided communication), SHMEM, UPC, Charm++ and Grappa. The kernels in our study – Stencil, Synch_p2p and Transpose – underlie a wide range of computational science applications. They enable direct probing of properties of programming models, especially communication and synchronization. In contrast to mini- or proxy applications, the PRK allow for rapid implementation, measurement and verification. Our experimental results show MPISHM the overall winner, with MPI1, MPIOPENMP and SHMEM performing well. MPISHM and MPIOPENMP outperform the other models in the strong-scaling limit due to their effective use of shared memory and good granularity control. The non-evolutionary models Grappa and Charm++ are not competitive with traditional models (MPI and PGAS) for two of the kernels; these models favor irregular algorithms, while the PRK considered here are regular.

ieee international conference on high performance computing data and analytics | 2012

Extending the BT NAS parallel benchmark to exascale computing

Rob F. Van der Wijngaart; Srinivas Sridharan; Victor W. Lee

The NAS Parallel Benchmarks (NPB) are a well-known suite of benchmarks that proxy scientific computing applications. They specify several problem sizes that represent how such applications may run on different sizes of HPC systems. However, even the largest problem (class F) is still far too small to exercise properly a petascale supercomputer. Our work shows how one may scale the Block Tridiagonal (BT) NPB from todays published size to petascale and exascale computing systems. In this paper we discuss the pros and cons of various ways of scaling. We discuss how scaling BT would impact computation, memory access, and communications, and highlight the expected bottleneck, which turns out to be not memory or communication bandwidth, but network latency. Two complementary ways are presented to overcome latency obstacles. We also describe a practical method to gather approximate performance data for BT at exascale on actual hardware, without requiring an exascale system.

ieee international conference on high performance computing data and analytics | 2017

Deep learning at 15PF: supervised and semi-supervised classification for scientific data

Thorsten Kurth; Jian Zhang; Nadathur Satish; Evan Racah; Ioannis Mitliagkas; Md. Mostofa Ali Patwary; Tareq M. Malas; Narayanan Sundaram; Wahid Bhimji; Mikhail Smorkalov; Jack Deslippe; Mikhail Shiryaev; Srinivas Sridharan; Prabhat; Pradeep Dubey

This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains ~2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to ~9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.

international parallel and distributed processing symposium | 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems

Dheevatsa Mudigere; Srinivas Sridharan; Anand M. Deshpande; Jongsoo Park; Alexander Heinecke; Mikhail Smelyanskiy; Bharat Kaul; Pradeep Dubey; Dinesh K. Kaushik; David E. Keyes

In this work, we revisit the 1999 Gordon Bell Prize winning PETSc-FUN3D aerodynamics code, extending it with highly-tuned shared-memory parallelization and detailed performance analysis on modern highly parallel architectures. An unstructured-grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain decomposition approach, exposes tradeoffs between the number of threads assigned to each MPI-rank sub domain, and the total number of domains. By applying several algorithm- and architecture-aware optimization techniques for unstructured grids, we show a 6.9X speed-up in performance on a single-node Intel® XeonTM1 E5 2690 v2 processor relative to the out-of-the-box compilation. Our scaling studies on TACC Stampede supercomputer show that our optimizations continue to provide performance benefits over baseline implementation as we scale up to 256 nodes.

international parallel and distributed processing symposium | 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems

Dhiraj D. Kalamkar; Joshua D. Trzaskoz; Srinivas Sridharan; Mikhail Smelyanskiy; Daehyun Kim; Armando Manduca; Yunhong Shu; Matt A. Bernstein; Bharat Kaul; Pradeep Dubey

The Non-Uniform Fast Fourier Transform (NUFFT) is a generalization of FFT to non-equidistant samples. It has many applications which vary from medical imaging to radio astronomy to the numerical solution of partial differential equations. Despite recent advances in speeding up NUFFT on various platforms, its practical applications are still limited, due to its high computational cost, which is significantly dominated by the convolution of a signal between a non-uniform and uniform grids. The computational cost of the NUFFT is particularly detrimental in cases which require fast reconstruction times, such as iterative 3D non-Cartesian MRI reconstruction. We propose novel and highly scalable parallel algorithm for performing NUFFT on x86-based multi-core CPUs. The high performance of our algorithm relies on good SIMD utilization and high parallel efficiency. On convolution, we demonstrate on average 90% SIMD efficiency using SSE, as well up to linear scalability using a quad-socket 40-core Intel(R) Xeon(R) E7-4870 Processors based system. As a result, on dual socket Intel(R) Xeon(R) X5670 based server, our NUFFT implementation is more than 4x faster compared to the best available NUFFT3D implementation, when run on the same hardware. On Intel(R) Xeon(R) E5-2670 processor based server, our NUFFT implementation is 1.5X faster than any published NUFFT implementation today. Such speed improvement opens new usages for NUFFT. For example, iterative multi channel reconstruction of a 240×240×240 image could execute in just over 3 minutes, which is on the same order as contemporary non-iterative (and thus less-accurate) 3D NUFFT-based MRI reconstructions.

Proceedings of the 24th European MPI Users' Group Meeting on | 2017

Planning for performance: persistent collective operations for MPI

Bradley Morgan; Daniel J. Holmes; Anthony Skjellum; Purushotham Bangalore; Srinivas Sridharan

Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, more optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example illustrating the potential performance enhancements for such operations. Persistent operations allow MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., overlap of communication with computation or other communication). Further enhancement of the current implementation prototype as well as additional opportunities to enhance performance through the application of these new APIs comprise future work.

parallel computing | 2018

Planning for Performance: Enhancing Achievable Performance for MPI through Persistent Collective Operations

Daniel J. Holmes; Bradley Morgan; Anthony Skjellum; Purushotham Bangalore; Srinivas Sridharan

Abstract Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, significant additional optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example application (based on 3D conjugate gradient) illustrating the potential performance enhancements for such operations. Persistent operations enable MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., to support overlap of communication with computation and/or other communication). Further enhancement of the current reference implementation, as well as additional opportunities to enhance performance through the application of these new APIs, comprise future work.

Concurrency and Computation: Practice and Experience | 2018

TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML: TensorFlow at Scale

Thorsten Kurth; Mikhail Smorkalov; Peter Mendygral; Srinivas Sridharan; Amrita Mathuriya

Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems.

arXiv: Distributed, Parallel, and Cluster Computing | 2016