Bharat Kaul | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bharat Kaul is active.

Explore More

Publication

Featured researches published by Bharat Kaul.

international parallel and distributed processing symposium | 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

Karthikeyan Vaidyanathan; Kiran Pamnany; Dhiraj D. Kalamkar; Alexander Heinecke; Mikhail Smelyanskiy; Jongsoo Park; Daehyun Kim; Aniruddha G. Shet; Bharat Kaul; Balint Joo; Pradeep Dubey

Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.

International Journal of Modern Physics C | 2013

ON VECTORIZATION FOR LATTICE BASED SIMULATIONS

Aniruddha G. Shet; K. Siddharth; Shahajhan H. Sorathiya; Anand M. Deshpande; Sunil D. Sherlekar; Bharat Kaul; Santosh Ansumali

We present a vector-friendly blocked computing strategy for the lattice Boltzmann method (LBM). This strategy, along with a recently developed data structure, Structure of Arrays of Structures (SoAoS), is implemented for multi-relaxation type lattice Boltzmann (LB). The proposed methodology enables optimal memory bandwidth utilization in the advection step and high compute efficiency in the collision step of LB implementation. In a dense computing environment, current performance optimization framework for LBM is able to achieve high single-core efficiency.

international symposium on computer architecture | 2017

ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks

Swagath Venkataramani; Ashish Ranjan; Subarno Banerjee; Dipankar Das; Sasikanth Avancha; Ashok Jagannathan; Ajaya V. Durg; Dheemanth Nagaraj; Bharat Kaul; Pradeep Dubey; Anand Raghunathan

Deep Neural Networks (DNNs) have demonstrated state-of-the-art performance on a broad range of tasks involving natural language, speech, image, and video processing, and are deployed in many real world applications. However, DNNs impose significant computational challenges owing to the complexity of the networks and the amount of data they process, both of which are projected to grow in the future. To improve the efficiency of DNNs, we propose SCALEDEEP, a dense, scalable server architecture, whose processing, memory and interconnect subsystems are specialized to leverage the compute and communication characteristics of DNNs. While several DNN accelerator designs have been proposed in recent years, the key difference is that SCALEDEEP primarily targets DNN training, as opposed to only inference or evaluation. The key architectural features from which SCALEDEEP derives its efficiency are: (i) heterogeneous processing tiles and chips to match the wide diversity in computational characteristics (FLOPs and Bytes/FLOP ratio) that manifest at different levels of granularity in DNNs, (ii) a memory hierarchy and 3-tiered interconnect topology that is suited to the memory access and communication patterns in DNNs, (iii) a low-overhead synchronization mechanism based on hardware data-flow trackers, and (iv) methods to map DNNs to the proposed architecture that minimize data movement and improve core utilization through nested pipelining. We have developed a compiler to allow any DNN topology to be programmed onto SCALEDEEP, and a detailed architectural simulator to estimate performance and energy. The simulator incorporates timing and power models of SCALEDEEPs components based on synthesis to Intels 14nm technology. We evaluate an embodiment of SCALEDEEP with 7032 processing tiles that operates at 600 MHz and has a peak performance of 680 TFLOPs (single precision) and 1.35 PFLOPs (half-precision) at 1.4KW. Across 11 state-of-the-art DNNs containing 0.65M-14.9M neurons and 6.8M-145.9M weights, including winners from 5 years of the ImageNet competition, SCALEDEEP demonstrates 6×-28× speedup at iso-power over the state-of-the-art performance on GPUs.

international parallel and distributed processing symposium | 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems

Dheevatsa Mudigere; Srinivas Sridharan; Anand M. Deshpande; Jongsoo Park; Alexander Heinecke; Mikhail Smelyanskiy; Bharat Kaul; Pradeep Dubey; Dinesh K. Kaushik; David E. Keyes

In this work, we revisit the 1999 Gordon Bell Prize winning PETSc-FUN3D aerodynamics code, extending it with highly-tuned shared-memory parallelization and detailed performance analysis on modern highly parallel architectures. An unstructured-grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain decomposition approach, exposes tradeoffs between the number of threads assigned to each MPI-rank sub domain, and the total number of domains. By applying several algorithm- and architecture-aware optimization techniques for unstructured grids, we show a 6.9X speed-up in performance on a single-node Intel® XeonTM1 E5 2690 v2 processor relative to the out-of-the-box compilation. Our scaling studies on TACC Stampede supercomputer show that our optimizations continue to provide performance benefits over baseline implementation as we scale up to 256 nodes.

international parallel and distributed processing symposium | 2012

High Performance Non-uniform FFT on Modern X86-based Multi-core Systems

Dhiraj D. Kalamkar; Joshua D. Trzaskoz; Srinivas Sridharan; Mikhail Smelyanskiy; Daehyun Kim; Armando Manduca; Yunhong Shu; Matt A. Bernstein; Bharat Kaul; Pradeep Dubey

The Non-Uniform Fast Fourier Transform (NUFFT) is a generalization of FFT to non-equidistant samples. It has many applications which vary from medical imaging to radio astronomy to the numerical solution of partial differential equations. Despite recent advances in speeding up NUFFT on various platforms, its practical applications are still limited, due to its high computational cost, which is significantly dominated by the convolution of a signal between a non-uniform and uniform grids. The computational cost of the NUFFT is particularly detrimental in cases which require fast reconstruction times, such as iterative 3D non-Cartesian MRI reconstruction. We propose novel and highly scalable parallel algorithm for performing NUFFT on x86-based multi-core CPUs. The high performance of our algorithm relies on good SIMD utilization and high parallel efficiency. On convolution, we demonstrate on average 90% SIMD efficiency using SSE, as well up to linear scalability using a quad-socket 40-core Intel(R) Xeon(R) E7-4870 Processors based system. As a result, on dual socket Intel(R) Xeon(R) X5670 based server, our NUFFT implementation is more than 4x faster compared to the best available NUFFT3D implementation, when run on the same hardware. On Intel(R) Xeon(R) E5-2670 processor based server, our NUFFT implementation is 1.5X faster than any published NUFFT implementation today. Such speed improvement opens new usages for NUFFT. For example, iterative multi channel reconstruction of a 240×240×240 image could execute in just over 3 minutes, which is on the same order as contemporary non-iterative (and thus less-accurate) 3D NUFFT-based MRI reconstructions.

european conference on computer vision | 2018

Out-of-Distribution Detection Using an Ensemble of Self Supervised Leave-Out Classifiers

Apoorv Vyas; Nataraj Jammalamadaka; Xia Zhu; Dipankar Das; Bharat Kaul; Theodore L. Willke

As deep learning methods form a critical part in commercially important applications such as autonomous driving and medical diagnostics, it is important to reliably detect out-of-distribution (OOD) inputs while employing these algorithms. In this work, we propose an OOD detection algorithm which comprises of an ensemble of classifiers. We train each classifier in a self-supervised manner by leaving out a random subset of training data as OOD data and the rest as in-distribution (ID) data. We propose a novel margin-based loss over the softmax output which seeks to maintain at least a margin m between the average entropy of the OOD and in-distribution samples. In conjunction with the standard cross-entropy loss, we minimize the novel loss to train an ensemble of classifiers. We also propose a novel method to combine the outputs of the ensemble of classifiers to obtain OOD detection score and class prediction. Overall, our method convincingly outperforms Hendrycks et al. [7] and the current state-of-the-art ODIN [13] on several OOD detection benchmarks.

arXiv: Distributed, Parallel, and Cluster Computing | 2016

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent.

Dipankar Das; Sasikanth Avancha; Dheevatsa Mudigere; Karthikeyan Vaidyanathan; Srinivas Sridharan; Dhiraj D. Kalamkar; Bharat Kaul; Pradeep Dubey

Physical Review E | 2013

Data structure and movement for lattice-based simulations.

Aniruddha G. Shet; Shahajhan H. Sorathiya; Siddharth Krithivasan; Anand M. Deshpande; Bharat Kaul; Sunil D. Sherlekar; Santosh Ansumali

arXiv: Learning | 2017

Ternary Neural Networks with Fine-Grained Quantization.

Naveen Mellempudi; Abhisek Kundu; Dheevatsa Mudigere; Dipankar Das; Bharat Kaul; Pradeep Dubey

international conference on learning representations | 2018

Mixed Precision Training of Convolutional Neural Networks using Integer Operations

Dipankar Das; Naveen Mellempudi; Dheevatsa Mudigere; Dhiraj D. Kalamkar; Sasikanth Avancha; Kunal Banerjee; Srinivas Sridharan; Karthik Vaidyanathan; Bharat Kaul; Evangelos Georganas; Alexander Heinecke; Pradeep Dubey; Jesus Corbal; Nikita Shustrov; Roman Dubtsov; Evarist Fomenko; Vadim O. Pirogov

Explore More