Kiran Pamnany | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kiran Pamnany is active.

Explore More

Publication

Featured researches published by Kiran Pamnany.

international supercomputing conference | 2013

Lattice QCD on Intel® Xeon PhiTM Coprocessors

Balint Joo; Dhiraj D. Kalamkar; Karthikeyan Vaidyanathan; Mikhail Smelyanskiy; Kiran Pamnany; Victor W. Lee; Pradeep Dubey; William A. Watson

Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully ’native’ multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs.

international parallel and distributed processing symposium | 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

Karthikeyan Vaidyanathan; Kiran Pamnany; Dhiraj D. Kalamkar; Alexander Heinecke; Mikhail Smelyanskiy; Jongsoo Park; Daehyun Kim; Aniruddha G. Shet; Bharat Kaul; Balint Joo; Pradeep Dubey

Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.

ieee international conference on high performance computing data and analytics | 2015

Improving concurrency and asynchrony in multithreaded MPI applications using software offloading

Karthikeyan Vaidyanathan; Dhiraj D. Kalamkar; Kiran Pamnany; Jeff R. Hammond; Pavan Balaji; Dipankar Das; Jongsoo Park; Balint Joo

We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI_THREAD_MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2015

Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel®Xeon Phi™ Coprocessor

Sanchit Misra; Kiran Pamnany; Srinivas Aluru

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel Xeon Phi coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel® Xeon® processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.

international parallel and distributed processing symposium | 2014

Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor

Sanchit Misra; Kiran Pamnany; Srinivas Aluru

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel (R) Xeon Phi (TM) coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel (R) Xeon (R) processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.

ieee international conference on high performance computing data and analytics | 2014

Parallel bayesian network structure learning for genome-scale gene networks

Sanchit Misra; Md. Vasimuddin; Kiran Pamnany; Sriram P. Chockalingam; Yong Dong; Min Xie; Maneesha Aluru; Srinivas Aluru

Learning Bayesian networks is NP-hard. Even with recent progress in heuristic and parallel algorithms, modeling capabilities still fall short of the scale of the problems encountered. In this paper, we present a massively parallel method for Bayesian network structure learning, and demonstrate its capability by constructing genome-scale gene networks of the model plant Arabidopsis thaliana from over 168.5 million gene expression values. We report strong scaling efficiency of 75% and demonstrate scaling to 1.57 million cores of the Tianhe-2 supercomputer. Our results constitute three and five orders of magnitude increase over previously published results in the scale of data analyzed and computations performed, respectively. We achieve this through algorithmic innovations, using efficient techniques to distribute work across all compute nodes, all available processors and coprocessors on each node, all available threads on each processor and coprocessor, and vectorization techniques to maximize single thread performance.

ieee international conference on high performance computing, data, and analytics | 2015

Dtree: Dynamic Task Scheduling at Petascale

Kiran Pamnany; Sanchit Misra; Vasimuddin; Xing Liu; Edmond Chow; Srinivas Aluru

Irregular applications are challenging to scale on supercomputers due to the difficulty of balancing load across large numbers of nodes. This challenge is exacerbated by the increasing heterogeneity of modern supercomputers in which nodes often contain multiple processors and coprocessors operating at different speeds, and with differing core and thread counts. We present Dtree, a dynamic task scheduler designed to address this challenge. Dtree shows close to optimal results for a class of HPC applications, improving time-to-solution by achieving near-perfect load balance while consuming negligible resources. We demonstrate Dtree’s effectiveness on up to 77,824 heterogeneous cores of the TACC Stampede supercomputer with two different petascale HPC applications: ParaBLe, which performs large-scale Bayesian network structure learning, and GTFock, which implements Fock matrix construction, an essential and expensive step in quantum chemistry codes. For ParaBLe, we show improved performance while eliminating the complexity of managing heterogeneity. For GTFock, we match the most recently published performance without using any application-specific optimizations for data access patterns (such as the task distribution design for communication reduction) that enabled that performance. We also show that Dtree can distribute from tens of thousands to hundreds of millions of irregular tasks across up to 1024 nodes with minimal overhead, while balancing load to within 2 % of optimal.

Archive | 2013