Is this you? Create Your Porfile

Manoj Kumar Krishnan

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manoj Kumar Krishnan is active.

Explore More

Publication

Featured researches published by Manoj Kumar Krishnan.

ieee international conference on high performance computing data and analytics | 2006

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

Jarek Nieplocha; Bruce J. Palmer; Vinod Tipparaju; Manoj Kumar Krishnan; Harold E. Trease; Edoardo Aprà

This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an inteface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the attractiveness of using higher level abstractions to write parallel code.

ieee international conference on high performance computing data and analytics | 2006

A Component Architecture for High-Performance Scientific Computing

Benjamin A. Allan; Robert C. Armstrong; David E. Bernholdt; Felipe Bertrand; Kenneth Chiu; Tamara L. Dahlgren; Kostadin Damevski; Wael R. Elwasif; Thomas Epperly; Madhusudhan Govindaraju; Daniel S. Katz; James Arthur Kohl; Manoj Kumar Krishnan; Gary Kumfert; J. Walter Larson; Sophia Lefantzi; Michael J. Lewis; Allen D. Malony; Lois C. Mclnnes; Jarek Nieplocha; Boyana Norris; Steven G. Parker; Jaideep Ray; Sameer Shende; Theresa L. Windus; Shujia Zhou

The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance coputing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed coputing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal ovehead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including cobustion research, global climate simulation, and computtional chemistry.

ieee international conference on high performance computing data and analytics | 2006

High Performance Remote Memory Access Communication: The Armci Approach

Jarek Nieplocha; Vinod Tipparaju; Manoj Kumar Krishnan; Dhabaleswar K. Panda

This paper describes the Aggregate Remote Memory Copy Interface (ARMCI), a portable high performance remote memory access communication interface, developed oriinally under the U.S. Department of Energy (DOE) Advanced Computational Testing and Simulation Toolkit project and currently used and advanced as a part of the run-time layer of the DOE project, Programming Models for Scalble Parallel Computing. The paper discusses the model, addresses challenges of portable implementations, and demonstrates that ARMCI delivers high performance on a variety of platforms. Special emphasis is placed on the latency hiding mechanisms and ability to optimize noncotiguous data transfers.

international parallel and distributed processing symposium | 2004

SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems

Manoj Kumar Krishnan; Jarek Nieplocha

Summary form only given. We describe a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannons algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray XI) demonstrate consistent performance advantages over pdgemm from the ScaLAPACK/PBBLAS suite, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than pdgemm for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated.

conference on high performance computing (supercomputing) | 2005

Multilevel Parallelism in Computational Chemistry using Common Component Architecture and Global Arrays

Manoj Kumar Krishnan; Yuri Alexeev; Theresa L. Windus; Jarek Nieplocha

The development of complex scientific applications for high-end systems is a challenging task. Addressing complexity of the involved software and algorithms is becoming increasingly difficult and requires appropriate software engineering approaches to address interoperability, maintenance, and software composition challenges. At the same time, the requirements for performance and scalability to thousand processor configurations magnifies the level of difficulties facing the scientific programmer due to the variable levels of parallelism available in different algorithms or functional modules of the application. This paper demonstrates how the Common Component Architecture (CCA) and Global Arrays (GA) can be used in context of computational chemistry to express and manage multi-level parallelism through the use of processor groups. For example, the numerical Hessian calculation using three levels of parallelism in NWChem computational chemistry package outperformed the original version of the NWChem code based on single level parallelism by a factor of 90% when running on 256 processors.

international parallel and distributed processing symposium | 2007

Scalable Visual Analytics of Massive Textual Datasets

Manoj Kumar Krishnan; Shawn J. Bohn; Wendy E. Cowley; Vernon L. Crow; Jarek Nieplocha

This paper describes the first scalable implementation of a text processing engine used in visual analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing a parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive datasets. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.

ieee international conference on high performance computing data and analytics | 2005

Optimizing strided remote memory access operations on the Quadrics QsNetII network interconnect

Jarek Nieplocha; Vinod Tipparaju; Manoj Kumar Krishnan

This paper describes and evaluates protocols for optimizing strided noncontiguous communication on the Quadrics QsNetII high-performance network interconnect. Most of previous related studies focused primarily on NIC-based or host-based protocols. This paper discusses merits for using both approaches and tries to determine types and data sizes in the communication operations for which these protocols should be used. We focus on the Quadrics QsNetll network, which offers powerful communication processors on the network interface card (NIC) and practical and flexible opportunities for exploiting them in context of the user. Furthermore, the paper focuses on noncontiguous data remote memory access (RMA) transfers and performs the evaluation in context of standalone communication and application microbenchmarks. In comparison to the vendor provided noncontiguous interfaces, proposed approach achieved significant performance improvement in context of microbenchmarks as well as application kernels; dense matrix multiplication, and the Co-Array Fortran version of the NAS BT parallel benchmark. For example, for NAS BT Class B, 54 % improvement in overall communication time and a 42% improvement in matrix multiplication was achieved for 64 processes

ieee international conference on high performance computing, data, and analytics | 2003

Exploiting non-blocking Remote Memory Access communication in scientific benchmarks

Vinod Tipparaju; Manoj Kumar Krishnan; Jarek Nieplocha; Gopalakrishnan Santhanaraman; Dhabaleswar K. Panda

This paper describes a comparative performance study of MPI and Remote Memory Access (RMA) communication models in context of four scientific benchmarks: NAS MG, NAS CG, SUMMA matrix multiplication, and Lennard Jones molecular dynamics on clusters with the Myrinet network. It is shown that RMA communication delivers a consistent performance advantage over MPI. In some cases an improvement as much as 50% was achieved. Benefits of using non-blocking RMA for overlapping computation and communication are discussed.

Journal of Physics: Conference Series | 2006

Enabling new capabilities and insights from quantum chemistry by using component architectures

Curtis L. Janssen; Joseph P. Kenny; Ida M. B. Nielsen; Manoj Kumar Krishnan; Vidhya Gurumoorthi; Edward F. Valeev; Theresa L. Windus

Steady performance gains in computing power, as well as improvements in Scientific computing algorithms, are making possible the study of coupled physical phenomena of great extent and complexity. The software required for such studies is also very complex and requires contributions from experts in multiple disciplines. We have investigated the use of the Common Component Architecture (CCA) as a mechanism to tackle some of the resulting software engineering challenges in quantum chemistry, focusing on three specific application areas. In our first application, we have developed interfaces permitting solvers and quantum chemistry packages to be readily exchanged. This enables our quantum chemistry packages to be used with alternative solvers developed by specialists, remedying deficiencies we discovered in the native solvers provided in each of the quantum chemistry packages. The second application involves development of a set of components designed to improve utilization of parallel machines by allowing multiple components to execute concurrently on subsets of the available processors. This was found to give substantial improvements in parallel scalability. Our final application is a set of components permitting different quantum chemistry packages to interchange intermediate data. These components enabled the investigation of promising new methods for obtaining accurate thermochemical data for reactions involving heavy elements.

computing frontiers | 2005

Exploiting processor groups to extend scalability of the GA shared memory programming model

Jarek Nieplocha; Manoj Kumar Krishnan; Bruce J. Palmer; Vinod Tipparaju; Yeliang Zhang

Exploiting processor groups is becoming increasingly important for programming next-generation high-end systems composed of tens or hundreds of thousands of processors. This paper discusses the requirements, functionality and development of multilevel-parallelism based on processor groups in the context of the Global Array (GA) shared memory programming model. The main effort involves management of shared data, rather than interprocessor communication. Experimental results for the NAS NPB Conjugate Gradient benchmark and a molecular dynamics (MD) application are presented for a Linux cluster with Myrinet and illustrate the value of the proposed approach for improving scalability. While the original GA version of the CG benchmark lagged MPI, the processor-group version outperforms MPI in all cases, except for a few points on the smallest problem size. Similarly, processor groups were very effective in improving scalability of a Molecular Dynamics application

Explore More