Vinod Tipparaju | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vinod Tipparaju is active.

Explore More

Publication

Featured researches published by Vinod Tipparaju.

architectural support for programming languages and operating systems | 2010

The Scalable Heterogeneous Computing (SHOC) benchmark suite

Anthony Danalis; Gabriel Marin; Collin McCurdy; Jeremy S. Meredith; Philip C. Roth; Kyle Spafford; Vinod Tipparaju; Jeffrey S. Vetter

Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOCs initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.

ieee international conference on high performance computing data and analytics | 2006

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

Jarek Nieplocha; Bruce J. Palmer; Vinod Tipparaju; Manoj Kumar Krishnan; Harold E. Trease; Edoardo Aprà

This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an inteface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the attractiveness of using higher level abstractions to write parallel code.

ieee international conference on high performance computing data and analytics | 2006

High Performance Remote Memory Access Communication: The Armci Approach

Jarek Nieplocha; Vinod Tipparaju; Manoj Kumar Krishnan; Dhabaleswar K. Panda

This paper describes the Aggregate Remote Memory Copy Interface (ARMCI), a portable high performance remote memory access communication interface, developed oriinally under the U.S. Department of Energy (DOE) Advanced Computational Testing and Simulation Toolkit project and currently used and advanced as a part of the run-time layer of the DOE project, Programming Models for Scalble Parallel Computing. The paper discusses the model, addresses challenges of portable implementations, and demonstrates that ARMCI delivers high performance on a variety of platforms. Special emphasis is placed on the latency hiding mechanisms and ability to optimize noncotiguous data transfers.

international parallel and distributed processing symposium | 2003

Fast collective operations using shared and remote memory access protocols on clusters

Vinod Tipparaju; Jarek Nieplocha; Dhabaleswar K. Panda

This paper describes a novel methodology for implementing a common set of collective communication operations on clusters based on symmetric multiprocessor (SMP) nodes. Called Shared-Remote-Memory collectives, or SRM, our approach replaces the point-to-point message passing, traditionally used in implementation of collective message-passing operations, with a combination of shared and remote memory access (RMA) protocols that are used to implement semantics of the collective operations directly. Appropriate embedding of the communication graphs in a cluster maximizes the use of shared memory and reduces network communication. Substantial performance improvements are achieved over the highly optimized commercial IBM implementation and the open-source MPICH implementation of MPI across a wide range of message sizes on the IBM SP. For example, depending on the message size and number of processors, SRM implementation of broadcast, reduce, and barrier outperforms IBM MPI/spl I.bar/Bcast by 27-84%, MPI/spl I.bar/Reduce by 24-79%, and MPI/spl I.bar/Barrier by 73% on 256 processors, respectively.

international parallel and distributed processing symposium | 2004

Host-assisted zero-copy remote memory access communication on InfiniBand

Vinod Tipparaju; G. Santhanaraman; Jaroslaw Nieplocha; O.K. Panda

Summary form only given. The remote memory access (RMA) is an increasingly important communication model due to its excellent potential for overlapping communication and computations and achieving high performance on modern networks with RDMA hardware such as Infiniband. RMA plays a vital role in supporting the emerging global address space programming models. We describe how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6/spl mu/s and a peak bandwidth of 830 MB/s for put and a small message latency of 12/spl mu/s and a peak bandwidth of 765 Megabytes for get. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for noncontiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance.

ieee international conference on high performance computing data and analytics | 2009

Liquid water: obtaining the right answer for the right reasons

Edoardo Aprà; Alistair P. Rendell; Robert J. Harrison; Vinod Tipparaju; Wibe A. deJong; Sotiris S. Xantheas

Water is ubiquitous on our planet and plays an essential role in several key chemical and biological processes. Accurate models for water are crucial in understanding, controlling and predicting the physical and chemical properties of complex aqueous systems. Over the last few years we have been developing a molecular-level based approach for a macroscopic model for water that is based on the explicit description of the underlying intermolecular interactions between molecules in water clusters. In the absence of detailed experimental data for small water clusters, highly-accurate theoretical results are required to validate and parameterize model potentials. As an example of the benchmarks needed for the development of accurate models for the interaction between water molecules, for the most stable structure of (H2O)20 we ran a coupled-cluster calculation on the ORNLs Jaguar petaflop computer that used over 100 TB of memory for a sustained performance of 487 TFLOP/s (double precision) on 96,000 processors, lasting for 2 hours. By this summer we will have studied multiple structures of both (H2O)20 and (H2O)24 and completed basis set and other convergence studies and anticipate the sustained performance rising close to 1 PFLOP/s.

international parallel and distributed processing symposium | 2012

Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication

James Dinan; Pavan Balaji; Jeffrey R. Hammond; Sriram Krishnamoorthy; Vinod Tipparaju

The industry-standard Message Passing Interface (MPI) provides one-sided communication functionality and is available on virtually every parallel computing system. However, it is believed that MPIs one-sided model is not rich enough to support higher-level global address space parallel programming models. We present the first successful application of MPI one-sided communication as a runtime system for a PGAS model, Global Arrays (GA). This work has an immediate impact on users of GA applications, such as NW Chem, who often must wait several months to a year or more before GA becomes available on a new architecture. We explore challenges present in the application of MPI-2 to PGAS models and motivate new features in the upcoming MPI-3 standard. The performance of our system is evaluated on several popular high-performance computing architectures through communication benchmarking and application benchmarking using the NW Chem computational chemistry suite.

international parallel and distributed processing symposium | 2002

Protocols and strategies for optimizing performance of remote memory operations on clusters

Jaroslaw Nieplocha; Vinod Tipparaju; Amina Saify; Dhabaleswar K. Panda

The paper describes software architecture for supporting remote memory operations on clusters equipped with high-performance networks such as Myrinet and Giganet/Emulex cLAN. It presents protocols and strategies that bridge the gap between user-level API requirements and low-level network-specific interfaces such as GM and VIA. In particular, the issues of memory registration, management of network resources and memory consumption on the host, are discussed and solved to achieve an efficient implementation.

2006 IEEE Power Engineering Society General Meeting | 2006

Towards efficient power system state estimators on shared memory computers

Jaroslaw Nieplocha; Andres Marquez; Vinod Tipparaju; Daniel G. Chavarría-Miranda; Ross T. Guttromson; H. Huang

We are investigating the effectiveness of parallel weighted- least-square (WLS) state estimation solvers on shared-memory parallel computers. Shared-memory parallel architectures are rapidly becoming ubiquitous due to the advent of multi-core processors. In the current evaluation, we are using an LU-based solver as well as a conjugate gradient (CG)-based solver for a 1177-bus system. In lieu of a very wide multi-core system we evaluate the effectiveness of the solvers on an SGI Altix system on up to 32 processors. On this platform, as expected, the shared memory implementation (pthreads) of the LU solver was found to be more efficient than the MPI version. Our implementation of the CG solver scales and performs significantly better than the state-of-the-art implementation of the LU solver: with CG we can solve the problem 4.75 times faster than using LU. These findings indicate that CG algorithms should be quite effective on multicore processors

Journal of Chemical Theory and Computation | 2011

Role of Many-Body Effects in Describing Low-Lying Excited States of π-Conjugated Chromophores: High-Level Equation-of-Motion Coupled-Cluster Studies of Fused Porphyrin Systems.

Karol Kowalski; Ryan M. Olson; Sriram Krishnamoorthy; Vinod Tipparaju; Edoardo Aprà

The unusual photophysical properties of the π-conjugated chromophores make them potential building blocks of various molecular devices. In particular, significant narrowing of the HOMO-LUMO gaps can be observed as an effect of functionalization chromophores with polycyclic aromatic hydrocarbons (PAHs). In this paper we present equation-of-motion coupled cluster (EOMCC) calculations for vertical excitation energies of several functionalized forms of porphyrins. The results for free-base porphyrin (FBP) clearly demonstrate significant differences between functionalization of FBP with one- (anthracene) and two-dimensional (coronene) structures. We also compare the EOMCC results with the experimentally available results for anthracene fused zinc-porphyrin. The impact of various types of correlation effects is illustrated on several benchmark models, where the comparison with the experiment is possible. In particular, we demonstrate that for all excited states considered in this paper, all of them being dominated by single excitations, the inclusion of triply excited configurations is crucial for attaining qualitative agreement with experiment. We also demonstrate the parallel performance of the most computationally intensive part of the completely renormalized EOMCCSD(T) approach (CR-EOMCCSD(T)) across 120u2009000 cores.

Explore More