Vinod Tipparaju
Oak Ridge National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vinod Tipparaju.
architectural support for programming languages and operating systems | 2010
Anthony Danalis; Gabriel Marin; Collin McCurdy; Jeremy S. Meredith; Philip C. Roth; Kyle Spafford; Vinod Tipparaju; Jeffrey S. Vetter
Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOCs initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.
ieee international conference on high performance computing data and analytics | 2006
Jarek Nieplocha; Bruce J. Palmer; Vinod Tipparaju; Manoj Kumar Krishnan; Harold E. Trease; Edoardo Aprà
This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an inteface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advatage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the attractiveness of using higher level abstractions to write parallel code.
ieee international conference on high performance computing data and analytics | 2006
Jarek Nieplocha; Vinod Tipparaju; Manoj Kumar Krishnan; Dhabaleswar K. Panda
This paper describes the Aggregate Remote Memory Copy Interface (ARMCI), a portable high performance remote memory access communication interface, developed oriinally under the U.S. Department of Energy (DOE) Advanced Computational Testing and Simulation Toolkit project and currently used and advanced as a part of the run-time layer of the DOE project, Programming Models for Scalble Parallel Computing. The paper discusses the model, addresses challenges of portable implementations, and demonstrates that ARMCI delivers high performance on a variety of platforms. Special emphasis is placed on the latency hiding mechanisms and ability to optimize noncotiguous data transfers.
international parallel and distributed processing symposium | 2003
Vinod Tipparaju; Jarek Nieplocha; Dhabaleswar K. Panda
This paper describes a novel methodology for implementing a common set of collective communication operations on clusters based on symmetric multiprocessor (SMP) nodes. Called Shared-Remote-Memory collectives, or SRM, our approach replaces the point-to-point message passing, traditionally used in implementation of collective message-passing operations, with a combination of shared and remote memory access (RMA) protocols that are used to implement semantics of the collective operations directly. Appropriate embedding of the communication graphs in a cluster maximizes the use of shared memory and reduces network communication. Substantial performance improvements are achieved over the highly optimized commercial IBM implementation and the open-source MPICH implementation of MPI across a wide range of message sizes on the IBM SP. For example, depending on the message size and number of processors, SRM implementation of broadcast, reduce, and barrier outperforms IBM MPI/spl I.bar/Bcast by 27-84%, MPI/spl I.bar/Reduce by 24-79%, and MPI/spl I.bar/Barrier by 73% on 256 processors, respectively.
international parallel and distributed processing symposium | 2004
Vinod Tipparaju; G. Santhanaraman; Jaroslaw Nieplocha; O.K. Panda
Summary form only given. The remote memory access (RMA) is an increasingly important communication model due to its excellent potential for overlapping communication and computations and achieving high performance on modern networks with RDMA hardware such as Infiniband. RMA plays a vital role in supporting the emerging global address space programming models. We describe how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6/spl mu/s and a peak bandwidth of 830 MB/s for put and a small message latency of 12/spl mu/s and a peak bandwidth of 765 Megabytes for get. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for noncontiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance.
ieee international conference on high performance computing data and analytics | 2009
Edoardo Aprà; Alistair P. Rendell; Robert J. Harrison; Vinod Tipparaju; Wibe A. deJong; Sotiris S. Xantheas
Water is ubiquitous on our planet and plays an essential role in several key chemical and biological processes. Accurate models for water are crucial in understanding, controlling and predicting the physical and chemical properties of complex aqueous systems. Over the last few years we have been developing a molecular-level based approach for a macroscopic model for water that is based on the explicit description of the underlying intermolecular interactions between molecules in water clusters. In the absence of detailed experimental data for small water clusters, highly-accurate theoretical results are required to validate and parameterize model potentials. As an example of the benchmarks needed for the development of accurate models for the interaction between water molecules, for the most stable structure of (H2O)20 we ran a coupled-cluster calculation on the ORNLs Jaguar petaflop computer that used over 100 TB of memory for a sustained performance of 487 TFLOP/s (double precision) on 96,000 processors, lasting for 2 hours. By this summer we will have studied multiple structures of both (H2O)20 and (H2O)24 and completed basis set and other convergence studies and anticipate the sustained performance rising close to 1 PFLOP/s.
international parallel and distributed processing symposium | 2012
James Dinan; Pavan Balaji; Jeffrey R. Hammond; Sriram Krishnamoorthy; Vinod Tipparaju
The industry-standard Message Passing Interface (MPI) provides one-sided communication functionality and is available on virtually every parallel computing system. However, it is believed that MPIs one-sided model is not rich enough to support higher-level global address space parallel programming models. We present the first successful application of MPI one-sided communication as a runtime system for a PGAS model, Global Arrays (GA). This work has an immediate impact on users of GA applications, such as NW Chem, who often must wait several months to a year or more before GA becomes available on a new architecture. We explore challenges present in the application of MPI-2 to PGAS models and motivate new features in the upcoming MPI-3 standard. The performance of our system is evaluated on several popular high-performance computing architectures through communication benchmarking and application benchmarking using the NW Chem computational chemistry suite.
international parallel and distributed processing symposium | 2002
Jaroslaw Nieplocha; Vinod Tipparaju; Amina Saify; Dhabaleswar K. Panda
The paper describes software architecture for supporting remote memory operations on clusters equipped with high-performance networks such as Myrinet and Giganet/Emulex cLAN. It presents protocols and strategies that bridge the gap between user-level API requirements and low-level network-specific interfaces such as GM and VIA. In particular, the issues of memory registration, management of network resources and memory consumption on the host, are discussed and solved to achieve an efficient implementation.
2006 IEEE Power Engineering Society General Meeting | 2006
Jaroslaw Nieplocha; Andres Marquez; Vinod Tipparaju; Daniel G. Chavarría-Miranda; Ross T. Guttromson; H. Huang
We are investigating the effectiveness of parallel weighted- least-square (WLS) state estimation solvers on shared-memory parallel computers. Shared-memory parallel architectures are rapidly becoming ubiquitous due to the advent of multi-core processors. In the current evaluation, we are using an LU-based solver as well as a conjugate gradient (CG)-based solver for a 1177-bus system. In lieu of a very wide multi-core system we evaluate the effectiveness of the solvers on an SGI Altix system on up to 32 processors. On this platform, as expected, the shared memory implementation (pthreads) of the LU solver was found to be more efficient than the MPI version. Our implementation of the CG solver scales and performs significantly better than the state-of-the-art implementation of the LU solver: with CG we can solve the problem 4.75 times faster than using LU. These findings indicate that CG algorithms should be quite effective on multicore processors
Journal of Chemical Theory and Computation | 2011
Karol Kowalski; Ryan M. Olson; Sriram Krishnamoorthy; Vinod Tipparaju; Edoardo Aprà
The unusual photophysical properties of the π-conjugated chromophores make them potential building blocks of various molecular devices. In particular, significant narrowing of the HOMO-LUMO gaps can be observed as an effect of functionalization chromophores with polycyclic aromatic hydrocarbons (PAHs). In this paper we present equation-of-motion coupled cluster (EOMCC) calculations for vertical excitation energies of several functionalized forms of porphyrins. The results for free-base porphyrin (FBP) clearly demonstrate significant differences between functionalization of FBP with one- (anthracene) and two-dimensional (coronene) structures. We also compare the EOMCC results with the experimentally available results for anthracene fused zinc-porphyrin. The impact of various types of correlation effects is illustrated on several benchmark models, where the comparison with the experiment is possible. In particular, we demonstrate that for all excited states considered in this paper, all of them being dominated by single excitations, the inclusion of triply excited configurations is crucial for attaining qualitative agreement with experiment. We also demonstrate the parallel performance of the most computationally intensive part of the completely renormalized EOMCCSD(T) approach (CR-EOMCCSD(T)) across 120u2009000 cores.