Is this you? Create Your Porfile

Rajesh Nishtala

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rajesh Nishtala is active.

Explore More

Publication

Featured researches published by Rajesh Nishtala.

international parallel and distributed processing symposium | 2006

Optimizing bandwidth limited problems using one-sided communication and overlap

Christian Bell; Dan Bonachea; Rajesh Nishtala; Katherine A. Yelick

This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimizations rely on aggressively overlapping communication with computation, alleviating bottlenecks that typically occur when communication is isolated in a single phase. The new algorithms send more and smaller messages, yet the one-sided versions achieve > 1.9times speedup over the base Fortran/MPI. Our one-sided versions show an average 15% improvement over the two-sided versions, due to the lower software overhead of onesided communication, whose semantics are fundamentally lighter-weight than message passing. Our UPC results use Berkeley UPC with GASNet and demonstrate the scalability of that system, with performance approaching 0.5 TFlop/s on the FT benchmark with 512 processors

conference on high performance computing (supercomputing) | 2002

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

Richard W. Vuduc; James Demmel; Katherine A. Yelick; Shoaib Kamil; Rajesh Nishtala; Benjamin C. Lee

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).

Applicable Algebra in Engineering, Communication and Computing | 2007

When cache blocking of sparse matrix vector multiply works and why

Rajesh Nishtala; Richard W. Vuduc; James Demmel; Katherine A. Yelick

We present new performance models and more compact data structures for cache blocking when applied to sparse matrix-vector multiply (SpM × V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimum block sizes. In addition, we determine criteria that predict when cache blocking improves performance. We conclude with architectural suggestions that would make memory systems execute SpM × V faster.

international parallel and distributed processing symposium | 2009

Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

Rajesh Nishtala; Paul Hargrove; Dan Bonachea; Katherine A. Yelick

In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM BlueGene/P, an architecture that combines low-power, quad-core processors with extreme scalability. We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case study of the communication-limited benchmark, NAS FT. We scale the benchmark up to 16,384 cores of the BlueGene/P and demonstrate that UPC consistently outperforms MPI by as much as 66% for some processor configurations and an average of 32%. In addition, the results demonstrate the scalability of the PGAS model and the Berkeley implementation of UPC, the viability of using it on machines with multicore nodes, and the effectiveness of the BG/P communication layer for supporting one-sided communication and PGAS languages.

acm sigplan symposium on principles and practice of parallel programming | 2008

Performance without pain = productivity: data layout and collective communication in UPC

Rajesh Nishtala; George S. Almasi; Calin Cascaval

The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems. In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPIs communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation. We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the Blue-Gene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.

parallel computing | 2011

Tuning collective communication for Partitioned Global Address Space programming models

Rajesh Nishtala; Yili Zheng; Paul Hargrove; Katherine A. Yelick

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.

parallel computing | 2011

Guest Editorial: Emerging programming paradigms for large-scale scientific computing

Leonid Oliker; Rajesh Nishtala; Rupak Biswas

High performance computing (HPC) systems are experiencing a rapid evolution of node architectures as power and cooling constraints limit increases in microprocessor clock speeds. In response to this power wall, computer architects are dramatically increasing on-chip parallelism to improve overall performance. The traditional doubling of clock speeds every 18 to 24 months is being replaced by increasing the number of cores for explicit parallelism. During the next decade, the level of parallelism on a single microprocessor will rival the number of nodes in the most massively parallel supercomputers of the 1980s. By 2020, extreme scale HPC systems are anticipated to have on the order of 100,000 to 1,000,000 sockets, with each socket containing between 100 and 1000 potentially heterogeneous cores. These enormous levels of concurrency must be exploited efficiently to reap the benefits of such exascale systems. This explosion of parallelism has two significant challenges from a programming perspective. The first is how to best manage all the available resources to ensure the most efficient use of the peak performance provided by the hardware designs. The other, and equally important, question is how the enormous potential of these systems can be effectively utilized by a wide spectrum of scientists and engineers who are not necessarily parallel programming experts. The problem for application programmers is further compounded by the diversity of multicore architectures that are now emerging, ranging from complex out-of-order CPUs with deep cache hierarchies, to relatively simple cores that support hardware multithreading, to chips that require explicit use of software-controlled memory. Algorithms must therefore expose parallelism at multiple levels to effectively exploit these diverse architectures. These changes are forcing the computational community to reexamine fundamental approaches in the design of parallel languages and runtime systems, as they have a profound effect on the productivity and efficiency of application design and execution. This special issue examines several key research topics geared towards productively enabling high utilization of next-generation supercomputing systems. Broadly speaking, researchers have taken two paths for leveraging the parallelism provide by modern platforms. The first focuses on optimizing parallel programs as aggressively as possible by leveraging the knowledge of the underlying architecture; thus quantifying a methodology for maximizing system performance. The second path provides the tools, libraries, and runtime systems to simplify the complexities of parallel programming, without sacrificing performance, thereby allowing domain experts to leverage the potential of high-end systems. The first four papers in this special …

conference on high performance computing (supercomputing) | 2006

Optimized collectives for PGAS languages with one-sided communication

Dan Bonachea; Paul Hargrove; Rajesh Nishtala; Michael L. Welcome; Katherine A. Yelick

Optimized collective operations are a crucial performance factor for many scientific applications. This work investigates the design and optimization of collectives in the context of Partitioned Global Address Space (PGAS) languages such as Unified Parallel C (UPC). Languages with one-sided communication permit a more flexible and expressive collective interface with application code, in turn enabling more aggressive optimization and more effective utilization of system resources. We investigate the design tradeoffs in a collectives implementation for UPC, ranging from resource management to synchronization mechanisms and target-dependent selection of optimal communication patterns. Our collectives are implemented in the Berkeley UPC compiler using the GASNet communication system, tuned across a wide variety of supercomputing platforms, and benchmarked against MPI collectives. Special emphasis is placed on the newly added Cray XT3 backend for UPC, whose characteristics are benchmarked in detail.

usenix conference on hot topics in parallelism | 2009