Siddhartha Chatterjee

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Siddhartha Chatterjee is active.

Explore More

Publication

Featured researches published by Siddhartha Chatterjee.

Journal of Parallel and Distributed Computing | 1994

Implementation of a portable nested data-parallel language

Guy E. Blelloch; Jonathan C. Hardwick; Jay Sipelstein; Marco Zagha; Siddhartha Chatterjee

Abstract This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current NESL implementation is based on an intermediate language called VCODE and a library of vector routines called CVL. It runs on the Connection Machines CM-2 and CM-5, the Cray Y-MP C90, and serial workstations. We compare initial benchmark results of NESL, with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse-matrix vector product. These results show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

conference on high performance computing (supercomputing) | 1990

Scan primitives for vector computers

Siddhartha Chatterjee; Guy E. Blelloch; Marco Zagha

The authors describe an optimized implementation of a set of scan (also called all-prefix-sums) primitives on a single processor of a CRAY Y-MP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. The algorithm used to implement the scans is based on an algorithm for parallel computers. A set of segmented versions of these scans is only marginally more expensive than the unsegmented versions. The authors describe a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 to 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular networks.<<ETX>>

Journal of Parallel and Distributed Computing | 1995

Generating local addresses and communication sets for data-parallel programs

Siddhartha Chatterjee; John R. Gilbert; Fred J. E. Long; Robert Schreiber; Shang-Hua Teng

Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We demonstrate a storage scheme for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution that does not waste any storage, and show that, under this storage scheme, the local memory access sequence of any processor for a computation involving the regular section A(?:h:s) is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and we extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

symposium on frontiers of massively parallel computation | 1990

Vcode: a data-parallel intermediate language

Guy E. Blelloch; Siddhartha Chatterjee

A description is given of Vcode, a data-parallel intermediate language. Vcode is designed to allow easy porting of data-parallel languages to a wide class of parallel machines, and for experimenting with compiling such languages. The design goal was to define a simple language whose primitives can be implemented efficiently but that is still powerful enough to express the features of existing data-parallel languages. Vcode contains about 50 instructions, most of which manipulate arbitrarily long vectors of atomic values, and includes a set of segmented instructions that are crucial for implementing data-parallel languages that permit nested parallelism. The design decisions are discussed, and it is shown how three data-parallel languages-C*, Fortran 8*, and Paralation Lisp-can be mapped onto Vcode. The issues encountered in implementing Vcode on different kinds of parallel machines, as well as specific techniques for implementing it on the Connection Machine, are examined.<<ETX>>

ACM Transactions on Programming Languages and Systems | 1993

Compiling nested data-parallel programs for shared-memory multiprocessors

Siddhartha Chatterjee

While data parallelism is well suited from algorithmic, architectural, and linguistic considerations to serve as a basis for portable parallel programming, its characteristic fine-grained parallelism makes the efficient implementation of data-parallel languages on MIMD machines a challenging task. The design, implementation, and evaluation of an optimizing compiler are presented for an applicative nested data-parallel language called Vcode targeted at the Encore Multimax, a shared-memory multiprocessor. The source language supports nested aggregate data types; aggregate operations including elementwise forms, scans, reductions, and permutations; and conditionals and recursion for control flow

programming language design and implementation | 1991

Size and access inference for data-parallel programs

Siddhartha Chatterjee; Guy E. Blelloch; Allan L. Fisher

Data-parallel programming languages have many desirable features, such as single-thread semantics and the ability to express fine-grained parallelism. However, it is challenging to implement such languages efficiently on conventional MIMD multiprocessors, because these machines incur a high overhead for small grain sizes. This paper presents compile-time analysis techniques for data-parallel program graphs that reduce these overheads in two ways: by stepping up the grain size, and by relaxing the synchronous nature of the computation without altering the program semantics. The algorithms partition the program graph into clusters of nodes such that all nodes in a cluster have the same loop structure, and further refine these clusters into epochs based on generation and consumption patterns of data vectors. This converts the fine-grain parallelism in the original program to medium-grain loop parallelism, which is better suited to MIMD machines. A compiler has been implemented based on these ideas. We present performance results for data-parallel kernels analyzed by the compiler and converted to single-program multiple-data (SPMD) code running on an Encore Multimax. This research was sponsored by the Avionics Laboratory, Wright Research and Development Center, Aeronautical Systems Division (AFSC), U.S. Air Force, Wright-Patterson AFB, Ohio 45433-6543 under Contract F33615-90-C-1465, ARPA Order No. 7597. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. government.

symposium on frontiers of massively parallel computation | 1995

Aligning parallel arrays to reduce communication

Thomas J. Sheffler; Robert Schreiber; John R. Gilbert; Siddhartha Chatterjee

Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.<<ETX>>

Journal of Parallel and Distributed Computing | 1995

Solving linear recurrences with loop raking

Guy E. Blelloch; Siddhartha Chatterjee; Marco Zagha

Abstract We present a variation of the partition method for solving linear recurrences that is well-suited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On four processors, we gain an additional speedup of at least 3.7.

Sigplan Notices | 1993

Optimal evaluation of array expressions on massively parallel machines (extended abstract)

Siddhartha Chatterjee; John R. Gilbert; Robert Schreiber; Shang-Hua Teng

We investigate the problem of optimal evaluation of Fortran-90 style array expressions on a massively parallel distributed-memory machine. On such machines, an elementwise operation can be performed in unit time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of alignment is part of the cost of expression evaluation. The choice of where to perform the operation then affects this cost. We demonstrate how a dynamic programming technique can be applied to solve this problem efficiently for a wide variety of interconnection schemes, including multidimensional grids and rings, hypercubes, and fat-trees. We also consider the variant where the operations may change the shape of the arrays, and show that our approach extends naturally to handle this case.

Journal of Programming Languages | 1994