Santosh Pande | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Santosh Pande is active.

Explore More

Publication

Featured researches published by Santosh Pande.

IEEE Transactions on Parallel and Distributed Systems | 1995

A scalable scheduling scheme for functional parallelism on distributed memory multiprocessor systems

Santosh Pande; Dharma P. Agrawal; Jon Mauney

We attempt a new variant of the scheduling problem by investigating the scalability of the schedule length with the required number of processors, by performing scheduling partially at compile time and partially at run time. Assuming infinite number of processors, the compile time schedule is found using a new concept of the threshold of a task that quantifies a trade-off between the schedule-length and the degree of parallelism. The schedule is found to minimize either the schedule length or the number of required processors and it satisfies: A feasibility condition which guarantees that the schedule delay of a task from its earliest start time is below the threshold, and an optimality condition which uses a merit function to decide the best task-processor match for a set of tasks competing for a given processor. At run time, the tasks are merged producing a schedule for a smaller number of available processors. This allows the program to be scaled down to the processors actually available at run time. Usefulness of this scheduling heuristic has been demonstrated by incorporating the scheduler in the compiler backend for targeting Sisal (Streams and Iterations in a Single Assignment Language) on iPSC/860. >

international conference on parallel processing | 1996

A compile time partitioning method for DOALL loops on distributed memory systems

Santosh Pande

The loop partitioning problem on modern distributed memory systems is no longer fully communication bound primarily due to a significantly lower ratio of communication/computation speeds. The useful parallelism may be exploited on these systems to an extent that the communication balances the parallelism and does not produce a very high overhead to nullify all the gains due to the parallelism. We describe a compile time partitioning and scheduling approach based on the above motivation for DOALL loops where communication without data replication is inevitable. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. Next, the data distribution phase uses a new larger partition owns rule to achieve computation and communication load balance. The granularity adjustment phase attempts to further eliminate communication through merging partitions to reduce the completion time. Finally, the load balancing phase attempts to reduce the number of processors without degrading the completion time and the mapping phase schedules the partitions on available processors. Relevant theory and algorithms are developed along with a performance evaluation on Cray T3D.

Journal of Parallel and Distributed Computing | 1996

Program Repartitioning on Varying Communication Cost Parallel Architectures

Santosh Pande; Kleanthis Psarris

In an earlier work, aThreshold Scheduling Algorithmwas proposed to schedule the functional (DAG) parallelism in a program on distributed memory systems. In this work, we address the issue of adapting the schedule for a set of distributed memory architectures with the same computation costs but higher communication costs. We introduce a new concept ofdominant edgesof a schedule to denote those edges which dictate the schedule time of the destination nodes due to the changes in their communication costs. Using this concept, we derive the conditions under which schedule on the whole or at least part of the graph can be reused for a different architecture keeping the cost of program repartitioning and rescheduling to a minimum. We demonstrate the practical significance of the method by incorporating it in the compiler backend for targeting Sisal (Streams and Iterations in a Single Assignment Language) on a family of Intel i860 architectures, Gamma, Delta, and Paragon, which vary in their communication costs. It is shown that almost 30 to 65% of the schedule can be reused unchanged, thereby avoiding program repartitioning to a large degree. The remainder of the schedule can beregeneratedthrough a linear algorithm at run time.

international conference on parallel processing | 1994

An Empirical Study of the I Test for Exact Data Dependence

Kleanthis Psarris; Santosh Pande

Parallelizing Compilers rely upon subscript analysis to detect data dependences between pairs of array references inside loop nests. The most widely used approximate subscript analysis tests are the GCD test and the Banerjee test. In an earlier work we proposed the I test, an improved subscript analysis test. The I test extends the accuracy of a combination of the GCD test and the Banerjee test. It is also able to provide exact data dependence information at no additional computation cost. In the present work we perform an empirical study on the Perfect Club benchmarks to demonstrate the effectiveness and practical importance of the I Test. We compare its performance with that of the GCD test and the Banerjee test. We show that the I test is always an exact test in practice.

international conference on parallel processing | 1998

Optimal task scheduling to minimize inter-tile latencies

Fabrice Rastello; Amit Rao; Santosh Pande

This work addresses the issue of exploiting intra-tile parallelism by overlapping communication with computation removing the restriction of atomicity of tiles. The effectiveness of tiling is then critically dependent on the execution order of tasks within a tile. We present a theoretical framework based on equivalence classes that provides an optimal task ordering under assumptions of constant and different permutations of tasks in individual tiles. Our framework is able to handle constant but compile-time unknown dependences by generating optimal task permutations at run-time and results in significantly lower loop completion times. Our solution is an improvement over previous approaches (Chou and Kung, 1993) (Dion et al., 1995) and is optimal for all problem instances. We also propose efficient algorithms that provide the optimal solution. The framework has been implemented as an optimization pass in the SUIF compiler and has been tested on a distributed memory system using a message passing model. We show that the performance improvement over previous results is substantial.

hawaii international conference on system sciences | 1995

Classical dependence analysis techniques: sufficiently accurate in practice

Kleanthis Psarris; Santosh Pande

Data dependence analysis is the foundation of any parallelizing compiler. The GCD (greatest common divisor) test and the Banerjee-Wolfe test (U. Banerjee 1988, M. Wolfe 1989) are the two tests traditionally used to determine statement data dependence in automatic vectorization/parallelization of loops. These tests are approximate in the sense that they are necessary but not sufficient conditions for data dependence. In an earlier work (Proc. 6th ACM Int. Conf. Supercomputing, Washington, DC, USA, July 1992), we extended the Banerjee-Wolfe test and a combination of the GCD and Banerjee-Wolfe tests with a set of conditions to derive exact data dependence information. In this paper, we perform an empirical study on the Perfect benchmarks to demonstrate the effectiveness and practical importance of our conditions. We show that the Banerjee-Wolfe test extended with our conditions becomes an exact test for data dependence in actual practice.<<ETX>>

international parallel and distributed processing symposium | 1995

A communication+computation load balanced loop partitioning method

Santosh Pande

We present an iteration and data partitioning approach for DOALL loops on distributed memory systems. The method first examines the highest amount of parallelism (available parallelism) which could be potentially exploited in a loop nest. It then examines the amount of communication overhead which can potentially nullify the benefits due to parallelism and attempts to maximally eliminate the communication to minimize the loop completion time by trading parallelism to a minimal extent. This is achieved by determining the directions of iteration space partitioning which result in minimum communication. Finally, in order to generate a load balanced partition with respect to computation+communication, the method uses a new larger partition owns rule to distribute the underlying data. Necessary theoretical framework has been developed and the merit of the method is shown through a performance evaluation on Cray T3D.

Concurrency and Computation: Practice and Experience | 1995

Run-time issues in program partitioning on distributed memory systems

Santosh Pande; Dharma P. Agrawal

Our earlier work reported a Threshold Scheduling Method for compile-time mapping of functional parallism on distributed-memory systems. The work reported in this paper discusses run-time issues in efficiently supporting the functional parallism with minimal overheads, through a combination of compile-time and run-time ownership analysis. At compile time, the code generation phase determines whether a local copy of a live definition of a variable needed by a task is available on a given processor, through an ownership analysis. In case ownership cannot be resolved at compile time, an appropriate code is generated to perform analysis at run time. The code generation is carried out so that all the processors carry the same copy of the compiled program with the individual processors code being isolated and the universally owned code being replicated on all processors to minimize run-time overheads. The run-time system maintains the static and dynamic ownerships at every processor to avoid communication overhead on ownership information. We demonstrate the approach by incorporating it in the compiler for targeting a parallel functional language, Sisal (streams and iterations in single assignment language), to Intel Touchstone i860 systems. Several benchmarks demonstrate the viability of the approach.

international conference on parallel processing | 1994

Compiling Functional Parllelism on a Family of Distributed Memory Architectures

Santosh Pande; Kleanthis Psarris

In an earlier work, a Threshold Scheduling Algorithm was proposed to schedule the functional parallelism in a program on distributed memory systems. In this work, we address the issue of regeneration of the schedule for a set of distributed memory architectures with different communication costs. A new concept of dominant edges of a schedule is introduced to denote those edges which dictate schedule regeneration due to the changes in their communication costs. It is shown that under certain conditions, schedule on the whole or at least part of the graph can be reused for a different architecture reducing the cost of program re-partitioning and re-scheduling. The usefulness of this method is demonstrated by incorporating it in the scheduler of the compiler backend for targeting Sisal (Streams and Iterations in a Single Assignment Language) on a family of Intel i860 architectures : Gamma, Delta and Paragon which vary in their communication costs. It is shown that almost SO to 65 % of the schedule can be reused, thereby, avoiding program re-partitioning to a large degree.

international conference on parallel architectures and languages europe | 1994

A Compilation Technique for Varying Communication Cost NUMA Architectures

Santosh Pande; Kleanthis Psarris

In an earlier work, a Threshold Scheduling Algorithm was proposed to schedule the functional parallelism in a program on distributed memory systems. In this work, we address the issue of regeneration of the schedule for a set of distributed memory architectures with different communication costs. A new concept of dominant edges of a schedule is introduced to denote those edges which dictate schedule regeneration due to the changes in their communication costs. It is shown that under certain conditions, schedule on the whole or at least part of the graph can be reused for a different architecture reducing the cost of program re-partitioning and re-scheduling. The usefulness of this method is demonstrated by incorporating it in the scheduler of the compiler backend for targeting Sisal (Streams and Iterations in a Single Assignment Language) on a family of Intel i860 architectures: Gamma, Delta and Paragon which vary in their communication costs. It is shown that almost 30 to 65 % of the schedule can be reused, thereby, avoiding program re-partitioning to a large degree.

Explore More