Soumen Chakrabarti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Soumen Chakrabarti is active.

Explore More

Publication

Featured researches published by Soumen Chakrabarti.

international colloquium on automata languages and programming | 1996

Improved Scheduling Algorithms for Minsum Criteria

Soumen Chakrabarti; Cynthia A. Phillips; Andreas S. Schulz; David B. Shmoys; Clifford Stein; Joel Wein

We consider the problem of finding near-optimal solutions for a variety of NP-hard scheduling problems for which the objective is to minimize the total weighted completion time. Recent work has led to the development of several techniques that yield constant worst-case bounds in a number of settings. We continue this line of research by providing improved performance guarantees for several of the most basic scheduling models, and by giving the first constant performance guarantee for a number of more realistically constrained scheduling problems. For example, we give an improved performance guarantee for minimizing the total weighted completion time subject to release dates on a single machine, and subject to release dates and/or precedence constraints on identical parallel machines. We also give improved bounds on the power of preemption in scheduling jobs with release dates on parallel machines.

symposium on the theory of computing | 1995

Parallel randomized load balancing

Micah Adler; Soumen Chakrabarti; Michael Mitzenmacher; Lars Eilstrup Rasmussen

It is well known that after placing n balls independently and uniformly at Ž . random into n bins, the fullest bin holds Q log nrlog log n balls with high probability. More recently, Azar et al. analyzed the following process: randomly choose d bins for each ball, and then place the balls, one by one, into the least full bin from its d choices. Azar et al. They show that after all n balls have been placed, the fullest bin contains only Ž . log log nrlog dqQ 1 balls with high probability. We explore extensions of this result to parallel and distributed settings. Our results focus on the tradeoff between the amount of Correspondence to: M. Mitzenmacher * A preliminary version of this work appeared in the Proceedings of the Twenty-Se enth Annual ACM Symposium on the Theory of Computing, May 1995, pp. 238]247. † This work was primarily done while attending U.C. Berkeley, and was supported by a Schlumberger Foundation graduate fellowship. ‡ This work was primarily done while attending U.C. Berkeley, and was supported in part by ARPA Ž . under contract DABT63-92-C-0026, by NSF numbers CCR-9210260 and CDA-8722788 , and by Lawrence Livermore National Laboratory. § This work was primarily done while attending U.C. Berkeley, and was supported by the Office of Naval Research and by NSF grant CCR-9505448. ¶ Supported by a fellowship from U.C. Berkeley. Q 1998 John Wiley & Sons, Inc. CCC 1042-9832r98r020159-30

acm symposium on parallel algorithms and architectures | 1995

Modeling the benefits of mixed data and task parallelism

Soumen Chakrabarti; James Demmel; Katherine A. Yelick

Mixed task and data parallelism exists naturally in many applications, but utilizing it may require sophisticated scheduling algorithms and software support. Recently, signi cant research e ort has been applied to exploiting mixed parallelism in both theory and systems communities. In this paper, we ask how much mixed parallelism will improve performance in practice, and how architectural evolution impacts these estimates. First, we build and validate a performance model for a class of mixed task and data parallel problems based on machine and problem parameters. Second, we use this model to estimate the gains from mixed parallelism for some scienti c applications on current machines. This quanti es our intuition that mixed parallelism is best when either communication is slow or the number of processors is large. Third, we show that, for balanced divide and conquer trees, a simple one-time switch between data and task parallelism gets most of the bene t of general mixed parallelism. Fourth, we establish upper bounds to the bene ts of mixed parallelism for irregular task graphs. Apart from these detailed analyses, we provide a framework in which other applications and machines can be evaluated.

programming language design and implementation | 1996

Global communication analysis and optimization

Soumen Chakrabarti; Manish Gupta; Jong-Deok Choi

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performance Fortran. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. We present performance results for the IBM SP2 and a network of Sparc workstations (NOW) connected by a Myrinet switch. In many cases, the communication cost is reduced by a factor of two.

acm sigplan symposium on principles and practice of parallel programming | 1993

Implementing an irregular application on a distributed memory multiprocessor

Soumen Chakrabarti; Katherine A. Yelick

Parallelism with irregular patterns of data, communication and computation is hard to manage efficiently. In this paper we present a case study of the Gro¨bner basis problem, a symbolic algebra application. We developed an efficient parallel implementation using the following techniques. First, a sequential algorithm was rewritten in a transition axiom style, in which computation proceeds by non-deterministic invocations of guarded statements at multiple processors. Next, the algebraic properties of the problem were studied to modify the algorithm to ensure correctness in spite of locally inconsistent views of the share data structures. This was used to design data structures with very little overhead for maintaining consistency. Finally, an application-specific scheduler was designed and tuned to get good performance. Our distributed memory implementation achieves impressive speedups.

acm symposium on parallel algorithms and architectures | 1996

Resource scheduling for parallel database and scientific applications

Soumen Chakrabarti; S. Muthukrishnan

We initiate a study of resource scheduling problems in parallel database and scientific applications. Based on this study we formulate a problem. In our formulation, jobs specify their running times and amounts of a fixed number of other resources (like memory, IO) they need. The resourcetime trade-off may be fundamentally different for different resource types. The processor resource is malleable, meaning we can trade processors for time gracefully. Other resources may not be malleable. One way to model them is to assume no malleability: the entire requirement of those resources has to be reserved for a job to begin execution, and no smaller quantity is acceptable. The jobs also have precedences amongst them; in our applications, the precedence structure may be restricted to being a collection of trees or series-parallel graphs. Not much is known about considering precedence and non-malleable resource constraints together, For many other problems, it has been possible to find schedules whose length match to a constant factor the sum of two obvious lower bounds: the total resource-time product of jobs, denoted V, and the critical path in the precedence graph, denoted II. We show that there are instances when the optimal makespan is G?(V + II log 2’) in our model. Here T is the ratio between longest and shortest job execution times, where typically T << n, the number of jobs. We then give a polynomial time makespan algorithm that produces a schedule of length O(V + II log T), which is therefore an O (log T) approximation. This contrasts with most existing solutions for this problem, which are greedy, list-based strategies. These fail under heavy load and that is provably unavoidable since theoretical results have established various adversaries for them that force Q(T) or Q(n) approximations, The makespan algorithm can be extended to minimize the weighted average completion time over all the jobs to the same approximation factor of O(log T).

ieee international conference on high performance computing data and analytics | 1994

Randomized load balancing for tree-structured computation

Soumen Chakrabarti; Abhiram G. Ranade; Katherine A. Yelick

Studies the performance of a randomized algorithm for balancing load across a multiprocessor executing a dynamic irregular task tree. Specifically, we show that the time taken to explore a task tree is likely to be within a small constant factor of an inherent lower bound for the tree instance. Our model permits arbitrary task times and overlap between computation and load balance, and thus extends earlier work (R.M. Karp and Y. Zhang, 1988) which assumed fixed cost tasks and used a bulk synchronous style in which the system alternated between distinct computing and load balancing steps. Our analysis is supported by experiments with application codes, demonstrating that the efficiency is high enough to make this method practical.<<ETX>>

Journal of Parallel and Distributed Computing | 1997

Models and Scheduling Algorithms for Mixed Data and Task Parallel Programs

Soumen Chakrabarti; James Demmel; Katherine A. Yelick

An increasing number of scientific programs exhibit two forms of parallelism, often in a nested fashion. At the outer level, the application comprises coarse-grained task parallelism, with dependencies between tasks reflected by an acyclic graph. At the inner level, each node of the graph is a data-parallel operation on arrays. Designers of languages, compilers, and runtime systems are building mechanisms to support such applications by providing processor groups and array remapping capabilities. In this paper we explore how to supplement these mechanisms with policy. What properties of an application, its data size, and the parallel machine determine the maximum potential gains from using both kinds of parallelism? It turns out that large gains can be expected only for specific task graph structures. For such applications, what are practical and effective ways to allocate processors to the nodes of the task graph? In principle one could solve the NP-complete problem of finding the best possible allocation of arbitrary processor subsets to nodes in the task graph. Instead of this, our analysis and simulations show that a simpleswitchedscheduling paradigm, which alternates between pure task and pure data parallelism, provides nearly optimal performance for the task graphs considered here. Furthermore, our scheme is much simpler to implement, has less overhead than the optimal allocation, and would be attractive even if the optimal allocation was free to compute. To evaluate switching in real applications, we implemented a switching task scheduler in the parallel numerical library ScaLAPACK and used it in a nonsymmetric eigenvalue program. Even for fairly large input sizes, the efficiency improves by factors of 1.5 on the Intel Paragon and 2.5 on the IBM SP-2. The remapping and scheduling overhead is negligible, between 0.5 and 5%.

Archive | 1996

Runtime Support for Portable Distributed Data Structures

Chih-Po Wen; Soumen Chakrabarti; Etienne Deprit; Arvind Krishnamurthy; Katherine A. Yelick

Multipol is a library of distributed data structures designed for irregular applications, including those with asynchronous communication patterns. In this paper, we describe the Multipol runtime layer, which provides an efficient and portable abstraction underlying the data structures. It contains a thread system to express computations with varying degrees of parallelism and to support multiple threads per processor for hiding communication latency. To simplify programming in a multithreaded environment, Multipol threads are small, finite-length computations that are executed atomically. Rather than enforcing a single scheduling policy on threads, users may write their own schedulers or choose one of the schedulers provided by Multipol. The system is designed for distributed memory architectures and performs communication optimizations such as message aggregation to improve efficiency on machines with high communication startup overhead. The runtime system currently runs on the Thinking Machines CM5, Intel Paragon, and IBM SPI, and is being ported to a network of workstations. Multipol applications include an event-driven timing simulator [1], an eigenvalue solver [2], and a program that solves the phylogeny problem [3].

Higher-order and Symbolic Computation \/ Lisp and Symbolic Computation | 1994

Distributed data structures and algorithms for Gro¨bner basis computation

Soumen Chakrabarti; Katherine A. Yelick

We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointer-based structures and the computations on the structures have unpredictable duration. The algorithm is presented as a series of refinements on atransition rule program, in which computation proceeds by nondeterministic invocations of guarded commands. Two key data structures, a set and a priority queue, are distributed across processors in the parallel algorithm. The data structures are designed for high throughput and latency tolerance, as appropriate for distributed memory machines. The programming style represents a compromise between shared-memory and message-passing models. The distributed nature of the data structures shows through their interface in that the semantics are weaker than with shared atomic objects, but they still provide a shared abstraction that can be used for reasoning about program correctness. In the data structure design there is a classic trade-off between locality and load balance. We argue that this is best solved by designing scheduling structures in tandem with the state data structures, since the decision to replicate or partition state affects the overhead of dynamically moving tasks.

Explore More