Guy E. Blelloch | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guy E. Blelloch is active.

Explore More

Publication

Featured researches published by Guy E. Blelloch.

IEEE Transactions on Computers | 1989

Scans as primitive parallel operations

Guy E. Blelloch

A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine). >

Communications of The ACM | 1996

Programming parallel algorithms

Guy E. Blelloch

In the past 20 years there has been treftlen-dous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some of these algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofpar-allelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and pro-totyping algorithms. There has been a large gap between languages that are too low level, requiring specification of many details that obscure the meaning of the algorithm, and languages that are too high level, making the performance implications of various constructs unclear. In sequential computing many standard languages such as C or Pascal do a reasonable J·ob of bridging this gap, but in parallel languages building such a bridge has been significantly more difficult.

acm symposium on parallel algorithms and architectures | 1991

A comparison of sorting algorithms for the connection machine CM-2

Guy E. Blelloch; Charles E. Leiserson; Bruce M. Maggs; C. Greg Plaxton; Stephen J. Smith; Marco Zagha

Sorting is arguably the most studied problem in computer science, both because it is used as a substep in many applications and because it is a simple, combinatorial problem with many interesting and diverse solutions. Sorting is also an important benchmark for parallel supercomputers. It requires significant communication bandwidth among processors, unlike many other supercomputer benchmarks, and the most efficient sorting algorithms communicate data in irregular patterns. Parallel algorithms for sorting have been studied since at least the 1960’s. An early advance in parallel sorting came in 1968 when Batcher discovered the elegant U(lg2 n)-depth bitonic sorting network [3]. For certain families of fixed interconnection networks, such as the hypercube and shuffle-exchange, Batcher’s bitonic sorting technique provides a parallel algorithm for sorting n numbers in U(lg2 n) time with n processors. The question of existence of a o(lg2 n)-depth sorting network remained open until 1983, when Ajtai, Komlos, and Szemeredi [1] provided an optimal U(lg n)-depth sorting network, but unfortunately, their construction leads to larger networks than those given by bitonic sort for all “practical” values of n. Leighton [15] has shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U(lg n) time in the bounded-degree fixed interconnection network domain. Not surprisingly, the optimal U(lg n)-time fixed interconnection sorting networks implied by the AKS construction are also impractical. In 1983, Reif and Valiant proposed a more practical O(lg n)-time randomized algorithm for sorting [19], called flashsort. Many other parallel sorting algorithms have been proposed in the literature, including parallel versions of radix sort and quicksort [5], a variant of quicksort called hyperquicksort [23], smoothsort [18], column sort [15], Nassimi and Sahni’s sort [17], and parallel merge sort [6]. This paper reports the findings of a project undertaken at Thinking Machines Corporation to develop a fast sorting algorithm for the Connection Machine Supercomputer model CM-2. The primary goals of this project were:

acm sigplan symposium on principles and practice of parallel programming | 2013

Ligra: a lightweight graph processing framework for shared memory

Julian Shun; Guy E. Blelloch

There has been significant recent interest in parallel frameworks for processing graphs due to their applicability in studying social networks, the Web graph, networks in biology, and unstructured meshes in scientific simulation. Due to the desire to process large graphs, these systems have emphasized the ability to run on distributed memory machines. Today, however, a single multicore server can support more than a terabyte of memory, which can fit graphs with tens or even hundreds of billions of edges. Furthermore, for graph algorithms, shared-memory multicores are generally significantly more efficient on a per core, per dollar, and per joule basis than distributed memory systems, and shared-memory algorithms tend to be simpler than their distributed counterparts. In this paper, we present a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write. The framework has two very simple routines, one for mapping over edges and one for mapping over vertices. Our routines can be applied to any subset of the vertices, which makes the framework useful for many graph traversal algorithms that operate on subsets of the vertices. Based on recent ideas used in a very fast algorithm for breadth-first search (BFS), our routines automatically adapt to the density of vertex sets. We implement several algorithms in this framework, including BFS, graph radii estimation, graph connectivity, betweenness centrality, PageRank and single-source shortest paths. Our algorithms expressed using this framework are very simple and concise, and perform almost as well as highly optimized code. Furthermore, they get good speedups on a 40-core machine and are significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

Theory of Computing Systems \/ Mathematical Systems Theory | 2002

The Data Locality of Work Stealing

Umut A. Acar; Guy E. Blelloch; Robert D. Blumofe

AbstractThis paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation.As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations Gn whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected O (C ⌉m/s;⌈PT∞more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∈fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.

Journal of Parallel and Distributed Computing | 1994

Implementation of a portable nested data-parallel language

Guy E. Blelloch; Jonathan C. Hardwick; Jay Sipelstein; Marco Zagha; Siddhartha Chatterjee

Abstract This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function calls. These features allow the concise description of parallel algorithms on irregular data, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current NESL implementation is based on an intermediate language called VCODE and a library of vector routines called CVL. It runs on the Connection Machines CM-2 and CM-5, the Cray Y-MP C90, and serial workstations. We compare initial benchmark results of NESL, with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse-matrix vector product. These results show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.

programming language design and implementation | 2001

A parallel, real-time garbage collector

Perry Cheng; Guy E. Blelloch

We describe a parallel, real-time garbage collector and present experimental results that demonstrate good scalability and good real-time bounds. The collector is designed for shared-memory multiprocessors and is based on an earlier collector algorithm [2], which provided fixed bounds on the time any thread must pause for collection. However, since our earlier algorithm was designed for simple analysis, it had some impractical features. This paper presents the extensions necessary for a practical implementation: reducing excessive interleaving, handling stacks and global variables, reducing double allocation, and special treatment of large and small objects. An implementation based on the modified algorithm is evaluated on a set of 15 SML benchmarks on a Sun Enterprise 10000, a 64-way UltraSparc-II multiprocessor. To the best of our knowledge, this is the first implementation of a parallel, real-time garbage collector. The average collector speedup is 7.5 at 8 processors and 17.7 at 32 processors. Maximum pause times range from 3 ms to 5 ms. In contrast, a non-incremental collector (whether generational or not) has maximum pause times from 10 ms to 650 ms. Compared to a non-parallel, stop-copy collector, parallelism has a 39% overhead, while real-time behavior adds an additional 12% overhead. Since the collector takes about 15% of total execution time, these features have an overall time costs of 6% and 2%.

acm symposium on parallel algorithms and architectures | 2007

Scheduling threads for constructive cache sharing on CMPs

Shimin Chen; Phillip B. Gibbons; Michael Kozuch; Vasileios Liaskovitis; Anastassia Ailamaki; Guy E. Blelloch; Babak Falsafi; Limor Fix; Nikos Hardavellas; Todd C. Mowry; Chris Wilkerson

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

ACM Transactions on Programming Languages and Systems | 2006

Adaptive functional programming

Umut A. Acar; Guy E. Blelloch; Robert Harper

We present techniques for incremental computing by introducing adaptive functional programming. As an adaptive program executes, the underlying system represents the data and control dependences in the execution in the form of a dynamic dependence graph. When the input to the program changes, a change propagation algorithm updates the output and the dynamic dependence graph by propagating changes through the graph and re-executing code where necessary. Adaptive programs adapt their output to any change in the input, small or large.We show that adaptivity techniques are practical by giving an efficient implementation as a small ML library. The library consists of three operations for making a program adaptive, plus two operations for making changes to the input and adapting the output to these changes. We give a general bound on the time it takes to adapt the output, and based on this, show that an adaptive Quicksort adapts its output in logarithmic time when its input is extended by one key.To show the safety and correctness of the mechanism we give a formal definition of AFL, a call-by-value functional language extended with adaptivity primitives. The modal type system of AFL enforces correct usage of the adaptivity mechanism, which can only be checked at run time in the ML library. Based on the AFL dynamic semantics, we formalize thechange-propagation algorithm and prove its correctness.

Journal of the ACM | 1999

Provably efficient scheduling for languages with fine-grained parallelism

Guy E. Blelloch; Phillip B. Gibbons; Yossi Matias

Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on <italic>p</italic> processors can use a factor of <italic>p</italic> or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any computation with <?Pub Fmt italic>w<?Pub Fmt /italic> units of work and critical path length <?Pub Fmt italic>d<?Pub Fmt /italic>, and for any sequential schedule that takes space s<subscrpt>1</subscrpt>, we provide a parallel schedule that takes fewer than w/p + d steps on p processors and requires less than s<subscrpt>1</subscrpt> + <inline-equation> <f> p˙d</f> </inline-equation> space. This matches the lower bound that we show, and significantly improves upon the best previous bound of <inline-equation> <f> s<inf>1</inf>˙p</f> </inline-equation> spaces for the common case where <italic>d</italic><<<italic>s</italic><subscrpt>1</subscrpt>. The paper then describes a scheduler for implementing high-level languages with <italic>nested</italic> parallelism, that generates schedules in this class. During program execution, as the structure of the computation is revealed, the scheduler keeps track of the active tasks, allocates the tasks to the processors, and performs the necessary task synchronization. The scheduler is itself a parallel algorithm, and incurs at most a constant factor overhead in time and space, even when the scheduling granularity is individual units of work. The algorithm is the first efficient solution to the scheduling problem discussed here, even if space considerations are ignored.

Explore More