Robert D. Blumofe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert D. Blumofe is active.

Explore More

Publication

Featured researches published by Robert D. Blumofe.

acm sigplan symposium on principles and practice of parallel programming | 1995

Cilk: an efficient multithreaded runtime system

Robert D. Blumofe; Christopher F. Joerg; Bradley C. Kuszmaul; Charles E. Leiserson; Keith H. Randall; Yuli Zhou

Cilk (pronounced “silk”) is a C-based runtime system for multi-threaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance. Consequently, a Cilk programmer can focus on reducing the work and critical path of his computation, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time and communication bounds all within a constant factor of optimal. The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, and the MIT Phish network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the *Socrates chess program, which won third prize in the 1994 ACM International Computer Chess Championship.

acm symposium on parallel algorithms and architectures | 1998

Thread scheduling for multiprogrammed multiprocessors

Nimar S. Arora; Robert D. Blumofe; C. Greg Plaxton

We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below this level, the operating system kernel schedules processes onto a fixed collection of processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel. Our thread scheduler is a non-blocking implementation of the work-stealing algorithm. For any multithreaded computation with work T1 and critical-path length T∈fty , and for any number P of processes, our scheduler executes the computation in expected time O(T1/PA + T∈fty P/PA) , where PA is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant factor, and achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .

architectural support for programming languages and operating systems | 2000

Hoard: a scalable memory allocator for multithreaded applications

Emery D. Berger; Kathryn S. McKinley; Robert D. Blumofe; Paul R. Wilson

Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of P (the number of processors) to unbounded memory consumption.This paper introduces Hoard, a fast, highly scalable allocator that largely avoids false sharing and is memory efficient. Hoard is the first allocator to simultaneously solve the above problems. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case. Our results on eleven programs demonstrate that Hoard yields low average fragmentation and improves overall program performance over the standard Solaris allocator by up to a factor of 60 on 14 processors, and up to a factor of 18 over the next best allocator we tested.

Theory of Computing Systems \/ Mathematical Systems Theory | 2002

The Data Locality of Work Stealing

Umut A. Acar; Guy E. Blelloch; Robert D. Blumofe

AbstractThis paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation.As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations Gn whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected O (C ⌉m/s;⌈PT∞more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∈fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.

acm sigops european workshop | 1996

A TLAS : an infrastructure for global computing

J. Eric Baldeschwieler; Robert D. Blumofe; Eric A. Brewer

In this paper, we present a proposed system architecture for global computing that we call ATLAS, and we describe an early prototype that implements several of the mechanisms and policies that comprise the proposed architecture. ATLAS is designed to execute parallel multithreaded programs on the networked computing resources of the world. The ATLAS system is a marriage of existing technologies from Java and Cilk together with some new technologies needed to extend the system into the global domain.

acm symposium on parallel algorithms and architectures | 1996

An analysis of dag-consistent distributed shared-memory algorithms

Robert D. Blumofe; Matteo Frigo; Christopher F. Joerg; Charles E. Leiserson; Keith H. Randall

In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C) of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number FP(C) of page faults incurred by a computation executed on P processors can be related to the number F1(C) of serial page faults by the formula FP(C) F1(C)+O(CPT∞). Finally, we give simple bounds on the number of page faults and the space requirements for “regular” divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.

acm symposium on parallel algorithms and architectures | 2000

The data locality of work stealing

Umut A. Acar; Guy E. Blelloch; Robert D. Blumofe

This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation. As a lower bound, we show that there is a family of multi-threaded computations G n each member of which requires T( n ) total instructions (work) for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is O( n ). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O ( C ⌈ m/s P T ∞ ), where m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T ∞ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.

international conference on parallel processing | 1996

DAG-consistent distributed shared memory

Robert D. Blumofe; Matteo Frigo; Christopher F. Joerg; Charles E. Leiserson; Keith H. Randall

Introduces DAG (directed acyclic graph) consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented DAG consistency in software for the Cilk multithreaded runtime system running on a CM5 Connection Machine. Our implementation includes a DAG-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of DAG consistency for applications that include blocked matrix multiplication, Strassens (1969) matrix multiplication algorithm and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number F/sub P/ of page faults incurred by a user program running an P processors can be related to the number F/sub 1/ of page faults running serially by the formula F/sub P//spl les/F/sub 1/+2Cs, where C is the cache size and s is the number of thread migrations executed by Cilks scheduler.

Theory of Computing Systems \/ Mathematical Systems Theory | 2001

Thread Scheduling for Multiprogrammed Multiprocessors

N. S. Arora; Robert D. Blumofe; C. G. Plaxton

Abstract. We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming. We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below this level, the operating system kernel schedules processes onto a fixed collection of processors. We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel. Our thread scheduler is a non-blocking implementation of the work-stealing algorithm. For any multithreaded computation with work T1 and critical-path length T∈fty , and for any number P of processes, our scheduler executes the computation in expected time O(T1/PA + T∈fty P/PA) , where PA is the average number of processors allocated to the computation by the kernel. This time bound is optimal to within a constant factor, and achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .

measurement and modeling of computer systems | 1998

The performance of work stealing in multiprogrammed environments (extended abstract)

Robert D. Blumofe; Dionisios Papadopoulos

We study the performance of user-level thread schedulers in multiprogrammed environments. Our goal is a user-level thread scheduler that delivers efficient performance under multiprogramming without any need for kernel-level resource management, such as coscheduling or process control. We show that a non-blocking implementation of the work-stealing algorithm achieves this goal. With this implementation, the execution time of a computation running with arbitrarily many processes on arbitrarily many processors can be modeled as a simple function of work and critical-path length. This model holds even when the processes run on a set of processors that arbitrarily grows and shrinks over time. We observe linear speedup whenever the number of processes is small relative to the average parallelism.

Explore More