Is this you? Create Your Porfile

Bradley C. Kuszmaul

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bradley C. Kuszmaul is active.

Explore More

Publication

Featured researches published by Bradley C. Kuszmaul.

acm sigplan symposium on principles and practice of parallel programming | 1995

Cilk: an efficient multithreaded runtime system

Robert D. Blumofe; Christopher F. Joerg; Bradley C. Kuszmaul; Charles E. Leiserson; Keith H. Randall; Yuli Zhou

Cilk (pronounced “silk”) is a C-based runtime system for multi-threaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the “work” and “critical path” of a Cilk computation can be used to accurately model performance. Consequently, a Cilk programmer can focus on reducing the work and critical path of his computation, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time and communication bounds all within a constant factor of optimal. The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Silicon Graphics Power Challenge SMP, and the MIT Phish network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the *Socrates chess program, which won third prize in the 1994 ACM International Computer Chess Championship.

high-performance computer architecture | 2005

Unbounded transactional memory

C.S. Ananian; Krste Asanovic; Bradley C. Kuszmaul; Charles E. Leiserson; Sean Lie

Hardware transactional memory should support unbounded transactions: transactions of arbitrary size and duration. We describe a hardware implementation of unbounded transactional memory, called UTM, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory. We performed a cycle-accurate simulation of a simplified architecture, called LTM. LTM is based on UTM but is easier to implement, because it does not change the memory subsystem outside of the processor. LTM allows nearly unbounded transactions, whose footprint is limited only by physical memory size and whose duration by the length of a timeslice. We assess UTM and LTM through microbenchmarking and by automatically converting the SPECjvm98 Java benchmarks and the Linux 2.4.19 kernel to use transactions instead of locks. We use both cycle-accurate simulation and instrumentation to understand benchmark behavior. Our studies show that the common case is small transactions that commit, even when contention is high, but that some applications contain very large transactions. For example, although 99.9% of transactions in the Linux study touch 54 cache lines or fewer, some transactions touch over 8000 cache lines. Our studies also indicate that hardware support is required, because some applications spend over half their time in critical regions. Finally, they suggest that hardware support for transactions can make Java programs run faster than when run using locks and can increase the concurrency of the Linux kernel by as much as a factor of 4 with no additional programming work.

Journal of Parallel and Distributed Computing | 1996

The Network Architecture of the Connection Machine CM-5

Charles E. Leiserson; Zahi S. Abuhamdeh; David C. Douglas; Carl R. Feynman; Mahesh N. Ganmukhi; Jeffrey V. Hill; W. Daniel Hillis; Bradley C. Kuszmaul; Margaret A. St. Pierre; David S. Wells; Monica C. Wong-Chan; Shaw-Wen Yang; Robert C. Zak

The Connection Machine Model CM-5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (1012floating-point operations per second). The CM-5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a data network, a control network, and a diagnostic network. This paper describes the organization of these three networks and how they contribute to the design goals of the CM-5.

acm symposium on parallel algorithms and architectures | 1992

The network architecture of the Connection Machine CM-5 (extended abstract)

The Connection Machine Model CM-5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (1012 floating-point operations per second). The CM-5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a data network, a control network, and a diagnostic network. This paper describes the organization of these three networks and how they contribute to the design goals of the CM-5.

acm symposium on parallel algorithms and architectures | 2007

Cache-oblivious streaming B-trees

Michael A. Bender; Jeremy T. Fineman; Yonatan R. Fogel; Bradley C. Kuszmaul; Jelani Nelson

A streaming B-tree is a dictionary that efficiently implements insertions and range queries. We present two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA). For block-transfer size B and on N elements, the shuttle tree implements searches in optimal O(log B+1N) transfers, range queries of L successive elements in optimal O(log B+1N +L/B) transfers, and insertions in O((log B+1N)/BΘ(1/(log log B)2)+(log2N)/B) transfers, which is an asymptotic speedup over traditional B-trees if B ≥ (log N)1+c log log log2 N for any constant c >1. A COLA implements searches in O(log N) transfers, range queries in O(log N + L/B) transfers, and insertions in amortized O((log N)/B) transfers, matching the bounds for a (cache-aware) buffered repository tree. A partially deamortized COLA matches these bounds but reduces the worst-case insertion cost to O(log N) if memory size M = Ω(log N). We also present a cache-aware version of the COLA, the lookahead array, which achieves the same bounds as Brodal and Fagerbergs (cache-aware) Bε-tree. We compare our COLA implementation to a traditional B-tree. Our COLA implementation runs 790 times faster for random inser-tions, 3.1 times slower for insertions of sorted data, and 3.5 times slower for searches.

acm symposium on parallel algorithms and architectures | 2005

Adversarial contention resolution for simple channels

Michael A. Bender; Simai He; Bradley C. Kuszmaul; Charles E. Leiserson

This paper analyzes the worst-case performance of randomized backoff on simple multiple-access channels. Most previous analysis of backoff has assumed a statistical arrival model.For batched arrivals, in which all n packets arrive at time 0, we show the following tight high-probability bounds. Randomized binary exponential backoff has makespan Θ(nlgn), and more generally, for any constant r, r-exponential backoff has makespan Θ(nloglgr n). Quadratic backoff has makespan Θ((n/lg n)3/2), and more generally, for r>1, r-polynomial backoff has makespan Θ((n/lg n)1+1/r). Thus, for batched inputs, both exponential and polynomial backoff are highly sensitive to backoff constants. We exhibit a monotone superpolynomial subexponential backoff algorithm, called loglog-iterated backoff, that achieves makespan Θ(nlglg n/lglglg n). We provide a matching lower bound showing that this strategy is optimal among all monotone backoff algorithms. Of independent interest is that this lower bound was proved with a delay sequence argument.In the adversarial-queuing model, we present the following stability and instability results for exponential backoff and loglog-iterated backoff. Given a (λ,T)-stream, in which at most n=λT packets arrive in any interval of size T, exponential backoff is stable for arrival rates of λ=O(1/lgn) and unstable for arrival rates of λ=Ω(lglgn/lgn); loglog-iterated backoff is stable for arrival rates of λ=O(1/(lglgn\lgn)) and unstable for arrival rates of λ=Ω(1/lgn). Our instability results show that bursty input is close to being worst-case for exponential backoff and variants and that even small bursts can create instabilities in the channel.

acm symposium on parallel algorithms and architectures | 2005

Concurrent cache-oblivious b-trees

Michael A. Bender; Jeremy T. Fineman; Seth Gilbert; Bradley C. Kuszmaul

This paper presents concurrent cache-oblivious (CO) B-trees. We extend the cache-oblivious model to a parallel or distributed setting and present three concurrent CO B-trees. Our first data structure is a concurrent lock-based exponential CO B-tree. This data structure supports insertions and non-blocking searches/successor queries. The second and third data structures are lock-based and lock-free variations, respectively, on the packed-memory CO B-tree. These data structures support range queries and deletions in addition to the other operations. Each data structure achieves the same serial performance as the original data structure on which it is based. In a concurrent setting, we show that these data structures are linearizable, meaning that completed operations appear to an outside viewer as though they occurred in some serialized order. The lock-based data structures are also deadlock free, and the lock-free data structure guarantees forward progress by at least one process.

symposium on principles of database systems | 2006

Cache-oblivious string B-trees

Michael A. Bender; Bradley C. Kuszmaul

B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally when keys are long or of variable length,when keys are compressed, even when using front compression, the standard B-tree compression scheme,for range queries, andwith respect to memory effects such as disk prefetching.This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: The COSB-tree searches asymptotically optimally and inserts and deletes nearly optimally.It maintains an index whose size is proportional to the front-compressed size of the dictionary. Furthermore, unlike standard front-compressed strings, keys can be decompressed in a memory-efficient manner.It performs range queries with no extra disk seeks; in contrast, B-trees incur disk seeks when skipping from leaf block to leaf block.It utilizes all levels of a memory hierarchy efficiently and makes good use of disk locality by using cache-oblivious layout strategies.

very large data bases | 2012

Don't thrash: how to cache your hash on flash

Michael A. Bender; Rob Johnson; Russell Kraner; Bradley C. Kuszmaul; Dzejla Medjedovic; Pablo Montes; Pradeep Shetty; Richard P. Spillane; Erez Zadok

This paper presents new alternatives to the well-known Bloom filter data structure. The Bloom filter, a compact data structure supporting set insertion and membership queries, has found wide application in databases, storage systems, and networks. Because the Bloom filter performs frequent random reads and writes, it is used almost exclusively in RAM, limiting the size of the sets it can represent. This paper first describes the quotient filter, which supports the basic operations of the Bloom filter, achieving roughly comparable performance in terms of space and time, but with better data locality. Operations on the quotient filter require only a small number of contiguous accesses. The quotient filter has other advantages over the Bloom filter: it supports deletions, it can be dynamically resized, and two quotient filters can be efficiently merged. The paper then gives two data structures, the buffered quotient filter and the cascade filter, which exploit the quotient filter advantages and thus serve as SSD-optimized alternatives to the Bloom filter. The cascade filter has better asymptotic I/O performance than the buffered quotient filter, but the buffered quotient filter outperforms the cascade filter on small to medium data sets. Both data structures significantly outperform recently-proposed SSD-optimized Bloom filter variants, such as the elevator Bloom filter, buffered Bloom filter, and forest-structured Bloom filter. In experiments, the cascade filter and buffered quotient filter performed insertions 8.6--11 times faster than the fastest Bloom filter variant and performed lookups 0.94--2.56 times faster.

ACM Transactions on Storage | 2015

BetrFS: Write-Optimization in a Kernel File System

William Jannen; Jun Yuan; Yang Zhan; Amogh Akshintala; John Esmet; Yizheng Jiao; Ankur Mittal; Prashant Pandey; Phaneendra Reddy; Leif Walsh; Michael A. Bender; Rob Johnson; Bradley C. Kuszmaul; Donald E. Porter

The Be-tree File System, or BetrFS (pronounced “better eff ess”), is the first in-kernel file system to use a write-optimized data structure (WODS). WODS are promising building blocks for storage systems because they support both microwrites and large scans efficiently. Previous WODS-based file systems have shown promise but have been hampered in several ways, which BetrFS mitigates or eliminates altogether. For example, previous WODS-based file systems were implemented in user space using FUSE, which superimposes many reads on a write-intensive workload, reducing the effectiveness of the WODS. This article also contributes several techniques for exploiting write-optimization within existing kernel infrastructure. BetrFS dramatically improves performance of certain types of large scans, such as recursive directory traversals, as well as performance of arbitrary microdata operations, such as file creates, metadata updates, and small writes to files. BetrFS can make small, random updates within a large file 2 orders of magnitude faster than other local file systems. BetrFS is an ongoing prototype effort and requires additional data-structure tuning to match current general-purpose file systems on some operations, including deletes, directory renames, and large sequential writes. Nonetheless, many applications realize significant performance improvements on BetrFS. For instance, an in-place rsync of the Linux kernel source sees roughly 1.6--22 × speedup over commodity file systems.

Explore More