Jeremy T. Fineman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jeremy T. Fineman is active.

Explore More

Publication

Featured researches published by Jeremy T. Fineman.

acm symposium on parallel algorithms and architectures | 2009

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Aydin Buluç; Jeremy T. Fineman; Matteo Frigo; John R. Gilbert; Charles E. Leiserson

This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ(√nlgn) span (critical-path length), yielding a parallelism of Θ(nnz/√nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is the same as that for the more-standard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A,x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A,x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.

acm symposium on parallel algorithms and architectures | 2007

Cache-oblivious streaming B-trees

Michael A. Bender; Jeremy T. Fineman; Yonatan R. Fogel; Bradley C. Kuszmaul; Jelani Nelson

A streaming B-tree is a dictionary that efficiently implements insertions and range queries. We present two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA). For block-transfer size B and on N elements, the shuttle tree implements searches in optimal O(log B+1N) transfers, range queries of L successive elements in optimal O(log B+1N +L/B) transfers, and insertions in O((log B+1N)/BΘ(1/(log log B)2)+(log2N)/B) transfers, which is an asymptotic speedup over traditional B-trees if B ≥ (log N)1+c log log log2 N for any constant c >1. A COLA implements searches in O(log N) transfers, range queries in O(log N + L/B) transfers, and insertions in amortized O((log N)/B) transfers, matching the bounds for a (cache-aware) buffered repository tree. A partially deamortized COLA matches these bounds but reduces the worst-case insertion cost to O(log N) if memory size M = Ω(log N). We also present a cache-aware version of the COLA, the lookahead array, which achieves the same bounds as Brodal and Fagerbergs (cache-aware) Bε-tree. We compare our COLA implementation to a traditional B-tree. Our COLA implementation runs 790 times faster for random inser-tions, 3.1 times slower for insertions of sorted data, and 3.5 times slower for searches.

acm symposium on parallel algorithms and architectures | 2012

Brief announcement: the problem based benchmark suite

Julian Shun; Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Aapo Kyrola; Harsha Vardhan Simhadri; Kanat Tangwongsan

This announcement describes the problem based benchmark suite (PBBS). PBBS is a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems. Each benchmark is defined concretely in terms of a problem specification and a set of input distributions. No requirements are made in terms of algorithmic approach, programming language, or machine architecture. The goal of the benchmarks is not only to compare runtimes, but also to be able to compare code and other aspects of an implementation (e.g., portability, robustness, determinism, and generality). As such the code for an implementation of a benchmark is as important as its runtime, and the public PBBS repository will include both code and performance results. The benchmarks are designed to make it easy for others to try their own implementations, or to add new benchmark problems. Each benchmark problem includes the problem specification, the specification of input and output file formats, default input generators, test codes that check the correctness of the output for a given input, driver code that can be linked with implementations, a baseline sequential implementation, a baseline multicore implementation, and scripts for running timings (and checks) and outputting the results in a standard format. The current suite includes the following problems: integer sort, comparison sort, remove duplicates, dictionary, breadth first search, spanning forest, minimum spanning forest, maximal independent set, maximal matching, K-nearest neighbors, Delaunay triangulation, convex hull, suffix arrays, n-body, and ray casting. For each problem, we report the performance of our baseline multicore implementation on a 40-core machine.

acm symposium on parallel algorithms and architectures | 2005

Concurrent cache-oblivious b-trees

Michael A. Bender; Jeremy T. Fineman; Seth Gilbert; Bradley C. Kuszmaul

This paper presents concurrent cache-oblivious (CO) B-trees. We extend the cache-oblivious model to a parallel or distributed setting and present three concurrent CO B-trees. Our first data structure is a concurrent lock-based exponential CO B-tree. This data structure supports insertions and non-blocking searches/successor queries. The second and third data structures are lock-based and lock-free variations, respectively, on the packed-memory CO B-tree. These data structures support range queries and deletions in addition to the other operations. Each data structure achieves the same serial performance as the original data structure on which it is based. In a concurrent setting, we show that these data structures are linearizable, meaning that completed operations appear to an outside viewer as though they occurred in some serialized order. The lock-based data structures are also deadlock free, and the lock-free data structure guarantees forward progress by at least one process.

acm sigplan symposium on principles and practice of parallel programming | 2012

Internally deterministic parallel algorithms can be fast

Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Julian Shun

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics---i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable. In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

acm sigplan symposium on principles and practice of parallel programming | 2008

Nested parallelism in transactional memory

Kunal Agrawal; Jeremy T. Fineman; Jim Sukha

This paper investigates adding transactions with nested parallelism and nested transactions to a dynamically multithreaded parallel programming language that generates only series-parallel programs. We describe XConflict, a data structure that facilitates conflict detection for a software transactional memory system which supports transactions with nested parallelism and unbounded nesting depth. For languages that use a Cilk-like work-stealing scheduler, XConflict answers concurrent conflict queries in O(1) time and can be maintained efficiently. In particular, for a program with T1 work and a span (or critical-path length) of T∞, the running time on p processors of the program augmented with XConflict is only O(T1/p + pT∞). Using XConflict, we describe CWSTM, a runtime-system design for software transactional memory which supports transactions with nested parallelism and unbounded nesting depth of transactions. The CWSTM design provides transactional memory with eager updates, eager conflict detection, strong atomicity, and lazy cleanup on aborts. In the restricted case when no transactions abort and there are no concurrent readers, CWSTM executes a transactional computation on p processors also in time O(T1/p + pT∞). Although this bound holds only under rather optimistic assumptions, to our knowledge, this result is the first theoretical performance bound on a TM system that supports transactions with nested parallelism which is independent of the maximum nesting depth of transactions.

acm symposium on parallel algorithms and architectures | 2012

Greedy sequential maximal independent set and matching are parallel on average

Guy E. Blelloch; Jeremy T. Fineman; Julian Shun

The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend on a subset of the previous iterates (i.e. knowing that any one of a vertexs previous neighbors is in the MIS, or knowing that it has no previous neighbors, is sufficient to decide its fate one way or the other). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm. In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence length of the sequential greedy MIS algorithm is polylogarithmic (O(log^2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for greedy maximal matching (MM). For both problems we describe simple linear-work parallel algorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism.

ACM Transactions on Algorithms | 2016

A New Approach to Incremental Cycle Detection and Related Problems

Michael A. Bender; Jeremy T. Fineman; Seth Gilbert; Robert Endre Tarjan

We consider the problem of detecting a cycle in a directed graph that grows by arc insertions and the related problems of maintaining a topological order and the strong components of such a graph. For these problems, we give two algorithms, one suited to sparse graphs, the other to dense graphs. The former takes O(min {m1/2, n2/3}m) time to insert m arcs into an n-vertex graph; the latter takes O(n2log n) time. Our sparse algorithm is substantially simpler than a previous O(m3/2)-time algorithm; it is also faster on graphs of sufficient density. The time bound of our dense algorithm beats the previously best time bound of O(n5/2) for dense graphs. Our algorithms rely for their efficiency on vertex numberings weakly consistent with topological order: we allow ties. Bounds on the size of the numbers give bounds on running time.

acm symposium on parallel algorithms and architectures | 2015

Sorting with Asymmetric Read and Write Costs

Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Yan Gu; Julian Shun

Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in O(n) writes, O(n log n) reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges k > 1 for writing a block of size B to the secondary memory, and present variants of three EM sorting algorithms (multi-way merge sort, sample sort, and heap sort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly k block reads for every block write. Finally, we define a variant of the Ideal-Cache model with asymmetric write costs, and present write-efficient,cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield provably good bounds for parallel machines with private caches or with a shared cache, respectively.

symposium on principles of database systems | 2014

Cost-oblivious storage reallocation

Michael A. Bender; Sándor P. Fekete; Jeremy T. Fineman; Seth Gilbert

Databases allocate and free blocks of storage on disk. Freed blocks introduce holes where no data is stored. Allocation systems attempt to reuse such deallocated regions in order to minimize the footprint on disk. When previously allocated blocks cannot be moved, this problem is called the memory allocation problem. It is known to have a logarithmic overhead in the footprint size. This paper defines the storage reallocation problem, where previously allocated blocks can be moved, or reallocated, but at some cost. This cost is determined by the allocation/reallocation cost function. The algorithms presented here are cost oblivious, in that they work for a broad and reasonable class of cost functions, even when they do not know what the cost function actually is. The objective is to minimize the storage footprint, that is, the largest memory address containing an allocated object, while simultaneously minimizing the reallocation costs. This paper gives asymptotically optimal algorithms for storage reallocation, in which the storage footprint is at most (1+ε) times optimal, and the reallocation cost is at most O((1/ε)log(1/ε)) times the original allocation cost, which is asymptotically optimal for constant ε. The algorithms are cost oblivious, which means they achieve these bounds with no knowledge of the allocation/reallocation cost function, as long as the cost function is subadditive.

Explore More