Jacob Nelson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jacob Nelson is active.

Explore More

Publication

Featured researches published by Jacob Nelson.

international symposium on microarchitecture | 2013

Approximate storage in solid-state memories

Adrian Sampson; Jacob Nelson; Karin Strauss; Luis Ceze

Memories today expose an all-or-nothing correctness model that incurs significant costs in performance, energy, area, and design complexity. But not all applications need high-precision storage for all of their data structures all of the time. This paper proposes mechanisms that enable applications to store data approximately and shows that doing so can improve the performance, lifetime, or density of solid-state memories. We propose two mechanisms. The first allows errors in multi-level cells by reducing the number of programming pulses used to write them. The second mechanism mitigates wear-out failures and extends memory endurance by mapping approximate data onto blocks that have exhausted their hardware error correction resources. Simulations show that reduced-precision writes in multi-level phase-change memory cells can be 1.7x faster on average and using failed blocks can improve array lifetime by 23% on average with quality loss under 10%.

architectural support for programming languages and operating systems | 2011

RCDC: a relaxed consistency deterministic computer

Joseph Devietti; Jacob Nelson; Tom Bergan; Luis Ceze; Dan Grossman

Providing deterministic execution significantly simplifies the debugging, testing, replication, and deployment of multithreaded programs. Recent work has developed deterministic multiprocessor architectures as well as compiler and runtime systems that enforce determinism in current hardware. Such work has incidentally imposed strong memory-ordering properties. Historically, memory ordering has been relaxed in favor of higher performance in shared memory multiprocessors and, interestingly, determinism exacerbates the cost of strong memory ordering. Consequently, we argue that relaxed memory ordering is vital to achieving faster deterministic execution. This paper introduces RCDC, a deterministic multiprocessor architecture that takes advantage of relaxed memory orderings to provide high-performance deterministic execution with low hardware complexity. RCDC has two key innovations: a hybrid HW/SW approach to enforcing determinism; and a new deterministic execution strategy that leverages data-race-free-based memory models (e.g., the models for Java and C++) to improve performance and scalability without sacrificing determinism, even in the presence of races. In our hybrid HW/SW approach, the only hardware mechanisms required are software-controlled store buffering and support for precise instruction counting; we do not require speculation. A runtime system uses these mechanisms to enforce determinism for arbitrary programs. We evaluate RCDC using PARSEC benchmarks and show that relaxing memory ordering leads to performance and scalability close to nondeterministic execution without requiring any form of speculation. We also compare our new execution strategy to one based on TSO (total-store-ordering) and show that some applications benefit significantly from the extra relaxation. We also evaluate a software-only implementation of our new deterministic execution strategy.

high-performance computer architecture | 2015

SNNAP: Approximate computing on programmable SoCs via neural acceleration

Thierry Moreau; Mark Wyse; Jacob Nelson; Adrian Sampson; Hadi Esmaeilzadeh; Luis Ceze; Mark Oskin

Many applications that can take advantage of accelerators are amenable to approximate execution. Past work has shown that neural acceleration is a viable way to accelerate approximate code. In light of the growing availability of on-chip field-programmable gate arrays (FPGAs), this paper explores neural acceleration on off-the-shelf programmable SoCs. We describe the design and implementation of SNNAP, a flexible FPGA-based neural accelerator for approximate programs. SNNAP is designed to work with a compiler workflow that configures the neural networks topology and weights instead of the programmable logic of the FPGA itself. This approach enables effective use of neural acceleration in commercially available devices and accelerates different applications without costly FPGA reconfigurations. No hardware expertise is required to accelerate software with SNNAP, so the effort required can be substantially lower than custom hardware design for an FPGA fabric and possibly even lower than current “C-to-gates” high-level synthesis (HLS) tools. Our measurements on a Xilinx Zynq FPGA show that SNNAP yields a geometric mean of 3.8× speedup (as high as 38.1×) and 2.8× energy savings (as high as 28 x) with less than 10% quality loss across all applications but one. We also compare SNNAP with designs generated by commercial HLS tools and show that SNNAP has similar performance overall, with better resource-normalized throughput on 4 out of 7 benchmarks.

ACM Transactions on Computer Systems | 2014

Approximate Storage in Solid-State Memories

Adrian Sampson; Jacob Nelson; Karin Strauss; Luis Ceze

Memories today expose an all-or-nothing correctness model that incurs significant costs in performance, energy, area, and design complexity. But not all applications need high-precision storage for all of their data structures all of the time. This paper proposes mechanisms that enable applications to store data approximately and shows that doing so can improve the performance, lifetime, or density of solid-state memories. We propose two mechanisms. The first allows errors in multi-level cells by reducing the number of programming pulses used to write them. The second mechanism mitigates wear-out failures and extends memory endurance by mapping approximate data onto blocks that have exhausted their hardware error correction resources. Simulations show that reduced-precision writes in multi-level phase-change memory cells can be 1.7× faster on average and using failed blocks can improve array lifetime by 23% on average with quality loss under 10%.

ieee international conference on high performance computing, data, and analytics | 2016

Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels

Rob F. Van der Wijngaart; Abdullah Kayi; Jeff R. Hammond; Gabriele Jost; Tom St. John; Srinivas Sridharan; Timothy G. Mattson; John Abercrombie; Jacob Nelson

We use three Parallel Research Kernels to compare performance of a set of programming models(We employ the term programming model as it is commonly used in the application community. A more accurate term is programming environment, which is the collective of abstract programming model, embodiment of the model in an Application Programmer Interface (API), and the runtime that implements it.): MPI1 (MPI two-sided communication), MPIOPENMP (MPI+OpenMP), MPISHM (MPI1 with MPI-3 interprocess shared memory), MPIRMA (MPI one-sided communication), SHMEM, UPC, Charm++ and Grappa. The kernels in our study – Stencil, Synch_p2p and Transpose – underlie a wide range of computational science applications. They enable direct probing of properties of programming models, especially communication and synchronization. In contrast to mini- or proxy applications, the PRK allow for rapid implementation, measurement and verification. Our experimental results show MPISHM the overall winner, with MPI1, MPIOPENMP and SHMEM performing well. MPISHM and MPIOPENMP outperform the other models in the strong-scaling limit due to their effective use of shared memory and good granularity control. The non-evolutionary models Grappa and Charm++ are not competitive with traditional models (MPI and PGAS) for two of the kernels; these models favor irregular algorithms, while the PRK considered here are regular.

2015 9th International Conference on Partitioned Global Address Space Programming Models | 2015

Using the Parallel Research Kernels to Study PGAS Models

Rob F. Van der Wijngaart; Srinivas Sridharan; Abdullah Kayi; Gabriele Jost; Jeff R. Hammond; Timothy G. Mattson; Jacob Nelson

A subset of the Parallel Research Kernels (PRK),simplified parallel application patterns, are used to study the behavior of different runtimes implementing the PGAS programming model. The goal of this paper is to show that such an approach is practical and effective as we approach the exascale era. Our experimental results indicate that forthe kernels we selected, MPI with two-sided communications outperforms the PGAS runtimes SHMEM, UPC, Grappa, and MPI-3 with RMA extensions.

symposium on cloud computing | 2018

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training.

Liang Luo; Jacob Nelson; Luis Ceze; Amar Phanishayee; Arvind Krishnamurthy

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art cloud-based distributed training techniques for image classification workloads, with 25% better throughput per dollar.

usenix annual technical conference | 2015