Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jeremy W. Sheaffer is active.

Publication


Featured researches published by Jeremy W. Sheaffer.


ieee international symposium on workload characterization | 2009

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che; Michael Boyer; Jiayuan Meng; David Tarjan; Jeremy W. Sheaffer; Sang-Ha Lee; Kevin Skadron

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeleys dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.


Journal of Parallel and Distributed Computing | 2008

A performance study of general-purpose applications on graphics processors using CUDA

Shuai Che; Michael Boyer; Jiayuan Meng; David Tarjan; Jeremy W. Sheaffer; Kevin Skadron

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIAs C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.


symposium on application specific processors | 2008

Accelerating Compute-Intensive Applications with GPUs and FPGAs

Shuai Che; Jie Li; Jeremy W. Sheaffer; Kevin Skadron; John Lach

Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. Applications typically exhibit vastly different performance characteristics depending on the accelerator. This is an inherent problem attributable to architectural design, middleware support and programming style of the target platform. For the best application-to-accelerator mapping, factors such as programmability, performance, programming cost and sources of overhead in the design flows must be all taken into consideration. In general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources. We present a performance study of three diverse applications - Gaussian elimination, data encryption standard (DES), and Needleman-Wunsch - on an FPGA, a GPU and a multicore CPU system. We perform a comparative study of application behavior on accelerators considering performance and code complexity. Based on our results, we present an application characteristic to accelerator platform mapping, which can aid developers in selecting an appropriate target architecture for their chosen application.


ieee international symposium on workload characterization | 2010

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads

Shuai Che; Jeremy W. Sheaffer; Michael Boyer; Lukasz G. Szafaryn; Liang Wang; Kevin Skadron

The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics.


ieee international conference on high performance computing data and analytics | 2011

Dymaxion: optimizing memory access patterns for heterogeneous systems

Shuai Che; Jeremy W. Sheaffer; Kevin Skadron

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to sub-optimal performance for programs designed with a CPU memory interface — or no particular memory interface at all! — in mind. This implies that application performance is highly sensitive irregularity in memory access patterns. This issue is all the more important due to the growing disparity between core and DRAM clocks; memory interfaces have increasingly become bottlenecks in computer systems. In this paper, we propose a simple API, Dymaxion1, that allows programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms. Use of Dymaxion requires only minimal modifications to existing CUDA programs. Our current framework extends NVIDIAs CUDA API with the addition of memory layout remapping and index transformation. We consider the overhead of layout remapping and effectively hide it through chunking and overlapping with PCI-E transfer. We present the implementation of Dymaxion and its optimizations and evaluate a variety of important memory access patterns. Using four case studies, we are able to achieve 3.3× speedup on GPU kernels and 20% overall performance improvement, including the PCI-E transfer, over the original CUDA implementations on an NVIDIA GTX 480 GPU. We also explore the importance of maintaining per-device data layouts and cross-device data mappings with a case study of concurrent CPU-GPU execution.


siggraph eurographics conference on graphics hardware | 2004

A flexible simulation framework for graphics architectures

Jeremy W. Sheaffer; David Luebke; Kevin Skadron

In this paper we describe a multipurpose tool for analysis of the performance characteristics of computer graphics hardware and software. We are developing Qsilver, a highly configurable micro-architectural simulator of the GPU that uses the Chromium systems ability to intercept and redirect an OpenGL stream. The simulator produces an annotated trace of graphics commands using Chromium, then runs the trace through a cycle-timer model to evaluate time-dependent behaviors of the varios functional units. We demonstrate the use of Qsilver on a simple hypothetical architecture to analyze performance bottlenecks, to explore new GPU microarchitectures, and to model power and leakage properties. One innovation we explore is the use of dynamic voltage scaling across multiple clock domains to achieve significant energy savings at almost negligible performance cost. Finally, we discuss how other architectural features and experiments might be incorporated into the Qsilver framework.


international conference on computer graphics and interactive techniques | 2007

A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors

Jeremy W. Sheaffer; David Luebke; Kevin Skadron

General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMDs Close to the Metal (CTM) and NVIDIAs Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.


international symposium on performance analysis of systems and software | 2005

Studying Thermal Management for Graphics-Processor Architectures

Jeremy W. Sheaffer; Kevin Skadron; David Luebke

We have previously presented Qsilver, a flexible simulation system for graphics architectures. In this paper we describe our extensions to this system, which we use - instrumented with a power model and HotSpot - to analyze the application of standard CPU static and runtime thermal management techniques on the GPU. We describe experiments implementing clock gating, fetch gating, dynamic voltage scaling, multiple clock domains and permuted floor-planning on the GPU using our simulation environment, and demonstrate that these techniques are beneficial in the GPU domain. Further, we show that the inherent parallelism of GPU workloads enables significant thermal gains on chips designed employing static floorplan repartitioning


international conference on computer graphics and interactive techniques | 2006

The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware

Jeremy W. Sheaffer; David Luebke; Kevin Skadron

With shrinking process technology, the primary cause of transient faults in semiconductors shifts away from high-energy cosmic particle strikes and toward more mundane and pervasive causes---power fluctuations, crosstalk, and other random noise. Smaller transistor features require a lower critical charge to hold and change bits, which leads to faster microprocessors, but which also leads to higher transient fault rates. Current trends, expected to continue, show soft error rates increasing exponentially at a rate of 8% per technology generation. Existing transient fault research in general-purpose architecture, like the well-established architectural vulnerability factor (AVF), assume that all computations are equally important and all errors equally intolerable. However, we observe that the effect of transient faults in graphics processing can range from imperceptible, to bothersome visual artifacts, to critical loss of function. We therefore extend and generalize the AVF by introducing the Visual Vulnerability Spectrum (VVS). We apply the VVS to analyze the effect of increased transient error rate on graphics processors. With this analysis in hand, we suggest several targeted, inexpensive solutions that can mitigate the most egregious of soft error consequences.


international parallel and distributed processing symposium | 2012

Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth

Jiayuan Meng; Jeremy W. Sheaffer; Kevin Skadron

Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area- and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intels MIC, AMDs Fusion, and NVIDIAs Fermi. However, unlike software-managed streaming memories used in conventional graphics processors (GPUs), hardware-managed caches are more disruptive to SIMD execution, therefore the interaction between implicit caching and aggressive SIMD execution may no longer follow the conventional wisdom gained from streaming memories. We show that due to more frequent memory latency divergence, lower latency in non-L1 data accesses, and relatively unpredictable L1 contention, cache hierarchies favor different SIMD widths and multi-threading depths than streaming memories. In fact, because the above effects are subject to runtime dynamics, a fixed combination of SIMD width and multi-threading depth no longer works ubiquitously across diverse applications or when cache capacities are reduced due to pollution or power saving. To address the above issues and reduce design risks, this paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback. Robust SIMD can trade wider SIMD for deeper multi-threading by splitting a wider SIMD group into multiple narrower SIMD groups. Compared to the performance generated by running every benchmark on its individually preferred SIMD organization, the same Robust SIMD organization performs similarly -- sometimes even better due to phase adaptation -- and out per-forms the best fixed SIMD organization by 17%. When D-cache capacity is reduced due to runtime disruptiveness, Robust SIMD offers graceful performance degradation, with 25% polluted cache lines in a 32 KB D-cache, Robust SIMD performs 1.4× better compared to a conventional SIMD architecture.

Collaboration


Dive into the Jeremy W. Sheaffer's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shuai Che

Advanced Micro Devices

View shared research outputs
Top Co-Authors

Avatar

Jiayuan Meng

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jie Li

University of Virginia

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge