Jeffrey A. Stuecheli | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jeffrey A. Stuecheli is active.

Explore More

Publication

Featured researches published by Jeffrey A. Stuecheli.

Ibm Journal of Research and Development | 2011

IBM POWER7 multicore server processor

Balaram Sinharoy; Ronald Nick Kalla; William J. Starke; Hung Q. Le; R. Cargnoni; J. A. Van Norstrand; B. J. Ronchetti; Jeffrey A. Stuecheli; Jens Leenstra; G. L. Guthrie; D. Q. Nguyen; Bart Blaner; C. F. Marino; E. Retter; Peter Williams

The IBM POWER® processor is the dominant reduced instruction set computing microprocessor in the world today, with a rich history of implementation and innovation over the last 20 years. In this paper, we describe the key features of the POWER7® processor chip. On the chip is an eight-core processor, with each core capable of four-way simultaneous multithreaded operation. Fabricated in IBMs 45-nm silicon-on-insulator (SOI) technology with 11 levels of metal, the chip contains more than one billion transistors. The processor core and caches are significantly enhanced to boost the performance of both single-threaded response-time-oriented, as well as multithreaded, throughput-oriented applications. The memory subsystem contains three levels of on-chip cache, with SOI embedded dynamic random access memory (DRAM) devices used as the last level of cache. A new memory interface using buffered double-data-rate-three DRAM and improvements in reliability, availability, and serviceability are discussed

international symposium on microarchitecture | 2011

Minimalist open-page: a DRAM page-mode scheduling policy for the many-core era

Dimitris Kaseridis; Jeffrey A. Stuecheli; Lizy Kurian John

Contemporary DRAM systems have maintained impressive scaling by managing a careful balance between performance, power, and storage density. In achieving these goals, a significant sacrifice has been made in DRAMs operational complexity. To realize good performance, systems must properly manage the significant number of structural and timing restrictions of the DRAM devices. DRAMs use is further complicated in many-core systems where the memory interface is shared among multiple cores/threads competing for memory bandwidth.

international symposium on computer architecture | 2010

The virtual write queue: coordinating DRAM and last-level cache policies

Jeffrey A. Stuecheli; Dimitris Kaseridis; David Daly; Hillery C. Hunter; Lizy Kurian John

In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPUs data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controllers visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.

international symposium on microarchitecture | 2010

Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory

Jeffrey A. Stuecheli; Dimitris Kaseridis; Hillery C. Hunter; Lizy Kurian John

High density memory is becoming more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM’s per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This paper shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient– they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. We refer to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized. We extend the GEMS on SIMICS tool-set to include Elastic Refresh. Simulations show the proposed solution provides a 10% average performance improvement over existing techniques across the entire SPEC CPU suite, and up to a 41%improvement for certain workloads.

Ibm Journal of Research and Development | 2002

Functional verification of the POWER4 microprocessor and POWER4 multiprocessor systems

John M. Ludden; Wolfgang Roesner; G. M. Heiling; J. R. Reysa; Jonathan R. Jackson; B.-L. Chu; Michael L. Behm; Jason R. Baumgartner; R. D. Peterson; J. Abdulhafiz; W. E. Bucy; J. H. Klaus; D. J. Klema; T. N. Le; F. D. Lewis; P. E. Milling; L. A. McConville; B. S. Nelson; Viresh Paruthi; T. W. Pouarz; A. D. Romonosky; Jeffrey A. Stuecheli; K. D. Thompson; D. W. Victor; Bruce Wile

This paper describes the methods and simulation techniques used to verify the microarchitecture design and functional performance of the IBM POWER4 processor and the POWER4-based Regatta system. The approach was hierarchical, based on but considerably expanding the practice used for verification of the CMOS-based IBM S/390 Parallel Enterprise Server™ G4. For POWER4, verification began at the abstract, high-level design phase and continued throughout the designer and unit levels, the multi-unit level, and finally the multiple-chip system level. The abstract (high-level design) phase permitted early validation of the POWER4 processor design prior to its commitment to HDL. The designer and unit-level stages focused on ensuring the correctness of the microarchitectural components. Multiunitlevel verification, performed on storage and I/O components as well as on the processor, confirmed architectural compliance for each of the chips and subsystems. Finally, systemlevel verification tested multiprocessor coherence and system-level function, including processor-to-I/O communication and validation of multiple hardware configurations. In parallel with design and functional validation, verification of reliability functions, performance, and degraded configurations was also performed at most of the levels in the hierarchy.

Ibm Journal of Research and Development | 2015

CAPI: A Coherent Accelerator Processor Interface

Jeffrey A. Stuecheli; Bart Blaner; Charles Ray Johns; Michael S. Siegel

Heterogeneous computing systems combine different types of compute elements that share memory. A specific class of heterogeneous systems discussed in this paper pairs traditional general-purpose processing cores and accelerator units. While this arrangement enables significant gains in application performance, device driver overheads and operating system code path overheads can become prohibitive. The I/O interface of a processor chip is a well-suited attachment point from a system design perspective, in that standard server models can be augmented with application-specific accelerators. However, traditional I/O attachment protocols introduce significant device driver and operating system software latencies. With the Coherent Accelerator Processor Interface (CAPI), we enable attaching an accelerator as a coherent CPU peer over the I/O physical interface. The CPU peer features consist of a homogeneous virtual address space across the CPU and accelerator, and hardware-managed caching of this shared data on the I/O device. This attachment method greatly increases the opportunities for acceleration due to the much shorter software path length required to enable its use compared to a traditional I/O model.

international symposium on computer architecture | 2013

Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems

Janani Mukundan; Hillery C. Hunter; Kyu-hyoun Kim; Jeffrey A. Stuecheli; Jose F. Martinez

Recent DRAM specifications exhibit increasing refresh latencies. A refresh command blocks a full rank, decreasing available parallelism in the memory subsystem significantly, thus decreasing performance. Fine Granularity Refresh (FGR) is a feature recently announced as part of JEDECs DDR4 DRAM specification that attempts to tackle this problem by creating a range of refresh options that provide a trade-off between refresh latency and frequency. In this paper, we first conduct an analysis of DDR4 DRAMs FGR feature, and show that there is no one-size-fits-all option across a variety of applications. We then present Adaptive Refresh (AR), a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application. When looking at the refresh problem more closely, we identify in high-density DRAM systems a phenomenon that we call command queue seizure, whereby the memory controllers command queue seizes up temporarily because it is full with commands to a rank that is being refreshed. To attack this problem, we propose two complementary mechanisms called Delayed Command Expansion (DCE) and Preemptive Command Drain (PCD). Our results show that AR does exploit DDR4s FGR effectively. However, once our proposed DCE and PCD mechanisms are added, DDR4s FGR becomes redundant in most cases, except in a few highly memory-sensitive applications, where the use of AR does provide some additional benefit. In all, our simulations show that the proposed mechanisms yield 8% (14%) mean speedup with respect to traditional refresh, at normal (extended) DRAM operating temperatures, for a set of diverse parallel applications.

Ibm Journal of Research and Development | 2015

The cache and memory subsystems of the IBM POWER8 processor

William J. Starke; Jeffrey A. Stuecheli; David Daly; John Steven Dodson; Florian A. Auernhammer; Patricia M. Sagmeister; Guy Lynn Guthrie; Charles F. Marino; Michael S. Siegel; Bart Blaner

In this paper, we describe the IBM POWER8™ cache, interconnect, memory, and input/output subsystems, collectively referred to as the “nest.” This paper focuses on the enhancements made to the nest to achieve balanced and scalable designs, ranging from small 12-core single-socket systems, up to large 16-processor-socket, 192-core enterprise rack servers. A key aspect of the design has been increasing the end-to-end data and coherence bandwidth of the system, now featuring more than twice the bandwidth of the POWER7® processor. The paper describes the new memory-buffer chip, called Centaur, providing up to 128 MB of eDRAM (embedded dynamic random-access memory) buffer cache per processor, along with an improved DRAM (dynamic random-access memory) scheduler with support for prefetch and write optimizations, providing industry-leading memory bandwidth combined with low memory latency. It also describes new coherence-transport enhancements and the transition to directly integrated PCIe® (PCI Express®) support, as well as additions to the cache subsystem to support higher levels of virtualization and scalability including snoop filtering and cache sharing.

high-performance computer architecture | 2010

A bandwidth-aware memory-subsystem resource management using non-invasive resource profilers for large CMP systems

Dimitris Kaseridis; Jeffrey A. Stuecheli; Jian Chen; Lizy Kurian John

By integrating multiple cores in a single chip, Chip Multiprocessors (CMP) provide an attractive approach to improve both system throughput and efficiency. This integration allows the sharing of on-chip resources which may lead to destructive interference between the executing workloads. Memorysubsystem is an important shared resource that contributes significantly to the overall throughput and power consumption. In order to prevent destructive interference, the cache capacity and memory bandwidth requirements of the last level cache have to be controlled. While previously proposed schemes focus on resource sharing within a chip, we explore additional possibilities both inside and outside a single chip. We propose a dynamic memory-subsystem resource management scheme that considers both cache capacity and memory bandwidth contention in large multi-chip CMP systems. Our approach uses low overhead, non-invasive resource profilers that are based on Mattsons stack distance algorithm to project each cores resource requirements and guide our cache partitioning algorithms. Our bandwidth-aware algorithm seeks for throughput optimizations among multiple chips by migrating workloads from the most resource-overcommitted chips to the ones with more available resources. Use of bandwidth as a criterion results in an overall 18% reduction in memory bandwidth along with a 7.9% reduction in miss rate, compared to existing resource management schemes. Using a cycle-accurate full system simulator, our approach achieved an average improvement of 8.5% on throughput.

international conference on parallel processing | 2009

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Dimitris Kaseridis; Jeffrey A. Stuecheli; Lizy Kurian John

As Chip-Multiprocessor systems (CMP) have become the predominant topology for leading microprocessors, critical components of the system are now integrated on a single chip. This enables sharing of computation resources that was not previously possible. In addition, the virtualization of these computational resources exposes the system to a mix of diverse and competing workloads. Cache is a resource of primary concern as it can be dominant in controlling overall throughput. In order to prevent destructive interference between divergent workloads, the last level of cache must be partitioned. In the past, many solutions have been proposed but most of them are assuming either simplified cache hierarchies with no realistic restrictions or complex cache schemes that are difficult to integrate in a real design. To address this problem, we propose a dynamic partitioning strategy based on realistic last level cache designs of CMP processors. We used a cycle accurate, full system simulator based on Simics and Gems to evaluate our partitioning scheme on an 8-core DNUCA CMP system. Results for an 8-core system show that our proposed scheme provides on average a 70% reduction in misses compared to non-partitioned shared caches, and a 25% misses reduction compared to static equally partitioned (private) caches.

Explore More