Jude A. Rivers | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jude A. Rivers is active.

Explore More

Publication

Featured researches published by Jude A. Rivers.

international symposium on computer architecture | 2009

Scalable high performance main memory system using phase-change memory technology

Moinuddin K. Qureshi; Vijayalakshmi Srinivasan; Jude A. Rivers

The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-change materials is being actively investigated in the circuits community. Phase Change Memory (PCM) devices offer more density relative to DRAM, and can help increase main memory capacity of future systems while remaining within the cost and power constraints. In this paper, we analyze a PCM-based hybrid main memory system using an architecture level model of PCM.We explore the trade-offs for a main memory system consisting of PCMstorage coupled with a small DRAM buffer. Such an architecture has the latency benefits of DRAM and the capacity benefits of PCM. Our evaluations for a baseline system of 16-cores with 8GB DRAM show that, on average, PCM can reduce page faults by 5X and provide a speedup of 3X. As PCM is projected to have limited write endurance, we also propose simple organizational and management solutions of the hybrid memory that reduces the write traffic to PCM, boosting its lifetime from 3 years to 9.7 years.

dependable systems and networks | 2004

The impact of technology scaling on lifetime reliability

Jayanth Srinivasan; Sarita V. Adve; Pradip Bose; Jude A. Rivers

The relentless scaling of CMOS technology has provided a steady increase in processor performance for the past three decades. However, increased power densities (hence temperatures) and other scaling effects have an adverse impact on long-term processor lifetime reliability. This paper represents a first attempt at quantifying the impact of scaling on lifetime reliability due to intrinsic hard errors, taking workload characteristics into consideration. For our quantitative evaluation, we use RAMP (Srinivasan et al., 2004), a previously proposed industrial-strength model that provides reliability estimates for a workload, but for a given technology. We extend RAMP by adding scaling specific parameters to enable workload-dependent lifetime reliability evaluation at different technologies. We show that (1) scaling has a significant impact on processor hard failure rates - on average, with SPEC benchmarks, we find the failure rate of a scaled 65nm processor to be 316% higher than a similarly pipelined 180nm processor; (2) time-dependent dielectric breakdown and electromigration have the largest increases; and (3) with scaling, the difference in reliability from running at worst-case vs. typical workload operating conditions increases significantly, as does the difference from running different workloads. Our results imply that leveraging a single microarchitecture design for multiple remaps across a few technology generations will become increasingly difficult, and motivate a need for workload specific, microarchitectural lifetime reliability awareness at an early design stage.

international symposium on computer architecture | 2004

The Case for Lifetime Reliability-Aware Microprocessors

Jayanth Srinivasan; Sarita V. Adve; Pradip Bose; Jude A. Rivers

Ensuring long processor lifetimes by limiting failures due to wear-out related hard errors is a critical requirement for all microprocessor manufacturers. We observe that continuous device scaling and increasing temperatures are making lifetime reliability targets even harder to meet. However, current methodologies for qualifying lifetime reliability are overly conservative since they assume worst-case operating conditions. This paper makes the case that the continued use of such methodologies will significantly and unnecessarily constrain performance. Instead, lifetime reliability awareness at the microarchitectural design stage can mitigate this problem, by designing processors that dynamically adapt in response to the observed usage to meet a reliability target. We make two specific contributions. First, we describe an architecture-level model and its implementation, called RAMP, that can dynamically track lifetime reliability, responding to changes in application behavior. RAMP is based on state-of-the-art device models for different wear-out mechanisms. Second, we propose dynamic reliability management (DRM) - a technique where the processor can respond to changing application behavior to maintain its lifetime reliability target. In contrast to current worst-case behavior based reliability qualification methodologies, DRM allows processors to be qualified for reliability at lower (but more likely) operating points than the worst case. Using RAMP, we show that this can save cost and/or improve performance, that dynamic voltage scaling is an effective response technique for DRM, and that dynamic thermal management neither subsumes nor is subsumed by DRM.

international symposium on computer architecture | 2005

Exploiting Structural Duplication for Lifetime Reliability Enhancement

Jayanth Srinivasan; Sarita V. Adve; Pradip Bose; Jude A. Rivers

Increased power densities (and resultant temperatures) and other effects of device scaling are predicted to cause significant lifetime reliability problems in the near future. In this paper, we study two techniques that leverage microarchitectural structural redundancy for lifetime reliability enhancement. First, in structural duplication (SD), redundant microarchitectural structures are added to the processor and designated as spares. Spare structures can be turned on when the original structure fails, increasing the processors lifetime. Second, graceful performance degradation (GPD) is a technique which exploits existing microarchitectural redundancy for reliability. Redundant structures that fail are shut down while still maintaining functionality, thereby increasing the processors lifetime, but at a lower performance. Our analysis shows that exploiting structural redundancy can provide significant reliability benefits, and we present guidelines for efficient usage of these techniques by identifying situations where each is more beneficial. We show that GPD is the superior technique when only limited performance or cost resources can be sacrificed for reliability. Specifically, on average for our systems and applications, GPD increased processor reliability to 1.42 times the base value for less than a 5% loss in performance. On the other hand, for systems where reliability is more important than performance or cost, SD is more beneficial. SD increases reliability to 3.17 times the base value for 2.25 times the base cost, for our applications. Finally, a combination of the two techniques (SD+GPD) provides the highest reliability benefit.

international symposium on microarchitecture | 2005

Lifetime reliability: toward an architectural solution

Jayanth Srinivasan; Sarita V. Adve; Pradip Bose; Jude A. Rivers

Developing and maintaining industrywide standards for lifetime reliability is a critical task for all microprocessor manufacturers. Although technology scaling continues to provide significant performance benefits, increasingly smaller feature sizes and increasing power densities are accelerating the onset of wearout-based failures, thus shortening processor life. Microarchitects have traditionally treated processor lifetime reliability as a manufacturing problem, best left to device and process engineers. In current processors, manufacturers enforce lifetime reliability, or qualify it, during device design, circuit layout, manufacture, and chip test. This reliability qualification, which is application-oblivious, is based on estimates of worst case temperature and processor utilization. However, most applications will run at lower temperature and utilization, resulting in higher reliability and longer processor lifetimes than required. As a result, current reliability qualification methodologies are overly conservative, unnecessarily increasing cost or decreasing performance. Sustaining this approach will likely be infeasible in future scaled systems.

international symposium on microarchitecture | 2010

SAFER: Stuck-At-Fault Error Recovery for Memories

Nak Hee Seong; Dong Hyuk Woo; Vijayalakshmi Srinivasan; Jude A. Rivers; Hsien-Hsin S. Lee

As technology scaling poses a threat to DRAM scaling due to physical limitations such as limited charge, alternative memory technologies including several emerging non-volatile memories are being explored as possible DRAM replacements. One main roadblock for wider adoption of these new memories is the limited write endurance, which leads to wear-out related permanent failures. Furthermore, technology scaling increases the variation in cell lifetime resulting in early failures of many cells. Existing error correcting techniques are primarily devised for recovering from transient faults and are not suitable for recovering from permanent stuck-at faults, which tend to increase gradually with repeated write cycles. In this paper, we propose SAFER, a novel hardware-efficient multi-bit stuck-at fault error recovery scheme for resistive memories, which can function in conjunction with existing wear-leveling techniques. SAFER exploits the key attribute that a failed cell with a stuck-at value is still readable, making it possible to continue to use the failed cell to store data, thereby reducing the hardware overhead for error recovery. SAFER partitions a data block dynamically while ensuring that there is at most one fail bit per partition and uses single error correction techniques per partition for fail recovery. SAFER increases the number of recoverable fails and achieves better lifetime improvement with smaller hardware overhead relative to recently proposed Error Correcting Pointers and even ideal hamming coding scheme.

dependable systems and networks | 2005

SoftArch: an architecture-level tool for modeling and analyzing soft errors

Xiaodong Li; Sarita V. Adve; Pradip Bose; Jude A. Rivers

Soft errors are a growing concern for processor reliability. Recent work has motivated architecture-level studies of soft errors since the architecture can mask many raw errors and architectural solutions can exploit workload knowledge. This paper proposes a model and tool, called SoftArch, to enable analysis of soft errors at the architecture-level in modern processors. SoftArch is based on a probabilistic model of the error generation and propagation process in a processor. Compared to prior architecture-level tools, SoftArch is more comprehensive or faster. We demonstrate the use of SoftArch for an out-of-order superscalar processor running SPEC2000 benchmarks. Our results are consistent with, but more comprehensive than, prior work, and motivate selective and dynamic architecture-level soft error protection mechanisms.

international conference on parallel processing | 1996

Reducing conflicts in direct-mapped caches with a temporality-based design

Jude A. Rivers; Edward S. Davidson

Direct-mapped caches are often plagued by conflict misses because they lack the associativity to store more than one memory block in each set. However, some blocks that have no temporal locality actually cause program execution degradation by displacing blocks that do manifest temporal behavior. In this paper, we present a simple but efficient novel hardware design called the non-temporal streaming (NTS) cache that supplements the conventional direct-mapped cache with a parallel fully associative buffer. Every cache block loaded into the main cache is monitored for temporal behavior by a hardware detection unit. Cache blocks identified as nontemporal are allocated to the buffer on subsequent requests. Our simulations show that the NTS Cache not only provides a performance improvement over the conventional direct-mapped cache, but can also save on-chip area. For some numerical programs like FFTPDE, APPSP and APPBT from the NAS benchmark suite, an integral NTS Cache of size 9 KB (i.e., 8 KB direct-mapped cache plus 1 KB NT buffer) performs as well as a 16 KB conventional direct-mapped cache.

international conference on supercomputing | 1998

Utilizing reuse information in data cache management

Jude A. Rivers; Edward S. Tam; Gary S. Tyson; Edward S. Davidson; Matthew K. Farrens

1. ABSTRACT As microprocessor speeds continue to outgrow memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. This paper presents a comparative evaluation of two approaches that utilize reuse information for more efficiently managing the firstlevel cache. While one approach is based on the effective address of the data being referenced, the other uses the program counter of the memory instruction generating the reference. Our evaluations show that using effective address reuse information performs better than using program counter reuse information. In addition, we show that the Victim cache performs best for multi-lateral caches with a direct-mapped main cache and high L2 cache latency, while the NTS (effective-addressbased) approach performs better as the L2 latency decreases or the associativity of the main cache increases.

dependable systems and networks | 2007

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

Xiaodong Li; Sarita V. Adve; Pradip Bose; Jude A. Rivers

This paper concerns the validity of a widely used method for estimating the architecture-level mean time to failure (MTTF) due to soft errors. The method first calculates the failure rate for an architecture-level component as the product of its raw error rate and an architecture vulnerability factor (AVF). Next, the method calculates the system failure rate as the sum of the failure rates (SOFR) of all components, and the system MTTF as the reciprocal of this failure rate. Both steps make significant assumptions. We investigate the validity of the AVF+SOFR method across a large design space, using both mathematical and experimental techniques with real program traces from SPEC 2000 benchmarks and synthesized traces to simulate longer real-world workloads. We show that AVF+SOFR is valid for most of the realistic cases under current raw error rates. However, for some realistic combinations of large systems, long-running workloads with large phases, and/or large raw error rates, the MTTF calculated using AVF+SOFR shows significant-discrepancies from that using first principles. We also show that SoftArch, a previously proposed alternative method that does not make the AVF+SOFR assumptions, does not exhibit the above discrepancies.

Explore More