Eric Rotenberg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eric Rotenberg is active.

Explore More

Publication

Featured researches published by Eric Rotenberg.

international symposium on microarchitecture | 1996

Trace cache: a low latency approach to high bandwidth instruction fetching

Eric Rotenberg; Steve Bennett; James E. Smith

As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further it is shown that the trace caches efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

ieee international symposium on fault tolerant computing | 1999

AR-SMT: a microarchitectural approach to fault tolerance in microprocessors

Eric Rotenberg

This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques-system-level, gate-level, or component-specific approaches-are either too costly for general purpose computing, overly intrusive to the design, or insufficient for covering arbitrary logic faults. An approach in which the microarchitecture itself provides fault tolerance is required. We propose a new time redundancy fault-tolerant approach in which a program is duplicated and the two redundant programs simultaneously run on the processor: The technique exploits several significant microarchitectural trends to provide broad coverage of transient faults and restricted coverage of permanent faults. These trends are simultaneous multithreading, control flow and data flow prediction, and hierarchical processors-all of which are intended for higher performance, but which can be easily leveraged for the specified fault tolerance goals. The overhead for achieving fault tolerance is low, both in terms of performance and changes to the existing microarchitecture. Detailed simulations of five of the SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 10% to 30% longer than running a single version of the program.

international symposium on microarchitecture | 1996

Assigning confidence to conditional branch predictions

Erik Jacobsen; Eric Rotenberg; James E. Smith

Many high performance processors predict conditional branches and consume processor resources based on the prediction. In some situations, resource allocation can be better optimized if a confidence level is assigned to a branch prediction; i.e. if the quantity of resources allocated is a function of the confidence level. To support such optimizations, we consider hardware mechanisms that partition conditional branch predictions into two sets: those which are accurate a relatively high percentage of the time, and those which are accurate a relatively low percentage of the time. The objective is to concentrate as many of the mispredictions as practical into a relatively small set of low confidence dynamic branches. We first study an ideal method that profiles branch predictions and sorts static branches into high and low confidence sets, depending on the accuracy with which they are dynamically predicted. We find that about 63 percent of the mispredictions can be localized to a set of static branches that account for 20 percent of the dynamic branches. We then study idealized dynamic confidence methods using both one and two levels of branch correctness history. We find that the single level method performs at least as well as the more complex two level method and is able to isolate 89 percent of the mispredictions into a set containing 20 percent of the dynamic branches. Finally, we study practical, less expensive implementations and find that they achieve most of the performance of the idealized methods.

international symposium on computer architecture | 2002

A large, fast instruction window for tolerating cache misses

Alvin R. Lebeck; Jinson Koppanalil; Tong Li; Jaidev P. Patwardhan; Eric Rotenberg

Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism.This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32-entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.

international symposium on microarchitecture | 1997

Path-based next trace prediction

Quinn Jacobson; Eric Rotenberg; James E. Smith

The trace cache is proposed as a mechanism for providing increased fetch bandwidth by allowing the processor to fetch across multiple branches in a single cycle. But to date predicting multiple branches per cycle has meant paying a penalty in prediction accuracy. We propose a next trace predictor that treats the traces as basic units and explicitly predicts sequences of traces. The predictor collects histories of trace sequences (paths) and makes predictions based on these histories. The basic predictor is enhanced to a hybrid configuration that reduces performance losses due to cold starts and aliasing in the prediction table. The Return History Stack is introduced to increase predictor performance by saving path history information across procedure call/returns. Overall, the predictor yields about a 26% reduction in misprediction rates when compared with the most aggressive previously proposed, multiple branch prediction methods.

high-performance computer architecture | 2006

Retention-aware placement in DRAM (RAPID): software methods for quasi-non-volatile DRAM

Ravi K. Venkatesan; Stephen Herr; Eric Rotenberg

Measurements of an off-the-shelf DRAM chip confirm that different cells retain information for different amounts of time. This result extends to DRAM rows, or pages (retention time of a page is defined as the shortest retention time among its constituent cells). Currently, a single worst-case refresh period is selected based on the page with the shortest retention time. Even with refresh optimized for room temperature, the worst page limits the safe refresh period to no longer than 500 ms. Yet, 99% and 85% of pages have retention times above 3 seconds and 10 seconds, respectively. We propose retention-aware placement in DRAM (RAPID), novel software approaches that can exploit off-the-shelf DRAMs to reduce refresh power to vanishingly small levels approaching non-volatile memory. The key idea is to favor longer-retention pages over shorter-retention pages when allocating DRAM pages. This allows selecting a single refresh period that depends on the shortest-retention page among populated pages, instead of the shortest-retention page overall. We explore three versions of RAPID and observe refresh energy savings of 83%, 93%, and 95%, relative to the best temperature-compensated refresh. RAPID with off-the-shelf DRAM also approaches the energy levels of idealized techniques that require custom DRAM support.

international conference on parallel architectures and compilation techniques | 2001

Adaptive Mode Control: A Static-Power-Efficient Cache Design

Huiyang Zhou; Mark C. Toburen; Eric Rotenberg; Thomas M. Conte

Lower threshold voltages in deep sub-micron technologies cause store leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information) and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active, the hardware knows what the hypothetical miss rate would be if all data lines were active and the actual miss rate can be made to precisely track it. Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches.

IEEE Transactions on Computers | 1999

A trace cache microarchitecture and evaluation

Eric Rotenberg; Steve Bennett; James E. Smith

As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks p...As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. Trace caches overcome this limitation by caching traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. In this paper, we present and evaluate a microarchitecture incorporating a trace cache. The microarchitecture provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of (1) control flow prediction acid (2) instruction supply. For the SPEC95 integer benchmarks, trace-level sequencing improves performance from 15 percent to 35 percent over an otherwise equally sophisticated, but contiguous, multiple-block fetch mechanism. Most of this performance improvement is due to the trace cache. However, for one benchmark whose performance is limited by branch mispredictions, the performance gain is almost entirely due to improved prediction accuracy.

ACM Transactions in Embedded Computing Systems | 2006

FAST: Frequency-aware static timing analysis

Kiran Seth; Aravindh Anantaraman; Frank Mueller; Eric Rotenberg

Energy is a valuable resource in embedded systems as the lifetime of many such systems is constrained by their battery capacity. Recent advances in processor design have added support for dynamic frequency/voltage scaling (DVS) for saving energy. Recent work on real-time scheduling focuses on saving energy in static as well as dynamic scheduling environments by exploiting idle time and slack because of early task completion for DVS of subsequent tasks. These scheduling algorithms rely on a priori knowledge of worst-case execution times (WCET) for each task. They assume that DVS has no effect on the worst-case execution cycles (WCEC) of a task and scale the WCET according to the processor frequency. However, for systems with memory hierarchies, the WCEC typically does change under DVS because of requency modulation. Hence, current assumptions used by DVS schemes result in a highly exaggerated WCET. This paper contributes novel techniques for tight and flexible static timing analysis, particularly well-suited for dynamic scheduling schemes. The technical contributions are as follows: (1) We assess the problem of changing execution cycles owing to scaling techniques. (2) We propose a parametric approach toward bounding the WCET statically with respect to the frequency. Using a parametric model, we can capture the effect of changes in frequency on the WCEC and, thus, accurately model the WCET over any frequency range. (3) We discuss design and implementation of the frequency-aware static timing analysis (FAST) tool based on our prior experience with static timing analysis. (4) We demonstrate in experiments that our FAST tool provides safe upper bounds on the WCET, which are tight. The FAST tool allows us to capture the WCET of six benchmarks using equations that overestimate the WCET by less than 1%. FAST equations can also be used to improve existing DVS scheduling schemes to ensure that the effect of frequency scaling on WCET is considered and that the WCET used is not exaggerated. (5) We leverage three DVS scheduling schemes by incorporating FAST into them and by showing that the energy consumption further decreases. (6) We compare experimental results using two different energy models to demonstrate or verify the validity of simulation methods. To the best of our knowledge, this study of DVS effects on timing analysis is unprecedented.

international symposium on computer architecture | 2003

Virtual simple architecture (VISA): exceeding the complexity limit in safe real-time systems

Aravindh Anantaraman; Kiran Seth; Kaustubh Patil; Eric Rotenberg; Frank Mueller

Meeting deadlines is a key requirement in safe realtime systems. Worst-case execution times (WCET) of tasks are needed for safe planning. Contemporary worst-case timing analysis tools can safely and tightly bound execution time on in-order single-issue pipelines with caches and static branch prediction. However, this simple pipeline appears to be a complexity limit, due to the need for analyzability. This excludes a whole class of high-performance processors from many embedded systems.We reconcile the complexity/safety trade-off by decoupling worst-case timing analysis from the processor implementation, through a virtual simple architecture (VISA). A VISA is the timing specification of a hypothetical simple pipeline and is the basis for worst-case timing analysis. However, the underlying microarchitecture can be arbitrarily complex. A task is divided into multiple sub-tasks which provide a means to gauge progress on the complex pipeline. Each sub-task is assigned an interim deadline, or checkpoint, based on the latest allowable completion time of the sub-task on the hypothetical simple pipeline. If no checkpoints are missed, then the complex pipeline is as timely as the safe pipeline. If a checkpoint is missed, the pipeline switches to a simple mode of operation that directly implements the VISA so that execution time of unfinished sub-tasks is safely bounded. The significance of our approach is that we circumvent worst-case timing analysis of the complex pipeline, by dynamically confirming its behavior is bounded by worst-case timing analysis of a simpler proxy pipeline.The benefit of using a high-performance processor is that tasks finish much sooner than they would have on an explicitly-safe processor. The new slack in the schedule can be exploited for higher throughput or lower power. With the VISA approach, an arbitrarily complex SMT processor can safely run non-real-time tasks at the same time as a real-time task. Alternatively, frequency/voltage can be safely lowered to take up slack. We explore the latter application and show a VISA-compliant complex pipeline consumes 43--61% less power than an explicitly-safe pipeline.

Explore More