Keith I. Farkas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keith I. Farkas is active.

Explore More

Publication

Featured researches published by Keith I. Farkas.

international symposium on microarchitecture | 2003

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

Rakesh Kumar; Keith I. Farkas; Norman P. Jouppi; Parthasarathy Ranganathan; Dean M. Tullsen

This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an applications execution, system software dynamically chooses the most appropriate core to meet specific performance and power requirements. Our evaluation of this architecture shows significant energy benefits. For an objective function that optimizes for energy efficiency with a tight performance threshold, for 14 SPEC benchmarks, our results indicate a 39% average energy reduction while only sacrificing 3% in performance. An objective function that optimizes for energy-delay with looser performance bounds achieves, on average, nearly a factor of three improvements in energy-delay product while sacrificing only 22% in performance. Energy savings are substantially more than chip-wide voltage/frequency scaling.

international symposium on microarchitecture | 1997

The multicluster architecture: reducing cycle time through partitioning

Keith I. Farkas; Paul Chow; Norman P. Jouppi; Zvonko G. Vranesic

The multicluster architecture that we introduce offers a decentralized, dynamically scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers. The motivation for the multicluster architecture is to reduce the clock cycle time, relative to a single-cluster architecture with the same number of hardware resources, by reducing the size and complexity of components on critical timing paths. Resource partitioning, however, introduces instruction-execution overhead and may reduce the number of concurrently executing instructions. To counter these two negative by-products of partitioning, we developed a static instruction scheduling algorithm. We describe this algorithm, and using trace-driven simulations of SPEC92 benchmarks, evaluate its effectiveness. This evaluation indicates that for the configurations considered the multicluster architecture may have significant performance advantages at feature sizes below 0.35 /spl mu/m, and warrants further investigation.

international symposium on computer architecture | 2002

The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

M. S. Hrishikesh; Doug Burger; Norman P. Jouppi; Stephen W. Keckler; Keith I. Farkas; Premkishore Shivakumar

Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 benchmarks, we find that for a high-performance architecture implemented in 100nm technology, the optimal clock period is approximately 8 fan-out-of-four (FO4) inverter delays for integer benchmarks, comprised of 6 FO4 of useful work and an overhead of about 2 FO4. The optimal clock period for floating-point benchmarks is 6 FO4. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indicates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Consequently, we propose and evaluate a high-frequency design called a segmented instruction window.

measurement and modeling of computer systems | 2000

Quantifying the energy consumption of a pocket computer and a Java virtual machine

Keith I. Farkas; Jason Flinn; Godmar Back; Dirk Grunwald; Jennifer-Ann M. Anderson

In this paper, we examine the energy consumption of a state-of-the-art pocket computer. Using a data acquisition system, we measure the energy consumption of the Itsy Pocket Computer, developed by Compaq Computer Corporations Palo Alto Research Labs. We begin by showing that the energy usage characteristics of the Itsy differ markedly from that of a notebook computer. Then, since we expect that flexible software environments will become increasingly prevalent on pocket computers, we consider applications running in a Java environment. In particular, we explain some of the Java design tradeoffs applicable to pocket computers, and quantify their energy costs. For the design options we considered and the three workloads we studied, we find a maximum change in energy use of 25%.

international symposium on computer architecture | 1997

Memory-system design considerations for dynamically-scheduled processors

Keith I. Farkas; Paul Chow; Norman P. Jouppi; Zvonko G. Vranesic

In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.

international symposium on computer architecture | 1994

Complexity/performance tradeoffs with non-blocking loads

Keith I. Farkas; Norman P. Jouppi

Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. In this paper, we describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in non-blocking systems.

high-performance computer architecture | 1996

Register file design considerations in dynamically scheduled processors

Keith I. Farkas; Norman P. Jouppi; Paul Chow

We have investigated the register file requirements of dynamically scheduled processors using register renaming and dispatch queues running the SPEC92 benchmarks. We looked at processors capable of issuing either four or eight instructions per cycle and found that in most cases implementing precise exceptions requires a relatively small number of additional registers compared to imprecise exceptions. Systems with aggressive non-blacking load support were able to achieve performance similar to processors with perfect memory systems at the cost of some additional registers. Given our machine assumptions, we found that the performance of a four-issue machine with a 32-entry dispatch queue tends to saturate around 80 registers. For an eight-issue machine with a 64-entry dispatch queue performance does not saturate until about 128 registers. Assuming the machine cycle time is proportional to the register file cycle time, the 8-issue machine yields only 20% higher performance than the 4-issue machine due in part to the cycle time impact of additional hardware.

IEEE Computer Architecture Letters | 2002

Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures

Rakesh Kumar; Keith I. Farkas; Norman P. Jouppi; Parthasarathy Ranganathan; Dean M. Tullsen

In this paper, we present an architecturecompiIer based approach to reduce energy consumption inthe processor. While we mainly target the fetch unit, an importantside-effect of our approach i s that we obtain energysavings in many other parts in the processor. The expIanationis that the fetch unit often runs substantiaIly ahead ofexecution, bringing in instructions to different stages in theprocessor that may never be executed. We have found, thatalthough the degree of Instruction Level Parallelism (ILP)of a program tends to vary over time, it can be staticallypredicted by the compiler with considerable accuracy. OurInstructions Per Clock (IPC) prediction scheme is using adependence-testing-based analysis and simple heuristics, toguide a front-end fetch-throttling mechanism. We developthe necessary architecture support and include its poweroverhead. We perform experiments over a wide number ofarchitectural configurations, using SPEC2000 applications.Our results are very encouraging: we obtain up to 15%total energy savings in the processor with generalIy littleperformance degradation. In fact, in some cases our intelligentthrottling scheme even tszcwases performance.Keywords- Low power design, compiler architecture interaction,instruction Ievel parallelism, fetch-throttling

IEEE Computer | 2001

Itsy: stretching the bounds of mobile computing

William Hamburgen; Deborah A. Wallach; Marc A. Viredaz; Lawrence S. Brakmo; Carl A. Waldspurger; Joel F. Bartlett; Timothy Mann; Keith I. Farkas

The Compaq Itsy, a prototype pocket computer that has enough processing power and memory capacity to run cycle-hungry applications such as continuous-speech recognition and real-time MPEG-1 movie decoding, has proved to be a useful experimental tool for interesting applications, systems work and power studies.

international conference on computer graphics and interactive techniques | 1999

Feline: fast elliptical lines for anisotropic texture mapping

Joel McCormack; Ronald N. Perry; Keith I. Farkas; Norman P. Jouppi

Texture mapping using trilinearly filtered mip-mapped data is efficient and looks much better than point-sampled or bilinearly filtered data. These properties have made it ubiquitous: trilinear filtering is offered on a

Explore More