Richard James Eickemeyer

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard James Eickemeyer is active.

Explore More

Publication

Featured researches published by Richard James Eickemeyer.

high-performance computer architecture | 2005

Stretching the limits of clock-gating efficiency in server-class processors

Hans M. Jacobson; Pradip Bose; Zhigang Hu; Alper Buyuktosunoglu; Victor Zyuban; Richard James Eickemeyer; Lee Evan Eisen; John Barry Griswell; Doug Logan; Balaram Sinharoy; Joel M. Tendler

Clock-gating has been introduced as the primary means of dynamic power management in recent high-end commercial microprocessors. The temperature drop resulting from active power reduction can result in additional leakage power savings in future processors. In this paper we first examine the realistic benefits and limits of clock-gating in current generation high-performance processors (e.g. of the POWER4/spl trade/ or POWER5/spl trade/ class). We then look beyond classical clock-gating: we examine additional opportunities to avoid unnecessary clocking in real workload executions. In particular, we examine the power reduction benefits of a couple of newly invented schemes called transparent pipeline clock-gating and elastic pipeline clock-gating. Based on our experiences with current designs, we try to bound the practical limits of clock gating efficiency in future microprocessors.

ieee international conference on high performance computing data and analytics | 1997

Evaluation of Multithreaded Processors and Thread-Switch Policies

Richard James Eickemeyer; Ross E. Johnson; Steven R. Kunkel; Beng-Hong Lim; Mark S. Squillante; Ching-Farn Eric Wu

This paper examines the use of coarse-grained multithreading to lessen the negative impact of memory access latencies on the performance of uniprocessor on-line transaction processing systems. It considers the effect of switching threads on cache misses in a two-level cache system. It also examines several different thread-switch policies. The results suggest that multithreading with a small number (3–5) of active threads can significantly improve the performance of such commercial environments.

annual conference on computers | 1993

Architectural effects on dual instruction issue with interlock collapsing ALUs

Nadeem Malik; Richard James Eickemeyer; Stamatis Vassiliadis

The authors present an evaluation of an innovative interlock collapsing arithmetic logic unit (ALU) in combination with several dual instruction issue processor organizations for two very different example architectures, IBM S/370 and MIPS R2000. The interlock collapsing ALU collapses execution interlocks between some integer operations as well as between address generation operations, without increasing the cycle time of the base machine. Thus, this allows two ALU, execution dependent instructions to be run in parallel, in a single cycle, instead of being executed sequentially. Results demonstrate that the overall contribution to the increase in instruction-level parallelism from the various processor organization design alternatives is remarkably similar to both the two example processors considering that the architectures are very different, and the contribution of the individual design alternatives varies.<<ETX>>

ACM Sigarch Computer Architecture News | 1992

Instruction-level parallelism from execution interlock collapsing

Nadeem Malik; Richard James Eickemeyer; Stamatis Vassiliadis

An innovative technique has been developed that permits the collapsing of execution interlocks between integer ALU operations as well as between address generation operations, allowing parallel execution of two instructions, having true dependencies, in a single cycle. Given that the proposed scheme has been shown not to increase the machine cycle time, it potentially provides an attractive means for increasing the instruction--level parallelism. Preliminary results show that within the basic blocks, the geometric mean of the speedup from this new design technique is up to 10% in the integer SPEC Benchmarks. The geometric mean of the speedup including floating point benchmarks is up to 6%. The results also suggest that depending on the application environment this new design may be used as an alternative to the relatively more expensive out--of--order instruction issue approach.

annual conference on computers | 1993

In-cache pre-processing and decode mechanisms for fine grain parallelism in SCISM

Stamatis Vassiliadis; Bartholomew Blaner; Richard James Eickemeyer; James Edward Phillips; Nadeem Malik

A study was initiated that investigated detractors to parallelism and implementation constraints associated with the critical paths in the design of fine grain parallel machines. The outcome of the research has been a new machine organization that facilitates and improves parallel instruction issue and possible increases in cycle time and by improving the instruction-level parallelism, using specialized hardware. The authors describe the attributes of the proposed machine organization related to the analysis of instruction sequences for the parallel issue and execution. They also describe the permanent preprocessing in the cache that allows for the determination of instructions for parallel execution prior to the instruction fetch and issues.<<ETX>>

annual conference on computers | 1993

Execution dependencies and their resolution in fine grain parallel machines

Nadeem Malik; Stamatis Vassiliadis; Richard James Eickemeyer; J. Philips

Execution dependence between sequential instructions is one of the factors that limits the level of parallelism which can be exploited by fine grain parallel machines. Several architectural, compiler and machine organization techniques that have been used to alleviate this restriction are examined. They are compared against a relatively new mechanism that simply eliminates the execution dependency. The dependency elimination is achieved by using a novel integer arithmetic logic unit (ALU) design, which performs arithmetic and logical operations on three operands in a single cycle, but without extending the cycle time of the base machine.<<ETX>>

Archive | 1997