Richard J. Eickemeyer

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard J. Eickemeyer is active.

Explore More

Publication

Featured researches published by Richard J. Eickemeyer.

Ibm Journal of Research and Development | 2005

POWER5 System microarchitecture

Balaram Sinharoy; Ronald Nick Kalla; Joel M. Tendler; Richard J. Eickemeyer; Jody B. Joyner

The IBM POWER4 is a new microprocessor organized in a system structure that includes new technology to form systems. The name POWER4 as used in this context refers not only to a chip, but also to the structure used to interconnect chips to form systems. In this paper we describe the processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor.

Ibm Journal of Research and Development | 1993

A load-instruction unit for pipelined processors

Richard J. Eickemeyer; Stamatis Vassiliadis

A special-purpose load unit is proposed as part of a processor design. The unit prefetches data from the cache by predicting the address of the data fetch in advance. This prefetch allows the cache access to take place early, in an otherwise unused cache cycle, eliminating one cycle from the load instruction. The prediction also allows the cache to prefetch data if they are not already in the cache. The cache-miss handling can be overlapped with other instruction execution. It is shown, using trace-driven simulations, that the proposed mechanism, when incorporated in a design, may contribute to a significant increase in processor performance. The paper also compares different prediction methods and describes a hardware implementation for the load unit.

Ibm Journal of Research and Development | 2000

A multithreaded PowerPC processor for commercial servers

John Michael Borkenhagen; Richard J. Eickemeyer; Ronald Nick Kalla; Steven R. Kunkel

This paper describes the microarchitecture of the RS64 IV, a multithreaded PowerPC® processor, and its memory system. Because this processor is used only in IBM iSeries™ and pSeries™ commercial servers, it is optimized solely for commercial server workloads. Increasing miss rates because of trends in commercial server applications and increasing latency of cache misses because of rapidly increasing clock frequency are having a compounding effect on the portion of execution time that is wasted on cache misses. As a result, several optimizations are included in the processor design to address this problem. The most significant of these is the use of coarse-grained multithreading to enable the processor to perform useful instructions during cache misses. This provides a significant throughput increase while adding less than 5% to the chip area and having very little impact on cycle time. When compared with other performance-improvement techniques, multithreading yields an excellent ratio of performance gain to implementation cost. Second, the miss rate of the L2 cache is reduced by making it four-way associative. Third, the latency of cache-to-cache movement of data is minimized. Fourth, the size of the L1 caches is relatively large. In addition to addressing cache misses, pipeline holes caused by branches are minimized with large instruction buffers, large L1 I-cache fetch bandwidth, and optimized resolution of the branch direction. In part, the branches are resolved quickly because of the short but efficient pipeline. To minimize pipeline holes due to data dependencies, the L1 D-cache access is optimized to yield a one-cycle load-to-use penalty.

international symposium on computer architecture | 1996

Evaluation of Multithreaded Uniprocessors for Commercial Application Environments

Mark S. Squillante; Ross Evan Johnson; Shiafun Liu; Steven R. Kunkel; Richard J. Eickemeyer

As memory speeds grow at a considerably slower rate than processor speeds, memory accesses are starting to dominate the execution time of processors, and this will likely continue into the future. This trend will be exacerbated by growing miss rates due to commercial applications, object-oriented programming and micro-kernel based operating systems. We examine the use of coarse-grained multithreading to address this important problem in uniprocessor on-line transaction processing environments where there is a natural, coarse-grained parallelism between the tasks resulting from transactions being executed concurrently, with no application software modifications required. Our results suggest that multithreading can provide significant performance improvements for uniprocessor commercial computing environments.

Ibm Journal of Research and Development | 2000

A performance methodology for commercial servers

Steven R. Kunkel; Richard J. Eickemeyer; Mikko H. Lipasti; Timothy J. Mullins; Brian W. O'Krafka; Harold Rosenberg; Steven P. Vanderwiel; Philip L. Vitale; Larry D. Whitley

This paper discusses a methodology for analyzing and optimizing the performance of commercial servers. Commercial server workloads are shown to have unique characteristics which expand the elements that must be optimized to achieve good performance and require a unique performance methodology. The steps in the process of server performance optimization are described and include the following: 1. Selection of representative commercial workloads and identification of key characteristics to be evaluated. 2. Collection of performance data. Various instrumentation techniques are discussed in light of the requirements placed by commercial server workloads on the instrumentation. 3. Creation of input data for performance models on the basis of measured workload information. This step in the methodology must overcome the operating environment differences between the instance of the measured system under test and the target system design to be modeled. 4. Creation of performance models. Two general types are described: high-level models and detailed cycle-accurate simulators. These types are applied to model the processor, memory, and I/O system. 5. System performance optimization. The tuning of the operating system and application software is described. n nOptimization of performance among commercial applications is not simply an exercise in using traces to maximize the processor MIPS. Equally significant are items such as the use of probabilities to reflect future workload characteristics, software tuning, cache miss rate optimization, memory management, and I/O performance. The paper presents techniques for evaluating the performance of each of these key contributors so as to optimize the overall performance and cost/performance of commercial servers.

Ibm Journal of Research and Development | 1994

SCISM: a scalable compound instruction set machine

Stamatis Vassiliadis; Bart Blaner; Richard J. Eickemeyer

In this paper we describe a machine organization suitable for RISC and CISC architectures. The proposed organization reduces hardware complexity in parallel instruction fetch and issue logic by minimizing possible increases in cycle time caused by parallel instruction issue decisions in the instruction buffer. Furthermore, it improves instruction-level parallelism by means of special features. The improvements are achieved by analyzing instruction sequences and deciding which instructions will issue and execute in parallel prior to actual instruction fetch and issue, by incorporating preprocessed information for parallel issue and execution of instructions in the cache, by categorizing instructions for parallel issue and execution on the basis of hardware utilization rather than opcode description, by attempting to avoid memory interlocks through the preprocessing mechanism, and by eliminating execution interlocks with specialized hardware. Introduction Improvements in the performance of computer systems relate to circuit-level or technology improvements and to organizational techniques such as pipelining, cache memories, out-of-order execution, multiple functional units, and exploitation of instruction-level parallelism. One increasingly popular approach for exploiting instructionlevel parallelism, i.e., allowing multiple instructions to be issued and executed in one machine cycle, is the so-called superscalar machine organization [1]. A number of such machines with varying degrees of parallelism have recently been described [2, 3]. The increasing popularity of superscalar machine organizations may be attributed to the increased instruction execution rate such systems may offer, concomitant with technology improvements that have made their organizations more feasible.

high-performance computer architecture | 2011

Abstraction and microarchitecture scaling in early-stage power modeling

Hans M. Jacobson; Alper Buyuktosunoglu; Pradip Bose; Emrah Acar; Richard J. Eickemeyer

Early-stage, microarchitecture-level power modeling methodologies have been used in industry and academic research for a decade (or more). Such methods use cycle-accurate performance simulators and deduce active power based on utilization markers. A key question faced in this context is: what key utilization metrics to monitor, and how many are needed for accuracy? Is there a systematic way to select the “best” markers? We also pose a key follow-on question: is it possible to perform accurate scaling of an abstracted model to enable exploration of new microarchitecture features? In this paper, we address these particular questions and examine the results for a range of abstraction levels. We highlight innovative insights for intelligent abstraction and microarchitecture scaling, and point out the pitfalls of abstractions that are not based on a systematic methodology or sound theory.

Ibm Journal of Research and Development | 2005

Characterization of simultaneous multithreading (SMT) efficiency in POWER5

Harry M. Mathis; Alex E. Mericas; John D. McCalpin; Richard J. Eickemeyer; Steven R. Kunkel

Coarse-grained multithreading, the switching of threads to avoid idle processor time during long-latency events, has been available on IBM systems since 1998. Simultaneous multithreading (SMT), first available on the POWER5TM processor, moves beyond simple thread switching to the maintenance of two thread streams that are issued as continuously as possible to ensure the maximum use of processor resources. Because SMT has the potential of increasing processor officiency and correspondingly increasing the amount of work done for a given time span, the reader might suppose that SMT would exhibit a performance gain for all workloads. This is true for most workloads, but is not true in some exceptional cases. In SMT mode, the processor resources--register sets, caches, queues, translation buffers, and the system memory nest--must be shared by both threads, and conditions can occur that degrade or even obviate SMT performance improvement. The POWER4TM and POWER5 processors have very powerful performance monitor (PM) toolsets that can help the user to determine what is occurring in workloads that may not be providing expected SMT gains. In this paper, the results of measured differences among workloads having large, medium, small, and even negative SMT performance gains are presented along with an approach to investigating workloads to determine the source of SMT performance gain limits.

ACM Sigarch Computer Architecture News | 1992

On the attributes of the SCISM organization

Stamatis Vassiliadis; Bart Blaner; Richard J. Eickemeyer

In this paper, we describe some of the attributes of the SCISM organization, a multiple instruction-issuing machine, the outcome of five years of research at the IBM Glendale Laboratory, in Endicott, New York. The proposed organization embodies a number of mechanisms, including the analysis of instruction sequences and deciding which instructions will execute in parallel prior to instruction fetch and issue, the incorporation of permanent preprocessing of instructions to be executed in parallel, the categorization of instructions for parallel execution on the basis of hardware utilization rather than opcode description, the avoidance of memory interlocks through the preprocessing mechanism, and the elimination of execution interlocks with specialized hardware. It is shown that by incorporating these mechanisms, a SCISM capable of issuing and executing two instructions per cycle can achieve more than 90% of the theoretical maximum performance of an idealized, dual instruction issue superscalar machine.

Ibm Journal of Research and Development | 2011

IBM POWER7 performance modeling, verification, and evaluation

M. Srinivas; Balaram Sinharoy; Richard J. Eickemeyer; Ram Raghavan; Steven R. Kunkel; Tien Chi Chen; W. Maron; D. Flemming; A. Blanchard; P. Seshadri; Jeffrey W. Kellington; Alex E. Mericas; A. E. Petruski; V. R. Indukuru; S. Reyes

In this paper, we describe the key performance enhancements in IBM POWER7® microarchitecture and its memory hierarchy, including performance modeling and verification methodology. We also describe the performance characteristics of server applications, including Standard Performance Evaluation Corporation (SPEC) central processing unit, SAP Sales and Distribution, SPECjbb, online transaction processing workloads, and high-performance computing applications running on POWER7 processor-based systems compared with other systems.

Explore More