Thomas R. Puzak | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas R. Puzak is active.

Explore More

Publication

Featured researches published by Thomas R. Puzak.

high performance computer architecture | 2001

Branch history guided instruction prefetching

Viji Srinivasan; Edward S. Davidson; Gary S. Tyson; Mark J. Charney; Thomas R. Puzak

Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated sufficiently early to cover the miss penalty. This paper presents a new hardware-based instruction prefetching mechanism, Branch History Guided Prefetching (BHGP), to improve the timeliness of instruction prefetches. BHGP correlates the execution of a branch instruction with I-cache misses and uses branch instructions to trigger prefetches of instructions that occur (N-1) branches later in the program execution, for a given N>1. Evaluations on commercial applications, windows-NT applications, and some CPU2000 applications show an average reduction of 66% in miss rate over all applications. BHGP improved the IPC bp 12 to 14% for the CPU2000 applications studied; on average 80% of the BHGP prefetches arrived in cache before their next use, even on a 4-wide issue machine with a 15 cycle L2 access penalty.

international symposium on computer architecture | 2002

The optimum pipeline depth for a microprocessor

Allan M. Hartstein; Thomas R. Puzak

The impact of pipeline length on the performance of a microprocessor is explored both theoretically and by simulation. An analytical theory is presented that shows two opposing architectural parameters affect the optimal pipeline length: the degree of instruction level parallelism (superscalar) decreases the optimal pipeline length, while the lack of pipeline stalls increases the optimal pipeline length. This theory is tested by analyzing the optimal pipeline length for 35 applications representing three classes of workloads. Trace tapes are collected from SPEC95 and SPEC2000 applications, traditional (legacy) database and on-line transaction processing (OLTP) applications, and modern (e. g. web) applications primarily written in Java and C++. The results show that there is a clear and significant difference in the optimal pipeline length between the SPEC workloads and both the legacy and modern applications. The SPEC applications, written in C, optimize to a shorter pipeline length than the legacy applications, largely written in assembler language, with relatively little overlap in the two distributions. Additionally, the optimal pipeline length distribution for the C++ and Java workloads overlaps with the legacy applications, suggesting similar workload characteristics. These results are explored across a wide range of superscalar processors, both in-order and out-of-order.

international symposium on microarchitecture | 2003

Optimum power/performance pipeline depth

Allan M. Hartstein; Thomas R. Puzak

The impact of pipeline length on both the power and performance of a microprocessor is explored both theoretically and by simulation. A theory is presented for a wide range of power/performance metrics, BIPS/sup m//W. The theory shows that the more important power is to the metric, the shorter the optimum pipeline length that results. For typical parameters neither BIPS/W nor BIPS/sup 2//W yield an optimum, i.e., a non-pipelined design is optimal. For BIPS/sup 3//W the optimum, averaged over all 55 workloads studied, occurs at a 22.5 FO4 design point, a 7 stage pipeline, but this value is highly dependent on the assumed growth in latch count with pipeline depth. As dynamic power grows, the optimal design point shifts to shorter pipelines. Clock gating pushes the optimum to deeper pipelines. Surprisingly, as leakage power grows, the optimum is also found to shift to deeper pipelines. The optimum pipeline depth varies for different classes of workloads: SPEC95 and SPEC2000 integer applications, traditional (legacy) database and on-line transaction processing applications, modern (e.g. Web) applications, and floating point applications.

Ibm Journal of Research and Development | 1997

Profetching and memory system behavior of the SPEC95 benchmark suite

Mark J. Charney; Thomas R. Puzak

This paper presents instruction and data cache miss rates for the SPEC95™ benchmark suite. We have simulated the instruction and data traffic resulting from 500 million instructions of each of the 18 programs. Simulation results show that only a few of the applications place more than modest demands on the memory system. This was noticed for instruction caches, where only a few workloads required more than a 32Kb cache to achieve miss rates of less than one miss every 1000 instructions. We also analyze two prefetching algorithms using the SPEC95 workload: next-sequential prefetching and shadow-directory prefetching. Each prefetching algorithm is evaluated using three performance metrics: coverage, accuracy, and traffic. Variations in each prefetching algorithm involve the use of a confirmation mechanism that receives feedback information about the quality of each prefetch. With confirmation, the prefetching algorithm is able to enhance the accuracy of prefetching decisions. The results show that shadow-directory prefetching averages miss coverage about ten percent higher than next-sequential prefetching when used in prefetching instructions (about 60 percent coverage for next-sequential prefetching versus 70 percent for shadow-directory prefetching). The prefetching accuracy for both algorithms is more than 90 percent when a confirmation mechanism is used. In general, data prefetching is shown to be less accurate and to provide less coverage than instruction prefetching. Shadow-directory prefetching averaged about a 40 percent miss coverage versus a 25 percent miss coverage for next-sequential prefetching. Prefetching accuracy is over 70 percent when confirmation is applied.

computing frontiers | 2006

Cache miss behavior: is it √2?

Allan M. Hartstein; Viji Srinivasan; Thomas R. Puzak; Philip G. Emma

It has long been empirically observed that the cache miss rate decreased as a power law of cache size, where the power was approximately -1/2. In this paper, we examine the dependence of the cache miss rate on cache size both theoretically and through simulation. By combining the observed time dependence of the cache reference pattern with a statistical treatment of cache entry replacement, we predict that the cache miss rate should vary with cache size as an inverse power law for a first level cache. The exponent in the power law is directly related to the time dependence of cache references, and lies between -0.3 to -0.7. Results are presented for both direct mapped and set associative caches, and for various levels of the cache hierarchy. Our results demonstrate that the dependence of cache miss rate on cache size arises from the temporal dependence of the cache access pattern.

ACM Transactions on Architecture and Code Optimization | 2004

The optimum pipeline depth considering both power and performance

Allan M. Hartstein; Thomas R. Puzak

The impact of pipeline length on both the power and performance of a microprocessor is explored both by theory and by simulation. A theory is presented for a range of power/performance metrics, BIPSm/W. The theory shows that the more important power is to the metric, the shorter the optimum pipeline length that results. For typical parameters neither BIPS/W nor BIPS2/W yield an optimum, i.e., a non-pipelined design is optimal. For BIPS3/W the optimum, averaged over all 55 workloads studied, occurs at a 22.5 FO4 design point, a 7 stage pipeline, but this value is highly dependent on the assumed growth in latch count with pipeline depth. As dynamic power grows, the optimal design point shifts to shorter pipelines. Clock gating pushes the optimum to deeper pipelines. Surprisingly, as leakage power grows, the optimum is also found to shift to deeper pipelines. The optimum pipeline depth varies for different classes of workloads: SPEC95 and SPEC2000 integer applications, traditional (legacy) database and on-line transaction processing applications, modern (e. g. web) applications, and floating point applications.

computing frontiers | 2005

When prefetching improves/degrades performance

Thomas R. Puzak; Allan M. Hartstein; Philip G. Emma; Viji Srinivasan

We formulate a new method for evaluating any prefetching algorithm (real or hypothetical). This method allows researchers to analyze the potential improvements prefetching can bring to an application independent of any known prefetching algorithm. We characterize prefetching with the metrics: timeliness, coverage, and accuracy. We demonstrate the usefulness of this method using a Markov prefetch algorithm. Under ideal conditions, prefetching can remove nearly all of the pipeline stalls associated with a cache miss. However, in todays processors, we show that nearly all of the performance benefits derived from prefetching are eroded and, in many cases, prefetching loses performance. We do quantitative analysis of these trade-offs, and show that there are linear relationships between overall performance and coverage, accuracy, and bandwidth

winter simulation conference | 1989

Simulation And Analysis Of A Pipeline Processor

Philip G. Emma; Joshua W. Knight; James H. Pomerence; Thomas R. Puzak; Rudolph Nathan Rechtschaffen

In this paper we describe a software simulator (a timer) that is used to model a wide range of pipeline processors. A set of performance equations is developed that allow a user to separate the performance of a processor into its infinite-cache and finite-cache performance values. We then use the timer to study the performance of two different machine organizations. Performance curves are presented that help a user compare the performance of each organization (in terms of MIPS and cycles per instruction) to the cycle time chosen to implement the design.

Microprocessing and Microprogramming | 1992

Contrasting instruction-fetch time and instruction-decode time branch prediction mechanisms: Achieving synergy through their cooperative operation

David R. Kaeli; Philip G. Emma; Joshua W. Knight; Thomas R. Puzak

Abstract We present a new design in which two branch prediction mechanisms are used in conjunction. We show that the combination of these mechanisms will reduce branch penalty, while also reducing chip area.

Archive | 1984