Kishore N. Menezes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kishore N. Menezes is active.

Explore More

Publication

Featured researches published by Kishore N. Menezes.

international conference on computer design | 1996

Reducing state loss for effective trace sampling of superscalar processors

Thomas M. Conte; Mary Ann Hirsch; Kishore N. Menezes

There is a wealth of technological alternatives that can be incorporated into a processor design. These include reservation station designs, functional unit duplication, and processor branch handling strategies. The performance of a given design is measured through the execution of application programs and other workloads. Presently, trace driven simulation is the most popular method of processor performance analysis in the development stage of system design. Current techniques of trace driven simulation, however, are extremely slow and expensive. A fast and accurate method for statistical trace sampling of superscalar processors is proposed.

international symposium on computer architecture | 1995

Optimization of instruction fetch mechanisms for high issue rates

Thomas M. Conte; Kishore N. Menezes; Patrick Mills; Burzin A. Patel

Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing buffer, achieves near-perfect performance and consistently aligns instructions in excess of 90% of the time, over a wide range of issue rates. The performance boost provided by compiler optimization techniques is also investigated. Results show that compiler optimization can significantly enhance performance across all schemes. The collapsing buffer supplemented by compiler techniques remains the best-performing mechanism. The paper closes with recommendations and suggestions for future.

international symposium on microarchitecture | 1996

Instruction fetch mechanisms for VLIW architectures with compressed encodings

Thomas M. Conte; Sanjeev Banerjia; Sergei Y. Larin; Kishore N. Menezes; Sumedh W. Sathaye

VLIW architectures use very wide instruction words in conjunction with high bandwidth to the instruction cache to achieve multiple instruction issue. This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs. A compressed instruction encoding for VLIWs is defined and a classification scheme for i-fetch hardware for such an encoding is introduced. Several interesting cache and i-fetch organizations are described and evaluated through trace-driven simulations. A new i-fetch mechanism using a silo cache is found to have the best performance.

international symposium on microarchitecture | 1996

Accurate and practical profile-driven compilation using the profile buffer

Thomas M. Conte; Kishore N. Menezes; Mary Ann Hirsch

Profiling is a technique of gathering program statistics in order to aid program optimization. In particular, it is an essential component of compiler optimization for the extraction of instruction-level parallelism. Code instrumentation has been the most popular method of profiling. However real-time, interactive, and transaction processing applications suffer from the high execution-time overhead imposed by software instrumentation. This paper suggests the use of hardware dedicated to the task of profiling. The hardware proposed consists of a set of counters, the profile buffer. A profile collection method that combines the use of hardware, the compiler and operating system support is described. Three methods for profile buffer indexing, address-mapping, selective indexing, and compiler indexing are presented that allow this approach to produce accurate profiling information with very little execution slowdown. The profile information obtained is applied to a prominent compiler optimization, namely superblock scheduling. The resulting instruction-level parallelism approaches that obtained through the use of perfect profile information.

International Journal of Parallel Programming | 1996

Hardware-based profiling: an effective technique for profile-driven optimization

Thomas M. Conte; Burzin A. Patel; Kishore N. Menezes; J. Stan Cox

Profile-based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%–4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems.

international symposium on microarchitecture | 2000

The Intel IA-64 compiler code generator

Jay Bharadwaj; William Y. Chen; Weihaw Chuang; Gerolf F. Hoflehner; Kishore N. Menezes; Kalyan Muthukumar; Jim Pierce

In planning the new EPIC (Explicitly Parallel Instruction Computing) architecture, Intel designers wanted to exploit the high level of instruction-level parallelism (ILP) found in application code. To accomplish this goal, they incorporated a powerful set of features such as control and data speculation, predication, register rotation, loop branches, and a large register file. By using these features, the compiler plays a crucial role in achieving the overall performance of an IA-64 platform. This paper describes the electron code generator (ECG), the component of Intels IA-64 production compiler that maximizes the benefits of these features. The ECG consists of multiple phases. The first phase, translation, converts the optimizers intermediate representation (ILO) of the program into the ECG IR. Predicate region formation, if conversion, and compare generation occur in the predication phase. The ECG contains two schedulers: the software pipeliner for targeted cyclic regions and the global code scheduler for all remaining regions. Both schedulers make use of control and data speculation. The software pipeliner also uses rotating registers, predication, and loop branches to generate efficient schedules for integer as well as floating-point loops.

IEEE Transactions on Very Large Scale Integration Systems | 2000

System-level power consumption modeling and tradeoff analysis techniques for superscalar processor design

Thomas M. Conte; Kishore N. Menezes; Sumedh W. Sathaye; Mark C. Toburen

This paper presents systematic techniques to find low-power high-performance superscalar processors tailored to specific user applications. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is based on case studies of actual designs. It is used to solve an important problem: decreasing power consumption in a superscalar processor without greatly impacting performance. Results are presented from runs using simulated annealing to reduce power consumption subject to performance reduction bounds. The major contributions of this paper are the separation of architectural and technology components of dynamic power the use of trace-driven simulation for architectural power measurement, and the use of a near-optimal search to tailor a processor design to a benchmark.

international symposium on microarchitecture | 1999

Wavefront scheduling: path based data representation and scheduling of subgraphs

Jay Bharadwaj; Kishore N. Menezes; Chris McKinsey

The IA-64 architecture is rich with features that enable aggressive exploitation of instruction-level parallelism. Features such as speculation, predication, multiway branches and others provide compilers with new opportunities for the extraction of parallelism in programs. Code scheduling is a central component in any compiler for the IA-64 architecture. This paper describes the implementation of the global code scheduler (GCS) in Intels reference compiler for the IA-64 architecture. GCS schedules code over acyclic regions of control flow. There is a tight coupling between the formation and scheduling of regions. GCS employs a new path based data dependence representation that combines control flow and data dependence information to make data analysis easy and accurate. This paper provides details of this representation. The scheduler uses a novel instruction scheduling technique called Wavefront scheduling. The concepts of wavefront scheduling and deferred compensation are explained to demonstrate the efficient generation of compensation code while scheduling. This paper also presents P-ready code motion, an opportunistic instruction level tail duplication which aims to strike a balance between code expansion and performance potential. Performance results show greater than 30% improvement in speedup for wavefront scheduling over basic block scheduling on the Merced microarchitecture.

international conference on parallel architectures and compilation techniques | 1998

A fast interrupt handling scheme for VLIW processors

Emre Özer; Sumedh W. Sathaye; Kishore N. Menezes; Sanjeev Banerjia; Matthew D. Jennings; Thomas M. Conte

Interrupt handling in out-of-order execution processors requires complex hardware schemes to maintain the sequential state. The amount of hardware will be substantial in VLIW architectures due to the nature of issuing a very large number of instructions in each cycle. It is hard to implement precise interrupts in out-of-order execution machines, especially in VLIW processors. In this paper, we will apply the reorder buffer with future file and the history buffer methods to a VLIW platform, and present a novel scheme, called the current-state buffer, which employs modest hardware with compiler support. Unlike the other interrupt handling schemes, the current-state buffer does not keep history state, result buffering or bypass mechanisms. It is a fast interrupt handling scheme with a relatively small buffer that records the execution and exception status of operations. It is suitable for embedded processors that require a fast interrupt handling mechanism with modest hardware.

hawaii international conference on system sciences | 1995

A technique to determine power-efficient, high-performance superscalar processors

Thomas M. Conte; Kishore N. Menezes; Sumedh W. Sathaye

Processor performance advances are increasingly inhibited by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although high-performance designs exist that dissipate low power, the method for finding these designs has been through trial-and-error. The paper presents systematic techniques to find low-power, high-performance superscalar processors tailored to specific user benchmarks. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is based on case studies of actual designs. It is used to solve an important problem: increasing the duplication in superscalar execution units without excessive power consumption. Results are presented from runs using simulated annealing to maximize processor performance subject to power and area constraints. The major contributions of the paper are the separation of architectural and technology components of dynamic power, the use of trace-driven simulation for architectural power measurement, and the use of a near-optimal search to tailor a processor design to a benchmark.<<ETX>>

Explore More