Sumedh W. Sathaye | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sumedh W. Sathaye is active.

Explore More

Publication

Featured researches published by Sumedh W. Sathaye.

IEEE Transactions on Computers | 2001

Dynamic binary translation and optimization

Kemal Ebcioglu; Erik R. Altman; Michael Karl Gschwind; Sumedh W. Sathaye

We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.

IEEE Computer | 2000

Dynamic and transparent binary translation

Michael Karl Gschwind; Erik R. Altman; Sumedh W. Sathaye; Paul Ledak; David Appenzeller

High-frequency design and instruction-level parallelism (ILP) are important for high-performance microprocessor implementations. The Binary-translation Optimized Architecture (BOA), an implementation of the IBM PowerPC family, combines binary translation with dynamic optimization. The authors use these techniques to simplify the hardware by bridging a semantic gap between the PowerPCs reduced instruction set and even simpler hardware primitives. Processors like the Pentium Pro and Power4 have tried to achieve high frequency and ILP by implementing a cracking scheme in hardware: an instruction decoder in the pipeline generates multiple micro-operations that can then be scheduled out of order. BOA relies on an alternative software approach to decompose complex operations and to generate schedules, and thus offers significant advantages over purely static compilation approaches. This article explains BOAs translation strategy, detailing system issues and architecture implementation.

international symposium on microarchitecture | 1996

Instruction fetch mechanisms for VLIW architectures with compressed encodings

Thomas M. Conte; Sanjeev Banerjia; Sergei Y. Larin; Kishore N. Menezes; Sumedh W. Sathaye

VLIW architectures use very wide instruction words in conjunction with high bandwidth to the instruction cache to achieve multiple instruction issue. This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs. A compressed instruction encoding for VLIWs is defined and a classification scheme for i-fetch hardware for such an encoding is introduced. Several interesting cache and i-fetch organizations are described and evaluated through trace-driven simulations. A new i-fetch mechanism using a silo cache is found to have the best performance.

international symposium on microarchitecture | 1999

Optimizations and oracle parallelism with dynamic translation

Kemal Ebcioglu; Erik R. Altman; Michael Karl Gschwind; Sumedh W. Sathaye

We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.

Proceedings of the IEEE | 2001

Advances and future challenges in binary translation and optimization

Erik R. Altman; Kemal Ebcioglu; Michael Karl Gschwind; Sumedh W. Sathaye

Binary translation and optimization have achieved a high profile in recent years. Binary translation has several potential attractions. While still in its early stages, could binary translation offer a new way to design processors, i.e. is it a disruptive technology? This paper discusses this question, examines some future possibilities for binary translation, and then gives an overview of selected projects (DAISY, Crusoe, Dynamo and LaTTe). One future possibility for binary translation is the Virtual IT Shop. Binary translation offers a possible solution for better utilization of computational resources as services over the World Wide Web. The Internet is radically changing the software landscape, and is fostering platform independence and interoperability. Along the lines of software convergence, recent advances in binary JIT (just-in-time) optimizations also present the future possibility of a convergence virtual machine (CVM). CVM aims to address research challenges in allowing the same standard operating system and application object code to run on different hardware platforms, through state-of-the-art JIT compilation and virtual device emulation.

IEEE Transactions on Very Large Scale Integration Systems | 2000

System-level power consumption modeling and tradeoff analysis techniques for superscalar processor design

Thomas M. Conte; Kishore N. Menezes; Sumedh W. Sathaye; Mark C. Toburen

This paper presents systematic techniques to find low-power high-performance superscalar processors tailored to specific user applications. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is based on case studies of actual designs. It is used to solve an important problem: decreasing power consumption in a superscalar processor without greatly impacting performance. Results are presented from runs using simulated annealing to reduce power consumption subject to performance reduction bounds. The major contributions of this paper are the separation of architectural and technology components of dynamic power the use of trace-driven simulation for architectural power measurement, and the use of a near-optimal search to tailor a processor design to a benchmark.

european conference on parallel processing | 1999

Execution-Based Scheduling for VLIW Architectures

Kemal Ebcioglu; Erik R. Altman; Sumedh W. Sathaye; Michael Karl Gschwind

We describe a new dynamic software scheduling technique for VLIW architectures, which compiles into VLIW code the program paths that are actually executed. Unlike trace processors, or DIF, the technique executes operations speculatively on multiple paths through the code, is resilient to branch mispredictions, and can achieve very large dynamic window sizes necessary for high ILP. Aggressive optimizations are applied to frequently executed portions of the code. Encouraging performance results were obtained on SPECint95 and TPC-C. The technique can be used for binary translation for achieving architectural compatibility with an existing processor, or as a VLIW scheduling technique in its own right.

international conference on parallel architectures and compilation techniques | 1998

A fast interrupt handling scheme for VLIW processors

Emre Özer; Sumedh W. Sathaye; Kishore N. Menezes; Sanjeev Banerjia; Matthew D. Jennings; Thomas M. Conte

Interrupt handling in out-of-order execution processors requires complex hardware schemes to maintain the sequential state. The amount of hardware will be substantial in VLIW architectures due to the nature of issuing a very large number of instructions in each cycle. It is hard to implement precise interrupts in out-of-order execution machines, especially in VLIW processors. In this paper, we will apply the reorder buffer with future file and the history buffer methods to a VLIW platform, and present a novel scheme, called the current-state buffer, which employs modest hardware with compiler support. Unlike the other interrupt handling schemes, the current-state buffer does not keep history state, result buffering or bypass mechanisms. It is a fast interrupt handling scheme with a relatively small buffer that records the execution and exception status of operations. It is suitable for embedded processors that require a fast interrupt handling mechanism with modest hardware.

international conference on supercomputing | 2000

Binary translation and architecture convergence issues for IBM system/390

Michael Karl Gschwind; Kemal Ebcioglu; Erik R. Altman; Sumedh W. Sathaye

We describe the design issues in an implementation of the ESA/390 architecture based on binary translation to a very long instruction word (VLIW) processor. During binary translation, complex ESA/390 instructions are decomposed into instruction “primitives” which are then scheduled onto a wide-issue machine. The aim is to achieve high instruction level parallelism due to the increased scheduling and optimization opportunities which can be exploited by binary translation software, combined with the efficiency of long instruction word architectures. A further aim is to study the feasibility of a common execution platform for different instruction set architectures, such as ESA/390, RS?6000, AS/400 and the Java Virtual Machine, so that multiple systems can be built around a common execution platform.

hawaii international conference on system sciences | 1995

A technique to determine power-efficient, high-performance superscalar processors

Thomas M. Conte; Kishore N. Menezes; Sumedh W. Sathaye

Processor performance advances are increasingly inhibited by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although high-performance designs exist that dissipate low power, the method for finding these designs has been through trial-and-error. The paper presents systematic techniques to find low-power, high-performance superscalar processors tailored to specific user benchmarks. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is based on case studies of actual designs. It is used to solve an important problem: increasing the duplication in superscalar execution units without excessive power consumption. Results are presented from runs using simulated annealing to maximize processor performance subject to power and area constraints. The major contributions of the paper are the separation of architectural and technology components of dynamic power, the use of trace-driven simulation for architectural power measurement, and the use of a near-optimal search to tailor a processor design to a benchmark.<<ETX>>

Explore More