Erik R. Altman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Erik R. Altman is active.

Explore More

Publication

Featured researches published by Erik R. Altman.

international symposium on computer architecture | 1997

Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility

Kemal Ebcioglu; Erik R. Altman

Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.

IEEE Transactions on Computers | 2001

Dynamic binary translation and optimization

Kemal Ebcioglu; Erik R. Altman; Michael Karl Gschwind; Sumedh W. Sathaye

We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.

IEEE Computer | 2000

Dynamic and transparent binary translation

Michael Karl Gschwind; Erik R. Altman; Sumedh W. Sathaye; Paul Ledak; David Appenzeller

High-frequency design and instruction-level parallelism (ILP) are important for high-performance microprocessor implementations. The Binary-translation Optimized Architecture (BOA), an implementation of the IBM PowerPC family, combines binary translation with dynamic optimization. The authors use these techniques to simplify the hardware by bridging a semantic gap between the PowerPCs reduced instruction set and even simpler hardware primitives. Processors like the Pentium Pro and Power4 have tried to achieve high frequency and ILP by implementing a cracking scheme in hardware: an instruction decoder in the pipeline generates multiple micro-operations that can then be scheduled out of order. BOA relies on an alternative software approach to decompose complex operations and to generate schedules, and thus offers significant advantages over purely static compilation approaches. This article explains BOAs translation strategy, detailing system issues and architecture implementation.

international conference on parallel architectures and compilation techniques | 1999

LaTTe: a Java VM just-in-time compiler with fast and efficient register allocation

Byung-Sun Yang; Soo-Mook Moon; Seong-Bae Park; Junpyo Lee; Seungil Lee; Jinpyo Park; Yoo C. Chung; Suhyun Kim; Kemal Ebcioglu; Erik R. Altman

For network computing on desktop machines, fast execution of Java bytecode programs is essential because these machines are expected to run substantial application programs written in Java. Higher Java performance can be achieved by just-in-time (JIT) compilers which translate the stack-based bytecode into register-based machine code on demand. One crucial problem in Java JIT compilation is how to map and allocate stack entries and local variables into registers efficiently and quickly, so as to improve the Java performance. This paper introduces LaTTe, a Java JIT compiler that performs fast and efficient register mapping and allocation for RISC machines. LaTTe first translates the bytecode into pseudo RISC code with symbolic registers, which is then register allocated while coalescing those copies corresponding to pushes and pops between local variables and the stack. The LaTTe JVM also includes an enhanced object model, a lightweight monitor, a fast mark-and-sweep garbage collector, and an on-demand exception handling mechanism, all of which are closely coordinated with LaTTes JIT compilation.

IEEE Computer | 2000

Welcome to the opportunities of binary translation

Erik R. Altman; David R. Kaeli; Yaron Sheffer

A new processor architecture poses significant financial risk to hardware and software developers alike, so both have a vested interest in easily porting code from one processor to another. Binary translation offers solutions for automatically converting executable code to run on new architectures without recompiling the source code.

international symposium on microarchitecture | 1994

Minimizing register requirements under resource-constrained rate-optimal software pipelining

R. Govindarajan; Erik R. Altman; Guang R. Gao

In this paper we address the following software pipelining problem: given a loop and a machine architecture with a fixed number of processor resources (e.g. function units), how can one construct a software-pipelined schedule which runs on the given architecture at the maximum possible iteration rate (a la rate-optimal) while minimizing the number of registers? The main contributions of this paper are: First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rate-optimal schedules) under different sets of constraints. Secondly, we show that a precise mathematical formulation and its solution does make a significant performance difference. We evaluated the performance of our method against three other leading contemporary heuristic methods. Experimental results show that the method described in this paper performed significantly better than these methods.

compiler construction | 1992

A Register Allocation Framework Based on Hierarchical Cyclic Interval Graphs

Laurie J. Hendren; Guang R. Gao; Erik R. Altman; Chandrika Mukerji

In this paper, we present a new register allocation framework based on hierarchical cyclic interval graphs. We motivate our approach by demonstrating that cyclic interval graphs provide a feasible and effective representation to characterize sequences of live ranges of variables in successive iterations of a loop. Based on this representation we provide a new heuristic algorithm for minimum register allocation, the fat cover algorithm. In addition, we present a spilling algorithm that makes use of the extra information available in the interval graph representation. Whenever possible, it favors register floats (moving values from one register to another) over the traditional register spills (storing a spilled variable into memory).

IEEE Transactions on Parallel and Distributed Systems | 1996

A framework for resource-constrained rate-optimal software pipelining

R. Govindarajan; Erik R. Altman; Guang R. Gao

The rapid advances in high-performance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs on the given architecture (with a fixed number of processor resources) at the maximum possible iteration rate (a la rate-optimal) while minimizing the number of buffers-a close approximation to minimizing the number of registers. The main contributions of this paper are: First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under a periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rate-optimal schedules) under different sets of constraints. Secondly, we show that a precise mathematical formulation and its solution does make a significant performance difference. We evaluated the performance of our method against three leading contemporary heuristic methods. Experimental results show that the method described in this paper performed significantly better than these methods. The techniques proposed in this paper are useful in two different ways: 1) As a compiler option which can be used in generating faster schedules for performance-critical loops (if the interested users are willing to trade the cost of longer compile time with faster runtime). 2) As a framework for compiler writers to evaluate and improve other heuristics-based approaches by providing quantitative information as to where and how much their heuristic methods could be further improved.

conference on object-oriented programming systems, languages, and applications | 2010

Performance analysis of idle programs

Erik R. Altman; Matthew Arnold; Stephen J. Fink; Nick Mitchell

This paper presents an approach for performance analysis of modern enterprise-class server applications. In our experience, performance bottlenecks in these applications differ qualitatively from bottlenecks in smaller, stand-alone systems. Small applications and benchmarks often suffer from CPU-intensive hot spots. In contrast, enterprise-class multi-tier applications often suffer from problems that manifest not as hot spots, but as idle time, indicating a lack of forward motion. Many factors can contribute to undesirable idle time, including locking problems, excessive system-level activities like garbage collection, various resource constraints, and problems driving load. We present the design and methodology for WAIT, a tool to diagnosis the root cause of idle time in server applications. Given lightweight samples of Java activity on a single tier, the tool can often pinpoint the primary bottleneck on a multi-tier system. The methodology centers on an informative abstraction of the states of idleness observed in a running program. This abstraction allows the tool to distinguish, for example, between hold-ups on a database machine, insufficient load, lock contention in application code, and a conventional bottleneck due to a hot method. To compute the abstraction, we present a simple expert system based on an extensible set of declarative rules. WAIT can be deployed on the fly, without modifying or even restarting the application. Many groups in IBM have applied the tool to diagnosis performance problems in commercial systems, and we present a number of examples as case studies.

international conference on computer design | 1998

An eight-issue tree-VLIW processor for dynamic binary translation

Kemal Ebcioglu; Jason E. Fritts; Stephen V. Kosonocky; Michael Karl Gschwind; Erik R. Altman; Krishnan K. Kailas; Terry Bright

Presented is an 8-issue tree-VLIW processor designed for efficient support of dynamic binary translation. This processor confronts two primary problems faced by VLIW architectures: binary compatibility and branch performance. Binary compatibility with existing architectures is achieved through dynamic binary translation which translates and schedules PowerPC instructions to take advantage of the available instruction level parallelism. Efficient branch performance is achieved through tree instructions that support multi-way path and branch selection within a single VLIW instruction. The processor architecture is described, along with design details of the branch unit, pipeline, register file and memory hierarchy for a 0.25 micron standard-cell design. Performance simulations show that the simplicity of a VLIW architecture allows a wide-issue processor to operate at high frequencies.

Explore More