Yoav Almog | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yoav Almog is active.

Explore More

Publication

Featured researches published by Yoav Almog.

symposium on code generation and optimization | 2004

Specialized dynamic optimizations for high-performance energy-efficient microarchitecture

Yoav Almog; Roni Rosner; Naftali Schwartz; Ari Schmorak

We study several major characteristics of dynamic optimization within the PARROT power-aware, trace-cache-based microarchitectural framework. We investigate the benefit of providing optimizations which although tightly coupled with the microarchitecture in substance are decoupled in time. The tight coupling in substance provides the potential for tailoring optimizations for microarchitecture in a manner impossible or impractical not only for traditional static compilers but even for a JIT. We show that the contribution of common, generic optimizations to processor performance and energy efficiency may be more than doubled by creating a more intimate correlation between hardware specifics and the optimizer. In particular, dynamic optimizations can profit greatly from hardware supporting fused and SIMDified operations. At the same time, the decoupling in time allows optimizations to be arbitrarily aggressive without significant performance loss. We demonstrate that requiring up to 512 repetitions before a trace is optimized sacrifices almost no performance or efficiency as compared with lower thresholds. These results confirm the feasibility of energy efficient hardware implementation of an aggressive optimizer.

international symposium on computer architecture | 2004

Power Awareness through Selective Dynamically Optimized Traces

Roni Rosner; Yoav Almog; Micha Moffie; Naftali Schwartz; Avi Mendelson

We present the PARROT concept that seeks to achieve higher performance with reduced energy consumption through gradual optimization of frequently executed code traces. The PARROT microarchitectural framework integrates trace caching, dynamic optimizations and pipeline decoupling. We employ a selective approach for applying complex mechanisms only upon the most frequently used traces to maximize the performance gain at any given power constraint, thus attaining finer control of tradeoffs between performance and power awareness. We show that the PARROT based microarchitecture can improve the performance of aggressively designed processors by providing the means to improve the utilization of their more elaborate resources. At the same time, rigorous selection of traces prior to storage and optimization provides the key to attenuating increases in the power budget. For resource-constrained designs, PARROT based architectures deliver better performance (up to an average 16% increase in IPC) at a comparable energy level, whereas the conventional path to a similar performance improvement consumes an average 70% more energy. Meanwhile, for those designs which can tolerate a higher power budget, PARROT gracefully scales up to use additional execution resources in a uniformly efficient manner. In particular, a PARROT-style doubly-wide machine delivers an average 45% IPC improvement while actually improving the cubic-MIPS-per-WATT power awareness metric by over 50%.

IEEE Transactions on Very Large Scale Integration Systems | 2003

Micro-operation cache: a power aware frontend for variable instruction length ISA

Baruch Solomon; Avi Mendelson; Ronny Ronen; Doron Orenstien; Yoav Almog

Modern computer architectures that support variable length instruction set architectures (ISA), such as the Intels IA-32, distinguish between the architectural level of presentation and the micro-architectural representations of the instructions. At the micro-architectural level, instructions are represented by fixed-length micro-operations termed uops, and complex instructions are broken into sequence of uops. The fetch and decode operations in such architectures are extremely complicated and power hungry, especially if they aim to handle several variable length instructions per cycle. This paper suggests caching uop sequences from decoded instructions in a special structure, termed uop cache (UC), and use this fix-length decoded format when possible. Doing so enables reduction in the processors power and energy consumption while not compromising performance. We will show that a moderately-sized UC can eliminate about 75% instruction decodes across a broad range of benchmarks and over 90% in multimedia applications and high-power tests. For existing Intel P6 family processors, the eliminated work may save about 10% of the full-chip power consumption. While the new proposed technique can be used to save power without degrading performance, we can also use it to improve processor performance when power is constrained.

high performance computer architecture | 2000

eXtended block cache

Stephan J. Jourdan; Lihu Rappoport; Yoav Almog; Mattan Erez; Adi Yoaz; Ronny Ronen

This paper describes a new instruction-supply mechanism, called the eXtended Block Cache (XBC). The goal of the XBC is to improve on the Trace Cache (TC) hit rate, while providing the same bandwidth. The improved hit rate is achieved by having the XBC a nearly redundant free structure. The basic unit recorded in the XBC is the extended block (XB), which is a multiple-entry single-exit instruction block. A XB is a sequence of instructions ending on a conditional or an indirect branch. Unconditional direct jumps do not end a XB. In order to enable multiple entry points per XB, the XB index is derived from the IP of its ending instruction. Instructions within the XB are recorded in reverse order, enabling easy extension of XBs. The multiple entry-points remove most of the redundancy. Since there is at most one conditional branch per XB, we can fetch up to n XBs per cycle by predicting n branches. The multiple fetch enables the XBC to march the TC bandwidth.

international symposium on low power electronics and design | 2001

Micro-operation cache: a power aware frontend for the variable instruction length ISA

Baruch Solomon; Avi Mendelson; Doron Orenstein; Yoav Almog; Ronny Ronen

Introduces the micro-operation cache (Uop Cache-UC) designed to reduce the processors frontend power and energy consumption without performance degradation. The UC caches basic blocks of instructions pre-decoded into micro-operations (uops). The UC fetches a single basic-block worth of uops per cycle. Fetching complete pre-decoded basic-blocks eliminates the need to repeatedly decode variable length instructions and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. The UC design enables even a small structure to be quite effective. Results: a moderate-sized UC eliminates about 75% instruction decodes across a broad range of benchmarks and over 90% in multimedia applications and high-power tests. For existing Intel P6 family processors, the eliminated work may save about 10% of the full-chip power consumption with no performance degradation.

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems | 2003

PARROT: power awareness through selective dynamically optimized traces

Roni Rosner; Yoav Almog; Micha Moffie; Naftali Schwartz; Avi Mendelson

We present the PARROT concept aimed at both higher performance and power-awareness. The PARROT microarchitectural framework integrates trace caching, dynamic optimizations and pipeline decoupling. We employ a gradual and selective approach for applying complex mechanisms only for the most frequently used traces to maximize the performance gain at any given power constraint, thus attaining finer control of tradeoffs between performance and power awareness. We show that the PARROT microarchitecture delivers performance increases comparable to those available through conventional doubling of execution resources (average 16% IPC improvement). This improvement comes through better utilization of all available resources with the combination of a trace cache and selective trace optimization. On the other hand, performance advantage of a trace cache alone is limited to wide-machine configurations. No less critical, however, is power awareness. The PARROT microarchitecture delivers the performance increase at a comparable energy level, whereas the conventional path to higher performance consumes an average 70% more energy. Meanwhile, for those designs which can tolerate a higher power budget, PARROT gracefully scales up to use additional execution resources in a uniformly efficient manner. In particular, a PARROT-style doubly-wide machine delivers an average 45% IPC improvement while actually improving the Cubic-MIPS-per-WATT power awareness metric by over 50%.

Archive | 2004

Method and apparatus to vectorize multiple input instructions

Yoav Almog; Roni Rosner; Ronny Ronen

Archive | 2004

System, method and apparatus for dependency chain processing

Satish Narayanasamy; Hong Wang; John Paul Shen; Roni Rosner; Yoav Almog; Naftali Schwartz; Gerolf F. Hoflehner; Daniel M. Lavery; Wei Li; Xinmin Tian; Milind Girkar; Perry H. Wang

Lecture Notes in Computer Science | 2004