Ronny Ronen
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ronny Ronen.
international symposium on computer architecture | 1999
Adi Yoaz; Mattan Erez; Ronny Ronen; Stephan J. Jourdan
State of the art microprocessors achieve high performance by executing multiple instructions per cycle. In an out-of-order engine, the instruction scheduler is responsible for dispatching instructions to execution units based on dependencies, latencies, and resource availability. Most existing instruction schedulers are doing a less than optimal job of scheduling memory accesses and instructions dependent on them, for the following reasons:• Memory dependencies cannot be resolved prior to execution, so loads are not advanced ahead of preceding stores.• The dynamic latencies of load instructions are unknown, so scheduling dependent instructions is based on either optimistic load-use delay (may cause re-scheduling and re-execution) or pessimistic delay (creating unnecessary delays).• Memory pipelines are more expensive than other execution units, and as such, are a scarce resource. Currently, an increase in the memory execution bandwidth is usually achieved through multi-banked caches where bank conflicts limit efficiency.In this paper we present three techniques to address these scheduler limitations. One is to improve the scheduling of load instructions by using a simple memory disambiguation mechanism. The second is to improve the scheduling of load dependent instructions by employing a Data Cache Hit-Miss Predictor to predict the dynamic load latencies. And the third is to improve the efficiency of load scheduling in a multi-banked cache through Cache-Bank Prediction.
Proceedings of the IEEE | 2001
Ronny Ronen; Avi Mendelson; Konrad K. Lai; Shih-Lien Lu; Fred J. Pollack; John Paul Shen
In the past several decades, the world of computers and especially that of microprocessors has witnessed phenomenal advances. Computers have exhibited ever-increasing performance and decreasing costs, making them more affordable and in turn, accelerating additional software and hardware development that fueled this process even more. The technology that enabled this exponential growth is a combination of advancements in process technology, microarchitecture, architecture, and design and development tools. While the pace of this progress has been quite impressive over the last two decades, it has become harder and harder to keep up this pace. New process technology requires more expensive megafabs and new performance levels require larger die, higher power consumption, and enormous design and validation effort. Furthermore, as CMOS technology continues to advance, microprocessor design is exposed to a new set of challenges. In the near future, microarchitecture has to consider and explicitly manage the limits of semiconductor technology, such as wire delays, power dissipation, and soft errors. In this paper we describe the role of microarchitecture in the computer world present the challenges ahead of us, and highlight areas where microarchitecture can help address these challenges.
international conference on computer design | 2004
Ed Grochowski; Ronny Ronen; John Paul Shen; Hong Wang
This paper describes the tradeoff between latency performance and throughput performance in a power-constrained environment. We show that the key to achieving both excellent latency performance as well as excellent throughput performance is to dynamically vary the amount of energy expended to process instructions according to the amount of parallelism available in the software. We survey four techniques for achieving variable energy per instruction: voltage/frequency scaling, asymmetric cores, variable-size cores, and speculation control. We estimate the potential range of energies obtainable by each technique and conclude that a combination of asymmetric cores and voltage/frequency scaling offers the most promising approach to design a chip-level multiprocessor that can achieve both excellent latency performance and excellent throughput performance.
international symposium on microarchitecture | 1998
Stephen J. Jourdan; Ronny Ronen; Michael Bekerman; Bishara Shomar; Adi Yoaz
Hardware renaming schemes provide multiple physical locations (register or memory) for each logical name. In current renaming schemes, a new physical location is allocated for each dispatched instruction regardless of its result value. However, these values exhibit a high level of temporal locality (result redundancy). This paper proposes: Physical Register Reuse. To reuse a physical location whenever it is detected that an incoming result value matches a previous one. This is performed during register renaming and requires some VALUE-IDENTITY DETECTION hardware. By mapping several logical registers holding the same value to the same physical register, Physical Register Reuse gives the opportunities: SHARING-exploit the high level of value-redundancy in the register file to either reduce the file size and complexity, or effectively enlarge the active instruction window. Our results suggest reduction factors of 2 to 4 in some cases. Performance is increased either by the enlarged instruction window or by the higher frequency enabled by a smaller register file requiring fewer ports. RESULT REUSE AND DEPENDENCY REDIRECTION-move the responsibility of generating results: (1) From the functional units to the register renamer, resulting in the possible elimination of processed instructions from the execution stream. (2) From one instruction to an earlier instruction stream, possibly allowing instructions to be scheduled earlier. This way, large performance speedups are achieved. 2. Unification. To combine the memory renamer with the register renamer in order to extend the above-stated sharing and result reuse and dependency redirection ideas to both registers and memory locations. This allows even greater hardware savings and performance improvements. This also simplifies the processing of store instructions.
international symposium on computer architecture | 1999
Michael Bekerman; Stephan J. Jourdan; Ronny Ronen; Gilad Kirshenboim; Lihu Rappoport; Adi Yoaz; Uri C. Weiser
As microprocessors become faster, the relative performance cost of memory accesses increases. Bigger and faster caches significantly reduce the absolute load-to-use time delay. However, increase in processor operational frequencies impairs the relative load-to-use latency, measured in processor cycles (e.g. from two cycles on the Pentium® processor to three cycles or more in current designs). Load-address prediction techniques were introduced to partially cut the load-to-use latency. This paper focuses on advanced address-prediction schemes to further shorten program execution time.Existing address prediction schemes are capable of predicting simple address patterns, consisting mainly of constant addresses or stride-based addresses. This paper explores the characteristics of the remaining loads and suggests new enhanced techniques to improve prediction effectiveness:• Context-based prediction to tackle part of the remaining, difficult-to-predict, load instructions.• New prediction algorithms to take advantage of global correlation among different static loads.• New confidence mechanisms to increase the correct prediction rate and to eliminate costly mispredictions.• Mechanisms to prevent long or random address sequences from polluting the predictor data structures while providing some hysteresis behavior to the predictions.Such an enhanced address predictor accurately predicts 67% of all loads, while keeping the misprediction rate close to 1%. We further prove that the proposed predictor works reasonably well in a deep pipelined architecture where the predict-to-update delay may significantly impair both prediction rate and accuracy.
programming language design and implementation | 2009
Bratin Saha; Xiaocheng Zhou; Hu Chen; Ying Gao; Shoumeng Yan; Mohan Rajagopalan; Jesse Fang; Peinan Zhang; Ronny Ronen; Avi Mendelson
The client computing platform is moving towards a heterogeneous architecture consisting of a combination of cores focused on scalar performance, and a set of throughput-oriented cores. The throughput oriented cores (e.g. a GPU) may be connected over both coherent and non-coherent interconnects, and have different ISAs. This paper describes a programming model for such heterogeneous platforms. We discuss the language constructs, runtime implementation, and the memory model for such a programming environment. We implemented this programming environment in a x86 heterogeneous platform simulator. We ported a number of workloads to our programming environment, and present the performance of our programming environment on these workloads.
IEEE Computer Architecture Letters | 2003
Aviad Cohen; Finkelstein Finkelstein; Avi Mendelson; Ronny Ronen; Dmitry Rudoy
In this paper we focus on dynamic thermal management(DTM) strategies that use dynamic voltage scaling (DVS)for power control. We perform a theoretical analysis targeted atestimating the optimal strategy, and show two facts: (1) whenthere is a gap between the initial and the limit temperatures,it is best to start with a high (though not necessarily maximal)frequency and decrease it exponentially until the limit temperatureis reached; (2) when being close to the limit temperature,the best strategy is to stay there. We use the patterns exhibitedby the optimal strategy in order to analyze some existing DTMtechniques.
international symposium on computer architecture | 2000
Michael Bekerman; Adi Yoaz; Freddy Gabbay; Stephan J. Jourdan; Maxim Kalaev; Ronny Ronen
Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intels IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-speculative technique that partially hides the increasing load-to-use latency, by allowing the early issue of load instructions. Early load address resolution relies on register tracking to safely compute the addresses of memory references in the front-end part of the processor pipeline. Register tracking enables decode-time computation of register values by tracking simple operations of the form reg±immediate. Register tracking may be performed in any pipeline stage following instruction decode and prior to execution. Several tracking schemes are proposed in this paper:Stack pointer tracking allows safe early resolution of stack references by keeping track of the value of the ESP register (the stack pointer). About 25% of all loads are stack loads and 95% of these loads may be resolved in the front-end. Absolute address tracking allows the early resolution of constant-address loads. Displacement-based tracking tackles all loads with addresses of the form reg±immediate by tracking the values of all general-purpose registers. This class corresponds to 82% of all loads, and about 65% of these loads can be safely resolved in the front-end pipeline. The paper describes the tracking schemes, analyzes their performance potential in a deeply pipelined processor and discusses the integration of tracking with memory disambiguation.
IEEE Transactions on Very Large Scale Integration Systems | 2003
Baruch Solomon; Avi Mendelson; Ronny Ronen; Doron Orenstien; Yoav Almog
Modern computer architectures that support variable length instruction set architectures (ISA), such as the Intels IA-32, distinguish between the architectural level of presentation and the micro-architectural representations of the instructions. At the micro-architectural level, instructions are represented by fixed-length micro-operations termed uops, and complex instructions are broken into sequence of uops. The fetch and decode operations in such architectures are extremely complicated and power hungry, especially if they aim to handle several variable length instructions per cycle. This paper suggests caching uop sequences from decoded instructions in a special structure, termed uop cache (UC), and use this fix-length decoded format when possible. Doing so enables reduction in the processors power and energy consumption while not compromising performance. We will show that a moderately-sized UC can eliminate about 75% instruction decodes across a broad range of benchmarks and over 90% in multimedia applications and high-power tests. For existing Intel P6 family processors, the eliminated work may save about 10% of the full-chip power consumption. While the new proposed technique can be used to save power without degrading performance, we can also use it to improve processor performance when power is constrained.
international conference on parallel architectures and compilation techniques | 2001
Roni Rosner; Avi Mendelson; Ronny Ronen
The trace cache is becoming an important building block of modern, wide-issue, processors. The paper has three main contributions: it indicates that trace cache optimizations directed to reducing power consumption are do not necessarily coincide with optimisations directed to increasing fetch bandwidth; it extends our understanding on how well the trace cache utilizes its resources and introduces a new trace-cache organization based on filtering techniques. We observe that: (1) the majority of traces that are inserted into the trace-cache are rarely used again before being replaced; (2) the majority of the instructions delivered for execution originate from the fewer traces that are heavily and repeatedly used; and that (3) techniques that aim to improve instruction fetch bandwidth may increase the number of traces built during program execution. Based on these observations, we propose splitting the trace cache into two components: the filter trace-cache (FTC) and the main trace-cache (MTC). The FTC/MTC organization exhibits an important benefit: it decreases the number of traces built, thus reducing power consumption while improving overall performance. An extension of the filtering concept involves adding a second level (L2) trace-cache that stores less frequent traces that are replaced in the FTC or the MTC. The extra level of caching allows for order-of-magnitude reduction in the number of trace builds. Second level trace cache proves particularly useful for applications with large instruction footprints.