Stephan J. Jourdan
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stephan J. Jourdan.
international symposium on computer architecture | 1999
Adi Yoaz; Mattan Erez; Ronny Ronen; Stephan J. Jourdan
State of the art microprocessors achieve high performance by executing multiple instructions per cycle. In an out-of-order engine, the instruction scheduler is responsible for dispatching instructions to execution units based on dependencies, latencies, and resource availability. Most existing instruction schedulers are doing a less than optimal job of scheduling memory accesses and instructions dependent on them, for the following reasons:• Memory dependencies cannot be resolved prior to execution, so loads are not advanced ahead of preceding stores.• The dynamic latencies of load instructions are unknown, so scheduling dependent instructions is based on either optimistic load-use delay (may cause re-scheduling and re-execution) or pessimistic delay (creating unnecessary delays).• Memory pipelines are more expensive than other execution units, and as such, are a scarce resource. Currently, an increase in the memory execution bandwidth is usually achieved through multi-banked caches where bank conflicts limit efficiency.In this paper we present three techniques to address these scheduler limitations. One is to improve the scheduling of load instructions by using a simple memory disambiguation mechanism. The second is to improve the scheduling of load dependent instructions by employing a Data Cache Hit-Miss Predictor to predict the dynamic load latencies. And the third is to improve the efficiency of load scheduling in a multi-banked cache through Cache-Bank Prediction.
IEEE Micro | 2014
Per Hammarlund; Alberto J. Martinez; Atiq Bajwa; David L. Hill; Erik G. Hallnor; Hong Jiang; Martin G. Dixon; Michael N. Derr; Mikal C. Hunsaker; Rajesh Kumar; Randy B. Osborne; Ravi Rajwar; Ronak Singhal; Reynold V. D'Sa; Robert S. Chappell; Shiv Kaushik; Srinivas Chennupaty; Stephan J. Jourdan; Steve H. Gunther; Thomas A. Piazza; Ted Burton
Haswell, Intels fourth-generation core processor architecture, delivers a range of client parts, a converged core for the client and server, and technologies used across many products. It uses an optimized version of Intel 22-nm process technology. Haswell provides enhancements in power-performance efficiency, power management, form factor and cost, core and uncore microarchitecture, and the cores instruction set.
international symposium on computer architecture | 1999
Michael Bekerman; Stephan J. Jourdan; Ronny Ronen; Gilad Kirshenboim; Lihu Rappoport; Adi Yoaz; Uri C. Weiser
As microprocessors become faster, the relative performance cost of memory accesses increases. Bigger and faster caches significantly reduce the absolute load-to-use time delay. However, increase in processor operational frequencies impairs the relative load-to-use latency, measured in processor cycles (e.g. from two cycles on the Pentium® processor to three cycles or more in current designs). Load-address prediction techniques were introduced to partially cut the load-to-use latency. This paper focuses on advanced address-prediction schemes to further shorten program execution time.Existing address prediction schemes are capable of predicting simple address patterns, consisting mainly of constant addresses or stride-based addresses. This paper explores the characteristics of the remaining loads and suggests new enhanced techniques to improve prediction effectiveness:• Context-based prediction to tackle part of the remaining, difficult-to-predict, load instructions.• New prediction algorithms to take advantage of global correlation among different static loads.• New confidence mechanisms to increase the correct prediction rate and to eliminate costly mispredictions.• Mechanisms to prevent long or random address sequences from polluting the predictor data structures while providing some hysteresis behavior to the predictions.Such an enhanced address predictor accurately predicts 67% of all loads, while keeping the misprediction rate close to 1%. We further prove that the proposed predictor works reasonably well in a deep pipelined architecture where the predict-to-update delay may significantly impair both prediction rate and accuracy.
international conference on parallel architectures and compilation techniques | 1999
Pierre Michaud; André Seznec; Stephan J. Jourdan
The effective performance of wide-issue superscalar processors depends on many parameters, such as branch prediction accuracy, available instruction-level parallelism, and instruction-fetch bandwidth. This paper explores the relations between some of these parameters, and more particularly, the requirement in instruction-fetch bandwidth. We introduce new enhancements to boost effectively the instruction-fetch bandwidth of conventional fetch engines. However, experiments strongly show that performance improves less for a given instruction-fetch bandwidth gain as the base fetch bandwidth increases. At the level of bandwidth exhibited by the proposed schemes, the performance improvement is small. This clearly brings to light potential relations between the fetch bandwidth and the other parameters. We provide a model to explain this behaviour and quantify some relations. Based on the experimental observation that the available parallelism in an instruction window of size N grows as the square root /spl radic/N, we derive from the model that the instruction fetch bandwidth requirement increases as the square root of the distance between mispredicted branches. We also show that the instruction fetch bandwidth requirement increases linearly with the parallelism available in a fixed-size instruction window.
international symposium on computer architecture | 2000
Michael Bekerman; Adi Yoaz; Freddy Gabbay; Stephan J. Jourdan; Maxim Kalaev; Ronny Ronen
Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intels IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, non-speculative technique that partially hides the increasing load-to-use latency, by allowing the early issue of load instructions. Early load address resolution relies on register tracking to safely compute the addresses of memory references in the front-end part of the processor pipeline. Register tracking enables decode-time computation of register values by tracking simple operations of the form reg±immediate. Register tracking may be performed in any pipeline stage following instruction decode and prior to execution. Several tracking schemes are proposed in this paper:Stack pointer tracking allows safe early resolution of stack references by keeping track of the value of the ESP register (the stack pointer). About 25% of all loads are stack loads and 95% of these loads may be resolved in the front-end. Absolute address tracking allows the early resolution of constant-address loads. Displacement-based tracking tackles all loads with addresses of the form reg±immediate by tracking the values of all general-purpose registers. This class corresponds to 82% of all loads, and about 65% of these loads can be safely resolved in the front-end pipeline. The paper describes the tracking schemes, analyzes their performance potential in a deeply pipelined processor and discusses the integration of tracking with memory disambiguation.
International Journal of Parallel Programming | 2001
Pierre Michaud; André Seznec; Stephan J. Jourdan
The performance of superscalar processors depends on many parameters with correlated effects. This paper explores the relations between some of these parameters, and more particularly, the requirement in instruction fetch bandwidth. We introduce new enhancements to increase the bandwidth of conventional instruction fetch engines. However, experiments show that the performance does not increase proportionally to the fetch. Once the measured IPC is half the instruction fetch bandwidth, increasing the fetch bandwidth brings very little improvement. In order to better understand this behavior, we develop a model from the empirical observation that the available instruction parallelism grows as the square root of the instruction window size. From the model, we derive that the fetch bandwidth requirement grows as the square root of the distance between mispredicted branches. We also verify experimentally that, to double the IPC, one should both double the fetch bandwidth and decrease the number of mispredicted branches fourfold.
high performance computer architecture | 2000
Stephan J. Jourdan; Lihu Rappoport; Yoav Almog; Mattan Erez; Adi Yoaz; Ronny Ronen
This paper describes a new instruction-supply mechanism, called the eXtended Block Cache (XBC). The goal of the XBC is to improve on the Trace Cache (TC) hit rate, while providing the same bandwidth. The improved hit rate is achieved by having the XBC a nearly redundant free structure. The basic unit recorded in the XBC is the extended block (XB), which is a multiple-entry single-exit instruction block. A XB is a sequence of instructions ending on a conditional or an indirect branch. Unconditional direct jumps do not end a XB. In order to enable multiple entry points per XB, the XB index is derived from the IP of its ending instruction. Instructions within the XB are recorded in reverse order, enabling easy extension of XBs. The multiple entry-points remove most of the redundancy. Since there is at most one conditional branch per XB, we can fetch up to n XBs per cycle by predicting n branches. The multiple fetch enables the XBC to march the TC bandwidth.
ieee hot chips symposium | 2009
Lily P. Looi; Stephan J. Jourdan
Intel maintains pace of innovation and execution - Next generation performance - 32nm: Another Process Technology Breakthrough. Enabling Nehalem for every segment - Delivering outstanding Nehalem performance to mainstream desktops and laptop computers. Redesigning more efficient platforms - Best performance acrossIntel maintains pace of innovation and execution - Next generation performance - 32nm: Another Process Technology Breakthrough. Enabling Nehalem for every segment - Delivering outstanding Nehalem performance to mainstream desktops and laptop computers. Redesigning more efficient platforms - Best performance across all segments - Low power and better power management - Higher levels of integration all segments - Low power and better power management - Higher levels of integration.
architectural support for programming languages and operating systems | 2002
Robert N. Cooksey; Stephan J. Jourdan; Dirk Grunwald
Archive | 2012
Stephen H. Gunther; Edward A. Burton; Anant Deval; Stephan J. Jourdan; Robert J. Greiner; Michael P. Cornaby