Soner Önder
Michigan Technological University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Soner Önder.
compiler construction | 2002
Siddharth Rele; Santosh Pande; Soner Önder; Rajiv Gupta
We present a novel approach which combines compiler, instruction set, and microarchitecture support to turn off functional units that are idle for long periods of time for reducing static power dissipation by idle functional units using power gating [2,9]. The compiler identifies program regions in which functional units are expected to be idle and communicates this information to the hardware by issuing directives for turning units off at entry points of idle regions and directives for turning them back on at exits from such regions. The microarchitecture is designed to treat the compiler directives as hints ignoring a pair of off and on directives if they are too close together. The results of experiments show that some of the functional units can be kept off for over 90% of the time at the cost of minimal performance degradation of under 1%.
Proceedings of the 2004 workshop on Memory system performance | 2004
Changpeng Fang; Steve Carr; Soner Önder; Zhenlin Wang
Feedback-directed optimization has become an increasingly important tool in designing and building optimizing compilers. Recently, reuse-distance analysis has shown much promise in predicting the memory behavior of programs over a wide range of data sizes. Reuse-distance analysis predicts program locality by experimentally determining locality properties as a function of the data size of a program, allowing accurate locality analysis when the programs data size changes.Prior work has established the effectiveness of reuse distance analysis in predicting whole-program locality and miss rates. In this paper, we show that reuse distance can also effectively predict locality and miss rates on a per instruction basis. Rather than predict locality by analyzing reuse distances for memory addresses alone, we relate those addresses to particular static memory operations and predict the locality of each instruction.Our experiments show that using reuse distance without cache simulation to predict miss rates of instructions is superior to using cache simulations on a single representative data set to predict miss rates on various data sizes. In addition, our analysis allows us to identify the critical memory operations that are likely to produce a significant number of cache misses for a given data size. With this information, compilers can target cache optimization specifically to the instructions that can benefit from such optimizations most.
international conference on parallel architectures and compilation techniques | 2005
Changpeng Fang; S. Can; Soner Önder; Zhenlin Wang
Feedback-directed optimization has become an increasingly important tool in designing and building optimizing compilers as it provides a means to analyze complex program behavior that is not possible using traditional static analysis. Feedback-directed optimization offers the compiler opportunities to analyze and optimize the memory behavior of programs even when traditional array-based analysis is not applicable. As a result, both floating point and integer programs can benefit from memory hierarchy optimization. In this paper, we examine the notion of memory distance as it is applied to the instruction space of a program and to feedback-directed optimization. Memory distance is defined as a dynamic quantifiable distance in terms of memory references between two accesses to the same memory location. We use memory distance to predict the miss rates of instructions in a program. Using the miss rates, we then identify the programs critical instructions - the set of high miss instructions whose cumulative misses account for 95% of the L2 cache misses in the program - in both integer and floating-point programs. Our experiments show that memory-distance analysis can effectively identify critical instructions in both integer and floating-point programs. Additionally, we apply memory-distance analysis to memory disambiguation in out-of-order issue processors using those distances to determine when a load may be speculated ahead of a preceding store. Our experiments show that memory-distance-based disambiguation on average achieves within 5-10% of the performance gain of the store set technique which requires a hardware table.
international conference on supercomputing | 2001
Soner Önder; Rajiv Gupta
The detection of opportunities for value reuse optimizations in memory operations require both the addresses and values associated with these operations to be available. Although the values are typically available in the physical register file, their presence cannot be exploited because no correspondence between the values and addressess is maintained. In this paper we propose the explicit management of the physical register file contents as a level in the memory hierarchy by supporting the Value Address Association Structure (VAAS). The entries in VAAS have a one-to-one correspondence with entries in the physical register file. For each value in the register file that is involved in a load or store operation, the associated information, including the memory address, are stored in the corresponding VAAS entry. Several optimization tasks can be performed using the combination of physical registers and VAAS. Specifically VAAS enables unified implementation of the following optimization tasks: (i) Store-to-load forwarding is performed without explicitly saving the stored values; (ii) Load-to-load forwarding is performed without saving loaded values in a reuse buffer; (iii) Silent stores are eliminated without saving or loading the prior value stored to the same addresses; (iv) Switching of bits in L1 cache is minimized without saving additional history; and (v) False memory access order violations are avoided without holding speculatively loaded values in the speculated loads table. Our experiments demonstrate that our implementation of non-speculative optimizations is highly effective as it eliminates memory references due to 60% (58%) of loads in SPECint95 (SPECfp95) and 25% (22.6%) of stores in SPECint95 (SPECfp95). On an average over 45% of cache references are eliminated due to non-speculative reuse. On an average the L1 switching activity was reduced by 7.75%.
international symposium on microarchitecture | 1999
Soner Önder; Rajiv Gupta
With the help of the memory dependence predictor the instruction scheduler can speculatively issue load instructions at the earliest possible time without causing significant amounts of memory order violations. For maximum performance, the scheduler must also allow full out-of-order issuing of store instructions since any superfluous ordering of stores results in false memory dependencies which adversely affect the timely issuing of dependent loads. Unfortunately, simple techniques of detecting memory order violations do not work well when store instructions issue out-of-order since they yield many false memory order violations. By using a novel memory order violation detection mechanism that is employed in the retire logic of the processor and delaying the checking for memory order violations, we are able to allow full out-of-order issuing of store instructions without causing false memory order violations. In addition, our mechanism can take advantage of data value redundancy. We present an implementation of our technique using the store set memory dependence predictor. An out-of-order superscalar processor that uses our technique delivers an IPC which is within 100, 96 and 85% of a processor equipped with an ideal memory disambiguator at issue widths of 8, 16 and 32 instructions respectively.
international conference on supercomputing | 2005
Peng Zhou; Soner Önder; Steve Carr
Current trends in modern out-of-order processors involve implementing deeper pipelines and a large instruction window to achieve high performance. However, as pipeline depth increases, the branch misprediction penalty becomes a critical factor in overall processor performance. Current approaches to handling branch mispredictions either incrementally roll back to in-order state by waiting until the mispredicted branch reaches the head of the reorder buffer, or utilize checkpointing at branches for faster recovery. Rolling back to in-order state stalls the pipeline for a significant number of cycles and checkpointing is costly.This paper proposes a fast recovery mechanism, called Eager Misprediction Recovery (EMR), to reduce the branch misprediction penalty. Upon a misprediction, the processor immediately starts fetching and renaming instructions from the correct path without restoring the map table. Those instructions that access incorrect speculative values wait until the correct data are restored; however, instructions that access correct values continue executing while recovery occurs. Thus, the recovery mechanism hides the latency of long branch recovery with useful instructions.EMR achieves a mean performance improvement very close to a recovery mechanism that supports checkpointing at each branch. In addition, EMR provides an average of 9.0% and up to 19.9% better performance than traditional sequential misprediction recovery on the SPEC2000 benchmark suite.
international conference on parallel architectures and compilation techniques | 2002
Soner Önder
Memory dependence prediction allows out-of-order issue processors to achieve high degrees of instruction level parallelism by issuing load instructions at the earliest time without causing a significant number of memory order violations. We present a simple mechanism which incorporates multiple speculation levels within the processor and classifies the load and the store instructions at run time to the appropriate speculation level. Each speculation level is termed as a color and the sets of load and store instructions are called color sets. We present how this mechanism can be incorporated into the issue logic of a conventional superscalar processor and show that this simple mechanism can provide similar performance to that of more costly schemes resulting in reduced hardware complexity and cost. The performance of the technique is evaluated with respect to the store set algorithm. At very small table sizes, the color set approach provides up to 21% better performance than the store set algorithm for floating point Spec95 benchmarks and up to 18% better performance for integer benchmarks using harmonic means.Memory dependence prediction allows out-of-order issue processors to achieve high degrees of instruction level parallelism by issuing load instructions at the earliest time without causing a significant number of memory order violations. We present a simple mechanism which incorporates multiple speculation levels within the processor and classifies the load and the store instructions at run time to the appropriate speculation level. Each speculation level istermed as a color and the sets of load and store instructions are called color sets. We present how this mechanism can be incorporated into the issue logic of a conventional super-scalarprocessor and show that this simple mechanism can provide similar performance to that of more costly schemes resulting in reduced hardware complexity and cost. The performance of the technique is evaluated with respect to the store set algorithm.At very small table sizes, the color set approach provides up to 21% better performance than the store set algorithm for floating point Spec-95 benchmarks and up to 18% better performance for integer benchmarks using harmonic means.
european conference on parallel processing | 2001
Soner Önder; Rajiv Gupta
While the central window implementation in a superscalar processor is an effective approach to waking up ready instructions, this implementation does not scale to large instruction window sizes. We propose a new wake-up algorithm that dynamically associates explicit wake-up lists with executing instructions according to the dependences between instructions. Instead of repeatedly examining a waiting instruction for wake-up till it can be issued, this algorithm identifies and considers for wake-up a fresh subset of waiting instructions from the instruction window in each cycle. The direct wake-up microarchitecture (DWMA) that we present is able to achieve approximately 80%, 75% and 63% of the performance of a central window processor at high issue widths of 8, 16 and 32 respectively.
IFIP World Computer Congress, TC 2 | 2004
Jeff Bastian; Soner Önder
Designing, testing, and producing a new computer processor is a complex and very expensive process. To reduce costly mistakes in hardware, the microarchitecture is usually designed and tested with the aid of a software simulator. The FAST System enables microarchitects to develop architecture simulators rapidly and is less error-prone than using a high level language such as C. In this paper, we describe how the FAST Systems Architecture Description Language (ADL) has been extended to facilitate the description of complex instruction sets such as Intels IA-32 instruction set architecture. In this respect, we demonstrate that the notion of inheritence, a key concept in object oriented programming languages can be extended for selective inheritence to enable the specification of complex instruction set architectures in architecture description languages.
international conference on parallel architectures and compilation techniques | 1999
Soner Önder; Jun Xu; Rajiv Gupta
A sequence of branch instructions in the dynamic instruction stream forms a branch sequence if at most one non-branch instruction separates each consecutive pair of branches in the sequence. We propose a branch prediction scheme in which branch sequence history is explicitly maintained to identify frequently encountered branch sequences at runtime and when the first branch in the sequence is encountered, the outcomes of the all of the branches in the sequence are predicted. We have designed an implementation of a branch sequence predictor which provides overall mis-prediction rates that are comparable with the gshare single branch predictor. Using this branch sequence predictor, we have devised a novel instruction fetch mechanism. By saving the instructions following the first branch belonging to a branch sequence in a sequence table, the proposed mechanism eliminates fetches of nonconsecutive instruction cache lines containing these instructions and therefore delays associated with their fetching are avoided. Experiments comparing the proposed fetch mechanism with a simple fetch mechanism based upon a single branch prediction for Spec95 benchmarks demonstrate that the total number of I-cache lines fetched during execution decreases by as much as 15%, the number of useful instructions per fetched cache line increases by as much as 18%, and the overall IPCs achieved on a superscalar processor increase by as much as 17% for some benchmarks.