Joseph J. Sharkey
Binghamton University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Joseph J. Sharkey.
PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems | 2004
Joseph J. Sharkey; Dmitry Ponomarev; Kanad Ghose; Oguz Ergin
Dynamic instruction scheduling logic is one of the most critical components of modern superscalar microprocessors, both from the delay and power dissipation standpoints. The delay and energy requirement of driving the result tags across the associatively-addressed issue queue accounts for a significant percentage of the schedulers overhead and also limits the design scalability. We propose two schemes to reduce the power consumption and the delays of the wakeup logic. Our first scheme – instruction packing – shares the associative part of an issue queue entry between two instructions, each with at most one non-ready source. As a result, the number of entries in the issue queue (and, hence, the length of the tag buses) can be reduced by a factor of two with almost no impact on the IPCs, because most instructions either enter the pipeline with at least one of their source operands ready, or do not make use of two source registers to begin with. Our second scheme – tag memoization – avoids driving the upper portion of the tags, if those bits did not change their values from what was driven on the same tag bus during the most recent broadcast. While instruction packing results in the reduced length of the tag buses, tag memoization reduced the number of tag lines that need to be driven. We evaluate our designs using detailed microarchitectural simulations of the SPEC 2000 benchmarks and the SPICE simulations of the issue queue layouts.
international conference on parallel architectures and compilation techniques | 2006
Joseph J. Sharkey; Deniz Balkan; Dmitry Ponomarev
In SMT processors, the complex interplay between private and shared datapath resources needs to be considered in order to realize the full performance potential. In this paper, we show that blindly increasing the size of the per-thread reorder buffers to provide a larger number of in-flight instructions does not result in the expected performance gains but, quite in contrast, degrades the instruction throughput for virtually all multithreaded workloads. The reason for this performance loss is the excessive pressure on the shared datapath resources, especially the instruction scheduling logic. We propose intelligent mechanisms for dynamically adapting the number of reorder buffer entries allocated to each thread in an effort to avoid such allocations if they detrimentally impact the scheduler. We achieve this goal through categorizing the program execution into issue-bound and commit-bound phases and only performing the buffer allocations to the threads operating in commit-bound phases. Our adaptive technique achieves improvements of 21% in instruction throughput and 10% in the fairness metric compared to the best performing baseline configuration with static ROBs.
high-performance computer architecture | 2006
Joseph J. Sharkey; Dmitry Ponomarev
We propose dynamic scheduler designs to improve the scheduler scalability and reduce its complexity in the SMT processors. Our first design is an adaptation of the recently proposed instruction packing to SMT. Instruction packing opportunistically packs two instructions (possibly from different threads), each with at most one non-ready source operand at the time of dispatch, into the same issue queue entry. Our second design, termed 2OP/spl I.bar/BLOCK, takes these ideas one step further and completely avoids the dispatching of the instructions with two non-ready source operands. This technique has several advantages. First, it reduces the scheduling complexity (and the associated delays) as the logic needed to support the instructions with 2 non-ready source operands is eliminated. More surprisingly, 2OP/spl I.bar/BLOCK simultaneously improves the performance as the same issue queue entry may be reallocated multiple times to the instructions with at most one non-ready source (which usually spend fewer cycles in the queue) as opposed to hogging the entry with an instruction which enters the queue with two non-ready sources. For the schedulers with the capacity to hold 64 instructions, the 2OP/spl I.bar/BLOCK design outperforms the traditional queue by 11%, on the average, and at the same time results in a 10% reduction in the overall scheduling delay.
IEEE Transactions on Very Large Scale Integration Systems | 2008
Deniz Balkan; Joseph J. Sharkey; Dmitry Ponomarev; Kanad Ghose
Much of the complexity in todays superscalar microprocessors stems from the need to maintain the speculatively produced results within the on-chip storage components until these results can be safely discarded without endangering the reconstruction of the precise state or impeding the recovery from possible branch misspeculations. For this, modern designs use large, heavily-ported physical register files (RFs) to increase the instruction throughput. The high complexity and power dissipation of such RFs mainly stem from the need to maintain each and every result for a large number of cycles after the result generation. We observed that a significant fraction (about 45%) of the result values are delivered to their consumers via the bypass network (consumed ldquoon-the-flyrdquo) and are never read out from the destination registers. In this paper, we first formulate conditions for identifying such transient values and describe their microarchitectural implementation; then we propose a technique to avoid the writeback of such transient values into the RF. With 64-entry integer and floating point register files, our technique achieves an 11% performance improvement and 29% reduction in the RF energy consumption compared to the baseline machine with the same number of registers. Furthermore, for the same performance target, the selective writeback scheme results in a 38% reduction in the energy consumption of the RF compared to the baseline machine.
international conference on parallel processing | 2006
Deniz Balkan; Joseph J. Sharkey; Dmitry Ponomarev; Aneesh Aggarwal
We propose a series of aggressive register deallocation mechanisms to reduce the register file pressure and increase the parallelism exploited by superscalar microprocessors. Our techniques are based on a key observation that a register value can be temporarily decoupled from the register identifier. Specifically, even if a physical register is deallocated, the value is still available in the register and can be read by the dependent instructions until the register is overwritten. In these situations, we can effectively overlap the consumption of the produced register value and partial processing of the instruction that gets the same register reassigned to it. In this paper, we propose several realizations of the address-value decoupling idea and discuss their implications on the performance. Our most aggressive scheme achieves an average IPC speedup of 14.6% across simulated SPEC 2000 benchmarks
international conference on parallel architectures and compilation techniques | 2006
Deniz Balkan; Joseph J. Sharkey; Dmitry Ponomarev; Kanad Ghose
High-performance microprocessors use large, heavily-ported physical register files (RFs) to increase the instruction throughput. The high complexity and power dissipation of such RFs mainly stem from the need to maintain each and every result for a large number of cycles after the result generation. We observed that a significant fraction (about 45%) of the result values are never read from the register file and are not required to recover from branch mispredictions. In this paper, we propose SPARTAN — a set of micro-architectural extensions that predicts such transient values and in many cases completely avoids physical register allocations to them. We show that the transient values can be predicted as such with more than 97% accuracy on the average across simulated SPEC 2000 benchmarks. We evaluate the performance of SPARTAN on a variety of configurations and show that significant improvements in performance and energy-efficiency can be realized. Furthermore, we directly compare SPARTAN against a number of previously proposed schemes for register optimizations and show that our technique significantly outperforms all those schemes.
ACM Transactions on Architecture and Code Optimization | 2006
Joseph J. Sharkey; Dmitry Ponomarev; Kanad Ghose; Oguz Ergin
Traditional dynamic scheduler designs use one issue queue entry per instruction, regardless of the actual number of operands actively involved in the wakeup process. We propose Instruction Packing---a novel microarchitectural technique that reduces both delay and power consumption of the issue queue by sharing the associative part of an issue queue entry between two instructions, each with, at most, one nonready register source operand at the time of dispatch. Our results show that this technique results in 40% reduction of the IQ power and 14% reduction in scheduling delay with negligible IPC degradations.
international conference on computer design | 2005
Joseph J. Sharkey; Kanad Ghose; Dmitry Ponomarev; Oguz Ergin
The dynamic instruction scheduling logic is one of the most critical components of modern superscalar microprocessors, both from the delay and power dissipation standpoints. The delay and energy requirement of driving the wakeup tags across the associatively-addressed issue queue accounts for a significant percentage of the schedulers overhead and also limits the design scalability. We propose tag memoization and tagline folding - two schemes to reduce the power of wakeup tag broadcasts by reducing the number of tag-bits that are driven in each broadcast. Our results show that the combination of these mechanisms provides 223% average reduction of the wakeup tag broadcast power with no impact on the IPC.
international conference on supercomputing | 2007
Joseph J. Sharkey; Dmitry Ponomarev
The register file is one of the most critical datapath components limiting the number of threads that can be supported on a Simultaneous Multithreading (SMT) processor. To allow the use of smaller register files without degrading performance, techniques that maximize the efficiency of using registers through aggressive register allocation/deallocation can be considered. In this paper, we propose a novel technique to early deallocate physical registers allocated to threads that experience L2 cache misses. This is accomplished by speculatively committing the load-independent instructions and deallocating the registers corresponding to the previous mappings of their destinations, without waiting for the cache miss request to be serviced. The early deallocated registers are then made immediately available for allocation to instructions within the same thread as well as within other threads, thus improving the overall processor throughput. On the average across the simulated mixes of multiprogrammed SPEC 2000 workloads, our technique results in 33% improvement in throughput and 25% improvement in terms of harmonic mean of weighted IPCs over the baseline SMT with the state-of-the-art DCRA policy. This is achieved without creating checkpoints, maintaining per-register counters of pending consumers, performing tag re-broadcasts, register re-mappings and/or additional associative searches.
european conference on parallel processing | 2005
Joseph J. Sharkey; Dmitry Ponomarev
The dynamic instruction scheduling logic (the issue queue and the associated control logic) forms the core of an out-of-order microprocessor. Traditional scheduling mechanisms, based on tag broadcasts and associative tag matching logic within the issue queue are limited by high power consumption, large access delay and poor scalability. To address these inefficiencies, researchers have proposed various flavors of so-called wakeup-free scheduling logic. Such wakeup-free scheduling techniques remove the wakeup delay from the critical path, but incur other forms of complexity, essentially stemming from the need to keep track of the cycle when each physical register will become ready and when each instruction can be (speculatively) issued. We propose instruction recirculation– a wakeup-free instruction scheduler design that completely eliminates all counting and issue time estimation logic inherent in all previously proposed wakeup-free schedulers. This complexity reduction is also accompanied by 3.6% IPC improvement over the state-of-the-art wakeup-free scheduler.