Srikanth T. Srinivasan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srikanth T. Srinivasan is active.

Explore More

Publication

Featured researches published by Srikanth T. Srinivasan.

international symposium on microarchitecture | 2003

Checkpoint processing and recovery: towards scalable large instruction window processors

Haitham Akkary; Ravi Rajwar; Srikanth T. Srinivasan

Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency. We focus of four critical aspects of a microarchitecture: 1) scheduling instructions; 2) recovering from branch mispredicts; 3) buffering a large number of stores and forwarding data from stores to any dependent load; and 4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitecture scheme for addressing these design issues-a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.

architectural support for programming languages and operating systems | 2004

Continual flow pipelines

Srikanth T. Srinivasan; Ravi Rajwar; Haitham Akkary; Amit Gandhi; Michael D. Upton

Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.

international symposium on low power electronics and design | 2007

Impact of die-to-die and within-die parameter variations on the throughput distribution of multi-core processors

Keith A. Bowman; Alaa R. Alameldeen; Srikanth T. Srinivasan; Chris Wilkerson

A statistical performance simulator is developed to explore the impact of die-to-die (D2D) and within-die (WID) parameter variations on the distributions of maximum clock frequency (FMAX) and throughput for multi-core processors in a future 22nm technology. The simulator integrates a compact analytical throughput model, which captures the key dependencies of multi-core processors, into a statistical simulation framework that models the effects of D2D and WID parameter variations on critical path delays across a die. The salient contributions from this paper are: (1) Product-level variation analysis for multi-core processors must focus on throughput, rather than just FMAX, and (2) Multi-core processors are inherently more variation tolerant than single-core processors due to the larger impact of memory latency and bandwidth on overall throughput. To elucidate these two points, multi-core and single-core processors have a similar chip-level FMAX distribution (mean degradation of 9% and standard deviation of 5%) for multi-threaded applications. In contrast to single-core processors, memory latency and bandwidth constraints significantly limit the throughput dependency on FMAX in multi-core processors, thus reducing the throughput mean degradation and standard deviation by 50%. Since single-threaded applications running on a multi-core processor can execute on the fastest core, mean FMAX and throughput gains of 4% are achieved from the nominal design target.

IEEE Transactions on Very Large Scale Integration Systems | 2009

Impact of Die-to-Die and Within-Die Parameter Variations on the Clock Frequency and Throughput of Multi-Core Processors

Keith A. Bowman; Alaa R. Alameldeen; Srikanth T. Srinivasan; Chris Wilkerson

A statistical performance simulator is developed to explore the impact of parameter variations on the maximum clock frequency (FMAX) and throughput distributions of multi-core processors in a future 22 nm technology. The simulator captures the effects of die-to-die (D2D) and within-die (WID) transistor and interconnect parameter variations on critical path delays in a die. A key component of the simulator is an analytical multi-core processor throughput model, which enables computationally efficient and accurate throughput calculations, as compared with cycle-accurate performance simulators, for single-threaded and highly parallel multi-threaded (MT) workloads. Based on microarchitecture designs from previous microprocessors, three multi-core processors with either small, medium, or large cores are projected for the 22 nm technology generation to investigate a range of design options. These three multi-core processors are optimized for maximum throughput within a constant die area. A traditional single-core processor is also scaled to the 22 nm technology to provide a baseline comparison. The salient contributions from this paper are: 1) product-level variation analysis for multi-core processors must focus on throughput, rather than just FMAX, and 2) multi-core processors are more variation tolerant than single-core processors due to the larger impact of memory latency and bandwidth on throughput. To elucidate these two points, statistical simulations indicate that multi-core and single-core processors with an equivalent total core area have similar FMAX distributions (mean degradation of 9% and standard deviation of 5%) for MT applications. In contrast to single-core processors, memory latency and bandwidth constraints significantly limit the throughput dependency on FMAX in multi-core processors, thus reducing the throughput mean degradation and standard deviation by ~50% for the small and medium core designs and by ~30% for the large core design. This improvement in the throughput distribution indicates that multi-core processors could significantly reduce the product design and process development complexities due to parameter variations as compared to single-core processors, enabling faster time to market for high-performance microprocessor products.

high-performance computer architecture | 2004

Reducing branch misprediction penalty via selective branch recovery

Amit Gandhi; Haitham Akkary; Srikanth T. Srinivasan

Branch misprediction penalty consists of two components: the time wasted on misspeculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor trends, large instruction windows and deep pipelines, amplify both components of the branch misprediction penalty. We propose a novel method, called selective branch recovery (SBR), to reduce both components of branch misprediction penalty. SBR exploits a frequently occurring type of control independence - exact convergence - where the mispredicted path converges exactly at the beginning of the correct path. In such cases, SBR selectively reuses the results computed during misspeculative execution and obviates the need to fetch or rename convergent instructions again. Thus, SBR addresses both components of branch misprediction penalty. To increase the likelihood of branch mispredictions that can be handled with SBR, we also present an effective means for inducing exact convergence on misspeculative paths. With SBR, we significantly improve performance (between 3%-22%, average 8%) on a wide range of benchmarks over our baseline processor that does not exploit SBR.

ACM Transactions on Architecture and Code Optimization | 2004

An analysis of a resource efficient checkpoint architecture

Haitham Akkary; Ravi Rajwar; Srikanth T. Srinivasan

Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically required to buffer and process such instruction window sizes significantly degrade the cycle time. This paper proposes a novel checkpoint processing and recovery (CPR) microarchitecture, and shows how to implement a large instruction window processor without requiring large structures thus permitting a high clock frequency.We focus on four critical aspects of a microarchitecture: (1) scheduling instructions, (2) recovering from branch mispredicts, (3) buffering a large number of stores and forwarding data from stores to any dependent load, and (4) reclaiming physical registers. While scheduling window size is important, we show the performance of large instruction windows to be more sensitive to the other three design issues. Our CPR proposal incorporates novel microarchitectural schemes for addressing these design issues---a selective checkpoint mechanism for recovering from mispredicts, a hierarchical store queue organization for fast store-load forwarding, and an effective algorithm for aggressive physical register reclamation. Our proposals allow a processor to realize performance gains due to instruction windows of thousands of instructions without requiring large cycle-critical hardware structures.

international symposium on microarchitecture | 2004

Continual flow pipelines: achieving resource-efficient latency tolerance

Srikanth T. Srinivasan; Ravi Rajwar; Haitham Akkary; Amit Gandhi; Michael D. Upton

With the natural trend toward integration, microprocessors are increasingly supporting multiple cores on a single chip. To keep design effort and costs down, designers of these multicore microprocessors frequently target an entire product range, from mobile laptops to high-end servers. This article discusses a continual flow pipeline (CFP) processor. Such processor architecture can sustain a large number of in-flight instructions (commonly referred to as the instruction window and comprising all instructions renamed but not retired) without requiring the cycle-critical structures to scale up. By keeping these structures small and making the processor core tolerant of memory latencies, a CFP mechanism enables the new core to achieve high single-thread performance, and many of these new cores can be placed on a chip for high throughput. The resulting large instruction window reveals substantial instruction-level parallelism and achieves memory latency tolerance, while the small size of cycle-critical resources permits a high clock frequency

international symposium on microarchitecture | 2003

Checkpoint processing and recovery: an efficient, scalable alternative to reorder buffers

Haitham Akkary; Ravi Rajwar; Srikanth T. Srinivasan

Processors require a combination of large instruction windows and high clock frequency to achieve high performance. Traditional processors use reorder buffers, but these structures do not scale efficiently as window size increases. A new technique, checkpoint processing and recovery, offers an efficient means of increasing the instruction window size without requiring large, cycle-critical structures, and provides a promising microarchitecture for future high-performance processors.

international conference on computer design | 2004

A minimal dual-core speculative multi-threading architecture

Srikanth T. Srinivasan; Haitham Akkary; Tom Holman; Konrad K. Lai

Speculative multi-threading (SpMT) can improve single-threaded application performance using the multiple thread contexts available in current processors. We propose a minimal SpMT model that uses only two thread contexts. The model achieves significant speedups for single-threaded applications using a low-overhead scheme for detecting and selectively recovering from data dependence violations, and a novel wrong path predictor to reduce the number of speculative threads executing along the wrong path. We also study the interactions between three previously proposed SpMT thread spawning policies that can be implemented dynamically in hardware - Fork on Call, Loop Continuation and Run Ahead policies - and show it is beneficial to implement all three policies together in a processor. While the individual thread spawning policies show performance benefits of 14%, 5% and 4% respectively on our SpMT model over a base processor that does not exploit SpMT, combining all three policies shows an average performance gain of 20%. Finally, we identify the sources of SpMT benefits - on average, 58% of the performance benefits due to SpMT comes from cache prefetching, 33% from instruction reuse, and 9% from branch precomputation and show all three sources of SpMT benefits must be utilized to realize the full potential of SpMT.

international conference on supercomputing | 2003

Recycling waste: exploiting wrong-path execution to improve branch prediction

Haitham Akkary; Srikanth T. Srinivasan; Konrad K. Lai

Despite continuous improvement in branch prediction algorithms, branch misprediction remains a major limitation on microprocessor performance. As pipelines are widened or stretched deeper, branch prediction will become even more crucial. This paper taps into a currently wasted resource, wrong-path execution, to help improve branch prediction. Due to control independence, often the outcomes of branches that are executed along the wrong-path match the outcomes on the correct-path. Current branch prediction methods rely on correlation between branches on the correct path, therefore leaving potentially useful wrong-path branch information unexploited. We present in this paper a new, very simple, and very effective method that extends branch prediction to allow the recycling of wrong-path branch outcomes at the fetch stage. Simulations of deeply pipelined processors using a selected set of SpecInt 2000 and other benchmarks, with more than 5 branch mispredictions per thousand micro-operations, show that branch misprediction rate can be reduced by up to 30%. Depending on the pipeline depth, the corresponding average performance improvement varies from 5% to 20%.

Explore More