Santosh G. Abraham | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Santosh G. Abraham is active.

Explore More

Publication

Featured researches published by Santosh G. Abraham.

international symposium on computer architecture | 2004

Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Yuan Chou; Brian M. Fahs; Santosh G. Abraham

The performance of memory-bound commercial applications such as databases is limited by increasing memory latencies. In this paper, we show that exploiting memory-level parallelism (MLP) is an effective approach for improving the performance of these applications and that microarchitecture has a profound impact on achievable MLP. Using the epoch model of MLP, we reason how traditional microarchitecture features such as out-of-order issue and state-of-the-art microarchitecture techniques such as runahead execution affect MLP. Simulation results show that a moderately aggressive out-of-order issue processor improves MLP over an in-order issue processor by 12-30%, and that aggressive handling of loads, branches and serializing instructions is needed to attain the full benefits of large out-of-order instruction windows. The results also show that a processors issue window and reorder buffer should be decoupled to exploit MLP more efficiently. In addition, we demonstrate that runahead execution is highly effective in enhancing MLP, potentially improving the MLP of the database workload by 82% and its overall performance by 60%. Finally, our limit study shows that there is considerable headroom in improving MLP and overall performance by implementing effective instruction prefetching, more accurate branch prediction and better value prediction in addition to runahead execution.

high-performance computer architecture | 2005

Chip multithreading: opportunities and challenges

Lawrence Spracklen; Santosh G. Abraham

Chip multi-threaded (CMT) processors provide support for many simultaneous hardware threads of execution in various ways, including simultaneous multithreading (SMT) and chip multiprocessing (CMP). CMT processors are especially suited to server workloads, which generally have high levels of thread-level parallelism (TLP). In this paper, we describe the evolution of CMT chips in industry and highlight the pervasiveness of CMT designs in upcoming general-purpose processors. The CMT design space accommodates a range of designs between the extremes represented by the SMT and CMP designs and a variety of attractive design options are currently unexplored. Though there has been extensive research on utilizing multiple hardware threads to speed up single-threaded applications via speculative parallelization, there are many challenges in designing CMT processors, even when sufficient TLP is present. This paper describes some of these challenges including, hot sets, hot banks, speculative prefetching strategies, request prioritization and off-chip bandwidth reduction.

measurement and modeling of computer systems | 1993

Efficient simulation of caches under optimal replacement with applications to miss characterization

Rabin A. Sugumar; Santosh G. Abraham

Cache miss characterization models such as the three Cs model are useful in developing schemes to reduce cache misses and their penalty. In this paper we propose the OPT model that uses cache simulation under optimal (OPT) replacement to obtain a finer and more accurate characterization of misses than the three Cs model. However, current methods for optimal cache simulation are slow and difficult to use. We present three new techniques for optimal cache simulation. First, we propose a limited lookahead strategy with error fixing, which allows one pass simulation of multiple optimal caches. Second, we propose a scheme to group entries in the OPT stack, which allows efficient tree based fully-associative cache simulation under OPT. Third, we propose a scheme for exploiting partial inclusion in set-associative cache simulation under OPT. Simulators based on these algorithms were used to obtain cache miss characterizations using the OPT model for nine SPEC benchmarks. The results indicate that miss ratios under OPT are substantially lower than those under LRU replacement, by up to 70% in fully-associative caches, and up to 32% in two-way set-associative caches.

application-specific systems, architectures, and processors | 2000

High-level synthesis of nonprogrammable hardware accelerators

Robert Schreiber; Shail Aditya; B. Ramakrishna Rau; Vinod Kathail; Scott A. Mahlke; Santosh G. Abraham; Greg Snider

The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (very-long instruction word) processors, their controller local memory, and interfaces. The system also modifies the users application software to make use of the generated accelerator. The user indicates the throughput to be achieved by specifying the number of processors and their initiation interval. In experimental comparisons, PICO-N designs are slightly more costly than hand-designed accelerators with the same performance.

international symposium on microarchitecture | 2005

Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor

Jiwei Lu; Abhinav Das; Wei-Chung Hsu; Khoa Nguyen; Santosh G. Abraham

Data prefetching via helper threading has been extensively investigated on simultaneous multi-threading (SMT) or virtual multi-threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most techniques rely on hardware support to reduce context switch overhead between the main thread and helper thread as well as rely on static profile feedback to construct the help thread code. This paper develops a new solution by exploiting helper threaded prefetching through dynamic optimization on the latest UltraSPARC chip-multiprocessing (CMP) processor. Our experiments show that by utilizing the otherwise idle processor core, a single user-level helper thread is sufficient to improve the runtime performance of the main thread without triggering multiple thread slices. Moreover, since the multiple cores are physically decoupled in the CMP, contention introduced by helper threading is minimal. This paper also discusses several key technical challenges of building a lightweight dynamic optimization/software scouting system on the UltraSPARC/Solaris platform.

international conference on supercomputing | 2004

Effective stream-based and execution-based data prefetching

Sorin Iacobovici; Lawrence Spracklen; Sudarshan Kadambi; Yuan Chou; Santosh G. Abraham

With processor speeds continuing to outpace the memory subsystem, cache missing memory operations continue to become increasingly important to application performance. In response to this continuing trend, most modern processors now support hardware (HW) prefetchers, which act to reduce the missing loads observed by an application.This paper analyzes the behavior of cache-missing loads in SPEC CPU2000 and highlights the inability of unit and single non-unit stride prefetchers to correctly prefetch for some commonly occurring streams. In response to this analysis, a novel multi-stride prefetcher, that supports streams with up to four distinct strides, is proposed. Performance analysis for SPEC CPU2000 illustrates that the proposed multi-stride prefetcher can outperform current stride prefetchers on several benchmarks; most notably on mcf, lucas and facerec, where it achieves an additional performance gain of up to 57%. Performance of the strided HW prefetchers is also contrasted with another recently proposed prefetch scheme, runahead execution (RAE), and the synergy between the schemes is investigated.

international conference on supercomputing | 1995

Optimum modulo schedules for minimum register requirements

Alexandre E. Eichenberger; Edward S. Davidson; Santosh G. Abraham

Modulo scheduling is an e cient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a combined approach that schedules the loop operations for the highest steady state throughput and minimum register requirements. Our method determines optimal register requirements for machines with nite resources and for general dependence graphs. We compare the performance of this and other modulo schedulers for a benchmark of 629 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels. Measurements demonstrate the potential of register-sensitive modulo schedulers, which will be useful in evaluating the performance of register-sensitive modulo scheduling heuristics.

high-performance computer architecture | 2005

Effective instruction prefetching in chip multiprocessors for modern commercial applications

Lawrence Spracklen; Yuan Chou; Santosh G. Abraham

In this paper, we study the instruction cache miss behavior of four modern commercial applications (a database workload, TPC-W, SPECjAppServer2002 and SPECweb99). These applications exhibit high instruction cache miss rates for both the L1 and L2 caches, and a sizable performance improvement can be achieved by eliminating these misses. We show that it is important, not only to address sequential misses, but also misses due to branches and function calls. As a result, we propose an efficient discontinuity prefetching scheme that can be effectively combined with traditional sequential prefetching to address all forms of instruction cache misses. Additionally, with the emergence of chip multiprocessors (CMPs), instruction prefetching schemes must take into account their effect on the shared L2 cache. Specifically aggressive instruction cache prefetching can result in an increase in the number of L2 cache data misses. As a solution, we propose a scheme that does not install prefetches into the L2 cache unless they are proven to be useful. Overall, we demonstrate that the combination of our proposed schemes is successful in reducing the instruction miss rate to only 10%-16% of the original miss rate and results in a 1.08X-1.37X performance improvement for the applications studied.

ACM Transactions on Computer Systems | 1995

Set-associative cache simulation using generalized binomial trees

Rabin A. Sugumar; Santosh G. Abraham

Set-associative caches are widely used in CPU memory hierarchies, I/O subsystems, and file systems to reduce average access times. This article proposes an efficient simulation technique for simulating a group of set-associative caches in a single pass through the address trace, where all caches have the same line size but varying associativities and varying number of sets. The article also introduces a generalization of the ordinary binomial tree and presents a representation of caches in this class using the Generalized Binomial Tree (gbt). The tree representation permits efficient search and update of the caches. Theoretically, the new algorithm, GBF_LS, based on the gbt structure, always takes fewer comparisons than the two earlier algorithms for the same class of caches: all-associativity and generalized forest simulation. Experimentally, the new algorithm shows performance gains in the range of 1.2 to 3.8 over the earlier algorithms on address traces of the SPEC benchmarks. A related algorithm for simulating multiple alternative direct-mapped caches with fixed cache size, but varying line size, is also presented.

international conference on supercomputing | 1992

Register requirements of pipelined processors

William Mangione-Smith; Santosh G. Abraham; Edward S. Davidson

To enable concurrent instruction execution, scientific computers generally rely on pipelining, which combines with faster system clocks to achieve greater throughput. Each concurrently executing instruction requires buffer space, usually implemented as a register, to receive its result. This paper focuses on the issue of how many registers are required to achieve optimal performance in pipelined scientific computers. Four machine models are considered: single, double, and triple issue scalar machines, and vector machines with various register lengths. A model is presented that accurately relates the register requirements for optimum performance cyclically scheduled loops with tree-dependence graphs to the degree of function unit pipelining, the instruction issue bandwidth, and code properties. A method for finding upper and lower bounds on the minimum register requirements is also presented. The result of this work is a theory for assessing register requirements that can be used to reveal fundamental differences among machines within a space of architectural and implementation design choices. Some experimental data is also provided to support the theory.

Explore More