Aneesh Aggarwal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aneesh Aggarwal is active.

Explore More

Publication

Featured researches published by Aneesh Aggarwal.

international conference on computer design | 2003

Energy efficient asymmetrically ported register files

Aneesh Aggarwal; Manoj Franklin

Power consumption in the register file (RF) forms a considerable fraction of the total power consumption in a chip. With increasing instruction window sizes and issue widths, RF power consumption will suffer a significantly large growth. Using the fact that many of the register values are small and require only a small number of bits for representation, we propose a novel asymmetrically ported RF (to reduce RF power consumption), in which some of the ports can only read/write small-sized values. We experiment with both monolithic and partitioned versions of asymmetrically ported RFs. The power savings in the RF with partitioned asymmetrically ported RF reach as high as 60%. These reductions in RF power consumption come with about 40% improvement in RF access-time and a negligible impact on IPC (instructions per cycle).

international conference on supercomputing | 2001

Evaluating the impact of memory system performance on software prefetching and locality optimizations

Abdel-Hameed A. Badawy; Aneesh Aggarwal; Donald Yeung; Chau-Wen Tseng

Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we evaluate the impact of memory trends on the effectiveness of software prefetching and locality optimizations for three types of applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. We find for many applications, software prefetching outperforms locality optimizations when there is sufficient memory bandwidth, but locality optimizations outperform software prefetching under bandwidth-limited conditions. The break-even point (for 1 Ghz processors) occurs at roughly 2.5 GBytes/sec on todays memory systems, and will increase on future memory systems. We also study the interactions between software prefetching and locality optimizations when applied in concert. Naively combining the techniques provides robustness to changes in memory bandwidth and latency, but does not yield additional performance gains. We propose and evaluate several algorithms to better integrate software prefetching and locality optimizations, including a modified tiling algorithm, padding for prefetching, and index prefetching.

international symposium on performance analysis of systems and software | 2001

An empirical study of the scalability aspects of instruction distribution algorithms for clustered processors

Aneesh Aggarwal; Manoj Franklin

In the evolving submicron technology, making it particularly attractive to use decentralized designs. A common form of decentralization adopted in processors is to partition the execution core into multiple clusters. Each cluster has a small instruction window, and a set of functional units. A number of algorithms have been proposed for distributing instructions among the clusters. The first part of this paper analyzes (qualitatively as well as quantitatively) the effect of various hardware parameters such as the type of cluster interconnect, the fetch size, the cluster issue width, the cluster window size, and the number of clusters on the performance of different instruction distribution algorithms. The study shows that the relative performance of the algorithms is very sensitive to these hardware parameters and that the algorithms that perform relatively better with four or fewer clusters are generally not the best ones for a larger number of clusters. This is important, given that with an imminent increase in the transistor budget, more clusters are expected to be integrated on a single chip. The second part of the paper investigates alternate interconnects that provide scalable performance as the number of clusters is increased. In particular, it investigates two hierarchical interconnects - a single ring of crossbars and multiple rings of crossbars - as well as instruction distribution algorithms to take advantage of these interconnects. Our study shows that these new interconnects with the appropriate distribution techniques achieve an IPC (instructions per cycle) that is 15-20 percent better than the most scalable existing configuration, and is within 2 percent of that achieved by a hypothetical ideal processor having a 1-cycle latency crossbar interconnect. These results confirm the utility and applicability of hierarchical interconnects and hierarchical distribution algorithms in clustered processors.In the sub-micron technology era, wire delays are becoming much more important than gate delays, making it particularly attractive to go for decentralized processors. A number of algorithms have already been proposed for distributing instructions among multiple clusters. In this paper we qualitatively and quantitatively analyze the effect of various hardware parameters on the scalability of different instruction distribution algorithms. Using a set of realistic system parameters, we examine performance differences resulting from different distribution algorithms as well as from specific implementation issues such as the type of interconnect, the fetch size, the cluster issue width, and the cluster window size. Our studies have found that those distribution algorithms that perform relatively better with 4 or fewer clusters are generally not the best ones for a larger number of clusters. Also, the relative performance and scalability of the algorithms are sensitive to different hardware parameters. We also found that, among the existing algorithms, there is no single algorithm that works uniformly best across all hardware configurations. This motivates the need to develop alternate interconnects and instruction distribution algorithms.

high-performance computer architecture | 2006

Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors

Sumeet Kumar; Aneesh Aggarwal

With reducing feature size, increasing chip capacity, and increasing clock speed, microprocessors are becoming increasingly susceptible to transient (soft) errors. Redundant multi-threading (RMT) is an attractive approach for concurrent error detection and recovery. However, redundant threads significantly increase the pressure on the processor resources, resulting in dramatic performance impact. In this paper, we propose reducing resource redundancy as a means to mitigate the performance impact of redundancy. In this approach, all the instructions are redundantly executed, however, the redundant instructions do not use many of the resources used by an instruction. The approach taken to reduce resource redundancy is to exploit the runtime profile of the leading thread to optimally allocate resources to the trailing thread in a staggered RMT architecture. The key observation used in this approach is that, even with a small slack between the two threads, many instructions in the leading thread have already produced their results before their trailing counterparts are renamed. We investigate two techniques in this approach (i) register bits reuse technique that attempts to use the same register (but different bits) for both the copies of the same instruction, if the result produced by the instruction is of small size, and (ii) register value reuse technique that attempts to use the same register for a main instruction and a distinct redundant instruction, if both the instructions produce the same result. These techniques, along with some others, are used to reduce redundancy in register file, reorder buffer, and load/store buffer. The techniques are evaluated in terms of their performance, power, and vulnerability impact on an RMT processor. Our experiments show that the techniques achieve about 95% performance improvement and about 17% energy reduction. The vulnerability of the RMT remains the same with the techniques.

programming language design and implementation | 2001

Related field analysis

Aneesh Aggarwal; Keith H. Randall

We present an extension of field analysis (sec [4]) called related field analysis which is a general technique for proving relationships between two or more fields of an object. We demonstrate the feasibility and applicability of related field analysis by applying it to the problem of removing array bounds checks. For array bounds check removal, we define a pair of related fields to be an integer field and an array field for which the integer field has a known relationship to the length of the array. This related field information can then be used to remove array bounds checks from accesses to the array field. Our results show that related field analysis can remove an average of 50% of the dynamic array bounds checks on a wide range of applications. We describe the implementation of related field analysis in the Swift optimizing compiler for Java, as well as the optimizations that exploit the results of related field analysis.

high-performance computer architecture | 2006

Increasing the cache efficiency by eliminating noise

Prateek Pujara; Aneesh Aggarwal

Caches are very inefficiently utilized because not all the excess data fetched into the cache, to exploit spatial locality, is utilized. We define cache utilization as the percentage of data brought into the cache that is actually used. Our experiments showed that Level 1 data cache has a utilization of only about 57%. In this paper, we show that the useless data in a cache block (cache noise) is highly predictable. This can be used to bring only the to-be-referenced data into the cache on a cache miss, reducing the energy, cache space, and bandwidth wasted on useless data. Cache noise prediction is based on the last words usage history of each cache block. Our experiments showed that a code-context predictor is the best performing predictor and has a predictability of about 95%. In a code context predictor, each cache block belongs to a code context determined by the upper order PC bits of the instructions that fetched the cache block. When applying cache noise prediction to L1 data cache, we observed about 37% improvement in cache utilization, and about 23% and 28% reduction in cache energy consumption and bandwidth requirement, respectively. Cache noise mispredictions increased the miss rate by 0.1% and had almost no impact on instructions per cycle (IPC) count. When compared to a sub-blocked cache, fetching the to-be-referenced data resulted in 97% and 44% improvement in miss rate and cache utilization, respectively. The sub-blocked cache had a bandwidth requirement about 35% of the cache noise prediction based approach.

international conference on computer design | 2005

Restrictive compression techniques to increase level 1 cache capacity

Prateek Pujara; Aneesh Aggarwal

Increasing cache latencies limit L1 cache sizes. In this paper we investigate restrictive compression techniques for level 1 data cache, to avoid an increase in the cache access latency. The basic technique - all words narrow (AWN) - compresses a cache block only if all the words in the cache block are of narrow size. We extend the AWN technique to store a few upper half-words (AHS) in a cache block to accommodate a small number of normal-sized words in the cache block. Further, we make the AHS technique adaptive, where the additional half-words space is adaptively allocated to the various cache blocks. We also propose techniques to reduce the increase in the tag space that is inevitable with compression techniques. Overall, the techniques in this paper increase the average L1 data cache capacity (in terms of the average number of valid cache blocks per cycle) by about 50%, compared to the conventional cache, with no or minimal impact on the cache access time. In addition, the techniques have the potential of reducing the average L1 data cache miss rate by about 23%.

international conference on parallel architectures and compilation techniques | 2003

Instruction replication: reducing delays due to inter-PE communication latency

Aneesh Aggarwal; Manoj Franklin

As feature sizes are becoming smaller, wire delays are becoming very critical. Clustering is a popular decentralization approach to reduce the impact of shrinking technologies on clock speed. In this approach, the centralized instruction window is replaced with multiple smaller windows, called clusters (PEs). The performance of these clustered processors depends on the amount of inter-PE communication and load imbalance incurred by the distribution algorithm used to distribute instructions among the PEs. In this paper, we investigate a novel approach of reducing the impact of inter-PE communication latency, while preserving good load balance. The basic idea is to selectively replicate instructions in those PEs where their results are required. The replication is done based on heuristics that weigh the potential benefits of replication. We found that, with instruction replication, the IPC of a clustered processor is significantly higher than that obtained without instruction replication and is within just 8 percent of that of a superscalar configuration with a centralized instruction scheduler.As feature sizes are becoming smaller, wire delays are becoming very critical. Clustering is a popular decentralization approach to reduce the impact of shrinking technologies on clock speed. In this approach, the centralized instruction window is replaced with multiple smaller windows, called clusters (PEs). The performance of these clustered processors depends on the amount of inter-PE communication and load imbalance incurred by the distribution algorithm used to distribute instructions among the PEs. We investigate a novel approach of reducing the impact of inter-PE communication latency, while preserving good load balance. The basic idea is to selectively replicate instructions in those PEs where their results are required. The replication is done based on heuristics that weigh the potential benefits of replication. We found that with instruction replication, the IPC of a clustered processor is significantly higher than that obtained without instruction replication and is within just 8% of that of a superscalar configuration with a centralized instruction window.

design, automation, and test in europe | 2009

Architectural support for low overhead detection of memory violations

Saugata Ghose; Latoya Gilgeous; Polina Dudnik; Aneesh Aggarwal; Corey Waxman

Violations in memory references cause tremendous loss of productivity, catastrophic mission failures, loss of privacy and security, and much more. Software mechanisms to detect memory violations have high false positive and negative rates or huge performance overhead. This paper proposes architectural support to detect memory reference violations in inherently unsafe languages such as C and C++. In this approach, the ISA is extended to include ldquosafetyrdquo instructions that provide compile-time information on pointers and objects. The microarchitecture is extended to efficiently execute the safety instructions. We explore optimizations, such as delayed violation detection and stack-based handling of local pointers, to reduce the performance overhead. Our experiments show that the synergy between hardware and software results in this approach having less than 5% average performance overhead, while an exclusively software mechanism incurs 480% impact for the same benchmarks.

international conference on computer design | 2004

Defining wakeup width for efficient dynamic scheduling

Aneesh Aggarwal; Manoj Franklin; Oguz Ergin

A larger dynamic scheduler (DS) exposes more instruction level parallelism (ILP), giving better performance. However, a larger DS also results in a longer scheduler latency and a slower clock speed. In this paper, we propose a new DS design that reduces the scheduler critical path latency by reducing the wakeup width (defined as the effective number of results used for instruction wakeup). The design is based on the realization that the average number of results per cycle that are immediately required to wake up the dependent instructions is considerably less than the processor issue width. Our designs are evaluated using the simulation of the SPEC 2000 benchmarks and SPICE simulations of the actual issue queue layouts in 0.18 micron process. We found that a significant reduction in scheduler latency, power consumption and area is achieved with less than 2% reduction in the instructions per cycle (IPC) count for the SPEC2K benchmarks.

Explore More