Banit Agrawal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Banit Agrawal is active.

Explore More

Publication

Featured researches published by Banit Agrawal.

design automation conference | 2006

A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy

Gian Luca Loi; Banit Agrawal; Navin Srivastava; Sheng-Chih Lin; Timothy Sherwood; Kaustav Banerjee

Three-dimensional (3D) integrated circuits have emerged as promising candidates to overcome the interconnect bottlenecks of nanometer scale designs. While they offer several other advantages, it is expected that the benefits from this technology can potentially be off-set by thermal considerations which impact chip performance and reliability. The work presented in this paper is the first attempt to study the performance benefits of 3D technology under the influence of such thermal constraints. Using a processor-cache-memory system and carefully chosen applications encompassing different memory behaviors, the performance of 3D architecture is compared with a conventional planar (2D) design. It is found that the substantial increase in memory bus frequency and bus width contribute to a significant reduction in execution time with a 3D design. It is also found that increasing the clock frequency translates into larger gains in system performance with 3D designs than for planar 2D designs in memory intensive applications. The thermal profile of the vertically stacked chip is generated taking into account the highly temperature sensitive leakage power dissipation. The maximum allowed operating frequency imposed by temperature constraint is shown to be lower for 3D than for 2D designs. In spite of these constraints, it is shown that the 3D system registers large performance improvement for memory intensive applications

international symposium on performance analysis of systems and software | 2006

Modeling TCAM power for next generation network devices

Banit Agrawal; Timothy Sherwood

Applications in computer networks often require high throughput access to large data structures for lookup and classification. Many advanced algorithms exist to speed these search primitives on network processors, general purpose machines, and even custom ASICs. However, supporting these applications with standard memories requires very careful analysis of access patterns, and achieving worst case performance can be quite difficult and complex. A simple solution is often possible if a Ternary CAM (content addressable memory) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure inside a modern TCAM and present a simple, yet accurate, power model. We present techniques to estimate the dynamic power consumption of a large TCAM. We validate the model using industrial TCAM datasheets and prior published works. We present an extensive analysis of the model by varying various architectural parameters. We also describe how new network algorithms have the potential to address the growing problem of power management in next-generation network devices.

IEEE Transactions on Very Large Scale Integration Systems | 2008

Ternary CAM Power and Delay Model: Extensions and Uses

Banit Agrawal; Timothy Sherwood

Applications in computer networks often require high throughput access to large data structures for lookup and classification. While advanced algorithms exist to speed these search primitives on network processors and even custom application-specific integrated circuits (ASICs), achieving tight bounds on worst case performance with standard memories often requires a very careful analysis of all possible access patterns. An alternative, and often times more simple solution, is possible if a ternary CAM (TCAM) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers at all levels of design (from algorithms to circuits) have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure of a modern TCAM and present a simple, yet accurate, power and delay model. We present techniques to estimate the dynamic power consumption and leakage power of a TCAM structure and validate the model using a combination of industrial TCAM datasheets and prior published works. Such a model is a critical first step in bridging the intellectual divide between circuit-level and algorithm-level optimizations. To demonstrate the utility of our model, we present an extensive analysis of the model by varying various architectural parameters and describe how our model can be easily extended to handle several circuit optimizations in the TCAM structure. In addition, we present a comparative study of SRAM and TCAM energy consumption to directly quantify the many design options which will be very useful for network designers to explore various power management schemes.

international conference on vlsi design | 2008

Exploring the Processor and ISA Design for Wireless Sensor Network Applications

Shashidhar Mysore; Banit Agrawal; Frederic T. Chong; Timothy Sherwood

Power consumption, physical size, and architecture design of sensor node processors have been the focus of sensor network research in the architecture community. What lies at the foundation for these research is the hardware- level design which determines the boundaries for achievable utility and performance. Architecture design and evaluation, however, cannot be accomplished independent of the applications and software that run on these sensor nodes. On one hand, some researchers have proposed architectures that can cater to a variety of application classes while trading off on some performance improvements. On the other hand, a set of application-specific architectures have been proposed which perform certain operations extremely well but are not versatile enough to run a variety of applications. This paper provides a design space exploration and optimizations platform to characterize the processor and ISA design tailored for a particular application or a class of applications. We collect a wide variety of sensor network applications to create a comprehensive benchmark suite called the WiSeNBench. We then present a careful profiling of these benchmark applications using an ARM simulator to identify some of the key characteristic behaviors. This also opens up avenue for a possible re-look at the classes of applications that could be supported on next-generation sensor networks and efficient architectural designs to enable these applications.

international symposium on microarchitecture | 2008

A small cache of large ranges: Hardware methods for efficiently searching, storing, and updating big dataflow tags

Mohit Tiwari; Banit Agrawal; Shashidhar Mysore; Jonathan Valamehr; Timothy Sherwood

Dynamically tracking the flow of data within a microprocessor creates many new opportunities to detect and track malicious or erroneous behavior, but these schemes all rely on the ability to associate tags with all of virtual or physical memory. If one wishes to store large 32-bit tags, multiple tags per data element, or tags at the granularity of bytes rather than words, then directly storing one tag on chip to cover one byte or word (in a cache or otherwise) can be an expensive proposition. We show that dataflow tags in fact naturally exhibit a very high degree of spatial-value locality, an observation we can exploit by storing metadata on ranges of addresses (which cover a non-aligned contiguous span of memory) rather than on individual elements. In fact, a small 128 entry on-chip range cache (with area equivalent to 4 KB of SRAM) hits more than 98% of the time on average. The key to this approach is our proposed method by which ranges of tags are kept in cache in an optimally RLE-compressed form, queried at high speed, swapped in and out with secondary memory storage, and (most important for dataflow tracking) rapidly stitched together into the largest possible ranges as new tags are written on every store, all the while correctly handling the cases of unaligned and overlapping ranges. We examine the effectiveness of this approach by simulating its use in definedness tracking (covering both the stack and the heap), in tracking network-derived dataflow through a multi-language web application, and through a synthesizable prototype implementation.

ACM Transactions in Embedded Computing Systems | 2009

Energy-efficient encoding techniques for off-chip data buses

Dinesh C. Suresh; Banit Agrawal; Jun Yang; Walid A. Najjar

Reducing the power consumption of computing devices has gained a lot of attention recently. Many research works have focused on reducing power consumption in the off-chip buses as they consume a significant amount of total power. Since the bus power consumption is proportional to the switching activity, reducing the bus switching is an effective way to reduce bus power. While numerous techniques exist for reducing bus power in address buses, only a handful of techniques have been proposed for data-bus power reduction, where frequent value encoding (FVE) is the best existing scheme to reduce the transition activity on the data buses. In this article, we propose improved frequent value data bus-encoding techniques aimed at reducing more switching activity and, hence, power consumption. We propose three new schemes and five new variations to exploit bit-wise temporal and spatial locality in the data-bus values. Our techniques just use one external control signal and capture bit-wise locality to efficiently encode data values. For all the embedded and SPEC applications we tested, the overall average switching reduction is 53% over unencoded data and 10% more than the conventional FVE scheme.

international symposium on microarchitecture | 2006

Virtually Pipelined Network Memory

Banit Agrawal; Timothy Sherwood

We introduce virtually-pipelined memory, an architectural technique that efficiently supports high-bandwidth, uniform latency memory accesses, and high-confidence throughput even under adversarial conditions. We apply this technique to the network processing domain where memory hierarchy design is an increasingly challenging problem as network bandwidth increases. Virtual pipelining provides a simple to analyze programing model of a deep pipeline (deterministic latencies) with a completely different physical implementation (a memory system with banks and probabilistic mapping). This allows designers to effectively decouple the analysis of their algorithms and data structures from the analysis of the memory buses and banks. Unlike specialized hardware customized for a specific data-plane algorithm, our system makes no assumption about the memory access patterns. In the domain of network processors this will be of growing importance as the size of the routing tables, the complexity of the packet classification rules, and the amount of packet buffering required, all continue to grow at a staggering rate. We present a mathematical argument for our systems ability to provably provides bandwidth with high confidence and demonstrate its functionality and area overhead through a synthesizable design. We further show that, even though our scheme is general purpose to support new applications such as packet reassembly, it outperforms the state of the art in specialized packet buffering architectures

compilers, architecture, and synthesis for embedded systems | 2003

Power efficient encoding techniques for off-chip data buses

Dinesh C. Suresh; Banit Agrawal; Jun Yang; Walid A. Najjar; Laxmi N. Bhuyan

Reducing the power consumption of computing devices has gained a lot of attention recently. Many research works have focused on reducing power consumption in the off-chip buses as they consume a significant amount of total power. Since the bus power consumption is proportional to the switching activity, reducing the bus switching is an effective way to reduce bus power. While numerous techniques exist for reducing bus power in address buses, only a handful of techniques have been proposed for data-bus power reduction, where Frequent Value Encoding (FVE) is the best existing scheme to reduce the transition activity on the data buses.In this paper, we propose improved frequent value data-bus encoding techniques aimed at reducing more switching activity and hence, more power consumption. We propose three new schemes and five new variations to exploit bit-wise temporal and spatial locality in the data bus values. Our technique does not use additional external control signal and captures bit-wise locality to efficiently encode data values. For all the embedded and SPEC applications we tested, the overall average switching reduction is 53% over unencoded data and 11% more than the conventional FVE scheme.

IEEE ACM Transactions on Networking | 2009

High-bandwidth network memory system through virtual pipelines

Banit Agrawal; Timothy Sherwood

As network bandwidth increases, designing an effective memory system for network processors becomes a significant challenge. The size of the routing tables, the complexity of the packet classification rules, and the amount of packet buffering required all continue to grow at a staggering rate. Simply relying on large, fast SRAMs alone is not likely to be scalable or cost-effective. Instead, trends point to the use of low-cost commodity DRAM devices as a means to deliver the worst-case memory performance that network data-plane algorithms demand. While DRAMs can deliver a great deal of throughput, the problem is that memory banking significantly complicates the worst-case analysis, and specialized algorithms are needed to ensure that specific types of access patterns are conflict-free. We introduce virtually pipelined memory, an architectural technique that efficiently supports high bandwidth, uniform latency memory accesses, and high-confidence throughput even under adversarial conditions. Virtual pipelining provides a simple-to-analyze programming model of a deep pipeline (deterministic latencies) with a completely different physical implementation (a memory system with banks and probabilistic mapping). This allows designers to effectively decouple the analysis of their algorithms and data structures from the analysis of the memory buses and banks. Unlike specialized hardware customized for a specific data-plane algorithm, our system makes no assumption about the memory access patterns. We present a mathematical argument for our systems ability to provably provide bandwidth with high confidence and demonstrate its functionality and area overhead through a synthesizable design. We further show that, even though our scheme is general purpose to support new applications such as packet reassembly, it outperforms the state-of-the-art in specialized packet buffering architectures.

symposium on code generation and optimization | 2006

Profiling over Adaptive Ranges

Shashidhar Mysore; Banit Agrawal; Timothy Sherwood; Nisheeth Shrivastava; Subhash Suri

Modern computer systems are called on to deal with billions of events every second, whether they are instructions executed, memory locations accessed, or packets forwarded. This presents a serious challenge to those who seek to quantify, analyze, or optimize such systems, because important trends and behaviors may easily be lost in a sea of data. We present range adaptive profiling (RAP) as a new and general purpose profiling method capable of hierarchically classifying streams of data efficiently in hardware. Through the use of RAP, events in an input stream are dynamically classified into increasingly precise categories based on the frequency with which they occur. The more important a class, or range of events, the more precisely it is quantified. Despite the dynamic nature of our technique, we build upon tight theoretic bounds covering both worst-case error as well as the required memory. In the limit, it is known that error and the memory bounds can be independent of the stream size, and grow only linearly with the level of precision desired. Significantly, we expose the critical constants in these algorithms and through careful engineering, algorithm re-design, and use of heuristics, we show how a high performance profile system can be implemented for range adaptive profiling. RAP can be used on various profiles such as PCs, load values, and memory addresses, and has a broad range of uses, from hot-region profiling to quantifying cache miss value locality. We propose two methods of implementation, one in software and the other with specialized hardware, and we show that with just 8k bytes of memory range profiles can be gathered with an average accuracy of 98%.

Explore More