Rama Sangireddy
Iowa State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rama Sangireddy.
IEEE Journal on Selected Areas in Communications | 2003
Rama Sangireddy; Arun K. Somani
With a rapid increase in the data transmission link rates and an immense continuous growth in the Internet traffic, the demand for routers that perform Internet protocol packet forwarding at high speed and throughput is ever increasing. The key issue in the router performance is the IP address lookup mechanism based on the longest prefix matching scheme. Earlier work on fast Internet protocol version 4 (IPv4) routing table lookup includes, software mechanisms based on tree traversal or binary search methods, and hardware schemes based on content addressable memory (CAM), memory lookups and the CPU caching. These schemes depend on the memory access technology which limits their performance. The paper presents a binary decision diagrams (BDDs) based optimized combinational logic for an efficient implementation of a fast address lookup scheme in reconfigurable hardware. The results show that the BDD hardware engine gives a throughput of up to 175.7 million lookups per second (Ml/s) for a large AADS routing table with 33 796 prefixes, a throughput of up to 168.6 Ml/s for an MAE-West routing table with 29 487 prefixes, and a throughput of up to 229.3 Ml/s for the Pacbell routing table with 6822 prefixes. Besides the performance of the scheme, routing table update and the scalability to Internet protocol version 6 (IPv6) issues are discussed.
IEEE Transactions on Computers | 2006
Rama Sangireddy
In modern day high-performance processors, the complexity of the register rename logic grows along with the pipeline width and leads to larger renaming time delay and higher power consumption. Renaming logic in the front-end of the processor is one of the largest contributors of peak temperatures on the chip and, so, demands attention to reduce the power consumption. Further, with the advent of clustered microarchitectures, the rename map table at the front-end is shared by the clusters and, hence, its critical path delay should not become a bottleneck in determining the processor clock cycle time. Analysis of characteristics of Spec2000 integer benchmark programs reveals that, when the programs are processed in a 4-wide processor, none or only one two-source instruction (an instruction with two source registers) is renamed in a cycle for 94 percent of the total execution time. Similarly, in an 8-wide processor, none or only one two-source instruction is renamed in a cycle for 92 percent of the total execution time. Thus, the analysis observes that the rename map table port bandwidth is highly underutilized for a significant portion of time. Based on the analysis, in this paper, we propose a novel technique to significantly reduce the number of ports in the rename map table. The novelty of the technique is that it is easy to implement and succeeds in reducing the access time, power, and area of the rename logic, without any additional power, area, and delay overheads in any other logic on the chip. The proposed technique performs the register renaming of instructions in the order of their fetch, with no significant impact on the processors performance. With this technique in an 8-wide processor, as compared to a conventional rename map table in an integer pipeline with 16 ports to look up source operands, a rename map table with nine ports results in a reduction in access time, power, and area by 14 percent, 42 percent, and 49 percent, respectively, with only 4.7 percent loss in instructions committed per cycle (IPC). The implementation of the technique in a 4-wide processor results in a reduction in access time, power, and area by 7 percent, 38 percent, and 59 percent, respectively, with an IPC loss of only 4.4 percent
IEEE Transactions on Computers | 2004
Rama Sangireddy; Huesung Kim; Arun K. Somani
The demand for higher computing power and, thus, more on-chip computing resources; is ever increasing. The size of on-chip cache memory has also been consistently increasing to keep up with developments in implementation technology. However, some applications may not utilize full cache capacity and, on the contrary, require more computing resources. To efficiently utilize silicon real-estate on the chip, we exploit the possibility of using a part of cache memory for computational purposes to strike a balance in the usage of memory and computing resources for various applications. In an earlier part of our work, the idea of adaptive balanced computing (ABC) architecture was evolved, where a module of an L1 data cache is used as a coprocessor controlled by main processor. A part of an L1 data cache is designed as a reconfigurable functional cache (RFC) that can be configured to perform a selective core function in the media application whenever such computing capability is required. ABC architecture provides speedups ranging from 1.04x to 5.0x for various media applications. We show that a reduced number of cache accesses and lesser utilization of other on-chip resources, due to a significant reduction in execution time of application, will result in power savings. For this purpose, we first develops a model to compute the power consumed by the RFC while accelerating the computation of multimedia applications. The results show that up to a 60 percent reduction in power consumption is achieved for MPEG decoding and a reduction in the range of 10 to 20 percent for various other multimedia applications. Besides, beyond the discussions in earlier work on ABC architecture, we present a detailed circuit level implementation of the core functions in the RFC modules. Further, we go much further and study the impact of converting the conventional cache into RFC on both access time and energy consumption. The analysis is performed on a wide spectrum of cache organizations with size varying from 8KB to 256KB for varying set associativity.
international conference on computer design | 2004
Rama Sangireddy; Arun K. Somani
Large register file with multiple ports, but with a minimal access time, is a critical component in a superscalar processor. Analysis of the lifetime of a logical to physical register mapping reveals that there are long latencies between the times a physical register is allocated, consumed, and released. In this paper, we propose a TriBank register file, a novel register file organization that exploits such long latencies, resulting in a larger register bandwidth and a smaller register access time. Implementation of the TriBank register file organization, as compared to a conventional monolithic register file in an 8-wide out-of-order issue superscalar processor reduced the register access time up to 34%, even while enhancing the throughput in instructions per cycle (IPC) by 3% and 14%, for SpecInt2000 and SpecFP2000, respectively.
international conference on computer communications and networks | 2001
Rama Sangireddy; Arun K. Somani
With an immense continuous growth in Internet traffic, the demand for routers that perform IP routing at high speed and throughput is ever increasing. The key issue in the router performance is the IP routing lookup mechanism based on the longest prefix matching scheme. Earlier works on fast IPv4 routing table lookup are based on content addressable memory (CAM), memory lookups and CPU caching. These schemes depend on the memory access technology which limits their performance. Besides, these address lookup schemes designed for the IPv4 32-bit address mostly are not extensible to adapt to the forthcoming IPv6 where the IP address is 128 bits long. The paper presents a binary decision diagrams based optimized combinational logic for an efficient implementation of a fast address lookup scheme in reconfigurable hardware. The experimental results show that, for the 32-bit IP address large MAE-east routing table, the number of redundant nodes is more than 99.99% in constructing the binary decision tree. With the binary encoding of the output port, an additional 36% reduction is obtained in the number of effective nodes. Besides the performance of the scheme, issues relating to routing table update and the scalability to IPv6 are discussed.
international conference on computer communications and networks | 2003
Natsuhiko Futamura; Rama Sangireddy; Srinivas Aluru; Arun K. Somani
IP address lookup algorithms can be evaluated on a number of metrics lookup time, update time, memory usage, and to a lesser extent, the time to construct the data structure used to support lookups and updates. Many of the existing methods are geared towards optimizing a specific metric, and hence do not scale well with the ever expanding routing tables and the forthcoming IPv6 with 128 bit long IP address. In contrast, our effort is directed at simultaneously optimizing multiple metrics and provide solutions that scale well to IPv6. In this paper, we present two IP address lookup schemes Elevator - Stairs algorithm and logW - Elevators algorithm. For a routing table with N prefixes, The Elevator - Stairs algorithm uses optimal O(N) memory, and achieves better lookup and update times than other methods with similar memory requirements. The logW - Elevators algorithm gives O(log W) lookup time, where W is the length of an IP address, while improving upon update time and memory usage. Experimental results using the MAE-West router with 29,487 prefixes show that the Elevator - Stairs algorithm gives an average throughput of 15.7 Million lookups per second (Mlps) using 459 KB of memory, and the logW - Elevators algorithm gives an average throughput of 21.41 Mlps with a memory usage of 1259 KB.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2006
Rama Sangireddy; Arun K. Somani
Applications, depending on their nature, demand either higher computing capacity, larger data-storage capacity, or both. Hence, providing on-chip memory and computing resources that are fixed in nature is expensive and does not enable an efficient utilization of on-chip silicon real estate. In this brief, we design the circuit of an adaptive register file computing (ARC) unit, a novel on-chip dual-role circuit with a minimal area overhead of 0.233 mm2 at 0.18-mu technology. It supplements the conventional register bank to provide larger register storage capacity or acts as a specialized computing unit to provide higher on-chip computing capacity, depending on the requirement of a specific application. The brief discusses the circuit-level details for the implementation of the dual-role ARC unit, its integration in a wide-issue processor pipeline, and the corresponding performance enhancement in various multimedia applications
application-specific systems, architectures, and processors | 2004
Rama Sangireddy
Large register file with multiple ports is a critical component of a high-performance processor. A large number of registers are necessary for processing a larger number of in-flight instructions to exploit higher instruction level parallelism (ILP). Multiple ports for a register file are necessary to support execution of multiple instructions each cycle. These necessities lead to a larger register access time. However, register access time has to be minimal to enable design of high frequency processors. Analysis of lifetime of a logical to physical register mapping reveals that there are long latencies between the times a physical register is allocated, consumed, and released. We propose a dual bank register file organization that exploits such long latencies, resulting in a large bandwidth with a reduced register access time. Implementation of one flavor of the proposed register file organization, as compared to a conventional monolithic register file, in an 8-wide out-of-order issue superscalar processor enhanced instructions per cycle (IPC) throughput up to 6% for Spec2000 applications while inducing register access time up to 22%. Another flavor of the register file organization, with a similar access time as the conventional monolithic register file, enhanced the IPC up to 15%. Thus a trade-off between register access time and ILP exploitation is shown.
ieee international conference on high performance computing data and analytics | 2002
Rama Sangireddy; Huesung Kim; Arun K. Somani
The demand for higher computing power and thus more onchip computing resources is ever increasing. The size of on-chip cache memory has also been consistently increasing. To efficiently utilize silicon real-estate on the chip, a part of L1 data cache is designed as a Reconfigurable Functional Cache (RFC), that can be configured to perform a selective core function in the media application whenever higher computing capability is required. The idea of Adaptive Balanced Computing architecture was developed, where the RFC module is used as a coprocessor controlled by main processor. Initial results have proved that ABC architecture provides speedups ranging from 1.04x to 5.0x for various media applications. In this paper, we address the impact of RFC on cache access time and energy dissipation. We show that reduced number of cache accesses and lesser utilization of other on-chip resources will result in energy savings of up to 60% for MPEG decoding, and in the range of 10% to 20% for various other multimedia applications.
ieee international conference on high performance computing, data, and analytics | 2003
Rama Sangireddy; Huesung Kim; Arun K. Somani
The concept of a reconfigurable coprocessor controlled by the general purpose processor, with the coprocessor acting as a specialized functional unit, has evolved to accelerate applications requiring higher computing power. The idea of Adaptive Balanced Computing (ABC) architecture has evolved, where a module of Reconfigurable Functional Cache (RFC) is configured with a selective core function in the application whenever a higher computing resources are required. Initial results have proved that the ABC architecture provides with speedups ranging from 1.04x to 5.0x depending on the application and the speedups in the range of 2.61x to 27.4x are observed for the core functions. This paper further explores the issues of management of RFC, where the impact of various schemes for configuration of core function into the RFC module is studied. This paper also gives a detailed analysis on the performance of ABC architecture for various configuration schemes, including the study of the effect of the percentage of the core function in an entire application over the management of RFC modules.