Rohit Sunkam Ramanujam

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rohit Sunkam Ramanujam is active.

Explore More

Publication

Featured researches published by Rohit Sunkam Ramanujam.

architectures for networking and communications systems | 2010

Destination-based adaptive routing on 2D mesh networks

Rohit Sunkam Ramanujam; Bill Lin

The choice of routing algorithm plays a vital role in the performance of on-chip interconnection networks. Adaptive routing is appealing because it offers better latency and throughput than oblivious routing, especially under non-uniform and bursty traffic. The performance of an adaptive routing algorithm is determined by its ability to accurately estimate congestion in the network. In this regard, maintaining global congestion information using a separate monitoring network offers better congestion visibility into distant parts of the network than solutions relying only on local congestion state. However, the main challenge in designing such routing schemes is to keep the logic and bandwidth overhead as low as possible to fit into the tight power, area and delay budgets of onchip routers. In this paper, we propose a minimal destination-based adaptive routing strategy (DAR) where every node estimates the delay to every other node in the network, and routing decisions are based on these per-destination delay estimates. DAR outperforms Regional Congestion Awareness (RCA), the best previously known adaptive routing algorithm that uses non-local congestion knowledge. This is because the per-destination delay estimates in DAR are more accurate and not corrupted by congestion on links outside the admissible routing paths to the destination. We show that DAR outperforms minimal adaptive routing by up to 65% and RCA by up to 41% in terms of latency on SPLASH-2 benchmarks. It also outperforms these algorithms in latency and throughput under synthetic traffic patterns on both 8×8 and 16×16 mesh topologies.

IEEE Computer Architecture Letters | 2009

A High-Throughput Distributed Shared-Buffer NoC Router

Vassos Soteriou; Rohit Sunkam Ramanujam; Bill Lin; Li-Shiuan Peh

Router microarchitecture plays a central role in the performance of an on-chip network (NoC). Buffers are needed in routers to house incoming flits which cannot be immediately forwarded due to contention. This buffering can be done at the inputs or the outputs of a router, corresponding to an input-buffered router (IBR) or an output-buffered router (OBR). OBRs are attractive because they can sustain higher throughputs and have lower queuing delays under high loads than IBRs. However, a direct implementation of an OBR requires a router speedup equal to the number of ports, making such a design prohibitive under aggressive clocking needs and limited power budgets of most NoC applications. In this paper, we propose a new router design that aims to emulate an OBR practically, based on a distributed shared-buffer (DSB) router architecture. We introduce innovations to address the unique constraints of NoCs, including efficient pipelining and novel flow-control. We also present practical DSB configurations that can reduce the power overhead with negligible degradation in performance. The proposed DSB router achieves up to 19% higher throughput on synthetic traffic and reduces packet latency by 60% on average for SPLASH-2 benchmarks with high contention, compared to a state-of-art pipelined IBR. On average, the saturation throughput of DSB routers is within 10% of the theoretically ideal saturation throughput under the synthetic workloads evaluated.

networks on chips | 2010

Design of a High-Throughput Distributed Shared-Buffer NoC Router

Rohit Sunkam Ramanujam; Vassos Soteriou; Bill Lin; Li-Shiuan Peh

Microarchitectural configurations of buffers in routers have a significant impact on the overall performance of an on-chip network (NoC). This buffering can be at the inputs or the outputs of a router, corresponding to an input-buffered router (IBR) or an output-buffered router (OBR). OBRs are attractive because they have higher throughput and lower queuing delays under high loads than IBRs. However, a direct implementation of OBRs requires a router speedup equal to the number of ports, making such a design prohibitive given the aggressive clocking and power budgets of most NoC applications. In this letter, we propose a new router design that aims to emulate an OBR practically based on a distributed shared-buffer (DSB) router architecture. We introduce innovations to address the unique constraints of NoCs, including efficient pipelining and novel flow control. Our DSB design can achieve significantly higher bandwidth at saturation, with an improvement of up to 20% when compared to a state-of-the-art pipelined IBR with the same amount of buffering, and our proposed microarchitecture can achieve up to 94% of the ideal saturation throughput.

IEEE Computer Architecture Letters | 2008

Randomized Partially-Minimal Routing on Three-Dimensional Mesh Networks

Rohit Sunkam Ramanujam; Bill Lin

This letter presents a new oblivious routing algorithm for 3D mesh networks called Randomized Partially- Minimal (RPM) routing that provably achieves optimal worstcase throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal when k is odd. Although this optimality result has been achieved with the minimal routing algorithm O1TURN [9] for the 2D case, the worst-case throughput of O1TURN degrades tremendously in higher dimensions. Other existing routing algorithms suffer from either poor worst-case throughput (DOR [10], ROMM [8]) or poor latency (VAL [14]). RPM on the other hand achieves near optimal worst-case and good average-case throughput as well as good latency performance.

international conference on computer design | 2008

Near-optimal oblivious routing on three-dimensional mesh networks

Rohit Sunkam Ramanujam; Bill Lin

The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of two-dimensional (2D) mesh-based tiled chip-multiprocessor architectures into three dimensions. In this paper, we focus on efficient routing algorithms for such 3D mesh networks. As in the case of 2D mesh networks, throughput and latency are important design metrics for routing algorithms. Existing routing algorithms suffer from either poor worst-case throughput (DOR , ROMM) or poor latency (VAL). Although the minimal routing algorithm O1TURN proposed in already achieves near-optimal worst-case throughput for the 2D case, the optimality result does not extend to higher dimensions. For 3D and higher dimensional meshes, the worst-case throughput of O1TURN degrades tremendously. The main contribution of this paper is the design of a new oblivious routing algorithm for 3D mesh networks called randomized partially-minimal (RPM) routing. RPM provably achieves optimal worst-case throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal worst-case throughput when k is odd. RPM also outperforms VAL, DOR, ROMM, and O1TURN in average-case throughput by 33.3%, 111%, 47%, and 30%, respectively when averaged over one million random traffic patterns on an 8 times 8 times 8 topology. Finally, whereas VAL achieves optimal worst-case throughput at a penalty factor of 2 in average latency over DOR, RPM achieves (near) optimal worst-case throughput with a much smaller factor of 1.33. In practice, the average latency of RPM is expected to be closer to minimal routing because 3D mesh networks are not expected to be symmetric in 3D chip designs. The number of available device layers is expected to be much less than the number of processor tiles that can be placed along an edge of a device layer. For practical asymmetric 3D mesh configurations, the average latency of RPM reduces to just a factor of 1.11 of DOR.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2011

Extending the Effective Throughput of NoCs With Distributed Shared-Buffer Routers

Rohit Sunkam Ramanujam; Vassos Soteriou; Bill Lin; Li-Shiuan Peh

Router microarchitecture plays a central role in the performance of networks-on-chip (NoCs). Buffers are needed in routers to house incoming flits that cannot be immediately forwarded due to contention. This buffering can be done at the inputs or the outputs of a router, corresponding to an input-buffered router (IBR) or an output-buffered router (OBR). OBRs are attractive because they can sustain higher throughputs and have lower queuing delays under high loads than IBRs. However, a direct implementation of an OBR requires a router speedup equal to the number of ports, making such a design prohibitive under aggressive clocking needs and limited power budgets of most NoC applications. In this paper, a new router design based on a distributed shared-buffer (DSB) architecture is proposed that aims to practically emulate an OBR. The proposed architecture introduces innovations to address the unique constraints of NoCs, including efficient pipelining and novel flow control. Practical DSB configurations are also presented with reduced power overheads while exhibiting negligible performance degradation. Compared to a state-of-the-art pipelined IBR, the proposed DSB router achieves up to 19% higher throughput on synthetic traffic and reduces packet latency on average by 61% when running SPLASH-2 benchmarks with high contention. On average, the saturation throughput of DSB routers is within 7% of the theoretically ideal saturation throughput under the synthetic workloads evaluated.

design automation conference | 2010

Trace-driven optimization of networks-on-chip configurations

Andrew B. Kahng; Bill Lin; Kambiz Samadi; Rohit Sunkam Ramanujam

Networks-on-chip (NoCs) are becoming increasingly important in general-purpose and application-specific multi-core designs. Although uniform router configurations are appropriate for general-purpose NoCs, router configurations for application-specific NoCs can be non-uniformly optimized to application-specific traffic characteristics. In this paper, we specifically consider the problem of virtual channel (VC) allocation in application-specific NoCs. Prior solutions to this problem have been average-rate driven. However, average-rate models are poor representations of real application traffic, and can lead to designs that are poorly matched to the application. We propose an alternate trace-driven paradigm in which configuration of NoCs is driven by application traces. We propose two simple greedy trace-driven VC allocation schemes. Compared to uniform allocation, we observe up to 51% reduction in the number of VCs under a given average packet latency constraint, or up to 74% reduction in average packet latency with same number of VCs. Our results suggest that average-rate driven methods cannot effectively select appropriate links for VC allocation because they fail to consider the impact of traffic bursts. As a case study, we compare our proposed approach with an existing average-rate driven method and observe up to 35% reduction in the number of VCs for a given target latency.

IEEE Transactions on Computers | 2013

Randomized Throughput-Optimal Oblivious Routing for Torus Networks

Rohit Sunkam Ramanujam; Bill Lin

In this paper, we study the problem of optimal oblivious routing for 1D and 2D torus networks. We introduce a new closed-form oblivious routing algorithm called W2TURN that is worst-case throughput optimal for 2D torus networks. W2TURN is based on a weighted random selection of paths that contain at most two turns. Restricting the maximum number of turns in routing paths to just two enables a simple deadlock-free implementation of W2TURN. In terms of average hop count, W2TURN outperforms the best previously known closed-form worst-case throughput optimal routing algorithm called IVAL [CHECK END OF SENTENCE]. When the network radix is odd, W2TURN achieves the minimum average hop count that can be achieved with 2-turn paths while remaining worst-case throughput optimal. When the network radix is even, W2TURN comes very close to achieving the minimum average hop count while remaining worst-case throughput optimal, within just 0.72 percent on a 12\times 12 torus. We also describe another routing algorithm based on weighted random selection of paths with at most two turns called I2TURN and show that I2TURN is equivalent to IVAL. However, I2TURN eliminates the need for loop removal at runtime and provides a closed-form analytical expression for evaluating the average hop count. The latter enables us to demonstrate analytically that W2TURN strictly outperforms IVAL (and I2TURN) in average hop count. Finally, we present a new optimal weighted random routing algorithm for rings called Weighted Random Direction (WRD). WRD provides a closed-form expression for the optimal distribution of traffic along the minimal and nonminimal directions in a ring topology to achieve minimum average hop count while guaranteeing optimal worst-case throughput. Based on our evaluations, in addition to being worst-case throughput optimal, W2TURN and WRD also perform well in the average case, and outperform the best previously known worst-case throughput optimal routing algorithms with closed-form descriptions in latency and throughput over a wide range of traffic patterns.

architectures for networking and communications systems | 2009

Weighted random oblivious routing on torus networks

Rohit Sunkam Ramanujam; Bill Lin

Torus, mesh, and flattened butterfly networks have all been considered as candidate architectures for on-chip interconnection networks. In this paper, we study the problem of optimal oblivious routing for one of these architecture classes, namely, the torus network. We introduce a new closed-form oblivious routing algorithm called W2TURN that is worst-case throughput optimal for 2D-torus networks. W2TURN is based on a weighted random selection of paths that contain at most two turns. Restricting the maximum number of turns in routing paths to just two results in a simple deadlock-free implementation of W2TURN. In terms of average hop count, W2TURN outperforms the best previously known closed-form worst-case throughput optimal routing algorithm called IVAL [14]. We also provide another routing algorithm based on the weighted random selection of paths with at most two turns called I2TURN and show that it is equivalent to IVAL. However, I2TURN eliminates the need for loop removal at runtime and provides a closed-form analytical expression for evaluating the average hop count. The latter enables us to demonstrate analytically that W2TURN strictly outperforms IVAL (and I2TURN) in average hop count. Finally, we present a new optimal weighted random routing algorithm for rings called WRD (Weighted Random Direction). WRD provides a closed form expression for the the optimal distribution of traffic along the minimal and non-minimal directions in a ring topology to achieve minimum average hop count under maximum worst-case throughput.

ACM Transactions on Design Automation of Electronic Systems | 2013

Destination-based congestion awareness for adaptive routing in 2D mesh networks

Rohit Sunkam Ramanujam; Bill Lin

The choice of routing algorithm plays a vital role in the performance of on-chip interconnection networks. Adaptive routing is appealing because it offers better latency and throughput than oblivious routing, especially under nonuniform and bursty traffic. The performance of an adaptive routing algorithm is determined by its ability to accurately estimate congestion in the network. In this regard, maintaining global congestion state using a separate monitoring network offers better congestion visibility into distant parts of the network compared to solutions relying only on local congestion. However, the main challenge in designing such routing schemes is to keep the logic and bandwidth overhead as low as possible to fit into the tight power, area, and delay budgets of on-chip routers. In this article, we propose a minimal destination-based adaptive routing strategy (DAR), where every node estimates the delay to every other node in the network, and routing decisions are based on these per-destination delay estimates. DAR outperforms Regional Congestion Awareness (RCA), the best previously known adaptive routing algorithm that uses nonlocal congestion state. The performance improvement is brought about by maintaining fine-grained per-destination delay estimates in DAR that are more accurate than regional congestion metrics measured in RCA. The increased accuracy is a consequence of the fact that the per-destination delay estimates are not corrupted by congestion on links outside the admissible routing paths to the destination. A scalable version of DAR, referred to as SDAR, is also proposed for minimizing the overheads associated with DAR in large network topologies. We show that DAR outperforms local adaptive routing by up to 79% and RCA by up to 58% in terms of latency on SPLASH-2 benchmarks. DAR and SDAR also outperform existing adaptive and oblivious routing algorithms in latency and throughput under synthetic traffic patterns on 8×8 and 16times;16 mesh topologies, respectively.

Explore More