Aniruddha S. Vaidya
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aniruddha S. Vaidya.
high-performance computer architecture | 1997
Sucheta Chodnekar; Viji Srinivasan; Aniruddha S. Vaidya; Anand Sivasubramaniam; Chita R. Das
The interconnection network (ICN) is a vital component of a parallel machine and is often the limiting factor in the performance of several parallel applications. While ICN performance evaluation has been a widely researched topic, there have been very few studies that have used real applications to drive this research. In this paper we develop a framework for characterizing the communication properties of parallel applications. Message generation frequency, spatial distribution of messages and message length are the three attributes that quantify any communication. We develop a methodology to quantify these attributes, in particular the first two attributes. We employ two strategies, namely dynamic and static, in our methodology. In the former, the applications are executed on an execution-driven simulator called SPASM, while in the latter they are executed on a parallel machine, IBM SP2. We gather communication events from these executions and feed them to a 2-D mesh network simulator. The log of the network activity is then analyzed using a statistical analysis package (SAS) to find the message inter-arrival time distribution and spatial distribution via regression analysis. Five shared memory applications and two message passing applications are analyzed to quantify their communication workloads. It is shown that it is possible to express the message generation and spatial distribution of an application in terms of commonly used distributions. These distributions can be used in the analysis of ICNs for developing realistic performance models.
IEEE Transactions on Parallel and Distributed Systems | 2001
Aniruddha S. Vaidya; Anand Sivasubramaniam; Chita R. Das
Research on multiprocessor interconnection networks has primarily focused on wormhole switching, virtual channel flow control, and routing algorithms to enhance their performance. The rationale behind this research is that by alleviating the network latency for high network loads, the overall system performance would improve; many studies have used synthetic workloads to support this claim. However, such workloads may not necessarily capture the behavior of real applications. In this paper, we have used parallel applications for a closer examination of the network behavior. In particular, the performance benefit from enhancing a 2D mesh with virtual channels (VCs) and a fully adaptive routing algorithm is examined with a set of shared-memory and message passing applications. Execution time and average message latency of shared memory applications are measured using execution-driven simulation and by varying many architectural attributes that affect the network workload. The communication traces of message passing applications, collected on an IBM-SP2, are used to run a trace-driven simulation of the mesh architecture to obtain message latency. Simulation results show that VCs and adaptive routing can reduce the network latency to varying degrees depending on the application. However, these modest benefits do not translate to significant improvements in the overall execution time because the load on the network is not high enough to exploit the advantages of the network enhancements. Moreover, this benefit may be negated if the architectural enhancements increase the network cycle time. Rather, emphasis should be placed on improving the raw network bandwidth and faster network interfaces.
high-performance computer architecture | 1999
Aniruddha S. Vaidya; Anand Sivasubramaniam; Chita R. Das
Earlier research has shown that adaptive routing can help in improving network performance. However, it has not received adequate attention in commercial routers mainly due to the additional hardware complexity, and the perceived cost and performance degradation that may result from this complexity. These concerns can be mitigated if one can design a cost-effective router that can support adaptive routing. This paper proposes a three step recipe-Look-Ahead routing, intelligent Path Selection, and an Economic Storage implementation, called the LAPSES approach-for cost-effective high performance pipelined adaptive router design. The first step, look-ahead routing, reduces a pipeline stage in the router by making table lookup and arbitration concurrent. Next, three new traffic-sensitive path selection heuristics (LRU, LFU and MAX-CREDIT) are proposed to select one of the available alternate paths. Finally, two techniques for reducing routing table size of the adaptive router are presented. These are called meta-table routing and economical storage. The proposed economical storage needs a routing table with only 9 and 27 entries for two and three dimensional meshes, respectively. All these design ideas are evaluated on a (16/spl times/16) mesh network via simulation. A fully adaptive algorithm and various traffic patterns are used to examine the performance benefits. Performance results show that the look-ahead design as well as the path selection heuristics boost network performance, while the economical storage approach turns out to be an ideal choice in comparison to full-table and meta-table options. We believe the router resulting from these three design enhancements can make adaptive routing a viable choice for interconnects.
international conference on supercomputing | 1997
Aniruddha S. Vaidya; Anand Sivasubramaniam; Chita R. Das
Recent research on multiprocessor interconnection networks has primarily focussed on wormhole switching, virtual channel flow control and routing algorithms. These architectural features are aimed at enhancing the network performance by reducing the network latency, which in turn should improve the overall system performance. Many research results support this design philosophy by claiming significant reduction in average message latency. However, these conclusions are drawn using synthetic workloads that may not necessarily capture the behavior of real applications. In this paper, we have used parallel applications for a closer examination of the network behavior. In particular, the performance benefit from enhancing a 2-D mesh with virtual channels (VCs) and a routing algorithm (oblivious or fully adaptive) is examined with five shared memory applications using an execution-driven simulator, SPASM. In order to analyze the performance implications in greater detail, we also consider other parameters that have a direct bearing on network traffic. These are the number of processors used to solve a problem, problem size and memory consistency model. Simulation results show that VCs can reduce the network latency to varying degrees depending on the application. Similar gain is possible with a fully adaptive routing algorithm compared to the oblivious routing. However, with respect to the overall execution time, the performance benefit using these enhancements is negligible. Moreover, this benefit is negated when we consider the cost of implementing the VCs. These results suggest that the performance rewards may not justify the cost of these enhancements. Rather, we need to emphasize on improving the raw network bandwidth by simpler and improved router designs.
international symposium on computer architecture | 2013
Aniruddha S. Vaidya; Anahita Shayesteh; Dong Hyuk Woo; Roy Saharoy; Mani Azimi
SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for todays GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.
IEEE Transactions on Parallel and Distributed Systems | 2002
Ki Hwan Yum; Eun Jung Kim; Chita R. Das; Aniruddha S. Vaidya
With the increasing use of clusters in real-time applications, it has become essential to design high-performance networks with quality-of-service (QoS) guarantees. We explore the feasibility of providing QoS in wormhole switched routers, which are widely used in designing scalable, high-performance cluster interconnects. In particular, we are interested in supporting multimedia video streams with CBR and VBR traffic, in addition to the conventional best-effort traffic. The proposed MediaWorm router uses a rate-based bandwidth allocation mechanism, called Fine-Grained VirtualClock (FGVC), to schedule network resources for different traffic classes. Our simulation results on an 8-port router indicate that it is possible to provide jitter-free delivery to VBR/CBR traffic up to an input load of 70-80 percent of link bandwidth and the presence of best-effort traffic has no adverse effect on real-time traffic. Although the MediaWorm router shows a slightly lower performance than a pipelined circuit switched (PCS) router, commercial success of wormhole switching, coupled with simpler and cheaper design, makes it an attractive alternative. Simulation of a (2/spl times/2) fat-mesh using this router shows performance comparable to that of a single switch and suggests that clusters designed with appropriate bandwidth balance between links can provide required performance for different types of traffic.
high performance computer architecture | 2000
Ki Hwan Yum; Aniruddha S. Vaidya; Chita R. Das; Anand Sivasubramaniam
With the increasing use of clusters in real-time applications, it has become essential to design high performance networks with quality of service (QoS) guarantees. In this paper, we explore the feasibility of providing QoS in worm-hole switched routers, which are otherwise well known for designing high performance interconnects. In particular, we are interested in supporting multimedia video streams, in addition to the conventional best-effort traffic. The proposed MediaWorm router uses a rate-based bandwidth allocation mechanism, called Virtual Clock, to schedule network resources for different traffic classes. Our simulation results on an 8-port router indicate that it is possible to provide jitter-free delivery to VBR/CBR traffic up to an input load of 70-80% of link bandwidth, and the presence of best effort traffic has no adverse effect on the real-time traffic. Although the MediaWorm router shows a slightly lower performance than a pipelined circuit switched (PCS) router, commercial success of worm-hole switching coupled with the simpler and cheaper design makes it an attractive alternative. Simulation of a (2/spl times/2) fat-mesh using this router suggests that clusters designed with appropriate bandwidth balance between links can provide good performance for different types of traffic.
IEEE Transactions on Parallel and Distributed Systems | 1999
Aniruddha S. Vaidya; Chita R. Das; Anand Sivasubramaniam
This paper presents a comprehensive evaluation testbed for interconnection networks and routing algorithms using real applications. The testbed is flexible enough to implement any network topology and fault-tolerant routing algorithm, and allows the system architect to study the cost versus performance trade-offs for a range of network parameters. We illustrate its use with one fault-tolerant algorithm and analyze the performance of four shared memory applications with different fault conditions. We also show how the testbed can be used to drive future research in fault-tolerant routing algorithms and architectures by proposing and evaluating novel architectural enhancements to the network router, called path selection heuristics (PSH). We propose three such schemes and the Least Recently Used (LRU) PSH is shown to give the best performance in the presence of faults.
IEEE Transactions on Computers | 2014
Dongkook Park; Aniruddha S. Vaidya; Akhilesh Kumar; Mani Azimi
The number of cores in a single chip keeps increasing with process technology scaling, requiring a scalable interconnection network topology. Buffered wormhole-switched interconnect architectures are attractive for such multicore architectures. The 2D mesh on-chip interconnect provides a scalable, cost-efficient, flexible, and reliable next-generation interconnect topology in this context. In this paper, we provide a microarchitecture for a power and area efficient router for a 2D mesh interconnect. We propose an efficient crossbar implementation, called MoDe-X, that uses a reasonable power-performance tradeoff. The MoDe-X router uses a Modular-Decoupled Crossbar (MoDe-X) that incorporates dimensional decomposition and segmentation to achieve power and area savings. However, unlike most prior work that considers only logical representation of the crossbars, MoDe-X is a physically aware router accounting for the actual layout of router components to reflect practical design requirements. Our simulation results and power estimate show that the MoDe-X router architectures can reduce the overall router area by up to 40 percent and power consumption by up to 35 percent with very little performance impact that occurs only at higher loads. Further, by applying aggressive power gating techniques the net power reductions can be as much as 99 percent for some workloads with no additional performance impact.
field programmable gate arrays | 2010
Donglai Dai; Aniruddha S. Vaidya; Roy Saharoy; Seungjoon Park; Dongkook Park; Hariharan Thantry; Ralf Plate; Elmar Maas; Akhilesh Kumar; Mani Azimi
Many-core chip multiprocessors can be expected to scale to tens of cores and beyond in the near future. Existing and emerging workloads on general-purpose many-core processors typically exhibit fast-changing, unpredictable on-chip communication traffic full of burstiness and jitters between different functional blocks. To provide high sustainable performance, scalable interconnects with a rich feature set including support for adaptive and flexible communication, performance isolation, and fault-tolerance are needed. 2D mesh and torus are attractive choices because they are physical layout friendly and scale more gracefully in network latency and bisection bandwidth than other simple interconnects such as buses or rings. However, the adoption of 2D mesh/torus in many-core processor designs is dependent on a verifiable and robust micro-architecture and a validated set of features. FPGA based systems have recently become a cost-effective, rapid prototyping vehicle for chip multiprocessor architectures. In this paper we present an FPGA based prototype of 2D on-die interconnect architecture. Our prototype is a highly configurable full-scale design that supports options selecting many different micro-architectural features and routing algorithms. The prototype incorporates a synthetic traffic generator to exercise and evaluate our design. To facilitate evaluation and characterization, a rich development environment and novel software capabilities including a very detailed performance visualization infrastructure has been developed. We demonstrate the experiment results of several configurations on a 6x6 2D network emulator setup in this paper.