Is this you? Create Your Porfile

Vijay Lakamraju

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vijay Lakamraju is active.

Explore More

Publication

Featured researches published by Vijay Lakamraju.

field programmable gate arrays | 2000

Tolerating operational faults in cluster-based FPGAs

Vijay Lakamraju; Russell Tessier

In recent years the application space of reconfigurable devices has grown to include many platforms with a strong need for fault tolerance. While these systems frequently contain hardware redundancy to allow for continued operation in the presence of operational faults, the need to recover faulty hardware and return it to full functionality quickly and efficiently is great. In addition to providing functional density, FPGAs provide a level of fault tolerance generally not found in mask-programmable devices by including the capability to reconfigure around operational faults in the field. In this paper, incremental CAD techniques are described that allow functional recovery of FPGA design configurations in the presence of single or multiple operational faults. Our preferred approach to fault recovery takes advantage of device routing hierarchy in architectural families such as Xilinx Virtex [2] and Altera Apex [3] to quickly swap unused logic and routing resources in place of faulty ones within logic clusters. These algorithms allow for straight-forward implementation within a local fault-tolerant system without the need to access a remote processing location. If initial recovery attempts through localized swapping fail, an incremental router based on the widely-used PathFinder maze routing algorithm [10] can be applied remotely in an attempt to form connections between newly-allocated logic and interconnect based on the history of the initial design route.

The Journal of Supercomputing | 2000

Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance

Joshua Haines; Vijay Lakamraju; Israel Koren; C. Mani Krishna

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.

IEEE Transactions on Parallel and Distributed Systems | 2007

Software-Based Failure Detection and Recovery in Programmable Network Interfaces

Yizheng Zhou; Vijay Lakamraju; Israel Koren; C. M. Krishna

Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.

IEEE Transactions on Parallel and Distributed Systems | 2002

Filtering random graphs to synthesize interconnection networks with multiple objectives

Vijay Lakamraju; Israel Koren; C. M. Krishna

Synthesizing networks that satisfy multiple requirements, such as high reliability, low diameter, good embeddability, etc., is a difficult problem to which there has been no completely satisfactory solution. We present a simple, yet very effective, approach to this problem. The crux of our approach is a filtration process that takes as input a large set of randomly generated graphs and filters out those that do not meet the specified requirements. Our experimental results show that this approach is both practical and powerful. The use of random regular networks as the raw material for the filtration process was motivated by their surprisingly good performance with regard to almost all properties that characterize a good interconnection network. We provide results related to the generation of networks that have low diameter, high fault tolerance, and good embeddability. Through this, we show that the generated networks are serious competitors to several traditional well-known networks. We also explore how random networks can be used in a packaging hierarchy and comment on the scope of application of these networks.

dependable systems and networks | 2003

Low overhead fault tolerant networking in Myrinet

Vijay Lakamraju; Israel Koren; C. M. Krishna

Emerging networking technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead fault tolerance technique to recover from network interface failures, more particularly network processor hangs. We demonstrate the technique in the context of Myrinet. Fault recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. Our fault detection is based on a software watchdog that detects network processor hangs. Results on the Myrinet platform show that the complete fault recovery can be achieved in under 2sec while incurring a latency overhead of just 1.5µs during normal operation. The paper also shows how this fault recovery can be made completely transparent to the user.

international parallel processing symposium | 1998

Measuring the vulnerability of interconnection networks in embedded systems

Vijay Lakamraju; Zahava Koren; Israel Koren; C. M. Krishna

Studies of the fault-tolerance of graphs have tended to largely concentrate on classical graph connectivity. This measure is very basic, and conveys very little information for designers to use in selecting a suitable topology for the interconnection network in embedded systems. In this paper, we study the vulnerability of interconnection networks to the failure of individual links, using a set of four measures which, taken together, provide a much fuller characterization of the network. Moreover, while traditional studies typically limit themselves to uncorrelated link failures, our model deals with both uncorrelated and correlated failure modes. This is of practical significance, since quite often, failures in networks are correlated due to physical considerations.

international conference on parallel and distributed systems | 2006

Software-based adaptive and concurrent self-testing in programmable network interfaces

Yizheng Zhou; Vijay Lakamraju; Israel Koren; C. M. Krishna

Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low-overhead failure detection technique, which is based on a software watchdog timer that detects network processor hangs and a self-testing scheme that detects interface failures other than processor hangs. The proposed adaptive and concurrent self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. The paper shows how this technique can be made to minimize the performance impact on the host system and be completely transparent to the user

international geoscience and remote sensing symposium | 2006

Testing and Validation of the CASA DCAS System

A. Bekkerman; Vijay Lakamraju; Israel Koren; C. M. Krishna

We present a system emulator that provides a versatile environment for testing and validating a distributed system for sensing of the lower atmosphere during its development lifecycle. The emulator, along with its comprehensive RAPIDS toolbox, provide the necessary infrastructure to experiment with different parameters of the system to ensure that the system specifications are being met, while also providing the scope of experimenting with newer configurations and futuristic designs. Extensive monitoring of the system possible through RAPIDS, allows to expose bugs and performance bottlenecks. The remote experimentation feature of the system allows several users to drive/monitor the system without requiring physical access to the emulator.

international conference on mobile systems, applications, and services | 2005