Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amber Roy-Chowdhury.
ieee international symposium on fault tolerant computing | 1993
Amber Roy-Chowdhury; Prithviraj Banerjee
A scheme for dealing with roundoff errors in algorithm-based fault tolerance methods which complicate the check phases of the algorithm is presented. The method is based on error analysis incorporating some simplifications which result in easier derivation of error bounds and more useful error expressions in some cases where the theoretical error bound may be too wide to be of much use as a tolerance in the check phase. The methods are used to derive error bounds for three applications, and it is shown that the fault-tolerant encodings for these applications using the authors error expressions achieve high error coverage and no false alarms for a wide spectrum of data sets at low overheads. The authors contrast their tolerance bound calculation method with an earlier method and show that their method is much more robust. The authors believe that, to be of practical use, algorithm-based fault-tolerance techniques need to pay close attention to the thresholding issue, and they point out that their scheme is the first one which achieves good results across a wide range of data sets.
ieee international symposium on fault tolerant computing | 1994
Amber Roy-Chowdhury; Prithviraj Banerjee
Previous algorithm-based methods for developing reliable versions of numerical algorithms have mostly concerned themselves with error detection. A truly fault tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In our paper, we discuss in detail fault tolerant version of a matrix multiplication algorithm. The ideas developed in the derivation of the fault-tolerant matrix multiplication algorithms may be used to derive fault-tolerant versions of other numerical algorithms. We outline how two other numerical algorithms, QR factorization and Gaussian Elimination may be made fault-tolerant using our approach. Our fault model assumes that a faulty processor can corrupt all the data it possesses. We present error coverage and overhead results for the single faulty processor case for fault-locating and fault-tolerant versions of three numerical algorithms on an Intel iPSC/2 hypercube multicomputer.<<ETX>>
IEEE Transactions on Computers | 1996
Amber Roy-Chowdhury; Prithviraj Banerjee
Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.
IEEE Transactions on Computers | 1996
Amber Roy-Chowdhury; Nikolaos Bellas; Prithviraj Banerjee
Algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. For numerical applications involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. We describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion; we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage. We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor.
international conference on parallel processing | 1993
Amber Roy-Chowdhury; Prithviraj Banerjee
Algorithm based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. Algorithm-based schemes have been proposed for a wide variety of numerical applications. However, for a particular class of numerical applications, namely those involving the iterative solution of linear systems, there exist almost no fault-tolerant algorithms in the literature. In this paper, we describe a fault-tolerant version of a parallel algorithm for iteratively solving the Laplace equation over a grid.
ieee international symposium on fault tolerant computing | 1996
Amber Roy-Chowdhury; Prithviraj Banerjee
We have developed an automated a compile time approach to generating error-detecting parallel programs. The compiler is used to identify statements implementing affine transformations within the program and to automatically insert code for computing, manipulating, and comparing checksums in order to detect data errors at runtime. Statements which do not implement affine transformations are checked by duplication. Checksums are reused from one loop to the next if this is possible, rather than recomputing checksums for every statement. A global dataflow analysis is performed in order to determine points at which checksums need to be recomputed. We also use a novel method of specifying the data distributions of the check data using data distribution directives so that the computations on the original data, and the corresponding check computations are performed on different processors. Results on the time overhead and error coverage of the error detecting parallel programs over the original programs are presented on an Intel Paragon distributed memory multicomputer.
ieee international symposium on fault tolerant computing | 1994
Nicholas S. Bowen; Amber Roy-Chowdhury
The data sharing approach to building distributed database systems is becoming more common because of its potentially higher processing power and flexibility compared to data partitioning. However, due to the large amounts of hardware and complex software involved, the likelihood of a single node failure in the system increases. Following a single node failure, some processing has to be done to determine the set of locks held by transactions which were executing at the failed node. These locks cannot be released until database recovery has completed on the failed node. This phenomenon can cause throughput degradation even if the processing power on the surviving nodes is adequate to handle all incoming transactions. This paper studies the throughput dropoff behavior following a single node failure in a data sharing system through simulations and analytical modeling. The analytical model reveals several important factors affecting post-failure behavior and is shown to match simulations quite accurately. The effect of hot locks (locks which are frequently accessed) on post-failure behavior is observed. Simulations are performed to observe system behavior after the set of locks held by transactions on the failed node has been determined and show that if the delay in obtaining this information is too large, the system is prone to thrashing.<<ETX>>
Archive | 2008
Shankar Ramaswamy; Amber Roy-Chowdhury
Archive | 2003
Samar Choudhary; John M. Lucassen; Shankar Ramaswamy; Sai G. Rathnam; Amber Roy-Chowdhury; Douglass J. Wilson
Archive | 2005
Amber Roy-Chowdhury; Srikanth Thirumalai