Douglas M. Blough
University of California, Irvine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Douglas M. Blough.
IEEE Transactions on Computers | 1999
Douglas M. Blough; Hongying W. Brown
This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for t-diagnosable systems is also given. A variation of this algorithm, which allows dynamic fault occurrence and incomplete diagnostic information, has been implemented in the COmmon Spaceborne Multicomputer Operating System (COSMOS). Results produced using a simulator for the JPL MAX multicomputer system running COSMOS show that the algorithm diagnoses all fault situations with low latency and very little overhead. These simulations demonstrate the practicality of the proposed diagnosis model and algorithm for multicomputer systems having weak reliable broadcast. This includes systems with fault-tolerant hardware for broadcast, as well as those where reliable broadcast is implemented in software.
symposium on reliable distributed systems | 1990
Douglas M. Blough; Gregory F. Sullivan
The problem of voting is studied for both the exact and inexact cases. Optimal solutions based on explicit computation of condition probabilities are given. The most commonly used strategies, i.e. majority, median, and plurality are compared quantitatively. The results show that plurality voting is the most powerful of these techniques and is, in fact, optimal for a certain class of probability distributions. An efficient method of implementing a generalized plurality voter when nonfaulty processes can produce differing answers is also given.<<ETX>>
IEEE Transactions on Computers | 1992
Douglas M. Blough; Andrzej Pelc
The authors consider a comparison-based probabilistic model for multiprocessor fault diagnosis. They study the problem of optimal diagnosis, which is to correctly identify the status (faulty/fault-free) of units in the system, with maximum probability. For some parameter values, this probabilistic model is well approximated by the asymmetric comparison model introduced by M. Malek (1980). For arbitrary systems it is shown that optimal diagnosis in the probabilistic model and in Maleks model is NP-hard. However, the authors construct efficient diagnosis algorithms in the asymmetric comparison model for a class of systems corresponding to bipartite graphs which includes hypercubes, grids, and forests. Furthermore, for ring systems, a linear-time algorithm to perform optimal diagnosis in the probabilistic model is presented. >
ieee international symposium on fault tolerant computing | 1989
Douglas M. Blough; Gregory F. Sullivan; Gerald M. Masson
The authors present a general approach to fault diagnosis that is widely applicable and requires only a limited number of connections among units. Each unit in the system forms a private opinion on the status of each of its neighboring units based on duplication of jobs and comparison of job results over time. A diagnosis algorithm that consists of simply taking a majority vote among the neighbors of a unit to determine the status of that unit is then executed. The performance of this simple majority-vote diagnosis algorithm is analyzed using a probabilistic model for the faults in the system. It is shown that with high probability, for systems composed of n units, the algorithm will correctly identify the status of all units when each unit is connected to O(log n) other units. It is also shown that the algorithm works with high probability in a class of systems in which the average number of neighbors of a unit is constant. The results indicate that fault diagnosis can in fact be achieved quite simply in multiprocessor systems containing a low to moderate number of testing conditions.<<ETX>>
international parallel processing symposium | 1993
Wei-jing Guan; Wei Kang Tsai; Douglas M. Blough
The communication performance of the interconnection network is critical in a multicomputer system. Wormhole routing has been known to be more efficient than the traditional circuit switching and packet switching. To evaluate wormhole routing, a queueing-theoretic analysis is used. This paper presents a general analytical model for wormhole routing based on very basic assumptions. The model is used to evaluate the routing delays in hypercubes and meshes. Delays calculated are compared against those obtained from simulations, and these comparisons show that the model is within a reasonable accuracy.<<ETX>>
IEEE Transactions on Computers | 1992
Douglas M. Blough; Gregory F. Sullivan; Gerald M. Masson
The problem of fault diagnosis in multiprocessor systems is considered under a probabilistic fault model. The focus is on minimizing the number of tests that must be conducted to correctly diagnose the state of every processor in the system with high probability. A diagnosis algorithm that can correctly diagnose these states with probability approaching one in a class of systems performing slightly greater than a linear number of tests is presented. A nearly matching lower bound on the number of tests required to achieve correct diagnosis in arbitrary systems is proved. Lower and upper bounds on the number of tests required for regular systems are presented. A class of regular systems which includes hypercubes is shown to be correctly diagnosable with high probability. In all cases, the number of tests required under this probabilistic model is shown to be significantly less than under a bounded-size fault set model. These results represent a very great improvement in the performance of system-level diagnosis techniques. >
IEEE Transactions on Computers | 1992
Douglas M. Blough; Gregory F. Sullivan; Gerald M. Masson
The authors present and analyze a probabilistic model for the self-diagnosis capabilities of a multiprocessor system. In this model an individual processor fails with probability p and a nonfaulty processor testing a faulty processor detects a fault with probability q. This models the situation where processors can be intermittently faulty or the situation where tests are not capable of detecting all possible faults within a processor. An efficient algorithm that can achieve correct diagnosis with high probability in systems of O(n log n) connections, where n is the number of processors, is presented. It is the first algorithm to be able to diagnose a large number of intermittently faulty processors in a class of systems that includes hypercubes. It is shown that, under this model, no algorithm can achieve correct diagnosis with high probability in regular systems which conduct a number of tests dominated by n log n. Examples of systems which perform a modest number of tests are given in which the probability of correct diagnosis for the algorithm is very nearly one. >
IEEE Transactions on Reliability | 1996
Douglas M. Blough
Reconfiguration of memory arrays using spare rows and columns is useful for yield-enhancement of memories. This paper presents a reconfiguration algorithm (QRCF) for memories that contain clustered faults. QRCF operates in a branch and bound fashion similar to known optimal algorithms that require exponential time. However, QRCF repairs faults in clusters rather than individually. Since many faults are repaired simultaneously, the execution-time of QRCF does not become prohibitive even for large memories containing many faults. The performance of QRCF is evaluated under a probabilistic model for clustered faults in a memory array. For a special case of the fault model, QRCF solves the reconfiguration problem exactly in polynomial time. In the general case, QRCF produces an optimal solution with high probability. The algorithm is also evaluated through simulation. The performance and execution-time of QRCF on arrays containing clustered faults are compared with other approximation algorithms and with an optimal algorithm. The simulation results show that QRCF outperforms previous approximation algorithms by a wide margin and performs nearly as well as the optimal algorithm with an execution-time that is orders of magnitude less.
IEEE Transactions on Computers | 1993
Douglas M. Blough; Andrzej Pelc
Reconfiguration of memory array using spare rows and spare columns, which has been shown to be a useful technique for yield enhancement of memories, is considered. A clustered failure model that adopts the center-satellite approach of F.J. Meyer and D.K. Pradhan (1989) is proposed and utilized to show that the total number of faulty cells that can be tolerated when clustering occurs is larger than when faults are independent. It is also shown that an optimal solution to the reconfiguration problem can be found in polynomial time for a special case of the clustering model. An efficient approximation algorithm is given for the general case of the probabilistic model assumed. It is shown, through simulation, that the computation time required by this algorithm to repair large arrays containing a significant number of clustered faults is small. >
[1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium | 1991
Douglas M. Blough
Reconfiguration of memory arrays using spare rows and spare columns, a useful technique for yield enhancement of memories, is considered under a compound probabilistic model that shows clustering of faults. It is shown that the total number of faulty cells that can be tolerated when clustering occurs is larger than when faults are independent. It is shown that an optimal solution to the reconfiguration problem can be found in polynomial time for a special case of the clustering model. Efficient approximation algorithms are given for the situation in which faults appear in clusters only and the situation in which faults occur both in clusters and singly. It is shown through simulation that the computation time required by this algorithm to repair large arrays containing a significant number of clustered faults is small.<<ETX>>