Chris J. Walter | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chris J. Walter is active.

Explore More

Publication

Featured researches published by Chris J. Walter.

IEEE Transactions on Computers | 1988

The MAFT architecture for distributed fault tolerance

R.M. Keichafer; Chris J. Walter; Alan M. Finn; Philip M. Thambidurai

A description is given of the multicomputer architecture for fault tolerance (MAFT), a distributed system designed to provide extremely reliable computation in real-time control systems. MAFT is based on the physical and functional partitioning of executive functions from applications functions. The implementation of the executive functions in a special-purpose hardware processor allows the fault-tolerance functions to be transparent to the application programs and minimizes overhead. Byzantine agreement and approximate agreement algorithms are used for critical system parameters. MAFT supports the use of multiversion hardware and software to tolerate built-in or generic faults. Graceful degradation and restoration of the application workload is permitted in response to the exclusion and readmission of nodes, respectively. >

IEEE Transactions on Software Engineering | 1997

Formally verified on-line diagnosis

Chris J. Walter; Patrick Lincoln; Neeraj Suri

A reconfigurable fault tolerant system achieves the attributes of dependability of operations through fault detection, fault isolation and reconfiguration, typically referred to as the FDIR paradigm. Fault diagnosis is a key component of this approach, requiring an accurate determination of the health and state of the system. An imprecise state assessment can lead to catastrophic failure due to an optimistic diagnosis, or conversely, result in underutilization of resources because of a pessimistic diagnosis. Differing from classical testing and other off-line diagnostic approaches, we develop procedures for maximal utilization of the system state information to provide for continual, on-line diagnosis and reconfiguration capabilities as an integral part of the system operations. Our diagnosis approach, unlike existing techniques, does not require administered testing to gather syndrome information but is based on monitoring the system message traffic among redundant system functions. We present comprehensive on-line diagnosis algorithms capable of handling a continuum of faults of varying severity at the node and link level. Not only are the proposed algorithms on-line in nature, but are themselves tolerant to faults in the diagnostic process. Formal analysis is presented for all proposed algorithms. These proofs offer both insight into the algorithm operations and facilitate a rigorous formal verification of the developed algorithms.

Proceedings of the IEEE | 1994

Synchronization issues in real-time systems

Neeraj Suri; Michelle M. Hugue; Chris J. Walter

Real-time systems must accomplish executive and application tasks within specified timing constraints. In distributed real-time systems, the mechanisms that ensure fair access to shared resources, achieve consistent deadlines, meet timing or precedence constraints, and avoid deadlocks all utilize the notion of a common system-wide time base. A synchronization primitive is essential in meeting the demands of real-time critical computing. This paper provides a tutorial on the terminology, issues, and techniques essential to synchronization in real-time systems. >

Archive | 1995

Continual On-Line Diagnosis of Hybrid Faults

Chris J. Walter; Neeraj Suri; Michelle M. Hugue

An accurate system-state determination is essential in ensuring system dependability. An imprecise state assessment can lead to catastrophic failure through optimistic diagnosis, or underutilization of resources due to pessimistic diagnosis. Dependability is usually achieved through a fault detection, isolation and reconfiguration (FDIR) paradigm, of which the diagnosis procedure is a primary component. Fault resolution in on-line diagnosis is key to providing an accurate system-state assessment. Most diagnostic strategies are based on limited fault models that adopt either an optimistic (all faults s-a-X) or pessimistic (all faults Byzantine) bias. Our Hybrid Fault-Effects Model (HFM) handles a continuum of fault types that are distinguished by their impact on system operations. While this approach has been shown to enhance system functionality and dependability, on-line FDIR is required to make the HFM practical. In this paper, we develop a methodology for utilization of the system-state information to provide continual on-line diagnosis and reconfiguration as an integral part of the system operations. We present diagnosis algorithms applicable under the generalized HFM and introduce the notion of fault decay time. Our diagnosis approach is based primarily on monitoring the system’s message traffic. Unlike existing approaches, no explicit test procedures are required.

Theoretical Computer Science | 2003

The customizable fault/error model for dependable distributed systems

Chris J. Walter; Neeraj Suri

Dependability is a qualitative term referring to a systems ability to meet its service requirements in the presence of faults. The types and number of faults covered by a system play a primary role in determining the level of dependability which that system can potentially provide. Given the variety and multiplicity of fault types, to simplify the design process, the system algorithm design often focuses on specific fault types, resulting in either over-optimistic (all fault permanent) or over-pessimistic (all faults malicious) dependable system designs.A more practical and realistic approach is to recognize that faults of varied severity levels and of differing occurrence probabilities may appear as combinations rather than the assumed single fault type occurrences. The ability to allow the user to select/customize a particular combination of fault types of varied severity characterizes the proposed customizable fault/error model (CFEM). The CFEM organizes diverse fault categories into a cohesive framework by classifying faults according to the effect they have on the required system services rather than by targeting the source of the fault condition. In this paper, we develop (a) the complete framework for the CFEM fault classification, (b) the voting functions applicable under the CFEM, and (c) the fundamental distributed services of consensus and convergence under the CFEM on which dependable distributed functionality can be supported.

IEEE Transactions on Reliability | 1990

Evaluation and design of an ultra-reliable distributed architecture for fault tolerance

Chris J. Walter

The issues related to the experimental evaluation of an early conceptual prototype of the MAFT (multicomputer architecture for fault tolerance) architecture are discussed. A completely automated testing approach was designed to allow fault-injection experiments to be performed, including stuck-at and memory faults. Over 2000 injection tests were run and the system successfully tolerated all faults. Concurrent with the experimental evaluation, an analytic evaluation was carried out to determine if higher levels of reliability could be achieved. The lessons learned in the evaluation phase culminated in a new design of the MAFT architecture for applications needing ultrareliability. The design uses the concept of redundantly self-checking functions to address the rigid requirements proposed for a future generation of mission-critical avionics. The testing of three subsystems critical to the operation of the new MAFT is presented with close to 50-k test cycles performed over 51 different IC devices to verify the designs. >

ieee international symposium on fault tolerant computing | 1989

Clock synchronization in MAFT

Philip M. Thambidurai; Alan M. Finn; Roger M. Kieckhafer; Chris J. Walter

The steady-state clock synchronization algorithm of MAFT (multicomputer architecture for fault tolerance), an extremely reliable system for real-time applications, is discussed. The synchronization algorithm has been implemented in hardware and a system prototype constructed. The algorithm uses an interactive convergence approach, based on synchronized rounds of message transmission. The authors derive the maximum skew between nonfaulty clocks in terms of basic system parameters. The problem of detecting clock faults is also addressed, with attention to the minimum amount of synchronization error guaranteed to be unambiguously detected. The authors discuss the various practicalities which arise in the implementation of the algorithm as an integrated part of the whole system. Relationships between the synchronization subsystem and the total system are discussed.<<ETX>>

[1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium | 1990

Identifying the cause of detected errors

Chris J. Walter

The author presents an approach to the consistent diagnosis of error monitoring observations in a distributed fault-tolerant computing system, even when the faulty source produces arbitrary errors. He describes the online algorithm used in the multicomputer architecture for fault tolerance (MAFT) to diagnose faulty system elements. By the use of syndrome information which categorizes detected errors as either symmetric or asymmetric, bounds for correct diagnosis can be deduced. Finally, an interactive consistency algorithm is employed to guarantee consistent diagnosis in a distributed environment and to provide online verification of all diagnostic units.<<ETX>>

ieee international symposium on fault tolerant computing | 1992

Reliability modeling of large fault-tolerant systems

Neeraj Suri; Michelle M. Hugue; Chris J. Walter

A cluster-based ultrareliable architecture is presented, offering synchronization and system functionality comparable to that of fully connected systems, with reduced system overhead. A reliability model considering the distribution of concurrent faults across the system clusters is shown to increase the accuracy of reliability and system fault-tolerance estimates. The hybrid fault model, which classifies faults based on their behavior, further improves reliability estimates and enhances the fault handling capability of each cluster. Linear growth in cluster reliability with respect to cluster size is possible, as are refinements in the convergence and consistency algorithms for synchronization.<<ETX>>

conference on high performance computing supercomputing | 1989

DASP: a general-purpose MIMD parallel computer using distributed associative processing

Y. Park; Chris J. Walter; Henry C. Yee; T. Roden; Simon Y. Berkovich

This paper presents a general purpose MIMD (Multiple Instruction Stream Multiple Data Stream) loosely-coupled parallel computer called DASP (Distributed Associative Processor). The DASP organization partitions the communication and application functions. The communication functions are performed by custom-made communication handlers called Network Communication Modules, while application functions are performed by any general purpose processor suitable for the application. The communication subsystem of DASP takes advantage of the properties of loosely-coupled MIMD parallel computers: the very short inter-processor distances and the locality of task reference. By pipelining the time slices of the bus hierarchically with the CITO (Content Induced Transaction Overlap) protocol, DASP provides virtual full-connectivity to the application processors without physical full connections; thus, its architecture exhibits a very high degree of extensibility and modularity. Analytical and simulation results have validated the DASP approach. A prototype has been constructed and several algorithms have been successfully implemented on the prototype.

Explore More