Douglas Craig Bossen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Douglas Craig Bossen is active.

Explore More

Publication

Featured researches published by Douglas Craig Bossen.

international symposium on microarchitecture | 2002

Power4 system design for high reliability

Douglas Craig Bossen; Joel M. Tendler; Kevin Franklin Reick

The Power4 is an integrated system on a chip (SoC) designed for systems that will initially target enterprise Unix customers and, at a later date, O/S 400 customers. Power4 systems are servers comprised of several Power4 chips. To achieve reliability goals, Power4 system design incorporates fault tolerance throughout the hardware, firmware and operating system. Together, these system components provide concurrent and deferred maintenance, multi-level recovery from errors and run-time diagnostics.

Ibm Journal of Research and Development | 2002

Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology

Douglas Craig Bossen; Alongkorn Kitamorn; Kevin Franklin Reick; Michael Stephen Floyd

The POWER4-based p690 systems offer the highest performance of the IBM eServer pSeries™ line of computers. Within the general-purpose UNIX® server market, they also offer the highest levels of concurrent error detection, fault isolation, recovery, and availability. High availability is achieved by minimizing component failure rates through improvements in the base technology, and through design techniques that permit hardand soft-failure detection, recovery, and isolation, repair deferral, and component replacement concurrent with system operation. In this paper, we discuss the faulttolerant design techniques that were used for array, logic, storage, and I/O subsystems for the p690. We also present the diagnostic strategy, fault-isolation, and recovery techniques. New features such as POWER4 synchronous machine-check interrupt, PCI bus error recovery, array dynamic redundancy, and minimum-element dynamic reconfiguration are described. The design process used to verify error detection, fault isolation, and recovery is also described.

national computer conference | 1970

Optimum test patterns for parity networks

Douglas Craig Bossen; Daniel L. Ostapko; Arvind M. Patel

The logic related to the error detecting and/or correcting circuitry of digital computers often contains portions which calculate the parity of a collection of bits. A tree structure composed of Exclusive-OR gates is used to perform this calculation. Similar to any other circuitry, the operation of this parity tree is subject to malfunctions. A procedure for testing malfunctions in a parity tree is presented in this report.

Ibm Journal of Research and Development | 1982

Model for transient and permanent error-detection and fault-isolation coverage

Douglas Craig Bossen; My-Yue Hsiao

As computer technologies advance to achieve higher performance and density, intermittent failures become more dominant than solid failures, with the result that the effectiveness of any diagnostic procedure which relies on reproducing failures is greatly reduced. This problem is solved at the system level by a new strategy of dynamic error detection and fault isolation based on error checking and analysis of captured information. The model developed in this paper allows the system designer to project the dynamic error-detection and fault-isolation coverages of the system as a function of the failure rates of components and the types and placement of error checkers, which has resulted in significant improvements to both detection and isolation in the IBM 3081 Processor Unit. The model has also resulted in new probabilistic isolation strategies based on the likelihood of failures. Our experiences with this model on several IBM products, including the 3081, show good correlation between the model and practical experiments.

Ibm Journal of Research and Development | 1992

Fault-tolerance design of the IBM Enterprise System/9000 Type 9021 processors

Chin-Long Chen; Nandakumar N. Tendolkar; Arthur James Sutton; My-Yue Hsiao; Douglas Craig Bossen

The 9021-type processors offer the highest performance of the IBM Enterprise System/9000™ (ES/9000™) series. They also have the highest levels of concurrent error detection, fault isolation, recovery, and availability of any IBM general-purpose processor. High availability is achieved by minimizing component failure rates through improvements of the base technology, and design techniques that permit hard and soft failure detection, recovery and isolation, and component replacement concurrent with system operation. In this paper, we discuss fault-tolerant design techniques for array, logic, and storage subsystems. We also present diagnostic strategy, fault isolation, and recovery techniques. New features such as the redundant power system and Processor Availability Facility are described. The overall recovery design is described, as well as specific implementation schemes. The design process to verify the error detection, fault isolation, and recovery is also described.

Ibm Journal of Research and Development | 1984

Fault alignment exclusion for memory using address permutation

Douglas Craig Bossen; Chin-Long Chen; My-Yue Hsiao

A significant improvement in memory fault tolerance, beyond what is already provided by the use of an appropriate error-correcting code (ECC), can be achieved by electronic chip swapping, without any compromise of data integrity as large numbers of faults are allowed to accumulate. Since most large and medium-sized semiconductor memories are organized so that each bit position of the system word (ECC codeword) is fed from a different chip, and quite often from a different array card, or at least from distinct partitions of an array card, the various bit positions have separate address circuitry on the array cards. This fact is important, and can be exploited to provide effective address permutation capability, which allows the realignment of faults which would otherwise have caused an uncorrectable multiple error in an ECC codeword. When faults occur in a codeword to produce an uncorrectable error (UE), the addressing within one of the error bit position array cards can be altered using simple EX-OR circuitry and storage latches. The content of the latches is computed using a fault map of the memory together with an algorithm. These techniques are referred to as Fault Alignment Exclusion (FAE) using address permutation. Practical considerations as to the complexity of the fault map, the number of storage latches per bit position, and the overall effectiveness of the permutation to disperse the expected numbers of errors are presented in this paper.

pacific rim international symposium on fault tolerant systems | 1997

Time-lag duplexing-a fault tolerance technique for online transaction processing systems

Arun Chandra; Douglas Craig Bossen

In this paper the concept of time-lag duplexing is proposed to achieve fault tolerance. Time-lag duplexing incorporates time and component redundancy to provide for transient errors both easy error recovery and tolerance against errors in common irredundant components. As a result, minimum performance and cost penalties are incurred. In this paper the fault detection and recovery algorithm using time-lag duplexing is shown for both transient and permanent faults. We show the applicability of time-lag duplexing by developing a prototype Online Transaction Processing (OLTP) System in a Network Computing Environment. Faults are injected into this prototype to show the viability of this technique.

design automation conference | 1971

Minimum test patterns for residue networks

Douglas Craig Bossen; Daniel L. Ostapko; Arvind M. Patel; Martin S. Schmookler

Residue networks are logic trees consisting of residue gates which calculate the modulo-m sum of two or more inputs. The principal result of this paper is that for a single output residue network consisting of modulo-m gates having n or less inputs each, the required number of test patterns is mn, provided the gates are of a particular logical construction. This represents a minimum number of test patterns since that is exactly the number of test patterns required for a complete functional test of a single modulo-m gate with n inputs. In the next section, an application of the method described in this paper is briefly illustrated. The properties of residue gates which are necessary and sufficient for the use of this method and various means for generating the test patterns are described in the remaining sections.

international symposium on vlsi technology systems and applications | 1993

BC412 bar code for silicon wafers

Chin-Long Chen; My-Yue Hsiao; Douglas Craig Bossen

The authors present a newly invented BC412 bar code for silicon wafer identification in automated semiconductor manufacturing floor control systems. This code has been tested and has been approved to be a SEMI standard in June 1992. It is a very important technology breakthrough and expected to be used by all semiconductor wafer manufacturers. The technical details of BC412 are presented.<<ETX>>

Archive | 2000