Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Dong Tang is active.

Publication


Featured researches published by Dong Tang.


IEEE Transactions on Software Engineering | 1993

FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faults

Wei-lun Kao; Ravishankar K. Iyer; Dong Tang

The authors present a fault injection and monitoring environment (FINE) as a tool to study fault propagation in the UNIX kernel. FINE injects hardware-induced software errors and software faults into the UNIX kernel and traces the execution flow and key variables of the kernel. FINE consists of a fault injector, a software monitor, a workload generator, a controller, and several analysis utilities. Experiments on SunOS 4.1.2 are conducted by applying FINE to investigate fault propagation and to evaluate the impact of various types of faults. Fault propagation models are built for both hardware and software faults. Transient Markov reward analysis is performed to evaluate the loss of performance due to an injected fault. Experimental results show that memory and software faults usually have a very long latency, while bus and CPU faults tend to crash the system immediately. About half of the detected errors are data faults, which are detected when the system is tries to access an unauthorized memory location. Only about 8% of faults propagate to other UNIX subsystems. Markov reward analysis shows that the performance loss incurred by bus faults and CPU faults is much higher than that incurred by software and memory faults. Among software faults, the impact of pointer faults is higher than that of nonpointer faults. >


IEEE Transactions on Computers | 1992

Analysis and modeling of correlated failures in multicomputer systems

Dong Tang; Ravishankar K. Iyer

Based on the measurements from two DEC VAX-cluster multicomputer systems, the issue of correlated failures is addressed. In particular, the characteristics of correlated failures, their impact and their modelling on dependability, are discussed. It is found from the data that most correlated failures are related to errors in shared resources and propagate from one machine to another. Comparisons between measurement-based models and analytical models that assume failure independence show that the impact of correlated failures on dependability is significant. Two validated models. the c-dependent model and the p-dependent model, are developed to evaluate the dependability of systems with correlated failures. >


IEEE Transactions on Computers | 1993

Dependability measurement and modeling of a multicomputer system

Dong Tang; Ravishankar K. Iyer

A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant. >


IEEE Transactions on Reliability | 1993

Measurement-based evaluation of operating system fault tolerance

Inhwan Lee; Dong Tang; Ravishankar K. Iyer; Mei Chen Hsueh

The authors demonstrate a methodology for evaluating the fault-tolerance characteristics of operational software and illustrate it through case studies of three operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, software error characteristics are investigated by analyzing error distributions and correlation. Two levels of models are developed to analyze the error and recovery processes inside an operating system and the interactions among multiple copies of an operating system running in a distributed environment. Reward analysis is used to evaluate the loss of service due to software errors and the effect of fault-tolerant techniques implemented in the systems. >


ieee international symposium on fault-tolerant computing | 1991

Error/failure analysis using event logs from fault tolerant systems

Inhwan Lee; Ravishankar K. Iyer; Dong Tang

A methodology for the analysis of automatically generated event logs from fault tolerant systems is presented. The methodology is illustrated using event log data from three Tandem systems. Two are experimental systems, with nonstandard hardware and software components causing accelerated stresses and failures. Errors are identified on the basis of knowledge of the architectural and operational characteristics of the measured systems. The methodology takes a raw event log and reduces the data by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained, and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components. The dependency analysis is illustrated using processor halt data from one of the measured systems. It is found that the number of errors is small, even though the measurement period is relatively long. This reflects the high dependability of the measured systems.<<ETX>>


ieee international symposium on fault tolerant computing | 1990

Failure analysis and modeling of a VAXcluster system

Dong Tang; Ravishankar K. Iyer; Sujatha S. Subramani

The authors discuss the results of a measurement-based analysis of real error data collected from a DEC VAXcluster multicomputer system. In addition to evaluating basic system dependability characteristics, such as error and failure distributions and hazard rates for both individual machines and the VAXcluster, they develop reward models to analyze the impact of failures on the system as a whole. The results show that more than 46% of all failures were due to errors in shared resources. This is despite the fact that these errors have a recovery probability greater than 0.99. The hazard rate calculations show that not only errors but also failures occur in bursts. Approximately 40% of all failures occur in bursts and involve multiple machines. This result indicates that correlated failures are significant. Analysis of rewards shows that software errors have the lowest reward (0.05 versus 0.74 for disk errors). The expected reward rate (reliability measure) of the VAXcluster drops to 0.5 in 18 hours for the 7-out-of-7 model and in 80 days for the 3-out-of-7 model. The VAXcluster system availability is evaluated to be 0.993 250 days of operation.<<ETX>>


Archive | 1992

Impact of Correlated Failures on Dependability in a VAXcluster System

Dong Tang; Ravishankar K. Iyer

This paper addresses the issue of correlated failures and their impact on system dependability. Measurements are made on a VAXcluster system and validated analytical models are proposed to calculate availability and reliability for simple systems with correlated failures. A correlation analysis of the VAXcluster data shows that errors are highly correlated across machines (average correlation coefficient ρ = 0.62) due to sharing of resources. The measured failure correlation coefficient, however, is not high (0.15). Based on the VAXcluster data, it is shown that models that ignore correlated failures can underestimate unavailability by orders of magnitude. Even a small correlation significantly affects system unavailability. A validated analytical model, to calculate unavailability of 1-out-of-2 systems with correlated failures, is derived. Similarly, reliability is also significantly influenced by correlated failures. The joint failure rate of the two components, λ f , is found to be the key parameter for estimating reliability of 1-out-of-2 systems with correlated failures. A validated relationship between ρ and λ f , is also derived.


measurement and modeling of computer systems | 1993

MEASURE+: a measurement-based dependability analysis package

Dong Tang; Ravishankar K. Iyer

Most existing dependability modeling and evaluation tools are designed for building and solving commonly used models with emphasis on solution techniques, not for identifying realistic models from measurements. In this paper, a measurement-based dependability analysis package, MEASURE+, is introduced. Given measured data from real systems in a specified format MEASURE+ can generate appropriate dependability models and measures including Markov and semi-Markov models, k-out-of-n availability models, failure distribution and hazard functions, and correlation parameters. These models and measures obtained from data are valuable for understanding actual error/failure characteristics, identifying system bottlenecks, evaluating dependability for real systems, and verifying assumptions made in analytical models. The paper illustrates MEASURE+ by applying it to the data from a VAXcluster multicomputer system. Models of field failure behavior identified by MEASURE+ indicate that both traditional models assuming failure independence and those few taking correlation into account are not representative of the actual occurrence process of correlated failures.


asian test symposium | 1993

Study of fault propagation using fault injection in the UNIX system

Wei-lun Kao; Dong Tang; Ravishankar K. Iyer

This paper presents a fault propagation study for the UNIX operating system (Sun OS 4.1.2). Both hardware and software faults are injected in the UNIX kernel by using FINE - a fault injection and monitoring environment - to investigate the propagation of various types of faults. Based on the experimental results, fault propagation models are built and transient reward analysis is performed to evaluate the performance loss due to a fault. Results from the experiments provide insight into the vulnerable aspects of the system where the fault tolerance techniques could be used.<<ETX>>


Archive | 1994

Measurement-Based Dependability Evaluation of Operational Computer Systems

Ravishankar K. Iyer; Dong Tang

This paper discusses methodologies and advances in measurement-based dependability evaluation of operational computer systems. Research work over the past 15 years in this area is briefly reviewed. Methodologies are illustrated through discussion of authors’ representative studies. Specifically, measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis are addressed. The discussion covers methods used in the area and several important issues previously studied, including workload/failure dependency, correlated failures, and software fault tolerance.

Collaboration


Dive into the Dong Tang's collaboration.

Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge