Is this you? Create Your Porfile

Qiang Guan

Los Alamos National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Qiang Guan is active.

Explore More

Publication

Featured researches published by Qiang Guan.

symposium on reliable distributed systems | 2013

Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures

Qiang Guan; Song Fu

Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software faults and environmental factors. Autonomic anomaly detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect anomalous cloud behaviors, we need to monitor the cloud execution and collect runtime cloud performance data. These data consist of values of performance metrics for different types of failures, which display different correlations with the performance metrics. In this paper, we present an adaptive anomaly identification mechanism that explores the most relevant principal components of different failure types in cloud computing infrastructures. It integrates the cloud performance metric analysis with filtering techniques to achieve automated, efficient, and accurate anomaly identification. The proposed mechanism adapts itself by recursively learning from the newly verified detection results to refine future detections. We have implemented a prototype of the anomaly identification system and conducted experiments in an on-campus cloud computing environment and by using the Google data center traces. Our experimental results show that our mechanism can achieve more efficient and accurate anomaly detection than other existing schemes.

Journal of Communications | 2012

Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems

Qiang Guan; Ziming Zhang; Song Fu

In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classifiers to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management.

availability, reliability and security | 2011

Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems

Qiang Guan; Ziming Zhang; Song Fu

Cloud computing systems continue to grow in their scale and complexity. They are changing dynamically as well due to the addition and removal of system components, changing execution environments, frequent updates and upgrades, online repairs and more. In such large-scale complex and dynamic systems, failures are common. In this paper, we present a failure prediction mechanism exploiting both unsupervised and semi-supervised learning techniques for building dependable cloud computing systems. The unsupervised failure detection method uses an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classier to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our proposed method can forecast failure dynamics with high accuracy.

international parallel and distributed processing symposium | 2014

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability

Qiang Guan; Nathan DeBardeleben; Sean Blanchard; Song Fu

As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.

international conference on computer communications and networks | 2011

Ensemble of Bayesian Predictors for Autonomic Failure Management in Cloud Computing

Qiang Guan; Ziming Zhang; Song Fu

In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It estimates the probability distribution of runtime performance data collected by health monitoring tools when cloud servers perform normally. It characterizes normal execution states of the system and detects anomalous behaviors. Experimental results in an institute-wide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management.

computer software and applications conference | 2010

An Anomaly Detection Framework for Autonomic Management of Compute Cloud Systems

Derek Smith; Qiang Guan; Song Fu

In large-scale compute cloud systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. When a system fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively detect anomalies from the voluminous amount of noisy, high-dimensional data. The traditional manual approach is time-consuming, error-prone, and not scalable. In this paper, we present an autonomic mechanism for anomaly detection in compute cloud systems. A set of techniques is presented to automatically analyze collected data: data transformation to construct a uniform data format for data analysis, feature extraction to reduce data size, and unsupervised learning to detect the nodes acting differently from others. We evaluate our prototype implementation on an institute-wide compute cloud environment. The results show that our mechanism can effectively detect faulty nodes with high accuracy and low computation overhead.

pacific rim international symposium on dependable computing | 2012

CDA: A Cloud Dependability Analysis Framework for Characterizing System Dependability in Cloud Computing Infrastructures

Qiang Guan; Chi-Chen Chiu; Song Fu

Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructure. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software failures. Dependability assurance is crucial for building sustainable cloud computing services. Although many techniques have been proposed to analyze and enhance reliability of distributed systems, there is little work on understanding the dependability of cloud computing environments. As virtualization has been an enabling technology for the cloud, it is imperative to investigate the impact of virtualization on the cloud dependability, which is the focus of this work. In this paper, we present a cloud dependability analysis (CDA) framework with mechanisms to characterize failure behavior in cloud computing infrastructures. We design the failure-metric DAGs (directed a cyclic graph) to analyze the correlation of various performance metrics with failure events in virtualized and non-virtualized systems. We study multiple types of failures. By comparing the generated DAGs in the two environments, we gain insight into the impact of virtualization on the cloud dependability. This paper is the first attempt to study this crucial issue. In addition, we exploit the identified metrics for failure detection. Experimental results from an on-campus cloud computing test bed show that our approach can achieve high detection accuracy while using a small number of performance metrics.

international conference on parallel processing | 2011

Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience

Nathan DeBardeleben; Sean Blanchard; Qiang Guan; Ziming Zhang; Song Fu

As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of today have to deal with these faults to a small degree and it is expected this will only be a larger problem as systems continue to scale. In this paper we present SEFI, the Soft Error Fault Injection framework, a tool for profiling software for its susceptibility to soft errors. In particular, we focus in this paper on logic soft error injection. Using the open source virtual machine and processor emulator (QEMU), we demonstrate modifying emulated machine instructions to introduce soft errors. We conduct experiments by modifying the virtual machine itself in a way that does not require intimate knowledge of the tested application. With this technique, we show that we are able to inject simulated soft errors in the logic operations of a target application without affecting other applications or the operating system sharing the VM. We present some initial results and discuss where we think this work will be useful in next generation hardware/software co-design.

international performance, computing, and communications conference | 2010

auto-AID: A data mining framework for autonomic anomaly identification in networked computer systems

Qiang Guan; Song Fu

Networked computer systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. A failure will cause one or multiple computer(s) to be unavailable, which affects the resource utilization and system throughput. When a computer fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. In this paper, we present auto-AID, an autonomic mechanism for anomaly identification in networked computer systems. It is composed of a set of data mining techniques that facilitates automatic analysis of system health data. The identification results are very valuable for the system administrators to manage systems and schedule the available resources. We implement a prototype of auto-AID and evaluate it on a production institution-wide compute grid. The results show that auto-AID can effectively identify anomalies with little human intervention.

networking architecture and storages | 2012

Efficient and Accurate Anomaly Identification Using Reduced Metric Space in Utility Clouds

Qiang Guan; Chi-Chen Chiu; Ziming Zhang; Song Fu

The online detection of anomalies is a vital element of operations in utility clouds. Detection should function for different levels of abstraction including hardware and software, and for the various metrics used in cloud computing systems. Given ever-increasing cloud sizes coupled with the complexity of system components, continuous monitoring leads to the overwhelming volume of data collected by health monitoring tools. High metric dimensionality and existence of interacting metrics compromise the detection accuracy and lead to high detection complexity. In this paper, we present a metric selection framework and propose systematic approaches to effectively identify and select the most essential metrics for online anomaly detection in utility clouds. Specifically, a mutual information based approach selects metrics with the maximized mutual relevance and the minimized redundancy. Then metric space combination and separation are explored to reduce the metric dimensionality further. Experimental results on utility cloud scenarios demonstrate the viability and efficiency of this framework. The selected metrics contribute to a high efficiency and accuracy in anomaly detection.

Explore More