Song Fu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Song Fu is active.

Explore More

Publication

Featured researches published by Song Fu.

conference on high performance computing (supercomputing) | 2007

Exploring event correlation for failure prediction in coalitions of clusters

Song Fu; Cheng Zhong Xu

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007.

symposium on reliable distributed systems | 2007

Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

Song Fu; Cheng Zhong Xu

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment.

symposium on reliable distributed systems | 2013

Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures

Qiang Guan; Song Fu

Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructures. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software faults and environmental factors. Autonomic anomaly detection is a crucial technique for understanding emergent, cloud-wide phenomena and self-managing cloud resources for system-level dependability assurance. To detect anomalous cloud behaviors, we need to monitor the cloud execution and collect runtime cloud performance data. These data consist of values of performance metrics for different types of failures, which display different correlations with the performance metrics. In this paper, we present an adaptive anomaly identification mechanism that explores the most relevant principal components of different failure types in cloud computing infrastructures. It integrates the cloud performance metric analysis with filtering techniques to achieve automated, efficient, and accurate anomaly identification. The proposed mechanism adapts itself by recursively learning from the newly verified detection results to refine future detections. We have implemented a prototype of the anomaly identification system and conducted experiments in an on-campus cloud computing environment and by using the Google data center traces. Our experimental results show that our mechanism can achieve more efficient and accurate anomaly detection than other existing schemes.

Journal of Communications | 2012

Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems

Qiang Guan; Ziming Zhang; Song Fu

In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classifiers to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management.

Journal of Parallel and Distributed Computing | 2010

Quantifying event correlations for proactive failure management in networked computing systems

Song Fu; Cheng Zhong Xu

Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment.

global communications conference | 2011

Performance Metric Selection for Autonomic Anomaly Detection on Cloud Computing Systems

Song Fu

With ever-growing complexity and dynamicity of cloud computing systems, dependability assurance has become a major concern in system design and management. In this paper, we propose a framework for autonomic anomaly detection in the cloud. Mutual information is exploited to quantify the relevance and redundancy among the large number of performance metrics. An incremental search algorithm is presented for metric selection. We apply principal component analysis to further reduce the metric dimension, while keeping the variance in the health- related data as much as possible. A detection mechanism with semi-supervised decision tree classifiers works on the reduce metric dimensionality and identifies anomalies. We have implemented a prototype of our autonomic anomaly detection framework and evaluated its performance on an institute-wide cloud computing system.

availability, reliability and security | 2011

Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems

Qiang Guan; Ziming Zhang; Song Fu

Cloud computing systems continue to grow in their scale and complexity. They are changing dynamically as well due to the addition and removal of system components, changing execution environments, frequent updates and upgrades, online repairs and more. In such large-scale complex and dynamic systems, failures are common. In this paper, we present a failure prediction mechanism exploiting both unsupervised and semi-supervised learning techniques for building dependable cloud computing systems. The unsupervised failure detection method uses an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classier to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our proposed method can forecast failure dynamics with high accuracy.

international conference on parallel processing | 2005

Service migration in distributed virtual machines for adaptive grid computing

Song Fu; Cheng Zhong Xu

Computational grids can integrate geographically distributed resources into a seamless environment. To facilitate managing these heterogeneous resources, the virtual machine technology provides a powerful layer of abstraction and allows multiple applications to multiplex the resources of a grid computer. On the other hand, the grid dynamics requires the virtual machine system be distributed and reconfigurable. However, the existing migration approaches only move the execution entities, such as processes, threads, and mobile agents, among servers and leave the runtime services behind. They are not potent to achieve service reconfiguration in face of server overload or failures. In this paper, we propose a service migration mechanism, which moves the computational services of a virtual server, for instance a shared array runtime support system, to available servers for adaptive grid computing. In this way, parallel jobs can resume computation on a remote server without requiring service preinstallation. As an illustration of the service migration mechanism, we incorporated it into a Java-compliant distributed virtual machine, DSA, and formed a Mobile DSA (M-DSA) to accommodate adaptive parallel applications in grids. We measured the performance of M-DSA in the execution of applications from the SPLASH-2 benchmark suite on a campus grid. Experimental results show that service migration can achieve system adaptivity effectively.

international parallel and distributed processing symposium | 2014

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability

Qiang Guan; Nathan DeBardeleben; Sean Blanchard; Song Fu

As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.

symposium on reliable distributed systems | 2012

AAD: Adaptive Anomaly Detection System for Cloud Computing Infrastructures

Husanbir Singh Pannu; Jianguo Liu; Song Fu

Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructure. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software failures. Autonomic failure detection is a crucial technique for understanding emergent, cloudwide phenomena and self-managing cloud resources for system-level dependability assurance. To detect failures, we need to monitor the cloud execution and collect runtime performance data. These data are usually unlabeled, and thus a prior failure history is not always available in production clouds, especially for newly managed or deployed systems. In this paper, we present an Adaptive Anomaly Detection (AAD) framework for cloud dependability assurance. It employs data description using hypersphere for adaptive failure detection. Based on the cloud performance data, AAD detects possible failures, which are verified by the cloud operators. They are confirmed as either true failures with failure types or normal states. The algorithm adapts itself by recursively learning from these newly verified detection results to refine future detections. Meanwhile, it exploits the observed but undetected failure records reported by the cloud operators to identify new types of failures. We have implemented a prototype of the algorithm and conducted experiments in an on-campus cloud computing environment. Our experimental results show that AAD can achieve more efficient and accurate failure detection than other existing scheme.

Explore More