Ziming Zhang
University of North Texas
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ziming Zhang.
Journal of Communications | 2012
Qiang Guan; Ziming Zhang; Song Fu
In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classifiers to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management.
availability, reliability and security | 2011
Qiang Guan; Ziming Zhang; Song Fu
Cloud computing systems continue to grow in their scale and complexity. They are changing dynamically as well due to the addition and removal of system components, changing execution environments, frequent updates and upgrades, online repairs and more. In such large-scale complex and dynamic systems, failures are common. In this paper, we present a failure prediction mechanism exploiting both unsupervised and semi-supervised learning techniques for building dependable cloud computing systems. The unsupervised failure detection method uses an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classier to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our proposed method can forecast failure dynamics with high accuracy.
international conference on computer communications and networks | 2011
Qiang Guan; Ziming Zhang; Song Fu
In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It estimates the probability distribution of runtime performance data collected by health monitoring tools when cloud servers perform normally. It characterizes normal execution states of the system and detects anomalous behaviors. Experimental results in an institute-wide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management.
international conference on parallel processing | 2011
Nathan DeBardeleben; Sean Blanchard; Qiang Guan; Ziming Zhang; Song Fu
As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of today have to deal with these faults to a small degree and it is expected this will only be a larger problem as systems continue to scale. In this paper we present SEFI, the Soft Error Fault Injection framework, a tool for profiling software for its susceptibility to soft errors. In particular, we focus in this paper on logic soft error injection. Using the open source virtual machine and processor emulator (QEMU), we demonstrate modifying emulated machine instructions to introduce soft errors. We conduct experiments by modifying the virtual machine itself in a way that does not require intimate knowledge of the tested application. With this technique, we show that we are able to inject simulated soft errors in the logic operations of a target application without affecting other applications or the operating system sharing the VM. We present some initial results and discuss where we think this work will be useful in next generation hardware/software co-design.
networking architecture and storages | 2012
Qiang Guan; Chi-Chen Chiu; Ziming Zhang; Song Fu
The online detection of anomalies is a vital element of operations in utility clouds. Detection should function for different levels of abstraction including hardware and software, and for the various metrics used in cloud computing systems. Given ever-increasing cloud sizes coupled with the complexity of system components, continuous monitoring leads to the overwhelming volume of data collected by health monitoring tools. High metric dimensionality and existence of interacting metrics compromise the detection accuracy and lead to high detection complexity. In this paper, we present a metric selection framework and propose systematic approaches to effectively identify and select the most essential metrics for online anomaly detection in utility clouds. Specifically, a mutual information based approach selects metrics with the maximized mutual relevance and the minimized redundancy. Then metric space combination and separation are explored to reduce the metric dimensionality further. Experimental results on utility cloud scenarios demonstrate the viability and efficiency of this framework. The selected metrics contribute to a high efficiency and accuracy in anomaly detection.
ieee international symposium on parallel distributed processing workshops and phd forum | 2010
Ziming Zhang; Song Fu
Networked computer systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. To achieve self-management of failures and resources in networked computer systems, we propose a framework for autonomic failure management with hierarchical failure prediction functionality for large coalition systems, such as coalition clusters and compute grids. It analyzes node, cluster and system wide failure behaviors and forecasts the prospective failure occurrences based on quantified failure dynamics. Failure correlations are inspected by the predictor. Experimental results in a computational grid on campus show the offline and online predictions by our predictors accurately forecast the failure trend and capture failure correlations in the production environment.
ieee international conference on cloud computing technology and science | 2011
Ziming Zhang; Song Fu
Power and energy are primary concerns in the design and management of modern cloud computing systems and data centers. Operational costs for powering and cooling large-scale cloud systems will soon exceed acquisition costs. To improve the energy effciency of cloud computing systems and applications, it is critical to profile the power usage of real systems and applications. Many factors influence power and energy usage in cloud systems, including each components electrical specification, the system usage characteristics of the applications, and system software. In this work, we present the power profiling results on a cloud test bed. We combine hardware and software that achieves power and energy profiling at server granularity. We collect the power and energy usage data with varying server/cloud configurations, and quantify their correlation. Our experiments reveal conclusively how different system configurations affect the server/cloud power and energy usage.
international workshop on energy efficient supercomputing | 2014
Ziming Zhang; Michael Lang; Scott Pakin; Song Fu
Power-aware parallel job scheduling has been recognized as a demanding issue in the high-performance computing (HPC) community. The goal is to efficiently allocate and utilize power and energy in machine rooms. In practice the power for machine rooms is well over-provisioned, specified by high energy LINPACK runs or nameplate power estimates. This results in a considerable amount of trapped power capacity. Instead of being wasted, this trapped power capacity should be reclaimed to accommodate more compute nodes in the machine room and thereby increase system throughput. But to do this we need the ability to enforce a system-wide power cap. In this paper, we present TracSim, a full-system simulator that enables users to evaluate the performance of different policies for scheduling parallel tasks under a power cap. TracSim simulates the executing environment of an HPC cluster at Los Alamos National Laboratory (LANL). We use real measurements from the LANL cluster to set the configuration parameters of TracSim. TracSim enables users to specify the system topology, hardware configuration, power cap, and task workload, and to develop resource configuration and task scheduling policies aiming to maximize machine-room throughput while keeping power consumption under a power cap by exploiting CPU throttling techniques. We leverage TracSim to implement and evaluate three resource scheduling policies. Simulation results show the performance of those policies and quantify the amount of trapped capacity that can effectively be reclaimed.
international performance computing and communications conference | 2011
Ziming Zhang; Song Fu
Power and energy consumption has become a major concern in modern data centers and cloud systems. In order to develop efficient power management mechanisms for green clouds, we need a deep understanding of the influence of system configurations on the power consumption in real cloud systems. Power profiling provides such a vehicle. Existing fine-grain profiling approaches require special hardwired connections to the pins of individual hardware devices, which is not practical for large-scale production clouds. Moreover, they cannot provide a macroscopic view of the cloud-wide power dynamics. In this paper, we present macropower, a coarse-grain power and energy profiling framework. It provides a combination of hardware and software tools that achieves power/energy profiling at server granularity. It uses direct or derived measurements to isolate and combine influences from system components in cloud power profiles. It also generates the correlations between system activities and server/cloud-wide power/energy usage. We implement a prototype of macropower and test it in a cloud testbed. The profiled data are analyzed and the impact of system configurations on the server/cloud power usage is quantified, which is valuable for autonomic and energy-efficient management of cloud resources.
International Journal of Computer Theory and Engineering | 2012
Qiang Guan; Ziming Zhang; Song Fu
Abstract—Modern data centers continue to grow in their scale and complexity. They are changing dynamically as well due to the addition and removal of system components, changing execution environments, frequent updates and upgrades, online repairs and more. Classical reliability theory and conventional methods do rarely consider the actual state of a system and are therefore not capable to reflect the dynamics of runtime systems and failure processes. In this paper, we present an unsupervised failure detection and prediction method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. We implement a prototype of our failure detection and prediction mechanism and evaluate its performance on a data center test platform. Experimental results show that our proposed method can forecast failure dynamics with high accuracy.