Is this you? Create Your Porfile

Zhiling Lan

Illinois Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhiling Lan is active.

Explore More

Publication

Featured researches published by Zhiling Lan.

cluster computing and the grid | 2006

Exploit failure prediction for adaptive fault-tolerance in cluster computing

Yawei Li; Zhiling Lan

As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.

IEEE Transactions on Parallel and Distributed Systems | 2010

Toward Automated Anomaly Identification in Large-Scale Systems

Zhiling Lan; Ziming Zheng; Yawei Li

When a system fails to function properly, health-related data are collected for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. The traditional manual approach is time-consuming, error-prone, and even worse, not scalable. In this paper, we present an automated mechanism for node-level anomaly identification in large-scale systems. A set of techniques is presented to automatically analyze collected data: data transformation to construct a uniform data format for data analysis, feature extraction to reduce data size, and unsupervised learning to detect the nodes acting differently from others. Moreover, we compare two techniques, principal component analysis (PCA) and independent component analysis (ICA), for feature extraction. We evaluate our prototype implementation by injecting a variety of faults into a production system at NCSA. The results show that our mechanism, in particular, the one using ICA-based feature extraction, can effectively identify faulty nodes with high accuracy and low computation overhead.

dependable systems and networks | 2009

System log pre-processing to improve failure prediction

Ziming Zheng; Zhiling Lan; Byung-Hoon Park; Al Geist

Log preprocessing, a process applied on the raw log before applying a predictive method, is of paramount importance to failure prediction and diagnosis. While existing filtering methods have demonstrated good compression rate, they fail to preserve important failure patterns that are crucial for failure analysis. To address the problem, in this paper we present a log preprocessing method. It consists of three integrated steps: (1) event categorization to uniformly classify system events and identify fatal events; (2) event filtering to remove temporal and spatial redundant records, while also preserving necessary failure patterns for failure analysis; (3) causality-related filtering to combine correlated events for filtering through apriori association rule mining. We demonstrate the effectiveness of our preprocessing method by using real failure logs collected from the Cray XT4 at ORNL and the Blue Gene/L system at SDSC. Experiments show that our method can preserve more failure patterns for failure analysis, thereby improving failure prediction by up to 174%.

international parallel and distributed processing symposium | 2010

Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P

Wei Tang; Narayan Desai; Daniel Buettner; Zhiling Lan

Backfilling and short-job-first are widely acknowledged enhancements to the simple but popular first-come, first-served job scheduling policy. However, both enhancements depend on user-provided estimates of job runtime, which research has repeatedly shown to be inaccurate. We have investigated the effects of this inaccuracy on backfilling and different queue prioritization policies, determining which part of the scheduling policy is most sensitive. Using these results, we have designed and implemented several estimation-adjusting schemes based on historical data. We have evaluated these schemes using workload traces from the Blue Gene/P system at Argonne National Laboratory. Our experimental results demonstrate that dynamically adjusting job runtime estimates can improve job scheduling performance by up to 20%.

Journal of Parallel and Distributed Computing | 2002

A novel dynamic load balancing scheme for parallel systems

Zhiling Lan; Valerie E. Taylor; Greg L. Bryan

Adaptive mesh refinement (AMR) is a type of multiscale algorithm that achieves high resolution in localized regions of dynamic, multidimensional numerical simulations. One of the key issues related to AMR is dynamic load balancing (DLB), which allows large-scale adaptive applications to run efficiently on parallel systems. In this paper, we present an efficient DLB scheme for structured AMR (SAMR) applications. This scheme interleaves a grid-splitting technique with direct grid movements (e.g., direct movement from an overloaded processor to an underloaded processor), for which the objective is to efficiently redistribute workload among all the processors so as to reduce the parallel execution time. The potential benefits of our DLB scheme are examined by incorporating our techniques into a SAMR cosmology application, the ENZO code. Experiments show that by using our scheme, the parallel execution time can be reduced by up to 57 % and the quality of load balancing can be improved by a factor of six, as compared to the original DLB scheme used in ENZO.

international conference on cluster computing | 2009

Reliability-aware scalability models for high performance computing

Ziming Zheng; Zhiling Lan

Scalability models are powerful analytical tools for evaluating and predicting the performance of parallel applications. Unfortunately, existing scalability models do not quantify failure impact and therefore cannot accurately account for application performance in the presence of failures. In this study, we extend two well-known models, namely Amdahls law and Gustafsons law, by considering the impact of failures and the effect of fault tolerance techniques on applications. The derived reliability-aware models can be used to predict application scalability in failure-present environments and evaluate fault tolerance techniques. Trace-based simulations via real failure logs demonstrate that the newly developed models provide a better understanding of application performance and scalability in the presence of failures.

international conference on parallel processing | 2001

Dynamic load balancing for structured adaptive mesh refinement applications

Zhiling Lan; Valerie E. Taylor; Greg L. Bryan

Adaptive Mesh Refinement (AMR) is a type of multiscale algorithm that achieves high resolution in localized regions of dynamic, multidimensional numerical simulations. One of the key issues related to AMR is dynamic load balancing (DLB), which allows large-scale adaptive applications to run efficiently on parallel systems. In this paper we present an efficient DLB scheme for structured AMR (SAMR) applications. Our DLB scheme combines a grid-splitting technique with direct grid movements (e.g., direct movement from an overloaded processor to an underloaded proces sor), for which the objective is to efficiently redistribute workload among all the processors so as to reduce the parallel execution time. The potential benefits of our DLB scheme are examined by incorporating our techniques into a parallel, cosmological application that uses SAMR techniques. Experiments show that by using our scheme, the parallel execution time can be reduced by up to 47% and the quality of load-balancing can be improved by a factor of four.

high performance distributed computing | 2015

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications

Eduardo Berrocal; Leonardo Bautista-Gomez; Sheng Di; Zhiling Lan; Franck Cappello

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. Consequently, the number of soft errors is expected to increase dramatically in the coming years. In this respect, techniques that leverage certain properties of iterative HPC applications (such as the smoothness of the evolution of a particular dataset) can be used to detect silent errors at the application level. In this paper, we present a pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range (i.e., normal value interval) surrounding the predicted next-step value. We show that dataset correlation can be used to detect corruptions indirectly and limit the size of the data set to monitor, taking advantage of the underlying physics of the simulation. Our results show that, using our techniques, we can detect a large number of corruptions (i.e., above 90% in some cases) with 84% memory overhead, and 13.75% extra computation time.

IEEE Transactions on Computers | 2008

Adaptive Fault Management of Parallel Applications for High-Performance Computing

Zhiling Lan; Yawei Li

As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.

ieee international conference on high performance computing data and analytics | 2013

Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems

Xu Yang; Zhou Zhou; Sean Wallace; Zhiling Lan; Wei Tang; Susan Coghlan; Michael E. Papka

The research literature to date mainly aimed at reducing energy consumption in HPC environments. In this paper we propose a job power aware scheduling mechanism to reduce HPCs electricity bill without degrading the system utilization. The novelty of our job scheduling mechanism is its ability to take the variation of electricity price into consideration as a means to make better decisions of the timing of scheduling jobs with diverse power profiles. We verified the effectiveness of our design by conducting trace-based experiments on an IBM Blue Gene/P and a cluster system as well as a case study on Argonnes 48-rack IBM Blue Gene/Q system. Our preliminary results show that our power aware algorithm can reduce electricity bill of HPC systems as much as 23%.

Explore More