Edward Chuah
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Edward Chuah.
symposium on reliable distributed systems | 2013
Edward Chuah; Arshad Jhumka; Sai Narasimhamurthy; John Hammond; James C. Browne; Bill Barth
Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.
ieee international conference on high performance computing, data, and analytics | 2010
Edward Chuah; Shyh-Hao Kuo; Paul Hiew; William-Chandra Tjhi; Gary Kee Khoon Lee; John Hammond; Marek T. Michalewicz; Terence Hung; James C. Browne
System event logs are often the primary source of information for diagnosing (and predicting) the causes of failures for cluster systems. Due to interactions among the system hardware and software components, the system event logs for large cluster systems are comprised of streams of interleaved events, and only a small fraction of the events over a small time span are relevant to the diagnosis of a given failure. Furthermore, the process of troubleshooting the causes of failures is largely manual and ad-hoc. In this paper, we present a systematic methodology for reconstructing event order and establishing correlations among events which indicate the root-causes of a given failure from very large syslogs. We developed a diagnostics tool, FDiag, to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed. We applied FDiag to analyze failures due to br eakdowns in interactions between the Lustre file system and its clients on the Ranger supercomputer at the Texas Advanced Computing Center (TACC). The results are positive. FDiag is able to identify the dates and the time periods that contain the significant events which eventually led to the occurrence of compute node soft lockups.
ieee international conference on dependable, autonomic and secure computing | 2011
Edward Chuah; Gary Kee Khoon Lee; William-Chandra Tjhi; Shyh-Hao Kuo; Terence Hung; John Hammond; Tommy Minyard; James C. Browne
A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag log-based failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.
ieee international conference on high performance computing, data, and analytics | 2014
Alejandro Pelaez; Andres Quiroz; James C. Browne; Edward Chuah; Manish Parashar
Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.
international conference on multimedia and expo | 2008
Xiaorong Li; Edward Chuah; Jo Yew Tham; Kwong Huang Goh
Due to the advance of technologies in multimedia compression and network communications, scalable media streaming services have been availed to provide QoS differentiated services for heterogeneous users. However, it is still a big challenge to support consistent end-to-end quality of services (QoS) for the users due to the dynamic feature of the Internet, and abrupt variability of the network resources may severally affect the client perceived QoS. In this paper, we address the issues of smooth QoS adaptation for scalable streaming services. We propose an Optimal Smooth QoS Adaptation (OS-QA) strategy which allocates the server resource adaptively to cope with the variability of network bandwidth and protects the service quality of different quality classes under dynamic resource constraints. We analyze the quality variation caused by resource fluctuation and proposed OS-QA to minimize the average QoS variance under the resource constraints. Simulations are conducted to compare our proposed method with other QoS adaptation methods, and performance is analyzed in terms of QoS variance and PSNR. Results show that our proposed method is able to gracefully adapt the QoS and protect the client perceived QoS by minimizing the QoS variance under dynamic network resource constrains.
international parallel and distributed processing symposium | 2015
Nentawe Gurumdimma; Arshad Jhumka; Maria Liakata; Edward Chuah; James C. Browne
The ability to automatically detect faults or fault patterns to enhance system reliability is important for system administrators in reducing system failures. To achieve this objective, the message logs from cluster system are augmented with failure information, i.e., The raw log data is labelled. However, tagging or labelling of raw log data is very costly. In this paper, our objective is to detect failure patterns in the message logs using unlabelled data. To achieve our aim, we propose a methodology whereby a pre-processing step is first performed where redundant data is removed. A clustering algorithm is then executed on the resulting logs, and we further developed an unsupervised algorithm to detect failure patterns in the clustered log by harnessing the characteristics of these sequences. We evaluated our methodology on large production data, and results shows that, on average, an f-measure of 78% can be obtained without having data labels. The implication of our methodology is that a system administrator with little knowledge of the system can detect failure runs with reasonably high accuracy.
symposium on reliable distributed systems | 2016
Nentawe Gurumdimma; Arshad Jhumka; Maria Liakata; Edward Chuah; James C. Browne
The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.
ieee international conference on high performance computing data and analytics | 2016
Edward Chuah; Arshad Jhumka; James C. Browne; Nentawe Gurumdimma; Sai Narasimhamurthy; Bill Barth
Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Recent availability of resource use data provides another potentially useful source of data for failure detection and diagnosis. Early work combining message logs and resource use data for failure diagnosis has shown promising results. This paper describes the CRUMEL framework which implements a new approach to combining rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and correlates these patterns by time with system failures. Application of CRUMEL to data from the Ranger supercomputer has yielded improved diagnoses over previous research. CRUMEL has: (i) showed that more events correlated with system failures can only be identified by applying different correlation algorithms, (ii) confirmed six groups of errors, (iii) identified Lustre I/O resource use counters which are correlated with occurrence of Lustre faults which are potential flags for online detection of failures, (iv) matched the dates of correlated error events and correlated resource use with the dates of compute node hang-ups and (v) identified two more error groups associated with compute node hang-ups. The pre-processed data will be put on the public domain in September, 2016.
european dependable computing conference | 2015
Edward Chuah; Arshad Jhumka; James C. Browne; Bill Barth; Sai Narasimhamurthy
Large cluster systems are composed of complex, interacting hardware and software components. Components, or the interactions between components, may fail due to many different reasons, leading to the eventual failure of executing jobs. This paper investigates an open question about failure diagnosis: What are the characteristics of the errors that lead to cluster system failures? To this end, this paper gives a systematic process for identifying and characterizing the root-causes of failures. We applied an extended version of the FDiagV3 diagnostics toolkit to the log-files of the Ranger and Lonestar supercomputers. Our results show that: (i) failures were a result of recurrent issues and errors, (ii) a small set of nodes are associated with these issues and errors, and (iii) Ranger and Lonestar display similar sets of problems. FDiagV3 will be put in the public domain for support of failure diagnosis for large cluster systems in May, 2015.
trust, security and privacy in computing and communications | 2015
Nentawe Gurumdimma; Arshad Jhumka; Maria Liakata; Edward Chuah; James C. Browne