IEEE Transactions on Cloud Computing | 2019

A Markov Random Field Based Approach for Analyzing Supercomputer System Logs

 
 
 

Abstract


High performance computing systems comprised of hundreds or thousands of computational nodes can generate a high volume of system log entries at a high data velocity. Analyzing these logs soon after they are generated is a significant challenge, due to the complexity of log messages, the speed at which they are produced, and the lack of a method to quickly map or categorize messages to meaningful sets. The impact of this problem is that it is not possible to comprehensively glean timely information from logs about the overall system or the health of individual nodes. In this paper, we address this problem through the development of a novel approach for system log analysis based on a markov random field (MRF) that can quickly categorize system log messages into multiple categories based on representative training examples provided by a user. We present a theoretical model of our approach, followed by an extensive evaluation of the accuracy and performance of the implementation of our model. We found that our MRF based approach can quickly categorize system log messages with a high degree of accuracy.

Volume 7
Pages 611-624
DOI 10.1109/TCC.2017.2678473
Language English
Journal IEEE Transactions on Cloud Computing

Full Text