Thomas Reidemeister | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas Reidemeister is active.

Explore More

Publication

Featured researches published by Thomas Reidemeister.

international conference on autonomic computing | 2009

System monitoring with metric-correlation models: problems and solutions

Miao Jiang; Mohammad Ahmad Munawar; Thomas Reidemeister; Paul Ward

Correlations among management metrics in software systems allow errors to be detected and their cause localized. Prior research shows that linear models can capture many of these correlations. However, our research shows that several factors may prevent linear models from accurately describing correlations, even if the underlying relationship is linear. Two common phenomena we have observed are relationships that evolve, typically with time, and heterogeneous variance of the correlated metrics. Two-variable linear models proposed thus far fail to capture these phenomena, and thus fail to describe system dynamics correctly. Often, these phenomena are caused by a missing variable. However, searching for three-variable correlations is O(n3) for n metrics, which is costly for systems with many metrics. In this paper we address the above challenges by improving on two-variable Ordinary Least Squares regression models. We validate our models using a realistic Java-Enterprise-Edition application. Using fault-injection experiments we show that our improved models capture system behavior accurately. We detect errors within 8 sample periods on average from the injection of the fault, which is less than half the time required by the current linear-model approach.

dependable systems and networks | 2009

Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring

Miao Jiang; Mohammad Ahmad Munawar; Thomas Reidemeister; Paul Ward

Management metrics of complex software systems exhibit stable correlations which can enable fault detection and diagnosis. Current approaches use specific analytic forms, typically linear, for modeling correlations. In this paper we use Normalized Mutual Information as a similarity measure to identify clusters of correlated metrics, without knowing the specific form. We show how we can apply the Wilcoxon Rank-Sum test to identify anomalous behaviour. We present two diagnosis algorithms to locate faulty components: RatioScore, based on the Jaccard Coefficient, and SigScore, which incorporates knowledge of component dependencies. We evaluate our mechanisms in the context of a complex enterprise application. Through fault-injection experiments, we show that we can detect 17 out of 22 faults without any false positives. We diagnose the faulty component in the top five anomaly scores 7 times out of 17 using SigScore, which is 40% better than when system structure is ignored.

IEEE Transactions on Dependable and Secure Computing | 2011

Efficient Fault Detection and Diagnosis in Complex Software Systems with Information-Theoretic Monitoring

Miao Jiang; Mohammad Ahmad Munawar; Thomas Reidemeister; Paul Ward

Management metrics of complex software systems exhibit stable correlations which can enable fault detection and diagnosis. Current approaches use specific analytic forms, typically linear, for modeling correlations. In practice, more complex nonlinear relationships exist between metrics. Moreover, most intermetric correlations form clusters rather than simple pairwise correlations. These clusters provide additional information and offer the possibility for optimization. In this paper, we address these issues by using Normalized Mutual Information (NMI) as a similarity measure to identify clusters of correlated metrics, without assuming any specific form for the metric relationships. We show how to apply the Wilcoxon Rank-Sum test on the entropy measures to detect errors in the system. We also present three diagnosis algorithms to locate faulty components: RatioScore, based on the Jaccard coefficient, SigScore, which incorporates knowledge of component dependencies, and BayesianScore, which uses Bayesian inference to assign a fault probability to each component. We evaluate our approach in the context of a complex enterprise application, and show that 1) stable, nonlinear correlations exist and can be captured with our approach; 2) we can detect a large fraction of faults with a low false positive rate (we detect up to 18 of the 22 faults we injected); and 3) we improve the diagnosis with our new diagnosis algorithms.

conference of the centre for advanced studies on collaborative research | 2009

Diagnosis of recurrent faults using log files

Thomas Reidemeister; Mohammad Ahmad Munawar; Miao Jiang; Paul Ward

Enterprise software systems (ESS) are becoming larger and increasingly complex. Failure in business-critical systems is expensive, leading to consequences such as loss of critical data, loss of sales, customer dissatisfaction, even law suits. Therefore, detecting failures and diagnosing their root-cause in a timely manner is essential. Many studies suggest that a large fraction of failures encountered in practice are recurrent (i.e., they have been seen before). Fast and accurate detection of these failures can accelerate problem determination, and thereby improve system reliability. To this effect, we explore machine learning techniques, including the Naïve Bayes classifier, partially-supervised learning, and decision trees (using C4.5), to automatically recognize symptoms of recurrent faults and to derive detection rules from samples of log data. This work focuses on log files, since they are readily available and they do not put any additional computational burden on the component generating the data. The methods explored in this work can aid the development of tools to assist support personnel in problem determination tasks. Instead of requiring the operators to manually define patterns for identifying recurrent problems, such tools can be trained using prior, solved and unsolved cases from existing support databases.

international conference on performance engineering | 2013

DataMill: rigorous performance evaluation made easy

Augusto Born de Oliveira; Jean-Christophe Petkovich; Thomas Reidemeister; Sebastian Fischmeister

Empirical systems research is facing a dilemma. Minor aspects of an experimental setup can have a significant impact on its associated performance measurements and potentially invalidate conclusions drawn from them. Examples of such influences, often called hidden factors, include binary link order, process environment size, compiler generated randomized symbol names, or group scheduler assignments. The growth in complexity and size of modern systems will further aggravate this dilemma, especially with the given time pressure of producing results. So how can one trust any reported empirical analysis of a new idea or concept in computer science? This paper introduces DataMill, a community-based easy-to-use services-oriented open benchmarking infrastructure for performance evaluation. DataMill facilitates producing robust, reliable, and reproducible results. The infrastructure incorporates the latest results on hidden factors and automates the variation of these factors. Multiple research groups already participate in DataMill. DataMill is also of interest for research on performance evaluation. The infrastructure supports quantifying the effect of hidden factors, disseminating the research results beyond mere reporting. It provides a platform for investigating interactions and composition of hidden factors.

conference of the centre for advanced studies on collaborative research | 2008

Information-theoretic modeling for tracking the health of complex software systems

Miao Jiang; Mohammad Ahmad Munawar; Thomas Reidemeister; Paul A. S. Ward

Stable correlation models are effective in detecting errors in complex software systems. However, most studies assume a specific mathematical form, typically linear, for the underlying correlations. In practice, more complex non-linear relationships exist between metrics. Moreover, most inter-metric correlations form clusters rather than simple pairwise correlations. These clusters provide additional information for error detection and offer the possibility for optimization. We address these issues by adopting the Normalized Mutual Information as a similarity measure. We also employ the entropy of metrics in clusters to monitor system state. Our approach does not require learning specific correlation models, thus reducing computation overhead. We have implemented the proposed approach and show, through experiments with a multi-tier enterprise software system, that it is effective. Our evaluation shows that (i) stable non-linear correlations exist in practice; (ii) the entropy of system metrics in clusters can efficiently detect anomalies caused by faults and provide information for diagnosis; and (iii) we can detect errors which were not captured by previous linear-correlation approaches.

high-assurance systems engineering | 2008

Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis

Miao Jiang; Mohammad Ahmad Munawar; Thomas Reidemeister; Paul A. S. Ward

A correctly functioning enterprise-software system exhibits long-term, stable correlations between many of its monitoring metrics. Some of these correlations no longer hold when there is an error in the system, potentially enabling error detection and fault diagnosis. However, existing approaches are inefficient, requiring a large number of metrics to be monitored and ignoring the relative discriminative properties of different metric correlations. In enterprise-software systems, similar faults tend to reoccur. It is therefore possible to significantly improve existing correlation-analysis approaches by learning the effects of common recurrent faults on correlations. We present methods to determine the most significant correlations to track for efficient error detection, and the correlations that contribute the most to diagnosis accuracy. We apply machine learning to identify the relevant correlations, removing the need for manually configured correlation thresholds, as used in the prior approaches. We validate our work on a multi-tier enterprise-software system. We are able to detect and correctly diagnose 8 of 10 injected faults to within three possible causes, and to within two in 7 out of 8 cases. This compares favourably with the existing approaches whose diagnosis accuracy is 3 out of 10 to within 3 possible causes. We achieve a precision of at least 95%.

distributed systems operations and management | 2008

Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

Mohammad Ahmad Munawar; Thomas Reidemeister; Miao Jiang; Allen Ajit George; Paul A. S. Ward

Ensuring high availability, adequate performance, and proper operation of enterprise software systems requires continuous monitoring. Today, most systems operate with minimal monitoring, typically based on service-level objectives (SLOs). Detailed metric-based monitoring is often too costly to use in production, while tracing is prohibitively expensive. Configuring monitoring when problems occur is a manual process. In this paper we propose an alternative: Minimal monitoring with SLOs is used to detect errors. When an error is detected, detailed monitoring is automatically enabled to validate errors using invariant-correlation models. If validated, Application-Response-Measurement (ARM) tracing is dynamically activated on the faulty subsystem and a healthy peer to perform differential trace-data analysis and diagnosis. Based on fault-injection experiments, we show that our system is effective; it correctly detected and validated errors caused by 14 out of 15 injected faults. Differential analysis of the trace data collected for 210 seconds allowed us to top-rank the faulty component in 80% of the cases. In the remaining cases the faulty component was ranked within the top-7 out of 81 components. We also demonstrate that the overhead of our system is low; given a false positive rate of one per hour, the overhead is less than 2.5%.

integrated network management | 2011

Mining unstructured log files for recurrent fault diagnosis

Thomas Reidemeister; Miao Jiang; Paul Ward

Enterprise software systems are large and complex with limited support for automated root-cause analysis. Avoiding system downtime and loss of revenue dictates a fast and efficient root-cause analysis process. Operator practice and academic research have shown that about 80% of failures in such systems have recurrent causes; therefore, significant efficiency gains can be achieved by automating their identification. In this paper, we present a novel approach to modelling features of log files. This model offers a compact representation of log data that can be efficiently extracted from large amounts of monitoring data. We also use decision-tree classifiers to learn and classify symptoms of recurrent faults. This representation enables automated fault matching and, in addition, enables human investigators to understand manifestations of failure easily. Our model does not require any access to application source code, a specification of log messages, or deep application knowledge. We evaluate our proposal using fault-injection experiments against other proposals in the field. First, we show that the features needed for symptom definition can be extracted more efficiently than does related work. Second, we show that these features enable an accurate classification of recurrent faults using only standard machine learning techniques. This enables us to identify accurately up to 78% of the faults in our evaluation data set.

network operations and management symposium | 2010

Identifying symptoms of recurrent faults in log files of distributed information systems

Thomas Reidemeister; Mohammad Ahmad Munawar; Paul Ward

The manual process to identifying causes of failure in distributed information systems is difficult and time-consuming. The underlying reason is the large size and complexity of these systems, and the vast amount of monitoring data they generate. Despite its high cost, this manual process is necessary in order to avoid the detrimental consequences of system downtime. Several studies and operator practice suggest that a large fraction of the failures in these systems are caused by recurrent faults. Therefore, significant efficiency gains can be achieved by automating the identification of these faults. In this work we present methods, which draw from the areas of information retrieval as well as machine learning, to automate the task of infering symptoms pertinent to failures caused by specific faults. In particular, we present a method to infer message types from plain-text log messages, and we leverage these types to train classifiers and extract rules to identify symptoms of recurrent faults automatically.

Explore More