Felix Salfner | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Felix Salfner is active.

Explore More

Publication

Featured researches published by Felix Salfner.

ACM Computing Surveys | 2010

A survey of online failure prediction methods

Felix Salfner; Maren Lenk; Miroslaw Malek

With the ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.

symposium on reliable distributed systems | 2007

Using Hidden Semi-Markov Models for Effective Online Failure Prediction

Felix Salfner; Miroslaw Malek

A proactive handling of faults requires that the risk of upcoming failures is continuously assessed. One of the promising approaches is online failure prediction, which means that the current state of the system is evaluated in order to predict the occurrence of failures in the near future. More specifically, we focus on methods that use event-driven sources such as errors. We use hidden semi-Markov models (HSMMs)for this purpose and demonstrate effectiveness based on field data of a commercial telecommunication system. For comparative analysis we selected three well-known failure prediction techniques: a straightforward method that is based on a reliability model, dispersion frame technique by Lin and Siewiorek and the eventset-based method introduced by Vilalta et al. We assess and compare the methods in terms of precision, recall, F-measure, false-positive rate, and computing time. The experiments suggest that our HSMM approach is very effective with respect to online failure prediction.

international parallel and distributed processing symposium | 2006

Predicting failures of computer systems: a case study for a telecommunication system

Felix Salfner; Michael Schieschke; Miroslaw Malek

The goal of online failure prediction is to forecast imminent failures while the system is running. This paper compares similar events prediction (SEP) with two other well-known techniques for online failure prediction: a straightforward method that is based on a reliability model and dispersion frame technique (DFT). SEP is based on recognition of failure-prone patterns utilizing a semi-Markov chain in combination with clustering. We applied the approaches to real data of a commercial telecommunication system. Results are presented in terms of precision, recall, F-measure and accumulated runtime-cost. The results suggest a significantly improved forecasting performance.

international parallel and distributed processing symposium | 2004

Comprehensive logfiles for autonomic systems

Felix Salfner; Steffen Tschirpke; Miroslaw Malek

Summary form only given. A proposal for a new generation of logfiles with regard to new challenges posed by autonomic computing is presented. While a variety of techniques are being developed on the way to autonomic computing, the problem of a systems logfiles remains to be critical. Several recommendations for logfile design are introduced: (a) Event type and source of events have to be distinguishably incorporated into the design, (b) Logfiles should incorporate hierarchical numbering schemes, (c) Information contained in logfiles should be categorized into classes, (d) Logfiles have to easily lend themselves to automatic analysis, (e) Information density of logfiles should be considered in the design of logging functionality. A metric to measure information entropy of logfiles is proposed. The type of information to be included in logfiles in order to support various aspects of autonomic computing as originally defined by IBMs eight elements is specified.

Archive | 2004

Advanced Failure Prediction in Complex Software Systems

Günther A. Hoffmann; Felix Salfner; Miroslaw Malek

The availability of software systems can be increased by preventive measures which are triggered by failure prediction mechanisms. In this paper we present and evaluate two non-parametric techniques which model and predict the occurrence of failures as a function of discrete and continuous measurements of system variables. We employ two modelling approaches: an extended Markov chain model and a function approximation technique utilising universal basis functions (UBF). The presented modelling methods are data driven rather than analytical and can handle large amounts of variables and data. Both modelling techniques have been applied to real data of a commercial telecommunication platform. The data includes event-based log files and time continuously measured system states. Results are presented in terms of precision, recall, F-Measure and cumulative cost. We compare our results to standard techniques such as linear ARMA models. Our findings suggest significantly improved forecasting performance compared to alternative approaches. By using the presented modelling techniques the software availability may be improved by an order of magnitude.

international symposium on object component service oriented real time distributed computing | 2011

Timely Virtual Machine Migration for Pro-active Fault Tolerance

Andreas Polze; Peter Tröger; Felix Salfner

Next generation processor and memory technologies will provide tremendously increasing computing and memory capacities for application scaling. However, this comes at a price: Due to the growing number of transistors and shrinking structural sizes, overall system reliability of future server systems is about to suffer significantly. This makes reactive fault tolerance schemes less appropriate for server applications under reliability and timeliness constraints. We propose an architectural blueprint for managing server system dependability in a pro-active fashion, in order to keep service-level promises for response times and availability even with increasing hardware failure rates. We introduce the concept of anticipatory virtual machine migration that proactively moves computation away from faulty or suspicious machines. The migration decision is based on health indicators at various system levels that are combined into a global probabilistic reliability measure. Based on this measure, live migration techniques can be triggered in order to move computation to healthy machines even before a failure brings the system down.

Journal of Systems and Software | 2010

Analysis of service availability for time-triggered rejuvenation policies

Felix Salfner; Katinka Wolter

In this paper we investigate the effect of three time-triggered system rejuvenation policies on service availability using a queuing model. The model is formulated as an extended stochastic Petri net using a variety of distributions for times between state changes. We define a metric for steady-state service availability and derive how it can be estimated from the models in a hybrid approach combining simulation and analytical reasoning. We further analyze time-to-failure of systems with rejuvenation. Experiments show that the optimal rejuvenation interval as well as the achievable service availability improvement depend significantly on system utilization. The experiments also show that service availability can deviate significantly from steady-state system availability. For low utilization all rejuvenation policies perform well. For medium utilization, one policy is significantly inferior to the other two, while for high utilization, no rejuvenation should be performed at all.

international conference on high performance computing and simulation | 2009

Cross-core event monitoring for processor failure prediction

Felix Salfner; Peter Tröger; Steffen Tschirpke

A recent trend in the design of commodity processors is the combination of multiple independent execution units on one chip. With the resulting increase of complexity and transistor count, it becomes more and more likely that a single execution unit on a processor gets faulty. In order to tackle this situation, we propose an architecture for dependable process management in chip-multiprocessing machines. In our approach, execution units survey each other to anticipate future hardware failures. The prediction relies on the analysis of processor hardware performance counters by a statistical rank-sum test. Initial experiments with the Intel Core processor platform proved the feasibility of the approach, but also showed the need for further investigation due to a high prediction quality variation in most of the cases.

Archive | 2006

Modeling Event-driven Time Series with Generalized Hidden Semi-Markov Models

Felix Salfner

This report introduces a new model for event-driven temporal sequence processing: Generalized Hidden Semi-Markov Models (GHSMMs). GHSMMs are an extension of hidden Markov models to continuous time that builds on turning the stochastic process of hidden state traversals into a semi-Markov process. A large variety of probability distributions can be used to specify transition durations. It is shown how GHSMMs can be used to address the principle problems of temporal sequence processing: sequence generation, sequence recognition and sequence prediction. Additionally, an algorithm is described how the parameters of GHSMMs can be determined from a set of training data: The Baum-Welch algorithm is extended by an embedded expectation-maximization algorithm. Under some conditions the procedure can be simplified to the estimation of distribution moments. A proof of convergence and a complexity assessment are provided.

international parallel and distributed processing symposium | 2005

Proactive fault handling for system availability enhancement

Felix Salfner; Miroslaw Malek

Proactive fault handling combines prevention and repair actions with failure prediction techniques. We extend the standard availability formula by five key measures: (1) precision and (2) recall assess failure prediction while failure handling is gauged by (3) prevention probability, (4) repair time improvement, and (5) risk of introducing additional failures. We give a short survey of actions that are suited to be combined with failure prediction and provide a procedure to estimate the five key measures. Altogether, this allows to quantify the impact of proactive fault handling on system availability and may provide valuable input for system design.

Explore More