WELCOME > FINAL PROGRAM > TUTORIALS > Tutorial 2: Proactive Fault Management for Availability Enhancement

Tutorial 2: Proactive Fault Management for Availability Enhancement

  • Miroslaw Malek, Humboldt-Universitaet zu Berlin, Germany
  • Felix Salfner, Humboldt-Universitaet zu Berlin, Germany

In this tutorial, we focus on runtime monitoring, failure avoidance and prediction algorithms and technologies, proactive recovery and preventive maintenance which the main steps in proactive fault management may have a major impact on computer systems availability and performance. We first survey runtime monitoring techniques, long-term and short-term prediction techniques, introduce prediction quality measures, and then demonstrate how the availability of software and hardware systems can be increased by preventive measures which are triggered by short-term failure prediction mechanisms. We present and evaluate mainly non-parametric techniques which model and predict the occurrence of failures as a function of discrete and continuous measurements of system variables.

We introduce two modeling approaches to failure prediction: hidden semi-Markov models and a function approximation technique utilizing universal basis functions. The presented modeling methods are data driven rather than analytical and can handle large amounts of variables and data. They offer the potential to capture the underlying dynamics of even high- dimensional and noisy systems. Both modeling techniques have been applied to real data of a commercial telecommunication platform. The data includes event-based log files and measured system states. We compare the effectiveness of discussed techniques with other methods in terms of precision, recall, F-measure and cumulative cost. The two methods demonstrate significantly improved forecasting performance compared to alternative approaches such as linear ARMA models.

Finally, we present a plethora of preventive measures that can be applied once it is established that a failure appears to be imminent.


By using the presented proactive fault management techniques the system availability may be improved by an order of magnitude.