Resumen de Model driven root cause analysis and reliability enhancement for large distributed computing systems

Ayuda

Resumen de Model driven root cause analysis and reliability enhancement for large distributed computing systems

Michel Zasadzinski

Over last years the number of Big Data, supercomputing, Internet of Things (IoT) or edge systems has snowballed. The core part of many areas and services in academia and business, are large, distributed, and complex IT systems. Any failure or performance degradation occurring in these systems causes negative effects, e.g., decreased user experience, raised operational costs. As the response, IT operators resolve failures, issues, and unexpected events. Operators are aided with tools for monitoring, diagnostics, and root cause analysis (RCA). They perform actions to recover a system to its normal state. However, the characteristics of the future IT systems makes the diagnostics and root cause analysis complicated. Thus, even the most skillful operators have problems to satisfy QoS constraints. In this thesis, we would like to aid operator work and in the long-term substitute them in the RCA. We contribute for environments as mentioned earlier in two areas: diagnostics, RCA, root cause classification; prevention of failures. We focus on areas such as scalability, dynamism, lack of knowledge on system failures, predictability, and prevention of failures. We use different IT environments for diversification of the use cases.

Firstly, we propose a fast RCA system based on probabilistic reasoning. The system can diagnose networks of devices with millions of nodes in a diagnostic model and solves the problem of scalability of RCA. We create diagnostic models based on Bayesian networks. Then, we transform them into a more efficient structure for runtime use that are Arithmetic Circuits. Thanks to the proposed optimization in this transformation and cache-based mechanism, the solution achieves great performance. Also, we propose actor-based RCA. This method is based on distributing diagnostic calculations through the devices and use of self-diagnostics paradigm. Thanks to this solution, results of partial diagnosis are known even when the connectivity with a part of the diagnosed system is lost. We show that the contribution works well in a simulated IoT system with high dynamism in its structure.

Secondly, we focus on the aspect of knowledge integration and partial knowledge of a diagnosed system. The path to NoOps involves precise, reliable and fast diagnostics and also reusing as much knowledge as possible after the system is reconfigured or changed. We propose a weighted graph framework which can transfer knowledge and perform high-quality diagnostics of IT systems. We encode all possible data in a graph representation of a system state and automatically calculate weights of these graphs. Then, thanks to the similarity evaluation, we transfer knowledge about failures from one system to another and use it for diagnostics. We successfully evaluate the proposed approach on Big Data cluster and a cloud system of containers running: Spark, Hadoop, Kafka and Cassandra systems.

Thirdly, we focus on the predictability of a supercomputing environment and prevention of failures. Failed jobs in a supercomputer cause waste in, e.g., CPU time, energy. Mining data collected during the operation of data centers helps to find patterns explaining failures and can be used to predict them. Automating system reactions, e.g., early termination of jobs, when software failures are predicted does not only increase availability and reduce operating cost, but it also frees people’s time. We explore a unique dataset containing the topology, operation metrics, and job scheduler history from the petascale Mistral supercomputer. We extract the most relevant system features deciding on the final state of a job through decision trees. Then, we propose actions to prevent failures. We create a model to predict job evolution based on power time series of nodes. Finally, we evaluate the effect on CPU time saving for static and dynamic job termination policies. We finish the thesis with a discussion on the contributions and state directions for future work.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Coordinado por: