Scalable and efficient distributed self-healing with self-optimization features in fixed IP networks
Skalierbare und effiziente verteilte Selbstheilung mit Selbstoptimierungsfunktionen in festen IP-Netzwerken
The Internet is continuously gaining importance in our society. Indeed, the Internet is slowly turning into the backbone of the modern world, having impact on all possible aspects, such as politics, communication, intercultural exchange, and emergency services, to give some examples. As these aspects are developing, the technical infrastructure around the Internet's core protocol - IP (Internet Protocol) - is increasingly exposed to various challenges. One of these challenges is given by the requirement for sophisticated resilience mechanisms that can guarantee the robustness of the IP infrastructure in case of faults, failures, and natural disasters. This dissertation aims to develop a new architectural framework for improving the resilience of network nodes in fixed IP network infrastructures, i.e. IP networks without any mobility and continuously changing physical topology. The current thesis approaches the topic of resilience from two different perspectives. First, it is recognized that resilient self-healing mechanisms are already embedded inside diverse network protocols, as well as in applications and services running on top of a fixed IP network. Secondly, the importance of network and systems management processes for the availability of the network and IT infrastructure is also analyzed. This leads to the identification of a gap between the resilient features which are intrinsically embedded inside the protocols and applications, on one hand, and the network and systems management processes, on the other hand. This gap is constituted by the lack of a framework that runs on top of the protocols and applications and manages them with respect to incidents, thereby automating aspects of the established management standards. In addition, this framework is meant to serve as a layer between the network/system's administrator and the networked infrastructure. That is, on one hand, the framework is configured and provided with knowledge by the human experts tweaking and improving the system. On the other hand, the framework is designed to escalate faulty conditions, which it is not able to resolve, to the operations personnel, such that responsive managerial actions can be initiated. The architectural framework consists of software components that operate in a distributed manner inside the nodes of the networked system in question. These software components are able to proactively and reactively respond to faulty conditions, i.e. on one hand failures are predicted and avoided, and on the other hand, an automatic response to already existing faulty conditions is realized. To evaluate the concepts and mechanisms, a number of case studies are executed. Additionally, the scalability and overhead (e.g. memory consumption) of the proposed framework are evaluated. Furthermore, the framework and algorithms are designed in a way that enables the realization of real-time self-healing whereby the reaction strategy is always optimized such that key performance indicators of the networked system are improved.
Berlin, TU, Diss., 2019