Nowadays, distributed systems are a fundamental part of cloud-based systems, such as Google and cyber-physical systems like smart cities and electric grids. Achieving robustness and providing failure detection and recovery in distributed systems is a difficult problem because they are subject to local conditions and can fail unexpectedly. The main goal of this research is to define algorithms to achieve robustness and self-healing for solving the problem of failure detection and recovery in distributed systems. This research integrates different approaches inspired from nature: it improves robustness for distributed data-collection tasks performed by failure-prone mobile agents employing techniques inspired from animal foraging and swarm intelligence. Results show how agents are able to collect and replicate data from the entire target space despite agent failures. Then, the performance and robustness of the pheromone-based algorithm and random exploration are studied for data collection in complex networks, with different topologies (Lattice, Small-world, Community and Scale-free). Experimental results show that network topologies impact data collection and synchronisation and that the proposed pheromone-based approach can improve performance and success rates across most networks. Next, a replication based self-healing mechanism is proposed. The proposed replication approach uses communication time-outs to determine agent failure, and learns time-outs automatically to minimise false positives. Finally, a model to self-heal the structure of a complex network from node failures is proposed. This model differs from existing approaches in the creation of replicas from existing failing nodes and its links instead of rewiring the network to recover its functionality. Experimental results show that it is possible to recover failures in nodes if nodes know the topology. However, in some cases the topology is unknown or changes dynamically. To solve this problem, the data-collection strategies studied previously are applied to synchronise the network topology. Results show the benefits of this approach with respect to a reference multicast-based solution. By using mobile agents, a good part of the network is maintained with lesser overloads in terms of number of messages compared with multicast. Additionally, the strategy to replicate failing mobile agents is extended to deal with failures in nodes, making possible for agents to synchronise the topology data and to enable nodes holding this information to recover failed agents and neighbouring nodes at the same time. The obtained results provide key information that may help to design distributed systems covering applications like sensor networks, swarm robotics, server clusters, clouds and Internet of Things (IoT).
© 2001-2024 Fundación Dialnet · Todos los derechos reservados