Resumen de Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability

In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks.

Acceso de usuarios registrados

¿Es nuevo? Regístrese

Coordinado por: