TY - GEN
T1 - CoLoR
T2 - 24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018
AU - Hussain, Zaeem
AU - Cui, Xiaolong
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault tolerance technique, degrades under high failure rates because of frequent global rollbacks, which themselves are susceptible to failures. We propose CoLoR, a fault tolerance scheme that i)requires only the failing process to recover, ii)overlaps reexecution with restart, and iii)avoids the cumulative effect of successive failures. Our theoretical analysis reveals that such a scheme results in lower expected completion time than coordinated C/R. We also provide a proof-of-concept implementation in MPI using receiver based message logging and colocated rescuer (CoLoR)processes, and evaluate its performance on several HPC benchmarks. Our experimental results, combined with observations from the theoretical analysis, show that CoLoR can outperform both traditional C/R and replication over a large range of system sizes, without using extra logger nodes.
AB - With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault tolerance technique, degrades under high failure rates because of frequent global rollbacks, which themselves are susceptible to failures. We propose CoLoR, a fault tolerance scheme that i)requires only the failing process to recover, ii)overlaps reexecution with restart, and iii)avoids the cumulative effect of successive failures. Our theoretical analysis reveals that such a scheme results in lower expected completion time than coordinated C/R. We also provide a proof-of-concept implementation in MPI using receiver based message logging and colocated rescuer (CoLoR)processes, and evaluate its performance on several HPC benchmarks. Our experimental results, combined with observations from the theoretical analysis, show that CoLoR can outperform both traditional C/R and replication over a large range of system sizes, without using extra logger nodes.
UR - http://www.scopus.com/inward/record.url?scp=85063351239&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063351239&partnerID=8YFLogxK
U2 - 10.1109/PADSW.2018.8644528
DO - 10.1109/PADSW.2018.8644528
M3 - Conference contribution
AN - SCOPUS:85063351239
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 569
EP - 576
BT - Proceedings - 2018 IEEE 24th International Conference on Parallel and Distributed Systems, ICPADS 2018
PB - IEEE Computer Society
Y2 - 11 December 2018 through 13 December 2018
ER -