TY - GEN
T1 - CoLoR
T2 - 24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018
AU - Hussain, Zaeem
AU - Cui, Xiaolong
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Funding Information:
ACKNOWLEDGMENT This research is based in part upon work supported by the Department of Energy under contract DE-SC0014376. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Funding Information:
This research is based in part upon work supported by the Department of Energy under contract DE-SC0014376. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Publisher Copyright:
© 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault tolerance technique, degrades under high failure rates because of frequent global rollbacks, which themselves are susceptible to failures. We propose CoLoR, a fault tolerance scheme that i)requires only the failing process to recover, ii)overlaps reexecution with restart, and iii)avoids the cumulative effect of successive failures. Our theoretical analysis reveals that such a scheme results in lower expected completion time than coordinated C/R. We also provide a proof-of-concept implementation in MPI using receiver based message logging and colocated rescuer (CoLoR)processes, and evaluate its performance on several HPC benchmarks. Our experimental results, combined with observations from the theoretical analysis, show that CoLoR can outperform both traditional C/R and replication over a large range of system sizes, without using extra logger nodes.
AB - With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault tolerance technique, degrades under high failure rates because of frequent global rollbacks, which themselves are susceptible to failures. We propose CoLoR, a fault tolerance scheme that i)requires only the failing process to recover, ii)overlaps reexecution with restart, and iii)avoids the cumulative effect of successive failures. Our theoretical analysis reveals that such a scheme results in lower expected completion time than coordinated C/R. We also provide a proof-of-concept implementation in MPI using receiver based message logging and colocated rescuer (CoLoR)processes, and evaluate its performance on several HPC benchmarks. Our experimental results, combined with observations from the theoretical analysis, show that CoLoR can outperform both traditional C/R and replication over a large range of system sizes, without using extra logger nodes.
UR - http://www.scopus.com/inward/record.url?scp=85063351239&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063351239&partnerID=8YFLogxK
U2 - 10.1109/PADSW.2018.8644528
DO - 10.1109/PADSW.2018.8644528
M3 - Conference contribution
AN - SCOPUS:85063351239
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 569
EP - 576
BT - Proceedings - 2018 IEEE 24th International Conference on Parallel and Distributed Systems, ICPADS 2018
PB - IEEE Computer Society
Y2 - 11 December 2018 through 13 December 2018
ER -