TY - GEN
T1 - Optimal placement of in-memory checkpoints under heterogeneous failure likelihoods
AU - Hussain, Zaeem
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Publisher Copyright:
© 2019 IEEE
PY - 2019/5
Y1 - 2019/5
N2 - In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.
AB - In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.
KW - Failure logs
KW - Fault tolerance
KW - In-memory checkpoint
KW - Multilevel checkpoint
UR - http://www.scopus.com/inward/record.url?scp=85072833947&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072833947&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2019.00098
DO - 10.1109/IPDPS.2019.00098
M3 - Conference contribution
AN - SCOPUS:85072833947
T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
SP - 900
EP - 910
BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
Y2 - 20 May 2019 through 24 May 2019
ER -