TY - GEN
T1 - Partial redundancy in HPC systems with non-uniform node reliabilities
AU - Hussain, Zaeem
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Funding Information:
We are thankful to reviewers for their constructive feedback that has helped us improve the quality of this paper. This research is based in part upon work supported by the Department of Energy under contract DE-SC0014376. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/3/11
Y1 - 2019/3/11
N2 - We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of the system. In such settings, partial replication has never been shown to outperform the two extremes(full and no-replication) for any significant range of node counts. In this work, we argue that partial redundancy may provide the best performance under the more realistic assumption of non-identical node failure distributions. We provide theoretical results on arranging nodes with different reliability values among replicas such that system reliability is maximized. Moreover, using system reliability to compute MTTI (mean-time-to-interrupt) and expected completion time of a partially replicated system, we numerically determine the optimal partial replication degree. Our results indicate that partial replication can be a more efficient alternative to full replication at system scales where Checkpoint/Restart alone is not sufficient.
AB - We study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of the system. In such settings, partial replication has never been shown to outperform the two extremes(full and no-replication) for any significant range of node counts. In this work, we argue that partial redundancy may provide the best performance under the more realistic assumption of non-identical node failure distributions. We provide theoretical results on arranging nodes with different reliability values among replicas such that system reliability is maximized. Moreover, using system reliability to compute MTTI (mean-time-to-interrupt) and expected completion time of a partially replicated system, we numerically determine the optimal partial replication degree. Our results indicate that partial replication can be a more efficient alternative to full replication at system scales where Checkpoint/Restart alone is not sufficient.
KW - Checkpoint
KW - Fault tolerance
KW - HPC
KW - Replication
KW - Resilience
UR - http://www.scopus.com/inward/record.url?scp=85064106081&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064106081&partnerID=8YFLogxK
U2 - 10.1109/SC.2018.00047
DO - 10.1109/SC.2018.00047
M3 - Conference contribution
AN - SCOPUS:85064106081
T3 - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
SP - 566
EP - 576
BT - Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018
Y2 - 11 November 2018 through 16 November 2018
ER -