TY - GEN
T1 - Rejuvenating Shadows
T2 - 19th IEEE Intl Conference on High Performance Computing and Communications, 15th IEEE Intl Conference on Smart City, and 3rd IEEE Intl Conference on Data Science and Systems, HPCC/SmartCity/DSS 2017
AU - Cui, Xiaolong
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Funding Information:
ACKNOWLEDGMENT This research is based in part upon work supported by the Department of Energy under contract DE-SC0014376. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Funding Information:
This research is based in part upon work supported by the Department of Energy under contract DE-SC0014376. This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Publisher Copyright:
© 2017 IEEE.
PY - 2018/2/14
Y1 - 2018/2/14
N2 - In today's large-scale High Performance Computing (HPC) systems, an increasing portion of the computing capacity is wasted due to failures and recoveries. It is expected that exascale machines will decrease the mean time between failures to a few hours, making fault tolerance a major challenge. This work explores novel methodologies to fault tolerance that achieve forward recovery, power-awareness, and scalability. The proposed model, referred to as Rejuvenating Shadows, is able to deal with multiple types of failure and maintain consistent level of resilience. An implementation is provided for MPI, and empirically evaluated with various benchmark applications that represent a wide range of HPC workloads. The results demonstrate Rejuvenating Shadows' ability to tolerate high failure rates, and to outperform in-memory checkpointing/restart in both execution time and resource utilization.
AB - In today's large-scale High Performance Computing (HPC) systems, an increasing portion of the computing capacity is wasted due to failures and recoveries. It is expected that exascale machines will decrease the mean time between failures to a few hours, making fault tolerance a major challenge. This work explores novel methodologies to fault tolerance that achieve forward recovery, power-awareness, and scalability. The proposed model, referred to as Rejuvenating Shadows, is able to deal with multiple types of failure and maintain consistent level of resilience. An implementation is provided for MPI, and empirically evaluated with various benchmark applications that represent a wide range of HPC workloads. The results demonstrate Rejuvenating Shadows' ability to tolerate high failure rates, and to outperform in-memory checkpointing/restart in both execution time and resource utilization.
KW - Extreme-scale computing
KW - Forward recovery
KW - Leaping
KW - Rejuvenation
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=85047454736&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85047454736&partnerID=8YFLogxK
U2 - 10.1109/HPCC-SmartCity-DSS.2017.71
DO - 10.1109/HPCC-SmartCity-DSS.2017.71
M3 - Conference contribution
AN - SCOPUS:85047454736
T3 - Proceedings - 2017 IEEE 19th Intl Conference on High Performance Computing and Communications, HPCC 2017, 2017 IEEE 15th Intl Conference on Smart City, SmartCity 2017 and 2017 IEEE 3rd Intl Conference on Data Science and Systems, DSS 2017
SP - 547
EP - 554
BT - Proceedings - 2017 IEEE 19th Intl Conference on High Performance Computing and Communications, HPCC 2017, 2017 IEEE 15th Intl Conference on Smart City, SmartCity 2017 and 2017 IEEE 3rd Intl Conference on Data Science and Systems, DSS 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 December 2017 through 20 December 2017
ER -