TY - GEN
T1 - Differential Shadowing
T2 - 2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021
AU - Li, Longhao
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Funding Information:
This material is based in part upon work supported by the National Science Foundation under Grants Number CNS-1252306 and CNS-1253218. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Achieving resilience in extreme-scale environments, while minimizing energy consumption, is a daunting challenge. At extreme scale, however, the classic checkpoint-restart approach or replication for recovery techniques become inadequate. In this paper, we propose a novel application-aware elastic resilience model, dShadowing, for extreme-scale environments, as an efficient and scalable alternative to checkpointing, pure replication and re-execution. The basic tenet of this model is a dShadow, which is a derivative of its associated main process, whose functional and non-functional attributes are derived to achieve high tolerance to failure, at a minimum energy cost, while closely adhering to QoS requirements. Contrary to current schemes, dShadowing assumes heterogeneous environments, where cores fail independently, but non-identically. The experiment's results show that dShadowing model can achieve on average over 20% reduction in energy consumption and expected completion time, in comparison to a baseline shadowing model that considers cores fail uniformly. The results also demonstrate the flexibility of the dShadowing model and the ability to tolerate failure at scale adaptively and efficiently.
AB - Achieving resilience in extreme-scale environments, while minimizing energy consumption, is a daunting challenge. At extreme scale, however, the classic checkpoint-restart approach or replication for recovery techniques become inadequate. In this paper, we propose a novel application-aware elastic resilience model, dShadowing, for extreme-scale environments, as an efficient and scalable alternative to checkpointing, pure replication and re-execution. The basic tenet of this model is a dShadow, which is a derivative of its associated main process, whose functional and non-functional attributes are derived to achieve high tolerance to failure, at a minimum energy cost, while closely adhering to QoS requirements. Contrary to current schemes, dShadowing assumes heterogeneous environments, where cores fail independently, but non-identically. The experiment's results show that dShadowing model can achieve on average over 20% reduction in energy consumption and expected completion time, in comparison to a baseline shadowing model that considers cores fail uniformly. The results also demonstrate the flexibility of the dShadowing model and the ability to tolerate failure at scale adaptively and efficiently.
KW - application-aware
KW - dShadowing
KW - extreme-scale
KW - heterogeneous environment
KW - resilience
UR - http://www.scopus.com/inward/record.url?scp=85125181765&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125181765&partnerID=8YFLogxK
U2 - 10.1109/IPCCC51483.2021.9679435
DO - 10.1109/IPCCC51483.2021.9679435
M3 - Conference contribution
AN - SCOPUS:85125181765
T3 - Conference Proceedings of the IEEE International Performance, Computing, and Communications Conference
BT - 2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 29 October 2021 through 31 October 2021
ER -