TY - GEN
T1 - Adaptive and Power-Aware Resilience for Extreme-Scale Computing
AU - Cui, Xiaolong
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/12
Y1 - 2017/1/12
N2 - With concerted efforts from researchers in hardware, software, algorithm, resource management, HPC is moving towards extreme-scale, featuring a computing capability of exaFLOPS. As we approach the new era of computing, however, several daunting scalability challenges remain to be conquered. Delivering extreme-scale performance will require a computing platform that supports billion-way parallelism, necessitating a dramatic increase in the number of computing, storage, networking components. At such a large scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unprecedented amount of power consumption. To tackle these challenges, we propose an adaptive, power-aware algorithm, referred to as Lazy Shadowing, as an efficient, scalable approach to achieve high-levels of resilience, through forward progress, in extreme-scale, failure-prone computing environments. Lazy Shadowing associates with each process a 'shadow' (process) that executes at a reduced rate, opportunistically rolls forward each shadow to catch up with its leading process during failure recovery. Compared to existing fault tolerance methods, our approach can achieve 20% energy saving with potential reduction in solution time at scale.
AB - With concerted efforts from researchers in hardware, software, algorithm, resource management, HPC is moving towards extreme-scale, featuring a computing capability of exaFLOPS. As we approach the new era of computing, however, several daunting scalability challenges remain to be conquered. Delivering extreme-scale performance will require a computing platform that supports billion-way parallelism, necessitating a dramatic increase in the number of computing, storage, networking components. At such a large scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unprecedented amount of power consumption. To tackle these challenges, we propose an adaptive, power-aware algorithm, referred to as Lazy Shadowing, as an efficient, scalable approach to achieve high-levels of resilience, through forward progress, in extreme-scale, failure-prone computing environments. Lazy Shadowing associates with each process a 'shadow' (process) that executes at a reduced rate, opportunistically rolls forward each shadow to catch up with its leading process during failure recovery. Compared to existing fault tolerance methods, our approach can achieve 20% energy saving with potential reduction in solution time at scale.
KW - Extreme-scale computing
KW - Forward progress
KW - Lazy Shadowing
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=85013173202&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85013173202&partnerID=8YFLogxK
U2 - 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0111
DO - 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0111
M3 - Conference contribution
AN - SCOPUS:85013173202
T3 - Proceedings - 13th IEEE International Conference on Ubiquitous Intelligence and Computing, 13th IEEE International Conference on Advanced and Trusted Computing, 16th IEEE International Conference on Scalable Computing and Communications, IEEE International Conference on Cloud and Big Data Computing, IEEE International Conference on Internet of People and IEEE Smart World Congress and Workshops, UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld 2016
SP - 671
EP - 679
BT - Proceedings - 13th IEEE International Conference on Ubiquitous Intelligence and Computing, 13th IEEE International Conference on Advanced and Trusted Computing, 16th IEEE International Conference on Scalable Computing and Communications, IEEE International Conference on Cloud and Big Data Computing, IEEE International Conference on Internet of People and IEEE Smart World Congress and Workshops, UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld 2016
A2 - El Baz, Didier
A2 - Bourgeois, Julien
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th IEEE International Conference on Ubiquitous Intelligence and Computing, 13th IEEE International Conference on Advanced and Trusted Computing, 16th IEEE International Conference on Scalable Computing and Communications, IEEE International Conference on Cloud and Big Data Computing, IEEE International Conference on Internet of People and IEEE Smart World Congress and Workshops, UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld 2016
Y2 - 18 July 2016 through 21 July 2016
ER -