Rejuvenating Shadows: Fault Tolerance with Forward Recovery

Xiaolong Cui, Taieb Znati, Rami Melhem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In today's large-scale High Performance Computing (HPC) systems, an increasing portion of the computing capacity is wasted due to failures and recoveries. It is expected that exascale machines will decrease the mean time between failures to a few hours, making fault tolerance a major challenge. This work explores novel methodologies to fault tolerance that achieve forward recovery, power-awareness, and scalability. The proposed model, referred to as Rejuvenating Shadows, is able to deal with multiple types of failure and maintain consistent level of resilience. An implementation is provided for MPI, and empirically evaluated with various benchmark applications that represent a wide range of HPC workloads. The results demonstrate Rejuvenating Shadows' ability to tolerate high failure rates, and to outperform in-memory checkpointing/restart in both execution time and resource utilization.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 19th Intl Conference on High Performance Computing and Communications, HPCC 2017, 2017 IEEE 15th Intl Conference on Smart City, SmartCity 2017 and 2017 IEEE 3rd Intl Conference on Data Science and Systems, DSS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages547-554
Number of pages8
ISBN (Electronic)9781538625880
DOIs
Publication statusPublished - Feb 14 2018
Externally publishedYes
Event19th IEEE Intl Conference on High Performance Computing and Communications, 15th IEEE Intl Conference on Smart City, and 3rd IEEE Intl Conference on Data Science and Systems, HPCC/SmartCity/DSS 2017 - Bangkok, Thailand
Duration: Dec 18 2017Dec 20 2017

Publication series

NameProceedings - 2017 IEEE 19th Intl Conference on High Performance Computing and Communications, HPCC 2017, 2017 IEEE 15th Intl Conference on Smart City, SmartCity 2017 and 2017 IEEE 3rd Intl Conference on Data Science and Systems, DSS 2017
Volume2018-January

Conference

Conference19th IEEE Intl Conference on High Performance Computing and Communications, 15th IEEE Intl Conference on Smart City, and 3rd IEEE Intl Conference on Data Science and Systems, HPCC/SmartCity/DSS 2017
Country/TerritoryThailand
CityBangkok
Period12/18/1712/20/17

Keywords

  • Extreme-scale computing
  • Forward recovery
  • Leaping
  • Rejuvenation
  • Reliability

ASJC Scopus subject areas

  • Information Systems
  • Hardware and Architecture
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics
  • Software

Fingerprint

Dive into the research topics of 'Rejuvenating Shadows: Fault Tolerance with Forward Recovery'. Together they form a unique fingerprint.

Cite this