CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems

Zaeem Hussain, Xiaolong Cui, Taieb Znati, Rami Melhem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault tolerance technique, degrades under high failure rates because of frequent global rollbacks, which themselves are susceptible to failures. We propose CoLoR, a fault tolerance scheme that i)requires only the failing process to recover, ii)overlaps reexecution with restart, and iii)avoids the cumulative effect of successive failures. Our theoretical analysis reveals that such a scheme results in lower expected completion time than coordinated C/R. We also provide a proof-of-concept implementation in MPI using receiver based message logging and colocated rescuer (CoLoR)processes, and evaluate its performance on several HPC benchmarks. Our experimental results, combined with observations from the theoretical analysis, show that CoLoR can outperform both traditional C/R and replication over a large range of system sizes, without using extra logger nodes.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 24th International Conference on Parallel and Distributed Systems, ICPADS 2018
PublisherIEEE Computer Society
Pages569-576
Number of pages8
ISBN (Electronic)9781538673089
DOIs
Publication statusPublished - Feb 19 2019
Externally publishedYes
Event24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018 - Singapore, Singapore
Duration: Dec 11 2018Dec 13 2018

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Volume2018-December
ISSN (Print)1521-9097

Conference

Conference24th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2018
Country/TerritorySingapore
CitySingapore
Period12/11/1812/13/18

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'CoLoR: Co-Located Rescuers for Fault Tolerance in HPC Systems'. Together they form a unique fingerprint.

Cite this