Optimal placement of in-memory checkpoints under heterogeneous failure likelihoods

Zaeem Hussain, Taieb Znati, Rami Melhem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages900-910
Number of pages11
ISBN (Electronic)9781728112466
DOIs
Publication statusPublished - May 2019
Externally publishedYes
Event33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil
Duration: May 20 2019May 24 2019

Publication series

NameProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
Country/TerritoryBrazil
CityRio de Janeiro
Period5/20/195/24/19

Keywords

  • Failure logs
  • Fault tolerance
  • In-memory checkpoint
  • Multilevel checkpoint

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Optimal placement of in-memory checkpoints under heterogeneous failure likelihoods'. Together they form a unique fingerprint.

Cite this