Differential Shadowing: A Resilience Framework for Extreme-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution

Longhao Li, Taieb Znati, Rami Melhem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Achieving resilience in extreme-scale environments, while minimizing energy consumption, is a daunting challenge. At extreme scale, however, the classic checkpoint-restart approach or replication for recovery techniques become inadequate. In this paper, we propose a novel application-aware elastic resilience model, dShadowing, for extreme-scale environments, as an efficient and scalable alternative to checkpointing, pure replication and re-execution. The basic tenet of this model is a dShadow, which is a derivative of its associated main process, whose functional and non-functional attributes are derived to achieve high tolerance to failure, at a minimum energy cost, while closely adhering to QoS requirements. Contrary to current schemes, dShadowing assumes heterogeneous environments, where cores fail independently, but non-identically. The experiment's results show that dShadowing model can achieve on average over 20% reduction in energy consumption and expected completion time, in comparison to a baseline shadowing model that considers cores fail uniformly. The results also demonstrate the flexibility of the dShadowing model and the ability to tolerate failure at scale adaptively and efficiently.

Original languageEnglish
Title of host publication2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665443319
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021 - Austin, United States
Duration: Oct 29 2021Oct 31 2021

Publication series

NameConference Proceedings of the IEEE International Performance, Computing, and Communications Conference
Volume2021-October
ISSN (Print)1097-2641

Conference

Conference2021 IEEE International Performance, Computing, and Communications Conference, IPCCC 2021
Country/TerritoryUnited States
CityAustin
Period10/29/2110/31/21

Keywords

  • application-aware
  • dShadowing
  • extreme-scale
  • heterogeneous environment
  • resilience

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'Differential Shadowing: A Resilience Framework for Extreme-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution'. Together they form a unique fingerprint.

Cite this