A systematic fault-tolerant computational model for both crash failures and silent data corruption

Xiaolong Cui, Zaeem Hussain, Taieb Znati, Rami Melhem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

As the boundaries between Cloud and HPC continue to blur, it is clear that there is an urgent demand for a systematic computational model that adapts to the computing platform and accommodates the underlying workloads. As computing systems continue to scale out to satisfy the increasingly large demands on computing capacity, power awareness and fault tolerance have become major concerns. This paper proposes a novel computational model that applies to both compute- A nd data-intensive workloads, and deals with diverse types of faults. Evaluation results demonstrate that the proposed model is able to achieve significant energy savings compared to existing fault tolerance techniques, while maintaining the same level of fault tolerance.

Original languageEnglish
Title of host publication21st Conference on Innovation in Clouds, Internet and Networks, ICIN 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1-8
Number of pages8
ISBN (Electronic)9781538634585
DOIs
Publication statusPublished - Jun 29 2018
Externally publishedYes
Event21st International Conference on Innovation in Clouds, Internet and Networks, ICIN 2018 - Paris, France
Duration: Feb 19 2018Feb 22 2018

Publication series

Name21st Conference on Innovation in Clouds, Internet and Networks, ICIN 2018

Conference

Conference21st International Conference on Innovation in Clouds, Internet and Networks, ICIN 2018
Country/TerritoryFrance
CityParis
Period2/19/182/22/18

Keywords

  • Extreme-scale
  • Fault tolerance
  • Power awareness
  • Shadow Computing
  • Silent data corruption

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'A systematic fault-tolerant computational model for both crash failures and silent data corruption'. Together they form a unique fingerprint.

Cite this