Enhancing Reliability-Aware Speedup Modelling via Replication

Zaeem Hussain, Taieb Znati, Rami Melhem

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Reliability-aware speedup models study the expected speedup of a parallel application as a function of the number of processors, on a platform susceptible to processor failures. Existing works in this area have developed models using checkpoint-restart (without replication) as the only fault tolerance mechanism, and have studied the upper bound on the number of processors beyond which the application speedup starts to degrade due to increasing likelihood of failure. In this work, we develop speedup models in which replication, specifically dual replication, is also employed for resilience. We demonstrate that the upper bound on the number of processors to execute a perfectly parallel application using dual replication is of the order λ^-^2 where λ is the individual processor failure rate. We also compare the dual replication model with that of no-replication. Specifically, we found that, given the same hardware resources, replication starts offering better speedup just before the upper bound on the number of processors for no-replication is reached. Taken together, our results indicate that replication can significantly enhance reliability-aware speedup models by i) pushing the number of processors that yield the optimal speedup to a much higher value than what is possible without replication, and ii) improving on the optimal speedup possible through checkpoint-restart alone.

Original languageEnglish
Title of host publicationProceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages528-539
Number of pages12
ISBN (Electronic)9781728158099
DOIs
Publication statusPublished - Jun 2020
Externally publishedYes
Event50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020 - Valencia, Spain
Duration: Jun 29 2020Jul 2 2020

Publication series

NameProceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020

Conference

Conference50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
Country/TerritorySpain
CityValencia
Period6/29/207/2/20

Keywords

  • fault tolerance
  • hpc
  • modelling
  • parallel
  • reliability
  • speedup

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Enhancing Reliability-Aware Speedup Modelling via Replication'. Together they form a unique fingerprint.

Cite this