Energy consumption of resilience mechanisms in large scale systems

Bryan Mills, Taieb Znati, Rami Melhem, Kurt B. Ferreira, Ryan E. Grant

Research output: Contribution to conferencePaperpeer-review

19 Citations (Scopus)

Abstract

As HPC systems continue to grow to meet the requirements of tomorrow's exascale-class systems, two of the biggest challenges are power consumption and system resilience. On current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart - for example process replication. In this paper we address both resilience and power together, this is in contrast to much of the competed work which does so independently. Using an analytical model that accounts for both power consumption and failures, we study the performance of checkpoint and replication-based techniques on current and future systems and use power measurements from current systems to validate our findings. Lastly, in an attempt to optimize power consumption for replication, we introduce a new protocol termed shadow replication which not only reduces energy consumption but also produces faster response times than checkpoint/restart and traditional replication when operating under system power constraints.

Original languageEnglish
Pages528-535
Number of pages8
DOIs
Publication statusPublished - 2014
Externally publishedYes
Event2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2014 - Turin, Italy
Duration: Feb 12 2014Feb 14 2014

Conference

Conference2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2014
Country/TerritoryItaly
CityTurin
Period2/12/142/14/14

Keywords

  • energy-aware
  • fault tolerance
  • power-aware
  • resilience
  • scheduling
  • shadow computing

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Energy consumption of resilience mechanisms in large scale systems'. Together they form a unique fingerprint.

Cite this