Harvesting Underutilized Resources to Improve Responsiveness and Tolerance to Crash and Silent Faults for Data-Intensive Applications

Debashis Ganguly, Mohammad H. Mofrad, Taieb Znati, Rami Melhem, John R. Lange

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Low latency is critical for emerging data-intensive real-time analytic and interactive single-wave applications. As the complexity and heterogeneity of cloud computing continue to increase, so will the frequency of errors, which will manifest themselves in unpredictable ways with a significant impact on the correctness of the computing and responsiveness of cloud services. In this paper, we propose a new data-centric computational model to improve the responsiveness of the data-intensive applications to crash faults and augment its ability to deal with silent errors to ensure computational accuracy. The basic tenet of the proposed model is a task replication scheme, which interweaves the processing of a replicated data split among multiple distributed tasks, with each task consuming data at a different offset. In the absence of a failure, the concurrent execution of the tasks ensures complete processing of the data split, with a significant reduction in the total execution time. In the case of an error, however, the remaining tasks take over the execution of the unfinished work and finish on time. The proposed scheme also guarantees timely detection and correction of silent data corruptions along with crash faults. We demonstrate the effectiveness of our scheme by extending Hadoop's MapReduce code base as a case study. Results show a performance improvement of 50% over Hadoop's Speculative Execution when dealing with crash-faults and an improvement of 33% when dealing with silent errors in case of no failure.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 10th International Conference on Cloud Computing, CLOUD 2017
EditorsGeoffrey C. Fox
PublisherIEEE Computer Society
Pages536-543
Number of pages8
ISBN (Electronic)9781538619933
DOIs
Publication statusPublished - Sept 8 2017
Externally publishedYes
Event10th IEEE International Conference on Cloud Computing, CLOUD 2017 - Honolulu, United States
Duration: Jun 25 2017Jun 30 2017

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2017-June
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference10th IEEE International Conference on Cloud Computing, CLOUD 2017
Country/TerritoryUnited States
CityHonolulu
Period6/25/176/30/17

Keywords

  • crash faults
  • data-intensive computing
  • fault-tolerance
  • replication
  • silent errors

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Harvesting Underutilized Resources to Improve Responsiveness and Tolerance to Crash and Silent Faults for Data-Intensive Applications'. Together they form a unique fingerprint.

Cite this