TY - GEN
T1 - Harvesting Underutilized Resources to Improve Responsiveness and Tolerance to Crash and Silent Faults for Data-Intensive Applications
AU - Ganguly, Debashis
AU - Mofrad, Mohammad H.
AU - Znati, Taieb
AU - Melhem, Rami
AU - Lange, John R.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/8
Y1 - 2017/9/8
N2 - Low latency is critical for emerging data-intensive real-time analytic and interactive single-wave applications. As the complexity and heterogeneity of cloud computing continue to increase, so will the frequency of errors, which will manifest themselves in unpredictable ways with a significant impact on the correctness of the computing and responsiveness of cloud services. In this paper, we propose a new data-centric computational model to improve the responsiveness of the data-intensive applications to crash faults and augment its ability to deal with silent errors to ensure computational accuracy. The basic tenet of the proposed model is a task replication scheme, which interweaves the processing of a replicated data split among multiple distributed tasks, with each task consuming data at a different offset. In the absence of a failure, the concurrent execution of the tasks ensures complete processing of the data split, with a significant reduction in the total execution time. In the case of an error, however, the remaining tasks take over the execution of the unfinished work and finish on time. The proposed scheme also guarantees timely detection and correction of silent data corruptions along with crash faults. We demonstrate the effectiveness of our scheme by extending Hadoop's MapReduce code base as a case study. Results show a performance improvement of 50% over Hadoop's Speculative Execution when dealing with crash-faults and an improvement of 33% when dealing with silent errors in case of no failure.
AB - Low latency is critical for emerging data-intensive real-time analytic and interactive single-wave applications. As the complexity and heterogeneity of cloud computing continue to increase, so will the frequency of errors, which will manifest themselves in unpredictable ways with a significant impact on the correctness of the computing and responsiveness of cloud services. In this paper, we propose a new data-centric computational model to improve the responsiveness of the data-intensive applications to crash faults and augment its ability to deal with silent errors to ensure computational accuracy. The basic tenet of the proposed model is a task replication scheme, which interweaves the processing of a replicated data split among multiple distributed tasks, with each task consuming data at a different offset. In the absence of a failure, the concurrent execution of the tasks ensures complete processing of the data split, with a significant reduction in the total execution time. In the case of an error, however, the remaining tasks take over the execution of the unfinished work and finish on time. The proposed scheme also guarantees timely detection and correction of silent data corruptions along with crash faults. We demonstrate the effectiveness of our scheme by extending Hadoop's MapReduce code base as a case study. Results show a performance improvement of 50% over Hadoop's Speculative Execution when dealing with crash-faults and an improvement of 33% when dealing with silent errors in case of no failure.
KW - crash faults
KW - data-intensive computing
KW - fault-tolerance
KW - replication
KW - silent errors
UR - http://www.scopus.com/inward/record.url?scp=85032175000&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032175000&partnerID=8YFLogxK
U2 - 10.1109/CLOUD.2017.74
DO - 10.1109/CLOUD.2017.74
M3 - Conference contribution
AN - SCOPUS:85032175000
T3 - IEEE International Conference on Cloud Computing, CLOUD
SP - 536
EP - 543
BT - Proceedings - 2017 IEEE 10th International Conference on Cloud Computing, CLOUD 2017
A2 - Fox, Geoffrey C.
PB - IEEE Computer Society
T2 - 10th IEEE International Conference on Cloud Computing, CLOUD 2017
Y2 - 25 June 2017 through 30 June 2017
ER -