TY - GEN
T1 - Enhancing Reliability-Aware Speedup Modelling via Replication
AU - Hussain, Zaeem
AU - Znati, Taieb
AU - Melhem, Rami
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/6
Y1 - 2020/6
N2 - Reliability-aware speedup models study the expected speedup of a parallel application as a function of the number of processors, on a platform susceptible to processor failures. Existing works in this area have developed models using checkpoint-restart (without replication) as the only fault tolerance mechanism, and have studied the upper bound on the number of processors beyond which the application speedup starts to degrade due to increasing likelihood of failure. In this work, we develop speedup models in which replication, specifically dual replication, is also employed for resilience. We demonstrate that the upper bound on the number of processors to execute a perfectly parallel application using dual replication is of the order λ^-^2 where λ is the individual processor failure rate. We also compare the dual replication model with that of no-replication. Specifically, we found that, given the same hardware resources, replication starts offering better speedup just before the upper bound on the number of processors for no-replication is reached. Taken together, our results indicate that replication can significantly enhance reliability-aware speedup models by i) pushing the number of processors that yield the optimal speedup to a much higher value than what is possible without replication, and ii) improving on the optimal speedup possible through checkpoint-restart alone.
AB - Reliability-aware speedup models study the expected speedup of a parallel application as a function of the number of processors, on a platform susceptible to processor failures. Existing works in this area have developed models using checkpoint-restart (without replication) as the only fault tolerance mechanism, and have studied the upper bound on the number of processors beyond which the application speedup starts to degrade due to increasing likelihood of failure. In this work, we develop speedup models in which replication, specifically dual replication, is also employed for resilience. We demonstrate that the upper bound on the number of processors to execute a perfectly parallel application using dual replication is of the order λ^-^2 where λ is the individual processor failure rate. We also compare the dual replication model with that of no-replication. Specifically, we found that, given the same hardware resources, replication starts offering better speedup just before the upper bound on the number of processors for no-replication is reached. Taken together, our results indicate that replication can significantly enhance reliability-aware speedup models by i) pushing the number of processors that yield the optimal speedup to a much higher value than what is possible without replication, and ii) improving on the optimal speedup possible through checkpoint-restart alone.
KW - fault tolerance
KW - hpc
KW - modelling
KW - parallel
KW - reliability
KW - speedup
UR - http://www.scopus.com/inward/record.url?scp=85090425162&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090425162&partnerID=8YFLogxK
U2 - 10.1109/DSN48063.2020.00065
DO - 10.1109/DSN48063.2020.00065
M3 - Conference contribution
AN - SCOPUS:85090425162
T3 - Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
SP - 528
EP - 539
BT - Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
Y2 - 29 June 2020 through 2 July 2020
ER -