TY - GEN
T1 - An efficient fault-tolerant algorithm for distributed cloud services
AU - Al-Jaroodi, Jameela
AU - Mohamed, Nader
AU - Al Nuaimi, Klaithem
PY - 2012
Y1 - 2012
N2 - Several approaches for fault-tolerance in distributed systems were introduced; however, they require prior knowledge of the environment's operating conditions and/or constant monitoring of these conditions at run time. That allows the applications to adjust the load and redistribute the tasks when failures occur. These techniques work well when there is no high communication delay. Yet, this is not true in the Cloud, where data and computation servers are connected over the Internet and distributed across large geographic areas. Thus they usually exhibit high and dynamic communication delays that make discovering and recovering from failures take a long time. This paper proposes a delay-tolerant fault-tolerance algorithm that effectively reduces execution time and adapts for failures while minimizing the fault discovery and recovery overhead in the Cloud. Distributed tasks that can use this algorithm include downloading data from replicated servers and executing parallel applications on multiple independent distributed servers in the Cloud. The experimental results show the efficiency of the algorithm and its fault tolerance feature.
AB - Several approaches for fault-tolerance in distributed systems were introduced; however, they require prior knowledge of the environment's operating conditions and/or constant monitoring of these conditions at run time. That allows the applications to adjust the load and redistribute the tasks when failures occur. These techniques work well when there is no high communication delay. Yet, this is not true in the Cloud, where data and computation servers are connected over the Internet and distributed across large geographic areas. Thus they usually exhibit high and dynamic communication delays that make discovering and recovering from failures take a long time. This paper proposes a delay-tolerant fault-tolerance algorithm that effectively reduces execution time and adapts for failures while minimizing the fault discovery and recovery overhead in the Cloud. Distributed tasks that can use this algorithm include downloading data from replicated servers and executing parallel applications on multiple independent distributed servers in the Cloud. The experimental results show the efficiency of the algorithm and its fault tolerance feature.
KW - Cloud computing
KW - Fault-tolerance
KW - Heterogeneous distributed systems
KW - Load balancing
UR - http://www.scopus.com/inward/record.url?scp=84875591272&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84875591272&partnerID=8YFLogxK
U2 - 10.1109/NCCA.2012.21
DO - 10.1109/NCCA.2012.21
M3 - Conference contribution
AN - SCOPUS:84875591272
SN - 9780769549439
T3 - Proceedings - IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012
SP - 1
EP - 8
BT - Proceedings - IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012
T2 - 2012 IEEE 2nd Symposium on Network Cloud Computing and Applications, NCCA 2012
Y2 - 3 December 2012 through 4 December 2012
ER -