TY - GEN
T1 - AtFP
T2 - 13th International Conference on Reliability, Maintainability, and Safety, ICRMS 2022
AU - Li, Longhao
AU - Znati, Taieb
N1 - Funding Information:
This research is based in part upon work supported by the Department of Energy under contract DE-SC0014376, and in part upon work supported by the National Science Foundation under Grants Number CNS-1252306 and CNS-1253218. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Extreme-scale computing is paving the way for unparalleled advances in scientific discovery and innovation. However, as systems scale, their propensity to failure increases significantly, making it difficult for long running applications that span a large number of computing nodes to make forward progress. Achieving high performance in extreme scale environments, while minimizing energy consumption, has emerged as a daunting challenge. Significant advances on how to deal with failure, both physical and logical, have been achieved, with varying degree of success. A key component of fault tolerance relies heavily on the ability of the scheme to predict failure accurately. Varies approaches, including intelligent methods, have been proposed to predict failures. In this paper, we propose an attention-based failure predictor (AtFP), which automatically extracts representative features from the raw event log data to predict failure. The results show that, using the same input and output layers, AtFP outperforms frequently used LSTM methods. The proposed model reduces the F1 score by 39% and the training time by 65%.
AB - Extreme-scale computing is paving the way for unparalleled advances in scientific discovery and innovation. However, as systems scale, their propensity to failure increases significantly, making it difficult for long running applications that span a large number of computing nodes to make forward progress. Achieving high performance in extreme scale environments, while minimizing energy consumption, has emerged as a daunting challenge. Significant advances on how to deal with failure, both physical and logical, have been achieved, with varying degree of success. A key component of fault tolerance relies heavily on the ability of the scheme to predict failure accurately. Varies approaches, including intelligent methods, have been proposed to predict failures. In this paper, we propose an attention-based failure predictor (AtFP), which automatically extracts representative features from the raw event log data to predict failure. The results show that, using the same input and output layers, AtFP outperforms frequently used LSTM methods. The proposed model reduces the F1 score by 39% and the training time by 65%.
KW - encoder-decoder model
KW - event log
KW - failure prediction
KW - fault tolerance
KW - Transformer model
UR - http://www.scopus.com/inward/record.url?scp=85143087651&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143087651&partnerID=8YFLogxK
U2 - 10.1109/ICRMS55680.2022.9944604
DO - 10.1109/ICRMS55680.2022.9944604
M3 - Conference contribution
AN - SCOPUS:85143087651
T3 - 13th International Conference on Reliability, Maintainability, and Safety: Reliability and Safety of Intelligent Systems, ICRMS 2022
SP - 23
EP - 27
BT - 13th International Conference on Reliability, Maintainability, and Safety
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 August 2022 through 24 August 2022
ER -