Abstract
Extreme-scale computing is paving the way for unparalleled advances in scientific discovery and innovation. However, as systems scale, their propensity to failure increases significantly, making it difficult for long running applications that span a large number of computing nodes to make forward progress. Achieving high performance in extreme scale environments, while minimizing energy consumption, has emerged as a daunting challenge. Significant advances on how to deal with failure, both physical and logical, have been achieved, with varying degree of success. A key component of fault tolerance relies heavily on the ability of the scheme to predict failure accurately. Varies approaches, including intelligent methods, have been proposed to predict failures. In this paper, we propose an attention-based failure predictor (AtFP), which automatically extracts representative features from the raw event log data to predict failure. The results show that, using the same input and output layers, AtFP outperforms frequently used LSTM methods. The proposed model reduces the F1 score by 39% and the training time by 65%.
| Original language | English |
|---|---|
| Title of host publication | 13th International Conference on Reliability, Maintainability, and Safety |
| Subtitle of host publication | Reliability and Safety of Intelligent Systems, ICRMS 2022 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 23-27 |
| Number of pages | 5 |
| ISBN (Electronic) | 9781665486903 |
| DOIs | |
| Publication status | Published - 2022 |
| Event | 13th International Conference on Reliability, Maintainability, and Safety, ICRMS 2022 - Hong Kong, China Duration: Aug 21 2022 → Aug 24 2022 |
Publication series
| Name | 13th International Conference on Reliability, Maintainability, and Safety: Reliability and Safety of Intelligent Systems, ICRMS 2022 |
|---|
Conference
| Conference | 13th International Conference on Reliability, Maintainability, and Safety, ICRMS 2022 |
|---|---|
| Country/Territory | China |
| City | Hong Kong |
| Period | 8/21/22 → 8/24/22 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- Transformer model
- encoder-decoder model
- event log
- failure prediction
- fault tolerance
ASJC Scopus subject areas
- Artificial Intelligence
- Computer Science Applications
- Energy Engineering and Power Technology
- Electrical and Electronic Engineering
- Safety, Risk, Reliability and Quality
Fingerprint
Dive into the research topics of 'AtFP: Attention-based Failure Predictor for Extreme-scale Computing'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS