AtFP: Attention-based Failure Predictor for Extreme-scale Computing

Longhao Li, Taieb Znati

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Extreme-scale computing is paving the way for unparalleled advances in scientific discovery and innovation. However, as systems scale, their propensity to failure increases significantly, making it difficult for long running applications that span a large number of computing nodes to make forward progress. Achieving high performance in extreme scale environments, while minimizing energy consumption, has emerged as a daunting challenge. Significant advances on how to deal with failure, both physical and logical, have been achieved, with varying degree of success. A key component of fault tolerance relies heavily on the ability of the scheme to predict failure accurately. Varies approaches, including intelligent methods, have been proposed to predict failures. In this paper, we propose an attention-based failure predictor (AtFP), which automatically extracts representative features from the raw event log data to predict failure. The results show that, using the same input and output layers, AtFP outperforms frequently used LSTM methods. The proposed model reduces the F1 score by 39% and the training time by 65%.

Original languageEnglish
Title of host publication13th International Conference on Reliability, Maintainability, and Safety
Subtitle of host publicationReliability and Safety of Intelligent Systems, ICRMS 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages23-27
Number of pages5
ISBN (Electronic)9781665486903
DOIs
Publication statusPublished - 2022
Event13th International Conference on Reliability, Maintainability, and Safety, ICRMS 2022 - Hong Kong, China
Duration: Aug 21 2022Aug 24 2022

Publication series

Name13th International Conference on Reliability, Maintainability, and Safety: Reliability and Safety of Intelligent Systems, ICRMS 2022

Conference

Conference13th International Conference on Reliability, Maintainability, and Safety, ICRMS 2022
Country/TerritoryChina
CityHong Kong
Period8/21/228/24/22

Keywords

  • Transformer model
  • encoder-decoder model
  • event log
  • failure prediction
  • fault tolerance

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Energy Engineering and Power Technology
  • Electrical and Electronic Engineering
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'AtFP: Attention-based Failure Predictor for Extreme-scale Computing'. Together they form a unique fingerprint.

Cite this