TY - GEN
T1 - Multilevel Feature Representation for Hybrid Transformers-based Emotion Recognition
AU - Swain, Monorama
AU - Maji, Bubai
AU - Khan, Mustaqeem
AU - Saddik, Abdulmotaleb El
AU - Gueaieb, Wail
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Automated Speech Emotion Recognition (SER) systems and human-computer interaction systems are both heavily reliant on emotion. Global and temporal representation of utterances is crucial to the effectiveness of an SER module. Research conducted by the author demonstrates that the temporal data gathered by the transformer can significantly improve the SER system's overall recognition rate. There are some limitations to all of the existing hybrid models, despite the fact that the performance of hybrid models is higher than that of conventional classifiers. Despite this, the relationship between different speech cues and the learning of high-level global and temporal cues using a transformer has not been studied thoroughly. As a result, this research discovered an efficient transformer-based hybrid technique for emotion recognition via multilevel feature representation of speech signals. To learn deeper information from global and temporal representations, the proposed method comprises a parallel convolutional encoder, a spatial encoder, and a sequential encoder. Furthermore, the learned cues pass through the proposed transformer to capture the salient information for a specific emotion in the input sequence. To verify its effectiveness, we evaluated the proposed approach and achieved state-of-the-art (SOTA) results 75.29% and 88.18% weighted, and 76.34% and 88.49% unweighted accuracy on the IEMOCAP and SITB-OSED corpora.
AB - Automated Speech Emotion Recognition (SER) systems and human-computer interaction systems are both heavily reliant on emotion. Global and temporal representation of utterances is crucial to the effectiveness of an SER module. Research conducted by the author demonstrates that the temporal data gathered by the transformer can significantly improve the SER system's overall recognition rate. There are some limitations to all of the existing hybrid models, despite the fact that the performance of hybrid models is higher than that of conventional classifiers. Despite this, the relationship between different speech cues and the learning of high-level global and temporal cues using a transformer has not been studied thoroughly. As a result, this research discovered an efficient transformer-based hybrid technique for emotion recognition via multilevel feature representation of speech signals. To learn deeper information from global and temporal representations, the proposed method comprises a parallel convolutional encoder, a spatial encoder, and a sequential encoder. Furthermore, the learned cues pass through the proposed transformer to capture the salient information for a specific emotion in the input sequence. To verify its effectiveness, we evaluated the proposed approach and achieved state-of-the-art (SOTA) results 75.29% and 88.18% weighted, and 76.34% and 88.49% unweighted accuracy on the IEMOCAP and SITB-OSED corpora.
KW - Emotion Recognition
KW - Human-Computer Interaction
KW - Hybrid Transformer
KW - Multilevel Feature Representation
KW - Speech Signal
UR - http://www.scopus.com/inward/record.url?scp=85165435559&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165435559&partnerID=8YFLogxK
U2 - 10.1109/BioSMART58455.2023.10162089
DO - 10.1109/BioSMART58455.2023.10162089
M3 - Conference contribution
AN - SCOPUS:85165435559
T3 - BioSMART 2023 - Proceedings: 5th International Conference on Bio-Engineering for Smart Technologies
BT - BioSMART 2023 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Conference on Bio-engineering for Smart Technologies, BioSMART 2023
Y2 - 7 June 2023 through 9 June 2023
ER -