TY - GEN
T1 - Grouped Echo State Network with Late Fusion for Speech Emotion Recognition
AU - Ibrahim, Hemin
AU - Loo, Chu Kiong
AU - Alnajjar, Fady
N1 - Funding Information:
Acknowledgements. This work was supported by the Covid-19 Special Research Grant under Project CSRG008-2020ST, Impact Oriented Interdisciplinary Research Grant Programme (IIRG), IIRG002C-19HWB from University of Malaya, and the AUA-UAEU Joint Research Grant 31R188.
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Speech Emotion Recognition (SER) has become a popular research topic due to having a significant role in many practical applications and is considered a key effort in Human-Computer Interaction (HCI). Previous works in this field have mostly focused on global features or time series feature representation with deep learning models. However, the main focus of this work is to design a simple model for SER by adopting multivariate time series feature representation. This work also used the Echo State Network (ESN) including parallel reservoir layers as a special case of the Recurrent Neural Network (RNN) and applied Principal Component Analysis (PCA) to reduce the high dimension output from reservoir layers. The late grouped fusion has been applied to capture additional information independently of the two reservoirs. Additionally, hyperparameters have been optimized by using the Bayesian approach. The high performance of the proposed SER model is proved when adopting the speaker-independent experiments on the SAVEE dataset and FAU Aibo emotion Corpus. The experimental results show that the designed model is superior to the state-of-the-art results.
AB - Speech Emotion Recognition (SER) has become a popular research topic due to having a significant role in many practical applications and is considered a key effort in Human-Computer Interaction (HCI). Previous works in this field have mostly focused on global features or time series feature representation with deep learning models. However, the main focus of this work is to design a simple model for SER by adopting multivariate time series feature representation. This work also used the Echo State Network (ESN) including parallel reservoir layers as a special case of the Recurrent Neural Network (RNN) and applied Principal Component Analysis (PCA) to reduce the high dimension output from reservoir layers. The late grouped fusion has been applied to capture additional information independently of the two reservoirs. Additionally, hyperparameters have been optimized by using the Bayesian approach. The high performance of the proposed SER model is proved when adopting the speaker-independent experiments on the SAVEE dataset and FAU Aibo emotion Corpus. The experimental results show that the designed model is superior to the state-of-the-art results.
KW - Grouped echo state network
KW - Recurrent neural network
KW - Reservoir computing
KW - Speech emotion recognition
KW - Time series classification
UR - http://www.scopus.com/inward/record.url?scp=85121925231&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121925231&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-92238-2_36
DO - 10.1007/978-3-030-92238-2_36
M3 - Conference contribution
AN - SCOPUS:85121925231
SN - 9783030922375
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 431
EP - 442
BT - Neural Information Processing - 28th International Conference, ICONIP 2021, Proceedings
A2 - Mantoro, Teddy
A2 - Lee, Minho
A2 - Ayu, Media Anugerah
A2 - Wong, Kok Wai
A2 - Hidayanto, Achmad Nizar
PB - Springer Science and Business Media Deutschland GmbH
T2 - 28th International Conference on Neural Information Processing, ICONIP 2021
Y2 - 8 December 2021 through 12 December 2021
ER -