TY - JOUR
T1 - Bidirectional parallel echo state network for speech emotion recognition
AU - Ibrahim, Hemin
AU - Loo, Chu Kiong
AU - Alnajjar, Fady
N1 - Funding Information:
This work was supported in part by the COVID-19 Special Research Grant under Project CSRG008-2020ST, Impact Oriented Interdisciplinary Research Grant Programme (IIRG), IIRG002C-19HWB from University of Malaya, and the AUA-UAEU Joint Research Grant 31R188.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022/10
Y1 - 2022/10
N2 - Speech is an effective way for communicating and exchanging complex information between humans. Speech signal has involved a great attention in human-computer interaction. Therefore, emotion recognition from speech has become a hot research topic in the field of interacting machines with humans. In this paper, we proposed a novel speech emotion recognition system by adopting multivariate time series handcrafted feature representation from speech signals. Bidirectional echo state network with two parallel reservoir layers has been applied to capture additional independent information. The parallel reservoirs produce multiple representations for each direction from the bidirectional data with two stages of concatenation. The sparse random projection approach has been adopted to reduce the high-dimensional sparse output for each direction separately from both reservoirs. Random over-sampling and random under-sampling methods are used to overcome the imbalanced nature of the used speech emotion datasets. The performance of the proposed parallel ESN model is evaluated from the speaker-independent experiments on EMO-DB, SAVEE, RAVDESS, and FAU Aibo datasets. The results show that the proposed SER model is superior to the single reservoir and the state-of-the-art studies.
AB - Speech is an effective way for communicating and exchanging complex information between humans. Speech signal has involved a great attention in human-computer interaction. Therefore, emotion recognition from speech has become a hot research topic in the field of interacting machines with humans. In this paper, we proposed a novel speech emotion recognition system by adopting multivariate time series handcrafted feature representation from speech signals. Bidirectional echo state network with two parallel reservoir layers has been applied to capture additional independent information. The parallel reservoirs produce multiple representations for each direction from the bidirectional data with two stages of concatenation. The sparse random projection approach has been adopted to reduce the high-dimensional sparse output for each direction separately from both reservoirs. Random over-sampling and random under-sampling methods are used to overcome the imbalanced nature of the used speech emotion datasets. The performance of the proposed parallel ESN model is evaluated from the speaker-independent experiments on EMO-DB, SAVEE, RAVDESS, and FAU Aibo datasets. The results show that the proposed SER model is superior to the single reservoir and the state-of-the-art studies.
KW - Random resampling
KW - Recurrent neural network
KW - Reservoir computing
KW - Speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85131091494&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131091494&partnerID=8YFLogxK
U2 - 10.1007/s00521-022-07410-2
DO - 10.1007/s00521-022-07410-2
M3 - Article
AN - SCOPUS:85131091494
SN - 0941-0643
VL - 34
SP - 17581
EP - 17599
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 20
ER -