Abstract
Speech signals are the most convenient way of communication between human beings and the eventual method of Human–Computer Interaction (HCI) to exchange emotions and information. Recognizing emotions from speech signals is a challenging task due to the sparse nature of emotional data and features. In this article, we proposed a Deep Echo-State-Network (DeepESN) system for emotion recognition with a dilated convolution neural network and multi-headed attention mechanism. To reduce the model complexity, we incorporate a DeepESN that combines reservoir computing for higher-dimensional mapping. We also used fine-tuned Sparse Random Projection (SRP) to reduce dimensionality and adopted an early fusion strategy to fuse the extracted cues and passed the joint feature vector via a classification layer to recognize emotions. Our proposed model is evaluated on two public speech corpora, EMO-DB and RAVDESS, and tested for subject/speaker-dependent/independent performance. The results show that our proposed system achieves a high recognition rate, 91.14, 85.57 for EMO-DB, and 82.01, 77.02 for RAVDESS, using speaker-dependent and independent experiments, respectively. Our proposed system outperforms the State-of-The-Art (SOTA) while requiring less computational time.
Original language | English |
---|---|
Article number | 110525 |
Journal | Knowledge-Based Systems |
Volume | 270 |
DOIs | |
Publication status | Published - Jun 21 2023 |
Externally published | Yes |
Keywords
- Affective computing
- Attention mechanism
- Audio speech signals
- Convolution neural network
- Echo state networks
- Emotion recognition
- Human–computer interaction
ASJC Scopus subject areas
- Software
- Management Information Systems
- Information Systems and Management
- Artificial Intelligence