Abstract
The field of Multimodal Emotion Recognition (MER) has made considerable advancements in recent years; however, the opportunity to leverage the synergistic relationships between different modalities remains largely untapped. This paper introduces an MER approach employing a Joint Multi-Scale Multimodal Transformer (JMMT) with recursive cross-attention for naturalistic recognition of emotions by enhancing and capturing inter-and intra-modal relationships across both (visual and audio) modalities. We compute multi-scale attention weights based on cross-correlations between multi-scale joint representations of combined and individual cues to capture inter and intra-modal dynamics. As a result of individual modalities, recursive inputs are fed back during the fusion for further refinement of features. Our JMMT model presents a cost-effective solution for consumer devices by capturing synergistic characteristics across visual and audio inputs. The JMMT model outperforms the state-of-the-art (SOTA) methods in MER systems, which were evaluated by IEMOCAP and MELD datasets.
Original language | English |
---|---|
Journal | IEEE Transactions on Consumer Electronics |
DOIs | |
Publication status | Accepted/In press - 2025 |
Keywords
- Emotion recognition
- Joint feature representation
- Multi-scale transformer
- Multimodal fusion
- Recursive attention
ASJC Scopus subject areas
- Media Technology
- Electrical and Electronic Engineering