Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices

Mustaqeem Khan, Jamil Ahmad, Wail Gueaieb, Giulia De Masi, Fakhri Karray, Abdulmotaleb El Saddik

Research output: Contribution to journalArticlepeer-review

Abstract

The field of Multimodal Emotion Recognition (MER) has made considerable advancements in recent years; however, the opportunity to leverage the synergistic relationships between different modalities remains largely untapped. This paper introduces an MER approach employing a Joint Multi-Scale Multimodal Transformer (JMMT) with recursive cross-attention for naturalistic recognition of emotions by enhancing and capturing inter-and intra-modal relationships across both (visual and audio) modalities. We compute multi-scale attention weights based on cross-correlations between multi-scale joint representations of combined and individual cues to capture inter and intra-modal dynamics. As a result of individual modalities, recursive inputs are fed back during the fusion for further refinement of features. Our JMMT model presents a cost-effective solution for consumer devices by capturing synergistic characteristics across visual and audio inputs. The JMMT model outperforms the state-of-the-art (SOTA) methods in MER systems, which were evaluated by IEMOCAP and MELD datasets.

Original languageEnglish
JournalIEEE Transactions on Consumer Electronics
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Emotion recognition
  • Joint feature representation
  • Multi-scale transformer
  • Multimodal fusion
  • Recursive attention

ASJC Scopus subject areas

  • Media Technology
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices'. Together they form a unique fingerprint.

Cite this