HuBERT-CLAP: Contrastive Learning-Based Multimodal Emotion Recognition using Self-Alignment Approach

Long H. Nguyen, Nhat Truong Pham, Mustaqeem Khan, Alice Othmani, Abdulmotaleb EI Saddik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A breakthrough in deep learning has led to improvements in speech emotion recognition (SER), but these studies tend to process fixed-length segments, resulting in degraded performance. Therefore, multimodal approaches that combine audio and text features improve SER but lack modality alignment. In this study, we introduce HuBERT-CLAP, a contrastive language-audio self-alignment pre-training framework for SER to address the aforementioned issue. Initially, we employ CLIP to train a contrastive self-alignment model using HuBERT for audio and BERT/DistilBERT for text to extract discriminative cues from the input sequences and map informative features from text to audio features. Additionally, HuBERT in the pre-trained HuBERT-CLAP undergoes partial fine-tuning to enhance the effectiveness in predicting emotional states. Furthermore, we evaluated our model on the IEMOCAP dataset, where it outperformed the non-pre-training model, achieving a weighted accuracy of 77.22%. Our source code is publicly available at https://github.com/oggyfaker/HuBERT-CLAP/for reproducible purposes.

Original languageEnglish
Title of host publicationProceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400712739
DOIs
Publication statusPublished - Dec 28 2024
Externally publishedYes
Event6th ACM International Conference on Multimedia in Asia, MMAsia 2024 - Auckland, New Zealand
Duration: Dec 3 2024Dec 6 2024

Publication series

NameProceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024

Conference

Conference6th ACM International Conference on Multimedia in Asia, MMAsia 2024
Country/TerritoryNew Zealand
CityAuckland
Period12/3/2412/6/24

Keywords

  • Affective Computing
  • Contrastive Learning
  • Human-Computer Interaction
  • Partial Fine-Tuning
  • Speech Emotion Recognition

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'HuBERT-CLAP: Contrastive Learning-Based Multimodal Emotion Recognition using Self-Alignment Approach'. Together they form a unique fingerprint.

Cite this