TY - GEN
T1 - HuBERT-CLAP
T2 - 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
AU - Nguyen, Long H.
AU - Pham, Nhat Truong
AU - Khan, Mustaqeem
AU - Othmani, Alice
AU - EI Saddik, Abdulmotaleb
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/28
Y1 - 2024/12/28
N2 - A breakthrough in deep learning has led to improvements in speech emotion recognition (SER), but these studies tend to process fixed-length segments, resulting in degraded performance. Therefore, multimodal approaches that combine audio and text features improve SER but lack modality alignment. In this study, we introduce HuBERT-CLAP, a contrastive language-audio self-alignment pre-training framework for SER to address the aforementioned issue. Initially, we employ CLIP to train a contrastive self-alignment model using HuBERT for audio and BERT/DistilBERT for text to extract discriminative cues from the input sequences and map informative features from text to audio features. Additionally, HuBERT in the pre-trained HuBERT-CLAP undergoes partial fine-tuning to enhance the effectiveness in predicting emotional states. Furthermore, we evaluated our model on the IEMOCAP dataset, where it outperformed the non-pre-training model, achieving a weighted accuracy of 77.22%. Our source code is publicly available at https://github.com/oggyfaker/HuBERT-CLAP/for reproducible purposes.
AB - A breakthrough in deep learning has led to improvements in speech emotion recognition (SER), but these studies tend to process fixed-length segments, resulting in degraded performance. Therefore, multimodal approaches that combine audio and text features improve SER but lack modality alignment. In this study, we introduce HuBERT-CLAP, a contrastive language-audio self-alignment pre-training framework for SER to address the aforementioned issue. Initially, we employ CLIP to train a contrastive self-alignment model using HuBERT for audio and BERT/DistilBERT for text to extract discriminative cues from the input sequences and map informative features from text to audio features. Additionally, HuBERT in the pre-trained HuBERT-CLAP undergoes partial fine-tuning to enhance the effectiveness in predicting emotional states. Furthermore, we evaluated our model on the IEMOCAP dataset, where it outperformed the non-pre-training model, achieving a weighted accuracy of 77.22%. Our source code is publicly available at https://github.com/oggyfaker/HuBERT-CLAP/for reproducible purposes.
KW - Affective Computing
KW - Contrastive Learning
KW - Human-Computer Interaction
KW - Partial Fine-Tuning
KW - Speech Emotion Recognition
UR - http://www.scopus.com/inward/record.url?scp=85216185627&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216185627&partnerID=8YFLogxK
U2 - 10.1145/3696409.3700183
DO - 10.1145/3696409.3700183
M3 - Conference contribution
AN - SCOPUS:85216185627
T3 - Proceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
BT - Proceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
PB - Association for Computing Machinery, Inc
Y2 - 3 December 2024 through 6 December 2024
ER -