Preprint Article Version 1 This version is not peer-reviewed

AVaTER: A Multimodal Approach of Recognizing Emotion using Cross-modal Attention Technique

Version 1 : Received: 10 July 2024 / Approved: 11 July 2024 / Online: 11 July 2024 (12:40:02 CEST)

How to cite: Das, A.; Sarma, M. S.; Hoque, M. M.; Siddique, N.; Dewan, M. A. A. AVaTER: A Multimodal Approach of Recognizing Emotion using Cross-modal Attention Technique. Preprints 2024, 2024070917. https://doi.org/10.20944/preprints202407.0917.v1 Das, A.; Sarma, M. S.; Hoque, M. M.; Siddique, N.; Dewan, M. A. A. AVaTER: A Multimodal Approach of Recognizing Emotion using Cross-modal Attention Technique. Preprints 2024, 2024070917. https://doi.org/10.20944/preprints202407.0917.v1

Abstract

Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for emotion recognition that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model’s ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.

Keywords

Multimodal Emotion Recognition; Natural Language Processing, Multimodal Dataset; Cross-Modal Attention, Transformers

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.