Wang, Y., Haiyan Lan and Xinjia Zhang. 2024 "Speech Emotion Recognition Using Multiscale Global-Local Representation Learning with Feature Pyramid Network" Preprints. https://doi.org/10.20944/preprints202410.1002.v1
Abstract
Speech emotion recognition (SER) is important in facilitating natural human-computer interactions. In speech sequence modelling, a vital challenge is to learn context-aware sentence expression and temporal dynamics of para-linguistic features to achieve unambiguous emotional semantic understand-ing. In previous studies, the SER method based on the single-scale cascade feature extraction module could not effectively preserve the temporal structure of speech signals in the deep layer, downgrading the sequence modeling performance. In this paper, we propose a novel multi-scale feature pyramid network to mitigate the above limitations. With the aid of the bi-directional feature fusion of the pyramid network, the emotional representation with adequate temporal semantics is obtained. Experiments on the IEMOCAP corpus demonstrate the effectiveness of the proposed methods and achieve competitive results under speaker-independent validation.
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.