1. Introduction
The Internet of Everything (IoE) presents a plethora of opportunities for human-robot interaction. Speech is the most natural and convenient communication mode between humans and robots. Emotion information from speech can effectively help robots understand the speaker’s intentions in natural human-robot interaction. Therefore, speech emotion recognition (SER) holds immense potential for diverse applications in human-robot interaction, including but not limited to intelligent driving, service robotics, online education, telemedicine, and criminal investigations[
1].
The extraction of emotional features is one of the key technologies in SER. The commonly used emotional features mainly include: Hand-crafted low-level descriptor (LLD) and its high-level statistical features (HSF)[
2], Mel-filter bank features[
3], Spectrogam[
4,
5], etc. However, researchers have not identified the best speech features for SER, and still explore the effective features that can represent emotional states[
6]. Humans can easily perceive emotional information and its changes through the auditory system. Sounds reach the auditory cortex after passing through several auditory signal processing stages, which then perceives differences in intensity and tone to produce varying psychological responses. Therefore, identifying emotions from the perspective of auditory perception can be an effective approach. However, the complexity of the human auditory system and its signal processing mechanism remain unclear. Researchers have simulated functional models of the auditory system based on its characteristics, such as the models of the cochlear basilar membrane, the inner hair cell, the nerve conduction, and the auditory center. These models are mainly applied in a cochlear implant, hearing aid, sound source positioning, speech enhancement[
7], etc, but there are still few studies on auditory perception and understanding. Psychoacoustic research reveals that speech signals are decomposed into spectral-temporal components in the cochlea and are subject to spectral-temporal modulation through the auditory pathway, generating a modulation spectrum[
8]. This modulation spectrum plays an essential role in speech perception and understanding[
9,
10]. Several studies have used statistical functions on the modulation spectrum to obtain modulation spectral features (MSF) for SER tasks [
11]. Avila et al.[
12] proposed a feature pooling scheme for dimensional emotion recognition using combined MSF and 3D spectrum representation. They extracted the amplitude envelope of the Gammatone auditory filterbank and applied discrete Fourier transform (DFT) to obtain the spectral-temporal modulation representation. However, this method uses DFT to convert the envelope signal into the frequency domain before temporal modulation, thus increasing the computational complexity. Peng et al. [
13] proposed the modulation-filtered cochleagram (MCG) feature to extract high-level auditory representations for dimensional emotion recognition. The experimental results showed excellent performance in terms of arousal and valence prediction, but the effectiveness of this feature in categorical emotion recognition requires further investigation.
In order to extract high-level feature representations from speech features, deep learning methods such as rr neural network (RNN), Transformer, etc., are mainly used for the SER task. CNN is often used to extract high-level speech feature representation because of its scale and rotation invariance[
14]. RNN is often used to capture the sequence dependence[
15,
16] because of its long-term dependence in the speech sequence. Recently, attention mechanisms have been incorporated into deep learning methods to automatically capture salient emotion features in speech sequences. Neumann et al.[
3] proposed attentive CNN (ACNN) based on the attention model to identify emotions from the log-Mel filterbank features. Mirsamadi et al.[
17] introduced attentive RNN (ARNN) model recognize emotions from frame-level LLDs with local attention as a weighted pooling method. Peng et al.[
18] proposed an attention-based sliding recurrent neural network (ASRNN) to effectively model auditory representation sequence by mimicking the auditory attention to capture salient emotion regions. In addition, the Transformer employs a self-attention mechanism in conjunction with RNN-based encoder-decoder architecture to track the context relations in the sequence data. Chen et al. [
7] introduced Key-Sparse Transformer to dynamically judge the importance of each frame in the speech signal, so as to help the model pay attention to the emotionally related fragments as much as possible.
Some novel attention models such as channel attention and spatial attention are proposed for image recognition and behavior detection. Channel attention is used to obtain the importance of different channels, such as SE-Net[
19], SK-Net[
20], and ECA-Net[
21]. Spatial attention is transformed into another space through the spatial conversion module and retains key information, such as A2-Net[
22], DANet[
23], and convolutional block attention module(CBAM)[
24]. In addition, some studies have constructed multi-level attention models from different dimensions. Ma et al.[
25] proposed TripleNet that uses a hierarchical representation module to construct the representation of context, reply and query in multi-turn dialogue, in which the triple attention mechanism is applied to update the representation. Liu et al.[
26] proposed TANet for object detection by jointly considering the triple attention of channel, point and voxel. for speech dialogue and object detection. Jiang et al.[
27] proposed a convolutional-recurrent neural network with multiple attention mechanisms for SER. This method employed the multiple attention layer to calculate the weights for different frames and features, and the self-attention layer to calculate the weights from Mel-spectrum features. Liu et al. [
28] proposed a novel multi-level attention network, which contains a multiscale low-level feature extractor and a multi-unit attention module for SER. Zou et al.[
29] proposed an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module. These methods used multiple attention models to extract different channel and spatial attention maps from LLDs, spectrograms, and waveforms, and then fused these attention maps to recognize emotions, without considering capturing significant emotional regions of speech sequences using temporal attention. To address this issue and investigate the effectiveness of MCG features in discrete emotion recognition, this paper proposes a categorical emotion recognition method that employs a multi-level attention network to extract salient information from modulation-filtered cochleagram features. Firstly, 3D-CNN is used to extract high-level auditory feature representation from modulation-filtered cochleagram. Then, the channel-level attention module is used to capture the dependence of the channel structure from the 3D convolution feature map, the spatial-level attention module is used to capture the dependence of the spectral-temporal spatial structure of spectral-temporal feature representation. Finally, a temporal-level attention module is used to capture the significant emotional regions from the concatenated feature sequence of the channel and spatial attention map.
The major contributions of this study are as follows:
Using the same convolutional recurrent neural network, the MCG features perform better than other evaluation features in categorical emotion recognition.
The multi-level attention network is proposed, in which channel-level and spatial-level attention modules obtain fused features from MCG features, and temporal-level attention further captures significant emotional regions from fused feature sequences, thereby improving emotion recognition performance.
The proposed method is evaluated on Interactive Emotional Dyadic Motion Capture Database (IEMOCAP). It obtains an unweighted accuracy of 71%, showing the effectiveness of our approach.
The remainder of this paper is organized as follows. In Section II, we describe the modulation-filtered cochleagram feature. In Section III, we describe the proposed emotional recognition framework with a multi-level attention module. The experiments and results are presented in Section IV. Finally, the paper is concluded in Section V.