4.2. Experimental Results-Accuracy and Loss Curves
In the sequel, we present the results in the form of graphs for the accuracy and loss of our six models, through Figures 9 to 24. Where there is need, we add some comments. It is important to note here that compared to the other models, the Deep Belief Network (DBN) algorithm package provides only loss information, so we have only one graph for each dataset.
Figure 6.
DBN loss training charts for RAVDESS (left) and SAVEE (right) databases.
Figure 6.
DBN loss training charts for RAVDESS (left) and SAVEE (right) databases.
Figure 7.
DNN training charts for RAVDESS database.
Figure 7.
DNN training charts for RAVDESS database.
Figure 8.
DNN training charts for SAVEE database.
Figure 8.
DNN training charts for SAVEE database.
Figure 9.
LSTM network training charts for RAVDESS database.
Figure 9.
LSTM network training charts for RAVDESS database.
Figure 10.
LSTM network training charts for SAVEE database.
Figure 10.
LSTM network training charts for SAVEE database.
Figure 11.
LSTM network with attention mechanism training charts for RAVDESS database.
Figure 11.
LSTM network with attention mechanism training charts for RAVDESS database.
Figure 12.
LSTM network with attention mechanism training charts for SAVEE database.
Figure 12.
LSTM network with attention mechanism training charts for SAVEE database.
Figure 13.
CNN network training charts for RAVDESS database.
Figure 13.
CNN network training charts for RAVDESS database.
Figure 14.
CNN network training charts for SAVEE database.
Figure 14.
CNN network training charts for SAVEE database.
The model of Figure 13 and Figure 14 uses the batch-normalization and dropout techniques with a reset of 50%, to prevent large overfitting of the model, but still no good generalization is achieved. For this reason, two more variations of our model were tested.
Initially, zero padding was tested on the input vectors at the hidden levels. This was to keep the dimensions of the vector containing the input sequence the same as the model proceeds to the next hidden levels (for RAVDESS 228, and for SAVEE 308). This test was performed on the RAVDESS database, and the accuracy and loss curves are presented in Figure 18.
As we can see from the graphics of the test with zero padding compared to our model in Figure 16, it is obvious that the accuracy remains the same (70.5%), as well as the overfit (30%). However, comparing the graph of the loss between the two implementations, we observe that the error from epoch 50 onwards, in the zero-padding test, gradually increases and does not stabilize after epoch 100 on 2, as is done with the model in Figure 16. On the contrary, it continues to grow and exceeds 2.
Then, after the first test for the RAVDESS database had not yielded the desired results, a second test was performed using zero padding in combination with the replacement of the ReLU activation function by the LeakyReLU. The ReLU activation function sets, all negative values to zero. This makes our network adaptable to ensure that the most important neurons have positive values (> 0). However, this can also be a problem, as the gradient of 0 is 0 and therefore neurons that reach large negative values cannot recover and stick to 0. This causes the neurons to die and for this reason this phenomenon is called “The dying ReLU problem”.
Figure 15.
CNN network training charts for RAVDESS database with zero padding.
Figure 15.
CNN network training charts for RAVDESS database with zero padding.
To avoid this phenomenon, LeakyReLU was proposed, according to which negative values instead of zero are multiplied by a factor of 0.01:
Comparing it with ReLU, we can see that LeakyReLU has a slight negative slope, avoiding the entrapment of neurons with negative values at 0. There have generally been reports of success adopting this activation function, but the results are not always consistent. The reason we want to apply it to the CNN model is that through its operation we reduce the phenomenon of overfitting a little more, and stabilize more the instabilities of our system, which appear as abrupt changes (spikes) in the graphics.
This test was performed on RAVDESS and SAVEE databases and the accuracy and loss curves are depicted in Figure 19 and Figure 20 respectively.
Figure 16.
CNN network training charts for RAVDESS database with zero padding and LeakyReLU.
Figure 16.
CNN network training charts for RAVDESS database with zero padding and LeakyReLU.
Figure 17.
CNN network training charts for SAVEE database with zero padding and LeakyReLU.
Figure 17.
CNN network training charts for SAVEE database with zero padding and LeakyReLU.
As we can see from the graphs of the test with zero padding and LeakyReLU in the RAVDESS base, compared to our model in Figure 16, it is obvious that the accuracy decreases slightly (67%) and the overfitting increases slightly (33%). Comparing the graphs of loss between the two implementations, we notice that the instability (spikes) of the system has been improved. However, the error from epoch 50 onwards, in the test with zero padding and LeakyReLU, increases gradually and does not stabilize after epoch 100 on 2, as it is done with the model in Figure 16. On the contrary, it continues to increase, approaching values such as 3 and 4.
Comparing now the graph of test accuracy with zero padding and LeakyReLU on the SAVEE base with that of Figure 17, we notice that the accuracy increases dimly (55.7%) compared to the model in Figure 17 (55.2%), so there is a very minimal reduction in overfitting, of 0.5%. Then comparing the loss graphs between the two implementations, we notice that the instability (spikes) of the system has been slightly improved. The error from epoch 50 onwards, in the test with zero padding and LeakyReLU, increases gradually and does not stabilize, approaching 3.5. This is a small improvement of the system as the loss in Figure 17 approaches 4.5 in epoch 200, but from epoch 200 onwards stabilizes.
The above comparison of the two tests with the implementation of Figure 16 in the RAVDESS database concludes that our model with the use of ReLU and without zero padding gives overall better results in terms of accuracy and loss. Regarding the comparison of the test with zero padding and LeakyReLU with the implementation of Figure 17 for the SAVEE base, we observe that the test with zero padding and LeakyReLU achieves faintly improved overall results, in terms of both accuracy and loss. This may be due to the fact that this database has a smaller number of samples than RAVDESS. Therefore, we conclude that the model implemented with ReLU and without zero padding performs in general better and that the techniques tested do not give the desired results, i.e. reduce the overfitting effect and the loss to a significant degree.
Figure 18.
CNN network with attention training charts for RAVDESS database.
Figure 18.
CNN network with attention training charts for RAVDESS database.
Figure 19.
CNN network with attention training charts for SAVEE database.
Figure 19.
CNN network with attention training charts for SAVEE database.
Because we see that the models show a degree of overfitting that cannot be further reduced, we performed a test, where we removed the dropout normalization technique from the model, so that only the batch-normalization technique remained, to see the effects. The results of this model (without the dropout) for the RAVDESS and SAVEE databases are depicted in Figure 23 and Figure 24, respectively.
Figure 20.
CNN network with attention training charts for RAVDESS database without Dropout.
Figure 20.
CNN network with attention training charts for RAVDESS database without Dropout.
Figure 21.
CNN network with attention training charts for SAVEE database without Dropout.
Figure 21.
CNN network with attention training charts for SAVEE database without Dropout.
It is obvious that with the removal of the dropout function, the model shows a significant increase in the overfitting effect, given that for both the RAVDESS and SAVEE datasets we observe a significant reduction in the accuracy of the model and some increase in loss. More specifically, for the RAVDESS dataset, the model without the dropout function achieves 65% accuracy, in contrast to the model of Figure 21, which achieves 77% accuracy. This means that the model without the dropout technique overfits to a percentage of 45% on this dataset. Also, the loss of the model without the dropout technique fluctuates (spikes) around 2.5, while the model of Figure 21 fluctuates (spikes) around 1.5, and from epoch 100 onwards revolves around 1, with a minimal upward trend to 1.2.
For the SAVEE dataset, the model without the dropout technique achieves an accuracy of 47%, in contrast to the model of Figure 22, which achieves an accuracy of 74%. This means that the model without the dropout technique overfits to a percentage of 63% on this dataset. Also, the loss of the model without the dropout technique from epoch 25 onwards has a steady upward trend from 1.3 to 1.6, while the model in Figure 22 fluctuates (spikes) around 1.5. From the above it is obvious that proposed CNN-Attention model applying the dropout technique against overfitting is the best possible option.
4.4. Experimental Results-Confusion Matrices
For the six implemented models, the confusion matrices for each of the RAVDESS and SAVEE datasets are presented in this subsection, via the Figures 25 to 30. The confusion matrices, as it is well known, contain information about the success rate of the models’ predictions for each emotion of the selected datasets separately.
Table 1 shows the correspondence between the labels in confusion matrices and the two datasets.
From the confusion matrix of DBN for the RAVDESS dataset (Figure 25), we can conclude that the ‘neutral’ and the ‘angry’ emotions of the dataset yields the lowest success rate (0%), as the ‘neutral’ emotion is most often confused with the emotion of ‘calm’ {in 9 out of 20 samples} and the emotion of ‘anger’ is most often confused with the emotion of ‘surprise’ {in 27 out of 36 samples}. In contrast, the emotion of ‘surprise’ receives the best success rate (80.5%) compared to the rest, as it identifies 33 out of 41 samples.
Figure 22.
DBN confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 22.
DBN confusion matrices for RAVDESS (left) and SAVEE (right).
From the confusion matrix of DBN for the SAVEE dataset (Figure 25), we can conclude that the ‘anger’ emotion of the dataset yields the lowest success rate (25%) by identifying only 3 of the 12 samples, as it confuses most times with the emotion of ‘sadness’ {in 8 out of 12 samples}. In contrast, the emotion of ‘surprise’ receives the best success rate (66.7%) compared to the rest, as it identifies 8 out of 12 samples.
In the confusion matrix of DNN for the RAVDESS dataset (Figure 26), we can see that the ‘neutral’ emotion yields the lowest success rate (14.3%), as it is most often confused with the ‘sad’ emotion {in 5 out of 21 samples} and the emotion of ‘disgust’ {in 5 out of 21 samples}. In contrast, the emotion of ‘surprise’ receives the best success rate (61%) compared to the rest, as it identifies 25 out of 41 samples.
Similarly, in the confusion matrix of DNN for the SAVEE dataset (Figure 26), we can see that the emotion of ‘surprise’ yields the lowest success rate (33.3%) identifying only 4 out of the 12 samples, as it is confused most often with the feeling of ‘fear’ {in 5 out of the 12 samples}. In contrast, the ‘neutral’ emotion receives the best success rate (91.3%) compared to the rest, as it identifies 21 out of 23 samples.
Figure 23.
DNN confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 23.
DNN confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 24.
LSTM network confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 24.
LSTM network confusion matrices for RAVDESS (left) and SAVEE (right).
From the confusion matrix of LSTM for the RAVDESS dataset (Figure 27) we can conclude that the ‘happy’ emotion and the ‘neutral’ emotion of the dataset yield the lowest success rates achieving 44.8% and 66.7%, respectively. This is because the ‘happy’ emotion is most often confused with the emotion of ‘fearful’ {in 7 out of 29 samples} and the ‘neutral’ emotion is sometimes confused with the ‘sad’ emotion {in 3 out of 21 samples}. In contrast, the emotions of ‘calm’ and ‘surprise’ receive the best success rates (80% and 78%), as they identify 32 out of 40 and 32 out of 41 samples, respectively.
From the confusion matrix of LSTM for the SAVEE dataset (Figure 27) we can conclude that the emotion of ‘happiness’ yields the lowest success rate (36.4%) identifying only 4 out of 11 samples, as it is confused several times with both the emotion of ‘anger’ and the emotion of ‘surprise’ {in 3 out of 12 samples for each}. In contrast, the ‘neutral’ emotion receives the optimal success rate (95.6%) compared to the rest, as it identifies 22 of the 23 samples.
Figure 25.
LSTM network with attention mechanism confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 25.
LSTM network with attention mechanism confusion matrices for RAVDESS (left) and SAVEE (right).
From the confusion matrix of LSTM-ATN for the RAVDESS dataset (Figure 28), we can conclude that the ‘sad’ emotion achieves the lowest success rate (50%) by identifying only 22 out of 44 samples, as it is sometimes confused with the emotion of ‘disgust’ {in 6 out of 44 samples}. In contrast, the ‘calm’ emotion receives the optimal success rate (95%) compared to the rest, as it identifies 38 out of 40 samples.
From the confusion matrix of LSTM-ATN for the SAVEE dataset (Figure 28), we can conclude that the emotion of ‘happiness’ yields the lowest success rate (27.3%) identifying only 3 of the 11 samples, as it is most often confused with the emotion of ‘surprise’ {in 3 out of 11 samples}. In contrast, the neutral emotion receives the optimal success rate (95.6%) compared to the rest, as it identifies 22 out of 23 samples.
Figure 26.
CNN network confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 26.
CNN network confusion matrices for RAVDESS (left) and SAVEE (right).
From the confusion matrix of CNN for the RAVDESS dataset (Figure 29), we can conclude that the ‘neutral’ emotion achieves the lowest success rate (28.6%) by identifying only 6 of the 21 samples, as it is most often confused with the emotion of ‘calm’ {in 8 out of 21 samples}. In contrast, the emotion of ‘calm’ receives the optimal success rate (85%) compared to the rest, as it identifies 34 out of 40 samples.
From the confusion matrix of CNN for the SAVEE dataset (Figure 29), we can conclude that the emotion of ‘disgust’ yields the lowest success rate (10%) by identifying only 1 out of 10 samples, as is confused several times with both the emotion of ‘sadness’ {in 4 out of 10 samples} and the ‘neutral’ emotion {in 3 out of 10 samples}. In contrast, the ‘neutral’ emotion receives the optimal success rate (82.6%) compared to the rest, as it identifies 19 out of 23 samples.
Figure 27.
CNN network with attention mechanism confusion matrices for RAVDESS (left) and SAVEE (right).
Figure 27.
CNN network with attention mechanism confusion matrices for RAVDESS (left) and SAVEE (right).
From the confusion matrix of CNN-ATN for the RAVDESS dataset (Figure 30), we can conclude that ‘neutral’ emotion yields the lowest success rate (57%) as it is often confused with the emotion of ‘calm’ {in 9 out of 21 samples}. In contrast, the ‘calm’ emotion receives the optimal success rate (97%) compared to the rest, as it identifies 39 out of 40 samples.
From the confusion matrix of CNN-ATN for the SAVEE dataset (Figure 30), we can conclude that the emotion of ‘disgust’ yields the lowest success rate (60%) by identifying 6 out of 10 samples, as it is confused a few times with the ‘neutral’ emotion {in 2 out of 10 samples}. In contrast, both the emotion of ‘anger’ and the emotion of ‘sadness’ receive the best success rate (91.6%) compared to the rest, as it identifies 11 out of 12 samples.
In conclusion, from all the above we observe that for the RAVDESS base our models can on average perceive more successfully the emotions of ‘calm’ and ‘surprise’, while they find it difficult to determine the ‘neutral’ emotion. This may be due to the fact that in this database there are two basic emotions that approach the ‘neutral’ emotion, one from a negative point of view and the other from a positive point of view, making it difficult to determine the right emotion. Regarding the SAVEE database, we observe that our models can perceive the ‘neutral’ emotion more successfully, as in this database it is unique, while they find it difficult to identify the emotions of ‘happiness’ and ‘disgust’.