3.1. Implementation details
Our experiments were conducted in the Pytorch environment with the following specifications: Ubuntu 18.04.6 LTS, 128 GB RAM, and 2 NVIDIA RTX A6000 GPUs. We used the Aff-wild2 dataset for the experiments. To extract image features, we utilized the SimMIM model [
21], which was not directly trained on the Aff-Wild2 dataset. Instead, it was initially trained using datasets AffectNet [
25], CASIA-WebFace [
26], CelebA [
27],and IMDB-WIKI [
28], and then fine-tuned using the ground truth for Expression, Valence-arousal, and Action Unit from the Aff-Wild2 dataset. To obtain audio features, we employed the Wav2Vec model [
22], utilizing its pre-trained version without any additional fine-tuning.
The aff-wild2 dataset is in-the-wild and provides extensive annotations across three key tasks: valence-arousal (VA), expression recognition (Expr), and action unit detection (AU). For the VA task, the dataset includes 594 videos with around 3 million frames from 584 subjects, annotated for valence and arousal. In the Expr task, there are 548 videos with approximately 2.7 million frames annotated for the six basic expressions (anger, disgust, fear, happiness, sadness, surprise), the neutral state, and an ’other’ category for additional expressions or affective states. Finally, for the AU task, the dataset comprises 547 videos with around 2.7 million frames annotated for 12 action units (AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU15, AU23, AU24, AU25, AU26), offering a detailed framework for action unit analysis.
3.2. Evaluation Metrics
For the Affective Behavior Analysis in-the-wild (ABAW) competition, each of the three tasks, Expression (EXPR), Valence-Arousal (VA), and Action Unit (AU), has a specific performance measure.
Expression (EXPR) Task: The F1 score is used as an evaluation metric to assess the performance. The F1 score combines precision and recall into a single measure, which is particularly useful in imbalanced datasets. The
score is defined as:
where
n is the number of emotion classes,
is the precision of the
i-th class, and
is the recall of the
i-th class.
Valence-Arousal (VA) Task: The performance measure is the mean Concordance Correlation Coefficient (CCC) of valence and arousal. The CCC measures the agreement between observed and predicted scores. The performance
P is given by:
where
is the CCC for arousal and
is the CCC for valence.
Action Unit (AU) Task: The performance measure is the average F1 Score across all 12 categories. The performance
P is computed as:
where
is the F1 score for the
i-th AU category.
3.3. Results
Expression (EXPR) Task: For the EXPR task, our model achieved varying levels of performance across different emotion classes. The highest F1-score was observed in ’Anger’, ’Disgust’, and ’Fear’ categories, each achieving a perfect score of 1.0, indicating that our model could identify these expressions with high precision and recall. The ’Neutral’ expression received an F1-score of 0.6063, which suggests a reasonably good recognition rate. However, the model struggled with ’Surprise’ and ’Sadness’ expressions, evidenced by F1-scores of 0.0028 and 0.0178, respectively. This indicates a need for further model refinement to improve its sensitivity to these expressions. The ’Happiness’ expression, often easily recognizable, had a lower-than-expected F1-score of 0.2857, which could be attributed to the complexity of the dataset. The ’Other’ category, encompassing various non-standard expressions, achieved a moderate F1-score of 0.4238. Overall, the average F1-score for the EXPR task was 0.5420, providing a baseline for future improvements. The detailed results for each emotion class can be found in
Table 1.
Action Unit (AU) Task: The AU task results were promising, with an average F1-score of 0.7077 across all 12 categories. The model was particularly effective in detecting ’AU15’ and ’AU23’, where it reached the maximum F1-score of 1.0, indicating a perfect match between predictions and the ground truth. Other AUs like ’AU6’, ’AU7’, ’AU10’, ’AU12’, and ’AU25’ also showed high F1-scores, all above 0.69, demonstrating the model’s strong capability in recognizing these facial muscle movements. ’AU24’ received the lowest score of 0.4886, suggesting areas where the model may require additional training data or feature engineering. The detailed F1-scores for each AU category are presented in
Table 2.
Valence-Arousal (VA) Task: For the VA task, the model’s performance was quantified using the Concordance Correlation Coefficient (CCC), with ’Arousal’ obtaining a CCC of 0.5906 and ’Valence’ a CCC of 0.4328. The average CCC for the VA task was 0.5117, indicating a moderate agreement with the ground truth. These results highlight the challenges in accurately predicting the subtle variations in emotional intensity represented by valence and arousal dimensions. Detailed performance metrics for valence and arousal can be seen in
Table 3.