One of the most crucial elements in deeply understanding humans on a psychological level is manifested through facial expressions. The analysis of a human behavior can be informed by their facial expressions, making it essential to employ indicators such as expression (Expr), valence-arousal (VA), and action units (AU). In this paper, we introduce the method proposed in the Challenge of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) at CVPR 2024. Our proposed method utilizes the multi-modal Aff-wild2 dataset, which is splitted into spatial and audio modalities. For the spatial data, we extract features using a SimMiM model that was pre-trained on a diverse set of facial expression data. For the audio data, we extract features using a WAV2VEC model. To fusion the extracted spatial and audio features, we employed the cascaded cross-attention mechanism of a transformer.