1. Introduction
Human-Computer Interaction (HCI) is becoming increasingly important with the development of digital devices and software. HCI is moving away from simple command input and output and moving toward providing a better user experience by having machines understand human emotions and intentions and respond appropriately. Emotion Recognition (ER) plays a key role in this interaction, and technology that understands human emotions and responds appropriately is becoming an important element in improving the user experience [
1]. Emotion recognition technology improves the quality of HCI and plays a role in increasing usability, accessibility, and efficiency in various industries, including education and organizational productivity [
2]. In the field of education, emotion recognition technology can monitor students’ emotional states and positively impact learning performance. Based on this, it can provide customized learning approaches and adaptive feedback. In an e-learning environment, utilizing multimodal emotion recognition technology can help identify students’ emotions in real time and adjust learning difficulty or provide appropriate feedback [
3]. This technology can contribute to increasing learning engagement and reducing learning dropouts due to stress and frustration. In the area of organizational productivity, emotion recognition technology also contributes to stress management, fatigue control, and improving job satisfaction by monitoring the emotional state of employees in the workplace. Since the emotional state of employees is closely related to work performance, it is important to understand it in real time and take appropriate measures. Especially in a remote work environment, where face-to-face interaction is limited, emotion recognition systems become even more necessary. Through this, productivity can be maintained and improved by monitoring the emotional state of employees and providing necessary support even in a non-face-to-face environment. As such, the need for emotion recognition technology will continue to increase in various fields in the future, and it will play an important role in improving work efficiency and performance in each industry.
Korean has unique linguistic characteristics for emotion recognition. First, in terms of text, Korean is an agglutinative language that performs various grammatical functions through word inflection. This characteristic makes it essential to understand the entire context of a sentence because the word order in a sentence is flexible, and subjects and objects are often omitted. For example, the meaning of a sentence can change depending on the position or inflection of the same word, and emotional expression can also change subtly. Therefore, in Korean text emotion analysis, it is necessary to deeply analyze the structure and context of the sentence beyond the vocabulary level. In terms of phonetics, Korean intonation and pronunciation patterns play an important role in emotional expression. Even the same word can convey emotions differently depending on intonation, and the pronunciation system of Korean (long, short, aspirated, liquid sounds, etc.) provides important clues for understanding emotional states. For example, emotions in Korean change depending on the pitch of the speech, pronunciation stress, and speaking speed, and emotional states can be predicted through this. Scherer’s [
4] study explains that the pitch and speed of speech can indicate emotional state, and this can also be an important factor in Korean speech emotion recognition.
Existing emotion recognition studies have mainly focused on one modality, either text or speech, but showed limitations in capturing important emotion information. For example, Bharti’s [
5] study combined various text data sets (ISERA, WASSA, Emotion-stimulus) and used a combination of CNN (Convolutional Neural Network) and Bi-GRU (Bidirectional Gated Recurrent Unit) models to recognize emotions. This study classified emotions based on text data, but did not consider speech or other modalities, and thus could not sufficiently reflect the complexity of emotions. Similar limitations also appear in speech emotion recognition. Kim’s [
6] study extracted speech features such as Mel-Spectrogram and MFCC (Mel-Frequency Cepstral Coefficients) using the Emo-DB and RAVDESS data sets, and then combined BiLSTM-Transformer and 2D CNN to recognize emotions. However, this study also focused only on speech data and did not consider interactions with other modalities such as text. Thus, single-modality approaches have limitations in sufficiently reflecting the complex characteristics of emotions. In addition, many emotion recognition studies have been conducted based on data in German or English, and emotion recognition studies on Korean are relatively lacking. To solve this problem, a multimodal approach that combines various modalities such as text, speech, and video has recently attracted attention. A multimodal approach is a methodology that combines the unique emotional information of each modality to increase the accuracy of emotion recognition. For example, in the study by H. Park [
7], deep learning models were trained for text and speech separately, and then a weighted average ensemble was used to improve emotion recognition performance. Y. Kim [
8] also trained deep learning models for text and speech separately, and then used an ensemble method by averaging, showing higher performance than with a single modality. However, multimodal studies tend to rely mainly on simple combination methods such as average ensemble or weighted average ensemble. This may limit emotion recognition performance because it does not sufficiently reflect the interaction between modalities. To complement this, this study proposes a deep learning model that incorporates preprocessing reflecting the characteristics of the Korean language, adds a transformer encoder to each modality, and enhances the interaction between text and speech through cross-modal attention.
This study aims to design an emotion recognition model that reflects the unique linguistic and speech characteristics of Korean. Korean is an agglutinative language with many grammatical variations, and context-dependent analysis is essential. To this end, text data are processed through the KoELECTRA model, and speech data are analyzed using MFCC and Pitch to extract key features of speech signals. KoELECTRA is a BERT-based Korean-specific model that learns Korean honorifics and informal speech, as well as complex lexical changes. Through this, text data are converted into high-dimensional vectors, and speech data are vectorized by extracting MFCC representing timbre and Pitch reflecting intonation information. A multimodal transformer model is used to combine text and speech data. This model consists of a transformer encoder that processes each modality individually and a cross-modal attention mechanism that combines text and speech data. Cross-modal attention learns the interaction between the two modalities by using text embeddings as queries and speech embeddings as keys and values. This effectively combines complementary information between text and speech, and improves the accuracy of emotion recognition. Therefore, this study proposes a multimodal transformer emotion recognition model that considers the interaction between Korean text and speech to reflect the characteristics of the Korean language and overcome the limitations of existing studies. This study will contribute to overcoming the limitations of existing emotion recognition studies by reflecting the linguistic and speech characteristics of the Korean language and improving the performance of Korean-based multimodal emotion recognition. In addition, it will contribute to the development of emotion recognition technology that can be applied to various languages and modality data in the future.
The structure of this paper is as follows. In
Section 2, we define the contents of existing studies related to text, speech, and multimodal emotion recognition, and in
Section 3, we explain the proposed multimodal transformer model. In
Section 4, we evaluate the proposed model through experiments and verify the performance of the model by comparing it with existing emotion recognition studies using objective performance indicators. Finally, in
Section 5, we present the conclusions of the stud
5. Conclusion
In this paper, we propose KoMPT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, MFCC, and Pitch with a Multimodal Transformer model, which improves the performance of emotion recognition. The proposed model combines text embedding based on KoELECTRA and speech features using MFCC and Pitch, and effectively learns the interaction between text and speech using the Cross-Modal Attention mechanism. This allows us to achieve higher accuracy and efficiency in emotion recognition. The experimental results show that the multimodal approach outperforms the single modality model. In particular, the Multimodal Transformer model recorded higher accuracy and F1-Score than when text and speech data were used alone, and achieved an accuracy of 73.13% in emotion classification. This shows that the performance of emotion recognition can be significantly improved by combining complementary information from text and speech. In addition, the preprocessing process and model design that reflect the linguistic and phonetic characteristics of Korean enhanced the performance of emotion recognition. KoELECTRA provided an embedding that reflected the context of Korean well, and MFCC and Pitch successfully extracted the frequency components and intonation information of the speech, thereby improving the performance of speech emotion recognition. This study made an important contribution to research on multimodal emotion recognition based on Korean, and will be able to contribute to the development of emotion recognition technology that combines various language and modality data in the future. In future studies, it is expected that emotion recognition performance can be further improved by combining new modalities such as video data or applying more advanced multimodal combination techniques.