1. Introduction
The expression used for both sentiment analysis and emotion analysis is the same. In this sense, keyboard and mouse input is not necessary for the sentiment analysis techniques to function. With the use of novel modalities like speech, gesture, messaging, and facial expression, it assesses opinion, emotions, and polarity. The modalities are subjective and can be positive, negative, neutral, joyous, wonderful, and many other things. Taking "Mai dislikes the battery of the ABC phones" as an example. Mai expresses her opinion in this statement, and she has an unfavorable view regarding the "battery" of the "ABC phone." Sentiment analysis research involving the extraction of sentiments from voice, text, and facial expressions, have been the subject of extensive research in recent years. For example, Nguyen et al. in [
1] used an ontological method to determine entity ratings. The authors then run trials using these entity scores to categorize opinions or detect opinion spam. In an another paper [
2], Tran et al. used a popular machine learning method (SVM) and the WEKA library to build a Java web program for sentiment analysis of English comments on dresses, handbags, shoes, and rings. Their system was trained on 300 comments and tested on 400 comments, and got 89.3% for precision. With models for Vietnamese sentiment analysis, in the paper [
3], Dang et al. proposed hybrid deep learning models and results show hybrid models achieve higher accuracy on Vietnamese datasets. However, because of the ambiguity and adaptability of the data, researchers have faced many obstacles to effectively solving sentiment problems. A single piece of evidence is typically insufficient to yield reliable information. Sentiment analysis methods, for instance, are unable to precisely categorize user attitudes in the cases of irony, subjectivity, tone, and sarcasm. There are numerous ways to write the same text and the right class cannot be determined using only a single data source. Certain situations, personality, cultural, gender, and situational differences may cause a change in a person’s facial expression. As a result, developing precise forecasts have become more difficult in recent years. These new challenges center on an innovative and interactive technology that integrates many information sources to forecast more precise classification and enhance computation accuracy and dependability.
People are posting photographs and text together to communicate their thoughts and feelings, thanks to the rise of social media and mobile devices. Multimodal Sentiment Analysis (MSA) is an emerging field of study that aims to analyze and identify sentiments using data from several modalities. Applications for comprehending multimodal sentiment include opinion mining, tailored advertising, affective cross-modal retrieval, and decision-making, among others. A field of study called multimodal sentiment analysis combines data from several sources to more accurately categorize people’s thoughts and emotions. Numerous applications, such as social media, navigation tools, and human-to-human contact, have already been implemented utilizing the multimodal framework. These applications have already proven MSA’s viability and significance.
In a survey paper published in 2021 [
4], the authors highlighted that multimodal representation learning, multimodal alignment, and multimodal information fusion are the three primary issues in MSA. Information fusion is a primary challenge because: (1) modalities may not have their information temporally aligned; (2) fusion models may find it challenging to leverage complementarity between modalities; and (3) noise types and intensities may differ among modal data. Additionally, they concentrated on the deep learning (DL) MSA fusion techniques, such as CNNs, RNNs, LSTMs, Transformers, and Attention Machines. For example, with fusion of text and image, it is called static MSA, CNN-based method is used and showed accuracy is nearly 91%. In this modal, with predicting the sentiment of visual information, text analytics uses a hybrid Convolutional Neural Network (CNN) and picture analytics uses a support vector machine (SVM) classifier that was trained using a bag-of-visual words. Another method of fusing text, image, audio, and video is called dynamic MSA. It uses an LSTM-based approach and extracts text features using textCNN, audio features using openSMILE, visual features using 3D-CNN, and shared information among multimodal features using context-sensitive LSTM. The modal accuracy is 80.3%. Especially, CNN+LSTM based method, RNN based method are used in MSA in conversation. In a separate survey [
5], the authors discussed using recurrent neural networks in sentiment analysis in textual, visual and multimodal inputs. This work also discussed how Textual SA extracts huge semantic information using DL models, RNN, LSTM, and their derivatives are used to extract features from a series of visual frames, while Visual SA uses deep CNN to extract more abstract features. A third study [
6], on Multimodal Sentiment Analysis research focused primarily on SA with only a brief discussion of MSA.
Contribution. In contrast to existing survey papers, our survey aims to provide a comprehensive overview of MSA datasets and techniques, with a specific emphasis on multi-modal features, multi-modal fusion and offer insights into MSA based on text and image data. The contributions are outlined as follows:
we offer a thorough examination of datasets and tasks specifically within the field of MSA;
we review and analyse Multimodal features and Multimodal fusion;
we present the challenges and research future development in MSA, addressing issues such as the cross-modal interactions, context-dependent interpretations, and the prospect of constructing knowledge graph of multimodal representation for semantic analytics.
Paper Structure. The rest of this paper is organised as follows: in
Section 2, we detail the methodology employed in selecting the papers from the literature; in
Section 3, current research is analysed on MSA datasets, multimodal features, multimodal fusion and the analysis/modelling techniques applied;
Section 4 provides a discussion of the main findings from the survey; and finally in
Section 5, we present conclusions and discuss further research for this topic.
4. Results and Discussion
4.1. Publication Year
Figure 2 shows the number of the research papers as increasing year on year. Specifically, MSA could be considered essential research in 2023 with approximately 50% of the selected research papers, due to the rapid growth of social medias and MSA as a potential tool to detect and interpret public sentiment.
4.2. Datasets
Table 3 provides an overview of the datasets utilized in the selected literature. Approximately 63% of the papers utilized generated third-party and open data, accessible to the research community. A slightly smaller proportion, around 30%, relied on self-generated and closed data, not publicly available. A minority of papers, roughly 7%, utilized datasets generated by a third-party but kept private.
Table 2 serves as a comprehensive overview of two distinct types of data sources: Self (self-built by authors) and 3rd (third-party supplies). Beyond merely identifying the source, the table offers valuable insights into two critical properties of these data sources, namely NO (no Open for access) and O (Open for access). This classification provides an important context for understanding the availability and accessibility of the datasets used in the discussed multimodal sentiment analysis studies. One of key components of MSA is the dataset. A multi-modal sentiment analysis model with excellent generalization and widespread application could be trained on a vast and diverse dataset, considering the diversity of languages and ethnicities in many nations. Furthermore, researchers must label multimodal datasets more precisely because they now have low annotation accuracy and have not reached absolute continuous values. The majority of multimodal data available now only include text, voice, and visual modalities; they do not include modal information paired with physiological signals like pulses and brain waves.
4.3. Fusion Methods
Table 3 also outlines fusion methods, including Late (∼73%) and Early (∼20%). A much smaller percentage of papers do not address fusion methods (∼7%). Early and Late fusions are illustrated in
Figure 3 with some components such as Feature of Modality (FOM) and Sentiment Classifier (SC), respectively. Especially, The text and image fusion mainly involves late fusion, where each modal input is handled by a model. During the decision phase, combination technologies are utilized to generate an output.
4.4. Models
In terms of model development, the models in MSA include ’new-model’ (∼66%), ’improvement based-line’ (∼17%), ’experiment’ (∼17%), as shown in
Table 3. Recent years have seen the development of almost entirely new models, with a particular emphasis on multimodal fusion and features.
A new framework called the HG-BERT model combines a gate channel and hierarchical multi-attention mechanism to optimize the BERT model. Wav2Vec 2.0, ELECTRA, ViT Embedding, and BERT’s elevant Sub-spaces are self-supervised pre-trained models that are used for prediction. The new proposed Hyperbolic Hierarchical Attention Network model showed semantic similarity between written and visual content when compared to the summary. Continously, a new proposed CB-Transformer framework: global self-attention representations, cross-modal feature fusion, and local temporal learning. The transformer encoder and the residual-based cross-modal fusion, which are represented by TransEncoder and CrossModal, are the two key elements of this module. The other system consists of two training paths: the first path uses textual data for perceived sentiment analysis, and the second path uses video data for induced sentiment analysis. A unified DL framework is constructed based on the unified modalities, an inter-modal attention mechanism. Furthermore, the BERT model was nearly utilized for text classification, while CNN was employed for image classificationin text-image fusion. Additionally, a newly suggested model, such as CNN and CBAM, may concentrate on the relationship between text and image.
4.5. Future Development
MSA models need to be examined further in light of increasing accuracy or other metrics, or they might be used to create a new modal based on sophisticated temporal models and fusion techniques. Regarding the features, models could be created with time-dependent interactions in mind. They could also make use of social context features like user profiles and propagation patterns, as well as invariant feature learning to help learn how to better distinguish biased features and facilitate bias estimation.
Network training can be used to get parameters for the feature’s distribution, combine features from different sources to create more relevant multimodal characteristics, and explore additional feature types that could help learn more about online sentiment behavior. Transfer learning approaches are a focus model technique that has gained popularity recently. Additionally, MSA models have the potential to track a user’s credibility by utilizing metadata and comments in conjunction with user-related data. Additionally, they can leverage adversarial learning and knowledge graphs to enhance the effectiveness of unified inter-modal attention approaches. In addition, models can investigate the relationship between the relative importance of modalities and capture complicated relations. The interpretability of emotion identification in the aforementioned modalities is investigated through additional methodologies, crossmodal linkages, and filtering mechanisms.