1. Introduction
The proliferation of social media has led to a vast increase in user-generated content across various platforms. This expansion has transformed social media networks from specialized forums for sharing specific types of content, like texts or images, to versatile environments where users can share a diverse array of multimedia elements including texts, images, audio, and video clips. Current trends indicate that multimedia posts, particularly those incorporating images or videos, significantly enhance user engagement and interaction. Given the rich and expressive nature of multimodal content, it presents a unique opportunity to understand and predict user behaviors and preferences. By analyzing such multimodal data, we can extract valuable insights into user sentiments and emotions, which can be leveraged in numerous applications such as targeted advertising, personalized content delivery, and interactive marketing strategies. Despite the importance of accurate social media content classification, traditional methodologies often focus on a single modality, typically neglecting the rich context provided by combining multiple data types. For instance, while visual content may vividly portray emotions, textual content also carries nuanced emotional cues that are equally significant. By examining both text and images, we can achieve a more holistic understanding of the underlying sentiments in user posts.
The development of advanced sensors that capture high-quality audio and video data [
131] has catalyzed significant advancements in various sectors, notably in the development of passive, non-invasive monitoring technologies. These technologies hold promise for improving ongoing management of chronic and mental health issues such as diabetes, hypertension, and depression [
56]. Expected to be incorporated into daily environments like homes and workplaces, these sensors could transform how these spaces interact with their inhabitants by subtly adjusting to and managing their emotional and psychological well-being.
Emotion recognition has become a crucial research focus within affective computing, propelled by the need to decode and interpret human emotions across diverse applications, from interactive gaming to psychological assessments [
60]. The field has seen significant advancements through the application of deep learning methods, which enhance both the accuracy and the efficiency of emotion detection from complex data sets. Increasingly, research has turned toward multimodal emotion recognition, which combines insights from facial expressions, vocal tones, and physiological signals to form a comprehensive picture of an individual’s emotional state. Innovations by researchers such as Kossaifi et al. have shown how neural networks can effectively untangle these complex inputs to predict emotions with increased accuracy [
52,
69]. However, the field continues to grapple with the nuances of context-specific emotional expressions and the subjective nature of interpreting emotional data, spurring ongoing innovation in this evolving area.
Analyzing both audio and video data is essential for multimodal emotion recognition, providing a deeper context for understanding human behavior’s subtleties. This analysis benefits from merging visual indicators like facial expressions and body gestures with auditory elements such as voice tone and pitch, offering a detailed perspective on an individualâs emotional state. The challenge of synchronizing these modalities involves matching their temporal dynamics and extracting relevant features indicative of emotional states. Groundbreaking efforts by Trigeorgis et al. in integrating audio and video data through advanced deep learning frameworks highlight significant progress in this domain, demonstrating potential for markedly improved recognition accuracy compared to single-modality approaches [
50,
77]. These developments emphasize the need for robust algorithms capable of effectively parsing and analyzing the intricate interactions between audio and visual data to boost the precision and usability of emotion recognition technologies.
These technologies offer more than just convenience; they aim to provide essential support in managing conditions like autism spectrum disorders, fatigue, and substance abuse through continuous monitoring and real-time feedback. The ability to accurately discern and react to emotional states via multimodal analysis is vital for these technologies to fulfill their potential [
40,
50,
51]. Nonetheless, the path to effective deployment in real-world settings is laden with challenges, including the precise acquisition and analysis of complex spatio-temporal data across varied populations and conditions [
74,
75]. Moreover, developing extensive, well-annotated multimodal datasets for training effective models remains an expensive and time-consuming process.
To address these challenges, we propose the Unified Multimodal Classifier (UMC), which integrates neural network architectures to fuse multiple modalities effectively [
30,
32,
39,
71,
72,
73]. Unlike previous approaches that require the presence of all modalities and involve complex configurations tailored to specific tasks, UMC simplifies the integration process and enhances flexibility, facilitating application to a broader range of problems and accommodating the absence of certain modalities.
In this paper, we articulate our contributions as follows:
We introduce a novel, generalized methodology, UMC, that amalgamates data from various modalities to improve the classification accuracy of social media content significantly.
Our approach is designed to be robust, maintaining high classification performance despite the absence of one or more modalities, and is scalable to incorporate additional modalities or adapt to new application domains.
We have developed and will release a comprehensive dataset that includes both textual and visual modalities, annotated with precise labels to facilitate extensive testing and future research in multimodal emotion analysis.
The remainder of the paper is structured as follows:
Section 2 reviews existing literature in the fields of multimodal classification and emotion analysis.
Section 3 details the traditional and our proposed fusion models.
Section 4 elaborates on the specific methodologies employed in UMC. Experimental setups and results are discussed in
Section 5, and
Section 6 provides concluding remarks.
2. Related Work
The realm of multimodal classification in social media is broadly categorized into two distinct methodologies based on the integration of data from multiple sources.
Late fusion processes modalities independently before combining the outcomes at the decision-making stage, operating under the assumption of modality independence, which often does not hold as different modalities usually depict correlated aspects of the same phenomena [
22]. An innovative twist on late fusion utilizes the Kullback-Leibler divergence to align the results from different modalities, ensuring a more coherent decision process [
27]. In contrast,
Early fusion merges modalities at the data level, creating a unified feature set for subsequent classification [
28]. This approach is favored in applications like sentiment analysis, where methods such as LSTM networks integrate visual and textual data [
25], or hierarchical classifiers manage complex event categorization from combined features [
28]. Beyond these, intermediate fusion strategies employ techniques like Latent Dirichlet Allocation (LDA) or Canonical Correlation Analysis (CCA) to discover underlying relationships between modalities in contexts such as image and text classification [
11]. Although effective, these multimodal classification frameworks often require the presence of all modalities and can be quite complex. Our Unified Multimodal Classifier (UMC) introduces a simplified, yet robust, variant of early fusion that adapts to the absence of modalities and integrates divergent neural network architectures for different modalities, moving away from the traditional Siamese network configurations [
12] where identical networks are used.
Emotion analysis techniques evolve from leveraging hand-crafted features derived from art and psychology to utilizing deep learning models that automatically extract discriminative features. Traditional methods involve low-level features such as shape [
13], color, and texture [
15], often combined into more complex representations [
2,
8,
9,
17]. These approaches, while intuitive, require extensive expert knowledge to design and may not capture all emotion-relevant aspects. To overcome these limitations, recent advancements have shifted towards deep learning, particularly using Convolutional Neural Networks (CNNs), to autonomously learn features from data [
3]. Such techniques have demonstrated superior performance in emotion recognition solely from images. However, they neglect the rich emotional context provided by textual data. Our UMC model innovatively combines both visual and textual modalities using a hybrid CNN architecture, significantly enhancing the accuracy and depth of emotion analysis.
The importance of emotion recognition has grown within the domain of human-computer interaction, propelling forward innovations from customer service interfaces to therapeutic applications. There has been considerable focus on enhancing recognition algorithms using machine learning models that process intricate datasets from facial, vocal, and biometric signals [
101]. The use of CNNs and Recurrent Neural Networks (RNNs) has been emphasized to capture the nuanced dynamics of emotional expressions over time [
51]. Recent research has delved into context-aware systems that adapt their processing based on situational nuances, addressing the variability and ambiguity typical of human emotions [
66]. These systems aim not just to identify basic emotions but to understand complex affective states and their fluctuations, challenging traditional emotion recognition models with more adaptive and enriched frameworks [
96,
97,
98,
99,
100,
101].
In the study of audio-video analysis for emotion recognition, the integration of auditory and visual cues has been extensively explored to develop more precise and reliable systems. This multidisciplinary approach utilizes the strengths of each modality to overcome the limitations of the others, frequently employing sophisticated signal processing and deep learning techniques [
53,
74]. For instance, combining facial expression analysis with voice tone analysis enables systems to discern emotional subtleties that might remain ambiguous when analyzed in isolation [
130]. Researchers have crafted frameworks that dynamically align audio and video streams, extracting temporally correlated features to enhance the coherence and accuracy of emotion detection processes [
20,
133]. These methodologies are crucial in advancing real-time emotion recognition systems, expanding their practical applications in fields like interactive media, surveillance, and telecommunications.
The automatic detection of emotional states via auditory signals has also seen considerable advancements, particularly concerning depression and emotion recognition. These systems use acoustic features to infer psychological states, drawing parallels in their application. France
et al. demonstrated that variations in formant frequencies could reliably indicate depression and suicidal tendencies [
37]. Cummings
et al. and Moore
et al. have successfully used energy, spectral, and prosodic features to classify depression with accuracy rates around 70-75% [
36,
45]. The rise of machine learning has led to the widespread use of deep neural networks, Long-Short Term Memory networks (LSTMs), and Convolutional Neural Networks (CNNs) in refining the accuracy of emotion detection systems [
31,
32,
40,
57,
93,
94,
95]
The integration of multimodal data sources has proven to be an effective method for improving the accuracy and reliability of emotion recognition systems. This approach typically involves combining features at the feature, score, or decision levels, with each modality providing complementary information that boosts the overall performance of the system [
31,
32,
40]. Recent research has explored adaptive frameworks that intelligently merge input modalities, leveraging varying degrees of certainty from vocal and facial data to more accurately detect depression and other emotional states [
34,
41]. For instance, Meng
et al. introduced a layered system that uses Motion History Histogram features, and Nasir
et al. implemented a multi-resolution model that combines audio and video features for more effective depression diagnosis [
44,
46]. Williamson
et al. developed a system that utilizes speech, prosody, and facial action units to assess the severity of depression, underscoring the value of multimodal integration [
54].
Despite these advancements, challenges remain in deploying these technologies in real-world settings. Often, models are constructed using limited datasets that may not accurately represent the broader population, leading to potential biases and inaccuracies in emotion recognition [
124,
134,
135]. The variability in data captureâoften using standard equipment in uncontrolled environmentsâadds complexity to the process. The dynamic nature of human expressions and environmental factors necessitates adaptable models capable of handling variations within classes and shifting domains. This paper introduces the TriFusion architecture, an advanced deep learning model designed to effectively integrate multimodal information for robust emotion recognition. This approach goes beyond traditional feature-level and score-level fusion, implementing a hybrid system that optimizes both features and classifiers for comprehensive multimodal integration.
3. Multimodal Social Media Classification
This section outlines the various modalities encountered in social media, which are incorporated into our Unified Multimodal Classifier (UMC) models. Additionally, we review traditional and contemporary approaches for integrating these modalities within classification frameworks.
3.1. Multimodality in Social Media
Multimodality refers to the utilization of multiple communicative modes, including textual, aural, linguistic, spatial, and visual resources, to convey messages in social media [
14]. Our research primarily focuses on the textual and visual modalities prevalent in social media platforms. The flexibility of our proposed UMC models allows for easy adaptation to include additional modalities if required.
A social media post, denoted as , may include an image (i), a text (s), or both, with an inherent semantic relationship assumed between text and image when both are present. Each image i is characterized by a feature vector , which is derived using state-of-the-art convolutional neural networks (CNNs) trained on extensive image datasets. Textual content, ranging from short phrases to longer paragraphs, is represented by a vector , extracted via advanced text processing techniques such as word embeddings or transformer-based models, moving beyond traditional bag-of-words schemas.
3.2. Multimodal Classification using Feature Level Fusion
Our UMC approach aims to assign a post
x to one of the classes in a set
Y, based on the highest probability across potential classes,
For unimodal scenarios where x equals i or s, separate classifiers are trained for each modality. However, our focus is on the integration of both image and text modalities, utilizing traditional fusion techniques termed late fusion and early fusion, differentiated by the stage at which data integration occurs within the classification process.
3.2.1. Late Fusion
Late fusion involves the creation of two independent classifiersâone for images and one for texts. The final classification is determined by combining the output probabilities of these classifiers, where the predicted class maximizes the joint probability,
This method assumes conditional independence between modalities, which simplifies the classification but may not fully capture the interdependencies between text and image content.
3.2.2. Early Fusion
Contrary to late fusion, early fusion amalgamates the modalities at the feature level before classification, thus leveraging the interplay between text and image features. This fusion is mathematically represented as,
where
represents the concatenated feature vector of post
x. The combined vector
facilitates the application of sophisticated classification models such as support vector machines (SVM) or more complex neural networks. In our research, we employ deep learning techniques to construct the classifier on this enriched feature space, capitalizing on the strengths of both modalities to enhance classification performance.
4. Joint Fusion with Neural Network Models
While both late fusion and early fusion techniques integrate visual and textual data for classification, each method exhibits specific limitations. Late fusion, involving the creation of dual classifiers, often lacks efficiency, as the marginal gains from dual processing may not justify the complexity. Conversely, early fusion demands simultaneous availability of both image and text modalities, which might not always be feasible. In this section, we introduce two innovative approaches, termed joint fusion and common space fusion, which combine the simplicity of early fusion with the flexibility of late fusion under our Unified Multimodal Classifier (UMC) framework. These methods leverage neural networks for their adaptability and efficiency in learning directly from data, circumventing the need for extensive feature engineering.
4.1. Mathematical Notations and Neural Layers
A neural network in our framework models the probability
using a parametric function
(referenced in Equation
1), where
encompasses all trainable parameters of the network. For an input
x, the function
operates through multiple layers:
where
L represents the total number of layers.
We denote matrices in bold uppercase letters (, , ) and vectors in bold lowercase (, , ). Element refers to the row of matrix , and denotes the element of vector . Vectors are assumed to be column vectors unless stated otherwise. Here, we describe two fundamental layers used in neural network-based classifiers: the linear layer and the softmax layer.
4.1.1. Linear Layer
This layer implements a linear transformation on its input
:
where
is the weight matrix and
represents the bias vector. The dimensions of these parameters adjust based on the input feature vector, whether it be from an image or text.
4.1.2. Softmax Layer
Following the computation of class scores in the penultimate layer, the softmax layer converts these scores into a probability distribution:
4.2. Early Fusion as a Neural Network Model
Early fusion is depicted within our neural network model as a three-layer structure where the first layer, , combines the feature vectors from both modalities. For example: The fusion layer performs a concatenation operation on the feature vectors and from image and text respectively, resulting in the combined post vector . The subsequent layers, and , are linear and softmax layers respectively, transforming the fused features into class probabilities. The learnable parameters in this setup are , where these parameters are adjusted during training to minimize classification loss.
4.3. Unified Feature Representations
A critical component of our UMC is the robust representation of images and texts. Images are processed through a convolutional neural network to extract a feature vector , utilizing pre-trained networks on large datasets like ImageNet for initial weights, which can be fine-tuned to our specific task. Texts are transformed into vectors using embedding techniques followed by an aggregation layer that might employ operations like averaging or max pooling to condense word embeddings into a single text representation.
4.4. Joint and Common Space Fusion Models
Our joint fusion model is designed to handle cases where only one modality is available by modifying how image and text features are integrated. Rather than simple concatenation, this model employs pooling strategies to merge features into a unified representation, ensuring that either modality can independently contribute to the classification task.
In common space fusion, the aim is to align the feature spaces of both modalities, enhancing the model’s ability to learn from correlated features across different types of data. This alignment is facilitated by an auxiliary learning task that optimizes the feature representations to be similar for the same class and distinct across different classes, thus refining the model’s generalization capabilities across diverse social media posts.
5. Experiments with Emotion Classification
This section presents a comprehensive evaluation of our newly proposed Unified Multimodal Classifier (UMC) on the task of emotion classification using both visual and textual data. To our knowledge, this novel approach is pioneering in its field.
5.1. Emotion as Discrete Categories
In the realm of emotion classification, our objective is to assign a given social media post
x to an appropriate emotion class from a predefined set
, where each class
is mutually exclusive. We adopt Plutchik’s renowned model of primary emotions as the foundation for our classification schema [
21].
5.2. Datasets
There is a notable absence of large-scale datasets comprising both visual and textual data specifically curated for emotion classification. This scarcity prompted us to both enhance an existing dataset and construct a new one from the ground up.
Enriching an image-only dataset Originally compiled by You et al. [
26], the
flickr dataset contains images tagged according to eight emotional categories by multiple annotators from Amazon Mechanical Turk. We expanded this dataset by scraping titles and descriptions from Flickr, ensuring all textual data is in English and contains more than five words. The updated
flickr dataset statistics are presented in
Table 1.
Constructing an emotion dataset from scratch We also developed a dataset by aggregating content from various Reddit subreddits associated with specific emotions, such as happy (joy), creepy (fear), rage (anger), and gore (disgust). Each selected submission included both image and corresponding textual content, ensuring rich multimodal data. The collection process targeted posts with a high number of upvotes to ensure relevance and engagement.
5.3. Experimental Setup
Our experimental framework employs the GloVe word vectors trained on Twitter data, aligning with the social media context of our datasets. We chose a word vector size of 200 for optimal performance, as validated against other vector sizes. Each model’s performance was rigorously evaluated against a baseline of traditional and single-modality classifiers on both datasets, as shown in
Table 2.
5.4. Results and Discussion
The results demonstrate that UMC consistently outperforms all baselines and traditional fusion methods, confirming the effectiveness of integrating visual and textual modalities for emotion classification. Specifically, the Common Space Fusion approach of UMC shows a slight advantage over Joint Fusion, suggesting that creating a shared feature space for different modalities enhances the classifier’s ability to generalize across diverse inputs.
5.5. Qualitative Analysis
Further qualitative analysis reveals specific instances where UMC correctly identifies complex emotions from multimodal inputs, highlighting scenarios where either visual or textual cues alone might lead to misclassification. This analysis substantiates the robustness of UMC in handling real-world, noisy social media data.
6. Conclusions and Future Work
In this study, we introduced the Unified Multimodal Classifier (UMC), a set of models designed to efficiently integrate data from various modalities to perform social media analysis. The UMC framework is designed to be highly adaptable, capable of processing inputs even when some modalities are absent, and demonstrates its robustness across different scenarios. Our experiments on emotion classification reveal that UMC, while straightforward in its architecture, delivers impressive performance, surpassing traditional unimodal and multimodal approaches. We supported our research with two custom-constructed multimodal datasets, which were instrumental in demonstrating the efficacy of our models under diverse conditions.
6.1. Contributions
The primary contributions of this work are: Development of the UMC, which simplifies the integration of multimodal data for social media analysis. Validation of UMC’s effectiveness in handling incomplete data modalities without compromising on performance. Creation of two novel datasets that encompass a wide range of emotions, providing a comprehensive platform for testing multimodal emotion classification.
6.2. Future Directions
Looking forward, we aim to expand the capabilities of UMC by exploring additional modalities such as structured and user-annotated data, which hold the potential to enrich the models’ understanding and accuracy further [
7,
19]. Moreover, while this research focused primarily on emotion classification, future applications could include but are not limited to, sentiment analysis, behavioral prediction, and personalized content delivery.
6.3. Challenges and Opportunities
The integration of increasingly complex modalities presents both challenges and opportunities. Challenges include the scalability of the models to handle large-scale data and the ability to maintain high accuracy levels across diverse datasets. On the other hand, opportunities lie in leveraging cutting-edge AI techniques like deep learning and natural language processing to enhance the interpretability and efficiency of multimodal data analysis.
6.4. Extending to Real-World Applications
We also plan to collaborate with industry partners to test the real-world applicability of UMC in various sectors such as marketing, healthcare, and public safety. These applications could benefit from advanced predictive analytics, offering insights that are not only accurate but also actionable.
In summary, the UMC framework sets a new benchmark for multimodal data fusion in social media analytics. Its ability to handle incomplete data, combined with high accuracy and simplicity of use, makes it a promising tool for researchers and practitioners alike. The continued development and application of UMC will undoubtedly open new avenues in the field of artificial intelligence and data science.