A series of practical tests were conducted to evaluate the effectiveness of the ST-SA model for Arabic dialect vernaculars. The performance of the proposed ST- SA model in classifying Arabic dialects (ADs) was thoroughly examined.
4.1. Data
The proposed model underwent training using three reference datasets. The initial dataset utilized was HARD [
25], where reviews were collected from various reservation websites and categorized into five distinct groups. The second dataset, BRAD [
24], and the third dataset, LARB [
53], were subsequently employed for training purposes. The research project utilized review-level datasets, including BRAD, HARD, and LARB. BRAD’s reviews were collected from the Goodreads website and categorized into five scales. The distribution of classes for HARD, BRAD, and LARB is detailed in
Table 1,
Table 2, and
Table 3, respectively. It’s important to note that the datasets employed in this study were left in their unprocessed state, potentially affecting the reliability of the proposed model. Furthermore, all sentences underwent preprocessing, including the use of sentence breakers to segment reviews into individual sentences. Furthermore, any presence of Latin letters, non-Arabic characters, diacritics, hashtags, punctuation, and URLs were entirely removed from the texts of the ADs. The Arabic dialect texts underwent orthographic normalization for consistency. Additionally, emoticons were replaced with their corresponding descriptions, and adjustments were made for extended words. To prevent potential overfitting of the model, we implemented an early stopping technique with a patience parameter set to three epochs. When assessing the performance of the proposed ST-SA model that utilize MTL for sentiment analysis of Arabic vernaculars. We utilized a model checkpoint mechanism to save the most optimal weights of the proposed model. Besides being divided for training and testing purposes, the HARD, BRAD, and LARB datasets yield valuable insights into how polarities are distributed among their samples. The HARD dataset, with a total of 409,562 samples, is categorized into 5 polarities, each signifying distinct sentiments or attitudes. Allocating 80% of the dataset for training (327,649 samples) and 20% for testing (81,912 samples) ensures a comprehensive portrayal of the various polarities within both sets. Similarly, the BRAD dataset, comprising 510,598 samples, is divided with 80% (408,478 samples) earmarked for training and 20% (101,019 samples) for testing. Likewise, the LARB dataset, encompassing 63,257 samples, is split with 80% (50,606 samples) for training and 20% (12,651 samples) for testing. This partitioning approach guarantees that the 5-polarities are well-represented in both training and testing phases, allowing models to capture the subtleties of sentiment variation and effectively apply their understanding to unseen data. Prejudices can wield significant sway over the effectiveness of sentiment analysis models. If biases are present in the training data, they can skew the outcomes. To tackle this concern and determine the appropriate data selection for the presented ST-SA sentiment analysis model for Arabic vernaculars, we took into account five distinct steps:
Guarantee that the training dataset comprises a multitude of origins and encompasses a broad spectrum of demographic profiles, geographic locales, and societal contexts. This approach serves the purpose of mitigating biases, resulting in a dataset that is not only more exhaustive but also more equitable in its composition.
Confirm that the sentiment labels in the training dataset are evenly distributed among all demographic segments and viewpoints. This helps to reduce the risk of over-generalization and biases stemming from an unequal distribution of sentiment instances.
Set forth precise labeling directives that explicitly guide human annotators to remain impartial and refrain from introducing their personal biases into the sentiment labels. This approach aids in upholding uniformity and reducing the potential for biases.
Conducting an exhaustive examination of the training data to pinpoint potential biases is imperative. This entails scrutinizing factors like demographic disparities, serotype reinforcement, and any groups that may be inadequately represented. Upon identification, we implemented appropriate measures to rectify these biases. This involved employing techniques such as data augmentation, oversampling of underrepresented groups, and applying pre-processing methods.
4.4. Sate-of-Art Approaches:
Employing the five-point datasets BRAD, HARD, and LARB for analyzing ADs, the ST-SA model designed for this purpose was assessed against the latest standard methods. Initially, Logistic Regression (LR) was introduced in [
24] using unigrams, bi-grams, and TF-IDF, and was subsequently employed on the BRAD dataset. In a similar vein, Logistic Regression (LR) was initially advocated in [
25] utilizing unigrams, bi-grams, and TF-IDF, and was then put into practice on the HARD dataset. The ST-SA model we propose has also been subjected to a comparative analysis using the LABR datasets. These reference methods encompass: SVM which employs a support vector machine classifier with n-gram characteristics, as recommended in [
60]. MNB Implements a multinomial Naive Bayes approach with bag-of-words attributes, as outlined in [
53]. HC which is a model utilizing hierarchical classifiers, constructed based on the divide-and-conquer technique introduced by [
61] and HC(KNN) which is an enhanced iteration of the hierarchical classifiers model, still rooted in the divide-and-conquer strategy as delineated by [
62]. In recent times, tasks in Natural Language Processing (NLP) achieved remarkable proficiency through the utilization of the bi-directional encoder representation from transformers, known as BERT [
63]. The AraBERT [
64], an Arabic pre-trained BERT model, underwent training on three distinct corpora: OSIAN [
65], Arabic Wikipedia, and the MSA corpus, encompassing a staggering 1.5 billion words. We’ve conducted a comparative analysis between the proposed ST-SA system for ADs and AraBERT [
64], which boasts 786 latent dimensions, 12 attention facets, and a composition of 12 encoder layers.
4.5. Results:
Numerous empirical experiments were conducted employing the proposed ST- SA system for Arabic dialects. The suggested ST- SA system underwent training with varying configurations of attention heads (AH) in the MHA sub-layer and diverse encoder quantities to ascertain the most efficient structure. Additionally, the system was trained with varying dimensions of word embeddings for each token. This research delved into the influence of training the proposed system using two multitasking methodologies, namely, in tandem and alternatively, for performance assessment. The efficacy of the suggested system’s sentiment analysis was assessed using an automated accuracy metric. This section details the evaluation of the proposed ST- SA system across five polarity classification tasks for ADs. The results of the practical experiments on HARD , BRAD and LARB are delineated in
Table 4,
Table 5 and
Table 6, respectively. The efficiency of the suggested ST- SA model, under the application of joint and alternate learning methods to HARD and BRAD, is succinctly summarized in
Table 10, respectively. As elucidated in
Figure 4,
Table 4 and
Table 7, the proposed ST- SA system achieved an accuracy of 84.02% on the HARD-imbalanced dataset, where the number of AH was 2, number of tokens was 90, number of experts was 10 ,batch_size was 60 ,filter size was 32 , dropout value was 0.25 ,and the embedding dimension for each token was 23. This commendable accuracy was attained due to the favorable impact of employing the MTL framework , MoE mechanism and MHA approach, particularly in right-to-left texts like ADs. MoE employs a collection of expert networks to grasp distinct facets of the input data, subsequently amalgamating their outputs via a gating network. This enables the model to dynamically select from various parameter sets (i.e., expert modules) based on the input amd the, so that the proposed model can detect the sentiments accuretly.When juxtaposed with the performance of the top-performing system on the HARD dataset, the outcomes produced by the ST- SA model surpassed those obtained by LR [
66], exhibiting an accuracy differential of 7.92%. Moreover, the proposed model outshone AraBERT [
64], with an accuracy differential of 3.17% and the proposed ST-SA model outperformed the MTL-MHA SA [
46] with an accuracy differential of 2.19%. Consequently, the concurrent execution of learning-related tasks augmented the pool of usable data and mitigated the risk of overfitting [
67]. The presented system demonstrated proficiency in capturing both syntactic and semantic attributes, enabling it to discern the sentiments conveyed in AD sentences.
Furthermore, the recommended ST- SA system demonstrated superior effectiveness on the imbalanced BRAD dataset. As depicted in
Table 5, the proposed model achieved an accuracy of 68.81%, where the number of AH was 3, number of tokens was 24, number of experts was 15 ,batch_size was 53 ,filter size was 30 , dropout value was 0.24 ,and the embedding dimension for each token was 50. As elucidated in
Table 8, the suggested ST-SA system surpassed the logistic regression (LR) approach advocated by [
24], exhibiting an accuracy differential of 21.71%, and outperformed the AraBERT model [
64] by a margin of 7.96%. Also , the proposed ST-SA surpassed the MTL-MHA [
46] SA system with an accuarcy differential of 7.08% .Additionally, the integration of the Switch-Transformer-based shared encoder (one for each classification task) enabled the suggested model to glean a comprehensive representation, encompassing the preceding, subsequent, and localized contexts of any position within a sentence.
Moreover, the suggested Switch-Transfirmer Sentiment Analysis model that utilize Multitask Learning (ST- SA), detailed in
Table 6, exhibited exceptional performance on the demanding LARB imbalanced dataset. In this investigation, this innovative model attained a noteworthy accuracy of 83.91%, surpassing alternative methodologies. It’s worth noting that with a specific setup comprising three Attention Heads (AH), a filter size of 35, number of tokens of 100, number of experts of 12 ,batch_size set to 70, dropout value set to 0.27 ,and the embedding dimension for each token set to 60, the suggested system truly showcased its effectiveness. This accomplishment underscores the resilience of the ST- SA model in navigating the intricacies of sentiment analysis within the framework of an imbalanced dataset. As demonstrated in
Table 9, the porposed Switch-Transfirmer Sentiment Analysis model that utilize Multitask Learning (ST-SA), system exhibited its superiority over several alternative approaches. Notably, the ST-SA model outperformed various models by substantial margins. For instance, it displayed a remarkable accuracy differential of 33.61% when compared to the SVM [
60] model, an impressive 38.91% accuracy differential surpassing the MNP [
53] model, a significant 26.11% accuracy differential over the HC(KNN) [
61] model, as well as a noteworthy 24.95% accuracy differential compared to AraBERT [
64] and the proposed model even surpassed HC(KNN) [
62] by an accuracy differential of 11.27%. Additionally , the proposed model surpassed the MTL-MHA SA Model [
46] with accuracy differential of 5.78%.
Joint training, within the realm of deep learning, involves the concurrent training of a single neural network model to undertake multiple interrelated tasks. Rather than training distinct models for each task, this approach enables the model to collaborate and learn shared representations that can be advantageous for all tasks. This can lead to enhanced adaptability, heightened efficiency, and potentially even superior performance on each specific task. Imbalanced data signifies an uneven distribution of classes (or categories) within a dataset. In certain instances, one or more classes may possess notably fewer instances in comparison to others. This situation can present difficulties for deep learning models as they might exhibit a bias towards the majority class, resulting in subpar performance on minority classes. The evaluation results suggest that the presented ST- SA system, when subjected to both joint and alternate learning, exhibited exceptional efficiency. Alternate training outperformed joint learning, yielding accuracies of 84.02% and 76.62% in the imbalanced HARD dataset, and 67.37% and 64.23% in BRAD, as outlined in
Table 10. Upon comparison with benchmark methods, it became evident that alternate training in five-point classification can yield more comprehensive feature representations within the text sequence than a single learning task. These outcomes highlight that alternate learning is better suited for tackling complex SA tasks, and it can grasp and generate a more robust latent representation in intricate tasks for AD SA. The discernible contrast in effectiveness between the two methodologies lies in how alternate learning is influenced by the volume of data in each task’s dataset. Shared layers tend to hold more information when a task encompasses a larger dataset. In contrast, joint learning may lean towards bias if one of the tasks is associated with a significantly larger dataset than the other. Consequently, alternative training methods are deemed more suitable for tasks related to sentiment analysis of Arabic dialects. This holds particularly true in scenarios where there are two distinct datasets for different tasks, such as in machine translation tasks where translation is conducted from AD to MSA and then to English [
57]. The efficacy of each task can be augmented by designing a network in an alternate configuration, obviating the need for additional training data [
58]. Moreover, related tasks can further bolster the efficiency of five-point classification. The significance of the enhancements observed in our proposed model’s performance can be attributed to multiple factors. Surpassing state-of-the-art models like AraBERT and LR stands as a notable achievement in itself, considering AraBERT’s established effectiveness in Arabic language processing tasks. By outperforming AraBERT on the same datasets, our proposed model demonstrates its heightened precision in handling Arabic dialects. Additionally, even slight enhancements in accuracy bear significance as they contribute to elevating the overall performance of models designed for processing Arabic dialects. These incremental improvements can have practical implications, including refining the accuracy of sentiment analysis, information retrieval, and other applications in natural language processing for Arabic dialects.
Table 4.
Results for the prposed ST- SA model on HARD dataset for the five-polarities classification task.
Table 4.
Results for the prposed ST- SA model on HARD dataset for the five-polarities classification task.
E-D-T |
NT |
AH |
FS |
NE |
BS |
DO |
Accuracy (5-Polarity) |
50 |
50 |
4 |
50 |
10 |
50 |
0.30 |
81.39% |
32 |
100 |
2 |
32 |
10 |
50 |
0.25 |
83.81% |
23 |
90 |
2 |
32 |
10 |
60 |
0.25 |
84.02% |
30 |
150 |
4 |
30 |
5 |
50 |
0.25 |
82.89% |
30 |
25 |
4 |
30 |
5 |
50 |
0.30 |
82.72% |
Table 5.
Results for the ST- SA model on BRAD dataset for the five-polarities classification task.
Table 5.
Results for the ST- SA model on BRAD dataset for the five-polarities classification task.
E-D-T |
NT |
AH |
FS |
NE |
BS |
DO |
Accuracy (5-Polarity) |
30 |
20 |
2 |
30 |
6 |
40 |
0.22 |
66.72% |
40 |
15 |
3 |
30 |
10 |
55 |
0.25 |
67.37% |
35 |
17 |
3 |
35 |
13 |
52 |
0.30 |
64.95% |
50 55 |
24 30 |
3 3 |
30 40 |
15 18 |
53 56 |
0.24 0.26 |
68.81% 67.15% |
Table 6.
Results for the ST- SA model on LARB dataset for the five-polarities classification task.
Table 6.
Results for the ST- SA model on LARB dataset for the five-polarities classification task.
E-D-T |
NT |
AH |
FS |
NE |
BS |
DO |
Accuracy (5-Polarity) |
40 |
20 |
3 |
35 |
10 |
50 |
0.30 |
80.09% |
60 |
100 |
3 |
35 |
12 |
70 |
0.27 |
83.91% |
35 |
40 |
2 |
40 |
10 |
60 |
0.20 |
81.74% |
20 |
40 |
4 |
39 |
15 |
40 |
0.30 |
82.65% |
Table 7.
The performance of the proposed ST- SA model compared with benchmark approaches on HARD dataset.
Table 7.
The performance of the proposed ST- SA model compared with benchmark approaches on HARD dataset.
Model |
Polarity |
Accuracy |
LR [66] |
5 |
76.1% |
AraBERT [64] MTL-MHA-SA [46] |
5 5 |
80.85% 81.83% |
The proposed ST-SA Model |
5 |
84.02% |
Table 8.
The performance of the proposed ST- SA. model compared with bench mark approaches on BRAD dataset.
Table 8.
The performance of the proposed ST- SA. model compared with bench mark approaches on BRAD dataset.
Model |
Polarity |
Accuracy |
LR [24] |
5 |
47.7% |
AraBERT [64] |
5 |
60.85% |
The MTL-MHA SA [46] The proposed ST-SA Model
|
5 5
|
61.73% 68.81%
|
Table 9.
The performance of the proposed ST-SA model compared with benchmark approaches on LARB imbalanced dataset.
Table 9.
The performance of the proposed ST-SA model compared with benchmark approaches on LARB imbalanced dataset.
Model |
Polarity |
Accuracy |
SVM [60] |
5 |
50.3% |
MNP [53] HC(KNN) [61] AraBERT [64] HC(KNN) [62] MTL-MHA SA [46] The Proposed ST-SA Model
|
5 5 5 5 5 5
|
45.0% 57.8% 58.96% 72.64% 78.13% 83.91%
|
Table 10.
Performance of joint and alternate training for five-polarity classification.
Table 10.
Performance of joint and alternate training for five-polarity classification.
ST-SA Training Method |
HARD (imbalance) Accuracy |
BRAD (imbalance) Accuracy |
Alternately |
84.02% |
67.37% |
Jointly |
76.62% |
64.23% |
Figure 4.
The Evaluation Accuracy of the Proposed ST-SA Model in Comparison with Sate-of-Art Approaches on HARD Test Dataset.
Figure 4.
The Evaluation Accuracy of the Proposed ST-SA Model in Comparison with Sate-of-Art Approaches on HARD Test Dataset.
Figure 5.
The Evaluation Accuracy of the Proposed ST-SA Model in Comparison with Sate-of-Art Approaches on BRAD Test Dataset.
Figure 5.
The Evaluation Accuracy of the Proposed ST-SA Model in Comparison with Sate-of-Art Approaches on BRAD Test Dataset.
Figure 6.
The Evaluation Accuracy of the Proposed ST-SA Model in Comparison with Sate-of-Art Approaches on LARB Test Dataset.
Figure 6.
The Evaluation Accuracy of the Proposed ST-SA Model in Comparison with Sate-of-Art Approaches on LARB Test Dataset.