Altmetrics
Downloads
73
Views
54
Comments
0
A peer-reviewed article of this preprint also exists.
This version is not peer-reviewed
Submitted:
15 October 2024
Posted:
17 October 2024
You are already at the latest version
The rise of abusive language on social media is a significant threat to mental health and 1 social cohesion. For Bengali speakers, the need for effective detection is critical. However, current 2 methods fall short in addressing the massive volume of content. Improved techniques are urgently 3 needed to combat online hate speech in Bengali. Traditional machine learning techniques, while 4 useful, often require large, linguistically diverse datasets to train models effectively. This paper 5 addresses the urgent need for improved hate speech detection methods in Bengali, aiming to fill the 6 existing research gap. Contextual understanding is crucial in differentiating between harmful speech 7 and benign expressions. Large language models (LLMs) have shown state-of-the-art performance in 8 various natural language tasks due to their extensive training on vast amounts of data. We explore the 9 application of LLMs, specifically GPT-3.5 Turbo and Gemini 1.5 Pro, for Bengali hate speech detection 10 using Zero-Shot and Few-Shot Learning approaches. Unlike conventional methods, Zero-Shot 11 Learning identifies hate speech without task-specific training data, making it highly adaptable to new 12 datasets and languages. Few-Shot Learning, on the other hand, requires minimal labeled examples, 13 allowing for efficient model training with limited resources. Our experimental results show that 14 LLMs outperform traditional approaches. In this study, we evaluated GPT-3.5 Turbo and Gemini 1.5 15 Pro on multiple datasets. To further enhance our study, we considered the distribution of comments 16 in different datasets and the challenge of class imbalance, which can affect model performance. The 17 BD-SHS dataset consists of 35,197 comments in the training set, 7,542 in the validation set, and 7,542 18 in the test set. The Bengali Hate Speech Dataset v1.0 & v2.0 includes comments distributed across 19 various hate categories: personal hate (629), political hate (1,771), religious hate (502), geopolitical hate 20 (1,179), and gender abusive hate (316). The Bengali Hate Dataset comprises 7,500 non-hate and 7,500 21 hate comments. GPT-3.5 Turbo achieved impressive results with 97.33%, 98.42%, and 98.53% accuracy. 22 In contrast, Gemini 1.5 Pro showed lower performance across all datasets. Specifically, GPT-3.5 Turbo 23 excelled with significantly higher accuracy compared to Gemini 1.5 Pro. These outcomes highlight a 24 6.28% increase in accuracy compared to traditional methods, which achieved 92.25%. Our research 25 contributes to the growing body of literature on LLM applications in natural language processing, 26 particularly in the context of low-resource languages.
Types | Authors | Year | Models Employed | Performance Metrics | Key Findings |
---|---|---|---|---|---|
Traditional Approaches | Manash et al. [12] | 2022 | Gated Recurrent Unit (GRU), Logistic Regression, Random Forest, Multinomial Naive Bayes (MNB), Support Vector Machine (SVM) | GRU: 78.89% accuracy, MNB: 80.51% accuracy | Developed a dataset of 2,000 Bengali comments; highlighted scarcity of Bengali datasets and importance of context-specific feature extraction; MNB and GRU models effective in detecting anti-social comments. |
Sherin et al. [13] | 2022 | Logistic Regression, Multinomial Naive Bayes, Random Forest, Support Vector Machine (SVM), Gradient Boosting | SVM: 85.7% accuracy | Dataset of 5,000 comments; emphasized challenges in multi-class classification for Bangla; binary classification was used; highlighted importance of data preprocessing and TFIDF feature extraction. | |
Istiaq et al. [14] | 2021 | Logistic Regression, Gated Recurrent Unit (GRU) | GRU: 98.89% accuracy | Created dataset from scratch with videos from YouTube; high accuracy with GRU model; logistic regression also showed high precision, recall, and F1 scores; focused on detecting hate speech in Bangla videos. | |
Deep Learning Approaches | Rezaul et al. [15] | 2020 | Multichannel Convolutional LSTM (MConv-LSTM), incorporating BengFastText | MConv-LSTM: F1-scores of 90.45% | Developed BengFastText, the largest Bengali word embedding model based on 250 million articles; created three extensive datasets; MC-LSTM with BengFastText outperformed baseline models. |
Nauros et al. [4] | 2022 | Bi-LSTM, Support Vector Machine (SVM) | Bi-LSTM: F1-score of 91.0% | Introduced BD-SHS, a large manually labeled dataset with over 50,200 offensive comments; Bi-LSTM trained with informal embeddings achieved highest F1-score; outperformed other pre-trained embeddings like BengFastText and MFT. | |
Amit et al. [1] | 2022 | LSTM, GRU, Attention-based decoders | Attention-based model: 77% accuracy | Proposed an encoder-decoder-based model for classifying Bengali Facebook comments; collected 7,425 comments across seven hate speech categories; attention-based model achieved highest accuracy; included Bangla Emot Module | |
Alvi et al. [16] | 2019 | GRU, Random Forest | GRU: 70.10% accuracy | Compiled and annotated a dataset of 5,126 comments into six classes; Random Forest achieved 52.20% accuracy, GRU model improved to 70.10%; emphasized importance of linguistic and quantitative feature extraction for Bengali. |
Types | Authors | Year | Models Employed | Performance Metrics | Key Findings |
---|---|---|---|---|---|
Transformer Based Approaches | Jobair et al. [3] | 2023 | BERT, SVM, LSTM, BiLSTM | BERT: 80% accuracy on new dataset, 97% accuracy on existing dataset | Compiled a dataset of 8600 comments; BERT showed highest accuracy at 80% on new dataset and 97% on existing dataset of 30,000 records; BERT outperformed SVM, LSTM, and BiLSTM. |
Mithun et al. [5] | 2022 | m-BERT, XLM-RoBERTa, IndicBERT, MuRIL | m-BERT: F1-score of 0.81 | Developed an annotated dataset of 10K Bengali posts (5K actual, 5K Romanized); XLM-RoBERTa performed best in separate training; MuRIL outperformed in joint and few-shot training scenarios. | |
Rezaul et al. [17] | 2020 | BERT variants (including XLM-RoBERTa), traditional ML models, DNN models (CNN, Bi-LSTM) | XLM-RoBERTa: F1-score of 87%, MCC score of 0.82 | Evaluated BERT variants; XLM-RoBERTa achieved highest F1-score of 87%; ensemble approach improved overall accuracy by 1.8%; highlighted challenges in detecting political hate speech; traditional ML models showed varied performance due to feature selection. | |
Large Language Models | Keyan et al. [7] | 2024 | GPT-3.5-turbo, Chain-of-Thought prompts | Accuracy: 0.85, Precision: 0.8, Recall: 0.95, F1 Score: 0.87 | Chain-of-Thought reasoning prompts significantly outperform other strategies, capturing intricate contextual details for accurate hate speech detection. |
Sarthak et al. [8] | 2023 | Flan-T5-large, text-davinci-003, GPT-3.5-turbo-0301 | F1 Scores: Flan-T5-large: 0.59 (HateXplain), 0.63 (implicit hate), text-davinci-003: 0.45 (HateXplain), 0.36 (implicit hate) | Flan-T5-large outperforms other models with vanilla prompts. Incorporating target community information into prompts yields a 20-30% performance boost. Precise prompt engineering is critical for optimizing LLMs in hate speech detection. | |
Flor et al. [9] | 2023 | mT0, FLAN-T5, multilingual XLM-RoBERTa | Macro-F1 Scores: FLAN-T5: 65.34 (English), 62.61 (Spanish), 57.29 (Italian) | Zero-shot learning with prompting can match or surpass fine-tuned models’ performance, particularly with instruction fine-tuned models. Prompt and model selection significantly impact accuracy. |
Architecture | Model | Layers | Attention Heads | Parameters | Objective Type During Training | Embedding Size |
---|---|---|---|---|---|---|
ELECTRA | BanglaBERT | 12 | 12 | 110M | MLM with Replaced Token Detection (RTD) | 768 |
BERT | BanglaBERT Base | 12 | 12 | 110M | Masked Language Model (MLM) | 768 |
mBERT | 12 | 12 | 110M | Multilingual Masked Language Model (MLM) | 768 | |
XLM-RoBERTa | 24 | 16 | 125M | Masked Language Model (MLM) | 768 | |
ALBERT | sahajBERT | 24 | 16 | 18M | Multilingual Masked Language Model (MLM) | 128 |
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
BanglaBERT | 0.9225 | 0.9223 | 0.9227 | 0.9219 |
Bangla BERT Base | 0.9129 | 0.9130 | 0.9124 | 0.9127 |
mBERT | 0.9128 | 0.9130 | 0.9224 | 0.9219 |
XLM-RoBERTa | 0.9122 | 0.9136 | 0.9128 | 0.9027 |
sahajBERT | 0.9067 | 0.9088 | 0.9014 | 0.9039 |
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
BanglaBERT | 0.8921 | 0.8814 | 0.8921 | 0.8920 |
Bangla BERT Base | 0.8853 | 0.8903 | 0.8853 | 0.8849 |
mBERT | 0.8793 | 0.8805 | 0.8793 | 0.8792 |
XLM-RoBERTa | 0.8723 | 0.8732 | 0.8723 | 0.8723 |
sahajBERT | 0.8793 | 0.8821 | 0.8793 | 0.8791 |
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
BanglaBERT | 0.9042 | 0.9087 | 0.9025 | 0.9063 |
Bangla BERT Base | 0.9134 | 0.9176 | 0.9112 | 0.9154 |
mBERT | 0.9021 | 0.9143 | 0.9084 | 0.9126 |
XLM-RoBERTa | 0.8552 | 0.7768 | 0.8184 | 0.7892 |
sahajBERT | 0.8563 | 0.7807 | 0.8481 | 0.8014 |
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
Dataset 1 | GPT 3.5 Turbo | 0.8661 | 0.8669 | 0.8671 | 0.8665 |
Gemini 1.5 Pro | 0.8220 | 0.8218 | 0.8224 | 0.8219 | |
Dataset 2 | GPT 3.5 Turbo | 0.8029 | 0.8031 | 0.8024 | 0.8027 |
Gemini 1.5 Pro | 0.8130 | 0.8130 | 0.8130 | 0.8130 | |
Dataset 3 | GPT 3.5 Turbo | 0.8331 | 0.8330 | 0.8331 | 0.8331 |
Gemini 1.5 Pro | 0.8776 | 0.8782 | 0.8769 | 0.8775 |
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
Dataset 1 | GPT 3.5 Turbo | 0.9379 | 0.9385 | 0.9373 | 0.9379 |
Gemini 1.5 Pro | 0.9129 | 0.9130 | 0.9124 | 0.9127 | |
Dataset 2 | GPT 3.5 Turbo | 0.9378 | 0.9382 | 0.9374 | 0.9378 |
Gemini 1.5 Pro | 0.9365 | 0.9371 | 0.9379 | 0.9364 | |
Dataset 3 | GPT 3.5 Turbo | 0.9465 | 0.9463 | 0.9467 | 0.9465 |
Gemini 1.5 Pro | 0.9229 | 0.9230 | 0.9224 | 0.9227 |
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
Dataset 1 | GPT 3.5 Turbo | 0.9453 | 0.9448 | 0.9457 | 0.9452 |
Gemini 1.5 Pro | 0.9375 | 0.9372 | 0.9378 | 0.9376 | |
Dataset 2 | GPT 3.5 Turbo | 0.9567 | 0.9563 | 0.9569 | 0.9566 |
Gemini 1.5 Pro | 0.9667 | 0.9663 | 0.9669 | 0.9666 | |
Dataset 3 | GPT 3.5 Turbo | 0.9567 | 0.9563 | 0.9569 | 0.9566 |
Gemini 1.5 Pro | 0.9320 | 0.9318 | 0.9324 | 0.9319 |
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
Dataset 1 | GPT 3.5 Turbo | 0.9733 | 0.9731 | 0.9735 | 0.9733 |
Gemini 1.5 Pro | 0.9711 | 0.9702 | 0.9715 | 0.9713 | |
Dataset 2 | GPT 3.5 Turbo | 0.9842 | 0.9840 | 0.9844 | 0.9842 |
Gemini 1.5 Pro | 0.9723 | 0.9727 | 0.9726 | 0.9723 | |
Dataset 3 | GPT 3.5 Turbo | 0.9853 | 0.9851 | 0.9855 | 0.9853 |
Gemini 1.5 Pro | 0.9747 | 0.9743 | 0.9746 | 0.9748 |
Paper | Dataset | Approach | Performance Metrics | Comments |
---|---|---|---|---|
This paper | BD-SHS, Bengali Hate Speech Dataset v1.0, Bengali Hate Speech Dataset v2.0, Bengali Hate Dataset | In the context of 15-shot learning, both GPT 3.5 Turbo and Gemini 1.5 Pro were evaluated. | 97.33% in Dataset 1, 98.42% in Dataset 2, and 98.53% in Dataset 3. | GPT 3.5 Turbo excelled particularly in Dataset 1, Dataset 2 and Dataset 3, demonstrating significantly higher accuracy compared to Gemini 1.5 Pro. |
Saroar et al. [2] | Offensive posts filtered: 8.5k. Non-offensive posts identified: 8.5k. Final manually labeled dataset: 15k posts (balanced with 7.5k offensive and 7.5k non-offensive posts). | The existing BanglaBERT model, pre-trained on 18.6 GB of Bengali text (1 million steps over 3 billion tokens), was retrained with 1.5 million offensive posts for 15 epochs (almost 2 million steps) in batches of 64 samples using MLM and the Adam optimizer with a learning rate of 5e-5. | Bangla Hate BERT: Accuracy - 94.3%, F1 Score - 94.1% | The dataset is balanced with equal offensive and non-offensive posts, and high-quality labels from manual annotation. A limitation is the need for a large corpus for traditional models. However, LLMs can generalize from large-scale pre-existing datasets, reducing the need for extensive domain-specific annotated data. |
Rezaul et al. [15] | The dataset has 100,000 annotated hate speech statements, covering political, personal, gender-based, geopolitical, and religious hate, created with a bootstrapping and semi-automatic annotation approach. | The MC-LSTM integrates BengFastText embeddings for hate speech detection, capturing contextual and semantic information from Bengali texts. Additionally, traditional ML models (SVM, KNN, LR, NB, DT, RF, GBT) and embedding models (Word2Vec, GloVe) were trained for a comprehensive performance comparison. | Achieved up to 90.45% F1-score. | The authors’ traditional model training approach didn’t address the need for a large corpus. LLMs mitigate this by generalizing from large pre-existing datasets, showing that LLMs offer a more efficient and adaptive alternative to traditional methods. |
Nauros et al. [4] | BD-SHS, the largest Bangla hate speech dataset, consists of 50,281 comments manually labeled in different social contexts. 24,156 comments are tagged as hate speech (HS). | Various ML models, including SVM and Bi-LSTM, were used to identify and categorize hate speech, combined with word embeddings like pre-trained formal (BFT, MFT) and informal (IFT) embeddings. | Weighted F1-score of 91.00% | LLMs can be leveraged to mitigate the need for extensive labeled data, which is often time-consuming to gather, by utilizing few-shot learning techniques and transfer learning to achieve robust performance with minimal annotated examples. |
Jobair et al. [3] | The new dataset consists of 8,600 user comments from Facebook and YouTube, categorized into sports, religion, politics, entertainment, and others. | Conducted a comprehensive study using five distinct models to analyze abusive language in Bengali. The models tested include CNN, LSTM, Bi-LSTM, GRU, and BERT. Additionally, we ran these models on an existing dataset of 30,000 records to compare performance across different datasets. | The BERT model outperformed others with a 97% accuracy and an F1-score of 96%. | LLMs can effectively minimize the reliance on extensive labeled datasets. Leveraging techniques like few-shot learning, zero-shot learning, and transfer learning, LLMs achieve robust performance even with minimal annotated examples, circumventing the time-consuming process of gathering extensive labeled data. |
1 |
Category Name | Description |
---|---|
HS Comments |
|
NH Comments |
|
Parameter | Description | Value |
---|---|---|
Temperature | Controls randomness; lower values increase determinism, higher values increase diversity | 1.0 |
Top P | Selects from most probable tokens; 1.0 considers tokens until cumulative probability reaches 100%, balancing diversity and relevance | 1.0 |
Maximum Tokens | Limits number of generated tokens per response, ensuring concise and relevant outputs. | 256 |
Frequency Penalty | Penalizes model for generating frequently used tokens; 0.0 avoids bias towards common words in hate speech detection. | 0.0 |
Presence Penalty | Penalizes model based on presence of discouraged tokens or sequences; 0.0 ensures unbiased consideration of all text aspects in hate speech detection. | 0.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 MDPI (Basel, Switzerland) unless otherwise stated