Submitted:
18 November 2024
Posted:
20 November 2024
You are already at the latest version
Abstract
This research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Article Data set (SANAD), which includes 195,174 Arabic news articles, this study explores pre-processing methods such as cleaning, stemming, normalization, and stop word removal, which are crucial processes given the complex morphology of Arabic. Additionally, the influence of six different embedding models on the topic modeling performance was assessed. The originality of this work lies in addressing the lack of previous studies that optimize BERTopic through adjusting the n-gram range parameter and combining it with different embedding models for effective Arabic topic modeling. Pre-processing techniques were fine-tuned to improve data quality before applying BERTopic, LDA, and NMF, and the performance was assessed using metrics such as topic coherence and diversity. Coherence was measured using Normalized Pointwise Mutual Information (NPMI). The results show that the Tashaphyne stemmer significantly enhanced the performance of LDA and NMF. BERTopic, optimized with pre-processing and bi-grams, outperformed LDA and NMF in both coherence and diversity. The CAMeL-Lab/bert-base-arabic-camelbert-da embedding yielded the best results, emphasizing the importance of pre-processing in Arabic topic modeling.
Keywords:
1. Introduction
2. Related Work
Topic Modeling Comparative Studies Using Arabic Data Sets
3. Methodology
3.1. Data Set
3.2. Pre-Processing Techniques
- Text Normalization: Standardizing text by removing diacritics, numbers, punctuation, and special characters and unifying character variations [14].
- Tokenization: Splitting text into individual words or tokens, considering Arabic-specific linguistic rules [9].
- Stop Word Removal: Eliminating common Arabic words that do not contribute significant meaning [17].
-
Stemming and Lemmatization: Reducing words to their root forms using two different stemmers—ISRI and Tashaphyne.
- −
- ISRI Stemmer: An Arabic stemmer based on the ISRI algorithm developed by the Information Science Research Institute. This stemmer removes prefixes, suffixes, and infixes from words and incorporates some degree of morphological analysis to accurately determine the roots of Arabic words. This approach generally enhances the accuracy of root identification [18].
- −
- Tashaphyne Stemmer: Provided by the Tashaphyne library, this is a light Arabic stemmer that merges light stemming and root-based stemming principles. It segments Arabic words into roots and patterns while also removing common prefixes and suffixes without performing a full morphological analysis. The Tashaphyne stemmer has achieved remarkable results, outperforming other competitive stemmers (e.g., Khoja, ISRI, Motaz/Light10, FARASA, and Assem stemmers) in extracting Roots and Stems [19].
- n-Gram Construction: Creating combinations of adjacent words (n-grams) to capture more context within the text [20].
3.3. Modeling
- NMF: A linear algebraic model that utilizes statistical techniques to identify thematic structures within a collection of texts. It employs a decompositional strategy based on matrix factorization, categorizing it within the realm of linear algebraic methods. It stands out as an unsupervised approach for simplifying the complexity of non-negative matrices. Additionally, when it comes to mining topics from brief texts, learning models based on NMF have proven to be a potent alternative to those predicated on LDA [23].
- BERTopic: A method for topic modeling that employs an unsupervised clustering approach. It leverages Bidirectional Encoder Representations from Transformers (BERT) to generate contextual embeddings of sentences. These embeddings encapsulate the semantic details of sentences, enabling the algorithm to identify topics through their contextual significance. The methodology of BERTopic encompasses a three-stage process. Initially, it transforms documents into embedding representations utilizing a pre-trained language model. Subsequently, it diminishes the dimensionality of these embeddings to facilitate more effective clustering. In the final stage, BERTopic derives topic representations from the clusters of documents through applying a unique class-based variant of TF-IDF, a technique detailed by Grootendorst in 2022 [7]. A key strength of BERTopic lies in its precision in forming clusters and its capability to propose names for topics based on these clusters. Unlike traditional topic modeling approaches, BERTopic does not require the number of topics to be pre-defined, offering a flexible and dynamic framework for topic discovery [24].
3.4. Evaluation Metrics
- Topic Coherence: This metric measures the interpretability of the topics, with coherence scores calculated using Normalized Pointwise Mutual Information (NPMI) [25].
- Topic Diversity: This metric assesses the variety of topics through calculating the proportion of unique words among the top words across all topics [12].
- Perplexity: Used only for LDA, this metric assesses how well the model predicts a sample of the text [12].
4. Results
4.1. Training Experiments
-
LDA Experiments: The results of the LDA experiments are summarized in Table 1.
- Original Articles without Pre-processing: This baseline experiment resulted in a low coherence score (0.012) and moderate topic diversity (0.571). The model exhibited high perplexity (17,652.513), indicating difficulty in predicting the unseen data, as well as the longest training time (649.00 s).
- Cleaning: After cleaning the text, the coherence score improved significantly to 0.040 and topic diversity increased to 0.729. The perplexity decreased to 14,655.042 and the training time was reduced to 380.82 s, reflecting the benefits of removing noise from the data.
- Cleaning + Stemming (Tashaphyne): This configuration achieved the highest coherence score (0.051) among the LDA experiments, with a slight decrease in topic diversity (0.724). The perplexity dropped dramatically to 2836.581 and the training time decreased further to 289.69 s. This indicates that Tashaphyne stemming effectively enhanced the model’s interpretability and prediction accuracy.
- Cleaning + Stemming (ISRI): While this setup resulted in a slightly lower coherence score (0.037) compared to Tashaphyne, it achieved the lowest perplexity (1424.293) among the LDA experiments, suggesting superior predictive performance. The training time was similar to the Tashaphyne setup, at 289.51 s.
Among the LDA experiments, Experiment 3—with basic cleaning and stemming using Tashaphyne—achieved the highest coherence score (0.051) and good topic diversity (0.724). However, when considering predictive performance, Experiment 4—cleaning + stemming (ISRI)—stood out with the lowest perplexity. This suggests that, while stemming using Tashaphyne can enhance the interpretability of topics, incorporating ISRI stemming significantly improves the model’s ability to predict and generalize to unseen data. -
NMF Experiments: NMF was also tested under similar pre-processing configurations, with the results summarized in Table 2.
- Original Articles without Pre-processing: The NMF model initially achieved a coherence score of 0.032 and a lower topic diversity (0.557) than LDA. The training time was 2741.79 s, indicating that the model struggled with raw data.
- Cleaning: After cleaning, the coherence score improved to 0.038 and the topic diversity increased to 0.624. The training time significantly decreased to 1084.76 s, demonstrating that cleaning had a substantial positive impact.
- Cleaning + Stemming (Tashaphyne): This configuration resulted in the highest coherence score (0.071) and the best topic diversity (0.714) across all NMF experiments. The training time was further reduced to 528.50 s. The Tashaphyne stemmer clearly improved the model’s ability to produce coherent and diverse topics.
- Cleaning + Stemming (ISRI): The ISRI stemming method also improved coherence (0.053) and diversity (0.681), although not as much as Tashaphyne. The training time was the lowest among all NMF experiments, at 396.90 s, highlighting ISRI’s efficiency.
Experiment 3, which involved cleaning and stemming using Tashaphyne, provided the highest coherence score (0.071) and topic diversity (0.714), making it the preferred choice for this technique. This indicates that Tashaphyne stemming significantly enhances the model’s ability to produce coherent and diverse topics. Although Experiment 4 (with ISRI stemming) also improved the performance, it did not match the results achieved with Tashaphyne stemming. The Tashaphyne stemmer outperformed the ISRI stemmer primarily because ISRI’s robust approach can occasionally lead to over-stemming. This over-stemming excessively reduces words, eliminating critical morphological details essential for differentiating between topics. Such aggressive reduction can create a diverse array of topics that are less meaningful or coherent, as more words are erroneously treated as identical. In contrast, Tashaphyne employs a lighter stemming technique, which likely avoids these issues. Through preserving more of the original word form, Tashaphyne ensures greater accuracy and relevance in the words retained, thereby enhancing the coherence of the topics generated. -
BERTopic Experiments: Various embedding models were tested using different pre-processing techniques, and the results are summarized in Table 3.
- UBC-NLP/MARBERT Embedding: Basic cleaning yielded the highest coherence score (0.110) and excellent topic diversity (0.871). Interestingly, stemming with ISRI slightly improved the topic diversity (0.886) but reduced coherence (0.046). Tashaphyne stemming resulted in a negative coherence score (−0.004), despite high topic diversity (0.857), suggesting that this stemmer may overly simplify the text for this embedding.
- xlm-roberta-base Embedding: Cleaning significantly improved the coherence to 0.107 and maintained high topic diversity (0.843). Stemming with Tashaphyne and ISRI improved diversity but slightly reduced coherence, indicating that, while stemming is beneficial, it might not always enhance coherence with this embedding.
- aubmindlab/bert-base-arabertv02 Embedding: Cleaning provided the best results, with a coherence score of 0.120 and topic diversity of 0.843. Stemming generally improved diversity, but had mixed effects on coherence.
- CAMeL-Lab/bert-base-arabic-camelbert-da Embedding: Cleaning produced the highest coherence score (0.211) and the best topic diversity (0.886) across all BERTopic experiments. Stemming, while still effective, did not surpass the results achieved with basic cleaning.
- asafaya/bert-base-arabic Embedding: Both cleaning and stemming methods performed well, with cleaning achieving a coherence score of 0.120 and topic diversity of 0.886. The original articles (without pre-processing) had slightly higher coherence (0.126) but lower diversity (0.843).
- qarib/bert-base-qarib Embedding: With this embedding, cleaning provided the highest coherence (0.144) and excellent topic diversity (0.900). Stemming also performed well, maintaining high diversity and coherence close to that obtained after cleaning.
4.2. Summary of Results
4.3. Evaluation and Optimization
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, X.; Lei, L. A bibliometric analysis of topic modelling studies (2000–2017). J. Inf. Sci. 2021, 47, 161–175. [Google Scholar] [CrossRef]
- Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 2016, 5, 1–22. [Google Scholar] [CrossRef] [PubMed]
- Blei, D.M.; Lafferty, J.D. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 113–120. [Google Scholar]
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13, 535–541. [Google Scholar]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Bagheri, R.; Entezarian, N.; Sharifi, M.H. Topic Modeling on System Thinking Themes Using Latent Dirichlet Allocation, Non-Negative Matrix Factorization and BERTopic. J. Syst. Think. Pract. 2023, 2, 33–56. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Ma, T.; Al-Sabri, R.; Zhang, L.; Marah, B.; Al-Nabhan, N. The impact of weighting schemes and stemming process on topic modeling of arabic long and short texts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 2020, 19, 1–23. [Google Scholar] [CrossRef]
- Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
- Abuzayed, A.; Al-Khalifa, H. BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Comput. Sci. 2021, 189, 191–194. [Google Scholar] [CrossRef]
- Al-Khalifa, S.; Alhumaidhi, F.; Alotaibi, H.; Al-Khalifa, H.S. ChatGPT across Arabic Twitter: A Study of Topics, Sentiments, and Sarcasm. Data 2023, 8, 171. [Google Scholar] [CrossRef]
- Berrimi, M.; Oussalah, M.; Moussaoui, A.; Saidi, M. A Comparative Study of Effective Approaches for Arabic Text Classification. SSRN Electronic Journal 2023. Available at SSRN: https://ssrn.com/abstract=4361591. [Google Scholar]
- Einea, O.; Elnagar, A.; Al Debsi, R. Sanad: Single-label arabic news articles dataset for automatic text categorization. Data Brief 2019, 25, 104076. [Google Scholar] [CrossRef]
- El Kah, A.; Zeroual, I. The effects of pre-processing techniques on Arabic text classification. Int. J. 2021, 10, 1–12. [Google Scholar]
- Taghva, K.; Elkhoury, R.; Coombs, J. Arabic stemming without a root dictionary. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05)-Volume II, Las Vegas, NV, USA, 4–6 April 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 152–157. [Google Scholar]
- Al-Khatib, R.M.; Zerrouki, T.; Abu Shquier, M.M.; Balla, A. Tashaphyne0. 4: A new arabic light stemmer based on rhyzome modeling approach. Inf. Retr. J. 2023, 26, 14. [Google Scholar] [CrossRef]
- Nithyashree, V. What Are N-Grams and How to Implement Them in Python? 2021. Available online: https://www.analyticsvidhya.com/blog/2021/11/what-are-n-grams-and-how-to-implement-them-in-python/ (accessed on 15 October 2024).
- Abdelrazek, A.; Medhat, W.; Gawish, E.; Hassan, A. Topic Modeling on Arabic Language Dataset: Comparative Study. In Proceedings of the International Conference on Model and Data Engineering, Cairo, Egypt, 21–24 November 2022; Springer: Berlin, Germany, 2022; pp. 61–71. [Google Scholar]
- Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.; Lin, J. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowl.-Based Syst. 2019, 163, 1–13. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
- Chang, J.; Gerrish, S.; Wang, C.; Boyd-Graber, J.; Blei, D. Reading tea leaves: How humans interpret topic models. Adv. Neural Inf. Process. Syst. 2009, 22. [Google Scholar]
- Biniz, M.DataSetforArabicClassification. Mendeley Data, 2, 2018. [CrossRef]


| Pre-processing | NPMI Coherence Score ↑ | Topic Diversity | Perplexity |
|---|---|---|---|
| Original articles (without pre-processing) | 0.012 | 0.571 | 17,652.513 |
| Cleaning | 0.040 | 0.729 | 14,655.042 |
| Cleaning + stemming (Tashaphyne) | 0.051 | 0.724 | 2836.581 |
| Cleaning + stemming (ISRI) | 0.037 | 0.643 | 1424.293 |
| Pre-processing | NPMI Coherence Score ↑ | Topic Diversity |
|---|---|---|
| Original articles (without pre-processing) | 0.032 | 0.557 |
| Cleaning | 0.038 | 0.624 |
| Cleaning + stemming (Tashaphyne) | 0.071 | 0.714 |
| Cleaning + stemming (ISRI) | 0.053 | 0.681 |
| Embedding Model | Pre-processing | NPMI Coherence Score ↑ | Topic Diversity | |
|---|---|---|---|---|
| 1 | UBC-NLP/MARBERT | Original articles (without pre-processing) | 0.062 | 0.814 |
| 2 | Cleaning | 0.110 | 0.871 | |
| 3 | Cleaning + stemming (Tashaphyne) | −0.004 | 0.857 | |
| 4 | Cleaning + stemming (ISRI) | 0.046 | 0.886 | |
| 5 | xlm-roberta-base | Original articles (without pre-processing) | −0.088 | 0.743 |
| 6 | Cleaning | 0.107 | 0.843 | |
| 7 | Cleaning + stemming (Tashaphyne) | 0.104 | 0.800 | |
| 8 | Cleaning + stemming (ISRI) | 0.083 | 0.829 | |
| 9 | aubmindlab/bert-base-arabertv02 | Original articles (without pre-processing) | 0.024 | 0.757 |
| 10 | Cleaning | 0.120 | 0.843 | |
| 11 | Cleaning + stemming (Tashaphyne) | 0.074 | 0.800 | |
| 12 | Cleaning + stemming (ISRI) | 0.022 | 0.771 | |
| 13 | CAMeL-Lab/bert-base/arabic/camelbert-da | Original articles (without pre-processing) | −0.040 | 0.857 |
| 14 | Cleaning | 0.211 | 0.886 | |
| 15 | Cleaning + stemming (Tashaphyne) | 0.139 | 0.843 | |
| 16 | Cleaning + stemming (ISRI) | 0.062 | 0.814 | |
| 17 | asafaya/bert-base-arabic | Original articles (without pre-processing) | 0.126 | 0.843 |
| 18 | Cleaning | 0.120 | 0.886 | |
| 19 | Cleaning + stemming (Tashaphyne) | 0.089 | 0.871 | |
| 20 | Cleaning + stemming (ISRI) | 0.081 | 0.871 | |
| 21 | qarib/bert-base-qarib | Original articles (without pre-processing) | −0.026 | 0.786 |
| 22 | Cleaning | 0.144 | 0.900 | |
| 23 | Cleaning + stemming (Tashaphyne) | 0.138 | 0.900 | |
| 24 | Cleaning + stemming (ISRI) | 0.131 | 0.871 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
