Preprint Article Version 1 This version is not peer-reviewed

Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation

Version 1 : Received: 15 October 2024 / Approved: 16 October 2024 / Online: 17 October 2024 (02:44:46 CEST)

How to cite: Allahim, A.; Cherif, A. Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation. Preprints 2024, 2024101288. https://doi.org/10.20944/preprints202410.1288.v1 Allahim, A.; Cherif, A. Advancing Arabic Word Embeddings: A Multi-Corpora Approach with Optimized Hyperparameters and Custom Evaluation. Preprints 2024, 2024101288. https://doi.org/10.20944/preprints202410.1288.v1

Abstract

The expanding Arabic user base presents a unique opportunity for researchers to tap into vast online Arabic resources. However, the lack of reliable Arabic word embedding models and limited availability of Arabic corpora pose significant challenges. This paper aims to address these gaps by conducting an experiment to train word embedding models with Arabic corpora and examine the influence of varying hyperparameter values on different NLP tasks. To gather data for training the models, we collected information from three distinct sources: Wikipedia, newspapers, and 31 Arabic books. Each corpus’s specific impact on the outcomes was analyzed. The experiment involved assessing the models across diverse NLP tasks, including sentiment analysis, similarity tests, and analogy tests. The findings revealed that both the corpus size and hyperparameter values had distinct effects on each test. In the Analogy test, employing a larger vocabulary size positively influenced the outcomes, with Fasttext models using skip-gram approaches performing well in solving Analogy Test questions. For sentiment analysis, we discovered that vocabulary size played a crucial role. In the similarity score test, Fasttext models achieved the highest scores, with lower window sizes and smaller vector sizes leading to better results. Overall, our models performed well, achieving a score of 99%, 90% accuracies for sentiment analysis and Analogy Test, respectively. Moreover, they achieved a similarity score of 8 out of 10. Our findings indicate that our models can serve as a valuable resource for Arabic NLP researchers, equipping them with a robust tool for handling Arabic text.

Keywords

Word embedding; Word2vec; Fasttext; Arabic embedding; Arabic corpus

Subject

Computer Science and Mathematics, Computer Science

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.