Version 1
: Received: 1 October 2024 / Approved: 2 October 2024 / Online: 3 October 2024 (09:00:22 CEST)
How to cite:
Jha, B. The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models. Preprints2024, 2024100167. https://doi.org/10.20944/preprints202410.0167.v1
Jha, B. The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models. Preprints 2024, 2024100167. https://doi.org/10.20944/preprints202410.0167.v1
Jha, B. The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models. Preprints2024, 2024100167. https://doi.org/10.20944/preprints202410.0167.v1
APA Style
Jha, B. (2024). The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models. Preprints. https://doi.org/10.20944/preprints202410.0167.v1
Chicago/Turabian Style
Jha, B. 2024 "The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models" Preprints. https://doi.org/10.20944/preprints202410.0167.v1
Abstract
Large Language Models (LLMs) like GPT-4 and mBERT have revolutionized natural languageprocessing (NLP) by providing multilingual capabilities, making it possible to develop models thathandle diverse linguistic inputs across various languages. However, despite these advances, thereremains a noticeable performance gap between how well these models perform in high-resourcelanguages such as English and low-resource languages such as Nepali or Malagasy. We term thisphenomenon the "Babel Effect," highlighting the disproportionate performance that arises fromdifferences in resource availability across languages.This paper aims to explore the root causes of these performance discrepancies in LLMs, focusingon the underlying challenges in tokenization, training, and data scarcity. We utilize cross-lingualbenchmarks, such as XGLUE and TyDiQA, to quantify these performance variations and examinethem in detail. Furthermore, we propose solutions, including enhancing tokenization strategies,employing data augmentation techniques, and refining fine-tuning methods. The paper concludeswith a discussion on how these improvements can mitigate the Babel Effect and lead to more equitablelanguage modeling across diverse linguistic contexts.
Keywords
Multilingual Language Models; Large Language Models; Low-resource Languages; Cross-lingualLearning; Natural Language Processing; Tokenization; Data Augmentation
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.