Preprint Essay Version 1 This version is not peer-reviewed

The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models

Version 1 : Received: 1 October 2024 / Approved: 2 October 2024 / Online: 3 October 2024 (09:00:22 CEST)

How to cite: Jha, B. The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models. Preprints 2024, 2024100167. https://doi.org/10.20944/preprints202410.0167.v1 Jha, B. The Babel Effect: Analyzing Multilingual Performane Discrepancies in Large Language Models. Preprints 2024, 2024100167. https://doi.org/10.20944/preprints202410.0167.v1

Abstract

Large Language Models (LLMs) like GPT-4 and mBERT have revolutionized natural languageprocessing (NLP) by providing multilingual capabilities, making it possible to develop models thathandle diverse linguistic inputs across various languages. However, despite these advances, thereremains a noticeable performance gap between how well these models perform in high-resourcelanguages such as English and low-resource languages such as Nepali or Malagasy. We term thisphenomenon the "Babel Effect," highlighting the disproportionate performance that arises fromdifferences in resource availability across languages.This paper aims to explore the root causes of these performance discrepancies in LLMs, focusingon the underlying challenges in tokenization, training, and data scarcity. We utilize cross-lingualbenchmarks, such as XGLUE and TyDiQA, to quantify these performance variations and examinethem in detail. Furthermore, we propose solutions, including enhancing tokenization strategies,employing data augmentation techniques, and refining fine-tuning methods. The paper concludeswith a discussion on how these improvements can mitigate the Babel Effect and lead to more equitablelanguage modeling across diverse linguistic contexts.

Keywords

Multilingual Language Models; Large Language Models; Low-resource Languages; Cross-lingualLearning; Natural Language Processing; Tokenization; Data Augmentation

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.