Preprint
Article

TWIENG: A Multi-Domain Twi-English Parallel Corpus for Machine Translation of Twi, a Low-Resource African Language

This version is not peer-reviewed.

Submitted:

21 March 2022

Posted:

23 March 2022

You are already at the latest version

Abstract
A Twi-English parallel corpus is certainly an important resource for Machine Translation of Twi (ISO 639-3), a Low-Resource African Language (LRAL) which is mainly spoken in Ghana and Ivory Coast. Currently large-scale multi-domain Twi-English parallel corpus is still unavailable partly due to the difficulties and the arduous efforts required in its design. In this paper, we present TWIENG: a large-scale multi-domain Twi-English parallel corpus. We crawled the sentences from the web using web crawlers, translated, aligned, tokenized and compiled to create the corpus. We crawled English sentences from Ghanaian indigenous electronic news portals, Ghanaian Parliamentary Hansards, Twi Bible and crowdsourcing via google forms. The sentences were translated by professional translators and linguists, they were then aligned, tokenized and compiled. The corpus was curated using the sketch engine, a corpus manager and analysis software developed by Lexical Computing Limited. The corpus was manually evaluated by Twi professional linguists. The Corpus has 5,419 parallel sentences.
Keywords: 
;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Downloads

584

Views

369

Comments

0

Subscription

Notify me about updates to this article or when a peer-reviewed version is published.

Email

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated