HausaMT v1.0: Towards English-Hausa Neural Machine Translation
aa r X i v : . [ c s . C L ] J un HausaMT v1.0: Towards English–Hausa Neural MachineTranslation
Adewale Akinfaderin
Data Duality [email protected]
Abstract
Neural Machine Translation (NMT) for low-resource languages suffers from low perfor-mance because of the lack of large amounts of parallel data and language diversity. Tocontribute to ameliorating this problem, we built a baseline model for English–Hausamachine translation, which is considered a task for low–resource language. The Hausalanguage is the second largest Afro–Asiatic language in the world after Arabic and itis the third largest language for trading across a larger swath of West Africa countries,after English and French. In this paper, we curated different datasets containing Hausa–English parallel corpus for our translation. We trained baseline models and evaluatedthe performance of our models using the Recurrent and Transformer encoder–decoder ar-chitecture with two tokenization approaches: standard word–level tokenization and BytePair Encoding (BPE) subword tokenization.
Hausa is a language spoken in the western part of Africa. It belongs to the Afro–Asiatic phylumand it is the second most spoken native language on the continent, after Swahili. The languageis spoken by more than 40 million people as a first language and about 15 million people useit as a second and third language. Most of the speakers are concentrated in Nigeria, Nigerand Chad – all resulting to both anglophone and francophone influences (Sabiu et al., 2018;Eberhard et al., 2019). Our work on curating datasets and creating evaluation benchmark forEnglish–Hausa Neural Machine Translation (NMT) is inspired by the socio–linguistic facts ofthe Hausa language. Hausa has been referred to as the largest internal political unit in Africa.There has been extensive linguistic academic research on Hausa and the language benefits fromthe existence of trans–border communication in the West African Sahel belt and the availabilityof international radio stations like the BBC Hausa and Voice of America Hausa (Odoje, 2013).The exponential growth of social media platforms have eased communication among users.However, the advances in technological adoption have also informed the need to trans-late human languages. In low–resource countries, the language inequality can be amelio-rated by using machine translation to bridge gaps in technological, political and socio–economic advancements (Odoje, 2016). The recent successes in NMT over Phrased–BasedStatistical Machine Translation (PBSMT) for high–resource data conditions can be leveragedto explore best practices, data curation and evaluation benchmark for low–resource NMTtasks (Bentivogli et al., 2016; Isabelle et al., 2017). Using the JW300, Tanzil, Tatoeba andWikimedia public datasets, we trained and evaluated baseline NMT models for Hausa language.
Hausa Words Embedding : Researchers have recently curated datasets and trained wordembedding models for the Hausa language. The results from this trained models have been romising, with approximately 300% increase in prediction accuracy over other baseline models(Abdulmumin & Galadanci, 2019).
Masakhane : Due to the linguistic complexity and morphological properties of languages na-tive to continent of Africa, using abstractions from successful resource–rich cross–lingual machinetranslation tasks often fail for low–resource NMT task. The Masakhane project was created tobridge this gap by focusing on facilitating open–source NMT research efforts for African lan-guages ( ∀ , Orife et al., 2020). For the HausaMT task, we used the JW300, Tanzil, Tatoeba and Wikimedia public datasets.The JW300 dataset is a crawl of the parallel data available on Jehovah Witness’ website. Mostof the data are from the magazines,
Awake! and
Watchtower , and they cover a diverse range ofsocietal topics in a religious context (Agić & Vulić, 2019). The Tatoeba database is a collectionof parallel sentences in 330 languages (Raine, 2018). The dataset is crowdsourced and publishedunder a Creative Commons Attribution 2.0 license. The Tanzil dataset is a multilingual textaimed at producing a highly verified multi-text of the Quran text (Zarrabi-Zadeh et al., 2007).The Wikimedia dataset are parallel sentence pairs extracted and filtered from noisy paralleland comparable wikipedia copora (Wolk & Marasek, 2014). For this work, we trained on twodatasets which are: 1) the JW300 as our baseline, and 2) All the datasets combined. Thenumber of tokens, number of sentences and statistical properties of the datasets are in Table 1.
Dataset Sentence Length (Mean ± Std) Tokens Sentences
English Hausa English HausaJW300 18.11 ± ± ∗ ± ± ∗ Combination of JW300, Tanzil, Tatoeba and Wikimedia datasets.
For our baseline model, we trained a recurrent-based model with the Long Short-Term Mem-ory (LSTM) network as our encoder and decoder type with the Luong attention mecha-nism (Luong et al., 2015). To achieve an improved benchmark, we also trained a Transformerencoder–decoder model. The Transformer is based on attention mechanism and the train-ing time is significantly faster than architectures based on convolutional or recurrent net-works (Vaswani et al., 2017). For the hyperparameters used to train both the recurrent andtransformer based architecture, we used an embedding size of 256, hidden units of 256, batchsize of 4096 and an encoder and decoder depth of 6 respectively.
Dataset Model BPE Word dev test dev test
JW300 Recurrent 20.06 19.39 25.36 24.75Transformer
All Recurrent 31.89 33.48 40.78 42.29Transformer
Table 2: BLEU scores for BPE and word-level tokenization. Best scores of the Transformermodel against the Recurrent are highlighted in bold.To preprocess the parallel corpus, we used the standard word–level tokenization and BytePair Encoding (BPE) (Gage, 1994). The BPE is a subword tokenization which has become asuccessful choice in translation tasks. The model was trained based on the 4000 BPE tokens usedn a recent machine translation study for South African languages (Martinus & Abbott, 2019).To train our model, the Joey NMT minimalist toolkit, which is open source and based onPyTorch was used (Kreutzer et al., 2019). The models were trained using a Tesla P100 GPU.The model training for the baseline and repeated tasks (datasets & tokenization type) tookbetween 5-9 hours for each run.
Evaluating the model on the test set, we observed that the word–level tokenization outperformthe BPE by a BLEU score factor of ~1.27–1.42 times (Table 2). The qualities of the English toHausa translations using both word–level and BPE subword tokenizations were rated positivelyby first language speakers. Table 3 shows some of the translations example.
Source:
This is normal, because they themselves have not been anointed.
Reference:
Hakan ba abin mamaki ba ne don ba a shafa su da ruhu mai tsarki ba.
Hypothesis:
Wannan ba daidai ba ne, domin ba a shafe su ba.
Source:
A white - haired man in a frock coat appears on screen.
Reference:
Wani mutum mai furfura ya bayyana da dogon kwat a majigin.
Hypothesis:
Wani mutum mai suna da wani mutum mai suna da ke cikin mota yanada nisa a cikin kabari
Source:
Why is that of vital importance?
Reference:
Me ya sa hakan yake da muhimmanci?
Hypothesis:
Me ya sa wannan yake da muhimmanci?Table 3: Example Translations.A significant portion of both the training and test datasets are from the the JW300 paralleldata, which are religious texts. We acknowledge that for us to reach a viable state of real-world translation quality, we need to evaluate our model on "general" Hausa data. However,parallel data for other out-of-domain areas does not exist. High-yielding avenue for future workinclude evaluating on English texts in other domains and crowd-sourcing L1 speakers to manuallyevaluate the quality of the translations by editing. The post edited translation can then be usedas the reference to calculate the evaluation metric. Other future work include carrying out anempirical study to explore the effect of word–level and subword tokenizations. Other methodssuch as the linguistically motivated vocabulary reduction (LMVR) have shown to perform betterfor languages in the Afro–Asiatic family (Ataman & Federico, 2018). The datasets, pre–trainedmodels, and configurations are available on Github. Acknowledgements
The author would like to thank Gabriel Idakwo for the qualitative analysis of the translations.
References
David M. Eberhard, Gary F. Simons, and Charles D. Fennig (eds.). 2019.
Ethnologue: Languages of theworld. twenty-second edition . URL: .Ibrahim T. Sabiu, Fakhrul A. Zainol, and Mohammed S. Abdullahi. 2018.
Hausa People of NorthernNigeria and their Development . Asian People Journal (APJ), eISSN : 2600-8971, Volume 1, Issue 1,Pp 179-189.Clement Odoje. 2013.
Language Inequality: Machine Translation as the Bridging Bridge for Africanlanguages . 4, 01. https://github.com/WalePhenomenon/Hausa-NMT lement Odoje. 2016. The Peculiar Challenges of SMT to African Languages . s. ICT, Globalisationand the Study of Languages and Linguistics in Africa, Pp 223.Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016.
Neural versus Phrase-Based Machine Translation Quality: a Case Study . Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, Pp 257-267, Austin, Texas.Pierre Isabelle, Colin Cherry,and George Foster. 2017.
A Challenge Set Approach to Evaluating Ma-chine Translation . Proceedings of the 2017 Conference on Empirical Methods in Natural LanguageProcessing, Pp 2486-2496, Copenhagen, Denmark.Idris Abdulmumin and Bashir S. Galadanci. 2019. hauWE: Hausa Words Embedding for Natural Lan-guage Processing . 2019 2nd International Conference of the IEEE Nigeria Computer Chapter, Pp 1-6,Zaria, Nigeria. ∀ , Iroro F. O. Orife, Julia Kreutzer, Blessing Sibanda, Kathleen Siminyu, Laura Martinus, Jamiil ToureAli, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, OrevaogheneAhia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp ÃŰktem, Wole Akin, GhollahKioko, Kevin Degila, Herman Kamper, Bonaventure Dossou, Chris Emezue, Kelechi Ogueji, and Ab-dallah Bashir. 2020. Masakhane – Machine Translation For Africa . To appear in the Proceedings ofthe AfricaNLP Workshop, International Conference on Learning Representations (ICLR 2020).Željko Agić and Ivan Vulić. 2019.
JW300: A Wide-Coverage Parallel Corpus for Low-Resource Lan-guages . Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Pp3204-3210, Florence, Italy.Paul Raine. 2018.
Building Sentences with Web 2.0 and the Tatoeba Database . Accents Asia, 10(2), Pp2-7.Hamid Zarrabi-Zadeh, Abbas Ahmadi, Morteza Bagheri, Yousef Daneshvar, Mohammad Derakhshani,Mohammad Fakharzadeh, Ehsan Fathi, Yusof Ganji, Mojtaba Haghighi, Nasser Lashgarian, ZahraMousavian, Mohsen Saboorian, Yaser Shanjani, Mohammad-Reza Nikseresht, and Mahdi Mousavian.2007.
Tanzil Project . URL: http://tanzil.net/docs/home .Krzysztof Wolk and Krzysztof Marasek. 2014.
Building Subject-aligned Comparable Corpora and Miningit for Truly Parallel Sentence Pairs . Procedia Technology, Volume 18, Pp 126-132.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ÅĄukaszKaiser, and Illia Polosukhin. 2017.
Attention is All you Need . In 31st Conference on Neural InformationProcessing Systems (NIPS 2017), Long Beach, CA, USA.Philip Gage. 1994.
A New Algorithm for Data Compression . C Users J., 12(2), Pp 23-38.Laura Martinus and Jade Abbott. 2019.
A Focus on Neural Machine Translation for African Languages .CoRR, abs/1906.05685. URL: http://arxiv.org/abs/1906.05685 .Julia Kreutzer, Joost Bastings, and Stefan Riezler. 2019.
Joey NMT: A Minimalist NMT Toolkit forNovices . Proceedings of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), Pp 109-114,Hong Kong, China.Duygu Ataman and Marcello Federico. 2018.
An Evaluation of Two Vocabulary Reduction Methods forNeural Machine Translation . Proceedings of AMTA 2018, vol. 1: MT Research Track, Pp 97-110,Boston, MA.Thang Luong, Hieu Pham, and Christopher D. Manning. 2015.