Improving English-Persian Neural Machine Translation System through Filtered Back-Translation Method
Abstract
This study utilizes the neural machine translation (NMT) approach to improve the VRU English-Persian NMT system. In an NMT system, the encoder takes a sequence of source words as inputs and the decoder takes the source vectors through an attention mechanism as input and returns the target words. As English-Persian is a low resource language pair and few researches have been carried out on this language pair, it is important to augment the NMT system with various data. The study explores two methods to enhance the VRU system: back-translation and data filtering. At first, we created NMT models using two corpora, Amirkabir and Persica. To see whether higher ratios of synthetic data leads to decreases or increases in translation performance, we modeled different ratios using back-translation. We found that back-translation significantly improved the VRU NMT system. Second, the filtering method is applied to eliminate noisy data by applying sentence-BLEU, Average Alignment Similarity (AAS), Maximum Alignment Similarity (MAS), combination of AAS and MAS, combination of AAS, MAS, and sent-BLEU. Results show that the combination of AAS, MAS, and sent-BLEU produced the highest growth, with a BLEU score of 30.65. The study concludes that the proposed methods effectively enhance the VRU English-Persian NMT system.Keywords: AAS, Back-translation, BLEU, Filtering, MAS, Neural machine translation, Tensor2Tensor
References
Abdulmumin, I., Galadanci, B.S., Isa, A., Kakudi, H.A., & Sinan, II. (2021). A Hybrid Approach for Improved Low Resource Neural Machine Translation using Monolingual Data. Engineering Letters, 29 (4), 1478–1493.
Ahmadnia, B., & Aranovich, R. (2020). An Effective Optimization Method for Neural Machine Translation: The Case of English-Persian Bilingually Low-Resource Scenario. Proceedings of the 7th Workshop on Asian Translation (WAT 2020).
Ahmadnia, B., & Dorr, B. J. (2019). Augmenting Neural Machine Translation through Round-Trip Training Approach. Open Computer Science, 9, 268–278.
Ahmadnia, B., Dorr, B. J., & Aranovich, R. (2021). Impact of Filtering Generated Pseudo Bilingual Texts in Low-Resource Neural Machine Translation Enhancement: The Case of Persian-Spanish. Procedia Computer Science, 189, 136–141.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation, Conference on Empirical Methods in Natural Language Processing.
Currey, A., & Heafield, K. (2019). Zero-Resource Neural Machine Translation with Monolingual Pivot Data. Proceedings of the 3rd Workshop on Neural Generation and Translation (WNGT 2019), Association for Computational Linguistic, 99–107.
Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding Back-Translation at Scale. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 489–500.
Eghbalzadeh, H, B., Khadivi, Sh., & Khodabakhsh, A. (2012). Persica: A Persian Corpus for Multipurpose Text Mining and Natural Language Processing. In Sixth International Symposium Telecommunication (IST), IEEE, Tehran.
Fadaee, M., & Monz, C. (2018). Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 436–446.
Gülçehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin H., Bougares, F., Schwenk, Hand Bengio, Y. (2015). On Using Monolingual Corpora in Neural Machine Translation.1503.03535v2.
He, D., Xia Y., Qin, T., Wang, L., Yu, L., Liu, T., & Ma, W. (2016). Dual Learning for Machine Translation. 30th Conference on Neural Information Processing Systems (NIPS).
Hoang, C., Haffari, Gh., & Cohn, T. (2018). Improved Neural Machine Translation using Side Information. In Proceedings of the Australasian Language Technology Association Workshop 2018, Dunedin, New Zealand, 6–16.
Jabbari, F., Bakhshaei S., Ziabary S. M. M., & Khadivi, Sh. (2012, November). Developing an Open-domain English-Farsi Translation System Using AFEC: Amirkabir Bilingual Farsi-English Corpus. In The Fourth Workshop on Computational Approaches to Arabic Script-based Languages.
Jaiswal, N., Patidar, M., Kumari, S., Patwardhan, M., Karande, S., Agarwal, P., & Vig, L. (2020, December). Improving NMT via Filtered Back-translation. In Proceedings of the 7th Workshop on Asian Translation, 154–159.
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
Lambert, P., Schwenk, H., Servan, Ch., & Abdul-Rauf, S. (2011). Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11, Stroudsburg, PA, USA. Association for Computational Linguistics, 284–293.
Lample, G., Conneau, A., Denoyer, L., & Aurelio, R.M. (2018). Unsupervised Machine Translation Using Monolingual Corpora Only. ICLR, 280–290.
Papineni, K., Roukos, S., Ward, T., and Zhu, W. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 311–318.
Sabbagh-Jafari, M., & Ramezani, S. (2019). Filtering low quality parallel corpora, to improve the performance of neural machine translation by using the combination of sentence pair similarity criteria. Proceedings of the First National Interdisciplinary Conference on Iranian Studies, Linguistics and Translation Studies, Vali-e-Asr University (AJ) Rafsanjan.
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., & Nădejde, M. (2017). Nematus: A Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 65–68, Valencia, Spain. Association for Computational Linguistics.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Sub Word Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–1725.
Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short text similarity. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, 1275–1280.
Downloads
Published
How to Cite
Issue
Section
DOR
License
Copyright (c) 2023 Pariya Razmdideh, Fatemeh Pour-Ali Momen-Abadi, Sajjad Ramezani
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright Licensee: Iranian Journal of Translation Studies. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0 license).