An Efficient Method to Add Chunker Rules in Persian to English Rule-based Apertium Machine Translation System
Abstract
Rule-based machine translation (RBMT) captures linguistic information about the source and target languages. This information is retrieved from (bilingual) dictionaries and grammar rules. This paper proposes an active learning (AL) method to grow structural transfer rules at the chunker level. To this end, two sets of experiments are performed based on two types of sentences extracted from Mizan English-Persian Parallel Corpus which are selected manually and randomly. The results show adding newly written chunker rules to the transformation file using pool-based AL technique improves translation system more compared to a random chunker rule selection baseline.
Keywords:
Pool-based active learning, Rule-based machine translation, Apertium, Chunker rulesReferences
Anvari, H., & Ahmadi Givi, H. (2016). Persian Language Grammar (1). Fifth edition. Fatemi Publication.
Chen, A., Schein, L., & Ungar, M. (2006). An empirical study of the behaviour of active learning for word sense disambiguation. In Proceedings of HLT-NAACL06.
Esplà-Gomis, M., Carrasco, R. C., Sánchez-Cartagena, V. M., & Forcada, M. L. (2016). Assisting non-expert speakers of under-resourced languages in assigning stems and inflectional paradigms to new word entries of morphological dictionaries. Language Research, 1-29.
Esplà-Gomis, M., Sánchez-Cartagena, V. M., Pérez-Ortiz, J. A., Sánchez-Martínez, F., Forcada, M. L., & Carrasco, RC. (2014). An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknown words. In Proceedings of the 17th Annual Conference of the EAMT. Dubrovnik, Croatia, 19-29.
Esplà-Gomis, M., Sánchez-Cartagena, V. M., & Pérez-Ortiz, J. A. (2011a). Enlarging monolingual dictionaries for machine translation with active learning and non-expert users. In Proceedings of Recent Advances in NLP. Hissar, Bulgaria, 339– 346.
Farshidvard, Kh. (2005). Today detailed grammar: based on new linguistics including novel researches about phonetics, morphology and contemporary Persian syntax and comparing it with English and French grammatical rules. Sokhan publication.
Forcada, M. L., Bonev, B. I., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez, G. R., Sánchez-Martínez, F., Armentano-Pller, C., Montava, M. A., & Tyers. F. M. (2010). Documentation of the Open-Source Shallow-Transfer Machine translation Platform Apertium. Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant.
Forcada, M. L., Ginestí-Rosell, M., Nordfalk, J., O’Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J. A., Sánchez-Martínez, F., & Tyers, F. M. (2011) Apertium: a free/open-source platform for rule-based machine translation. Machine Translation, 127-144.
Haffari, Gh., & Sarkar, A. (2009). Active learning for multilingual statistical machine translation, In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP. Suntec. Singapore, 181–189.
Kamyar, T., & Omrani, G. (2006). Persian Language Grammar. Samt publication.
Khanlari, P. (1972). Persian Language Grammar. Tous Publication.
Lewis, D., & Gale. W. (1994). A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval: ACM/Springer, 3–12.
Mahootian, S. (1997). Persian (Descriptive Grammars). London: Routledge.
McCallum, A., & Nigam, K. (1998). Employing EM in pool-based active learning for text classification. In Proceedings of ICML, 359–367.
Meshkatadini, M. (2013). Persian Language Grammar based on Transformational Theory. Ferdowsi University of Mashhad Press (FUMP).
Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on ACL. Philadelphia, Pennsylvania, USA, 311–318.
Popović, M., & Ney, H. (2007). Word error rates: Decomposition over POS classes and applications for error analysis. In Proceedings of Workshop on ACL.
Sánchez-Cartagena, V. M, Esplá-Gomis, M., Sánchez-Martíez, F., & Pérez-Ortiz, J. A. (2012a). Choosing the correct paradigm for unknown words in rule-based machine translation systems. In Proceedings of the Third International Workshop on Free/Open-Source Rule-Based Machine Translation. Gothenburg, Sweden, 27–39.
Sánchez-Cartagena, V. M., Esplá-Gomis, M., & Pérez-Ortiz J. A. (2012). Source Language Dictionaries Help Non-Expert Users to Enlarge Target-Language Dictionaries for Machine Translation. In Proceedings of the Eight International Conference on LanguageInternational Conference on Language Resources and Evaluation. Istanbul, Turkey, 3422–3429.
Santner, T. J., William, B. J., & Notze, W. I. (2003). The Design and Analysis of Computer Experiments. Springer Series in Statistics.
Settles, B. (2010). Active Learning Literature Survey. Computer Science Technical Report 1648. University of Wisconsin-Madison.
Shen, D., Zhang, J., Zhou, G., Su, J., & Tan, C. (2003). Effective adaptation of a hidden Markov model-based named entity recognizer for biomedical domain. In Proceedings of the ACL Workshop on Natural Language Processing. Biomedicine.
Supreme Council of Information and Communication Technology. (2013). Mizan English-Persian Parallel Corpus.Tehran. I.R. Iran. Retrieved from the website: http://dadegan.ir/catalog/mizan. Accessed 20 February 2016.
Thompson, C. A., Califf, M. E., & Mooney, R. J. (1999). Active Learning for Natural Language Parsing and Information Extraction. In Proceedings of the Sixteenth International Machine Learning Conference. Bled, Slovenia, 406-414
https://stackoverflow.com/questions/40542523/nltk-corpus-level-bleu-vs-sentence-level-bleu-score. Accessed 12 March 2017.
https://svn.code.sf.net/p/apertium/svn/incubator/apertium-pes-eng. Accessed 6 July 2017.
Published
How to Cite
Issue
Section
License
Copyright Licensee: Iranian Journal of Translation Studies. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0 license).