COMPARATIVE ANALYSIS OF MULTILINGUAL QA MODELS AND THEIR ADAPTATION TO THE KAZAKH LANGUAGE

Arailym Tleubayeva; Aday Shomanov

doi:10.37943/19WHRK2878

Authors

Arailym Tleubayeva K.Kulazhanov Kazakh University of Technology and Business, Kazakhstan https://orcid.org/0000-0001-9560-9756
Aday Shomanov Nazarbayev University, Kazakhstan https://orcid.org/0000-0001-8253-7474

DOI:

https://doi.org/10.37943/19WHRK2878

Keywords:

Multilingual models, NLP, Kazakh language, mBERT, XLM-R, mT5, GPT, AYA, question-answering, low-resource languages

Abstract

This paper presents a comparative analysis of large pretrained multilingual models for question-answering (QA) systems, with a specific focus on their adaptation to the Kazakh language. The study evaluates models including mBERT, XLM-R, mT5, AYA, and GPT, which were tested on QA tasks using the Kazakh sKQuAD dataset. To enhance model performance, fine-tuning strategies such as adapter modules, data augmentation techniques (back-translation, paraphrasing), and hyperparameter optimization were applied. Specific adjustments to learning rates, batch sizes, and training epochs were made to boost accuracy and stability. Among the models tested, mT5 achieved the highest F1 score of 75.72%, showcasing robust generalization across diverse QA tasks. GPT-4-turbo closely followed with an F1 score of 73.33%, effectively managing complex Kazakh QA scenarios. In contrast, native Kazakh models like Kaz-RoBERTa showed improvements through fine-tuning but continued to lag behind larger multilingual models, underlining the need for additional Kazakh-specific training data and further architectural enhancements. Kazakh’s agglutinative morphology and the scarcity of high-quality training data present significant challenges for model adaptation. Adapter modules helped mitigate computational costs, allowing efficient fine-tuning in resource-constrained environments without significant performance loss. Data augmentation techniques, such as back-translation and paraphrasing, were instrumental in enriching the dataset, thereby enhancing model adaptability and robustness. This study underscores the importance of advanced fine-tuning and data augmentation strategies for QA systems tailored to low-resource languages like Kazakh. By addressing these challenges, this research aims to make AI technologies more inclusive and accessible, offering practical insights for improving natural language processing (NLP) capabilities in underrepresented languages. Ultimately, these findings contribute to bridging the gap between high-resource and low-resource language models, fostering a more equitable distribution of AI solutions across diverse linguistic contexts.

Author Biographies

Arailym Tleubayeva, K.Kulazhanov Kazakh University of Technology and Business, Kazakhstan

Master of Technical Science, senior-lecturer Department of Information Technologies

Aday Shomanov, Nazarbayev University, Kazakhstan

PhD, Instructor, School of Engineering and Digital Sciences

References

Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models (3rd ed.). Retrieved August 20, 2024, from https://web.stanford.edu/~jurafsky/slp3/.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171-4186. https://doi.org/10.18653/v1/N19-1423.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440-8451. https://doi.org/10.18653/v1/2020.acl-main.747.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., ... & Raffel, C. (2021). mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 483-498. https://doi.org/10.18653/v1/2021.naacl-main.41.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., & Kaplan, J. (2020). GPT-3: Language Models are Few-Shot Learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165.

Maxutov, A., Myrzakhmet, A., & Braslavski, P. (2024). Do LLMs speak Kazakh? A pilot evaluation of seven models. Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024), 81–91. https://aclanthology.org/2024.sigturk-1.8.

Alqahtani, S., Korolova, A., & Saleh, M. (2020). Question Answering on Kazakh Language Using Multilingual BERT. In Proceedings of the 12th Language Resources and Evaluation Conference, 5488-5494.

Mansurova, A., Nugumanova, A., & Makhambetova, Z. (2023). Development of a question-answering chatbot for blockchain domain. Scientific Journal of Astana IT University, 15(15), 27–40. https://doi.org/10.37943/15XNDZ6667.

Bimagambetova, Z., Rakhymzhanov, D., Jaxylykova, A., & Pak, A. (2023). Evaluating Large Language Models for Sentence Augmentation in Low-Resource Languages: A Case Study on Kazakh. 2023 19th International Asian School-Seminar on Optimization Problems of Complex Systems (OPCS), Новосибирск, Россия, 14-18. https://doi.org/10.1109/OPCS59592.2023.10275753.

Shymbayev, M., & Alimzhanov, Y. (2023). Extractive Question Answering for Kazakh Language. 2023 IEEE International Conference on Smart Information Systems and Technologies (SIST), Астана, Казахстан, 401-405. https://doi.org/10.1109/SIST58284.2023.10223508.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv preprint arXiv:1909.11942. https://doi.org/10.48550/arXiv.1909.11942.

Singh, A., Yadav, M., Verma, S., & Gupta, R. (2024). AYA: A 13 Billion Parameter Multilingual Model Based on mT5-xxl. Proceedings of the 2024 International Conference on Computational Linguistics, 102-114.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., ... & Gelly, S. (2019). Parameter-efficient Transfer Learning for NLP. In International Conference on Machine Learning, 2790-2799. PMLR.

Artetxe, M., & Schwenk, H. (2019). Massively Multilingual Sentence Embeddings for Zero-shot Cross-lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7, 597-610.

Z. Li, X. Li, J. Sheng and W. Slamu, "AgglutiFiT: Efficient Low-Resource Agglutinative Language Model Fine-Tuning," in IEEE Access, vol. 8, pp. 148489-148499, 2020, https://doi.org/10.1109/ACCESS.2020.3015854.

Li, X., Li, Z., Sheng, J., & Slamu, W. (2020, October). Low-resource text classification via cross-lingual language model fine-tuning. In China National Conference on Chinese Computational Linguistics(pp. 231-246). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-63031-7_17.

Yeshpanov, R., Efimov, P., Boytsov, L., Shalkarbayuli, A., & Braslavski, P. (2024). KazQAD: Kazakh open-domain question answering dataset. arXiv. https://arxiv.org/abs/2404.04487.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. https://www.deeplearningbook.org/.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. arXiv. https://doi.org/10.48550/arXiv.1902.00751

Ghandour, R., Potams, A. J., Boulkaibet, I., Neji, B., & Al Barakeh, Z. (2021). Driver behavior classification system analysis using machine learning methods. Applied Sciences, 11(22), 10562. https://doi.org/10.3390/app112210562.

A. Nugumanova, K. Apayev, A. Saken, S. Quandyq, A. Mansurova and A. Kamiluly, "Developing a Kazakh question-answering model: standing on the shoulders of multilingual giants," 2024 IEEE 4th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 2024, pp. 600-605, https://doi.org/10.1109/SIST61555.2024.10629326.

COMPARATIVE ANALYSIS OF MULTILINGUAL QA MODELS AND THEIR ADAPTATION TO THE KAZAKH LANGUAGE

Authors

DOI:

Keywords:

Abstract

Author Biographies

Arailym Tleubayeva, K.Kulazhanov Kazakh University of Technology and Business, Kazakhstan

Aday Shomanov, Nazarbayev University, Kazakhstan

References

Downloads

Published

How to Cite

Issue

Section

License