AI-BASED QUESTION GENERATION FOR AVIATION TRAINING: COMPARING RETRIEVAL-AUGMENTED GENERATION AND FINE-TUNED MODELS

Authors

DOI:

https://doi.org/10.37943/25VYNZ9998

Keywords:

retrieval-augmented generation, question generation, fine-tuning, transformer models, aviation education, natural language processing, domain adaptation, low-rank adaptation, mistral

Abstract

This study examines how retrieval-augmented and fine-tuned architecture influences the cognitive complexity, terminology usage, and pedagogical characteristics of automatically generated aviation-related questions. The objective is to determine how different modeling strategies affect not only linguistic quality but also the educational value of generated content. A retrieval-augmented generation pipeline was implemented by combining vector-based document retrieval using Facebook AI Similarity Search with the Mistral-7B language model, containing seven billion parameters, applied to a curated knowledge base of 238 aviation documents. In parallel, a T5-small language model, comprising 60 million parameters, was fine-tuned using the Low-Rank Adaptation method on a dataset of 920 aviation context–question pairs.

Both systems were evaluated on a test set of 116 examples. The evaluation framework included expert-based assessment aligned with Bloom's taxonomy of cognitive learning objectives, as well as domain-specific criteria such as aviation terminology coverage and lexical diversity. In addition, widely used text similarity metrics were employed, including Bilingual Evaluation Understudy, Recall-Oriented Understudy for Gisting Evaluation with the longest common subsequence variant, and Bidirectional Encoder Representations from Transformers Score.

The results reveal distinct differences in the cognitive profiles of the generated questions. All questions produced by the fine-tuned model corresponded to the Knowledge level of Bloom's taxonomy, indicating a strong emphasis on factual recall. In contrast, the retrieval-augmented system generated questions that more frequently addressed higher cognitive levels, particularly Comprehension (53.3%) and Application (40.0%). It also demonstrated broader coverage of aviation terminology (92.2% compared to 44.0%) and greater output diversity (112 unique questions versus 56). Conversely, the fine-tuned model achieved higher similarity scores and approximately five times faster inference speed.

Author Biographies

Aruzhan Tugambayeva, Kazakh British Technical University

Master's student, Faculty of Information Technology

Aivar Sakhipov, Astana IT University

PhD, Assistant Professor, School of Software Engineering

References

Maity, S., Deroy, A., & Sarkar, S. (2025). Can large language models meet the challenge of generating school-level questions? Computers and Education: Artificial Intelligence, 8, 100370. https://doi.org/10.1016/j.caeai.2025.100370

Wang, H., Li, J., Wu, H., Hovy, E., & Sun, Y. (2022). Pre-trained language models and their applications. Engineering, 25, 51–65. https://doi.org/10.1016/j.eng.2022.04.024

Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the ACL, 328–339. https://doi.org/10.18653/v1/P18-1031

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html

ICAO. (2020). Training development guide: Competency-based training methodology. Doc 9941. International Civil Aviation Organization. https://www.icao.int/

Du, X., Shao, J., & Cardie, C. (2017). Learning to ask: Neural question generation for reading comprehension. Proceedings of the 55th Annual Meeting of the ACL, 1342–1352. https://doi.org/10.18653/v1/P17-1123

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67. https://jmlr.org/papers/v21/20-074.html

Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022). End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications, 208, 118258. https://doi.org/10.1016/j.eswa.2022.118258

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. Proceedings of the ICLR. https://doi.org/10.48550/arXiv.2106.09685

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., & Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997. https://doi.org/10.48550/arXiv.2312.10997

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of EMNLP-IJCNLP, 3982–3992. https://doi.org/10.18653/v1/D19-1410

Soudani, H., Kanoulas, E., & Hasibi, F. (2024). Fine tuning vs. retrieval augmented generation for less popular knowledge. SIGIR-AP 2024: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 12–22. https://doi.org/10.1145/3673791.3698415

Wang, L., Chou, J., Tien, A., Zhou, X., & Baumgartner, D. (2024). AviationGPT: A large language model for the aviation domain. In AIAA AVIATION FORUM AND ASCEND 2024 (p. 4250). https://doi.org/10.48550/arXiv.2311.17686

Faraby, S. A., Adiwijaya, A., & Romadhony, A. (2023). Review on neural question generation for education purposes. International Journal of Artificial Intelligence in Education, 34(3), 1008–1045. https://doi.org/10.1007/s40593-023-00374-x

Karabacak, M., Ozkara, B. B., Margetis, K., Wintermark, M., & Bisdas, S. (2023). The advent of generative language models in medical education. JMIR Medical Education, 9, e48163. https://doi.org/10.2196/48163

Ling, J., & Afzaal, M. (2024). Automatic question-answer pairs generation using pre-trained large language models in higher education. Computers and Education: Artificial Intelligence, 6, 100252. https://doi.org/10.1016/j.caeai.2024.100252

Tugambayeva, A., & Sakhipov, A. (2025). Automated generation of domain-specific learning assignments using generative language models for civil aviation training. Vestnik AGAKAZ, 4(39), 211–224. https://doi.org/10.53364/24138614_2025_39_4_16

Sai, A. B., Tanber, A. K., & Khapra, M. M. (2022). A Survey of Evaluation Metrics Used for NLG Systems. ACM Computing Surveys, 55(2), 1–39. https://doi.org/10.1145/3485766

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. Proceedings of ICLR. https://doi.org/10.48550/arXiv.1904.09675

Larsen, T., Endo, B., Yee, A., Do, T., & Lo, S. (2022). Probing internal assumptions of the Revised Bloom's Taxonomy. CBE Life Sciences Education, 21(4), ar66. https://doi.org/10.1187/cbe.20-08-0170

Downloads

Published

2026-03-30

How to Cite

Tugambayeva, A. ., & Sakhipov, A. . (2026). AI-BASED QUESTION GENERATION FOR AVIATION TRAINING: COMPARING RETRIEVAL-AUGMENTED GENERATION AND FINE-TUNED MODELS. Scientific Journal of Astana IT University, 25. https://doi.org/10.37943/25VYNZ9998

Issue

Section

Information Technologies