MULTILINGUAL AUTOMATIC SPEECH RECOGNITION INTERFACE FOR TYPING: USABILITY STUDY AND PERFORMANCE EVALUATION FOR KAZAKH, RUSSIAN, AND ENGLISH
DOI:
https://doi.org/10.37943/24AHNP6638Keywords:
automatic speech recognition (ASR), cognitive load, usability, human-computer interaction (HCI), human-AI interaction, speech-based typingAbstract
We present a multilingual automatic speech recognition (ASR) system for Kazakh, Russian, and English designed for the trilingual community of Kazakhstan. Although prior research has shown that speech-based text entry can outperform conventional keyboard typing for human–computer interaction and interaction with large language models (LLMs), little is known about its performance and usability in low-resource multilingual contexts, particularly for Kazakh. To address this gap, we fine-tuned a Whisper-based model on additional Kazakh speech data, achieving a large reduction in Kazakh word error rate (WER) from 21.55% with the OpenAI baseline to 8.84%, while preserving competitive performance for Russian and English. We then conducted a user study with 38 participants from Nazarbayev University, who performed dictated reading and editing tasks in all three languages. We evaluated performance using WPM, CPM, WER, and CER, and assessed usability and cognitive effort using the System Usability Scale (SUS) and the Raw NASA Task Load Index (NASA-TLX). Participants reached high speech-based typing speeds without editing and moderate speeds with editing across all three languages. Importantly, there were no statistically significant differences between Kazakh, Russian, and English in error rates, cognitive load, or perceived usability. Users reported low cognitive load (NASA-TLX < 40) and consistently high usability (SUS > 80%), indicating that the interface is efficient, easy to use, and requires minimal mental effort. These results demonstrate that Kazakh-adapted Whisper enables accurate, usable, and low-effort multilingual ASR, and highlight the potential of speech-driven text entry systems for trilingual contexts such as Kazakhstan.
References
Yu, D., & Deng, L. (2016). Automatic speech recognition (Vol. 1). Springer.
Bai, Z., & Zhang, X. L. (2021). Speaker recognition based on deep learning: An overview. Neural Networks, 140, 65-99. https://doi.org/10.1016/j.neunet.2021.03.004
Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. J. (2019). A review of deep learning-based speech synthesis. Applied Sciences, 9 (19), 4050. https://doi.org/10.3390/app9194050
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2023). A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5), 1–72. https://doi.org/10.1145/3744746
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models [Preprint]. arXiv. https://arxiv.org/abs/2108.07258
Luz, S., Masoodian, M., Rogers, B., & Deering, C. (2008, December). Interface design strategies for computer-assisted speech transcription. In Proceedings of the 20th australasian conference on computer-human interaction: designing for habitus and habitat (pp. 203-210). https://doi.org/10.1145/1517744.1517812
Vashistha, A., Sethi, P., & Anderson, R. (2017, May). Respeak: A voice-based, crowd-powered speech transcription system. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1855-1866). http://dx.doi.org/10.1145/3025453.3025640
Fathullah, Y., Wu, C., Lakomkin, E., Jia, J., Shangguan, Y., Li, K., ... & Seltzer, M. (2024, April). Prompting large language models with speech recognition abilities. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 13351-13355). IEEE. https://doi.org/10.1109/ICASSP48485.2024.10447605
Yang, C. H. H., Gu, Y., Liu, Y. C., Ghosh, S., Bulyko, I., & Stolcke, A. (2023, December). Generative speech recognition error correction with large language models and task-activating prompting. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1-8). IEEE. https://doi.org/10.1109/ASRU57964.2023.10389673
Mozilla Foundation. (2022). Common Voice (Version 12.0) [Data set]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2022). KSC2: An industrial-scale open-source Kazakh speech corpus. Proceedings of Interspeech 2022, 1367–1371. https://doi.org/10.21437/Interspeech.2022-421
Mussakhojayeva, S., Dauletbek, K., Yeshpanov, R., & Varol, H. A. (2023). Multilingual Speech Recognition for Turkic Languages. Information, 14(2), 74. https://doi.org/10.3390/info14020074
Adhikary, J., & Vertanen, K. (2021). Text entry in virtual environments using speech and a midair keyboard. IEEE Transactions on Visualization and Computer Graphics, 27(5), 2648–2658. https://doi.org/10.1109/TVCG.2021.3067776
Schneider, J. (2020). Humans learn too: Better human-AI interaction using optimized human inputs [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2009.09266
Yılmaz, E., van den Heuvel, H., & Van Leeuwen, D. (2016). Investigating bilingual deep neural networks for automatic recognition of code-switching frisian speech. Procedia Computer Science, 81, 159-166. https://doi.org/10.1016/j.procs.2016.04.044
Abushariah, A. A., Ting, H. N., Mustafa, M. B. P., Khairuddin, A. S. M., Abushariah, M. A., & Tan, T. P. (2022). Bilingual automatic speech recognition: A review, taxonomy and open challenges. IEEE Access, 11, 5944-5954. https://doi.org/10.1109/ACCESS.2022.3218684
Heracleous, P., & Yoneyama, A. (2019). A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PloS one, 14 (8), e0220386. https://doi.org/10.1371/journal.pone.0220386
Wu, R., & Yu, Z. (2024). Do AI chatbots improve students learning outcomes? Evidence from a meta‐analysis. British Journal of Educational Technology, 55 (1), 10-33. https://doi.org/10.1111/bjet.13334
Labadze, L., Grigolia, M., & Machaidze, L. (2023). Role of AI chatbots in education: systematic literature review. International Journal of Educational Technology in Higher Education, 20 (1), 56. https://doi.org/10.1186/s41239-023-00426-1
Kim, J., Merrill, K., Xu, K., & Sellnow, D. D. (2020). My teacher is a machine: Understanding students’ perceptions of AI teaching assistants in online education. International Journal of Human–Computer Interaction, 36 (20), 1902-1911. https://doi.org/10.1080/10447318.2020.1801227
Belda-Medina, J., & Calvo-Ferrer, J. R. (2022). Using chatbots as AI conversational partners in language learning. Applied Sciences, 12(17), 8427.
Jeon, J., Lee, S., & Choi, S. (2023). A systematic review of research on speech-recognition chatbots for language learning: Implications for future directions in the era of large language models. Interactive Learning Environments, 32 (8), https://doi.org/10.1080/10494820.2023.2204343
National Aeronautics and Space Administration. (n.d.). NASA-TLX: Paper/pencil version. https://humansystems.arc.nasa.gov/groups/tlx/tlxpaperpencil.php
Brooke, J. (1.996). SUS: A “quick and dirty” usability scale. In P. W. Jordan, B. Thomas, I. L. McClelland, & B. Weerdmeester (Eds.), Usability evaluation in industry (pp. 189–194). Taylor & Francis.
Makhataeva, Z., & Varol, H. A. (2025). Evaluation of Typing Speed, User Experience, and Cognitive Load Across Kazakh, Russian, and English Languages among Kazakhstani Users. Journal of Educational Sciences, 83 (2), 2520-2634. https://doi.org/10.26577/JES202583214
Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. A. (2018). Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(4), 1-23. https://doi.org/10.1145/3161187
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning (Vol. 48, pp. 173–182). PMLR.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.