MULTILINGUAL AUTOMATIC SPEECH RECOGNITION INTERFACE FOR TYPING: USABILITY STUDY AND PERFORMANCE EVALUATION FOR KAZAKH, RUSSIAN, AND ENGLISH

Zhanat Makhataeva; Nursultan Atymtay; Rakhat Meiramov; Guldana Nauryzbaikyzy; Kulzat Sadirova; Atakan Huseyin Varol

doi:10.37943/24AHNP6638

Authors

Zhanat Makhataeva Al-Farabi Kazakh National University, Kazakhstan https://orcid.org/0000-0001-9366-7047
Nursultan Atymtay Nazarbayev University, Kazakhstan https://orcid.org/0009-0003-4323-3023
Rakhat Meiramov Al-Farabi Kazakh National University, Kazakhstan https://orcid.org/0000-0003-0800-288X
Guldana Nauryzbaikyzy K.Zhubanov Aktobe Regional University, Kazakhstan https://orcid.org/0000-0001-9272-8952
Kulzat Sadirova K.Zhubanov Aktobe Regional University, Kazakhstan https://orcid.org/0000-0001-6092-8191
Atakan Huseyin Varol Al-Farabi Kazakh National University, Kazakhstan https://orcid.org/0000-0002-4042-425X

DOI:

https://doi.org/10.37943/24AHNP6638

Keywords:

automatic speech recognition (ASR), cognitive load, usability, human-computer interaction (HCI), human-AI interaction, speech-based typing

Abstract

We present a multilingual automatic speech recognition (ASR) system for Kazakh, Russian, and English designed for the trilingual community of Kazakhstan. Although prior research has shown that speech-based text entry can outperform conventional keyboard typing for human–computer interaction and interaction with large language models (LLMs), little is known about its performance and usability in low-resource multilingual contexts, particularly for Kazakh. To address this gap, we fine-tuned a Whisper-based model on additional Kazakh speech data, achieving a large reduction in Kazakh word error rate (WER) from 21.55% with the OpenAI baseline to 8.84%, while preserving competitive performance for Russian and English. We then conducted a user study with 38 participants from Nazarbayev University, who performed dictated reading and editing tasks in all three languages. We evaluated performance using WPM, CPM, WER, and CER, and assessed usability and cognitive effort using the System Usability Scale (SUS) and the Raw NASA Task Load Index (NASA-TLX). Participants reached high speech-based typing speeds without editing and moderate speeds with editing across all three languages. Importantly, there were no statistically significant differences between Kazakh, Russian, and English in error rates, cognitive load, or perceived usability. Users reported low cognitive load (NASA-TLX < 40) and consistently high usability (SUS > 80%), indicating that the interface is efficient, easy to use, and requires minimal mental effort. These results demonstrate that Kazakh-adapted Whisper enables accurate, usable, and low-effort multilingual ASR, and highlight the potential of speech-driven text entry systems for trilingual contexts such as Kazakhstan.

Author Biographies

Zhanat Makhataeva, Al-Farabi Kazakh National University, Kazakhstan

PhD, Senior Data Scientist
Private Institution “Institute of Smart Systems and Artificial Intelligence” (ISSAI), Kazakhstan

Nursultan Atymtay, Nazarbayev University, Kazakhstan

Bachelor Student, Computer Science, School of Engineering and Digital Sciences (SEDS)

Rakhat Meiramov, Al-Farabi Kazakh National University, Kazakhstan

MS, Data Scientist
Private Institution “Institute of Smart Systems and Artificial Intelligence” (ISSAI), Kazakhstan

Guldana Nauryzbaikyzy, K.Zhubanov Aktobe Regional University, Kazakhstan

PhD Candidate, Lecturer, Department of Foreign Languages
Private Institution “Institute of Smart Systems and Artificial Intelligence” (ISSAI), Kazakhstan

Kulzat Sadirova, K.Zhubanov Aktobe Regional University, Kazakhstan

Doctor of Philological Sciences, Professor, Department of Philology
Private Institution “Institute of Smart Systems and Artificial Intelligence” (ISSAI), Kazakhstan

Atakan Huseyin Varol, Al-Farabi Kazakh National University, Kazakhstan

PhD, General Director of ISSAI and Professor of Robotics, Department of Robotics, School of Engineering and Digital Sciences (SEDS)
Private Institution “Institute of Smart Systems and Artificial Intelligence” (ISSAI), Kazakhstan
Nazarbayev University, Kazakhstan

References

Yu, D., & Deng, L. (2016). Automatic speech recognition (Vol. 1). Springer.

Bai, Z., & Zhang, X. L. (2021). Speaker recognition based on deep learning: An overview. Neural Networks, 140, 65-99. https://doi.org/10.1016/j.neunet.2021.03.004

Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. J. (2019). A review of deep learning-based speech synthesis. Applied Sciences, 9 (19), 4050. https://doi.org/10.3390/app9194050

Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2023). A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5), 1–72. https://doi.org/10.1145/3744746

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models [Preprint]. arXiv. https://arxiv.org/abs/2108.07258

Luz, S., Masoodian, M., Rogers, B., & Deering, C. (2008, December). Interface design strategies for computer-assisted speech transcription. In Proceedings of the 20th australasian conference on computer-human interaction: designing for habitus and habitat (pp. 203-210). https://doi.org/10.1145/1517744.1517812

Vashistha, A., Sethi, P., & Anderson, R. (2017, May). Respeak: A voice-based, crowd-powered speech transcription system. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1855-1866). http://dx.doi.org/10.1145/3025453.3025640

Fathullah, Y., Wu, C., Lakomkin, E., Jia, J., Shangguan, Y., Li, K., ... & Seltzer, M. (2024, April). Prompting large language models with speech recognition abilities. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 13351-13355). IEEE. https://doi.org/10.1109/ICASSP48485.2024.10447605

Yang, C. H. H., Gu, Y., Liu, Y. C., Ghosh, S., Bulyko, I., & Stolcke, A. (2023, December). Generative speech recognition error correction with large language models and task-activating prompting. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1-8). IEEE. https://doi.org/10.1109/ASRU57964.2023.10389673

Mozilla Foundation. (2022). Common Voice (Version 12.0) [Data set]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0

Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2022). KSC2: An industrial-scale open-source Kazakh speech corpus. Proceedings of Interspeech 2022, 1367–1371. https://doi.org/10.21437/Interspeech.2022-421

Mussakhojayeva, S., Dauletbek, K., Yeshpanov, R., & Varol, H. A. (2023). Multilingual Speech Recognition for Turkic Languages. Information, 14(2), 74. https://doi.org/10.3390/info14020074

Adhikary, J., & Vertanen, K. (2021). Text entry in virtual environments using speech and a midair keyboard. IEEE Transactions on Visualization and Computer Graphics, 27(5), 2648–2658. https://doi.org/10.1109/TVCG.2021.3067776

Schneider, J. (2020). Humans learn too: Better human-AI interaction using optimized human inputs [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2009.09266

Yılmaz, E., van den Heuvel, H., & Van Leeuwen, D. (2016). Investigating bilingual deep neural networks for automatic recognition of code-switching frisian speech. Procedia Computer Science, 81, 159-166. https://doi.org/10.1016/j.procs.2016.04.044

Abushariah, A. A., Ting, H. N., Mustafa, M. B. P., Khairuddin, A. S. M., Abushariah, M. A., & Tan, T. P. (2022). Bilingual automatic speech recognition: A review, taxonomy and open challenges. IEEE Access, 11, 5944-5954. https://doi.org/10.1109/ACCESS.2022.3218684

Heracleous, P., & Yoneyama, A. (2019). A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PloS one, 14 (8), e0220386. https://doi.org/10.1371/journal.pone.0220386

Wu, R., & Yu, Z. (2024). Do AI chatbots improve students learning outcomes? Evidence from a meta‐analysis. British Journal of Educational Technology, 55 (1), 10-33. https://doi.org/10.1111/bjet.13334

Labadze, L., Grigolia, M., & Machaidze, L. (2023). Role of AI chatbots in education: systematic literature review. International Journal of Educational Technology in Higher Education, 20 (1), 56. https://doi.org/10.1186/s41239-023-00426-1

Kim, J., Merrill, K., Xu, K., & Sellnow, D. D. (2020). My teacher is a machine: Understanding students’ perceptions of AI teaching assistants in online education. International Journal of Human–Computer Interaction, 36 (20), 1902-1911. https://doi.org/10.1080/10447318.2020.1801227

Belda-Medina, J., & Calvo-Ferrer, J. R. (2022). Using chatbots as AI conversational partners in language learning. Applied Sciences, 12(17), 8427.

Jeon, J., Lee, S., & Choi, S. (2023). A systematic review of research on speech-recognition chatbots for language learning: Implications for future directions in the era of large language models. Interactive Learning Environments, 32 (8), https://doi.org/10.1080/10494820.2023.2204343

National Aeronautics and Space Administration. (n.d.). NASA-TLX: Paper/pencil version. https://humansystems.arc.nasa.gov/groups/tlx/tlxpaperpencil.php

Brooke, J. (1.996). SUS: A “quick and dirty” usability scale. In P. W. Jordan, B. Thomas, I. L. McClelland, & B. Weerdmeester (Eds.), Usability evaluation in industry (pp. 189–194). Taylor & Francis.

Makhataeva, Z., & Varol, H. A. (2025). Evaluation of Typing Speed, User Experience, and Cognitive Load Across Kazakh, Russian, and English Languages among Kazakhstani Users. Journal of Educational Sciences, 83 (2), 2520-2634. https://doi.org/10.26577/JES202583214

Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. A. (2018). Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(4), 1-23. https://doi.org/10.1145/3161187

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., ... & Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning (Vol. 48, pp. 173–182). PMLR.

MULTILINGUAL AUTOMATIC SPEECH RECOGNITION INTERFACE FOR TYPING: USABILITY STUDY AND PERFORMANCE EVALUATION FOR KAZAKH, RUSSIAN, AND ENGLISH

Authors

DOI:

Keywords:

Abstract

Author Biographies

Zhanat Makhataeva, Al-Farabi Kazakh National University, Kazakhstan

Nursultan Atymtay, Nazarbayev University, Kazakhstan

Rakhat Meiramov, Al-Farabi Kazakh National University, Kazakhstan

Guldana Nauryzbaikyzy, K.Zhubanov Aktobe Regional University, Kazakhstan

Kulzat Sadirova, K.Zhubanov Aktobe Regional University, Kazakhstan

Atakan Huseyin Varol, Al-Farabi Kazakh National University, Kazakhstan

References

Downloads

Published

How to Cite

Issue

Section

License