KAZMORPHLM: MORPHEME-AWARE LANGUAGE MODEL FOR KAZAKH AUTOMATIC SPEECH RECOGNITION
DOI:
https://doi.org/10.37943/25SCDM2312Keywords:
morpheme language model, Kazakh speech recognition, agglutinative morphology, vowel harmony, morpheme segmentation, n-gram interpolation, ASR rescoring, Turkic languages, low-resource ASRAbstract
This paper presents KazMorphLM, a morpheme-aware language model for automatic speech recognition (ASR) in the Kazakh language. Kazakh belongs to the Turkic family and is characterised by a highly agglutinative morphology, in which a single root can generate a large number of inflected forms through productive suffixation. This property causes severe data sparsity for conventional word-level language models and reduces recognition accuracy.
The proposed model introduces three main innovations. First, a rule-based morpheme segmenter uses an inventory of 230 suffixes across fourteen grammatical categories and includes phonological validation through vowel harmony and consonant assimilation rules. Second, a two-level interpolated n-gram architecture combines a 7-gram morpheme-level model with a 5-gram word-level model using an interpolation ratio of 0.6 to 0.4 and Witten–Bell smoothing. Third, a four-channel rescoring mechanism integrates acoustic confidence, word-level and morpheme-level language-model probabilities, and a vowel-harmony consistency score.
KazMorphLM was integrated into a hybrid ASR pipeline combining NVIDIA FastConformer and Meta MMS-1B acoustic models. On the FLEURS test set, the system achieves a word error rate of 6.86%, a 14.6% relative improvement over word-level KenLM rescoring. The results indicate that higher-order morpheme modelling is essential for agglutinative languages and that corpus quality outweighs corpus size. The approach is applicable to other morphologically rich Turkic languages.
References
Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., & Varol, H. A. (2021). A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 697–706). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.58
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html
Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2021). A study of multilingual end-to-end speech recognition for Kazakh, Russian, and English. In Speech and Computer (SPECOM 2021). Lecture Notes in Computer Science, vol. 12997, pp. 448–459. Springer. https://doi.org/10.1007/978-3-030-87802-3_41
Mussakhojayeva, S., Dauletbek, Y., Yeshpanov, R., & Varol, H. A. (2023). Multilingual speech recognition for Turkic languages. Information, 14(2), 74. https://doi.org/10.3390/info14020074
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2022). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 140, 79–104. https://doi.org/10.1016/j.specom.2022.04.002
Karabaliyev, Y., & Kolesnikova, K. (2024). Kazakh speech and recognition methods: Error analysis and improvement prospects. Scientific Journal of Astana IT University, 20, 62–75. https://doi.org/10.37943/20HKZC2614
Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9(115)), 84–92. https://doi.org/10.15587/1729-4061.2022.252801
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97), 1–52. https://jmlr.org/papers/v25/23-1318.html
Rekesh, D., Koluguri, N. R., Kriman, S., Majumdar, S., Noroozi, V., Huang, H., et al. (2023). Fast Conformer with linearly scalable attention for efficient speech recognition. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2023) (pp. 1–8). IEEE. https://doi.org/10.1109/ASRU57964.2023.10389717
Smit, P., Virpioja, S., Grönroos, S.-A., & Kurimo, M. (2014). Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 21–24). Association for Computational Linguistics. https://aclanthology.org/E14-2006/
Johanson, L. (2021). Turkic. Cambridge Language Surveys. Cambridge University Press. https://doi.org/10.1017/9781139016704
Toleu, A., Tolegen, G., & Makazhanov, A. (2021). Character-aware neural morphological disambiguation for Kazakh. Cognitive Computation, 13(6), 1480–1490. https://doi.org/10.1007/s12559-021-09926-6
Xu, H., Povey, D., Mangu, L., & Zhu, J. (2021). Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 65, 101147. https://doi.org/10.1016/j.csl.2020.101147
Bérard, A., Calapodescu, I., Dymetman, M., Roux, C., Meunier, J.-L., & Nikoulina, V. (2021). Efficient inference for multilingual neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 8563–8583). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.674
Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., et al. (2021). VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 993–1003). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.80
Ren, Z., Yolwas, N., Slamu, W., Cao, R., & Wang, H. (2022). Improving hybrid CTC/Attention architecture for agglutinative language speech recognition. Sensors, 22(19), 7319. https://doi.org/10.3390/s22197319
Varjokallio, M., Virpioja, S., & Kurimo, M. (2021). Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian. Computer Speech & Language, 66, 101141. https://doi.org/10.1016/j.csl.2020.101141
Longpre, S., Yauney, G., Reif, E., Lee, K., Roberts, A., Zoph, B., et al. (2024). A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, and toxicity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 3245–3276). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.179
Çöltekin, Ç., Doğruöz, A. S., & Çetinoğlu, Ö. (2023). Resources for Turkish natural language processing: A critical survey. Language Resources and Evaluation, 57(1), 449–488. https://doi.org/10.1007/s10579-022-09605-4
Ruokolainen, T., Kohonen, O., Virpioja, S., & Kurimo, M. (2013). Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL) (pp. 29–37). Association for Computational Linguistics. https://aclanthology.org/W13-3504/
Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021). End-to-end model based on RNN-T for Kazakh speech recognition. In Proceedings of the 3rd International Conference on Computer Communication and the Internet (ICCCI) (pp. 163–167). IEEE. https://doi.org/10.1109/ICCCI51764.2021.9486811
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., & Nuranbayeva, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9(112)), 32–45. https://doi.org/10.15587/1729-4061.2021.239186
Orken, M., Dina, O., Keylan, A., Tolganay, T., & Mohamed, O. (2022). A study of transformer-based end-to-end speech recognition system for Kazakh language. Scientific Reports, 12, 8337. https://doi.org/10.1038/s41598-022-12260-y
Peters, B., & Martins, A. F. T. (2022). Beyond characters: Subword-level morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 131–138). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.sigmorphon-1.14
Enarvi, S., Smit, P., Virpioja, S., & Kurimo, M. (2017). Automatic speech recognition with very large conversational Finnish and Estonian vocabularies. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(11), 2085–2097. https://doi.org/10.1109/TASLP.2017.2743344
Kurimo, M., Enarvi, S., Tilk, O., Varjokallio, M., Mansikkaniemi, A., & Alumäe, T. (2017). Modeling under-resourced languages for speech recognition. Language Resources and Evaluation, 51(4), 961–987. https://doi.org/10.1007/s10579-016-9336-9
Singh, M., Virpioja, S., Smit, P., & Kurimo, M. (2019). Subword RNNLM approximations for out-of-vocabulary keyword search. In Proceedings of INTERSPEECH 2019 (pp. 4235–4239). ISCA. https://doi.org/10.21437/Interspeech.2019-1329
Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2022). KSC2: An industrial-scale open-source Kazakh speech corpus. In Proceedings of INTERSPEECH 2022 (pp. 1367–1371). ISCA. https://doi.org/10.21437/Interspeech.2022-421
Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., et al. (2023). FLEURS: Few-shot learning evaluation of universal representations of speech. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT) (pp. 798–805). IEEE. https://doi.org/10.1109/SLT54892.2023.10023141
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Articles are open access under the Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.