KAZMORPHLM: MORPHEME-AWARE LANGUAGE MODEL FOR KAZAKH AUTOMATIC SPEECH RECOGNITION

Yerlan Karabaliyev; Kateryna  Kolesnikova

doi:10.37943/25SCDM2312

Authors

Yerlan Karabaliyev International Information Technology University https://orcid.org/0009-0001-9465-3998
Kateryna Kolesnikova International Information Technology University https://orcid.org/0000-0002-9160-5982

DOI:

https://doi.org/10.37943/25SCDM2312

Keywords:

morpheme language model, Kazakh speech recognition, agglutinative morphology, vowel harmony, morpheme segmentation, n-gram interpolation, ASR rescoring, Turkic languages, low-resource ASR

Abstract

This paper presents KazMorphLM, a morpheme-aware language model for automatic speech recognition (ASR) in the Kazakh language. Kazakh belongs to the Turkic family and is characterised by a highly agglutinative morphology, in which a single root can generate a large number of inflected forms through productive suffixation. This property causes severe data sparsity for conventional word-level language models and reduces recognition accuracy.

The proposed model introduces three main innovations. First, a rule-based morpheme segmenter uses an inventory of 230 suffixes across fourteen grammatical categories and includes phonological validation through vowel harmony and consonant assimilation rules. Second, a two-level interpolated n-gram architecture combines a 7-gram morpheme-level model with a 5-gram word-level model using an interpolation ratio of 0.6 to 0.4 and Witten–Bell smoothing. Third, a four-channel rescoring mechanism integrates acoustic confidence, word-level and morpheme-level language-model probabilities, and a vowel-harmony consistency score.

KazMorphLM was integrated into a hybrid ASR pipeline combining NVIDIA FastConformer and Meta MMS-1B acoustic models. On the FLEURS test set, the system achieves a word error rate of 6.86%, a 14.6% relative improvement over word-level KenLM rescoring. The results indicate that higher-order morpheme modelling is essential for agglutinative languages and that corpus quality outweighs corpus size. The approach is applicable to other morphologically rich Turkic languages.

Author Biographies

Yerlan Karabaliyev, International Information Technology University

PhD candidate, Clever System

Kateryna Kolesnikova, International Information Technology University

Doctor of Technical Sciences, Professor

References

Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., & Varol, H. A. (2021). A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 697–706). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.58

Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460. https://proceedings.neurips.cc/paper/2020/hash/92d1e1eb1cd6f9fba3227870bb6d7f07-Abstract.html

Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2021). A study of multilingual end-to-end speech recognition for Kazakh, Russian, and English. In Speech and Computer (SPECOM 2021). Lecture Notes in Computer Science, vol. 12997, pp. 448–459. Springer. https://doi.org/10.1007/978-3-030-87802-3_41

Mussakhojayeva, S., Dauletbek, Y., Yeshpanov, R., & Varol, H. A. (2023). Multilingual speech recognition for Turkic languages. Information, 14(2), 74. https://doi.org/10.3390/info14020074

Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2022). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 140, 79–104. https://doi.org/10.1016/j.specom.2022.04.002

Karabaliyev, Y., & Kolesnikova, K. (2024). Kazakh speech and recognition methods: Error analysis and improvement prospects. Scientific Journal of Astana IT University, 20, 62–75. https://doi.org/10.37943/20HKZC2614

Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9(115)), 84–92. https://doi.org/10.15587/1729-4061.2022.252801

Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97), 1–52. https://jmlr.org/papers/v25/23-1318.html

Rekesh, D., Koluguri, N. R., Kriman, S., Majumdar, S., Noroozi, V., Huang, H., et al. (2023). Fast Conformer with linearly scalable attention for efficient speech recognition. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2023) (pp. 1–8). IEEE. https://doi.org/10.1109/ASRU57964.2023.10389717

Smit, P., Virpioja, S., Grönroos, S.-A., & Kurimo, M. (2014). Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 21–24). Association for Computational Linguistics. https://aclanthology.org/E14-2006/

Johanson, L. (2021). Turkic. Cambridge Language Surveys. Cambridge University Press. https://doi.org/10.1017/9781139016704

Toleu, A., Tolegen, G., & Makazhanov, A. (2021). Character-aware neural morphological disambiguation for Kazakh. Cognitive Computation, 13(6), 1480–1490. https://doi.org/10.1007/s12559-021-09926-6

Xu, H., Povey, D., Mangu, L., & Zhu, J. (2021). Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 65, 101147. https://doi.org/10.1016/j.csl.2020.101147

Bérard, A., Calapodescu, I., Dymetman, M., Roux, C., Meunier, J.-L., & Nikoulina, V. (2021). Efficient inference for multilingual neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 8563–8583). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.674

Wang, C., Rivière, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., et al. (2021). VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 993–1003). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.80

Ren, Z., Yolwas, N., Slamu, W., Cao, R., & Wang, H. (2022). Improving hybrid CTC/Attention architecture for agglutinative language speech recognition. Sensors, 22(19), 7319. https://doi.org/10.3390/s22197319

Varjokallio, M., Virpioja, S., & Kurimo, M. (2021). Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian. Computer Speech & Language, 66, 101141. https://doi.org/10.1016/j.csl.2020.101141

Longpre, S., Yauney, G., Reif, E., Lee, K., Roberts, A., Zoph, B., et al. (2024). A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, and toxicity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (pp. 3245–3276). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.naacl-long.179

Çöltekin, Ç., Doğruöz, A. S., & Çetinoğlu, Ö. (2023). Resources for Turkish natural language processing: A critical survey. Language Resources and Evaluation, 57(1), 449–488. https://doi.org/10.1007/s10579-022-09605-4

Ruokolainen, T., Kohonen, O., Virpioja, S., & Kurimo, M. (2013). Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL) (pp. 29–37). Association for Computational Linguistics. https://aclanthology.org/W13-3504/

Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021). End-to-end model based on RNN-T for Kazakh speech recognition. In Proceedings of the 3rd International Conference on Computer Communication and the Internet (ICCCI) (pp. 163–167). IEEE. https://doi.org/10.1109/ICCCI51764.2021.9486811

Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., & Nuranbayeva, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9(112)), 32–45. https://doi.org/10.15587/1729-4061.2021.239186

Orken, M., Dina, O., Keylan, A., Tolganay, T., & Mohamed, O. (2022). A study of transformer-based end-to-end speech recognition system for Kazakh language. Scientific Reports, 12, 8337. https://doi.org/10.1038/s41598-022-12260-y

Peters, B., & Martins, A. F. T. (2022). Beyond characters: Subword-level morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology (pp. 131–138). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.sigmorphon-1.14

Enarvi, S., Smit, P., Virpioja, S., & Kurimo, M. (2017). Automatic speech recognition with very large conversational Finnish and Estonian vocabularies. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(11), 2085–2097. https://doi.org/10.1109/TASLP.2017.2743344

Kurimo, M., Enarvi, S., Tilk, O., Varjokallio, M., Mansikkaniemi, A., & Alumäe, T. (2017). Modeling under-resourced languages for speech recognition. Language Resources and Evaluation, 51(4), 961–987. https://doi.org/10.1007/s10579-016-9336-9

Singh, M., Virpioja, S., Smit, P., & Kurimo, M. (2019). Subword RNNLM approximations for out-of-vocabulary keyword search. In Proceedings of INTERSPEECH 2019 (pp. 4235–4239). ISCA. https://doi.org/10.21437/Interspeech.2019-1329

Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2022). KSC2: An industrial-scale open-source Kazakh speech corpus. In Proceedings of INTERSPEECH 2022 (pp. 1367–1371). ISCA. https://doi.org/10.21437/Interspeech.2022-421

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., et al. (2023). FLEURS: Few-shot learning evaluation of universal representations of speech. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT) (pp. 798–805). IEEE. https://doi.org/10.1109/SLT54892.2023.10023141

KAZMORPHLM: MORPHEME-AWARE LANGUAGE MODEL FOR KAZAKH AUTOMATIC SPEECH RECOGNITION

Authors

DOI:

Keywords:

Abstract

Author Biographies

Yerlan Karabaliyev, International Information Technology University

Kateryna Kolesnikova, International Information Technology University

References

Downloads

Published

How to Cite

Issue

Section

License