KAZAKH SPEECH AND RECOGNITION METHODS: ERROR ANALYSIS AND IMPROVEMENT PROSPECTS

Yerlan Karabaliyev; Kateryna Kolesnikova

doi:10.37943/20DZGH8448

Authors

Yerlan Karabaliyev International Information Technology University, Kazakhstan https://orcid.org/0009-0001-9465-3998
Kateryna Kolesnikova International Information Technology University, Kazakhstan https://orcid.org/0000-0002-9160-5982

DOI:

https://doi.org/10.37943/20DZGH8448

Keywords:

The Kazakh speech recognition, Automatic speech recognition , Kaldi, Mozilla DeepSpeech, Google Speech-to-Text API, Speech recognition errors, Phonetic analysis, Acoustic model adaptation, Linguistic features, the Kazakh language processing

Abstract

This study offers a detailed evaluation of automatic speech recognition (ASR) systems for the Kazakh, examining their performance in recognizing the phonetic and linguistic features unique to the language. The Kazakh language presents specific challenges for ASR due to its complex phonology, vowel harmony, and the presence of multiple regional dialects. To address these challenges, a comparative analysis of three leading ASR systems were conducted—Kaldi, Mozilla DeepSpeech, and Google Speech-to-Text API—using a dataset of 101 recordings of spoken the Kazakh text. This study focuses on the systems' word error rates (WER), identifying common misrecognitions, especially with the Kazakh-specific phonemes like "қ," "ң," and "ү." Kaldi and Mozilla DeepSpeech exhibited high WERs, particularly struggling with Kazakh’s vowel harmony and consonant distinctions, while Google Speech-to-Text achieved of the lowest WER among the three. However, none of the systems demonstrated accuracy levels sufficient for practical applications, as errors in recognizing Kazakh’s agglutinative morphology and case endings remained pervasive. To improve these outcomes, a series of enhancements are proposed, including adapting acoustic models to better reflect Kazakh’s phonetic and morphological traits, integrating dialect-specific data, and employing machine learning methods such as transfer learning and hybrid models. Additional steps include refining data preprocessing and increasing dataset diversity to capture Kazakh’s linguistic nuances more accurately. By addressing these limitations, the ASR systems can better handle complex sentence structures and regional speech variations. This research thus provides a foundation for advancing Kazakh ASR technologies and contributes insights that are vital for developing inclusive, effective ASR systems capable of supporting linguistically diverse users.

Author Biographies

Yerlan Karabaliyev, International Information Technology University, Kazakhstan

PhD candidate, Clever System

Kateryna Kolesnikova, International Information Technology University, Kazakhstan

Doctor of Technical Sciences, Professor

References

Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., & Varol, H. A. (2021). A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2021). A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English. SPECOM 2021, Lecture Notes in Computer Science. Springer.

Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9), 84–92.

ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi. (2021). arXiv preprint..

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text. (2023). arXiv preprint.

Mamyrbayev, O., & Oralbekova, D. (2021). Development of Kazakh Speech Recognition System with Transfer Learning. Eastern-European Journal of Enterprise Technologies.

Wav2vec and Kaldi in Low-Resource Language Speech Recognition. (2022). arXiv preprint.

Huang, Y., & Ren, Z. (2023). Kaldi and DeepSpeech: Comparative Study for Multilingual Speech Recognition. ICASSP 2023.

Wu, X., & Liu, W. (2022). Speech Separation and Recognition using Kaldi. ICASSP 2022.

Google Speech-to-Text API: Performance Evaluation for Low-Resource Languages. (2023). arXiv preprint.

Povey, D., & Ghoshal, A. (2021). Kaldi Speech Recognition System and Integration with Deep Learning. Proceedings of the IEEE.

Sun, S., & Li, D. (2023). Mozilla DeepSpeech with Transfer Learning for Low-Resource ASR. IEEE Access.

Zhang, Y., & Chen, H. (2021). Comparison of Kaldi and DeepSpeech for Low-Resource ASR. Proceedings of the 2021 IEEE Conference on Acoustics.

Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B. (2021). End-to-End Model for Kazakh Speech Recognition using RNNLM. ICCCI 2021.

Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B. (2021). End-to-end model based on RNN-T for Kazakh speech recognition. 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), 163–167.

Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., & Zhumazhanov, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9), 32–45.

Narayanan, A., & Wang, D. (2022). Improving Speech Separation and ASR with Deep Learning. Journal of the Acoustical Society of America.

Tomas, M., & Zhang, Z. (2022). Advances in Transfer Learning for Speech Recognition. Proceedings of the IEEE.

Li, J., Deng, L., & Gong, Y. (2021). Hybrid DNN-HMM Models in Kaldi for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Wang, M., & Ren, Y. (2021). Language Model Rescoring with RNNLM for Speech Recognition. Speech Communication.

Huang, W., & Xu, S. (2022). Acoustic and Language Model Adaptation for Low-Resource Languages. ICASSP 2022.

Gao, J., & Liu, Y. (2021). Deep Learning for ASR: From GMM-HMM to End-to-End Models. Proceedings of IEEE.

KAZAKH SPEECH AND RECOGNITION METHODS: ERROR ANALYSIS AND IMPROVEMENT PROSPECTS

Authors

DOI:

Keywords:

Abstract

Author Biographies

Yerlan Karabaliyev, International Information Technology University, Kazakhstan

Kateryna Kolesnikova, International Information Technology University, Kazakhstan

References

Downloads

Published

How to Cite

Issue

Section

License