KAZAKH SPEECH AND RECOGNITION METHODS: ERROR ANALYSIS AND IMPROVEMENT PROSPECTS
DOI:
https://doi.org/10.37943/20DZGH8448Keywords:
The Kazakh speech recognition, Automatic speech recognition , Kaldi, Mozilla DeepSpeech, Google Speech-to-Text API, Speech recognition errors, Phonetic analysis, Acoustic model adaptation, Linguistic features, the Kazakh language processingAbstract
This study offers a detailed evaluation of automatic speech recognition (ASR) systems for the Kazakh, examining their performance in recognizing the phonetic and linguistic features unique to the language. The Kazakh language presents specific challenges for ASR due to its complex phonology, vowel harmony, and the presence of multiple regional dialects. To address these challenges, a comparative analysis of three leading ASR systems were conducted—Kaldi, Mozilla DeepSpeech, and Google Speech-to-Text API—using a dataset of 101 recordings of spoken the Kazakh text. This study focuses on the systems' word error rates (WER), identifying common misrecognitions, especially with the Kazakh-specific phonemes like "қ," "ң," and "ү." Kaldi and Mozilla DeepSpeech exhibited high WERs, particularly struggling with Kazakh’s vowel harmony and consonant distinctions, while Google Speech-to-Text achieved of the lowest WER among the three. However, none of the systems demonstrated accuracy levels sufficient for practical applications, as errors in recognizing Kazakh’s agglutinative morphology and case endings remained pervasive. To improve these outcomes, a series of enhancements are proposed, including adapting acoustic models to better reflect Kazakh’s phonetic and morphological traits, integrating dialect-specific data, and employing machine learning methods such as transfer learning and hybrid models. Additional steps include refining data preprocessing and increasing dataset diversity to capture Kazakh’s linguistic nuances more accurately. By addressing these limitations, the ASR systems can better handle complex sentence structures and regional speech variations. This research thus provides a foundation for advancing Kazakh ASR technologies and contributes insights that are vital for developing inclusive, effective ASR systems capable of supporting linguistically diverse users.
References
Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov, M., & Varol, H. A. (2021). A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2021). A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English. SPECOM 2021, Lecture Notes in Computer Science. Springer.
Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9), 84–92.
ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi. (2021). arXiv preprint..
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text. (2023). arXiv preprint.
Mamyrbayev, O., & Oralbekova, D. (2021). Development of Kazakh Speech Recognition System with Transfer Learning. Eastern-European Journal of Enterprise Technologies.
Wav2vec and Kaldi in Low-Resource Language Speech Recognition. (2022). arXiv preprint.
Huang, Y., & Ren, Z. (2023). Kaldi and DeepSpeech: Comparative Study for Multilingual Speech Recognition. ICASSP 2023.
Wu, X., & Liu, W. (2022). Speech Separation and Recognition using Kaldi. ICASSP 2022.
Google Speech-to-Text API: Performance Evaluation for Low-Resource Languages. (2023). arXiv preprint.
Povey, D., & Ghoshal, A. (2021). Kaldi Speech Recognition System and Integration with Deep Learning. Proceedings of the IEEE.
Sun, S., & Li, D. (2023). Mozilla DeepSpeech with Transfer Learning for Low-Resource ASR. IEEE Access.
Zhang, Y., & Chen, H. (2021). Comparison of Kaldi and DeepSpeech for Low-Resource ASR. Proceedings of the 2021 IEEE Conference on Acoustics.
Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B. (2021). End-to-End Model for Kazakh Speech Recognition using RNNLM. ICCCI 2021.
Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B. (2021). End-to-end model based on RNN-T for Kazakh speech recognition. 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), 163–167.
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., & Zhumazhanov, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9), 32–45.
Narayanan, A., & Wang, D. (2022). Improving Speech Separation and ASR with Deep Learning. Journal of the Acoustical Society of America.
Tomas, M., & Zhang, Z. (2022). Advances in Transfer Learning for Speech Recognition. Proceedings of the IEEE.
Li, J., Deng, L., & Gong, Y. (2021). Hybrid DNN-HMM Models in Kaldi for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Wang, M., & Ren, Y. (2021). Language Model Rescoring with RNNLM for Speech Recognition. Speech Communication.
Huang, W., & Xu, S. (2022). Acoustic and Language Model Adaptation for Low-Resource Languages. ICASSP 2022.
Gao, J., & Liu, Y. (2021). Deep Learning for ASR: From GMM-HMM to End-to-End Models. Proceedings of IEEE.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Articles are open access under the Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish a manuscript in this journal agree to the following terms:
- The authors reserve the right to authorship of their work and transfer to the journal the right of first publication under the terms of the Creative Commons Attribution License, which allows others to freely distribute the published work with a mandatory link to the the original work and the first publication of the work in this journal.
- Authors have the right to conclude independent additional agreements that relate to the non-exclusive distribution of the work in the form in which it was published by this journal (for example, to post the work in the electronic repository of the institution or publish as part of a monograph), providing the link to the first publication of the work in this journal.
- Other terms stated in the Copyright Agreement.